wiki:Unicode

Version 8 (modified by malcolm.wallace@…, 8 years ago) (diff)

--

Unicode

The Haskell'98 Report claims the language uses Unicode. Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode.

Background on Unicode

  • Unicode defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity).
  • Characters are distinguished from glyphs: the presentation of a character may vary with style, locale, etc, and some glyphs may correspond to a sequence of characters (e.g. base character and combining mark characters).
  • The first 128 code points match US-ASCII; the first 256 code points match ISO 8859-1 (Latin-1).
  • For reasons of backwards compatibility and space efficiency, there are a variety of variable-length encodings of the code points themselves into byte streams.
    • UTF-8 seeks to ensure that the ASCII characters retain their traditional coding in the bottom 7-bits of a single byte. Non-ASCII characters are coded using two or more bytes with the top-bit set.
    • UTF-16 makes all characters 16-bits wide. Unfortunately this does not cover the entire code space, so there are some 'page-switch' characters that swap out the current 'page' of the code book for a different one. So although most characters end up fitting in a single 16-bit field, some must be coded as two successive fields.
    • UCS-4 uses a full 32-bit word per character.
    • To make things more exciting, the UTF-16 and UCS-4 encodings have two variations, depending on the endianness of the machine they were originally written on. So if you read a raw byte-stream and want to convert it to 16-bit chunks, you first need to work out the byte-ordering. This is often done by reading a few bytes and then looking up a heuristic table, although there is also a 'byte-order mark' which is a non-printing character which may or may not be present.
  • Other character sets and their encodings may be treated as encodings of Unicode, but they will not represent all characters, and in some cases (e.g. ISO 2022) conversion to Unicode and back will not be an identity.
  • Since Unix-like systems traditionally deal with byte-streams, UTF-8 is the most common encoding on those platforms.
  • The NTFS file system (Windows) stores filenames and file contents in UTF-16, and the Win32 interface provides functions using UTF-16.
  • Almost no system stores UCS-4 in files, but in some C libraries (e.g. glibc), the type wchar_t (wide character) is UCS-4.
  • Any system must be able to read/write files that originated on any other platform.
  • As an example of the complex heuristics needed to guess the encoding of any particular file, see the XML standard.

Unicode in Haskell source

The Haskell 98 Report claims that Haskell source code uses the Unicode character set. If Unicode were allowed, how would implementations know which encoding was used?

  • Jhc is the only implementation that allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8.
  • Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
  • Others treat source code as Latin-1.

Some things we could do:

  • Revert to US-ASCII, Latin-1 or implementation-defined character sets.
  • Allow Unicode, defining a portable form (the \uNNNN escapes in Haskell 1.4 were an attempt at this).
  • Allow Unicode, with a mechanism for specifying encoding in the source file.
  • Allow Unicode with the encoding specified by the current locale (as currently done by Hugs). This is arguably the correct thing for all programs that read text files, but it makes Haskell source using non-ASCII characters non-portable. (We could specify that all compilers must support UTF-8 and/or some other portable form too.)

If Unicode is allowed, should its use be restricted, e.g. to character and string literals?

The Char type

The Haskell 98 Report claims that the type Char represents Unicode, which seems to be the canonical choice. The functions of Char work with Unicode for GHC and Hugs, with one divergence from the Report:

  • isAlpha selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.

I/O and System functions

The Report goes on to provide I/O primitives using the Char type, define FilePath as String, and have the functions in System use String.

  • Hugs treats the bytes interchanged with the operation system (I/O, filenames, program arguments and the environment) as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings.
  • All character based I/O in jhc-compiled programs uses the encoding of the current locale . Handling of strings will be similar when the CString functions become conformant.
  • Other implementations treat the bytes interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.

Assuming we retain Unicode as the representation of Char:

  • Flexible handling of character encodings is needed, but not necessarily as part of this standard.
  • BinaryIO is needed anyway, and would provide a base for these encodings.
  • A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
  • An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment.

A Straw-Man Proposal

  • Internal character representation. The Haskell type Char is UCS-4.
  • Haskell source encoding.
    • Introduce a pragma {-# ENCODING e #-} with a range of possible values of the encoding e. If the pragma is present, it must be at the beginning of the file. If it is not present, the file is encoded in Latin-1. Note that even if the pragma is present, some heuristic may be needed even to get as far as interpreting the encoding declaration, like in XML. The fact that the first three characters must be {-# will be useful here.
    • A literal string may contain any literal character representable in the source encoding. In addition, escapes are provided to permit the specification of any Unicode character (which may or may not be otherwise representable in the source encoding).
    • An identifier may contain any Unicode alphanumeric or symbol characters from a defined range. Thus, a source text may not be representable in certain other encodings (especially in ASCII).
  • I/O. All raw I/O is in terms of octets, i.e. Word8
  • Conversions. Pure functions exist to convert octets to and from any particular encoding:
       stringDecode :: Encoding -> [Word8] -> [Char]
       stringEncode :: Encoding -> [Char] -> [Word8]
    
    The codecs must operate on strings, not individual characters, because some encodings use variable-length sequences of octets.
  • Efficiency. Semantically, character-based I/O is a simple composition of the raw I/O primitives with an encoding conversion function. However, for efficiency, an implementation might choose to provide certain encoded I/O operations primitively. If such primitives are exposed to the user, they should have standard names so that other implementations can provide the same functionality in pure Haskell Prime.
  • Locales. It may be possible to retain the traditional I/O signatures for hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing a stateful notion of current encoding associated with each individual handle. The default encoding could be inherited from the operating system environment, but it should also be possible to change the encoding explicitly.
       getIOEncoding :: Handle -> IO Encoding
       setIOEncoding :: Encoding -> Handle -> IO ()
       resetIOEncoding :: Handle -> IO ()  -- go back to default
    
  • Filenames, program arguments, environment.
    • Filenames are stored in Haskell as [Char], but the operating system should receive [Word8] for any I/O using filenames. Some encoding conversion is therefore required. Usually, this will be platform-dependent, and so the actual encoding may be hidden from the programmer as part of the default locale.
    • Program arguments, and symbols from the environment, are supplied by the operating system to the Haskell program as [Word8]. The program is responsible for conversion to [Char]. Again, there may be a default encoding chosen based on the locale.