wiki:Unicode

Version 6 (modified by john@…, 8 years ago) (diff)

--

Unicode

The Haskell'98 Report claims the language uses Unicode. Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode.

Background on Unicode

  • Unicode defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity).
  • Characters are distinguished from glyphs: the presentation of a character may vary with style, locale, etc, and some glyphs may correspond to a sequence of characters (e.g. base character and combining mark characters).
  • The first 128 code points match US-ASCII; the first 256 code points match ISO 8859-1 (Latin-1).
  • For reasons of backwards compatibility and space efficiency, there are a variety of variable-length encodings of the code points themselves into byte streams.
    • UTF-8 seeks to ensure that the ASCII characters retain their traditional coding in the bottom 7-bits of a single byte. Non-ASCII characters are coded using two or more bytes with the top-bit set.
    • UTF-16 makes all characters 16-bits wide. Unfortunately this does not cover the entire code space, so there are some 'page-switch' characters that swap out the current 'page' of the code book for a different one. So although most characters end up fitting in a single 16-bit field, some must be coded as two successive fields.
    • UCS-4 uses a full 32-bit word per character.
    • To make things more exciting, the UTF-16 and UCS-4 encodings have two variations, depending on the endianness of the machine they were originally written on. So if you read a raw byte-stream and want to convert it to 16-bit chunks, you first need to work out the byte-ordering. This is often done by reading a few bytes and then looking up a heuristic table, although there is also a 'byte-order mark' which is a non-printing character which may or may not be present.
  • Other character sets and their encodings may be treated as encodings of Unicode, but they will not represent all characters, and in some cases (e.g. ISO 2022) conversion to Unicode and back will not be an identity.
  • Since Unix-like systems traditionally deal with byte-streams, UTF-8 is the most common encoding on those platforms.
  • The NTFS file system (Windows) stores filenames and file contents in UTF-16, and the Win32 interface provides functions using UTF-16.
  • Almost no system stores UCS-4 in files, but in some C libraries (e.g. glibc), the type wchar_t (wide character) is UCS-4.
  • Any system must be able to read/write files that originated on any other platform.
  • As an example of the complex heuristics needed to guess the encoding of any particular file, see the XML standard.

Unicode in Haskell source

The Haskell 98 Report claims that Haskell source code uses the Unicode character set. If Unicode were allowed, how would implementations know which encoding was used?

  • Jhc is the only implementation that allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8.
  • Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
  • Others treat source code as Latin-1.

Some things we could do:

  • Revert to US-ASCII, Latin-1 or implementation-defined character sets.
  • Allow Unicode, defining a portable form (the \uNNNN escapes in Haskell 1.4 were an attempt at this).
  • Allow Unicode, with a mechanism for specifying encoding in the source file.
  • Allow Unicode with the encoding specified by the current locale (as currently done by Hugs). This is arguably the correct thing for all programs that read text files, but it makes Haskell source using non-ASCII characters non-portable. (We could specify that all compilers must support UTF-8 and/or some other portable form too.)

If Unicode is allowed, should its use be restricted, e.g. to character and string literals?

Unicode in Haskell IO

  • All character based IO in jhc compiled programs is carried out in the current locale of the system
  • in nhc and ghc, character based IO is carried out as if it were latin1.

The Char type

The Haskell 98 Report claims that the type Char represents Unicode. It goes on to provide I/O primitives using the Char type, define FilePath as [Char], etc. Most implementations treat the octets interchanged with the operation system (file contents, filenames, program arguments and the environment) as characters, i.e. belonging to the Latin-1 subset. Hugs treats them as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings.

Using Unicode for Char seems the principled thing to do. If we retain it:

  • Flexible handling of character encodings is needed, but not necessarily as part of this standard.
  • BinaryIO is needed anyway, and would provide a base for these encodings.
  • A simple character-based I/O interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
  • An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment.