wiki:Unicode

Version 10 (modified by john@…, 8 years ago) (diff)

--

Unicode

The Haskell'98 Report claims the language uses Unicode (aka ISO 10646-1). Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode.

Background on Unicode

  • Unicode defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity).
  • Characters are distinguished from glyphs: the presentation of a character may vary with style, locale, etc, and some glyphs may correspond to a sequence of characters (e.g. base character and combining mark characters).
  • The first 128 code points match US-ASCII; the first 256 code points match ISO 8859-1 (Latin-1).
  • For reasons of backwards compatibility and space efficiency, there are a variety of variable-length encodings of the code points themselves into byte streams.
    • UTF-8 seeks to ensure that the ASCII characters retain their traditional coding in the bottom 7-bits of a single byte. Non-ASCII characters are coded using two or more bytes with the top-bit set.
    • UTF-16 makes all characters 16-bits wide. Unfortunately this does not cover the entire code space, so there are some 'page-switch' characters that swap out the current 'page' of the code book for a different one. So although most characters end up fitting in a single 16-bit field, some must be coded as two successive fields.
    • UCS-4 uses a full 32-bit word per character.
    • To make things more exciting, the UTF-16 and UCS-4 encodings have two variations, depending on the endianness of the machine they were originally written on. So if you read a raw byte-stream and want to convert it to 16-bit chunks, you first need to work out the byte-ordering. This is often done by reading a few bytes and then looking up a heuristic table, although there is also a 'byte-order mark' which is a non-printing character which may or may not be present.
  • Other character sets and their encodings may be treated as encodings of Unicode, but they will not represent all characters, and in some cases (e.g. ISO 2022) conversion to Unicode and back will not be an identity.
  • Unix-like systems and many others traditionally deal with byte-streams. Various regional encodings are still widely used, but UTF-8 is growing in popularity.
  • Windows NT and later uses UTF-16.
  • Almost no system stores UCS-4 in files, but in some C libraries (e.g. glibc), the type wchar_t (wide character) is UCS-4.
  • Any system must be able to read/write files that originated on any other platform.
  • As an example of the complex heuristics needed to guess the encoding of any particular file, see the XML standard.

Unicode in Haskell source

The Haskell 98 Report claims that Haskell source code uses the Unicode character set. If Unicode were allowed, how would implementations know which encoding was used?

  • Jhc is the only implementation that allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8.
  • Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
  • Others treat source code as Latin-1.

Some things we could do:

  • Revert to US-ASCII, Latin-1 or implementation-defined character sets.
  • Allow Unicode, defining a portable form (the \uNNNN escapes in Haskell 1.4 were an attempt at this).
  • Allow Unicode, with a mechanism for specifying encoding in the source file.
  • Allow Unicode with the encoding specified by the current locale (as currently done by Hugs). This is arguably the correct thing for all programs that read text files, but it makes Haskell source using non-ASCII characters non-portable. (We could specify that all compilers must support UTF-8 and/or some other portable form too.)

If Unicode is allowed, should its use be restricted, e.g. to character and string literals?

The Char type

The Haskell 98 Report claims that the type Char represents Unicode, which seems to be the canonical choice. The functions of Char work with Unicode for GHC and Hugs, with one divergence from the Report:

  • isAlpha selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.

Input and Output

Haskell 98 provides I/O primitives using the Char type.

  • All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale.
  • Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset.

Assuming we retain Unicode as the representation of Char:

  • Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation?
  • BinaryIO is needed anyway, and would provide a base for these encodings.
  • A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we shrink the Prelude.

Strings in System functions

Native system calls use varying representations of strings:

  • Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all).
  • The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16.

Haskell 98 defines FilePath as String; the functions in System use String for program arguments and environment values.

  • Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale.
  • Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.
  • The ForeignFunctionInterface specifies CString functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations.

A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to String and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many.

A Straw-Man Proposal

  • Internal character representation. The Haskell type Char is UCS-4.
  • Haskell source encoding.
    • Introduce a pragma {-# ENCODING e #-} with a range of possible values of the encoding e. If the pragma is present, it must be at the beginning of the file. If it is not present, the file is encoded in ASCII. Note that even if the pragma is present, some heuristic may be needed even to get as far as interpreting the encoding declaration, like in XML. The fact that the first three characters must be {-# will be useful here. Haskell compilers must support at least the encodings

ASCII,LATIN1, and UTF8.

  • A literal string may contain any literal character representable in the source encoding. In addition, escapes are provided to permit the specification of any Unicode character (which may or may not be otherwise representable in the source encoding).
  • An identifier may contain any Unicode alphanumeric or symbol characters from a defined range. Thus, a source text may not be representable in certain other encodings (especially in ASCII).
  • I/O. All raw I/O is in terms of octets, i.e. Word8
  • Conversions. Pure functions exist to convert octets to and from any particular encoding:
       stringDecode :: Encoding -> [Word8] -> [Char]
       stringEncode :: Encoding -> [Char] -> [Word8]
    
    The codecs must operate on strings, not individual characters, because some encodings use variable-length sequences of octets.
  • Efficiency. Semantically, character-based I/O is a simple composition of the raw I/O primitives with an encoding conversion function. However, for efficiency, an implementation might choose to provide certain encoded I/O operations primitively. If such primitives are exposed to the user, they should have standard names so that other implementations can provide the same functionality in pure Haskell Prime.
  • Locales. It may be possible to retain the traditional I/O signatures for hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing a stateful notion of current encoding associated with each individual handle. The default encoding could be inherited from the operating system environment, but it should also be possible to change the encoding explicitly.
       getIOEncoding :: Handle -> IO Encoding
       setIOEncoding :: Encoding -> Handle -> IO ()
       resetIOEncoding :: Handle -> IO ()  -- go back to default
    
  • Filenames, program arguments, environment.
    • Filenames are stored in Haskell as [Char], but the operating system should receive [Word8] for any I/O using filenames. Some encoding conversion is therefore required. Usually, this will be platform-dependent, and so the actual encoding may be hidden from the programmer as part of the default locale.
    • Program arguments, and symbols from the environment, are supplied by the operating system to the Haskell program as [Word8]. The program is responsible for conversion to [Char]. Again, there may be a default encoding chosen based on the locale.