Changes between Version 2 and Version 3 of Unicode


Ignore:
Timestamp:
Dec 3, 2005 12:56:39 PM (8 years ago)
Author:
ross@…
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Unicode

    v2 v3  
    11= Unicode = 
    22 
    3 The Haskell'98 Report claims the language uses Unicode.  This statement is somewhat vague, and as yet there are still no implementations that comply fully with Unicode.  Some things we can do: 
     3The Haskell'98 Report claims the language uses Unicode. Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode. 
    44 
    5   * Remove any mention of Unicode, and revert to either plain ASCII, or implementation-defined character sets. 
    6   * Make the specification clearer 
    7     * Should source-code be permitted to use unicode characters in identifiers etc? 
    8     * If not in identifiers, should Unicode be permitted in Strings? 
    9     * What about the I/O primitives? 
    10       * Do we read/write files as ASCII (see also [wiki:BinaryIO]), or as some Unicode format(s)? 
    11       * How are !FilePaths represented - ASCII or some Unicode format(s)? 
     5== Background on Unicode == 
    126 
    13 It might be helpful to outline some of the background to Unicode itself, because I find many people get confused by it. 
    14  
    15   * Unicode is a finite mapping from character glyphs to code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity). 
    16     The mapping is (in theory at least) one-to-one and onto. 
     7  * [http://www.unicode.org Unicode] defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity). 
     8  * Characters are distinguished from glyphs: the presentation of a character may vary with style, locale, etc, and some glyphs may correspond to a sequence of characters (e.g. base character and combining mark characters). 
     9  * The first 128 code points match US-ASCII; the first 256 code points match ISO 8859-1 (Latin-1). 
    1710  * For reasons of backwards compatibility and space efficiency, there are a variety of ''variable-length'' encodings of the 
    1811    code points themselves into byte streams. 
    1912    * UTF-8 seeks to ensure that the ASCII characters retain their traditional coding in the bottom 7-bits of a single byte. 
    20       The remaining character glyphs are coded using the top-bit of the first byte, plus some variable number of trailing bytes. 
    21       Thus, non-ASCII characters are typically several bytes long. 
     13      Non-ASCII characters are coded using two or more bytes with the top-bit set. 
    2214    * UTF-16 makes all characters 16-bits wide.  Unfortunately this does not cover the entire code space, so there are some 
    2315      'page-switch' characters that swap out the current 'page' of the code book for a different one. So although most characters 
    2416      end up fitting in a single 16-bit field, some must be coded as two successive fields. 
    25     * UCS-4 uses a full 32-bit word per character glyph. 
     17    * UCS-4 uses a full 32-bit word per character. 
    2618    * To make things more exciting, the UTF-16 and UCS-4 encodings have two variations, depending on the endianness of 
    2719      the machine they were originally written on.  So if you read a raw byte-stream and want to convert it to 16-bit chunks, 
    2820      you first need to work out the byte-ordering.  This is often done by reading a few bytes and then looking up a heuristic table, 
    2921      although there is also a 'byte-order mark' which is a non-printing character which may or may not be present. 
     22  * Other character sets and their encodings may be treated as encodings of Unicode, but they will not represent all characters, and in some cases (e.g. ISO 2022) conversion to Unicode and back will not be an identity. 
     23  * Since Unix-like systems traditionally deal with byte-streams, UTF-8 is the most common encoding on those platforms. 
     24  * The NTFS file system (Windows) stores filenames and file contents in UTF-16, and the Win32 interface provides functions using UTF-16. 
     25  * Almost no system stores UCS-4 in files, but in some C libraries (e.g. glibc), the type `wchar_t` (wide character) is UCS-4. 
     26  * Any system must be able to read/write files that originated on any other platform. 
     27  * As an example of the complex heuristics needed to guess the encoding of any particular file, see the [http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing XML standard]. 
    3028 
    31   * Since unix-like systems traditionally deal with byte-streams, UTF-8 is the most common encoding on those platforms. 
    32   * The NTFS file system (Windows) stores filenames and file contents in UTF-16. 
    33   * Almost no system stores UCS-4 natively, but there is a C library type 'wchar' (wide character) which has 32 bits. 
    34   * But of course any system must be able to read/write files that originated on any other platform. 
    35   * As an example of the complex heuristics needed to guess the encoding of any particular file, see the 
    36     [http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing XML standard] 
     29== Unicode in Haskell source == 
     30 
     31The Haskell 98 Report claims that Haskell source code uses the Unicode character set. 
     32No existing implementation allows unrestricted use of the Unicode character set in Haskell source. Most treat source code as Latin-1. If Unicode were allowed, how would implementations know which encoding was used? 
     33 
     34Some things we could do: 
     35 
     36 * Revert to US-ASCII, Latin-1 or implementation-defined character sets. 
     37 * Allow Unicode, defining a portable form (the \uNNNN escapes in Haskell 1.4 were an attempt at this). 
     38 * Allow Unicode, with a mechanism for specifying encoding. 
     39 * Allow Unicode only in some places, e.g. character and string literals. 
     40 
     41== The Char type == 
     42 
     43The Haskell 98 Report claims that the type `Char` represents Unicode. It goes on to provide I/O primitives using the `Char` type, define `FilePath` as `[Char]`, etc. Most implementations treat the octets interchanged with the operation system (file contents, filenames, program arguments and the environment) as characters, i.e. belonging to the Latin-1 subset. Hugs treats them as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings. 
     44 
     45Using Unicode for `Char` seems the principled thing to do. If we retain it: 
     46 
     47 * Flexible handling of character encodings is needed, but not necessarily as part of this standard. 
     48 * [wiki:BinaryIO] is needed anyway, and would provide a base for these encodings. 
     49 * A simple character-based I/O interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. 
     50 * An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment.