Changes between Version 8 and Version 9 of Unicode


Ignore:
Timestamp:
Dec 5, 2005 12:22:11 AM (8 years ago)
Author:
ross@…
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Unicode

    v8 v9  
    11= Unicode = 
    22 
    3 The Haskell'98 Report claims the language uses Unicode. Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode. 
     3The Haskell'98 Report claims the language uses Unicode (aka ISO 10646-1). Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode. 
    44 
    55== Background on Unicode == 
     
    2121      although there is also a 'byte-order mark' which is a non-printing character which may or may not be present. 
    2222  * Other character sets and their encodings may be treated as encodings of Unicode, but they will not represent all characters, and in some cases (e.g. ISO 2022) conversion to Unicode and back will not be an identity. 
    23   * Since Unix-like systems traditionally deal with byte-streams, UTF-8 is the most common encoding on those platforms. 
    24   * The NTFS file system (Windows) stores filenames and file contents in UTF-16, and the Win32 interface provides functions using UTF-16. 
     23  * Unix-like systems and many others traditionally deal with byte-streams. Various regional encodings are still widely used, but UTF-8 is growing in popularity. 
     24  * Windows NT and later uses UTF-16. 
    2525  * Almost no system stores UCS-4 in files, but in some C libraries (e.g. glibc), the type `wchar_t` (wide character) is UCS-4. 
    2626  * Any system must be able to read/write files that originated on any other platform. 
     
    5050 * `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters. 
    5151 
    52 == I/O and System functions == 
     52== Input and Output == 
    5353 
    54 The Report goes on to provide I/O primitives using the `Char` type, define `FilePath` as `String`, and have the functions in `System` use `String`. 
    55  * Hugs treats the bytes interchanged with the operation system (I/O, filenames, program arguments and the environment) as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings. 
    56  * All character based I/O in jhc-compiled programs uses the encoding of the current locale . Handling of strings will be similar when the CString functions become conformant. 
    57  * Other implementations treat the bytes interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset. 
     54Haskell 98 provides I/O primitives using the `Char` type. 
     55 
     56 * All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale. 
     57 * Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset. 
    5858 
    5959Assuming we retain Unicode as the representation of `Char`: 
    6060 
    61  * Flexible handling of character encodings is needed, but not necessarily as part of this standard. 
     61 * Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation? 
    6262 * [wiki:BinaryIO] is needed anyway, and would provide a base for these encodings. 
    63  * A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. 
    64  * An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment. 
     63 * A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we [wiki:Prelude shrink the Prelude]. 
     64 
     65== Strings in System functions == 
     66 
     67Native system calls use varying representations of strings: 
     68 
     69 * Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all). 
     70 * The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16. 
     71 
     72Haskell 98 defines `FilePath` as `String`; the functions in `System` use `String` for program arguments and environment values. 
     73 
     74 * Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale. 
     75 * Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset. 
     76 * The ForeignFunctionInterface specifies `CString` functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations. 
     77 
     78A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to `String` and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many. 
    6579 
    6680== A Straw-Man Proposal ==