Changes between Version 13 and Version 14 of Unicode


Ignore:
Timestamp:
Dec 6, 2005 11:53:45 AM (8 years ago)
Author:
ross@…
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Unicode

    v13 v14  
    1 = Unicode = 
     1= Background on Unicode = 
    22 
    3 The Haskell'98 Report claims the language uses Unicode (aka ISO 10646-1). Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode. 
    4  
    5 == Background on Unicode == 
    6  
    7   * [http://www.unicode.org Unicode] defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity). 
     3  * [http://www.unicode.org Unicode] (or equivalently ISO 10646-1) defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity). 
    84  * Characters are distinguished from glyphs: the presentation of a character may vary with style, locale, etc, and some glyphs may correspond to a sequence of characters (e.g. base character and combining mark characters). 
     5  * Unicode has a story for the display of mixed left-to-right and right-to-left scripts (the [http://www.unicode.org/reports/tr9/ BiDi algorithm]). 
    96  * The first 128 code points match US-ASCII; the first 256 code points match ISO 8859-1 (Latin-1). 
    107  * For reasons of backwards compatibility and space efficiency, there are a variety of ''variable-length'' encodings of the 
     
    2724  * As an example of the complex heuristics needed to guess the encoding of any particular file, see the [http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing XML standard]. 
    2825 
    29 == Unicode in Haskell source == 
    30  
    31 The Haskell 98 Report claims that Haskell source code uses the Unicode character set. 
    32 If Unicode were allowed, how would implementations know which encoding was used? 
    33  * Jhc is the only implementation that allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8. 
    34  * Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals. 
    35  * Others treat source code as Latin-1. 
    36  * Jhc supports several uses of unicode characters instead of the haskell 
    37    keywords. 
    38    * [chr 0x2192] '→' is equivalent to '->' 
    39    * [chr 0x2190] '←' is equivalent to '<- 
    40    * [chr 0x2237] '∷' is equivalent to '::' 
    41    * [chr 0x2025] '‥' is equivalent to '..' 
    42    * [chr 0x21d2] '⇒' is equivalent to '=>' 
    43    * [chr 0x2200] '∀' is equivalent to 'forall' 
    44    * [chr 0x2203] '∃' is equivalent to 'exists' -- future extension will use 
    45    * in addition there is experimental support for defining new operators and 
    46      names using various unicode characters. 
    47  
    48 Some things we could do: 
    49  
    50  * Revert to US-ASCII, Latin-1 or implementation-defined character sets. 
    51  * Allow Unicode, defining a portable form (the \uNNNN escapes in Haskell 1.4 were an attempt at this). 
    52  * Allow Unicode, with a mechanism for specifying encoding in the source file. 
    53  * Allow Unicode with the encoding specified by the current locale (as currently done by Hugs). This is arguably the correct thing for all programs that read text files, but it makes Haskell source using non-ASCII characters non-portable. (We could specify that all compilers must support UTF-8 and/or some other portable form too.) 
    54  
    55 If Unicode is allowed, should its use be restricted, e.g. to character and string literals? 
    56  
    57 What about supporting scripts where characters are written right-to-left instead of left-to-right.  In theory, you still have a simple sequence of characters, but what about mixing L-to-R and R-to-L scripts, e.g. the spelling of language keywords?  Should 'module' be spelled 'eludom' in a R-to-L encoding?  Or is this just a text-editor/visualisation problem that does not concern the language standard?  (Note that, for instance in Arabic script, words are R-to-L but numbers are L-to-R.) 
    58  
    59 Unicode has varying-width space characters (m-width, n-width, l-width, non-breaking space, narrow non-breaking space, zero-width non-breaking space...)  How do these interact with the layout rules? 
    60  
    61 == The Char type == 
    62  
    63 The Haskell 98 Report claims that the type `Char` represents Unicode, which seems to be the canonical choice. 
    64 The functions of `Char` work with Unicode for GHC and Hugs, with one divergence from the Report: 
    65  * `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters. 
    66  
    67 == Input and Output == 
    68  
    69 Haskell 98 provides I/O primitives using the `Char` type. 
    70  
    71  * All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale. 
    72  * Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset. 
    73  
    74 Assuming we retain Unicode as the representation of `Char`: 
    75  
    76  * Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation? 
    77  * [wiki:BinaryIO] is needed anyway, and would provide a base for these encodings. 
    78  * A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we [wiki:Prelude shrink the Prelude]. 
    79  
    80 == Strings in System functions == 
    81  
    82 Native system calls use varying representations of strings: 
    83  
    84  * Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all). 
    85  * The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16. 
    86  
    87 Haskell 98 defines `FilePath` as `String`; the functions in `System` use `String` for program arguments and environment values. 
    88  
    89  * Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale. 
    90  * Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset. 
    91  * The ForeignFunctionInterface specifies `CString` functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations. 
    92  
    93 A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to `String` and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many. 
    94  
    95 == A Straw-Man Proposal == 
    96  
    97  * '''Internal character representation.''' 
    98    The Haskell type {{{Char}}} is UCS-4. 
    99  * '''Haskell source encoding.''' 
    100    * Introduce a pragma {{{{-# ENCODING e #-}}}} with a range of possible 
    101      values of the encoding {{{e}}}.  If the pragma is present, it must be 
    102      at the beginning of the file.  If it is not present, the file is 
    103      encoded in ASCII.  Note that even if the pragma is present, some 
    104      heuristic may be needed even to get as far as interpreting the 
    105      encoding declaration, like in 
    106      [http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing XML]. 
    107      The fact that the first three characters must be {{{{-#}}} will be 
    108      useful here. Haskell compilers must support at least the encodings ASCII,LATIN1, and UTF8.  
    109    * A literal string may contain any literal character representable in the 
    110      source encoding.  In addition, escapes are provided to permit the specification of 
    111      ''any'' Unicode character (which may or may not be otherwise 
    112      representable in the source encoding). 
    113    * An identifier may contain any Unicode alphanumeric or symbol 
    114      characters from a defined range.  Thus, a source text may not be 
    115      representable in certain other encodings (especially in ASCII). 
    116  * '''I/O.'''   
    117    All raw I/O is in terms of octets, i.e. {{{Word8}}} 
    118  * '''Conversions.''' 
    119    Pure functions exist to convert octets to and from any particular encoding: 
    120 {{{ 
    121    stringDecode :: Encoding -> [Word8] -> [Char] 
    122    stringEncode :: Encoding -> [Char] -> [Word8] 
    123 }}} 
    124    The codecs must operate on strings, not individual characters, because some 
    125    encodings use variable-length sequences of octets. 
    126  * '''Efficiency.''' 
    127    Semantically, character-based I/O is a simple composition of the raw  
    128    I/O primitives with an encoding conversion function.  However, for 
    129    efficiency, an implementation might choose to provide certain encoded 
    130    I/O operations primitively.  If such primitives are exposed to the  
    131    user, they should have standard names so that other implementations can 
    132    provide the same functionality in pure Haskell Prime. 
    133  * '''Locales.''' 
    134    It may be possible to retain the traditional I/O signatures for 
    135    hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing 
    136    a stateful notion of ''current encoding'' associated with each 
    137    individual handle.  The default encoding could be inherited from the 
    138    operating system environment, but it should also be possible to 
    139    change the encoding explicitly. 
    140 {{{ 
    141    getIOEncoding :: Handle -> IO Encoding 
    142    setIOEncoding :: Encoding -> Handle -> IO () 
    143    resetIOEncoding :: Handle -> IO ()  -- go back to default 
    144 }}} 
    145  * '''Filenames, program arguments, environment.''' 
    146    * Filenames are stored in Haskell as {{{[Char]}}}, but the operating 
    147      system should receive {{{[Word8]}}} for any I/O using filenames. 
    148      Some encoding conversion is therefore required.  Usually, this will 
    149      be platform-dependent, and so the actual encoding may be hidden 
    150      from the programmer as part of the default locale. 
    151    * Program arguments, and symbols from the environment, are supplied 
    152      by the operating system to the Haskell program as {{{[Word8]}}}. 
    153      The program is responsible for conversion to {{{[Char]}}}.  Again, 
    154      there may be a default encoding chosen based on the locale. 
     26See also UnicodeInHaskellSource and CharAsUnicode.