Changes between Initial Version and Version 1 of CharAsUnicode


Ignore:
Timestamp:
Dec 6, 2005 11:53:38 AM (8 years ago)
Author:
ross@…
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • CharAsUnicode

    v1 v1  
     1= The Char type = 
     2 
     3The Haskell 98 Report ([http://www.haskell.org/onlinereport/basic.html#characters Characters and Strings]) states that the type `Char` represents [wiki:Unicode], which seems to be the canonical choice. 
     4The functions of the [http://www.haskell.org/onlinereport/char.html Char] module work with Unicode for GHC and Hugs, with one divergence from the Report: 
     5 * `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters. 
     6More sophisticated functions could be provided by additional libraries. 
     7 
     8== Input and Output == 
     9 
     10The Haskell 98 [http://www.haskell.org/onlinereport/standard-prelude.html#preludeio Prelude] and [http://www.haskell.org/onlinereport/io.html IO] modules provide I/O primitives using the `Char` type. 
     11 
     12 * All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale. 
     13 * Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset. 
     14 
     15Assuming we retain Unicode as the representation of `Char`: 
     16 
     17 * Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation? 
     18 * [wiki:BinaryIO] is needed anyway, and would provide a base for these encodings. 
     19 * A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we [wiki:Prelude shrink the Prelude]. 
     20 
     21== Strings in System functions == 
     22 
     23Native system calls use varying representations of strings: 
     24 
     25 * Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all). 
     26 * The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16. 
     27 
     28Haskell 98 defines `FilePath` as `String` (used in the [http://www.haskell.org/onlinereport/standard-prelude.html#preludeio Prelude], [http://www.haskell.org/onlinereport/io.html IO] and [http://www.haskell.org/onlinereport/directory.html Directory] modules). 
     29The functions in [http://www.haskell.org/onlinereport/system.html System] use `String` for program arguments and environment values. 
     30 
     31 * Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale. 
     32 * Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset. 
     33 * The ForeignFunctionInterface specifies `CString` functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations. 
     34 
     35A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to `String` and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many. 
     36 
     37== A Straw-Man Proposal == 
     38 
     39 * '''I/O.'''   
     40   All raw I/O is in terms of octets, i.e. {{{Word8}}} 
     41 * '''Conversions.''' 
     42   Pure functions exist to convert octets to and from any particular encoding: 
     43{{{ 
     44   stringDecode :: Encoding -> [Word8] -> [Char] 
     45   stringEncode :: Encoding -> [Char] -> [Word8] 
     46}}} 
     47   The codecs must operate on strings, not individual characters, because some 
     48   encodings use variable-length sequences of octets. 
     49 * '''Efficiency.''' 
     50   Semantically, character-based I/O is a simple composition of the raw  
     51   I/O primitives with an encoding conversion function.  However, for 
     52   efficiency, an implementation might choose to provide certain encoded 
     53   I/O operations primitively.  If such primitives are exposed to the  
     54   user, they should have standard names so that other implementations can 
     55   provide the same functionality in pure Haskell Prime. 
     56 * '''Locales.''' 
     57   It may be possible to retain the traditional I/O signatures for 
     58   hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing 
     59   a stateful notion of ''current encoding'' associated with each 
     60   individual handle.  The default encoding could be inherited from the 
     61   operating system environment, but it should also be possible to 
     62   change the encoding explicitly. 
     63{{{ 
     64   getIOEncoding :: Handle -> IO Encoding 
     65   setIOEncoding :: Encoding -> Handle -> IO () 
     66   resetIOEncoding :: Handle -> IO ()  -- go back to default 
     67}}} 
     68 * '''Filenames, program arguments, environment.''' 
     69   * Filenames are stored in Haskell as {{{[Char]}}}, but the operating 
     70     system should receive {{{[Word8]}}} for any I/O using filenames. 
     71     Some encoding conversion is therefore required.  Usually, this will 
     72     be platform-dependent, and so the actual encoding may be hidden 
     73     from the programmer as part of the default locale. 
     74   * Program arguments, and symbols from the environment, are supplied 
     75     by the operating system to the Haskell program as {{{[Word8]}}}. 
     76     The program is responsible for conversion to {{{[Char]}}}.  Again, 
     77     there may be a default encoding chosen based on the locale.