Changes between Version 7 and Version 8 of Unicode


Ignore:
Timestamp:
Dec 4, 2005 4:21:50 PM (10 years ago)
Author:
malcolm.wallace@…
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Unicode

    v7 v8  
    6363 * A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
    6464 * An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment.
     65
     66== A Straw-Man Proposal ==
     67
     68 * '''Internal character representation.'''
     69   The Haskell type {{{Char}}} is UCS-4.
     70 * '''Haskell source encoding.'''
     71   * Introduce a pragma {{{{-# ENCODING e #-}}}} with a range of possible
     72     values of the encoding {{{e}}}.  If the pragma is present, it must be
     73     at the beginning of the file.  If it is not present, the file is
     74     encoded in Latin-1.  Note that even if the pragma is present, some
     75     heuristic may be needed even to get as far as interpreting the
     76     encoding declaration, like in
     77     [http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing XML].
     78     The fact that the first three characters must be {{{{-#}}} will be
     79     useful here.
     80   * A literal string may contain any literal character representable in the
     81     source encoding.  In addition, escapes are provided to permit the specification of
     82     ''any'' Unicode character (which may or may not be otherwise
     83     representable in the source encoding).
     84   * An identifier may contain any Unicode alphanumeric or symbol
     85     characters from a defined range.  Thus, a source text may not be
     86     representable in certain other encodings (especially in ASCII).
     87 * '''I/O.''' 
     88   All raw I/O is in terms of octets, i.e. {{{Word8}}}
     89 * '''Conversions.'''
     90   Pure functions exist to convert octets to and from any particular encoding:
     91{{{
     92   stringDecode :: Encoding -> [Word8] -> [Char]
     93   stringEncode :: Encoding -> [Char] -> [Word8]
     94}}}
     95   The codecs must operate on strings, not individual characters, because some
     96   encodings use variable-length sequences of octets.
     97 * '''Efficiency.'''
     98   Semantically, character-based I/O is a simple composition of the raw
     99   I/O primitives with an encoding conversion function.  However, for
     100   efficiency, an implementation might choose to provide certain encoded
     101   I/O operations primitively.  If such primitives are exposed to the
     102   user, they should have standard names so that other implementations can
     103   provide the same functionality in pure Haskell Prime.
     104 * '''Locales.'''
     105   It may be possible to retain the traditional I/O signatures for
     106   hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing
     107   a stateful notion of ''current encoding'' associated with each
     108   individual handle.  The default encoding could be inherited from the
     109   operating system environment, but it should also be possible to
     110   change the encoding explicitly.
     111{{{
     112   getIOEncoding :: Handle -> IO Encoding
     113   setIOEncoding :: Encoding -> Handle -> IO ()
     114   resetIOEncoding :: Handle -> IO ()  -- go back to default
     115}}}
     116 * '''Filenames, program arguments, environment.'''
     117   * Filenames are stored in Haskell as {{{[Char]}}}, but the operating
     118     system should receive {{{[Word8]}}} for any I/O using filenames.
     119     Some encoding conversion is therefore required.  Usually, this will
     120     be platform-dependent, and so the actual encoding may be hidden
     121     from the programmer as part of the default locale.
     122   * Program arguments, and symbols from the environment, are supplied
     123     by the operating system to the Haskell program as {{{[Word8]}}}.
     124     The program is responsible for conversion to {{{[Char]}}}.  Again,
     125     there may be a default encoding chosen based on the locale.