Changes between Version 4 and Version 5 of SourceEncodingDetection


Ignore:
Timestamp:
Aug 27, 2006 10:01:03 AM (9 years ago)
Author:
ross@…
Comment:

add refs, plus some grammatical pedantry

Legend:

Unmodified
Added
Removed
Modified
  • SourceEncodingDetection

    v4 v5  
    44== Brief Explanation ==
    55
    6 Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside ASCII range non-portable.
     6Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see UnicodeInHaskellSource).
    77
    8 This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
     8This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
    99
    1010This proposal does not cover user-specified source encoding.
    1111
     12== References ==
     13
     14 * [http://www.unicode.org/faq/utf_bom.html Unicode UTF and Byte Order Mark FAQ]
     15
    1216== Proposal ==
    1317
    14 This heuristics uses at most 4 bytes from the byte representation of Haskell source code.
     18This heuristic uses at most 4 bytes from the byte representation of Haskell source code.
    1519
    1620{{{
     
    5357}}}
    5458
    55 The heuristics has the following properties:
     59The heuristic has the following properties:
    5660 * Byte-order mark is optional on all three encodings.
    5761 * If present, byte-order-marks are consumed before lexical analysis.
    58  * Source code known to begin with the NULL chracter is disallowed.
     62 * Source code known to begin with the NULL character is disallowed.
    5963
    6064Furthermore, as long as the first logical characters in the program is
    61 under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always
    62 gracefully handle two common class of text editor flaws:
     65under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always
     66gracefully handle two common classes of text editor flaws:
    6367 * Emitting byte-order mark for UTF-8 text.
    6468 * Omitting byte-order mark for UTF-16 or UTF-32 text.
     
    7175 * Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
    7276 * Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion.
     77   (At present, this is mostly Latin-1.)