Changes between Version 4 and Version 5 of SourceEncodingDetection


Ignore:
Timestamp:
Aug 27, 2006 10:01:03 AM (9 years ago)
Author:
ross@…
Comment:

add refs, plus some grammatical pedantry

Legend:

Unmodified
Added
Removed
Modified
  • SourceEncodingDetection

    v4 v5  
    44== Brief Explanation == 
    55 
    6 Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside ASCII range non-portable. 
     6Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see UnicodeInHaskellSource). 
    77 
    8 This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input. 
     8This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input. 
    99 
    1010This proposal does not cover user-specified source encoding. 
    1111 
     12== References == 
     13 
     14 * [http://www.unicode.org/faq/utf_bom.html Unicode UTF and Byte Order Mark FAQ] 
     15 
    1216== Proposal == 
    1317 
    14 This heuristics uses at most 4 bytes from the byte representation of Haskell source code. 
     18This heuristic uses at most 4 bytes from the byte representation of Haskell source code. 
    1519 
    1620{{{ 
     
    5357}}} 
    5458 
    55 The heuristics has the following properties: 
     59The heuristic has the following properties: 
    5660 * Byte-order mark is optional on all three encodings. 
    5761 * If present, byte-order-marks are consumed before lexical analysis. 
    58  * Source code known to begin with the NULL chracter is disallowed. 
     62 * Source code known to begin with the NULL character is disallowed. 
    5963 
    6064Furthermore, as long as the first logical characters in the program is 
    61 under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always 
    62 gracefully handle two common class of text editor flaws: 
     65under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always 
     66gracefully handle two common classes of text editor flaws: 
    6367 * Emitting byte-order mark for UTF-8 text. 
    6468 * Omitting byte-order mark for UTF-16 or UTF-32 text. 
     
    7175 * Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers. 
    7276 * Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion. 
     77   (At present, this is mostly Latin-1.)