wiki:SourceEncodingDetection

Version 4 (modified by autrijus@…, 8 years ago) (diff)

sync back from Pugs's actual code; fix one typo.

Source Encoding Detection

Brief Explanation

Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside ASCII range non-portable.

This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.

This proposal does not cover user-specified source encoding.

Proposal

This heuristics uses at most 4 bytes from the byte representation of Haskell source code.

import Data.Word

data EncodedSource
    = UTF8 [Word8]
    | UTF16 Endian [Word8]
    | UTF32 Endian [Word8]
 -- | UserDefined ...

data Endian = LittleEndian | BigEndian

detectSourceEncoding :: [Word8] -> EncodedSource
detectSourceEncoding bytes = case bytes of
    []                          -> UTF8 []
    [0x00]                      -> invalidNulls
    xs@[_]                      -> UTF8 xs
    [0xFF, 0xFE]                -> UTF16 LittleEndian []
    (0xFE:0xFF:xs)              -> UTF16 BigEndian xs
    [0x00, 0x00]                -> invalidNulls
    xs@[0x00, _]                -> UTF16 BigEndian xs
    xs@[_, 0x00]                -> UTF16 LittleEndian xs
    xs@[_, _]                   -> UTF8 xs
    [0x00, 0x00, 0x00]          -> invalidNulls
    xs@[_, _, _]                -> UTF8 xs
    (0xEF:0xBB:0xBF:xs)         -> UTF8 xs
    (0x00:0x00:0xFE:0xFF:xs)    -> UTF32 BigEndian xs
    (0xFF:0xFE:0x00:0x00:xs)    -> UTF32 LittleEndian xs
    (0xFF:0xFE:xs)              -> UTF16 LittleEndian xs
    (0x00:0x00:0x00:0x00:_)     -> invalidNulls
    xs@(0x00:0x00:0x00:_)       -> UTF32 BigEndian xs
    xs@(_:0x00:0x00:0x00:_)     -> UTF32 LittleEndian xs
    (0x00:0x00:_)               -> invalidNulls
    xs@(0x00:_)                 -> UTF16 BigEndian xs
    xs@(_:0x00:_)               -> UTF16 LittleEndian xs
    xs                          -> UTF8 xs
    where
    invalidNulls = error "(implementation-specific error message)"

The heuristics has the following properties:

  • Byte-order mark is optional on all three encodings.
  • If present, byte-order-marks are consumed before lexical analysis.
  • Source code known to begin with the NULL chracter is disallowed.

Furthermore, as long as the first logical characters in the program is under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always gracefully handle two common class of text editor flaws:

  • Emitting byte-order mark for UTF-8 text.
  • Omitting byte-order mark for UTF-16 or UTF-32 text.

Pros

  • Ensures uniform treatment of Unicode in source code.
  • Disallows implicit ISO-8859-* encodings in source code, ensuring portability.

Cons

  • Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
  • Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion.