wiki:SourceEncodingDetection

Version 5 (modified by ross@…, 8 years ago) (diff)

add refs, plus some grammatical pedantry

Source Encoding Detection

Brief Explanation

Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see UnicodeInHaskellSource).

This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.

This proposal does not cover user-specified source encoding.

References

Proposal

This heuristic uses at most 4 bytes from the byte representation of Haskell source code.

import Data.Word

data EncodedSource
    = UTF8 [Word8]
    | UTF16 Endian [Word8]
    | UTF32 Endian [Word8]
 -- | UserDefined ...

data Endian = LittleEndian | BigEndian

detectSourceEncoding :: [Word8] -> EncodedSource
detectSourceEncoding bytes = case bytes of
    []                          -> UTF8 []
    [0x00]                      -> invalidNulls
    xs@[_]                      -> UTF8 xs
    [0xFF, 0xFE]                -> UTF16 LittleEndian []
    (0xFE:0xFF:xs)              -> UTF16 BigEndian xs
    [0x00, 0x00]                -> invalidNulls
    xs@[0x00, _]                -> UTF16 BigEndian xs
    xs@[_, 0x00]                -> UTF16 LittleEndian xs
    xs@[_, _]                   -> UTF8 xs
    [0x00, 0x00, 0x00]          -> invalidNulls
    xs@[_, _, _]                -> UTF8 xs
    (0xEF:0xBB:0xBF:xs)         -> UTF8 xs
    (0x00:0x00:0xFE:0xFF:xs)    -> UTF32 BigEndian xs
    (0xFF:0xFE:0x00:0x00:xs)    -> UTF32 LittleEndian xs
    (0xFF:0xFE:xs)              -> UTF16 LittleEndian xs
    (0x00:0x00:0x00:0x00:_)     -> invalidNulls
    xs@(0x00:0x00:0x00:_)       -> UTF32 BigEndian xs
    xs@(_:0x00:0x00:0x00:_)     -> UTF32 LittleEndian xs
    (0x00:0x00:_)               -> invalidNulls
    xs@(0x00:_)                 -> UTF16 BigEndian xs
    xs@(_:0x00:_)               -> UTF16 LittleEndian xs
    xs                          -> UTF8 xs
    where
    invalidNulls = error "(implementation-specific error message)"

The heuristic has the following properties:

  • Byte-order mark is optional on all three encodings.
  • If present, byte-order-marks are consumed before lexical analysis.
  • Source code known to begin with the NULL character is disallowed.

Furthermore, as long as the first logical characters in the program is under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always gracefully handle two common classes of text editor flaws:

  • Emitting byte-order mark for UTF-8 text.
  • Omitting byte-order mark for UTF-16 or UTF-32 text.

Pros

  • Ensures uniform treatment of Unicode in source code.
  • Disallows implicit ISO-8859-* encodings in source code, ensuring portability.

Cons

  • Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
  • Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion. (At present, this is mostly Latin-1.)