wiki:Unicode

Version 2 (modified by malcolm.wallace@…, 8 years ago) (diff)

--

Unicode

The Haskell'98 Report claims the language uses Unicode. This statement is somewhat vague, and as yet there are still no implementations that comply fully with Unicode. Some things we can do:

  • Remove any mention of Unicode, and revert to either plain ASCII, or implementation-defined character sets.
  • Make the specification clearer
    • Should source-code be permitted to use unicode characters in identifiers etc?
    • If not in identifiers, should Unicode be permitted in Strings?
    • What about the I/O primitives?
      • Do we read/write files as ASCII (see also BinaryIO), or as some Unicode format(s)?
      • How are FilePaths represented - ASCII or some Unicode format(s)?

It might be helpful to outline some of the background to Unicode itself, because I find many people get confused by it.

  • Unicode is a finite mapping from character glyphs to code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity). The mapping is (in theory at least) one-to-one and onto.
  • For reasons of backwards compatibility and space efficiency, there are a variety of variable-length encodings of the code points themselves into byte streams.
    • UTF-8 seeks to ensure that the ASCII characters retain their traditional coding in the bottom 7-bits of a single byte. The remaining character glyphs are coded using the top-bit of the first byte, plus some variable number of trailing bytes. Thus, non-ASCII characters are typically several bytes long.
    • UTF-16 makes all characters 16-bits wide. Unfortunately this does not cover the entire code space, so there are some 'page-switch' characters that swap out the current 'page' of the code book for a different one. So although most characters end up fitting in a single 16-bit field, some must be coded as two successive fields.
    • UCS-4 uses a full 32-bit word per character glyph.
    • To make things more exciting, the UTF-16 and UCS-4 encodings have two variations, depending on the endianness of the machine they were originally written on. So if you read a raw byte-stream and want to convert it to 16-bit chunks, you first need to work out the byte-ordering. This is often done by reading a few bytes and then looking up a heuristic table, although there is also a 'byte-order mark' which is a non-printing character which may or may not be present.
  • Since unix-like systems traditionally deal with byte-streams, UTF-8 is the most common encoding on those platforms.
  • The NTFS file system (Windows) stores filenames and file contents in UTF-16.
  • Almost no system stores UCS-4 natively, but there is a C library type 'wchar' (wide character) which has 32 bits.
  • But of course any system must be able to read/write files that originated on any other platform.
  • As an example of the complex heuristics needed to guess the encoding of any particular file, see the XML standard