wiki:UnicodeInHaskellSource

Version 3 (modified by autrijus@…, 8 years ago) (diff)

Add the SourceEncodingDetection proposal link.

Unicode in Haskell source

The Haskell 98 Report (Lexical Structure) claims that Haskell source code uses the Unicode character set.

Current support for Unicode in source files

Haskell source code is stored in text files using various character sets and encodings.

  • Jhc allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8. Several uses of Unicode characters in place of Haskell keywords are permitted:
    • '→' ('\x2192') is equivalent to '->'
    • '←' ('\x2190') is equivalent to '<-
    • '∷' ('\x2237') is equivalent to '::'
    • '‥' ('\x2025') is equivalent to '..'
    • '⇒' ('\x21d2') is equivalent to '=>'
    • '∀' ('\x2200') is equivalent to 'forall'
    • '∃' ('\x2203') is equivalent to 'exists' (see ExistentialQuantification)
    In addition there is experimental support for defining new operators and names using various Unicode characters.
  • Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
  • GHC now (as of early Jan 2006) interprets source files as UTF-8. In -fglasgow-exts mode the above special symbols are interpreted as in JHC, and additionally the lambda symbol 'λ' is interpreted as lambda. GHC knows about the characters classifications of all unicode characters via the Data.Char library, and can therefore understand identifiers written using alphanumeric characters from any language (but see below for note about caseless character sets).
  • Others treat source code as ISO 8858-1 (Latin-1).

Problems with Unicode in Haskell 98

There are plenty of Unicode alphabetic characters which are neither upper, lower, or title case, and hence are not allowed in identifiers. Some languages have no notion of case at all. Since Haskell's syntax relies on case for distinguishing constructors and variables, what should our position be with respect to caseless character sets?

The report should at least be absolutely clear about which Unicode character properties (N, Ll, Lu, Sm, etc.) correspond to which lexical class in the syntax.

Some things we could do

  • Allow Unicode, with detection for the two common encodings (UTF-8 and UTF-16). See SourceEncodingDetection for a proposal.
  • Revert to US-ASCII, Latin-1 or implementation-defined character sets.
  • Allow Unicode with the encoding specified outside source files (e.g. by the current locale, as currently done by Hugs). This would make Haskell source containing non-ASCII characters non-portable.
  • Allow Unicode, with a mechanism for specifying encoding in the source file, e.g.
    • Introduce a pragma {-# ENCODING e #-} with a range of possible values of the encoding e (cf IANA character sets). If the pragma is present, it must be at the beginning of the file. If it is not present, the file is encoded in US-ASCII. Note that even if the pragma is present, some heuristic may be needed even to get as far as interpreting the encoding declaration, like in XML. The fact that the first three characters must be {-# will be useful here. Haskell implementations must support at least the encodings US-ASCII, ISO-8859-1, and UTF-8.
  • Allow Unicode, defining a portable form (the \uNNNN escapes in Haskell 1.4 were an attempt at this).

If Unicode is allowed, should its use be restricted?

  • Haskell 98 already has character escapes for arbitrary Unicode characters in character and string literals. Thus Unicode in these literals can always be transformed into a portable form.
  • Haskell 98 permits upper, title and lower case alphabetic characters (but not other alphabetic characters) in identifiers, and symbol or punctuation characters in symbols. Thus a source text may not be representable in all encodings (especially ASCII).

It is not reasonable to display all Unicode characters with the same width, but the Haskell 98 Report (Layout) says:

For the purposes of the layout rule, Unicode characters in a source program are considered to be of the same, fixed, width as an ASCII character. However, to avoid visual confusion, programmers should avoid writing programs in which the meaning of implicit layout depends on the width of non-space characters.

Is this adequate?