Opened 7 years ago

Last modified 11 hours ago

#5518 patch bug

Some unicode symbols are not allow in literal characters or strings

Reported by: ertai Owned by: ulysses4ever
Priority: normal Milestone:
Component: Compiler Version:
Keywords: Cc:
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D5066
Wiki Page:

Description

main = putChar 'ₖ'

This program is rejected with following error message: lexical error in string/character literal at character '\8342'

There is at least a few other characters with the same issue, for instance this whole string should be accepted: "ₕₖₗₘₙₒₚᵣₛₜᵤᵥₓ"

A related issue is that GHCi do not let me paste these characters either.

Attachments (1)

q.hs (23 bytes) - added by ertai 7 years ago.

Download all attachments as: .zip

Change History (12)

comment:1 Changed 7 years ago by judahj

GHC requires that source files be encoded in UTF-8. Can you please check whether that's the case for your program? If you're not sure or if that didn't fix the problem, can you please attach the bad program to this ticket?

For ghci: What terminal are you using (e.g. xterm, urxvt, etc.)? Also, please let us know the results of running these commands in that terminal:

echo $TERM
echo $LANG

comment:2 Changed 7 years ago by igloo

Status: newinfoneeded

It works for me:

$ hexdump -C q.hs
00000000  0a 6d 61 69 6e 20 3d 20  70 75 74 43 68 61 72 20  |.main = putChar |
00000010  27 e2 82 96 27 0a 0a                              |'...'..|
00000017
$ ghc -c q.hs
$

Changed 7 years ago by ertai

Attachment: q.hs added

comment:3 Changed 7 years ago by ertai

Version: 7.2.17.0.3

I reproduce the same file than igloo and I have the same output for hexdump.

However ghc -c q.hs yields:

q.hs:2:17:

lexical error in string/character literal at character '\8342'

(the GHC version I use is actually 7.0.3, I updated the ticket info)

echo $TERM rxvt-unicode-256color

echo $LANG en_US.UTF-8

comment:4 Changed 7 years ago by judahj

I could reproduce the issue with ghc-7.0.3 and ghc-7.0.4.

I looked into this since it seemed to be affecting Haskeline too. The cause (for both problems) was that older versions of GHC support a older version of Unicode:

$ ghc-7.0.3 -e "Data.Char.generalCategory '\8342'"
NotAssigned
$ ghc-7.0.4 -e "Data.Char.generalCategory '\8342'"
NotAssigned
$ ghc-7.2.1 -e "Data.Char.generalCategory '\8342'"
ModifierLetter

So if you want to use those characters, you will probably need to upgrade to ghc-7.2.1.

comment:5 Changed 7 years ago by ertai

Ok, thank you.

comment:6 Changed 7 years ago by igloo

Resolution: fixed
Status: infoneededclosed

Yup, I can also reproduce it with 7.0.2 but not 7.2.1.

comment:7 Changed 29 hours ago by joeyh

Similarly, with ghc 8.2.2 (debian), this is not accepted:

main = putChar '🥖'

That's U+1F956 baguette. ghc says:

lexical error in string/character literal at character '\129366'

My system is fully utf-8 enabled and the original problem character works ok.

Guess this is just lag getting the unicode character tables updated. However, while it seems reasonable for ghc to not let me define a function eg

(🥖) = (</>)

since it doesn't know what kind of symbol baguette is, it seems much less reasonable to not accept any unicode inside a string.

comment:8 Changed 13 hours ago by ulysses4ever

I can confirm this for 8.4.3 and HEAD.

comment:9 Changed 11 hours ago by ulysses4ever

Differential Rev(s): Phab:D5066
Resolution: fixed
Status: closednew
Version: 7.0.3

I renewed the Unicode tables as described here, and this fixed the issue. Merge?

comment:10 Changed 11 hours ago by ulysses4ever

Owner: set to ulysses4ever

comment:11 Changed 11 hours ago by ulysses4ever

Status: newpatch
Note: See TracTickets for help on using tickets.