Opened 4 years ago

Closed 3 years ago

Last modified 3 years ago

#10196 closed bug (fixed)

Regression regarding Unicode subscript characters in identifiers

Reported by: thomie Owned by:
Priority: normal Milestone: 7.10.2
Component: Compiler (Parser) Version: 7.10.1
Keywords: unicode, report-impact Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: GHC rejects valid program Test Case: parser/should_compile/T10196, parser/should_fail/T10196Fail1, parser/should_fail/T10196Fail2
Blocked By: Blocking:
Related Tickets: #5108 Differential Rev(s): Phab:D969
Wiki Page:

Description (last modified by hvr)

As reported by both hvr as user Yongqian Li:

The Unicode 7.0 update in GHC 7.10 had the side effect of breaking code making use of subscript symbols that did compile with GHC 7.8.4, but won't anymore with GHC 7.10.1:

For instance, GHCi 7.8.4 accepts

   let xᵦ = 1
   let xᵤ = 1
   let xᵩ = 1
   let xᵢ = 1
   let xᵪ = 1
   let xᵣ = 1
   let x = 1

whereas GHC 7.10.1RC fails parsing those with a lexical error. (NB: GHC 7.8 does not accept all latin subscript letters either).

Change History (19)

comment:1 Changed 4 years ago by thomie

I think Simon's suggested change in #5108 would fix this:

"allow the category Lm (MODIFIER LETTER) as part of an identifier? That would include all the primes and subscript/superscript things."

comment:2 Changed 4 years ago by hvr

Description: modified (diff)

Minor markup improvement in ticket-description

comment:3 Changed 4 years ago by thomie

But perhaps don't allow a "MODIFIER LETTER" as the first character of an identifier.

We're unfortunately outside of the report territory here.

comment:4 Changed 4 years ago by yongqli

@Thomas Miedema,

Yes, subscript characters only starting from the second character on is fine for me.

comment:5 Changed 4 years ago by hvr

Owner: set to hvr

We're planning to allow Lm from the 2nd character on in an identifier for 7.10.2

comment:6 Changed 3 years ago by thoughtpolice

Owner: changed from hvr to thoughtpolice

comment:7 Changed 3 years ago by thoughtpolice

Differential Rev(s): Phab:D969
Status: newpatch

comment:8 Changed 3 years ago by thoughtpolice

Milestone: 7.10.27.10.3

I think we're going to end up punting this to 7.10.3 at least, because the current patch has some regressions, and this is ultimately fairly minor.

comment:9 in reply to:  5 Changed 3 years ago by thomie

Replying to hvr:

We're planning to allow Lm from the 2nd character on in an identifier for 7.10.2

The current patch does exactly this. It still needs a changelog entry.

I would have preferred to only allow Lm in the suffix of an identifier. But we can leave that for 7.12 or later, as there is a slight chance it breaks someone's code. We could mention it in the docs.

There's also the issue that ModifierLetter perhaps brings in too many weird characters:

"15-06-18T11:46:27"<             hvr@> thomie: can we easily list all modifier letters in Haskell?
"15-06-18T11:47:55"<             hvr@>  [ c | c <- ['\0'..],  generalCategory c == ModifierLetter ]
"15-06-18T11:47:56"<             hvr@> got it
"15-06-18T11:48:56"<             hvr@> ok, there's a lot in there one doesn't want to allow in identifiers :-/
"15-06-18T11:49:31"<          thomie > booh
"15-06-18T11:49:50"<             hvr@> these look nasty: 
"15-06-18T11:50:46"<             hvr@> so many column variants, theres also "ː"

hvr: do you think this a big enough issue to not proceed with the current patch?

Last edited 3 years ago by thomie (previous) (diff)

comment:10 Changed 3 years ago by Ben Gamari <ben@…>

In 6b01d3ce6c681428e7a9865af85685c2a76ba657/ghc:

parser: Allow Lm (MODIFIER LETTER) category in identifiers

Easy fix in the parser to stop regressions, due to Unicode 7.0 changing
the classification of some prior code points.

Signed-off-by: Austin Seipp <austin@well-typed.com>

Test Plan: `tests/parser/should_compile/T10196.hs`

Reviewers: hvr, austin, bgamari

Reviewed By: austin, bgamari

Subscribers: thomie, bgamari

Differential Revision: https://phabricator.haskell.org/D969

GHC Trac Issues: #10196

comment:11 Changed 3 years ago by bgamari

Resolution: fixed
Status: patchclosed

comment:12 in reply to:  10 Changed 3 years ago by hvr

Replying to Ben Gamari <ben@…>:

parser: Allow Lm (MODIFIER LETTER) category in identifiers

Easy fix in the parser to stop regressions, due to Unicode 7.0 changing the classification of some prior code points.

nitpick: the way the commit message is worded (as well the comments in this ticket) suggests that e.g. xᵦx is now a valid identifier... which it isn't...

comment:13 Changed 3 years ago by hvr

Owner: thoughtpolice deleted
Resolution: fixed
Status: closednew

comment:14 Changed 3 years ago by hvr

I'm reopening this temporarily, because GHC 7.8.4 does in fact accept e.g.

λ:6> let xᵦx = ()
xᵦx :: ()

PS: turns out, the current patch does allow this; but we may want to remove the misleading TODOs and expect_broken() as the patch is fine as it is.

Last edited 3 years ago by hvr (previous) (diff)

comment:15 Changed 3 years ago by thomie

Milestone: 7.10.37.10.2
Status: newmerge

Please merge.

comment:16 Changed 3 years ago by bgamari

Resolution: fixed
Status: mergeclosed

This has been merged to ghc-7.10 as 358e0a8d4cb49baa29cf6b001eaa9d4ac428bb2d.

comment:17 Changed 3 years ago by nomeata

Resolution: fixed
Status: closednew

Sorry, to bring this up late, but the report specifies “Haskell compilers are expected to make use of new versions of Unicode as they are made available.” So if we deviate from that, we should make sure that

This is also important for, e.g. #11012.

comment:18 Changed 3 years ago by thomie

Keywords: unicode report-impact added
Resolution: fixed
Status: newclosed
Test Case: parser/should_compile/T10196, parser/should_fail/T10196Fail1, parser/should_fail/T10196Fail2

So if we deviate

I've opened #11609 for that.

comment:19 Changed 3 years ago by Thomas Miedema <thomasmiedema@…>

In d738e66/ghc:

Modifier letter in middle of identifier is ok

Refactoring only. Cleanup some loose ends from #10196.

Initially the idea was to only allow modifier letters at the end of
identifiers. Since we later decided to allow modifier letters also in
the middle of identifiers (because not doing so would not fix the
regression completely), the names `suffix` and `okIdSuffixChar` don't
seem appropriate anymore.

Remove TODO. Move test from should_fail to should_compile.
Note: See TracTickets for help on using tickets.