Opened 3 years ago

Last modified 19 months ago

#5108 new feature request

Allow unicode sub/superscript symbols in both identifiers and operators

Reported by: mikhail.vorozhtsov Owned by:
Priority: normal Milestone: 7.6.2
Component: Compiler (Parser) Version: 7.1
Keywords: lexer unicode Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

While #4373 permits

Prelude> let v₁ = 1

the following is rejected

Prelude> let m >>=₁ f = undefined

<interactive>:0:10: lexical error at character '\8321'

Identifiers with non-numeric subscripts are not accepted either:

Prelude> let vₐ = 1

<interactive>:0:6: lexical error at character '\8336'

I wrote a small patch that makes such definitions possible.

  1. A new unicode Alex macro, $subsup, is introduced and added to $idchar, $symchar, and $graphic
  2. A unicode code point is classified as $subsup by alexGetChar iff either of the following holds:
    1. The code point is annotated with <sub> or <super> in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
    2. It is the [DOUBLE/TRIPLE/QUADRUPLE] PRIME (U+2032, U+2033, U+2034, U+2057)

Attachments (1)

unicode-subsupscripts.patch (5.4 KB) - added by mikhail.vorozhtsov 2 years ago.

Download all attachments as: .zip

Change History (10)

comment:1 Changed 3 years ago by mikhail.vorozhtsov

  • Status changed from new to patch

comment:2 Changed 3 years ago by igloo

  • Component changed from Compiler to Compiler (Parser)
  • Milestone set to 7.4.1

comment:3 Changed 3 years ago by mikhail.vorozhtsov

rebased

Changed 2 years ago by mikhail.vorozhtsov

comment:4 follow-up: Changed 2 years ago by simonmar

  • Difficulty set to Unknown

I'm not keen on this patch for a few reasons:

  • It's inconsistent to allow superscript/subscript on symbols. Haskell doesn't currently allow primes on symbols, for example.
  • The patch has a bunch of Unicode constants baked into it
  • It adds a bunch of extra tests to the inner loop. I haven't measured it but I wouldn't be surprised if this slows down the lexer.

Perhaps it might be better just to allow the category Lm (MODIFIER LETTER) as part of an identifier? That would include all the primes and subscript/superscript things.

comment:5 in reply to: ↑ 4 Changed 2 years ago by mikhail.vorozhtsov

Replying to simonmar:

I'm not keen on this patch for a few reasons:

  • It's inconsistent to allow superscript/subscript on symbols. Haskell doesn't currently allow primes on symbols, for example.

If fact, GHC already allows unicode primes on symbols. alexGetByte classifies OtherPunctuation? characters (including primes) as $unisymbol.

$ ghci
GHCi, version 7.2.2: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
λ> let a +′ b = a + b

The patch just makes sure that primes at least do not appear at the start of a @varsym. We can further restrict sub/sup characters to appear only in the suffix of a symbol, i.e. @varsym = $symbol $symchar* $subsup*.

  • The patch has a bunch of Unicode constants baked into it

The same can ultimately be said about generalCategory, I mean look at u_gencat. I can move the sup/sub test to a separate inlinable function.

  • It adds a bunch of extra tests to the inner loop. I haven't measured it but I wouldn't be surprised if this slows down the lexer.

Hm, I don't know if a few extra comparisons on already rare unicode characters will outweight the binary search in u_gencat, let alone significantly increase the overall lexing time. Is there any way to stop GHC right after lexing so I can benchmark?

Perhaps it might be better just to allow the category Lm (MODIFIER LETTER) as part of an identifier? That would include all the primes and subscript/superscript things.

Lm leaves out a bunch of characters (e.g. sub/sup variants of "+" "-" "=" "(" ")"), including the primes which, as I mentioned, are Po. Another drawback is that identifies like "abcₓdef" would be accepted. BTW, we already can write something not-so-beautiful like:

λ> let ᵤxᵤy = 1

because "ᵤ" is in the Ll category.

comment:6 Changed 2 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1

comment:7 Changed 21 months ago by simonpj

  • Status changed from patch to new

Mikhail,

The first issue here is whether we want sub/superscripts (or indeed primes) on operators, and that's a language design question. We tend towards "no" but if there was a clear consensus from the Unicode-aware Haskell community, we'd accept it. The implementation questions are probably resolvable.

Could you start a thread on glasgow-haskell-users to ask them?

(A possible outcome might be that operators should not allow primes! ie the current behaviour is inconsistent, as you point out. And it's wierd that you can use Unicode primes but not Ascii ones!)

Simon

comment:8 Changed 20 months ago by mikhail.vorozhtsov

Sorry for the late reply. I'll try to revisit the issue and come up with a less ad-hoc proposal in a month or two. Right now I'm completely out of free time.

comment:9 Changed 19 months ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2
Note: See TracTickets for help on using tickets.