Opened 12 years ago

Closed 11 years ago

Last modified 4 years ago

#1103 closed bug (fixed)

Japanese Unicode

Reported by: humasect Owned by:
Priority: normal Milestone: 6.10 branch
Component: Compiler (Parser) Version: 6.6
Keywords: japanese unicode lexical -fglasgow-exts report-impact Cc: pho@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Using Japanese characters (either katakana or hiragana) in identifiers rules this:

Source/Hehe.hs:12:0: lexical error at character '\12390'

There is no issue with Haskell98 for upper/lower case identifiers and type constructor identification with the two complimenting Japanese character sets. Using -fglasgow-exts along with other Unicode characters for various operators which work great.

Attachments (2)

UniTest.hs (720 bytes) - added by humasect 12 years ago.
1 working out of 3 unicode tests
CJK.hs (863 bytes) - added by humasect 12 years ago.
Her is an idea/proposal for some kind of simple extension to also allow backward-compatible "international" source code. Multilingual language ?

Download all attachments as: .zip

Change History (14)

comment:1 Changed 12 years ago by humasect

Priority: normalhigh

comment:2 Changed 12 years ago by simonmar

difficulty: Easy (1 hr)Unknown
Priority: highnormal

Please attach some example code illustrating the bug.

BTW, the "priority" field of the ticket is mainly for the GHC developers so we can prioritise tickets; please use "severity" to indicate how badly the bug affects you. Someday I'll figure out how to put a link to some docs next to these fields on the ticket page.

Changed 12 years ago by humasect

Attachment: UniTest.hs added

1 working out of 3 unicode tests

comment:3 Changed 12 years ago by humasect

Operating System: MacOS XMultiple

My apologies. I've attached some test code. We could really use Japanese identifiers in house development. I don't know what to say about upper/lower case for identifiers and constructors. I could create some sort of example code for conventions that would work very well. Thanks again

comment:4 Changed 12 years ago by ross

I don't think there's any reason why these characters couldn't be treated as upper or lower case letters; the question is which. We'd want to treat all the members of a Unicode General Category the same way, because special cases would be too cumbersome. A lexical syntax based on the case of letters was never going to work well with caseless scripts.

Kanji, katakana and hiragana all belong to the Letter, Other category. If we treated these as lower case, your third example would work, but you'd have to adopt a convention of prepending capital letters (like M, C, T and D) to Japanese module, class, type and data constructor names. (The same would apply to all the other caseless scripts too.)

comment:5 Changed 12 years ago by igloo

Another option would be to treat them as neither upper nor lower case, so they could be part of a name but not the first character of it. I think treating them as lower case make more sense, though. Whatever we do, we should make sure Haskell' matches it.

comment:6 Changed 12 years ago by simonmar

Milestone: 6.6.16.8

Punt to 6.8: this requires further thought and coordination with Haskell'.

Changed 12 years ago by humasect

Attachment: CJK.hs added

Her is an idea/proposal for some kind of simple extension to also allow backward-compatible "international" source code. Multilingual language ?

comment:7 Changed 12 years ago by igloo

Milestone: 6.86.1

Punt to 6.10 as this still requires further thought and coordination with Haskell'.

comment:8 Changed 11 years ago by PHO

Cc: pho@… added

comment:9 Changed 11 years ago by simonmar

Resolution: fixed
Status: newclosed

I did as Ross suggested and made the "Letter, Other" class behave as lower-case.

Wed Jul  9 10:12:52 BST 2008  Simon Marlow <marlowsd@gmail.com>
  * Treat the Unicode "Letter, Other" class as lowercase letters (#1103)
  This is an arbitrary choice, but it's strictly more useful than the
  current situation, where these characters cannot be used in
  identifiers at all.
  
  In Haskell' we may revisit this decision (it's on my list of things to
  discuss), but for now this is an improvement for those using caseless
  languages.

comment:10 Changed 10 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:11 Changed 10 years ago by simonmar

Operating System: MultipleUnknown/Multiple

comment:12 Changed 4 years ago by ekmett

Keywords: report-impact added
Type of failure: None/Unknown
Note: See TracTickets for help on using tickets.