Opened 23 months ago

Last modified 20 months ago

#11529 new bug

Show instance of Char should print literals for non-ascii printable charcters

Reported by: nushio Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.3
Keywords: Cc: lelf, RyanGlScott, erikd
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Show instance for non-ascii characters prints their character codes. This is sad for Haskell users that speaks language other than English.

> 'A'
'A'
> 'Ä'
'\196'
> '漢'
'\28450'
> print $ [(++"'s dad"), (++"'s mom")] <*> ["Simon", "John"]
["Simon's dad","John's dad","Simon's mom","John's mom"]
> print $ [(++"の父"), (++"の母")] <*> ["田中", "山田"]
["\30000\20013\12398\29238","\23665\30000\12398\29238","\30000\20013\12398\27597","\23665\30000\12398\27597"]

The function that needs improvement is showLitChar in GHC.Show, which currently prints any character larger than ASCII code 127 by its character code:

http://haddock.stackage.org/lts-5.1/base-4.8.2.0/src/GHC-Show.html

showLitChar                :: Char -> ShowS
showLitChar c s | c > '\DEL' =  showChar '\\' (protectEsc isDec (shows
(ord c)) s)

On the other hand, there is GHC.Unicode.isPrint, the predicate for printable Unicode characters, that is calling on a foreign function u_iswprint for the knowledge.

https://hackage.haskell.org/package/base-4.8.2.0/docs/src/GHC.Unicode.html#isPrint

I think one of the solution is to import and call u_iswprint from GHC.Show, too, but I don't know it's against any design choices.

Change History (19)

comment:1 Changed 23 months ago by nushio

One of application that is broken by this change, is when a customized Show instance of a type is controlled by other variables in that type. For example, the following code simulates a press code that respects privacy for people of age under 20.

data Sex = Male | Female
data Person = Person {name :: String, age :: Int, sex :: Sex}

instance Show Person where
  show (Person _ a Male  ) | a < 20 = "A boy (" ++ show a ++ ")"
  show (Person _ a Female) | a < 20 = "A girl (" ++ show a ++ ")"
  show (Person n a _     )          = n

assert $ show (Person "村主崇行" 19 Male) == "A boy (19)"
assert $ show (Person "村主崇行" 20 Male) == "\26449\20027\23815\34892"

I'm very looking forward to learn other drawbacks of this change.

comment:2 Changed 23 months ago by dfeuer

Absolutely any code in the entire world that relies on the current behavior will break. The current behavior is expressed in the reference implementation in the Haskell 2010 report. Frankly, changing it is not an option. You can write your own function to unescape valid Unicode. You can also write your own UShow class if you like with a method for showing various things using Unicode generally. You can then try to convince other developers to depend on your package and write instances of your class.

comment:3 Changed 23 months ago by nushio

Dear dfeuer, thank you for pointing out that the Show Char is specified in Haskell 2010. I believe the corresponding section is the following:

https://www.haskell.org/onlinereport/haskell2010/haskellch16.html#x24-21700016.6

16.6 String representations

showLitChar :: Char -> ShowS
    Convert a character to a string using only printable characters, 
    using Haskell source-language escape conventions. For example:

     showLitChar '\n' s  =  "\\n" ++ s 

where "Haskell source-language escape conventions" are defined, in turn, in Section 2.6 https://www.haskell.org/onlinereport/haskell2010/haskellch2.html#x7-200002.6 .

Correct me if I'm wrong.

comment:4 Changed 23 months ago by thomie

You can put something like this in your .ghci file:

:seti -XScopedTypeVariables

:{
let myShow :: Show a => a -> String
    myShow x = go (show x) where
      go :: String -> String
      go [] = []
      go s@(x:xs) = case x of
          '\"' -> '\"' : str ++ "\"" ++ go rest
          '\'' -> '\'' : char : '\'' : go rest'
          _    -> x : go xs
        where
          (str :: String, rest):_ = reads s
          (char :: Char, rest'):_ = reads s
:}

:{
let myPrint :: Show a => a -> IO ()
    myPrint = putStrLn . myShow
:}

:set -interactive-print=myPrint

Example:

Prelude> [(++"の父"), (++"の母")] <*> ["田中", "山田"]
["田中の父","山田の父","田中の母","山田の母"]

comment:5 Changed 23 months ago by nushio

Resolution: invalid
Status: newclosed

Dear thomie, thank you for your comment. Yes, -interactive-print is a great feature! I regret that I was not able to search out this has been done for years.

There are also several customized show function proposed, like myShow here. However, when I used it in some detail, I found that printing in Unicode has many corner cases that are more difficult than it seems .... As far as I have searched, I cannot find a unicode-printing function that satisfies read . unicode_show == id for sufficiently many types. For example, https://gist.github.com/nushio3/4a10f3c0092295696daf

Thus, I decided to start a small package -interactive-printing. http://hackage.haskell.org/package/unicode-show I wish this helps many people enjoy Haskell!

comment:6 Changed 23 months ago by nushio

By the way, now I know that this issue was a language feature, rather than lack of implementation, I think it is proper to close this ticket.

comment:7 Changed 23 months ago by thomie

GHCi could be changed to show unicode characters nicely by default. The code is in the function tcUserStmt in compiler/typecheck/TcRnDriver.hs.

Expressions:

        -- The plans are:
        --   A. [it <- e; print it]     but not if it::()
        --   B. [it <- e]
        --   C. [let it = e; print it]

Statements:

        -- The plans are:
        --      [stmt; print v]         if one binder and not v::()
        --      [stmt]                  otherwise

Replace print by putStrLn . uShow, with a suitable uShow. That shouldn't break anyone's code.

comment:8 Changed 23 months ago by thomie

Resolution: invalid
Status: closednew

Oh, silly me, print in tcUserStmt of course uses that interactive printer setting I mentioned in comment:4.

So my suggestion is to change the default interactive printer to display unicode characters nicely.

comment:9 Changed 23 months ago by thomie

There is also a bug, ghci -fprint-bind-result doesn't use the interactive printer for statements:

Prelude> :set -fprint-bind-result
Prelude> let x = "の父"
"\12398\29238"
Prelude> "の父"
"の父"

comment:10 Changed 22 months ago by lelf

Cc: lelf added

comment:11 Changed 22 months ago by nushio

(+1) to suggestion that to change the default interactive printer to display unicode characters nicely. The algorithm in unicode-show might be suitable for the purpose, although there should be various opinions on what is the "nice way to print unicode."

By the way, if we update the default interactive printer, will we be breaking the doctests that shows values with unicodes, forcing them to update the expected results from the interpreter?

comment:12 Changed 22 months ago by rwbarton

I would love for something like comment:4 to become the default in ghci. It could even be simpler/stupider and just replace any sequence like \12345 with the corresponding Unicode character wherever it appears. I mean when would you ever have such a string in the output of show, short of a weird custom Show instance? And it would be more robust to other weird custom Show instances, that used quotes in an unbalanced fashion.

I don't think we should replace \n or \ESC or especially \\ though. Just printable Unicode characters outside the ASCII range, probably. And we could decline to do the replacement if the replacement character can't be encoded in the user's locale.

One drawback is that the user's font might not contain the Unicode characters in question, like mine does not contain \12345. So there should probably be an option to disable these replacements.

comment:14 Changed 21 months ago by RyanGlScott

Cc: RyanGlScott added

comment:15 in reply to:  2 Changed 21 months ago by chak

Replying to dfeuer:

Absolutely any code in the entire world that relies on the current behavior will break. The current behavior is expressed in the reference implementation in the Haskell 2010 report. Frankly, changing it is not an option. You can write your own function to unescape valid Unicode. You can also write your own UShow class if you like with a method for showing various things using Unicode generally. You can then try to convince other developers to depend on your package and write instances of your class.

I disagree. I think, the current implementation is actually wrong and does not adhere to the standard. The standard states in 16.6 that showLitChar be defined as follows:

Convert a character to a string using only printable characters, using Haskell source-language escape conventions.

However, the current implementation of showLitChar fail to use isPrint; instead it uses a naive condition, c > '\DEL', to determine printability. This is wrong.

The solution is simple, replace the condition c > '\DEL' by not (isPrint c) in the definition of showLitChar.

comment:16 Changed 21 months ago by allbery_b

isPrint does not answer the question "can this character be displayed by the current user given their current locale?". That would require it to be in IO, and would limit the ability to use it in other contexts.

isPrint answers the question "is the Unicode codepoint contained in the given Char considered printable by the version of the Unicode standard to which the runtime conforms?".

It is not the correct question to ask here.

comment:17 Changed 21 months ago by erikd

Cc: erikd added

comment:18 in reply to:  16 Changed 21 months ago by chak

Replying to allbery_b:

isPrint does not answer the question "can this character be displayed by the current user given their current locale?". That would require it to be in IO, and would limit the ability to use it in other contexts.

isPrint answers the question "is the Unicode codepoint contained in the given Char considered printable by the version of the Unicode standard to which the runtime conforms?".

It is not the correct question to ask here.

It is, however, what the standard prescribes. IMHO it is also the right thing to do as it leads to less unexpected behaviour than the current implementation.

comment:19 Changed 20 months ago by nushio

I believe there is an ambiguity in the specification

showLitChar :: Char -> ShowS
    "Convert a character to a string using only printable characters

whether "printable" means ASCII printable or Unicode printable. How shall we solve the ambiguity?

By the way, I use map fromEnum to investigate the content of the string when lack of appropriate font or when I am debugging a pretty printer.

Note: See TracTickets for help on using tickets.