Opened 7 years ago

Last modified 7 months ago

#4471 new bug

Incorrect Unicode output on Windows Console

Reported by: sankeld Owned by:
Priority: normal Milestone:
Component: Compiler Version: 6.12.3
Keywords: Cc: ekmett@…, dagitj@…, simon@…, shelarcy@…, ryan.gl.scott@…, setre3+ghc@…
Operating System: Windows Architecture: x86
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: #11394 Differential Rev(s):
Wiki Page:

Description

To reproduce,

  • start a windows console
  • Change the console's font to a ttf unicode font, like "Lucida Console".
  • Type "chcp 65001" to set it to the UTF-8 code page.

test.hs

main = putStrLn "∷⇒∀→←⋯⊢"

Output to the console is garbled. runghc test.hs:

∷⇒∀→←⋯⊢
→←⋯⊢
⋯⊢
∷⇒∀→←⋯⊢→←⋯⊢←⋯⊢⋯⊢⊢⊢⊢<stdout>: hFlush: permission denied (Permission denied)

Piping works correctly. runghc test.hs > output && type output:

∷⇒∀→←⋯⊢

ghci fails. ghci test.hs

GHCi, version 6.12.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( test.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
∷*** Exception: <stdout>: hPutChar: permission denied (Permission denied)
*Main>

Change History (28)

comment:2 Changed 7 years ago by sankeld

comment:3 Changed 7 years ago by sankeld

A solution that doesn't require changing from the posix emulation layer is shown [here http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx]

test.c

#include <fcntl.h>
#include <io.h>
#include <stdio.h>
#include <unistd.h>

int main()
{
    //this seems to fix the problem
    _setmode(_fileno(stdout), _O_U8TEXT );
    char testStr[] = "∷⇒∀→←⋯⊢";
    //posix emulation
    write( STDOUT_FILENO, testStr, strlen(testStr) );
    return 0;
}

gcc test.c -o test.exe test.exe

∷⇒∀→←⋯⊢

test.exe > output && type output

∷⇒∀→←⋯⊢

comment:4 Changed 7 years ago by simonmar

Surely that solution only works for UTF-8? What about other code pages?

comment:5 Changed 7 years ago by sankeld

Change to the Greek code page, chcp 1253 test.exe

∷⇒∀→←⋯⊢

_setmode then is a solution only when the console is set to the Unicode code page. This seems like an adequate solution for now, no?

Here is the link for the _setmode documentation:

http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx

comment:6 in reply to:  5 ; Changed 7 years ago by simonmar

Replying to sankeld:

_setmode then is a solution only when the console is set to the Unicode code page. This seems like an adequate solution for now, no?

Here is the link for the _setmode documentation:

http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx

I don't like to apply a fix without fully understanding what the problem is and why the fix works, and this is all very mysterious to me right now. Why doesn't it work to send UTF-8 to stdout if the current code page is set to UTF-8?

comment:7 Changed 7 years ago by igloo

Milestone: 7.2.1

comment:8 in reply to:  6 Changed 7 years ago by sankeld

Replying to simonmar:

Replying to sankeld:

_setmode then is a solution only when the console is set to the Unicode code page. This seems like an adequate solution for now, no?

Here is the link for the _setmode documentation:

http://msdn.microsoft.com/en-us/library/tw4k6df8.aspx

I don't like to apply a fix without fully understanding what the problem is and why the fix works, and this is all very mysterious to me right now. Why doesn't it work to send UTF-8 to stdout if the current code page is set to UTF-8?

I understand your hesitation. I carefully read through the documentation linked there and on the blog post I mentioned. The only thing Microsoft is putting out right now is the "how" and not the "why" unfortunately.

I don't have high hopes we'll be able to get beyond speculation as to why the default console mode produces unexpected and unpredictable unicode console output.

One thing we can note is that the mention of _O_U16TEXT, _O_U8TEXT, and _O_WTEXT in the _setmode documentation is a recent addition (vs 2010), although they worked prior. This may be an indicator that Microsoft is "blessing" this workaround for the console.

comment:9 Changed 7 years ago by simonmar

There are still too many unknowns here.

  • Won't _O_U8TEXT do newline mangling too? The IO library already does that, so we could have a problem.
  • the original report said that piping the output to a file worked fine. So presumably we need to do this only when the file descriptor is attached to a console?

And I still don't understand exactly what this _setmode is a workaround for. Something apparently goes wrong when you try to output Unicode to the console, but at what layer does the problem occur? (GHC.IO, msvcrt, Win32, kernel)

I don't like to be obstructive when there's an apparent fix for a problem, but I've seen many cases where a "fix" has introduced new problems, so I want to make sure the cure is not worse than the disease :)

comment:10 Changed 7 years ago by sankeld

I think I have the bug pinpointed and can explain the behavior of the original test program.

I've verified that the posix write system call (when applied to stdout where stdout is attached to a console with code page 65001) returns the number of *characters* written instead of the number of *bytes*. This can probably be traced to this issue.

The reasoning for our original output

∷⇒∀→←⋯⊢ -- outputs correctly, but runtime thinks that 9/15 characters remain
→←⋯⊢ -- runtime tries to output the remaining characters, but still thinks characters remain.
⋯⊢ -- ...and so on until a buffer overrun I assume.
∷⇒∀→←⋯⊢→←⋯⊢←⋯⊢⋯⊢⊢⊢⊢<stdout>: hFlush: permission denied (Permission denied)

The GHC/IO/FD.hs's fdWrite function source confirms this behavior.

An ugly solution, if we want to work around this write bug, would be to check, upon write, if this is a 65001 console (not piped to a file). If so, treat the return value of write as a number of characters instead of a number of bytes.

Arg.

comment:11 Changed 7 years ago by sankeld

Also, looking at comments here, there would also have to be a check of whether or not the console is using a ttf font. This workaround strategy is beginning to look like a dead end.

comment:12 Changed 7 years ago by simonmar

That does clarify things a lot, thanks for that. To summarise:

  • the bug is that Win32 WriteFile() returns the wrong result when writing to a Console in codepage 65001. Furthermore, the result it actually returns is the number of characters written to the console, which depends on the actual font being used! (if the font doesn't have the required Unicode glyph, it falls back to outputting characters corresponding to the raw UTF-8 bytes).

The only way to work around the bug seems to be to use WriteConsole() and write Unicode characters directly. If a Handle is attached to a console, then all writes must be decoded from the codepage encoding to UTF-16 before being written using WriteConsole(). Even better would be to bypass the codepage encoding entirely and encode directly from UTF-32 to UTF-16 in the IO library. None of this is particularly easy, though.

comment:13 Changed 5 years ago by igloo

Milestone: 7.4.17.6.1
Priority: normallow

comment:14 Changed 5 years ago by igloo

Milestone: 7.6.17.6.2

comment:15 Changed 4 years ago by ekmett

Cc: ekmett@… added

comment:16 Changed 4 years ago by dagit

Cc: dagitj@… added

comment:17 Changed 4 years ago by simonmic

Cc: simon@… added

comment:18 Changed 4 years ago by shelarcy

Cc: shelarcy@… added

comment:19 Changed 3 years ago by thoughtpolice

Milestone: 7.6.27.10.1

Moving to 7.10.1.

comment:20 Changed 3 years ago by RyanGlScott

Cc: ryan.gl.scott@… added
difficulty: Unknown

comment:21 Changed 3 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:22 Changed 2 years ago by qwfy

Milestone: 7.12.17.10.1
Priority: lowhigh

comment:23 Changed 2 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to the 7.12.1 milestone, as these tickets won't be fixed in time for the 7.10.1 release (unless you, the reader, help write a patch :)

comment:24 Changed 22 months ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:25 Changed 18 months ago by bgamari

Milestone: 8.0.18.2.1
Priority: highnormal

Yet another release where this didn't happen. Due to the relatively small amount of activity on this ticket I'm going to reduce its priority.

comment:26 Changed 18 months ago by bgamari

comment:27 Changed 7 months ago by jmn

Cc: setre3+ghc@… added

comment:28 Changed 7 months ago by bgamari

Milestone: 8.2.1

De-milestoning due to lack of progress.

Note: See TracTickets for help on using tickets.