Opened 6 months ago

#15553 new bug

GHC.IO.Encoding not flushing partially converted input

Reported by: msakai Owned by:
Priority: normal Milestone: 8.6.1
Component: Core Libraries Version: 8.4.3
Keywords: Cc:
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Conversion by GHC.IO.Encoding produces incomplete output for some encodings because it does not flush partially converted input at the end of the string.

iconv(3) provides API for the flushing.

In each series of calls to iconv(), the last should be one with inbuf or *inbuf equal to NULL, in order to flush out any partially converted input.

But GHC.IO.Encoding does not perform the flushing properly and it can cause incomplete conversion result. I found two cases that it actually produces incomplete output, but there might be more cases.

Case 1: EUC-JISX0213

For example, the following code is expected to output two bytes 0xa4 0xb1, but it outputs none.

enc <- mkTextEncoding "EUC-JISX0213"
withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h "\x3051"

The problem happens because of the following mapping between Unicode and EUC-JISX0213.

UnicodeEUC-JISX0213
U+3051 U+309A0xa4 0xfa
U+30510xa4 0xb1

After seeing the codepoint U+3051, the converter is unable to determine which of the two byte sequence to output until it sees the next character or the end of the string. But GHC.IO.Encoding does not call the above mentioned flushing API, therefore the converter is unable to recognize the end of the string.

Case 2: ISO-2022-JP

Similarly, following code is expected to output byte sequence 0x1b 0x24 0x42 0x24 0x22 0x1b 0x28 0x42 but the last three bytes 0x1b 0x28 0x42 is not produced.

enc <- mkTextEncoding "ISO-2022-JP"
withFile "test.txt" WriteMode $ \h -> hSetEncoding h enc >> hPutStr h "\x3042"

ISO-2022-JP is a stateful encoding and RFC 1468 requires the state is reset to initial state at the end of the string. The missing three bytes 0x1b 0x28 0x42 are the escape sequence for that purpose. But again GHC.IO.Encoding does not call the above mentionedflushing API, therefore the converter cannot recognize the end of the string and cannot reset the state.

Change History (0)

Note: See TracTickets for help on using tickets.