Opened 8 years ago

Closed 6 years ago

#3309 closed bug (fixed)

getArgs should return Unicode on Unix

Reported by: YitzGale Owned by: batterseapower
Priority: high Milestone: 7.2.1
Component: libraries/base Version: 6.11
Keywords: unicode Cc: slyfox@…, marcot@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The raw bytes of args should be decoded according to the current locale.

An additional function should be added:

getArgsBytes :: IO [Word8]

to provide access to the raw bytes.

This change needs to be coordinated with #3007 so that it will still work to read a file name from the command line args and use it to access a file.

This change should also be made on Windows: #3008

See the discussion at http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html

Change History (14)

comment:1 Changed 8 years ago by igloo

difficulty: Unknown
Milestone: 6.14.1

comment:2 in reply to:  description Changed 7 years ago by slyfox

Type of failure: None/Unknown

Replying to YitzGale:

The raw bytes of args should be decoded according to the current locale.

An additional function should be added:

getArgsBytes :: IO [Word8]

s/\[Word8\]/\[\[Word\]\]/ :]

to provide access to the raw bytes.

This change needs to be coordinated with #3007 so that it will still work to read a file name from the command line args and use it to access a file.

This change should also be made on Windows: #3008

See the discussion at http://www.haskell.org/pipermail/haskell-cafe/2009-June/062795.html

Or, maybe, make getArgs/readFile and friends polymorphic like Text.Printf printf does?

Text.Printf printf :: PrintfType r => String -> r

instance (IsChar c) => PrintfType [c] -- Defined in Text.Printf
instance PrintfType (IO a) -- Defined in Text.Printf
instance (PrintfArg a, PrintfType r) => PrintfType (a -> r)

In our case it would be something like

getArgs :: StringAlike s => IO [s]

and usage would look like:
foo = getArgs :: [[Word8]] -- raw bytes
foo = getArgs :: [ByteString]  -- raw bytes in fast bytestring
foo = getArgs :: [String]  -- locale encoded
-- maybe, anothers?

Thanks!

comment:3 Changed 7 years ago by slyfox

Cc: slyfox@… added

comment:4 Changed 6 years ago by igloo

Milestone: 7.0.17.0.2

comment:5 Changed 6 years ago by marcotmarcot

Cc: marcot@… added

The same applies to System.Environment.getEnvironment.

comment:6 Changed 6 years ago by igloo

Milestone: 7.0.27.2.1

comment:7 Changed 6 years ago by batterseapower

I have a patch to add locale-awareness to the CString functions in Foreign.C.String, which fixes this problem, but I have a problem: The documentation for charIsRepresentable claims that unrepresentable characters are replaced with ?, but the current code does not in fact do this - you get a nonsense character instead. Furthermore, it is difficult to fix the code to match the documentation in my new locale-aware implementation because iconv only provides transliteration and ignore modes for unrepresentable characters.

So there are two problems:

  1. The documented behaviour on unrepresentable characters does not match the implemented behaviour
  2. The documented behaviour is difficult to implement

So we should probably change the documented behaviour. The easiest thing to do is drop unrepresentable characters, which can be implemented easily either using our code page decoder (on Win32) or iconv (on *nix).

Does this sound like a reasonable approach?

comment:8 Changed 6 years ago by simonmar

Are you planning to make peekCString and friends do decoding by default? I have a horrible feeling that will break lots of things. I know it's what the FFI spec requires, but since we've never done it, changing the behaviour now could be surprising.

I've no objection to your proposal for unrepresentable chars, provided we document it appropriately.

comment:9 in reply to:  8 ; Changed 6 years ago by ross

Replying to simonmar:

Are you planning to make peekCString and friends do decoding by default? I have a horrible feeling that will break lots of things. I know it's what the FFI spec requires, but since we've never done it, changing the behaviour now could be surprising.

This behaviour has been specified by the FFI spec since 2002, and was incorporated into Haskell 2010. The documentation of the module has been promising this change since 2004, and in all that time the alternative CAString versions have been available, so it's probably not too hasty to implement it now. I fear you're right about breakage, but it has to happen some time.

comment:10 in reply to:  9 Changed 6 years ago by simonmar

Replying to ross:

This behaviour has been specified by the FFI spec since 2002, and was incorporated into Haskell 2010. The documentation of the module has been promising this change since 2004, and in all that time the alternative CAString versions have been available, so it's probably not too hasty to implement it now. I fear you're right about breakage, but it has to happen some time.

Then I fear we will all need to brace for impact before the next major release :-)

comment:11 Changed 6 years ago by Athas

Has there been any further work on this issue? I'm willing to help out (with testing/hacking) if necessary.

comment:12 Changed 6 years ago by igloo

Owner: set to batterseapower
Priority: normalhigh

If we're going to do this, we should do it as soon as possible; as ross says, any breakage has to happen some time, and it's only going to get worse if we leave it. So I'll make it high priority for 7.2.1.

batterseapower, are you happy to take the lead on this?

comment:13 Changed 6 years ago by batterseapower

Yes, I am going to get my patches in - I was away in China for 2 weeks or I would have already moved this forward.

comment:14 Changed 6 years ago by batterseapower

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.