Opened 3 years ago

Closed 3 years ago

#5061 closed bug (fixed)

Implement FFI spec behaviour for *CString family

Reported by: batterseapower Owned by:
Priority: normal Milestone: 7.2.1
Component: Compiler Version: 7.0.3
Keywords: Cc: pho@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Although the FFI spec requires the *CString functions in Foreign.C.String to use the locale encoding to interpret the supplied CString, the implementation currently uses ASCII.

I have implemented this feature, and at the same time have changed the behaviour of the encoder upon encountering non-decodable characters to silently ignore them. The rationale behind this is just that the previously-specified behaviour (replace them with ?) was in fact also unimplemented and just ignoring them is marginally easier.

Here are some discussion points:

  1. It seems a shame not to expose a general interface to peek CStrings in any supported TextEncoding?.
  1. I'm not a fan of either the "ignore" error handling behaviour or the "replace with ?" behaviour. In my opinion we should throw an exception upon encoding failure because how to recover in this situation in general will depend on the user application
  1. I could implement the replace-with-? error handling behaviour with modest extra effort, if it is deemed necessary.
  1. To ensure that this patch does not change the behaviour of GHC in any way, I replaced every instance of a *CString function with a call to the CAString equivalent, and marked the source with a comment of the form "-- UNICODE". The intention is that if and when this patch is accepted I will then go back and figure out what is really going on in each case and choose the correct function to call.
  1. Some of the occurrences of CString in my GHC repo came from projects with a distinct upstream, such as Cabal. Should I be submitting these patches upstream rather than here?

I note that if we did expose a version of the CString functions that took a TextEncoding?, it would then be easy for the user to decode ignoring errors instead, because they could simply supply a TextEncoding? with different error-handling behaviour.

I have validated this patch on Windows and OS X, and not seen any reproducible failures above and beyond the usual set.

Attachments (1)

5061.zip (174.6 KB) - added by batterseapower 3 years ago.

Download all attachments as: .zip

Change History (16)

Changed 3 years ago by batterseapower

comment:1 follow-up: Changed 3 years ago by batterseapower

See also #3309 #3308 #3307

comment:2 in reply to: ↑ 1 ; follow-up: Changed 3 years ago by simonmar

Replying to batterseapower:

It seems a shame not to expose a general interface to peek CStrings in any supported TextEncoding

I have some work-in-progress to provide a generic interface to encoding/decoding with TextEncoding, I'll try to get it finished up sometime.

comment:3 in reply to: ↑ 2 ; follow-up: Changed 3 years ago by batterseapower

Replying to simonmar:

I have some work-in-progress to provide a generic interface to encoding/decoding with TextEncoding, I'll try to get it finished up sometime.

I have such functions as part of my patch, I just always call them with the first parameter being localeEncoding/localeEncoding_ignore. I have tested them and they seem to work but I'm not sure if they are what you have in mind here, and they could probably be faster and more efficient with memory.

comment:4 in reply to: ↑ 3 Changed 3 years ago by simonmar

Replying to batterseapower:

I have such functions as part of my patch, I just always call them with the first parameter being localeEncoding/localeEncoding_ignore. I have tested them and they seem to work but I'm not sure if they are what you have in mind here, and they could probably be faster and more efficient with memory.

That's fine then, just go ahead and use yours.

comment:5 Changed 3 years ago by batterseapower

  • Status changed from new to patch

comment:6 Changed 3 years ago by batterseapower

I see that this is a dup of #1414, so this issue has been around for a while.

How about this for a plan:

  • Implement PEP 383 (surrogate byte encodings)
  • Use the locale encoding w/ PEP 383 behaviour for encoding/decoding stuff in withCString by default. This has the following knock-on effects:
  1. File names will be decoded using the current locale on all platforms, insofar as that is possible (#3307, maybe more)
  1. Command line arguments from getArgs will be decoded in the same manner (#3309, #3308)
  1. Command line arguments to spawned processes will be encoded with current locale (#4006), but the surrogate byte stuff means that filenames supplied in such a command line won't be mangled

In the glorious future where everyones machines use UTF-8 locales and UTF-8 filenames everywhere, surrogate bytes will never appear and we will be able to turn off the surrogate bytes mechanism without anyone noticing, and all Strings will be decoded just as people expect.

I can probably take the lead on implementing this if you think it is a good strategy.

comment:7 Changed 3 years ago by igloo

Will there still be a way to get the actual bytes that were passed?

comment:8 Changed 3 years ago by batterseapower

With PEP-383 behaviour, yes: if you decode the string using the locale decoding (and handling surrogates by turning them back into bytes) you should get exactly the same bytes as originally came in.

The only downside is you need to *en*code with the same method as was originally used to *de*code. So if the command line arguments were decoded e.g. Shift-JIS there would be no guarantee that encoding them back to bytes with UTF-8 will work. However, if we stick to the policy of always decoding using the locale encoding, the user will be able to reliably use the locale encoding to get raw bytes out of a String that exactly match what we originally decoded.

comment:9 Changed 3 years ago by igloo

I'm not sure if I understood that. What I mean is, if we are passed a sequence of bytes which is not valid in the current locale, is there still a way in which we can get it as a sequence of bytes ([Word8])?

comment:10 Changed 3 years ago by batterseapower

Sorry if I was unclear. The one word answer is: yes

comment:11 Changed 3 years ago by PHO

  • Cc pho@… added

comment:12 Changed 3 years ago by batterseapower

For those interested I am collecting relevant tickets at http://hackage.haskell.org/trac/ghc/wiki/Status/Encoding-Tickets and have some code on the "encoding" branch of the base library.

comment:13 Changed 3 years ago by simonmar

I support the approach in principle, go for it Max.

comment:14 Changed 3 years ago by igloo

  • Milestone set to 7.2.1

comment:15 Changed 3 years ago by batterseapower

  • Resolution set to fixed
  • Status changed from patch to closed

Fixed by commit 509f28cc93b980d30aca37008cbe66c677a0d6f6 to base.

Note: See TracTickets for help on using tickets.