Opened 2 years ago

Last modified 4 months ago

#11312 new bug

GHC inlining primitive string literals can affect program output

Reported by: RyanGlScott Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.3
Keywords: strings Cc: ekmett, core-libraries-committee
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: #11292, #5218 Differential Rev(s):
Wiki Page:

Description

First noted in #11292, this program, when compiled with -O1 or higher:

{-# LANGUAGE MagicHash #-}
module Main (main) where

import GHC.Exts (Addr#, isTrue#)
import GHC.Prim (eqAddr#)

data A = A { runA :: Addr# }

a :: A
a = A "a"#

main :: IO ()
main = print (isTrue# (eqAddr# (runA a) (runA a)))

will result in the following after inlining:

Main.main2 =
  case GHC.Prim.tagToEnum#
         @ GHC.Types.Bool (GHC.Prim.eqAddr# "a"# "a"#)
  of _ [Occ=Dead] {
    GHC.Types.False -> GHC.Show.shows26;
    GHC.Types.True -> GHC.Show.shows24
  }

As a result, there are two of the same string constant with different addresses, which causes eqAddr# to return False. If compiled without optimizations, "a"# is not inlined, and as a result, eqAddr# returns True.

Two questions:

  1. Is this okay semantics-wise? Or is this a necessary risk when working with primitive string literals, and should programmers judiciously use {-# NOINLINE #-} with them?
  2. Is this okay from a code duplication standpoint? As Reid Barton noted in #11292, "a"# is duplicated due to inlining. In this example, not much is duplicated, but if it were a longer string constant, that could result in a noticeable increase in the object file size.

Change History (13)

comment:1 Changed 2 years ago by simonpj

Anyone who uses eqAddr# to compare string literals is asking for trouble. Really it should be harder to do. eqAddr# is a respectable function, but it is NOT respectable that "foo"# :: Addr#. We should have String#, and operations to compare the strings.

Meanwhile, see #8472 for a better approach to sharing string literals.

comment:2 Changed 2 years ago by RyanGlScott

To collect some of the comments you've made in #8472 and #11292, is this what you want to be done?

  1. Introduce a new String# type, and make primitive string literals their values, e.g., "foo"# :: String#.
  2. Allow top-level String#s, e.g.,
a::Addr# = "foo"#

but not top-level String# computations, which would entail:

  • Modify the test isUnLiftedType ty in SetLevels.lvlMFE, which stops unlifted things getting floated to top level.
  • Similarly Simplify.bindingOk.
  • Make CmmLint check the new invariant.
  • The STG->Cmm code generator would need to generate some suitable CmmData stuff.
  1. Introduce primitive operations that manipulate String#s (e.g., a Haskell equivalent of strcmp)

comment:3 Changed 2 years ago by simonpj

Cc: ekmett core-libraries-committee added

Let's separate two things:

  • Top-level unboxed string literals: #8472
  • Not using Addr# for string literals: this ticket.

Here's my summary for this ticket, after talking to Simon M.

  • It's plain wrong to use Addr# as the type of a string literal. If we do so, there is no reliable way to compute equality for
    data T = MkT Addr# deriving( Eq )
    
    Since the Addr# might come from malloc or something, it must compare using equality on Addr#. But then there is no guarantee that MkT "foo"# == MkT "foo"#.
  • So we need a new type for unlifted string literals, say String#. It could be primitive, and that's what I'll assume for now.
  • Of course the underlying representation will be the same as Addr#. But there should be no operation get :: String# -> Addr# (except maybe in the IO monad), else it'd possible that get "foo"# might be not-equal to get "foo"#.
  • What operations do we need on String#? Presumably at least
    eqString#    :: String# -> String# -> Int#   -- Like eqChar#
    cmpString#   :: String# -> String# -> Int#   -- 3-way compare
    lenString#   :: String# -> Int#              -- Number of chars
    indexString# :: String# -> Int# -> Char#   -- Get the ith char
    
  • NB: I'm deliberately not saying that the string is null-terminated. That's be up to the implementation of String#, provided it offered the above operations. A better representation might be a record of a length and a blob of bytes.
  • Could String# simply be a ByteArray#?
    type String# = ByteArray#
    
    After all, ByteArray# already has primops sizeOfByteArray# and indexCharArray#. We'd just need a way to have a statically-allocated ByteArray#, but that would be an excellent thing anyway. e.g. I believe that Happy mis-uses literal strings to allow it to build a statically-allocated array. Avoiding yet another primitive type would be a relief.

See also #5218 and #9577

  • I'm not sure about how Unicode plays with all of this.
  • This would be a potentially breaking change for any code using unboxed string literals. I'm copying the Core Libraries Committee

I'd love someone to take this up. there

comment:4 Changed 2 years ago by ekmett

From a CLC perspective, I'd consider anything involving unboxed string literals to fall under the purview of the GHC developers.

It is a very compiler-specific detail that doesn't leak into much user visible code, so whatever you think is cleanest should be good enough for us.

As for the lack of an operation for String# -> Addr# that gets a bit messier.

We already have

byteArrayContents# :: ByteArray# -> Addr#

replete with the caveat that it should only be used with pinned byte arrays and with the suggestion that String# = ByteArray# these are effectively pre-pinned byte array literals.

The existence of that primitive wasn't an issue until now, but becomes one in the presence of unboxed string literals.

comment:5 Changed 22 months ago by xnyhps

Fyi, the patch that I've been working on for #9577 already fixes this on Linux: the patch marks string literals as "mergeable" asm sections, so the assembler merges them and makes the addresses identical.

I don't mean to say that introducing String# isn't an improvement. I'll try to have a look.

comment:6 Changed 22 months ago by xnyhps

I hacked around with this a bit, these are my findings:

  • ByteArray# is represented as StgArrBytes, which includes a 1 word header and 1 word to indicate the size and is word aligned. This may be very convenient for things that want to use static ByteArray#s, but for usual string literals it sounds quite wasteful: on x86-64 it would require 24 bytes of padding to store a one-character string, instead of 2 for a zero-terminated string.

(The alignment changes of #9577 would conflict with this, but it should still be possible to mark the sections as mergeable.)

Considering that, and the existence of byteArrayContents#, I think it would be preferable to introduce a new primitive type String# that is used only for string literals.

  • Whichever different type ""# gets, bootstrapping the compiler becomes tricky: the template for both Alex and Happy defines ""# literals. What makes it worse is that this means stage 1 and stage 2 need a different template.

comment:7 Changed 15 months ago by simonpj

See #12585 for another report of this bug.

comment:8 Changed 14 months ago by hvr

See #12757 for yet another instance/report, this time affecting pre-release GHC 8.0.2 snapshots

comment:9 Changed 14 months ago by simonpj

And #12757 shows that having literal strings be of type Addr# can lead to seg-faults. Yikes. We really should fix this.

comment:10 Changed 14 months ago by duncan

From the point of view of the bytestring package, what we would like is an O(1) way to convert from a constant/literal string to a pinned ByteArray# since we can build a ByteString from one of those.

You may say why not just convert to a raw Addr# given that ByteString uses ForeignPtr which can use pointers. Well, long term we'd like to eliminate the use of ForeignPtr and use only pinned or unpinned ByteArray#s, and similarly in the long term Text will switch to UTF8 and it already uses ByteArray# not ForeignPtr. So long term, to have O(1) Text or ByteString literals we'd need to be able to convert to ByteArray#.

Last edited 14 months ago by duncan (previous) (diff)

comment:11 Changed 14 months ago by simonpj

Fair enough, but there are then three separate matters here

  1. Top-level unboxed string literals: #8472
  2. Disentangling Addr# and String# along the lines of comment:3 (this ticket)
  3. Some kind of support for ByteArray# literals, as you ask.

Perhaps (2) and (3) are related, beause perhaps a String# can be a static ByteArray#?

Simon

comment:12 Changed 5 months ago by bgamari

comment:13 Changed 4 months ago by bgamari

Keywords: strings added
Note: See TracTickets for help on using tickets.