8.2.1 regression: heap corruption after safe foreign calls

changed weight to 10

Attached file ghc_output (download).

compiler output with -v

For whatever reason, I'm not able to reproduce this on my Ubuntu 14.04 or 17.04 machine (both with GHC 8.2.1). I'm doing this:

$ ghc -fforce-recomp -threaded bug.c Bug.hs
[1 of 1] Compiling Main             ( Bug.hs, Bug.o )
Linking Bug ...
$ ./Bug

It then proceeds to run forever (AFAICT) without hitting any value mistmatches or segfaults.

Some questions:

What operating system are you using?
How can I reproduce this issue //with just GHC//? Please, no instructions involving fancy build tools like stack, since if this really is a GHC bug, one should be able to trigger the issue with just GHC.

Thanks for your report andrewchan; unfortunately, as with RyanGlScott, I am unable to reproduce this with +RTS -N4, +RTS -N1, or under any of GHC's optimization levels on Debian 9 running on amd64. Having a standalone testcase, free of build tools, would be quite helpful.

I am running arch linux x64 on a i5-4200U laptop.

I'm able to reproduce with just the system ghc:

ghc Main.hs test.c -threaded -O1 -fforce-recomp
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...

For me the program fails within seconds:

$ time ./Main
value mismatch
value mismatch
value mismatch
value mismatch
zsh: segmentation fault (core dumped)  ./Main
./Main  2.19s user 0.20s system 67% cpu 3.553 total

I'm also able to reproduce the issue in a fedora virtual machine on the same physical machine using ghc 8.2.1 binaries downloaded from haskell.org.

I can reproduce this same issue on my machine, I am using: NixOS x86_64 Unstable Branch (as of October 12th 2017) GHC 8.2.1 Binutils 2.28.1 GCC 6.4.0

I noticed the bug does not occur and the program runs infinitely if I simply compile with 'ghc Main.hs test.c -threaded -o Bug', however, if Optimization level 1 or 2 are enabled, the bug happens very quickly after running the binary.

Also, I forgot to add, the bug does not occur when compiled with debug symbols (-g).

Interesting; I can also reproduce this in my Nix unstable VM.

Somewhat to my surprise, this regression was introduced in 8d5cf8bf (Join points).

Well, ticket:14346#comment:143804 certainly explains why -g avoids the crash: in 8.2 source note ticks essentially prevented GHC from marking anything as a join point.

I could have sworn I left a comment last night but it seems I am mistaken. Here is what I discovered while looking into this so far:

The test is indeed rather environment sensitive. Moreover, as it doesn't occur under rr I strongly suspect it's a race of some sort. When compiled with -debug the eventual segmentation fault always seems to occur in stg_putMVarzh. Specifically here,

Dump of assembler code for function stg_putMVarzh:
   0x00000000004ab1b0 <+0>:	cmpl   $0x1,0x4f4800
   0x00000000004ab1b8 <+8>:	je     0x4ab35e <stg_putMVarzh+430>
   0x00000000004ab1be <+14>:	mov    $0x479e18,%eax
   0x00000000004ab1c3 <+19>:	mov    %rbx,%rcx
   0x00000000004ab1c6 <+22>:	sub    $0x8,%rsp
   0x00000000004ab1ca <+26>:	mov    %rcx,%rdi
   0x00000000004ab1cd <+29>:	mov    %rax,%rcx
   0x00000000004ab1d0 <+32>:	xor    %eax,%eax
   0x00000000004ab1d2 <+34>:	callq  *%rcx
   0x00000000004ab1d4 <+36>:	add    $0x8,%rsp
   0x00000000004ab1d8 <+40>:	cmpq   $0x4f2c30,0x18(%rbx)
   0x00000000004ab1e0 <+48>:	jne    0x4ab366 <stg_putMVarzh+438>
   0x00000000004ab1e6 <+54>:	mov    0x8(%rbx),%rcx
   0x00000000004ab1ea <+58>:	cmp    $0x4f2c30,%rcx
   0x00000000004ab1f1 <+65>:	je     0x4ab466 <stg_putMVarzh+694>
   0x00000000004ab1f7 <+71>:	cmpq   $0x4ac2f0,(%rcx)
   0x00000000004ab1fe <+78>:	je     0x4ab45d <stg_putMVarzh+685>
   0x00000000004ab204 <+84>:	cmpq   $0x4aca20,(%rcx)
   0x00000000004ab20b <+91>:	je     0x4ab45d <stg_putMVarzh+685>
   0x00000000004ab211 <+97>:	mov    0x10(%rcx),%rdx
   0x00000000004ab215 <+101>:	mov    0x8(%rcx),%rsi
   0x00000000004ab219 <+105>:	mov    %rsi,0x8(%rbx)
   0x00000000004ab21d <+109>:	cmpq   $0x4f2c30,0x8(%rbx)
   0x00000000004ab225 <+117>:	jne    0x4ab22f <stg_putMVarzh+127>
   0x00000000004ab227 <+119>:	movq   $0x4f2c30,0x10(%rbx)
=> 0x00000000004ab22f <+127>:	cmp    0x28(%rdx),%rbx
   0x00000000004ab233 <+131>:	je     0x4ab276 <stg_putMVarzh+198>
   ...

I believe this corresponds to this bit of C--,

    ...

    tso = StgMVarTSOQueue_tso(q);
    StgMVar_head(mvar) = StgMVarTSOQueue_link(q);
    if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) { // cmpq   $0x4f2c30,0x8(%rbx)
        StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure; // movq   $0x4f2c30,0x10(%rbx)
    }

    ASSERT(StgTSO_block_info(tso) == mvar);             // cmp    0x28(%rdx),%rbx
    
    ...

Indeed we find that,

>>> print/a $rbx
$1 = 0x42000b8400
>>> print/a $rdx
$2 = 0x42deadbeef

Yikes!

This sounds to me like we reentered STG while forgetting to do some bit of cleanup from the foreign call.

I managed to do a rr capture with --chaos mode.

Here's the part in main where it does the comparison:

0x404581 <Main_main1_info+361>  mov    ecx,DWORD PTR [rax]
0x404583 <Main_main1_info+363>  cmp    rcx,rbx                       // compares value with 0xDEADBEEF
0x404586 <Main_main1_info+366>  jne    0x40443c <Main_main1_info+36> // goes to print "value mismatch"

(rr) p/x $rcx
$22 = 0x1
(rr) p/x $rbx
$23 = 0xdeadbeef
(rr) p/x $rax
$24 = 0x42000b7540

Putting a watch point on the the memory address and reverse continuing leads to this:

Old value = 1
New value = -559038737
0x0000000000470b42 in base_GHCziEventziPoll_new5_info ()
=> 0x0000000000470b42 <base_GHCziEventziPoll_new5_info+1218>:   49 89 04 24     mov    QWORD PTR [r12],rax

(rr) p/x $r12
$27 = 0x42000b7540

Not sure what's going on there, but I hope this is of some help.

Indeed the --chaos tip is quite helpful. Thanks!

So it appears that the crazy TSO is loaded in stg_putMVar# on line 1737:

    ...
    // There are readMVar/takeMVar(s) waiting: wake up the first one

    tso = StgMVarTSOQueue_tso(q);                                // <--- here
    StgMVar_head(mvar) = StgMVarTSOQueue_link(q);
    if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
        StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure;
    }
...

Here q is 0x42000b7530 which is a fairly reasonable-looking MVAR_TSO_QUEUE, except with a completely wild tso field,

0x42000b7530:	0x4acd28 <stg_MVAR_TSO_QUEUE_info>	0x4f2c30
0x42000b7540:	0x42deadbeef	0x433508 <base_GHCziInt_I32zh_con_info>

Indeed the last guy to write to StgMVarTSOQueue_tso(q) is the FFI target, test,

Dump of assembler code for function test:
=> 0x00000000004044f0 <+0>:	movl   $0xdeadbeef,(%rdi)
   0x00000000004044f6 <+6>:	retq

where %rdi == 0x00000042000b7540.

Let's look at the calling sequence produced by GHC,

_c4Rp:
        movq $block_c4Ru_info,-8(%rbp) # I64[Sp - 8] = c4Ru;
        movq %rax,(%rbp)               # I64[Sp] = _s4Ok::I64;
        addq $-8,%rbp                  # Sp = Sp - 8;
        movq 872(%r13),%rbx            # _u4RJ::P64 = CurrentTSO;
        movq 24(%rbx),%rcx             # I64[I64[_u4RJ::P64 + 24] + 16] = Sp;
        movq %rbp,16(%rcx)
        movq 888(%r13),%rcx            # _u4RK::I64 = CurrentNursery;
        leaq 8(%r12),%rdx              # P64[_u4RK::I64 + 8] = Hp + 8;
            # I64[_u4RJ::P64 + 104] = I64[_u4RJ::P64 + 104] 
            #                         - ((Hp + 8) - I64[_u4RK::I64]);
        movq %rdx,8(%rcx)              
        leaq 8(%r12),%rdx
        subq (%rcx),%rdx
        movq 104(%rbx),%rcx
        subq %rdx,%rcx
        movq %rcx,104(%rbx)

            # (_u4RH::I64) = call "ccall" arg hints:  [PtrHint,]
            #                             result hints:  [PtrHint]
            #                             suspendThread(BaseReg, 0);
        subq $8,%rsp                   # native-call stack adjustment
        movq %r13,%rdi                 # setup argument 1 (BaseReg)
        xorl %esi,%esi                 # setup argument 2 (0)
        movq %rax,%rbx                 # Save $rax in callee-saved register
        xorl %eax,%eax                 # No floating point arguments for this call
        call suspendThread
        addq $8,%rsp                   # undo stack adjustment
        subq $8,%rsp                   # redo stack adjustment; silly GHC
        movq %rbx,%rdi                 # ??? <--- This is where the bad argument comes from
        movq %rax,%rbx                 # Spill again? But we never actually unspilled it!
                                       # I think this is where we go wrong
        xorl %eax,%eax                 # No floating point arguments for this call
        call test                      # Native call
        addq $8,%rsp                   # undo stack adjustment
        subq $8,%rsp                   # you are such a joker, GHC
        movq %rbx,%rdi
        xorl %eax,%eax
        call resumeThread
        ...

It looks to me like what happens here is that we spill $rax (which contains a pointer to the current MVar closure) to $rbx twice, losing knowledge of the first spill. Consequently we end up passing the MVar as the argument to test. Hilarity ensues.

mentioned in issue #14359

mentioned in issue #14361 (closed)

On looking at this with fresh eyes, it seems that unfortunately my analysis from ticket:14346#comment:143885 is flawed; the movq %rbx,$rdi is completely correct. We spill to the callee-saved %rbx register before suspendThread and then more the value from %rbx to %rdi, which is where we expect the first argument to reside. The second spill is simply preserving _u4RH, which is still alive after the call to test.

Back to the drawing board. I think now I'll focus on catching the issue earlier in execution; namely, when we first get the value mismatch message.

You don't need the FFI to trigger the bug. You can replace:

foreign import ccall safe "test"
    c_test :: Ptr Word32 -> IO ()

with:

{-# NOINLINE c_test #-}
c_test :: Ptr Word32 -> IO ()
c_test ptr = poke ptr 0xDEADBEEF

Smaller reproducer:

-- ghc -Wall -fforce-recomp -O1 Test.hs -threaded
module Main where

import Control.Concurrent
import Control.Monad
import Data.Word
import Foreign.Marshal.Alloc
import Foreign.Storable
import Numeric

main :: IO ()
main = do
    replicateM_ 49 $ threadDelay 1
    allocaBytes 4 $ \p -> do
      forever $ do
        poke p (0xDEADBEEF :: Word32)
        threadDelay 10
        x <- peek p
        unless (x == 0xDEADBEEF) $ putStrLn (showHex x "")

To trigger the bug, we need at least **49** threadDelay executions **before** allocaBytes. Their duration doesn't matter.

Really interestingly replacing forever with replicateM_ 1000000000 doesn't trigger the bug anymore.

A bit of speculation: compiler sees that the touch# at the end of allocaBytes is unreachable due to forever, and so ignores it and allows the allocated are to be GC'ed.

I had a look at the cmm code for comment [ticket:14346#comment:144039] and I have some questions:

      c4CO: // global
          _s4Bx::I64 = R1 + 16; <---- assign _s4Bx the pointer to the first 
                                      byte of the bytearray 
                                      (16 byte = infotable ptr + length field )
          goto c4CW;
      c4CW: // global
          I32[_s4Bx::I64] = 3735928559 :: W32;
          (_s4BE::I64) = call "ccall" arg hints:  []  result hints:  [`signed'] rtsSupportsBoundThreads();
          if (_s4BE::I64 != 0) goto c4Dz; else goto c4DE;
      c4Dz: // global
          I64[Sp - 8] = block_c4Dx_info;
          R2 = Main.main2_closure+1;
          I64[Sp] = _s4Bx::I64;   <------ is it ok to store an address which clearly points 
                                          into heap allocated memory but doesn't point to 
                                          an info table? 
                                          
          Sp = Sp - 8;
          call GHC.Conc.Windows.threadDelay1_info(R2) returns to c4Dx, args: 8, res: 8, upd: 8;
      c4Dx: // global
          _s4Bx::I64 = I64[Sp + 8];
          goto c4D2;
      c4DE: // global
          I64[Sp - 8] = block_c4DD_info;
          R1 = 10;              <------- overwrite R1, R1 was our *only* 
                                         reference to the bytearray closure. 
          I64[Sp] = _s4Bx::I64;
          Sp = Sp - 8;
          call stg_delay#(R1) returns to c4DD, args: 8, res: 8, upd: 8;
      c4DD: // global
          _s4Bx::I64 = I64[Sp + 8];
          goto c4D2;
      c4D2: // global

The only reference to the ByteArray closure is in R1
_s4Bx points to the first byte in the byte array
In block c4DE R1 is overwritten.
The rtsSupportsBoundThreads is a ccall, don't we have to save R1 over these calls?

Maybe the garbage collector assumes the ByteArray is dead and collects it too early?

Edit: Ah, I missed comment 15. This might support andrewchens observations in comment 15.

Very good insights, alexbiehl and andrewchen. Indeed it looks indeed the GC is (correctly, given the code) concluding that the array is unreachable. Looking at the -dverbose-core2core output one sees that the touch# call is dropped during one of the simplifier passes (SimplMode {Phase = 0 [post-call-arity], inline, rules, eta-expand, case-of-case}). That is certainly the cause of the crash.

To answer, a few of your questions:

is it ok to store an address which clearly points into heap allocated memory but doesn't point to an info table?

In the above case the answer is probably yes. This pointer is saved as a field of a stack frame (namely a return frame for block_c4Dx_info). The info table for this frame likely declares this field as a non-pointer. Consequently it won't be traced by the GC. Of course, for this to be safe we do need to keep some reference to the ByteArray# itself. That is where we go wrong.

The rtsSupportsBoundThreads is a ccall, don't we have to save R1` over these calls?

This will happen when we lower the call in nativeGen. Cmm is platform/calling-convention independent and so we can't yet determine which registers need to be saved and which do not for a particular call.

It looks like to avoid this we will either need to teach the simplifier not to throw away otherwise dead continuations which contain some "important" primops (e.g. touch#) or mark allocaBytes as NOINLINE so the simplifier can't see the bad simplification (which seems to be how this is typically dealt with; e.g. see GHC.Compact.Serialized).

However, in general I wonder whether touch# is more unsafe than strictly necessary. It seems to me that for a tad of stack allocation you can get a much safer way to keep values alive. The trick is to introduce a primop,

with# :: a -> r -> r

When with# a cont is entered, the entry code will,

Push an StgWithFrame, a new sort of return frame which carries a reference to a, onto the stack
Enters cont

When cont returns, it will enter the entry code for StgWithFrame, which will simply pop itself and return. I believe this should be more robust against the simplifier; in particular, the present bug couldn't occur under this scheme.

So for instance, alloca would be changed from being (parapharsing to avoid obfuscating the point with state passing),

allocaBytes :: Int -> (Ptr a -> IO b) -> IO b
allocaBytes (I# size) action = IO $ \ s0 ->
     case newPinnedByteArray# size s0      of { (# s1, mbarr# #) ->
     case unsafeFreezeByteArray# mbarr# s1 of { (# s2, barr#  #) ->
     let addr = Ptr (byteArrayContents# barr#) in
     case action addr     of { IO action' ->
     case action' s2      of { (# s3, r #) ->
     case touch# barr# s3 of { s4 ->
     (# s4, r #)
  }}}}}

We would instead have,

allocaBytes :: Int -> (Ptr a -> IO b) -> IO b
allocaBytes (I# size) action = IO $ \ s0 ->
     case newPinnedByteArray# size s0      of { (# s1, mbarr# #) ->
     case unsafeFreezeByteArray# mbarr# s1 of { (# s2, barr#  #) ->
     let addr = Ptr (byteArrayContents# barr#) in
     case action addr                      of { IO action' ->
     with# barr# (action' s2)
  } } }

It seems to me that this does a much better job of capturing the real idea allocaBytes seeks.

The only potential issue with this proposal is float-out; I suspect we might need to say that we can't float out of a with# application. I'll need to sleep on that.

It looks like to avoid this we will either need to teach the simplifier not to throw away otherwise dead continuations which contain some "important" primops

Can you give an example to show what it is throwing away, and why that's bad? I don't get it yet. I have even forgotten why touch# exists.

Copying Simon Marlow.

Interesting, can someone boil down the transformation that dropped the touch#?

Simon: touch# is keeping the ByteArray# alive until after the action, in allocaBytes (see ticket:14346#comment:144131). The action itself doesn't keep the array alive, because it is working with the raw pointer, not the ByteArray#. This is how we allocate temporary memory for marshalling data between Haskell and C, because it's a lot faster to allocate memory on the Haskell heap than to use malloc() and free().

I imagine the simplifier has proven that action never returns and then dropped the case with the continuation containing the touch#. That seems like a reasonable thing to do.

I like @bgamari's alternative suggestion of with#, although we probably want it to be

with# :: a -> (State# s -> (# State# s, b #)) -> State# s -> (# State# s, b #)

otherwise the second argument must be a thunk (yuck).

I imagine the simplifier has proven that action never returns and then dropped the case with the continuation containing the touch#. That seems like a reasonable thing to do.

Correct. I believe it is Simplify.rebuildCall that is responsible for this. I also agree: this seems like a perfectly reasonable thing to do and it's not clear how exactly we would prevent this behavior in the case of touch# (since the touch# may be buried deep in the continuation).

changed milestone to %8.2.2

See D4110 for an implementation. Still need to add a test.

Trac metadata

Trac field	Value
Differential revisions	- → D4110

..for an implementation of with#, that is.

Indeed D4110 appears to fix the testcases from ticket:14346#comment:144039 and the ticket description.

mentioned in issue #14375 (closed)

Let's move discussion of with# to new ticket just for that purpose: #14375 (closed).

This ticket remains to track the bug reported in the Description.

Trac metadata

Trac field	Value
Differential revisions	D4110 → -

changed milestone to %8.4.1

Alright, I have marked allocaBytes and allocaBytesAligned as NOINLINE for 8.2.2. A more principled solution, in the form of #14375 (closed), coming in 8.4.1.

mentioned in issue #14403 (closed)

assigned to @bgamari

Trac metadata

Trac field	Value
BlockedBy	- → #14375 (closed)

closed

I'm going to close this since the serious correctness issue is resolved for now. A more efficient solution will come in 8.4 (#14375 (closed)).

Trac metadata

Trac field	Value
Resolution	Unresolved → ResolvedFixed
BlockedBy	#14375 (closed) → -

mentioned in issue #15260 (closed)

reopened

changed milestone to %8.4.4

Sadly 404bf05e never made into HEAD nor GHC 8.4 branch and #14375 (closed) hasn't been completed before GHC 8.4 release. Hence HEAD and 8.4.* suffer from this bug (see #15260 (closed)).

I've made a new diff reintroducing the workaround and adding a regression test: D5020.

Trac metadata

Trac field	Value
Version	8.2.1 → 8.4.3
Resolution	ResolvedFixed → Unresolved
Component	Runtime System → Compiler
Differential revisions	- → D5020

removed RTS label

changed milestone to %8.6.1

This issue seems quite serious to me and it's actually encountered in the wild (see the related ticket). bgamari, can we milestone this for the next release?

ticket:14346#comment:145503 says "correctness issue is resolved for now" but I don't see how. As far as I can see no commits were done for this ticket.

Trac metadata

Trac field	Value
Version	8.4.3 → 8.5
Related	- → #15260 (closed)

Replying to [ticket:14346#comment:157570 osa1]:

This issue seems quite serious to me and it's actually encountered in the wild (see the related ticket). bgamari, can we milestone this for the next release?

Could we also have a 8.4.4 release?

ticket:14346#comment:145503 says "correctness issue is resolved for now" but I don't see how. As far as I can see no commits were done for this ticket.

It was fixed in 8.2.2: https://git.haskell.org/ghc.git/commitdiff/404bf05ed3193e918875cd2f6c95ae0da5989be2

It was fixed in 8.2.2: https://git.haskell.org/ghc.git/commitdiff/404bf05ed3193e918875cd2f6c95ae0da5989be2

Ah, I guess the patch was only merged to 8.2.2 branch (not to master).

8.2.1 regression: heap corruption after safe foreign calls

Child items 0

Activity

8.2.1 regression: heap corruption after safe foreign calls

Relates to

Activity