Opened 2 months ago

Closed 3 weeks ago

#14346 closed bug (fixed)

8.2.1 regression: heap corruption after safe foreign calls

Reported by: andrewchen Owned by: bgamari
Priority: highest Milestone: 8.4.1
Component: Runtime System Version: 8.2.1
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Test case: (compile with ghc 8.2.1 and -threaded option)

module Main where

import Control.Concurrent
import Control.Monad
import Data.Word
import Foreign.Marshal.Alloc
import Foreign.Ptr
import Foreign.Storable

foreign import ccall safe "test"
    c_test :: Ptr Word32 -> IO ()

main :: IO ()
main = do
    replicateM_ 1000 $ threadDelay 1000
    _ <- forkIO $ forever $ threadDelay 100
    allocaBytes 4 $ \p -> forever $ do
        c_test p
        x <- peek p
        unless (x == 0xDEADBEEF) $ putStrLn "value mismatch"
void test(unsigned int *buf) {
    *buf = 0xDEADBEEF;
}

On my machine, it detects a few value mismatches before crashing with sigsegv.

$ time ./.stack-work/install/x86_64-linux-nopie/nightly-2017-10-10/8.2.1/bin/bug
value mismatch
value mismatch
value mismatch
value mismatch
zsh: segmentation fault (core dumped)  ./.stack-work/install/x86_64-linux-nopie/nightly-2017-10-10/8.2.1/bin/bug
./.stack-work/install/x86_64-linux-nopie/nightly-2017-10-10/8.2.1/bin/bug  2.11s user 0.25s system 66% cpu 3.543 total

I believe this is what is causing crashes in xmobar. See discussion: https://github.com/jaor/xmobar/issues/310. Note that the crash in xmobar still happens without -threaded option, while this example only breaks when compiled with -threaded.

Attachments (1)

ghc_output (17.5 KB) - added by andrewchen 2 months ago.
compiler output with -v

Download all attachments as: .zip

Change History (30)

Changed 2 months ago by andrewchen

Attachment: ghc_output added

compiler output with -v

comment:1 Changed 2 months ago by RyanGlScott

Status: newinfoneeded

For whatever reason, I'm not able to reproduce this on my Ubuntu 14.04 or 17.04 machine (both with GHC 8.2.1). I'm doing this:

$ ghc -fforce-recomp -threaded bug.c Bug.hs
[1 of 1] Compiling Main             ( Bug.hs, Bug.o )
Linking Bug ...
$ ./Bug 

It then proceeds to run forever (AFAICT) without hitting any value mistmatches or segfaults.

Some questions:

  1. What operating system are you using?
  2. How can I reproduce this issue with just GHC? Please, no instructions involving fancy build tools like stack, since if this really is a GHC bug, one should be able to trigger the issue with just GHC.
Last edited 2 months ago by RyanGlScott (previous) (diff)

comment:2 Changed 2 months ago by bgamari

Thanks for your report andrewchan; unfortunately, as with RyanGlScott, I am unable to reproduce this with +RTS -N4, +RTS -N1, or under any of GHC's optimization levels on Debian 9 running on amd64. Having a standalone testcase, free of build tools, would be quite helpful.

Last edited 2 months ago by bgamari (previous) (diff)

comment:3 Changed 2 months ago by andrewchen

I am running arch linux x64 on a i5-4200U laptop.

I'm able to reproduce with just the system ghc:

ghc Main.hs test.c -threaded -O1 -fforce-recomp
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...

For me the program fails within seconds:

$ time ./Main
value mismatch
value mismatch
value mismatch
value mismatch
zsh: segmentation fault (core dumped)  ./Main
./Main  2.19s user 0.20s system 67% cpu 3.553 total

I'm also able to reproduce the issue in a fedora virtual machine on the same physical machine using ghc 8.2.1 binaries downloaded from haskell.org.

comment:4 Changed 2 months ago by rezb1t

I can reproduce this same issue on my machine, I am using: NixOS x86_64 Unstable Branch (as of October 12th 2017) GHC 8.2.1 Binutils 2.28.1 GCC 6.4.0

I noticed the bug does not occur and the program runs infinitely if I simply compile with 'ghc Main.hs test.c -threaded -o Bug', however, if Optimization level 1 or 2 are enabled, the bug happens very quickly after running the binary.

Last edited 2 months ago by rezb1t (previous) (diff)

comment:5 Changed 2 months ago by andrewchen

Also, I forgot to add, the bug does not occur when compiled with debug symbols (-g).

comment:6 Changed 2 months ago by bgamari

Interesting; I can also reproduce this in my Nix unstable VM.

comment:7 Changed 2 months ago by RyanGlScott

Somewhat to my surprise, this regression was introduced in 8d5cf8bf584fd4849917c29d82dcf46ee75dd035 (Join points).

comment:8 Changed 2 months ago by bgamari

Well, comment:7 certainly explains why -g avoids the crash: in 8.2 source note ticks essentially prevented GHC from marking anything as a join point.

comment:9 Changed 2 months ago by bgamari

I could have sworn I left a comment last night but it seems I am mistaken. Here is what I discovered while looking into this so far:

The test is indeed rather environment sensitive. Moreover, as it doesn't occur under rr I strongly suspect it's a race of some sort. When compiled with -debug the eventual segmentation fault always seems to occur in stg_putMVarzh. Specifically here,

Dump of assembler code for function stg_putMVarzh:
   0x00000000004ab1b0 <+0>:	cmpl   $0x1,0x4f4800
   0x00000000004ab1b8 <+8>:	je     0x4ab35e <stg_putMVarzh+430>
   0x00000000004ab1be <+14>:	mov    $0x479e18,%eax
   0x00000000004ab1c3 <+19>:	mov    %rbx,%rcx
   0x00000000004ab1c6 <+22>:	sub    $0x8,%rsp
   0x00000000004ab1ca <+26>:	mov    %rcx,%rdi
   0x00000000004ab1cd <+29>:	mov    %rax,%rcx
   0x00000000004ab1d0 <+32>:	xor    %eax,%eax
   0x00000000004ab1d2 <+34>:	callq  *%rcx
   0x00000000004ab1d4 <+36>:	add    $0x8,%rsp
   0x00000000004ab1d8 <+40>:	cmpq   $0x4f2c30,0x18(%rbx)
   0x00000000004ab1e0 <+48>:	jne    0x4ab366 <stg_putMVarzh+438>
   0x00000000004ab1e6 <+54>:	mov    0x8(%rbx),%rcx
   0x00000000004ab1ea <+58>:	cmp    $0x4f2c30,%rcx
   0x00000000004ab1f1 <+65>:	je     0x4ab466 <stg_putMVarzh+694>
   0x00000000004ab1f7 <+71>:	cmpq   $0x4ac2f0,(%rcx)
   0x00000000004ab1fe <+78>:	je     0x4ab45d <stg_putMVarzh+685>
   0x00000000004ab204 <+84>:	cmpq   $0x4aca20,(%rcx)
   0x00000000004ab20b <+91>:	je     0x4ab45d <stg_putMVarzh+685>
   0x00000000004ab211 <+97>:	mov    0x10(%rcx),%rdx
   0x00000000004ab215 <+101>:	mov    0x8(%rcx),%rsi
   0x00000000004ab219 <+105>:	mov    %rsi,0x8(%rbx)
   0x00000000004ab21d <+109>:	cmpq   $0x4f2c30,0x8(%rbx)
   0x00000000004ab225 <+117>:	jne    0x4ab22f <stg_putMVarzh+127>
   0x00000000004ab227 <+119>:	movq   $0x4f2c30,0x10(%rbx)
=> 0x00000000004ab22f <+127>:	cmp    0x28(%rdx),%rbx
   0x00000000004ab233 <+131>:	je     0x4ab276 <stg_putMVarzh+198>
   ...

I believe this corresponds to this bit of C--,

    ...

    tso = StgMVarTSOQueue_tso(q);
    StgMVar_head(mvar) = StgMVarTSOQueue_link(q);
    if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) { // cmpq   $0x4f2c30,0x8(%rbx)
        StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure; // movq   $0x4f2c30,0x10(%rbx)
    }

    ASSERT(StgTSO_block_info(tso) == mvar);             // cmp    0x28(%rdx),%rbx
    
    ...

Indeed we find that,

>>> print/a $rbx
$1 = 0x42000b8400
>>> print/a $rdx
$2 = 0x42deadbeef

Yikes!

This sounds to me like we reentered STG while forgetting to do some bit of cleanup from the foreign call.

comment:10 Changed 2 months ago by andrewchen

I managed to do a rr capture with --chaos mode.

Here's the part in main where it does the comparison:

0x404581 <Main_main1_info+361>  mov    ecx,DWORD PTR [rax]
0x404583 <Main_main1_info+363>  cmp    rcx,rbx                       // compares value with 0xDEADBEEF
0x404586 <Main_main1_info+366>  jne    0x40443c <Main_main1_info+36> // goes to print "value mismatch"
(rr) p/x $rcx
$22 = 0x1
(rr) p/x $rbx
$23 = 0xdeadbeef
(rr) p/x $rax
$24 = 0x42000b7540

Putting a watch point on the the memory address and reverse continuing leads to this:

Old value = 1
New value = -559038737
0x0000000000470b42 in base_GHCziEventziPoll_new5_info ()
=> 0x0000000000470b42 <base_GHCziEventziPoll_new5_info+1218>:   49 89 04 24     mov    QWORD PTR [r12],rax
(rr) p/x $r12
$27 = 0x42000b7540

Not sure what's going on there, but I hope this is of some help.

comment:11 Changed 2 months ago by bgamari

Indeed the --chaos tip is quite helpful. Thanks!

So it appears that the crazy TSO is loaded in stg_putMVar# on line 1737:

    ...
    // There are readMVar/takeMVar(s) waiting: wake up the first one

    tso = StgMVarTSOQueue_tso(q);                                // <--- here
    StgMVar_head(mvar) = StgMVarTSOQueue_link(q);
    if (StgMVar_head(mvar) == stg_END_TSO_QUEUE_closure) {
        StgMVar_tail(mvar) = stg_END_TSO_QUEUE_closure;
    }
...

Here q is 0x42000b7530 which is a fairly reasonable-looking MVAR_TSO_QUEUE, except with a completely wild tso field,

0x42000b7530:	0x4acd28 <stg_MVAR_TSO_QUEUE_info>	0x4f2c30
0x42000b7540:	0x42deadbeef	0x433508 <base_GHCziInt_I32zh_con_info>

Indeed the last guy to write to StgMVarTSOQueue_tso(q) is the FFI target, test,

Dump of assembler code for function test:
=> 0x00000000004044f0 <+0>:	movl   $0xdeadbeef,(%rdi)
   0x00000000004044f6 <+6>:	retq  

where %rdi == 0x00000042000b7540.

Let's look at the calling sequence produced by GHC,

_c4Rp:
        movq $block_c4Ru_info,-8(%rbp) # I64[Sp - 8] = c4Ru;
        movq %rax,(%rbp)               # I64[Sp] = _s4Ok::I64;
        addq $-8,%rbp                  # Sp = Sp - 8;
        movq 872(%r13),%rbx            # _u4RJ::P64 = CurrentTSO;
        movq 24(%rbx),%rcx             # I64[I64[_u4RJ::P64 + 24] + 16] = Sp;
        movq %rbp,16(%rcx)
        movq 888(%r13),%rcx            # _u4RK::I64 = CurrentNursery;
        leaq 8(%r12),%rdx              # P64[_u4RK::I64 + 8] = Hp + 8;
            # I64[_u4RJ::P64 + 104] = I64[_u4RJ::P64 + 104] 
            #                         - ((Hp + 8) - I64[_u4RK::I64]);
        movq %rdx,8(%rcx)              
        leaq 8(%r12),%rdx
        subq (%rcx),%rdx
        movq 104(%rbx),%rcx
        subq %rdx,%rcx
        movq %rcx,104(%rbx)

            # (_u4RH::I64) = call "ccall" arg hints:  [PtrHint,]
            #                             result hints:  [PtrHint]
            #                             suspendThread(BaseReg, 0);
        subq $8,%rsp                   # native-call stack adjustment
        movq %r13,%rdi                 # setup argument 1 (BaseReg)
        xorl %esi,%esi                 # setup argument 2 (0)
        movq %rax,%rbx                 # Save $rax in callee-saved register
        xorl %eax,%eax                 # No floating point arguments for this call
        call suspendThread
        addq $8,%rsp                   # undo stack adjustment
        subq $8,%rsp                   # redo stack adjustment; silly GHC
        movq %rbx,%rdi                 # ??? <--- This is where the bad argument comes from
        movq %rax,%rbx                 # Spill again? But we never actually unspilled it!
                                       # I think this is where we go wrong
        xorl %eax,%eax                 # No floating point arguments for this call
        call test                      # Native call
        addq $8,%rsp                   # undo stack adjustment
        subq $8,%rsp                   # you are such a joker, GHC
        movq %rbx,%rdi
        xorl %eax,%eax
        call resumeThread
        ...

It looks to me like what happens here is that we spill $rax (which contains a pointer to the current MVar closure) to $rbx twice, losing knowledge of the first spill. Consequently we end up passing the MVar as the argument to test. Hilarity ensues.

Last edited 8 weeks ago by bgamari (previous) (diff)

comment:12 Changed 8 weeks ago by bgamari

On looking at this with fresh eyes, it seems that unfortunately my analysis from comment:11 is flawed; the movq %rbx,$rdi is completely correct. We spill to the callee-saved %rbx register before suspendThread and then more the value from %rbx to %rdi, which is where we expect the first argument to reside. The second spill is simply preserving _u4RH, which is still alive after the call to test.

Back to the drawing board. I think now I'll focus on catching the issue earlier in execution; namely, when we first get the value mismatch message.

Last edited 8 weeks ago by bgamari (previous) (diff)

comment:13 Changed 8 weeks ago by hsyl20

You don't need the FFI to trigger the bug. You can replace:

foreign import ccall safe "test"
    c_test :: Ptr Word32 -> IO ()

with:

{-# NOINLINE c_test #-}
c_test :: Ptr Word32 -> IO ()
c_test ptr = poke ptr 0xDEADBEEF

comment:14 Changed 8 weeks ago by hsyl20

Smaller reproducer:

-- ghc -Wall -fforce-recomp -O1 Test.hs -threaded
module Main where

import Control.Concurrent
import Control.Monad
import Data.Word
import Foreign.Marshal.Alloc
import Foreign.Storable
import Numeric

main :: IO ()
main = do
    replicateM_ 49 $ threadDelay 1
    allocaBytes 4 $ \p -> do
      forever $ do
        poke p (0xDEADBEEF :: Word32)
        threadDelay 10
        x <- peek p
        unless (x == 0xDEADBEEF) $ putStrLn (showHex x "")

To trigger the bug, we need at least 49 threadDelay executions before allocaBytes. Their duration doesn't matter.

Last edited 8 weeks ago by hsyl20 (previous) (diff)

comment:15 Changed 8 weeks ago by andrewchen

Really interestingly replacing forever with replicateM_ 1000000000 doesn't trigger the bug anymore.

A bit of speculation: compiler sees that the touch# at the end of allocaBytes is unreachable due to forever, and so ignores it and allows the allocated are to be GC'ed.

comment:16 Changed 8 weeks ago by alexbiehl

I had a look at the cmm code for comment 14 and I have some questions:

      c4CO: // global
          _s4Bx::I64 = R1 + 16; <---- assign _s4Bx the pointer to the first 
                                      byte of the bytearray 
                                      (16 byte = infotable ptr + length field )
          goto c4CW;
      c4CW: // global
          I32[_s4Bx::I64] = 3735928559 :: W32;
          (_s4BE::I64) = call "ccall" arg hints:  []  result hints:  [`signed'] rtsSupportsBoundThreads();
          if (_s4BE::I64 != 0) goto c4Dz; else goto c4DE;
      c4Dz: // global
          I64[Sp - 8] = block_c4Dx_info;
          R2 = Main.main2_closure+1;
          I64[Sp] = _s4Bx::I64;   <------ is it ok to store an address which clearly points 
                                          into heap allocated memory but doesn't point to 
                                          an info table? 
                                          
          Sp = Sp - 8;
          call GHC.Conc.Windows.threadDelay1_info(R2) returns to c4Dx, args: 8, res: 8, upd: 8;
      c4Dx: // global
          _s4Bx::I64 = I64[Sp + 8];
          goto c4D2;
      c4DE: // global
          I64[Sp - 8] = block_c4DD_info;
          R1 = 10;              <------- overwrite R1, R1 was our *only* 
                                         reference to the bytearray closure. 
          I64[Sp] = _s4Bx::I64;
          Sp = Sp - 8;
          call stg_delay#(R1) returns to c4DD, args: 8, res: 8, upd: 8;
      c4DD: // global
          _s4Bx::I64 = I64[Sp + 8];
          goto c4D2;
      c4D2: // global
  • The only reference to the ByteArray closure is in R1
  • _s4Bx points to the first byte in the byte array
  • In block c4DE R1 is overwritten.
  • The rtsSupportsBoundThreads is a ccall, don't we have to save R1 over these calls?

Maybe the garbage collector assumes the ByteArray is dead and collects it too early?

Edit: Ah, I missed comment 15. This might support andrewchens observations in comment 15.

Last edited 8 weeks ago by alexbiehl (previous) (diff)

comment:17 Changed 8 weeks ago by bgamari

Very good insights, alexbiehl and andrewchen. Indeed it looks indeed the GC is (correctly, given the code) concluding that the array is unreachable. Looking at the -dverbose-core2core output one sees that the touch# call is dropped during one of the simplifier passes (SimplMode {Phase = 0 [post-call-arity], inline, rules, eta-expand, case-of-case}). That is certainly the cause of the crash.

To answer, a few of your questions:

is it ok to store an address which clearly points into heap allocated memory but doesn't point to an info table?

In the above case the answer is probably yes. This pointer is saved as a field of a stack frame (namely a return frame for block_c4Dx_info). The info table for this frame likely declares this field as a non-pointer. Consequently it won't be traced by the GC. Of course, for this to be safe we do need to keep some reference to the ByteArray# itself. That is where we go wrong.

The rtsSupportsBoundThreads is a ccall, don't we have to save R1` over these calls?

This will happen when we lower the call in nativeGen. Cmm is platform/calling-convention independent and so we can't yet determine which registers need to be saved and which do not for a particular call.

Last edited 8 weeks ago by bgamari (previous) (diff)

comment:18 Changed 8 weeks ago by bgamari

It looks like to avoid this we will either need to teach the simplifier not to throw away otherwise dead continuations which contain some "important" primops (e.g. touch#) or mark allocaBytes as NOINLINE so the simplifier can't see the bad simplification (which seems to be how this is typically dealt with; e.g. see GHC.Compact.Serialized).

However, in general I wonder whether touch# is more unsafe than strictly necessary. It seems to me that for a tad of stack allocation you can get a much safer way to keep values alive. The trick is to introduce a primop,

with# :: a -> r -> r

When with# a cont is entered, the entry code will,

  1. Push an StgWithFrame, a new sort of return frame which carries a reference to a, onto the stack
  2. Enters cont

When cont returns, it will enter the entry code for StgWithFrame, which will simply pop itself and return. I believe this should be more robust against the simplifier; in particular, the present bug couldn't occur under this scheme.

So for instance, alloca would be changed from being (parapharsing to avoid obfuscating the point with state passing),

allocaBytes :: Int -> (Ptr a -> IO b) -> IO b
allocaBytes (I# size) action = IO $ \ s0 ->
     case newPinnedByteArray# size s0      of { (# s1, mbarr# #) ->
     case unsafeFreezeByteArray# mbarr# s1 of { (# s2, barr#  #) ->
     let addr = Ptr (byteArrayContents# barr#) in
     case action addr     of { IO action' ->
     case action' s2      of { (# s3, r #) ->
     case touch# barr# s3 of { s4 ->
     (# s4, r #)
  }}}}}

We would instead have,

allocaBytes :: Int -> (Ptr a -> IO b) -> IO b
allocaBytes (I# size) action = IO $ \ s0 ->
     case newPinnedByteArray# size s0      of { (# s1, mbarr# #) ->
     case unsafeFreezeByteArray# mbarr# s1 of { (# s2, barr#  #) ->
     let addr = Ptr (byteArrayContents# barr#) in
     case action addr                      of { IO action' ->
     with# barr# (action' s2)
  } } }

It seems to me that this does a much better job of capturing the real idea allocaBytes seeks.

The only potential issue with this proposal is float-out; I suspect we might need to say that we can't float out of a with# application. I'll need to sleep on that.

Last edited 8 weeks ago by bgamari (previous) (diff)

comment:19 Changed 8 weeks ago by simonpj

Cc: simonmar added

It looks like to avoid this we will either need to teach the simplifier not to throw away otherwise dead continuations which contain some "important" primops

Can you give an example to show what it is throwing away, and why that's bad? I don't get it yet. I have even forgotten why touch# exists.

Copying Simon Marlow.

comment:20 Changed 8 weeks ago by simonmar

Interesting, can someone boil down the transformation that dropped the touch#?

Simon: touch# is keeping the ByteArray# alive until after the action, in allocaBytes (see comment:18). The action itself doesn't keep the array alive, because it is working with the raw pointer, not the ByteArray#. This is how we allocate temporary memory for marshalling data between Haskell and C, because it's a lot faster to allocate memory on the Haskell heap than to use malloc() and free().

I imagine the simplifier has proven that action never returns and then dropped the case with the continuation containing the touch#. That seems like a reasonable thing to do.

I like @bgamari's alternative suggestion of with#, although we probably want it to be

with# :: a -> (State# s -> (# State# s, b #)) -> State# s -> (# State# s, b #)

otherwise the second argument must be a thunk (yuck).

comment:21 Changed 8 weeks ago by bgamari

I imagine the simplifier has proven that action never returns and then dropped the case with the continuation containing the touch#. That seems like a reasonable thing to do.

Correct. I believe it is Simplify.rebuildCall that is responsible for this. I also agree: this seems like a perfectly reasonable thing to do and it's not clear how exactly we would prevent this behavior in the case of touch# (since the touch# may be buried deep in the continuation).

Last edited 8 weeks ago by bgamari (previous) (diff)

comment:22 Changed 8 weeks ago by bgamari

Differential Rev(s): Phab:D4110
Milestone: 8.2.2
Status: infoneededpatch

See Phab:D4110 for an implementation. Still need to add a test.

comment:23 Changed 8 weeks ago by simonpj

...for an implementation of with#, that is.

comment:24 Changed 8 weeks ago by bgamari

Indeed Phab:D4110 appears to fix the testcases from comment:14 and the ticket description.

Last edited 8 weeks ago by bgamari (previous) (diff)

comment:25 Changed 8 weeks ago by simonpj

Differential Rev(s): Phab:D4110

Let's move discussion of with# to new ticket just for that purpose: #14375.

This ticket remains to track the bug reported in the Description.

comment:26 Changed 7 weeks ago by bgamari

Milestone: 8.2.28.4.1

Alright, I have marked allocaBytes and allocaBytesAligned as NOINLINE for 8.2.2. A more principled solution, in the form of #14375, coming in 8.4.1.

comment:27 Changed 6 weeks ago by bgamari

Owner: set to bgamari

comment:28 Changed 6 weeks ago by bgamari

Blocked By: 14375 added

comment:29 Changed 3 weeks ago by bgamari

Blocked By: 14375 removed
Resolution: fixed
Status: patchclosed

I'm going to close this since the serious correctness issue is resolved for now. A more efficient solution will come in 8.4 (#14375).

Note: See TracTickets for help on using tickets.