Segfaults when using dynamic wrappers and concurrency
I had a largish program that sometimes segfaulted, the segfault seemingly coming from the code that gets a C pointer from an Haskell function.
After much sweat I've managed to produce a self-contained program that exhibits the same behavior:
bitonic@clay /tmp/ptr-crash % uname -a
Linux clay 3.13.0-48-generic #80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
bitonic@clay /tmp/ptr-crash % cabal configure --disable-library-profiling -w ghc-7.11.20150411
Resolving dependencies...
Configuring ptr-crash-0...
bitonic@clay /tmp/ptr-crash % cabal build
Building ptr-crash-0...
Preprocessing executable 'ptr-crash' for ptr-crash-0...
[1 of 1] Compiling Main ( Main.hs, dist/build/ptr-crash/ptr-crash-tmp/Main.o )
Linking dist/build/ptr-crash/ptr-crash ...
bitonic@clay /tmp/ptr-crash % strace -f -r -o strace-out ./dist/build/ptr-crash/ptr-crash +RTS -N2 -RTS
[1] 26612 segmentation fault (core dumped) strace -f -r -o strace-out ./dist/build/ptr-crash/ptr-crash +RTS -N2 -RTS
I'm running GHC HEAD on a Linux 64bit machine. In the larger program, I'm pretty sure the segfaults happened on GHC 7.8.4 too, but currently I can reproduce it only on 7.10 and later.
More details (thanks to Sergei Trofimovich on #ghc for helping me in investigating this):
- The segfault only happens when using
-N2
or more. - Curiously, the segfault seems to happen much more often when compiling the program with
-g
. - The segfault doesn't happen every time, I get it roughly half of the times on my machine.
-
strace
ing the program when segfaulting shows that all the threads crash together right after some calls tomremap
. I've attached the end of the output ofstrace
. -
gdb
ing the program and breaking onmremap
shows that all the calls tomremap
originate fromgetStablePtr
. I've attached a run ofgdb
that shows this pattern. - The segfault only happens with repeated calls to the dynamic wrapper and with certain timings, which explains the weird nature of the example (I kind of mimicked the behaviour of a C function we were calling from a proprietary C library). Note that the call to
sum_arr
is not really important and it's there just so that some time is spent in the callback -- the example works equally well if we convert the pointer to an Haskell vector and sum it from Haskell.
Sergei had a hunch that this had to do with thread-unsafe calls to stgReallocBytes
in enlargeStablePtrTable
.