Opened 18 months ago

Closed 16 months ago

Last modified 16 months ago

#7361 closed bug (fixed)

Segmentation fault on 5f37e0c71ff4af8539c5aebc739b006b4f0c6ebf

Reported by: bgamari Owned by: simonmar
Priority: highest Milestone: 7.8.1
Component: Compiler Version: 7.7
Keywords: Cc: bos@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

I'm experiencing a segmentation fault in previously working code compiled with GHC master (5f37e0c71ff4af8539c5aebc739b006b4f0c6ebf). This occurs with several executables in bayes-stack[1] (BenchLDA and RunST).

The easiest reproduction case is BenchLDA, which can be run without any external data (although doesn't get built by the cabal file). In the case of BenchLDA, gdb says that the crash is occurring in c6FD_info which seems to fall within BayesStack?.Models.Topic.Types, although -ddump-simpl doesn't seem to give any hints as to which function this is.

[1] http://github.com/bgamari/bayes-stack

Attachments (3)

Test.hs (602 bytes) - added by bgamari 17 months ago.
Slightly more minimal test case
Test2.hs (356 bytes) - added by bgamari 17 months ago.
A slightly more minimal testcast
Test3.hs (460 bytes) - added by bgamari 16 months ago.
Test case using only mwc-random

Download all attachments as: .zip

Change History (37)

comment:1 Changed 18 months ago by bgamari

After playing around a bit with various build options and ensuring object and interface files were cleaned up, I managed to get some reasonable output from gdb (with -dcore-lint -debug).

The backtrace is,

Program received signal SIGSEGV, Segmentation fault.
0x00007fffec554b6c in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12002426983225102154) at includes/rts/storage/ClosureMacros.h:248
248         return info->type != INVALID_OBJECT && info->type < N_CLOSURE_TYPES;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.5.x86_64 gmp-4.3.1-7.el6_2.2.x86_64
(gdb) bt
#0  0x00007fffec554b6c in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=12002426983225102154) at includes/rts/storage/ClosureMacros.h:248
#1  0x00007fffec554bb7 in LOOKS_LIKE_INFO_PTR (p=12002426983225102154) at includes/rts/storage/ClosureMacros.h:253
#2  0x00007fffec554bec in LOOKS_LIKE_CLOSURE_PTR (p=0x7fffe624a418) at includes/rts/storage/ClosureMacros.h:258
#3  0x00007fffec55532e in evacuate (p=0x7fffe629a028) at rts/sm/Evac.c:371
#4  0x00007fffec55d049 in scavenge_mut_arr_ptrs (a=0x7fffe629a000) at rts/sm/Scav.c:125
#5  0x00007fffec55de28 in scavenge_block (bd=0x7fffe6202680) at rts/sm/Scav.c:617
#6  0x00007fffec55faae in scavenge_find_work () at rts/sm/Scav.c:1791
#7  0x00007fffec55fba9 in scavenge_loop () at rts/sm/Scav.c:1867
#8  0x00007fffec557db2 in scavenge_until_all_done () at rts/sm/GC.c:999
#9  0x00007fffec556c7e in GarbageCollect (collect_gen=1, do_heap_census=rtsFalse, gc_type=0, cap=0x7fffec791100) at rts/sm/GC.c:392
#10 0x00007fffec544289 in scheduleDoGC (pcap=0x7fffffffdff0, task=0x63fce0, force_major=rtsFalse) at rts/Schedule.c:1667
#11 0x00007fffec5437dc in schedule (initialCapability=0x7fffec791100, task=0x63fce0) at rts/Schedule.c:582
#12 0x00007fffec544bf3 in scheduleWaitThread (tso=0x7fffe6305390, ret=0x0, pcap=0x7fffffffe0d0) at rts/Schedule.c:2368
#13 0x00007fffec53e62a in rts_evalLazyIO (cap=0x7fffffffe0d0, p=0x624730, ret=0x0) at rts/RtsAPI.c:497
#14 0x00007fffec5412ff in real_main () at rts/RtsMain.c:63
#15 0x00007fffec5413f4 in hs_main (argc=1, argv=0x7fffffffe248, main_closure=0x624730, rts_config=...) at rts/RtsMain.c:114
#16 0x00000000004231ef in main (argc=1, argv=0x7fffffffe248) at /tmp/ghc9240_0/ghc9240_0.c:7

comment:2 Changed 18 months ago by bgamari

When compiling with -dcore-lint -debug -O, I see the following output from the compiler,

*** Core Lint warnings : in result of Desugar (after optimization) ***
{-# LINE 26 "BayesStack/Models/Topic/Types.hs #-}: Warning:
    [RHS of $c/=_a1cN :: BayesStack.Models.Topic.Types.Edge
                         -> BayesStack.Models.Topic.Types.Edge -> GHC.Types.Bool]
    INLINE binder is (non-rule) loop breaker: $c/=_a1cN

gdb also gives me a slightly different stack trace,

Program received signal SIGSEGV, Segmentation fault.
0x00007fffec55f136 in scavenge_mutable_list (bd=0x7fffe6202880, gen=0x6379e8) at rts/sm/Scav.c:1353
1353                switch (get_itbl((StgClosure *)p)->type) {
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.5.x86_64 gmp-4.3.1-7.el6_2.2.x86_64
(gdb) bt
#0  0x00007fffec55f136 in scavenge_mutable_list (bd=0x7fffe6202880, gen=0x6379e8) at rts/sm/Scav.c:1353
#1  0x00007fffec55f328 in scavenge_capability_mut_lists (cap=0x7fffec791100) at rts/sm/Scav.c:1427
#2  0x00007fffec556acb in GarbageCollect (collect_gen=0, do_heap_census=rtsFalse, gc_type=0, cap=0x7fffec791100) at rts/sm/GC.c:343
#3  0x00007fffec544289 in scheduleDoGC (pcap=0x7fffffffdff0, task=0x645ce0, force_major=rtsFalse) at rts/Schedule.c:1667
#4  0x00007fffec5437dc in schedule (initialCapability=0x7fffec791100, task=0x645ce0) at rts/Schedule.c:582
#5  0x00007fffec544bf3 in scheduleWaitThread (tso=0x7fffe6305390, ret=0x0, pcap=0x7fffffffe0d0) at rts/Schedule.c:2368
#6  0x00007fffec53e62a in rts_evalLazyIO (cap=0x7fffffffe0d0, p=0x62c260, ret=0x0) at rts/RtsAPI.c:497
#7  0x00007fffec5412ff in real_main () at rts/RtsMain.c:63
#8  0x00007fffec5413f4 in hs_main (argc=1, argv=0x7fffffffe248, main_closure=0x62c260, rts_config=...) at rts/RtsMain.c:114
#9  0x000000000042aa03 in main (argc=1, argv=0x7fffffffe248) at /tmp/ghc9328_0/ghc9328_0.c:7

comment:3 Changed 18 months ago by simonmar

  • Difficulty set to Unknown
  • Milestone set to 7.8.1
  • Priority changed from normal to highest
  • Version set to 7.7

comment:4 Changed 18 months ago by bgamari

Just to ensure this detail isn't lost, I've been using BayesStack? commit 6df883ec0ac8859cd7422ab2184be620fd72c63a to reproduce this issue.

Do folks think it would be practical to try bisecting this issue or would attempting this just be a waste of time?

comment:5 Changed 18 months ago by simonmar

  • Owner set to simonmar

This looks similar to a crash I'm currently investigating in tests/concurrent/prog001/concprog001 (threaded2). I found a couple of bugs today, but there seems to be at least one remaining. Does your program need -threaded and/or +RTS -N to fail?

comment:6 Changed 18 months ago by bgamari

No, the bug seems to be perfectly reproducible without -threaded.

comment:7 Changed 18 months ago by simonmar

I reproduced this today, but didn't get very far with diagnosing it because the code that fails is very complicated. Something is writing beyond the end of an ARR_WORDS, and the code blob that does it is in random-source:Data.Random.Source.MWC.$wa, but that function is huge and I have no idea what it is supposed to do. The error could well be somewhere else. We could add some bounds checks to the array primitives (maybe with -debug), though I think it probably wouldn't help me in this case.

I have some other bugs I'm chasing so hopefully this one will get fixed as a side-effect of fixing something else. Failing that, we will need a smaller test case.

comment:8 Changed 18 months ago by bgamari

Alright, thanks for trying! I can try narrowing down the test case soon.

comment:9 Changed 18 months ago by bgamari

Just for the record, my initial thought for a reproduction case (simply generating random variates with mwc-random through random-source) does not reproduce the crash. Looks like this could be a tricky crash to reproduce. Simon, I might wait a few days until some of the better understood crashes (if there is such a thing) are sorted out and reevaluate then.

comment:10 Changed 17 months ago by simonmar

Does this still happen for you? I've fixed the other outstanding bugs that I know about in the new code generator, so now would be a good time to test again.

comment:11 Changed 17 months ago by bgamari

I've been trying to get a ghc built but unfortunately the DYNAMIC_BY_DEFAULT change has really broken quite a bit it seems, particularly when profiling is enabled. I've documented a few of the issues here. That being said, I think I have something building now. I'll let you know when I have some results.

comment:12 Changed 17 months ago by bgamari

Unfortunately the segmentation fault is still reproducible on 6486213bc4ad307273956bc6164eeeb3f6f31d1c.

Changed 17 months ago by bgamari

Slightly more minimal test case

comment:13 Changed 17 months ago by bgamari

I've attached Test.hs which is a slightly more compact testcase. While it gets nowhere near isolating the root cause of the crash, at least you don't need to clone all of bayes-stack to test.

comment:14 Changed 17 months ago by simonmar

I did manage to get this built eventually (after much yak-shaving), but it doesn't crash for me any more.

> ./7361 
[Node 10,Node 57,Node 57,Node 80,Node 33,Node 47,Node 13,Node 58,Node 13,Node 44,Node 76,Node 13,Node 44,Node 47,Node 58,Node 80,Node 87,Node 15,Node 52,Node 90,Node 67,Node 29,Node 80,Node 87,Node 15,Node 52,Node 90,Node 67,Node 35,Node 41,Node 23,Node 26,Node 65,Node 89,Node 9,Node 58,Node 92,Node 13,Node 44,Node 47,Node 14,Node 51,Node 53,Node 86,Node 55,Node 98,Node 48,Node 10,Node 72,Node 12,Node 65,Node 49,Node 11,Node 9,Node 48,Node 1,Node 87,Node 71,Node 45,Node 7,Node 97,Node 84,Node 80,Node 87,Node 15,Node 52,Node 90,Node 67,Node 35,Node 41,Node 23,Node 26,Node 65,Node 89,Node 9,Node 77,Node 10,Node 36,Node 84,Node 10,Node 31,Node 46,Node 81,Node 49,Node 86,Node 1,Node 17,Node 94,Node 62,Node 3,Node 20,Node 99,Node 64,Node 8,Node 49,Node 10,Node 35,Node 84,Node 34,Node 34]

comment:15 Changed 17 months ago by bgamari

Oh dear. I do know that unfortunately the program is sensitive to the number of replications passed to replicate. With 30 iterations, the program crashes for me roughly half the time. With 28 it practically never crashes. Perhaps try raising this number?

Moreover, the case Test2.hs that I'm about to attach is slightly more minimal and also crashes. This demonstrates that, as expected, the newtypes don't contribute to the crash, nor does the (>>=); it's strictly a matter of how many numbers are generated.

Changed 17 months ago by bgamari

A slightly more minimal testcast

comment:16 Changed 17 months ago by simonmar

  • Status changed from new to infoneeded

Tried again today, and it still doesn't reproduce for me on x86_64/Linux.

I'll need all the information about your specific setup to reproduce it: build.mk, exact version of every dependent package (I had to modify mersenne-random-pure64 to get it to build, removing the import of GHC.IOBase), and any relevant .cabal/config settings.

comment:17 Changed 16 months ago by bgamari

Sorry for the latency. I've tried to pull again and now sit on 4f7027d6947af9c5cdecc0c18097268594c4592b. Sadly the crash in Test2.hs is still reproducible for me. I should have pointed out the mersenne-random-pure64 incompatibility earlier. The only change I have made here is s/GHC.IOBase/GHC.IO/ as you point out.

I am also on x86_64/Linux. The exact dependencies (extracted from ghc -v) are,

random-fu-0.2.3.1
rvar-0.2.0.1
random-source-0.3.0.2
stateref-0.3
random-shuffle-0.0.4
mwc-random-0.12.0.1
monad-loops-0.3.3.0
stm-2.4.2
mersenne-random-pure64-0.2.0.3
old-time-1.1.0.1
gamma-0.9.0.2
vector-0.10.0.1
primitive-0.5.0.1
flexible-defaults-0.0.1.0
th-extras-0.0.0.1
syb-0.3.7
template-haskell-2.9.0.0
pretty-1.1.1.0
erf-2.0.0.0
converge-0.1.0.1
continued-fractions-0.9.1.1
containers-0.5.0.0
MonadRandom-0.1.8
random-1.0.1.1
time-1.4.0.2
old-locale-1.0.0.5
deepseq-1.3.0.2
array-0.4.0.2
MonadPrompt-1.0.0.3
mtl-2.1.2
transformers-0.3.0.0
base-4.7.0.0
integer-gmp-0.5.1.0
ghc-prim-0.3.1.0

The uncommented regions of build.mk are,

BuildFlavour = perf

ifeq "$(BuildFlavour)" "perf"

# perf matches the default settings, repeated here for comparison:

SRC_HC_OPTS     = -O -H64m
GhcStage1HcOpts = -O -fasm
GhcStage2HcOpts = -O2 -fasm
GhcHcOpts       = -Rghc-timing
GhcLibHcOpts    = -O2
GhcLibWays     += p

ifeq "$(PlatformSupportsSharedLibs)" "YES"
GhcLibWays += dyn
endif

endif

# NoFib settings
NoFibWays =
STRIP_CMD = :

Likewise for ~/.cabal/config,

remote-repo: hackage.haskell.org:http://hackage.haskell.org/packages/archive
remote-repo-cache: /home/ben/.cabal/packages
world-file: /home/ben/.cabal/world
library-profiling: True

Congratulations on your new job!

comment:18 Changed 16 months ago by bgamari

  • Status changed from infoneeded to new

comment:19 Changed 16 months ago by bgamari

Simon, can I do anything else to help here?

comment:20 Changed 16 months ago by bgamari

It seems this can still be reproduced with db9c062a4a7c39563a3a9a83718cc0ce6d4babae.

comment:21 Changed 16 months ago by simonmar

  • Status changed from new to infoneeded

I reproduced it again today, but the bug still eludes me. The code is reading and writing beyond the end of a ByteArray#, and I can't tell whether this is a codegen bug or a library bug that has gone unnoticed so far.

The crash still happens with -fno-cmm-sink and with -fllvm, which eliminates the two things that I would be most suspicious of (the Cmm sinking pass and the native code generator).

The code that writes over the end of the array is Data.Random.Source.MWC.$wa. This function is large and complicated and the code is TH-generated, so I'm a bit lost here. I tried to follow the code and couldn't see any codegen bugs though.

The erroneous access happens on the 3rd call, where the arguments are a normal-looking MVector:

(gdb) p4 0x2aaaac40bde9
0x2aaaac40be00: 0x102
0x2aaaac40bdf8: 0x0
0x2aaaac40bdf0: 0x2aaaac405628
0x2aaaac40bde8: 0x4cc3d0 <vectorzm0zi10zi0zi1_DataziVectorziPrimitiveziMutable_MVector_con_info>

and an Int# value 1.

Maybe someone familiar with random-source could add some debugging tests and try to narrow down the failure?

comment:22 Changed 16 months ago by bgamari

I've let the maintainer (James Cook) know. He's been quite responsive in the past so I suspect I'll hear back shortly. Thanks again for your time!

comment:23 Changed 16 months ago by bgamari

I'm not sure why this didn't occur to me earlier, but it seems much more likely that mwc-random is the culprit here (I'm fairly certain that mwc-random uses ByteArray#s? whereas random-source does not). So far my attempts at producing a crashing bit of code (or even one producing valgrind errors) using only mwc-random have turned up nothing, but I'll keep trying. I've let bos know. Perhaps we'll hear from him soon.

comment:24 Changed 16 months ago by bos

  • Cc bos@… added

comment:25 Changed 16 months ago by simonmar

The ByteArray# code seems to come from vector, because the ByteArray# is inside an MVector constructor (from Data.Vector.Primitive.Mutable).

comment:26 Changed 16 months ago by bgamari

Yes, this lends further support to the hypothesis that the it is mwc-random, not random-source, which is at fault. mwc-random uses an MVector to hold the RNG state. Sadly, I've still been unsuccessful at producing a testcase using only mwc-random.

comment:27 Changed 16 months ago by bgamari

Indeed, using bound-checked reads and writes in place of the unsafe varieties currently used in System.Random.MWC reveals that uniform1 is the culprit.

comment:28 Changed 16 months ago by bgamari

In particular, it appears to be the M.read q i in uniformWord32

Changed 16 months ago by bgamari

Test case using only mwc-random

comment:29 Changed 16 months ago by bgamari

Well, it seems that nextIndex isn't wrapping correctly when used in uniformWord32. Currently it is implemented as,

nextIndex :: Integral a => a -> Int
nextIndex i = fromIntegral j
    where j = fromIntegral (i+1) :: Word8

If I trace the index values over iterations of uniformWord32, I find that they grow beyond 256, resulting in the crash.

If I implement nextIndex with a simple mod, on the other hand, things work fine,

nextIndex :: Integral a => a -> Int
nextIndex i = fromIntegral $ (i + 1) `mod` 256

Simon, is it possible that the optimizer could be now optimizing out the cast?

comment:30 Changed 16 months ago by simonmar

Great analysis, thanks. We do indeed seem to have lost the narrowing in nextIndex, and it happens in Core, so nothing to do with the code generator. I'll dig a bit deeper and see if I can find out where we went wrong.

comment:31 Changed 16 months ago by simonmar

Got it - bogus rules for narrow8Int# and friends in PrelRules. Fix coming.

comment:32 Changed 16 months ago by marlowsd@…

commit 3af022f3ae6ff3adceb2318cf50549d954e8bbe7

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Wed Jan 9 16:52:16 2013 +0000

    Fix some incorrect narrowing rules (#7361)
    
    e.g. narrow8Int# subsumes narrow16Int#, not the other way around.

 compiler/prelude/PrelRules.lhs |   24 ++++++++++++------------
 1 files changed, 12 insertions(+), 12 deletions(-)

comment:33 Changed 16 months ago by simonmar

  • Resolution set to fixed
  • Status changed from infoneeded to closed

comment:34 Changed 16 months ago by bgamari

Thanks!

Note: See TracTickets for help on using tickets.