Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#3875 closed merge (fixed)

DPH QuickHull example crashes on SPARC with -N > 1

Reported by: benl Owned by: igloo
Priority: high Milestone: 6.12.2
Component: Runtime System Version: 6.13
Keywords: Cc:
Operating System: Solaris Architecture: sparc
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

... ghc-head-build/libraries/dph/examples/quickhull$ par/quickhull 10000 +RTS -N1
N = 10000: 503/520 503/520 503/520

... ghc-head-build/libraries/dph/examples/quickhull$ par/quickhull 10000 +RTS -N2
Bus Error (core dumped)

GDB says:

(gdb) bt
#0  0x0197f5ec in eval_thunk_selector ()
#1  0x019806e0 in scavenge_block ()
#2  0x01981b48 in scavenge_loop ()
#3  0x01974934 in scavenge_until_all_done ()
#4  0x01975828 in GarbageCollect ()
#5  0x0196cd3c in scheduleDoGC ()
#6  0x0196e708 in schedule ()
#7  0x0196ba54 in real_main ()
#8  0x0196bbfc in hs_main ()
#9  0x000121bc in _start ()

Change History (11)

comment:1 Changed 7 years ago by benl

Owner: set to benl
Status: newassigned

comment:2 Changed 7 years ago by benl

Sadly, compiling with -debug and running the same program with +RTS -Dg makes it work. Heisenbugs are my least favourite.

comment:3 Changed 7 years ago by simonmar

Component: CompilerRuntime System
Milestone: 6.12.2
Priority: normalhigh

Is this only on Sparc? I wonder if there is a missing write_barrier() somewhere.

This is likely to be a tricky one - eval_thunk_selector is a notorious bug farm. I thought I'd got them all, but evidently not.

comment:4 Changed 7 years ago by benl

The same program works fine on x86/Linux.

Another DPH benchmark sumsq crashes about 1 time in 20 on the SPARC/Solaris, so it does sound like a race condition of some kind. This wikipedia page http://en.wikipedia.org/wiki/Memory_ordering suggests that SPARC TSO and x86 memory ordering should be similar, and if that's true then I don't know why SPARC is crashing and x86 isn't. I'll do some digging.

comment:5 Changed 7 years ago by benl

The concprog002 testsuite program also crashes consistently with -N16 and -N32.

Once when running it when -N16 it said:

benl@greyarea:~/devel/ghc/ghc-head-devel/tmp$ ./Server +RTS -N16 > /dev/null
Server: thread blocked indefinitely in an MVar operation
Server: thread blocked indefinitely in an MVar operation

comment:6 Changed 7 years ago by benl

After compiling quickhull with

... inplace/bin/ghc-stage2 --make vect.hs -threaded -fdph-par 
       -package dph-bench -package-conf ../lib/dph-bench.conf 
       -Odph -fforce-recomp -debug

Running

./vect 10000 +RTS -N16                  -- crashes
./vect 10000 +RTS -N16 -qg              -- works, no parallel GC
./vect 10000 +RTS -N16 -A100M -H100M -S -- works, only 2 GCs run 

And it always crashes in the first GC cycle.

./vect 100000 +RTS -N2 -S 
    Alloc    Copied     Live    GC    GC     TOT     TOT  Page Flts
    bytes     bytes     bytes  user  elap    user    elap
   524688    309168    295340  0.07  0.05    0.08    0.06    0    0  (Gen:  1)
Bus Error (core dumped)

Note: +RTS -DS (debug sanity) does not sanity check the heap when running with -threaded. See rts/sm/Sanity.c

comment:7 Changed 7 years ago by benl

This looks like a race in the bale-out code at the bottom of eval_thunk_selector. An instrumented version of the code shows these events. The program was previously crashing when it tried to dereference the last, misaligned pointer.

tid= 3 p=fece84f4 entering eval_thunk_selector.
tid= 3 p=fece84f4 info_ptr_table=19a0750
tid= 3 p=fece84f4 bailing out
tid= 0 p=fece84f4 entering eval_thunk_selector.
tid= 0 p=fece84f4 info_ptr_table=fec96b21 BLERGH!!!!!! stopping program

comment:8 Changed 7 years ago by benl

Resolution: fixed
Status: assignedclosed
Sun Feb 21 19:16:27 PST 2010  Ben.Lippmeier@anu.edu.au
 * Fix #3875: Crash in parallel GC, wrong pointer was being tested.

   M ./rts/sm/Evac.c -1 +1

comment:9 Changed 7 years ago by simonmar

Resolution: fixed
Status: closedreopened
Type: bugmerge

comment:10 Changed 7 years ago by simonmar

Owner: changed from benl to igloo
Status: reopenednew

comment:11 Changed 7 years ago by igloo

Resolution: fixed
Status: newclosed

Merged.

Note: See TracTickets for help on using tickets.