Opened 3 years ago

Closed 2 years ago

Last modified 2 years ago

#11978 closed bug (fixed)

running a profiled build of shake test suite with rts args +RTS -hb -N10 triggers SIGSEGV

Reported by: carter Owned by:
Priority: highest Milestone: 8.0.2
Component: Compiler Version: 8.0.1
Keywords: Cc: lelf
Operating System: MacOS X Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #4820 Differential Rev(s): Phab:D2174
Wiki Page:

Description (last modified by carter)

on my 4 core macbook pro i can reliably get a sigsegv when running the shake test suite with profiling enabled and the RTS flags +RTS -hb -N10

steps to reproduce (with either ghc 7.10.3 or an 8.0 RC4 build)

cabal get --source shake
cd shake
#  shake master  was commit f0c0ce8d4c2c017b9a26b5d4258e07d881f4cde1 as of time of testing
# but problem should happen with eg 0.15.6 too
cabal install # this is so that the shake test suite can fine the right template file
cabal install --only-dep --enable-tests --enable-profiling  
./dist/build/shake-test/shake-test   +RTS -hb -N10

this last command yields the error

******************************************************************
** Running shake test suite, run with '--help' to see arguments **
******************************************************************
fish: './dist/build/shake-test/shake-t…' terminated by signal SIGSEGV (Address boundary error)

i can also trigger the bug reported in #4820 by running the test suite with the -hr flag rather than -hb

I'm marking this as HIGHEST so it can be discussed, because the profiiling build of the shake test suite only seems to crash *if* the test suite is run with -hb or -hr flags, which seems like a pretty nasty bug (which colleagues have it in other code)

nb: ./dist/build/shake-test/shake-test +RTS -N10 runs into a not enough stack space error before the shake test suite completes, but ./dist/build/shake-test/shake-test +RTS -K1G -N10 seems to work fine, though the crashes happen super early in the gargantuan test suite

the analogous output for reproducing #4820 is :

./dist/build/shake-test/shake-test   +RTS  -hr  -N10
******************************************************************
** Running shake test suite, run with '--help' to see arguments **
******************************************************************
Longest file modification time lag was 1005ms
## BUILD tar test
## TESTING tar
## BUILD tar --abbrev=output=$OUT -j3
Writing report to $OUT/tar/report.html
Build completed in 0:01m
## BUILD tar --no-build --report=-
Warning: No want/action statements, nothing to do
Writing report to output/tar/report.html
Writing report to -
* This database has tracked 1 runs.
* There are 4 rules (4 rebuilt in the last run).
* Building required 1 traced commands (1 in the last run).
* The total (unparallelised) time is 0.01s of which 0.01s is traced commands.
* The longest rule takes 0.01s (output/tar/result.tar), and the longest traced command takes 0.01s (tar).
* Last run gave an average parallelism of 0.44 times over 0.01s.
Build completed in 0:01m
## BUILD tar
Writing report to output/tar/report.html
Build completed in 0:01m
## FINISHED TESTING tar
## BUILD self test
## TESTING self
## BUILD self --abbrev=output=$OUT -j3
Writing report to $OUT/self/report.html
Build completed in 0:01m
## BUILD self --no-build --report=-
Warning: No want/action statements, nothing to do
Writing report to output/self/report.html
Writing report to -
* This database has tracked 2 runs.
* There are 460 rules (2 rebuilt in the last run).
* Building required 56 traced commands (1 in the last run).
* The total (unparallelised) time is 22.73s of which 22.33s is traced commands.
* The longest rule takes 1.11s (output/self/Development/Shake/Types.o output/self/Development/Shake/Types.hi), and the longest traced command takes 1.11s (ghc).
* Last run gave an average parallelism of 0.78 times over 0.05s.
Build completed in 0:01m
## BUILD self
shake-test: internal error: Invalid object in isRetainer(): 39
    (GHC version 8.0.0.20160421 for x86_64_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug 

Attachments (1)

0001-rts-LdvProfile.c-Fix-NULL-dereference-on-shutdown.patch (703 bytes) - added by erikd 3 years ago.
Propsed fix

Download all attachments as: .zip

Change History (33)

comment:1 Changed 3 years ago by carter

comment:2 Changed 3 years ago by carter

Description: modified (diff)
Summary: running a profiled build of shake test suite with rts args +RTS -hb -N10 triggers SIGSEVrunning a profiled build of shake test suite with rts args +RTS -hb -N10 triggers SIGSEGV

comment:3 Changed 3 years ago by carter

found a simple reproducer for the sigsevg

echo "main = putStrLn \"hii\"" > main.hs
ghc -prof -rtsopts  -threaded main.hs
./main  +RTS -hb -N10
hii
fish: './main  +RTS -hb -N10' terminated by signal SIGSEGV (Address boundary error)

comment:4 Changed 3 years ago by carter

without setting -N to any value, the hello world program doesnt crash, but for any f>1 -Nf` seems to trigger the crash. at least with 7.10.3 as the compiler

comment:5 in reply to:  3 Changed 3 years ago by erikd

Replying to carter:

found a simple reproducer for the sigsevg

echo "main = putStrLn \"hii\"" > main.hs
ghc -prof -rtsopts  -threaded main.hs
./main  +RTS -hb -N10
hii
fish: './main  +RTS -hb -N10' terminated by signal SIGSEGV

Can confirm that with the the same thing happens on linux with ghc-7.10.3 and ghc-8.0.0.20160421.

comment:6 Changed 3 years ago by erikd

Running under gdb I actually get a sensible backtrace, so it seems this crashed was in C code:

Program received signal SIGSEGV, Segmentation fault.
0x00000000004c8130 in LdvCensusForDead.part.0 ()
(gdb) bt
#0  0x00000000004c8130 in LdvCensusForDead.part.0 ()
#1  0x00000000004d27ad in GarbageCollect ()
#2  0x00000000004c9cc5 in scheduleDoGC ()
#3  0x00000000004cba8b in exitScheduler ()
#4  0x00000000004bc890 in hs_exit_ ()
#5  0x00000000004bce1f in shutdownHaskellAndExit ()
#6  0x00000000004ce13d in hs_main ()
#7  0x000000000040a61d in main ()

Changed 3 years ago by erikd

Propsed fix

comment:7 Changed 3 years ago by erikd

@carter Please test the attached patch to see it if fixes your initial problem. It does fix the hello world version of the problem.

comment:8 Changed 3 years ago by erikd

Recreated @carter's original test running shake's test suite. With my proposed fix applied, I still get a segfault.

Running under gdb provides the following backtrace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffbb7fe700 (LWP 9743)]
overwritingClosure (p=0x20010d8e0) at includes/rts/storage/ClosureMacros.h:535
535             ((StgThunk *)(p))->payload[i] = 0;
(gdb) bt
#0  overwritingClosure (p=0x20010d8e0) at includes/rts/storage/ClosureMacros.h:535
#1  doneWithMsgThrowTo (m=0x20010d8e0) at rts/Messages.h:25
#2  executeMessage (cap=cap@entry=0x1496740, m=0x20010d8e0) at rts/Messages.c:104
#3  0x0000000000f47834 in scheduleProcessInbox (pcap=<optimized out>) at rts/Schedule.c:1005
#4  scheduleFindWork (pcap=<optimized out>) at rts/Schedule.c:616
#5  schedule (initialCapability=<optimized out>, task=task@entry=0x7fff88000910)
    at rts/Schedule.c:274
#6  0x0000000000f490dc in scheduleWorker (cap=<optimized out>, task=0x7fff88000910)
    at rts/Schedule.c:2382
#7  0x00007ffff75b6454 in start_thread (arg=0x7fffbb7fe700) at pthread_create.c:334
#8  0x00007ffff6d72eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) print i
$1 = 910560
(gdb) print p
$2 = (StgClosure *) 0x20010d8e0

comment:9 Changed 3 years ago by lelf

Cc: lelf added

comment:10 Changed 3 years ago by erikd

A bit of printf debugging showed the problem was in includes/rts/storage/ClosureMacros.h in the function overwritingClosure in the code:

    for (i = 0; i < size - sizeofW(StgThunkHeader); i++) {
        ((StgThunk *)(p))->payload[i] = 0;
    }

The problem was that size (with a value of 3) was less than sizeofW(StgThunkHeader) (value 4) resulting an unsigned integer overflow, cause the code to write past where it should.

Why would the size returned by closure_sizeW be less than sizeofW(StgThunkHeader)?

Last edited 3 years ago by erikd (previous) (diff)

comment:11 Changed 3 years ago by erikd

My friend printf tells me that when the value of size is less than sizeofW(StgThunkHeader) when the closure type is WHITEHOLE which seems legitimate. Since everything seems legitimate, the solution to this may simply be to add a check:

    if (size > sizeofW(StgThunkHeader)) {
        for (i = 0; i < size - sizeofW(StgThunkHeader); i++) {
            ((StgThunk *)(p))->payload[i] = 0;
        }
    }

@simonmar does this seem right?

comment:12 Changed 3 years ago by Erik de Castro Lopo <erikd@…>

In bcfee218/ghc:

rts/LdvProfile.c: Fix NULL dereference on shutdown

Test Plan: validate

Reviewers: carter, austin, simonmar, bgamari

Reviewed By: simonmar, bgamari

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2154

GHC Trac Issues: #11978

comment:13 Changed 3 years ago by erikd

Differential Rev(s): phab:D2159
Status: newpatch

comment:14 Changed 3 years ago by erikd

On IRC @simonmar said that in this function overwritingClosure the closure type should not be WHITEHOLE. The backtrace looks like:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffca7fc700 (LWP 27422)]
executeMessage (cap=cap@entry=0x14861a0, m=0x20022b340) at rts/Messages.c:104
104                 doneWithMsgThrowTo(t);
(gdb) bt
#0  executeMessage (cap=cap@entry=0x14861a0, m=0x20022b340) at rts/Messages.c:104
#1  0x0000000000f47834 in scheduleProcessInbox (pcap=<optimized out>) at rts/Schedule.c:1005
#2  scheduleFindWork (pcap=<optimized out>) at rts/Schedule.c:616
#3  schedule (initialCapability=<optimized out>, task=task@entry=0x7fffc4000910) at rts/Schedule.c:274
#4  0x0000000000f490dc in scheduleWorker (cap=<optimized out>, task=0x7fffc4000910) at rts/Schedule.c:2382
#5  0x00007ffff75b6454 in start_thread (arg=0x7fffca7fc700) at pthread_create.c:334
#6  0x00007ffff6d72eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The call path seems to be scheduleProcessInbox gets the next message, passes it to executeMessage which calls throwToMsg on it which returns THROWTO_SUCCESS, and then passes the message to doneWithMsgThrowTo which then passes it to overwritingClosure.

Reading @ezyang's paper http://ezyang.com/jfp-ghc-rts-draft.pdf suggests that yes indeed this code path should not be executed when the closure type is WHITEHOLE.

comment:15 Changed 3 years ago by erikd

The whole issue arises because we enter the function overwritingClosure with a closure of type WHITEHOLE and that happens because executeMessage (in rts/Messages.c calls doneWithMsgThrowTo.

So lets look at the interesting part of executeMessage with some extra comments explaining what I think is happening:

void
executeMessage (Capability *cap, Message *m)
{
    const StgInfoTable *i;

loop:
    write_barrier(); // allow m->header to be modified by another thread
    i = m->header.info;
    if (i == &stg_MSG_TRY_WAKEUP_info)
    {
        StgTSO *tso = ((MessageWakeup *)m)->tso;
        debugTraceCap(DEBUG_sched, cap, "message: try wakeup thread %ld",
                      (W_)tso->id);
        tryWakeupThread(cap, tso);
    }
    else if (i == &stg_MSG_THROWTO_info)
    {
        MessageThrowTo *t = (MessageThrowTo *)m;
        uint32_t r;
//** This shadows the outer variable named `i`, but that seems ok.
        const StgInfoTable *i;

//** lockClosure writes `&stg_WHITEHOLE_info` to `m->header.info` and returns
//** the old value of `m->header.info`.
        i = lockClosure((StgClosure*)m);

//** t->header.info == &stg_WHITEHOLE_info

        if (i != &stg_MSG_THROWTO_info) {
            unlockClosure((StgClosure*)m, i);
            goto loop;
        }

        debugTraceCap(DEBUG_sched, cap, "message: throwTo %ld -> %ld",
                      (W_)t->source->id, (W_)t->target->id);

        ASSERT(t->source->why_blocked == BlockedOnMsgThrowTo);
        ASSERT(t->source->block_info.closure == (StgClosure *)m);

        r = throwToMsg(cap, t);

//** t->header.info == &stg_WHITEHOLE_info

        switch (r) {
        case THROWTO_SUCCESS: {
            // this message is done
            StgTSO *source = t->source;

//** doneWithMsgThrowTo calls overwritingClosure with
//** t->header.info == &stg_WHITEHOLE_info

            doneWithMsgThrowTo(t);
            tryWakeupThread(cap, source);
            break;
        }

@simonmar, on IRC you said that in overwritingClosure the closure type should not be WHITEHOLE, but the above code suggests that it can be. If the t->header.info value is supposed to be changed back to &stg_MSG_THROWTO_info (or something else), where is that supposed to happen?

Last edited 3 years ago by erikd (previous) (diff)

comment:16 Changed 3 years ago by simonmar

Yeah, this actually looks like a bug. overwritingClosure() wants to see the MSG_THROWTO closure, but we have locked it with lockClosure(), which replaces the header with a WHITEHOLE.

We can't replace the header before calling overwritingClosure(), because that would cause race conditions - another thread could execute the MSG_THROWTO. We can't replace the header with MSG_NULL before calling overwritingClosure(), because a MSG_NULL is smaller than a MSG_THROWTO, and we won't overwrite the payload correctliy.

So I think the only alternative is to have another version of overwritingClosure() that takes the size of the original closure as an argument. The size is sizeofW(MessageThrowTo).

comment:17 Changed 3 years ago by simonmar

Forgot to say: good catch!

comment:18 Changed 3 years ago by bgamari

Milestone: 8.0.2
Status: patchnew

The previous patch didn't fix the underlying issue.

comment:19 Changed 3 years ago by bgamari

Differential Rev(s): phab:D2159Phab:D2174
Status: newpatch

It actually turns out this was pretty easy to fix (assuming Phab:D2174 indeed is a fix).

comment:20 Changed 3 years ago by erikd

Owner: set to erikd

comment:21 Changed 3 years ago by erikd

With @bgamari's ok, I've taken ownership of this ticket.

I have a fix very similar to Ben's (with an extra ASSERT) and a test. However the test also triggers the bug in #12009 so I need to fix that bug before I can commit the fix and test for this one.

comment:22 Changed 3 years ago by erikd

Even with the MSG_THROWTO issue with overwritingClosure() fixed, I'm still getting a failure for the assertion I added. The back trace looks like:

T11978b: internal error: ASSERTION FAILED: file includes/rts/storage/ClosureMacros.h, line 551

    (GHC version 8.1.20160504 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffcaffd700 (LWP 30814)]
0x00007ffff6ec0478 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
55      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6ec0478 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007ffff6ec18fa in __GI_abort () at abort.c:89
#2  0x00000000004dc8bc in rtsFatalInternalErrorFn (s=0x550808 "ASSERTION FAILED: file %s, line %u\n", ap=0x7fffcaff8cb8) at rts/RtsMessages.c:182
#3  0x00000000004dc4ee in barf (s=0x550808 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:46
#4  0x00000000004dc551 in _assertFail (filename=0x54c3c0 "includes/rts/storage/ClosureMacros.h", linenum=551) at rts/RtsMessages.c:61
#5  0x00000000004c7127 in overwritingClosure (p=0x200180748) at includes/rts/storage/ClosureMacros.h:551
#6  0x00000000004e031d in threadPaused (cap=0x7b6340, tso=0x200087c00) at rts/ThreadPaused.c:310
#7  0x00000000004f7d51 in stg_returnToSched ()
#8  0x0000000000000000 in ?? ()

comment:23 Changed 3 years ago by erikd

The overwritingClosure issue can only be triggered in the -hb profiling mode and I finally figure our that the -hb profiling mode was not thread safe (see #12019).

Leaving this ticket open as a place holder for adding tests for the other profiling modes.

comment:24 Changed 3 years ago by bgamari

Owner: erikd deleted
Status: patchnew

comment:25 Changed 3 years ago by erikd

I have some tests I'd like to add for this after phab:D1187 lands.

comment:26 Changed 2 years ago by bgamari

erikd, any update on these additional tests?

comment:28 Changed 2 years ago by erikd

I remember writing that. Let me see if I can find them.

comment:29 Changed 2 years ago by bgamari

Milestone: 8.0.28.2.1

The tests aren't critical for 8.0.2. Bumping.

comment:30 Changed 2 years ago by bgamari

Milestone: 8.2.18.0.2
Resolution: fixed
Status: newclosed

As described in #12019 -hb is not thread-safe. As of 6555c6bb8447ed65d5da4bab462ee9da7dc3f97a the RTS will fail with a proper error message when it is used with more than one capability, so I think this can be safely closed.

Last edited 2 years ago by bgamari (previous) (diff)

comment:31 Changed 2 years ago by carter

Resolution: fixed
Status: closednew

This commit breaks builds on OS X 10.11.

comment:32 Changed 2 years ago by bgamari

Resolution: fixed
Status: newclosed

Can you be more specific, Carter? In what way do things break? Let's handle this breakage on a new ticket.

Note: See TracTickets for help on using tickets.