#9078 closed bug (fixed)

Segfault with makeStableName

Reported by: robertce Owned by:
Priority: highest Milestone: 7.8.3
Component: Compiler Version: 7.8.2
Keywords: Cc: tmcdonell@…, chak@…, erkokl@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

Calling makeStableName on the same object many times causes a segfault. The following triggers this behaviour on both OSX and Linux-x86_64, provided it is compiled (it doesn't seem to occur if the call to makeStableName is interpreted).

module Main where

import Control.Monad
import System.Mem.StableName

main :: IO ()
main = replicateM_ 100000 (makeStableName foo)

foo :: Int
foo = 1

It seems to happen largely at random. The 100000 calls to makeStableName in the above example essentially guarantee its occurrence, but I have observed it happen with far fewer.

The reason this bug was discovered is it is causing frequent failures in Accelerate on 7.8.2. See https://github.com/AccelerateHS/accelerate/issues/162. In that case it seems to be causing more strange closure type errors than segfaults, but I assume it is the same root cause.

Change History (23)

comment:1 Changed 15 months ago by tmcdonell

  • Cc tmcdonell@… added

comment:2 Changed 15 months ago by ezyang

Reproduced; when run with the debug build we get an assert failure:

Starting program: /home/ezyang/test +RTS -DS -C0 -V0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
cap 0: initialised
test: internal error: ASSERTION FAILED: file rts/Stable.c, line 337

    (GHC version 7.9.20140410 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Program received signal SIGABRT, Aborted.
0x00007ffff6eaef79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56	../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff6eaef79 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6eb2388 in __GI_abort () at abort.c:89
#2  0x000000000067edef in rtsFatalInternalErrorFn (s=0x6e63c0 "ASSERTION FAILED: file %s, line %u\n", ap=0x7fffffff9878) at rts/RtsMessages.c:170
#3  0x000000000067ea27 in barf (s=0x6e63c0 "ASSERTION FAILED: file %s, line %u\n") at rts/RtsMessages.c:42
#4  0x000000000067ea8a in _assertFail (filename=0x6e6d44 "rts/Stable.c", linenum=337) at rts/RtsMessages.c:57
#5  0x0000000000682c03 in lookupStableName (p=0x9076b0) at rts/Stable.c:337
#6  0x00000000006a2f46 in stg_makeStableNamezh ()
#7  0x0000000000000000 in ?? ()
(gdb) f 5
#5  0x0000000000682c03 in lookupStableName (p=0x9076b0) at rts/Stable.c:337
337	    ASSERT(stable_name_table[sn].addr == p);

comment:3 Changed 15 months ago by ezyang

I believe the bug is when we free stable-name entries (in gcStableTables), we also need to remove the stable name from the hash table. updateStableTables will not catch this because it only iterates over live stable names. The bug is masked because if we immediately make another stable pointer, it will probably be put in the right slot, but the ASSERT catches the problem immediately.

comment:4 Changed 15 months ago by ezyang

This patch seems to do the trick.

diff --git a/rts/Stable.c b/rts/Stable.c
index ec74b0d..431b7c6 100644
--- a/rts/Stable.c
+++ b/rts/Stable.c
@@ -246,6 +246,7 @@ STATIC_INLINE void
 freeSnEntry(snEntry *sn)
 {
   ASSERT(sn->sn_obj == NULL);
+  removeHashTable(addrToStableHash, (W_)sn->old, NULL);
   sn->addr = (P_)stable_name_free;
   stable_name_free = sn;
 }

Alas, there are a bunch of validate failures at the moment.

comment:5 Changed 15 months ago by chak

@ezyang Thanks for chasing this, but I'm a bit surprised that the problem is with freeing stable names as the example program only allocated them (and they should never be freed by the GC).

@robertce Could you please build a GHC with this patch included and check whether it solves our problem?

comment:6 Changed 15 months ago by chak

  • Cc chak@… added
  • Priority changed from normal to high

comment:7 Changed 15 months ago by robertce

@chak I have just patched GHC and it does seem to solve our problem. I can run all of our testsuite without problem, something that was previously impossible.

comment:8 Changed 15 months ago by thoughtpolice

  • Milestone set to 7.8.3
  • Priority changed from high to highest
  • Status changed from new to patch

comment:9 Changed 15 months ago by chak

@robertce Thanks!

@thoughtpolice What is the current timeline for 7.8.3? This bug is basically a killer for Accelerate as it leads to intermittent failures and it doesn't seem as if we can work around it. (Would be a pity if it ends up in the HP.)

comment:10 Changed 15 months ago by ezyang

@chak Stable names are GC'd as normal (as opposed to stable *pointers*), so they'll get freed upon GCs. This behavior is new from fixing #7674, I assume that is how the bug got introduced too.

comment:11 Changed 15 months ago by chak

@ezyang Yes, you are right.

comment:12 Changed 15 months ago by lerkok

  • Cc erkokl@… added

A recent bug reported in SBV points to the same problem: https://github.com/LeventErkok/sbv/issues/87

Only with GHC 7.8.2; heavy calls to SBV's sat/prove calls start causing seemingly random failures. We were able to track it down to the uses of makeStableName. We have not observed seg-faults, but we did observe erroneous behavior that we could not replicate with GHC 7.6.3 and earlier.

SBV uses makeStableName to implement Andy Gill's type-safe observable sharing idea (as described here: http://www.ittc.ku.edu/csdl/fpg/papers/Gill-09-TypeSafeReification.html), which is a popular trick amongst DSL implementations these days. I'm not sure if Accelerate uses it for the same purpose, but I suspect this bug can impact a bunch of other DSL implementations out there, and would be really hard to debug for end-users. A timely fix before 7.8.2 becomes more widely adapted would be really good for the community.

comment:13 Changed 15 months ago by chak

@erkokl Yes, Accelerate uses stable names for observable sharing. I agree that it is a nasty bug.

comment:14 Changed 15 months ago by thoughtpolice

So we weren't sure when we wanted to release 7.8.3. But it looks like there are quite a few important bugfixes biting people at this point. I'll send an email to ghc-devs shortly...

comment:15 Changed 15 months ago by Austin Seipp <austin@…>

In b75d126e779e8690c675be84e8972dc023e04b9b/ghc:

rts: remove stable-names from hashtable upon free

This fixes #9078.

Signed-off-by: Austin Seipp <[email protected]>

comment:16 Changed 15 months ago by thoughtpolice

  • Status changed from patch to merge

comment:17 Changed 14 months ago by chak

@thoughtpolice Is there any news on a timeline?

comment:18 follow-up: Changed 14 months ago by simonpj

Manuel, you mean a timeline for 7.8.3? What would be helpful for you? (You can always build your own 7.8.3 working copy.)

Simon

comment:19 Changed 14 months ago by thoughtpolice

I have posted an update here, please let me know what you think: http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/025022.html

comment:20 in reply to: ↑ 18 Changed 14 months ago by chak

Replying to simonpj:

Manuel, you mean a timeline for 7.8.3? What would be helpful for you? (You can always build your own 7.8.3 working copy.)

Simon

Simon, I'm worried that the Haskell Platform might ship with 7.8.2 (including this bug). It would mean that the Platform would be unusable for Accelerate (and possibly all other EDSLs that use Andy Gill's method to implement observable sharing). Hence, I would much appreciate if 7.8.3 could be pushed out of the door in time to get included in the forthcoming Haskell Platform.

comment:21 Changed 14 months ago by lerkok

@chak Looks like the next platform will indeed ship with 7.8.2; so maybe it's already too late: https://github.com/haskell/haskell-platform/blob/new-build/hptool/src/Releases2014.hs

comment:22 Changed 14 months ago by carter

I think the case could be made to mark re haskell platform using 7.8.3 assuming the timeline for 7.8.3 is soon enough...

comment:23 Changed 14 months ago by thoughtpolice

  • Resolution set to fixed
  • Status changed from merge to closed

Merged into 7.8 branch for 7.8.3.

Note: See TracTickets for help on using tickets.