Opened 3 years ago

Last modified 2 years ago

#12019 new bug

Profiling option -hb is not thread safe

Reported by: erikd Owned by:
Priority: normal Milestone:
Component: Runtime System Version:
Keywords: Cc: carter, simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: #11978, #12009 Differential Rev(s): Phab:D2516
Wiki Page:

Description (last modified by bgamari)

This ticket is a continuation of #11978 and #12009. After fixing a couple of issues in those two tickets I found that the profiling run time is not thread safe.

Have a trivial test program (written as one of the tests for #11978):

import Control.Concurrent
import Control.Concurrent.MVar
import Control.Exception
import Control.Monad

main :: IO ()
main = do
    putStrLn "Start ..."
    mvar <- newMVar (0 :: Int)

    let count = 50

    forM_ [ 1 .. count ] $ const $ forkIO $ do
            threadDelay 100
            i <- takeMVar mvar
            putMVar mvar $! i + 1

    threadDelay 1000000
    end <- takeMVar mvar
    putStrLn $ "Final result " ++ show end
    assert (end == count) $ return ()

Compiling that with a compiler that has bug fixes arising from #11978 and #12009 as:

inplace/bin/ghc-stage2 testsuite/tests/profiling/should_run/T11978b.hs \
    -fforce-recomp -rtsopts -fno-warn-tabs -O -prof -static -auto-all \
    -threaded -debug -o T11978b

and run as:

./T11978b +RTS -p -hb -N10 

crashes in a number of different ways. I've seen at least 3 different assertion failures and numerous segfaults (in different stg_ap_* functions).

Replace -hb with other profiling options like -hr etc do not seem to crash.

Looking at code, one example of lack of thread safetly is the function LDV_recordDead which mutates global variable censuses which does not have any locking around it. Only figured this out because the following assert (in LDV_recordDead) was being triggered occasionally.

    ASSERT(censuses[t].void_total < censuses[t].not_used);

Change History (12)

comment:1 Changed 3 years ago by simonmar

Yes, when I made the rest of profiling thread-safe I didn't look at +RTS -hb. Perhaps we should disable it except at -N1, unless you want to have a go at fixing it?

comment:2 Changed 3 years ago by erikd

Yes, I'm going to try to fix it.

comment:3 Changed 3 years ago by erikd

Fixing this is definitely not trivial.

I've placed locks around some of the shared mutable data structures, but I still get the same assertion (in function processHeapClosureForDead) failing:


because the overwritingClosure has already been called on it.

comment:4 Changed 2 years ago by bgamari

We should at least try to disable it except at -N1.

comment:5 Changed 2 years ago by bgamari

Description: modified (diff)
Differential Rev(s): Phab:D2516
Status: newpatch

Phab:D2516 throws an error when -hb is used with more than one capability.

Last edited 2 years ago by bgamari (previous) (diff)

comment:6 Changed 2 years ago by Ben Gamari <ben@…>

In 6555c6bb/ghc:

rts: Disable -hb with multiple capabilities

Biographical profiling is not thread-safe as documented in #12019. Throw
an error when it is used in this way.

Test Plan: Validate

Reviewers: simonmar, austin, erikd

Reviewed By: erikd

Subscribers: thomie

Differential Revision:

GHC Trac Issues: #12019

comment:7 Changed 2 years ago by bgamari

Resolution: fixed
Status: patchclosed

comment:8 Changed 2 years ago by bgamari

Version: 8.1

comment:9 Changed 2 years ago by bgamari

Milestone: 8.0.2
Owner: erikd deleted
Resolution: fixed
Status: closednew

Reopening since this is still an issue; we just happen to fail more gracefully now.

comment:10 Changed 2 years ago by carter

The fix on the 8.0 branch broke the build on my Mac. Reverting it for now for my local build. But surely it's impacted other folks

comment:11 Changed 2 years ago by carter

rts/ProfHeap.c: In function 'initHeapProfiling':

rts/ProfHeap.c:389:49: error:
     error: 'PAR_FLAGS {aka struct _PAR_FLAGS}' has no member named 'nCapabilities'
         if (doingLDVProfiling() && RtsFlags.ParFlags.nCapabilities > 1) {

I think the patch / CPP somehow doesn't quite work on the build ways matrix of the RTS in the expected way, i'll try to poke at it myself if i have time, but it def died

comment:12 Changed 2 years ago by Ben Gamari <ben@…>

In d1b4fec1/ghc:

Mark T11978a as broken due to #12019

Test Plan: `validate --slow`

Reviewers: austin

Subscribers: thomie

Differential Revision:

GHC Trac Issues: #12019
Note: See TracTickets for help on using tickets.