#13751 closed bug (fixed)

Runtime crash with <<loop>> after concurrent stressing of STM computations

Reported by: literon Owned by: simonmar
Priority: highest Milestone: 8.2.1
Component: Runtime System Version: 8.0.2
Keywords: Cc: simonmar, bgamari
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime crash Test Case:
Blocked By: Blocking:
Related Tickets: 10414 Differential Rev(s): Phab:D3630
Wiki Page:

Description

For reproduction, see stack project at https://github.com/robinp/bugreports/tree/master/blackhole. Copying its readme here:

### Compiler & OS

GHC 7.10.3, 8.0.1, 8.0.2 (didn't check others).

Linux: 4.10.13-1-ARCH #1 SMP PREEMPT x86_64 GNU/Linux 4-core

### What does the code do?

TLDR endlessly forks a batch of 8 threads, and waits for them to finish. Each thread calls observeDuration on an irrelevant IO action. observeDuration measures the time, then updates some data structures in STM context in a TVar.

After a short while (usually within 1 minute), the program aborts with <<loop>>. See below for a more detailed investigation.

### To run

stack install

# Restart if doesn't terminate in a minute.

loop-exe +RTS -N4 -Ds > debug 2> debug.out ; beep

### Observe

debug shows the metering batches run for a while, then get stuck.

debug.out contains <<loop>>.

My debug log reading fu is poor, but in less cleaned-up versions of the program it was more trivial to see that two threads get blocked on each others blackhole.

In the current log there is mentioning of blackholes, but also MVars, and I don't see what's going on.

### Tracking down

When built with profiling (remove --eventlog --debug from cabal file, then stack clean then stack build --profile), and running with +RTS -N4 -xc the following stack traces appear:

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

Main.meter, called from Main.main

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

Prometheus.Metric.Summary.observe, called from Prometheus.Metric.Summary.observeDuration, called from Main.meter.action, called from Main.meter.cfork, called from Main.meter, called from Main.main

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

Prometheus.Metric.Summary.observe, called from Prometheus.Metric.Summary.observeDuration, called from Main.meter.action, called from Main.meter.cfork, called from Main.meter, called from Main.main

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace: Main.meter, called from Main.mainPrometheus.Metric.Summary.observe, called from Prometheus.Metric.Summary.observeDuration, called from Main.meter.action, called from Main.meter.cfork, called from Main.meter, called from Main.main

* Exception (reporting due to +RTS -xc): (IND_STATIC), stack trace:

Prometheus.Metric.Summary.observe, called from Prometheus.Metric.Summary.observeDuration, called from Main.meter.action, called from Main.meter.cfork, called from Main.meter, called from Main.main

loop-exe: loop-exe: <<loop>> <<loop>> loop-exe: thread blocked indefinitely in an MVar operation

An other stack trace from an other run as bonus (these two are representative):

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

Prometheus.Metric.Summary.invariant, called from Prometheus.Metric.Summary.compress, called from Prometheus.Metric.Summary.observe, called from Prometheus.Metric.Summary.observeDuration, called from Main.meter.action, called from Main.meter.cfork, called from Main.meter, called from Main.main

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

Prometheus.Metric.Summary.invariant, called from Prometheus.Metric.Summary.compress, called from Prometheus.Metric.Summary.observe, called from Prometheus.Metric.Summary.observeDuration, called from Main.meter.action, called from Main.meter.cfork, called from Main.meter, called from Main.main

loop-exe: <<loop>> * Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

Main.meter, called from Main.main

* Exception (reporting due to +RTS -xc): (THUNK_STATIC), stack trace:

Main.meter, called from Main.main

loop-exe: thread blocked indefinitely in an MVar operation

I tried to even simplify by removing prometheus-client dep, but it seems that the kind of computation done in Prometheus.Metric.Summary.observe, namely putting a lazy computation inside a data type stored in a TVar, are needed to trigger the blackholing. When I tried to simplify those operations, or bang the remaining few fields of the data type, the error didn't reproduce (at least couldn't reproduce as quick as it usually does).

The relevant pieces of code from prometheus-client's P.M.Summary module. Note: I manually replaced the MonadMonitor constraint with plain IO in that call chain, but it didn't have much effect.

                data Estimator = Estimator {
                    estCount      :: !Int64
                ,   estSum        :: !Double
                ,   estQuantiles  :: [Quantile]
                ,   estItems      :: [Item]
                } deriving (Show)

                newtype Summary = MkSummary (STM.TVar Estimator)

                observeDuration :: IO a -> Metric Summary -> IO a
                observeDuration io metric = do
                                start  <- getCurrentTime
                                result <- io
                                end    <- getCurrentTime
                                let dt = fromRational $ toRational $ end `diffUTCTime` start
                                withSummary metric dt
                                return result

                observe :: MonadMonitor m => Double -> Metric Summary -> m ()
                observe v s = withSummary s (insert v)

                withSummary :: MonadMonitor m => Metric Summary -> (Estimator -> Estimator) -> m ()
                withSummary (Metric {handle = MkSummary valueTVar}) f =
                                doIO $ STM.atomically $ do
                                                STM.modifyTVar' valueTVar compress
                                                STM.modifyTVar' valueTVar f


                insert :: Double -> Estimator -> Estimator
                insert = ...

                compress :: Estimator -> Estimator
                compress = ...

I checked insert and compress and they don't seem to be able to loop in any edge case, so this bug is likely an RTS thing.

### Related tickets I found:

https://ghc.haskell.org/trac/ghc/ticket/10218

GHC creates incorrect code which throws <<loop>>

https://ghc.haskell.org/trac/ghc/ticket/10414 :

Buggy behavior with threaded runtime (-N1 working, -N2 getting into <<loop>>)

Attachments (1)

Main.hs (7.0 KB) - added by RyanGlScott 19 months ago.

Download all attachments as: .zip

Change History (11)

comment:1 Changed 19 months ago by simonmar

Component: CompilerRuntime System
Milestone: 8.2.1
Priority: normalhighest

This definitely looks bad. Thanks for reporting and for making a test case.

comment:2 Changed 19 months ago by RyanGlScott

Summary: Runtime crash with <<loop>> after concurrent stressing of STM computatinos.Runtime crash with <<loop>> after concurrent stressing of STM computations

comment:3 Changed 19 months ago by RyanGlScott

For the sake of making it easier to run, I'm attaching a version without any dependencies. To run it:

ghc -fforce-recomp -O -threaded Main.hs
./Main +RTS -N4 -RTS

Changed 19 months ago by RyanGlScott

Attachment: Main.hs added

comment:4 Changed 19 months ago by George

fwiw above reproduces in 8.2.1-rc2 on a Mac with 4 cpus

comment:5 Changed 19 months ago by simonmar

I reproduced it too. It does look like a lost wakeup somewhere in the blackhole handling code, but I haven't narrowed it down further yet.

comment:6 Changed 19 months ago by simonmar

Owner: set to simonmar

Yep, I think I've found the cause of this. Working on a fix.

comment:7 Changed 18 months ago by simonmar

Differential Rev(s): Phab:D3630

comment:8 Changed 18 months ago by Simon Marlow <marlowsd@…>

In 5984729/ghc:

Fix a lost-wakeup bug in BLACKHOLE handling (#13751)

Summary:
The problem occurred when
* Threads A & B evaluate the same thunk
* Thread A context-switches, so the thunk gets blackholed
* Thread C enters the blackhole, creates a BLOCKING_QUEUE attached to
  the blackhole and thread A's `tso->bq` queue
* Thread B updates the blackhole with a value, overwriting the BLOCKING_QUEUE
* We GC, replacing A's update frame with stg_enter_checkbh
* Throw an exception in A, which ignores the stg_enter_checkbh frame

Now we have C blocked on A's tso->bq queue, but we forgot to check the
queue because the stg_enter_checkbh frame has been thrown away by the
exception.

The solution and alternative designs are discussed in Note [upd-black-hole].

This also exposed a bug in the interpreter, whereby we were sometimes
context-switching without calling `threadPaused()`.  I've fixed this
and added some Notes.

Test Plan:
* `cd testsuite/tests/concurrent && make slow`
* validate

Reviewers: niteria, bgamari, austin, erikd

Reviewed By: erikd

Subscribers: rwbarton, thomie

GHC Trac Issues: #13751

Differential Revision: https://phabricator.haskell.org/D3630

comment:9 Changed 18 months ago by simonmar

Status: newmerge

comment:10 Changed 18 months ago by bgamari

Resolution: fixed
Status: mergeclosed
Note: See TracTickets for help on using tickets.