Performance issues with blackholes

The attached program demonstrates an acute performance problem when many threads are blocking on blackholes.

To run it:

$ cd event
$ cabal configure && cabal build
$ cd benchmarks
$ make thread-delay
$ ./thread-delay -n 20000

often the program completes in under a second, but sometimes it takes much longer (15s up to a minute).

I diagnosed the problem: in this program, thousands of threads compete to update a single IORef using atomicModifyIORef. If one of these threads is pre-empted while evaluating the thunk left in the IORef by atomicModifyIORef, then the thunk becomes a blackhole. From that point on, all the other threads become blocked on blackholes, as they each call atomicModifyIORef and evaluate the result.

Eventually there are thousands of threads on the blackhole queue, and since this queue is traversed at least once and possibly several times during GC, performance drops severely.

Some thoughts I've had follow. There isn't an obvious good solution yet.

We can't attach blocked threads to the blackhole directly, due to the race conditions this would cause with parallel execution (we replaced the blackhole blocking queues with the global blackhole queue in 6.6 when implementing parallel execution).
maintaining a hash-table mapping blackholes to blocked threads would improve wakeup times, but wouldn't help here, because we'd then have a large hash table to update on each GC. I have an unpushed patch to do just this sitting around (attached); it didn't improve performance when I tried it, and is rather more complicated than the current design.
we could divide the blackhole queue into per-generation queues as we've done with several other data structures. This would help GC to scale, but doesn't address the root problems.
If instead of blackholing the thunk we suspended it (ala raiseAsync), that would solve the problem. It could be done in threadPaused, but how do we know when to do this?
Another solution along similar lines would be to just not blackhole the thunk at all: we let other threads evaluate it again. This would make sense for thunks with a fixed known-small evaluation cost, such as selectors. Unfortunately in this case the thunk in the IORef is not known to be small, although the two selector thunks also created by atomicModifyIORef do fall into this category. (Note however that blackholing is currently mandatory due to a problem that otherwise occurs in parallel GC: see NOTE [upd-black-hole] in [[GhcFile(rts/sm/Scav.c)]]).
If we knew which threads owned which blackholes, then we could arrange to schedule threads on which others were blocked more quickly. This is like a priority-inversion problem: one thread is blocking all the others, we need to increase its priority. Unfortunately we don't have a mapping from blackholes to threads available, and maintaining it would introduce an overhead. In this case we have a cascade of threads blocked on each other, so choosing the right scheduling is particularly difficult.
Using STM would avoid the problem, as long as the program is careful to not store any thunks in a TVar. However, STM is rather more expensive, and the reason for using atomicModifyIORef in the first place was speed.
Using MVar would not solve the problem, as modifyMVar can be pre-empted while holding the MVar, which would lead to a similar problem (in fact, it would be impossible for the RTS to do anything in this case, since we can't revert an MVar or know which thread is holding it).

Edited Mar 09, 2019 by Simon Marlow

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information