Performance issues with blackholes
The attached program demonstrates an acute performance problem when many threads are blocking on blackholes.
To run it:
$ cd event
$ cabal configure && cabal build
$ cd benchmarks
$ make thread-delay
$ ./thread-delay -n 20000
often the program completes in under a second, but sometimes it takes much longer (15s up to a minute).
I diagnosed the problem: in this program, thousands of threads compete to update a single IORef
using atomicModifyIORef
. If one of these threads is pre-empted while evaluating the thunk left in the IORef
by atomicModifyIORef
, then the thunk becomes a blackhole. From that point on, all the other threads become blocked on blackholes, as they each call atomicModifyIORef
and evaluate the result.
Eventually there are thousands of threads on the blackhole queue, and since this queue is traversed at least once and possibly several times during GC, performance drops severely.
Some thoughts I've had follow. There isn't an obvious good solution yet.
- We can't attach blocked threads to the blackhole directly, due to the race conditions this would cause with parallel execution (we replaced the blackhole blocking queues with the global blackhole queue in 6.6 when implementing parallel execution).
- maintaining a hash-table mapping blackholes to blocked threads would improve wakeup times, but wouldn't help here, because we'd then have a large hash table to update on each GC. I have an unpushed patch to do just this sitting around (attached); it didn't improve performance when I tried it, and is rather more complicated than the current design.
- we could divide the blackhole queue into per-generation queues as we've done with several other data structures. This would help GC to scale, but doesn't address the root problems.
- If instead of blackholing the thunk we suspended it (ala
raiseAsync
), that would solve the problem. It could be done inthreadPaused
, but how do we know when to do this? - Another solution along similar lines would be to just not blackhole the thunk at all: we let other threads evaluate it again. This would make sense for thunks with a fixed known-small evaluation cost, such as selectors. Unfortunately in this case the thunk in the
IORef
is not known to be small, although the two selector thunks also created byatomicModifyIORef
do fall into this category. (Note however that blackholing is currently mandatory due to a problem that otherwise occurs in parallel GC: seeNOTE [upd-black-hole]
in [[GhcFile(rts/sm/Scav.c)]]). - If we knew which threads owned which blackholes, then we could arrange to schedule threads on which others were blocked more quickly. This is like a priority-inversion problem: one thread is blocking all the others, we need to increase its priority. Unfortunately we don't have a mapping from blackholes to threads available, and maintaining it would introduce an overhead. In this case we have a cascade of threads blocked on each other, so choosing the right scheduling is particularly difficult.
- Using STM would avoid the problem, as long as the program is careful to not store any thunks in a
TVar
. However, STM is rather more expensive, and the reason for usingatomicModifyIORef
in the first place was speed. - Using
MVar
would not solve the problem, asmodifyMVar
can be pre-empted while holding theMVar
, which would lead to a similar problem (in fact, it would be impossible for the RTS to do anything in this case, since we can't revert anMVar
or know which thread is holding it).