Opened 9 years ago

Closed 9 years ago

Last modified 4 years ago

#3838 closed bug (fixed)

Performance issues with blackholes

Reported by: simonmar Owned by: simonmar
Priority: normal Milestone: 6.14 branch
Component: Runtime System Version: 6.12.1
Keywords: Cc: bos@…, tibbe
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by simonmar)

The attached program demonstrates an acute performance problem when many threads are blocking on blackholes.

To run it:

$ cd event
$ cabal configure && cabal build
$ cd benchmarks
$ make thread-delay
$ ./thread-delay -n 20000

often the program completes in under a second, but sometimes it takes much longer (15s up to a minute).

I diagnosed the problem: in this program, thousands of threads compete to update a single IORef using atomicModifyIORef. If one of these threads is pre-empted while evaluating the thunk left in the IORef by atomicModifyIORef, then the thunk becomes a blackhole. From that point on, all the other threads become blocked on blackholes, as they each call atomicModifyIORef and evaluate the result.

Eventually there are thousands of threads on the blackhole queue, and since this queue is traversed at least once and possibly several times during GC, performance drops severely.

Some thoughts I've had follow. There isn't an obvious good solution yet.

  • We can't attach blocked threads to the blackhole directly, due to the race conditions this would cause with parallel execution (we replaced the blackhole blocking queues with the global blackhole queue in 6.6 when implementing parallel execution).
  • maintaining a hash-table mapping blackholes to blocked threads would improve wakeup times, but wouldn't help here, because we'd then have a large hash table to update on each GC. I have an unpushed patch to do just this sitting around (attached); it didn't improve performance when I tried it, and is rather more complicated than the current design.
  • we could divide the blackhole queue into per-generation queues as we've done with several other data structures. This would help GC to scale, but doesn't address the root problems.
  • If instead of blackholing the thunk we suspended it (ala raiseAsync), that would solve the problem. It could be done in threadPaused, but how do we know when to do this?
  • Another solution along similar lines would be to just not blackhole the thunk at all: we let other threads evaluate it again. This would make sense for thunks with a fixed known-small evaluation cost, such as selectors. Unfortunately in this case the thunk in the IORef is not known to be small, although the two selector thunks also created by atomicModifyIORef do fall into this category. (Note however that blackholing is currently mandatory due to a problem that otherwise occurs in parallel GC: see NOTE [upd-black-hole] in rts/sm/Scav.c).
  • If we knew which threads owned which blackholes, then we could arrange to schedule threads on which others were blocked more quickly. This is like a priority-inversion problem: one thread is blocking all the others, we need to increase its priority. Unfortunately we don't have a mapping from blackholes to threads available, and maintaining it would introduce an overhead. In this case we have a cascade of threads blocked on each other, so choosing the right scheduling is particularly difficult.
  • Using STM would avoid the problem, as long as the program is careful to not store any thunks in a TVar. However, STM is rather more expensive, and the reason for using atomicModifyIORef in the first place was speed.
  • Using MVar would not solve the problem, as modifyMVar can be pre-empted while holding the MVar, which would lead to a similar problem (in fact, it would be impossible for the RTS to do anything in this case, since we can't revert an MVar or know which thread is holding it).

Attachments (2)

event.tar.bz2 (55.3 KB) - added by simonmar 9 years ago.
code and benchmark
blackhole-hash.patch.gz (66.5 KB) - added by simonmar 9 years ago.

Download all attachments as: .zip

Change History (11)

Changed 9 years ago by simonmar

Attachment: event.tar.bz2 added

code and benchmark

Changed 9 years ago by simonmar

Attachment: blackhole-hash.patch.gz added

comment:1 Changed 9 years ago by simonmar

Description: modified (diff)

comment:2 Changed 9 years ago by simonmar

Description: modified (diff)

comment:3 Changed 9 years ago by bos

Cc: bos@… added

comment:4 Changed 9 years ago by bos

See also http://github.com/bos/event/commit/29abdc23d7d693e0f050ea81d288c9ba541f20ff for an example of how this makes performance with IORefs very difficult to understand or predict.

comment:5 Changed 9 years ago by tibbe

Cc: tibbe added

comment:6 Changed 9 years ago by simonmar

After several weeks of research, heated discussion, hacking, epic debugging, and with casualties including one whiteboard:

> ./thread-delay -n 20000 +RTS -s
./thread-delay -n 20000 +RTS -s 
     134,165,424 bytes allocated in the heap
     151,967,568 bytes copied during GC
      23,767,936 bytes maximum residency (7 sample(s))
       7,930,616 bytes maximum slop
              81 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:   239 collections,     0 parallel,  0.25s,  0.25s elapsed
  Generation 1:     7 collections,     0 parallel,  0.10s,  0.15s elapsed

  Parallel GC work balance: nan (0 / 0, ideal 1)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.16s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  0.00s)       0.25s    (  0.28s)
  Task  2 (worker) :    0.00s    (  0.16s)       0.00s    (  0.00s)
  Task  3 (bound)  :    0.00s    (  0.16s)       0.09s    (  0.11s)

  SPARKS: 0 (0 converted, 0 pruned)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.14s  (  0.16s elapsed)
  GC    time    0.34s  (  0.40s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.48s  (  0.56s elapsed)

And it always runs in <0.6s. Now all I have to do is get these patches cleaned up and committed.

comment:7 Changed 9 years ago by simonmar

Resolution: fixed
Status: newclosed

Patches committed today. The main ones are:

Mon Mar 29 07:44:56 PDT 2010  Simon Marlow <marlowsd@gmail.com>
  * New implementation of BLACKHOLEs
  
  This replaces the global blackhole_queue with a clever scheme that
  enables us to queue up blocked threads on the closure that they are
  blocked on, while still avoiding atomic instructions in the common
  case.
  
  Advantages:
  
   - gets rid of a locked global data structure and some tricky GC code
     (replacing it with some per-thread data structures and different
     tricky GC code :)
  
   - wakeups are more prompt: parallel/concurrent performance should
     benefit.  I haven't seen anything dramatic in the parallel
     benchmarks so far, but a couple of threading benchmarks do improve
     a bit.
  
   - waking up a thread blocked on a blackhole is now O(1) (e.g. if
     it is the target of throwTo).
  
   - less sharing and better separation of Capabilities: communication
     is done with messages, the data structures are strictly owned by a
     Capability and cannot be modified except by sending messages.
  
   - this change will utlimately enable us to do more intelligent
     scheduling when threads block on each other.  This is what started
     off the whole thing, but it isn't done yet (#3838).
  
  I'll be documenting all this on the wiki in due course.


Mon Mar 29 07:45:21 PDT 2010  Simon Marlow <marlowsd@gmail.com>
  * Move a thread to the front of the run queue when another thread blocks on it
  This fixes #3838, and was made possible by the new BLACKHOLE
  infrastructure.  To allow reording of the run queue I had to make it
  doubly-linked, which entails some extra trickiness with regard to
  GC write barriers and suchlike.

comment:8 Changed 9 years ago by bos

Beautiful. Thanks, Simon!

comment:9 Changed 4 years ago by Austin Seipp <austin@…>

In a520761d065a84838896e8dd09d8aaec77480d60/ghc:

Remove outdated TODO in TimeManager

Summary:
It describes a work around Trac #3838, but it is already fixed and the
workaround removed, Trac #7653

Test Plan: not needed

Reviewers: hvr, Mikolaj, austin

Reviewed By: austin

Subscribers: thomie, carter

Differential Revision: https://phabricator.haskell.org/D478
Note: See TracTickets for help on using tickets.