#8972 closed feature request (fixed)

Investigate adding fast compare-and-swap Int type/primops

Reported by: tibbe Owned by: tibbe
Priority: normal Milestone: 7.10.1
Component: Compiler Version: 7.9
Keywords: Cc: simonmar, idhameed@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #8157, #7883 Differential Revisions:

Description

I've received reports that using IORef Int and atomicModifyIORef to implement an atomic counter in the ekg package has become a bottleneck for some of its users. These users update the counter thousands of times per second, using multiple threads.

I will investigate whether adding a dedicated atomic Int reference type will offer significant speed improvements. Such a type can also be used to implement cheaper locks (by using bits in the int to represent different lock states, such as reader/write locks.)

Lets call this new type AtomicIntRef for now. This new type needs to support at least these functions:

add :: AtomicIntRef -> Int -> IO Int
set :: AtomicIntRef -> Int -> IO ()
get :: AtomicIntRef -> IO Int

add would be implemented using the lock and xaddq instructions. set and get are just simple loads and stores on x86, as these are atomic.

We might also want to consider having other functions, such as a cas. Furthermore, there are subtleties with memory barriers that might motivate having barrier/barrier-less versions of some functions.

Attachments (2)

Bench.hs (1.0 KB) - added by tibbe 16 months ago.
counter.c (450 bytes) - added by tibbe 16 months ago.

Download all attachments as: .zip

Change History (16)

Changed 16 months ago by tibbe

comment:1 follow-up: Changed 16 months ago by tibbe

I've added a little benchmark that shows that even calling out to C can speed things up a lot.

Using IORef:

$ ghc -O2 Bench.hs counter.c -DOLD=1
...
$ time ./Bench 

real	0m1.698s
user	0m1.670s
sys	0m0.020s

Calling out to C:

$ ghc -O2 Bench.hs counter.c
...
$ time ./Bench 

real	0m0.108s
user	0m0.090s
sys	0m0.010s

That's a ~16x speed-up.

Things look similar with the threaded RTS:

Using IORef:

$ ghc -O2 Bench.hs counter.c -DOLD=1 -threaded
...
$ time ./Bench 

real	0m1.998s
user	0m1.980s
sys	0m0.010s

Calling out to C:

$ ghc -O2 Bench.hs counter.c -threaded
$ time ./Bench 

real	0m0.117s
user	0m0.100s
sys	0m0.010s

With +RTS -N6 (real hardware cores on the test machine):

Using IORef:

$ time ./Bench +RTS -N6

real	1m22.565s
user	1m23.850s
sys	1m11.240s

Calling out to C:

$ time ./Bench +RTS -N6

real	0m0.247s
user	0m1.340s
sys	0m0.010s

Atrocious results for the IORef solution! It seems like contended IORefs don't work well at all.

comment:2 Changed 16 months ago by carter

unboxed references! Yes, that'd be a good idea.

Related tickets are #8157 and #7883

now that 7.8 is cut, i can take some time to work on #7883 and have some hope for getting it merged in.

comment:3 Changed 16 months ago by carter

Changed 16 months ago by tibbe

comment:4 Changed 16 months ago by rrnewton

Update: I confirmed that I get the exact same time for Data.Atomics.Counter (atomic-primops) as I do for Johan (tibbe)'s counter.c version of the benchmark here.

Johan and I talked about this and it seems like the obvious way to improve performance on this is inline primops for the fetch-and-add (#7883). We confirmed that the MutableByteArray# used by Data.Atomics.Counter is a single contiguous heap object, so the only advantage of a dedicated mutable reference type would be to save the word representing the length.

More, generally, I wonder if we can have a user facing interface that makes mutable unboxed data more pleasant. Unboxed vectors are nice, but there's nothing like an (IORef (# Int#, Int# #)), for example. Even though with double-word CAS we could support some interesting data types...

Last edited 16 months ago by tibbe (previous) (diff)

comment:5 in reply to: ↑ 1 Changed 15 months ago by jberryman

FWIW...

Replying to tibbe:

Atrocious results for the IORef solution! It seems like contended IORefs don't work well at all.

This is my experience. The behavior I've seen is interesting; check out the distribution of samples of a test of 100K concurrent atomicModifyIORefs, for 2 cores:

http://htmlpreview.github.io/?https://github.com/jberryman/chan-benchmarks/blob/master/reports/ghc-7.6/bench-multi.vars.html#b2

...and 8 cores:

http://htmlpreview.github.io/?https://github.com/jberryman/chan-benchmarks/blob/master/reports/ghc-7.6_8-core/bench-multi.vars.html#b2

The atomicModifyIORefCAS from atomic-primops doesn't display that nasty spread; with the primitive CAS one thread always wins, but it seems like whatever GHC is doing is causing vicious retry loops, which sometimes stay in phase for awhile.

And the fetch-and-add-based counter from atomic-primops handles contention far and away better than anything else available on hackage. From what I remember I couldn't even measure contention effects going from 2 to 8 contending threads (although I did do both tests on two different machines, so...).

Last edited 15 months ago by jberryman (previous) (diff)

comment:6 follow-ups: Changed 15 months ago by rrnewton

Thanks for these benchmarks!

I suspect the problem with atomicModifyIORef's poor behavior may have to do with blackholes? Tracing ghc events will perturb these workloads a bunch I guess, but maybe threadscope could still tell us something wrt blackhole events.

By the way, I started peeking at the benchmark code. I don't have a good sense currently of how many cycles each of the following takes in practice:

  • between a forkIO and the new thread waking up on another CPU.
  • between two forkIO's from a parent thread waking up (gang synchrony)
  • delta between two threads blocked on the same MVar waking up

I noticed the third method used in MainN.hs and I think this is very standard. However, I wonder if it is worth having all the test threads busy-wait on a memory location to start with a greater degree of synchrony. That way we can have them run for exactly the same time interval and maximize contention over the whole time period ;-).

In any case, these benchmarks are great. I'd like to run these on a few different machines with our nightly benchmarking stuff. I read the README but I can't currently build the benchmarks because I don't have chan-split-fast -- where is that?

P.S. We currently accumulate benchmark data into Google Fusion Tables (but we'd like to switch to Big Query or Cloud SQL or something). HSBencher does the upload, but it doesn't yet integrate with criterion well. It's more focused on long-running (parallel) benchmarks rather than statistically significant sampling of short runs. Still, adding support for consuming criterion reports would be great.

comment:7 in reply to: ↑ 6 ; follow-up: Changed 15 months ago by jberryman

Replying to rrnewton:

By the way, I started peeking at the benchmark code. I don't have a good sense currently of how many cycles each of the following takes in practice:

  • between a forkIO and the new thread waking up on another CPU.

.......

  • between two forkIO's from a parent thread waking up (gang synchrony)

Good question! Don't have enough cores on my real machine to look at that right now.

  • delta between two threads blocked on the same MVar waking up

I noticed the third method used in MainN.hs and I think this is very standard. However, I wonder if it is worth having all the test threads busy-wait on a memory location to start with a greater degree of synchrony. That way we can have them run for exactly the same time interval and maximize contention over the whole time period ;-).

I think that's a good idea. I started doing that in later tests/benchmarks: pass an IORef Int to each forked thread, each thread increments the ref and then busy-waits until the counter reaches the target.

In any case, these benchmarks are great. I'd like to run these on a few different machines with our nightly benchmarking stuff. I read the README but I can't currently build the benchmarks because I don't have chan-split-fast -- where is that?

Sorry, this whole project is really just my personal messy scratchwork; even the README is just personal notes that are out of date anyway, and I should probably just remove it. You'll have a hard time building it for a number of reasons, and might have better luck just copy-pasting what looks interesting. I'm happy to help extract and clean up any of them that you're interested in.

P.S. We currently accumulate benchmark data into Google Fusion Tables (but we'd like to switch to Big Query or Cloud SQL or something). HSBencher does the upload, but it doesn't yet integrate with criterion well. It's more focused on long-running (parallel) benchmarks rather than statistically significant sampling of short runs. Still, adding support for consuming criterion reports would be great.

Oh cool, I'd like to look into how you're doing all that. Yeah I was initially nervous about using criterion for these kinds of benchmarks (involving concurrency) but it's actually worked really well, and seeing the distributions has been really enlightening.

comment:8 in reply to: ↑ 7 Changed 15 months ago by jberryman

Replying to jberryman:

Replying to rrnewton:

By the way, I started peeking at the benchmark code. I don't have a good sense currently of how many cycles each of the following takes in practice:

  • between a forkIO and the new thread waking up on another CPU.

.......

Oops sorry! Here I meant to write that I'd sort of benchmarked this, but just added a better test (fork a putMVar, then block on takeMVar in main thread) which I measure at between 5.4 - 5.8 μs . So as long as you can get your contention test runs into the milliseconds range those effects should be completely hidden. Feel free to email me if this is distracting from this ticket and you want to chat further.

comment:9 Changed 15 months ago by hvr

  • Milestone set to 7.10.1

comment:10 in reply to: ↑ 6 ; follow-up: Changed 15 months ago by nominolo

Replying to rrnewton:

I suspect the problem with atomicModifyIORef's poor behavior may have to do with blackholes? Tracing ghc events will perturb these workloads a bunch I guess, but maybe threadscope could still tell us something wrt blackhole events.

We were affected by this issue and looked at the thread status of all threads and noticed that they all get blocked on a blackhole. We suspect there are two issues here:

  1. Poor cache behaviour due to the CAS retry busy loop.
  2. All the threads get queued up waiting for the thread that currently owns the blackhole to get scheduled again.

The second point is a potential issue for many other applications (e.g,. if a thread is holding on to an MVar). In the operating system world they deal with those kinds of issues either using:

  • interrupt blocking: i.e., a thread cannot be de-scheduled while holding the lock
  • time-slice donation / priority boosting: If a thread enters a critical section and still has some alotted compute time left, it "donates" that time to the thread owning the critical section, so that it will make progress and leave the critical section. If you have thread priorities then you can temporarily raise the priority of the thread in the critical section. This rather complicated because priority boosting may have to be recursive (you have to boost to the priority of the highest-priority thread waiting), so not everyone likes this method.

For GHC, I think it could be useful to have both mechanisms available. Marking a thread as non-interruptable could be exported from a "Here be dragons" module for using when you know that a thread will hold a lock/MVar only for a very short time.

For thunks, when a thread blocks on a blackhole, the scheduler should probably then schedule that thread next. This way, the time spent evaluating a blackhole will be proportional to the number of threads waiting on it. I don't know if the reduced fairness causes issues here, though. You could mark a blackhole as "boosted" meaning, that as soon the blackhole is updated we not just wake up all threads in the run-queue, but also yield to the first thread in the blackhole.

comment:11 Changed 15 months ago by simonmar

  • Cc simonmar added

comment:12 in reply to: ↑ 10 Changed 15 months ago by jberryman

Replying to nominolo:

For GHC, I think it could be useful to have both mechanisms available. Marking a thread as non-interruptable could be exported from a "Here be dragons" module for using when you know that a thread will hold a lock/MVar only for a very short time.

Would this be something like mask but for scheduling? If so I want. From a library-writers point of view this would be a really easy way to get rid of a lot of bad scheduling effects, e.g. in probably every use of modifyMVar. Sorry if I'm not understanding.

comment:13 Changed 15 months ago by ihameed

  • Cc idhameed@… added

comment:14 Changed 13 months ago by tibbe

  • Resolution set to fixed
  • Status changed from new to closed

We added atomic primops for byte arrays (which can be used to implement a single cell mutable variable) in d8abf85f8ca176854e9d5d0b12371c4bc402aac3.

Note: See TracTickets for help on using tickets.