Opened 5 months ago

Last modified 4 months ago

#8578 new task

Improvements to SpinLock implementation

Reported by: parcs Owned by:
Priority: normal Milestone:
Component: Runtime System Version: 7.7
Keywords: Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:


The SpinLock? implementation has a number of deficiencies. Here is a pair of patches that improves its implementation. Let me know what you think.

Attachments (2)

0001-Consolidate-common-spin-lock-code.patch (2.3 KB) - added by parcs 5 months ago.
0002-Rewrite-ACQUIRE_SPIN_LOCK.patch (2.6 KB) - added by parcs 5 months ago.

Download all attachments as: .zip

Change History (10)

Changed 5 months ago by parcs

comment:1 Changed 5 months ago by parcs

  • Status changed from new to patch

comment:2 Changed 5 months ago by parcs

According to a C microbenchmark, the new spinlock code is much more scalable than the old code. However, this scalability improvement remains to be seen in a GHC program. Is there a program one can test that is particularly sensitive to the scalability of the RTS's spinlocks?

comment:3 Changed 5 months ago by simonmar

The patch looks like a nice improvement, thanks! Before committing I want to run some benchmarks. The usual ones I use are nofib/parallel running on various numbers of cores, at least 8. I can try these sometimes next week.

comment:4 Changed 5 months ago by thoughtpolice

Very good stuff Patrick looks nice to me - and I can see how it'd help scalability.

I did a quick glance and one question I have is why are we using busy_wait_nop in the wait loop, which just becomes rep; nop on x86?

For spinlock implementations, Intel at least has a PAUSE instruction which specifically informs the CPU that this is a spinlock wait-loop, which allows it to optimize cache and memory accesses if possible to avoid memory ordering violations, requiring the CPUs to synchronize. This can be quite dramatic on older processors I believe (I wouldn't be surprised if rep; nop devolved in pause on some microarchs, but probably not fully guaranteed.) PAUSE might also actually *delay* the CPU for this optimization to happen, where as rep nop will merely run as fast as possible. So I'd suggest replacing busy_wait_nop on x86 with something like _pause_mm from GCC:

#define _mm_pause()                             \
  ({                                            \
    __asm__ __volatile__("pause" ::: "memory"); \

Finally, one thing I did for fun a while back was write an accelerated spinlock implementation using the new TSX extensions in Intel processors. In my very non-scientific experiments, this essentially eliminated any lock contention (or even any need to take a lock) in a lot of cases. I wonder if it's worth adding here to see if there's any difference in throughput or contention. You can find the code here (pinpointed to the code in question):

Note my example does *not* use lock elision prefixes for xchg (which are backwards compatible,) but instead uses the new TSX RTM instructions to wrap locks/unlocks in a transaction, allowing speculative elision. In practice this works just as well and RTM is more flexible.

It would however require checking cpuid to see if TSX is available and if so, dynamically replacing the code path as I have done. On Haswell, this indirect-branch would probably be predicted so well its overhead is basically zero, but I don't know what it might do to e.g. AMD machines in terms of prediction or throughput.

In any case - Patrick, thanks. If you'd like to test the elision thing, or the impact of _mm_pause, I have a powerful Haswell server available (8 hardware threads/32GB RAM) you can play around with for testing.

comment:5 Changed 5 months ago by simonmar

"rep; nop" is the pause instruction :-)

comment:6 Changed 5 months ago by thoughtpolice

Ah, yes, you're right - I did some more digging. The tricky bit is that rep; nop and pause actually have the same opcodes (F390) for backwards compatibility, but on newer CPUs, rep; nop is treated magically (allowing other hyperthreads on the same core), and pause is simply an alias to make it more clear.

In that case, ignore me and my inability to keep up with Intel shenanigans.

comment:7 Changed 4 months ago by simonmar

Here are my results with -N4 on an Intel Core i7-3770 (4 cores, 8 threads).

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
   blackscholes          +0.0%     +0.0%     -1.7%     -2.4%     -0.3%
          coins          +0.0%     -0.0%     +0.4%     +1.0%     -8.6%
           gray          +0.0%     +0.0%    +15.1%    +14.3%     +0.0%
         mandel          +0.0%     +0.0%     +3.3%     +3.3%     -0.8%
        matmult          +0.0%     +8.1%     -2.4%     -2.6%     +0.0%
        minimax          +0.0%     +0.0%     -1.3%     -1.1%     +0.0%
          nbody          +0.0%     -6.0%     -1.9%      0.06     +0.0%
         parfib          +0.0%     +0.1%    +16.2%    +16.2%     +0.0%
        partree          +0.0%     -0.0%     +1.0%     +0.5%     -3.0%
           prsa          +0.0%     -0.1%     +1.1%     +0.9%     +0.0%
         queens          +0.0%     -0.5%     -1.3%     -0.5%     +7.1%
            ray          +0.0%     -0.3%     -0.4%     -0.5%     +0.0%
       sumeuler          +0.0%     +0.0%     +1.0%     +1.0%     +0.0%
      transclos          +0.0%     +0.0%     +1.2%     +1.4%     +0.0%
            Min          +0.0%     -6.0%     -2.4%     -2.6%     -8.6%
            Max          +0.0%     +8.1%    +16.2%    +16.2%     +7.1%
 Geometric Mean          +0.0%     +0.1%     +2.0%     +2.3%     -0.4%

Not good! Two programs (gray and parfib) are significantly worse.

The effect is real, here is the timing info for parfib before and after:

5.70user 0.00system 0:01.43elapsed 397%CPU (0avgtext+0avgdata 20816maxresident)k
6.52user 0.00system 0:01.64elapsed 397%CPU (0avgtext+0avgdata 21568maxresident)k

I wonder whether not using a locked instruction in the spinlock might cause the loop to spin for longer, because it takes longer for the memory write to reach the core that is waiting for it?

Someone could probably dig into this further with perf. But the lesson here, as usual, is to always benchmark and don't just assume that because it looks good it will work in practice!

comment:8 Changed 4 months ago by simonmar

  • Owner parcs deleted
  • Status changed from patch to new

Moving out of the patch state pending further investigation.

Note: See TracTickets for help on using tickets.