Opened 11 years ago

Closed 4 years ago

Last modified 4 years ago

#693 closed task (fixed)

dynamic locking

Reported by: simonmar Owned by: simonmar
Priority: low Milestone: 7.8.1
Component: Runtime System Version: 6.4.1
Keywords: Cc: shumovichy@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: N/A
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by simonmar)

Now that the SMP and threaded runtimes are the same, there is some unnecessary locking going on in the single threaded case. For example, MVar operations now require an atomic memory operation, which can be quite slow. It might be a good idea to dynamically turn on these locks if we're running multi-threaded (ie. +RTS -N2 or greater). ToDo: measure the difference.

Attachments (1)

0001-Dynamically-turn-on-off-locking-in-take-put-mvar-693.patch (2.0 KB) - added by Yuras 4 years ago.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 11 years ago by simonmar

Description: modified (diff)

comment:2 Changed 10 years ago by igloo

Test Case: N/A

comment:3 Changed 10 years ago by simonmar

Priority: normallow

comment:4 Changed 9 years ago by igloo

Milestone: 6.8 branch_|_

comment:5 Changed 8 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:6 Changed 8 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:7 Changed 7 years ago by simonmar

difficulty: Moderate (1 day)Moderate (less than a day)

comment:8 Changed 4 years ago by Yuras

Cc: shumovichy@… added
Type of failure: None/Unknown

comment:9 Changed 4 years ago by Yuras

Status: newpatch

The patch is attached.

I don't see any difference in nofib. I see ~40% speedup in the next program:

import Control.Concurrent
import Control.Monad

main :: IO ()
main = do
  var <- newMVar ()
  replicateM_ 100000000 $ do
    takeMVar var
    putMVar var ()

I tried to check n_capabilities in lockClosure, but the speedup was only ~30%. I don't see any difference for "+RTS -N2" case.

So, I don't see why it should not be implemented. But I'm ok with wontfix in case there are arguments against it.

$ uname -a
Linux shum-laptop 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux

comment:10 Changed 4 years ago by simonmar

Milestone: _|_7.8.1
Owner: set to simonmar

Ok, I'll look at this.

comment:11 Changed 4 years ago by ian@…

commit 75947bb63794cae5950f679c8df86441b736b3fa

Author: Ian Lynagh <>
Date:   Sat Jun 15 19:07:58 2013 +0100

    Optimise lockClosure when n_capabilities == 1; fixes #693
    Based on a patch from Yuras Shumovich.

 includes/Rts.h                       |    2 +-
 includes/rts/storage/SMPClosureOps.h |   28 ++++++++++++++++++++++------
 rts/PrimOps.cmm                      |   28 ++++++++++++++++++++++++----
 3 files changed, 47 insertions(+), 11 deletions(-)

comment:12 Changed 4 years ago by igloo

Resolution: fixed
Status: patchclosed

Thanks; I've pushed a similar patch which also optimises the C users of lockClosure.

comment:13 Changed 4 years ago by ezyang

The way the C-- code is written now seems really odd.

@@ -1290,7 +1295,12 @@ stg_tryTakeMVarzh ( P_ mvar /* :: MVar a */ )
     W_ val, info, tso, q;
 #if defined(THREADED_RTS)
-    ("ptr" info) = ccall lockClosure(mvar "ptr");
+    if (CInt[n_capabilities] == 1 :: CInt) {
+        info = GET_INFO(mvar);
+    }
+    else {
+        ("ptr" info) = ccall reallyLockClosure(mvar "ptr");
+    }
     info = GET_INFO(mvar);

Why can't it just call lockClosure? Is the concern here the cost of the extra jump? Seems like we should make a macro for this, then.

comment:14 Changed 4 years ago by igloo

Given the existing code, I assumed that we want to avoid the overhead of a C function call.

comment:15 Changed 4 years ago by ezyang

OK, I adjusted this a little bit:

commit 3a8c50111d5a92594f5c2f1b2b96a7c1cfab82eb
Author: Edward Z. Yang <>
Date:   Wed Jul 10 13:10:32 2013 -0700

    Add LOCK_CLOSURE macro for use in C--, which inlines the capability check.
    This patch also tweaks lockClosure to be INLINE_HEADER, so C-- clients
    don't accidentally use them and updates some other code which locks closures
    to do the capability check.
    Signed-off-by: Edward Z. Yang <>

comment:16 Changed 4 years ago by parcs

I think atomicModifyMutVar# and noDuplicate# could be optimized in a similar fashion.

When n_capabilities == 1, atomicModifyMutVar# could update the MutVar in a faster (1.8x), non-atomic fashion instead of using a compare-and-swap. Similarly, noDuplicate# could elide the call to threadPaused since there is no possibility of work being duplicated. Are these safe assumptions?

How does this patch look? ./validate seems content with it, at least.

diff --git a/rts/PrimOps.cmm b/rts/PrimOps.cmm
index ced15ee..863c352 100644
--- a/rts/PrimOps.cmm
+++ b/rts/PrimOps.cmm
@@ -340,6 +340,15 @@ stg_atomicModifyMutVarzh ( gcptr mv, gcptr f )
    StgThunk_payload(r,0) = z;

+   if (CInt[n_capabilities] == 1 :: CInt) {
+     x = StgMutVar_var(mv);
+     StgThunk_payload(z,1) = x;
+     StgMutVar_var(mv) = y;
+     goto out;
+   }
    x = StgMutVar_var(mv);
    StgThunk_payload(z,1) = x;
@@ -350,6 +359,7 @@ stg_atomicModifyMutVarzh ( gcptr mv, gcptr f )
    StgMutVar_var(mv) = y;

    if (GET_INFO(mv) == stg_MUT_VAR_CLEAN_info) {
      ccall dirty_MUT_VAR(BaseReg "ptr", mv "ptr");
@@ -2008,6 +2018,10 @@ INFO_TABLE_RET(stg_noDuplicate, RET_SMALL, W_ info_ptr)

 stg_noDuplicatezh /* no arg list: explicit stack layout */
+    if (CInt[n_capabilities] == 1 :: CInt) {
+        jump %ENTRY_CODE(Sp(0)) [];
+    }
     STK_CHK(WDS(1), stg_noDuplicatezh);

     // leave noDuplicate frame in case the current
Note: See TracTickets for help on using tickets.