Opened 8 years ago

Closed 10 months ago

Last modified 9 months ago

#693 closed task (fixed)

dynamic locking

Reported by: simonmar Owned by: simonmar
Priority: low Milestone: 7.8.1
Component: Runtime System Version: 6.4.1
Keywords: Cc: shumovichy@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Moderate (less than a day)
Test Case: N/A Blocked By:
Blocking: Related Tickets:

Description (last modified by simonmar)

Now that the SMP and threaded runtimes are the same, there is some unnecessary locking going on in the single threaded case. For example, MVar operations now require an atomic memory operation, which can be quite slow. It might be a good idea to dynamically turn on these locks if we're running multi-threaded (ie. +RTS -N2 or greater). ToDo: measure the difference.

Attachments (1)

0001-Dynamically-turn-on-off-locking-in-take-put-mvar-693.patch (2.0 KB) - added by Yuras 16 months ago.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 8 years ago by simonmar

  • Description modified (diff)
  • Milestone changed from 6.6 to 6.6.1

comment:2 Changed 7 years ago by igloo

  • Test Case set to N/A

comment:3 Changed 7 years ago by simonmar

  • Milestone changed from 6.6.1 to 6.8
  • Priority changed from normal to low

comment:4 Changed 6 years ago by igloo

  • Milestone changed from 6.8 branch to _|_

comment:5 Changed 6 years ago by simonmar

  • Architecture changed from Unknown to Unknown/Multiple

comment:6 Changed 6 years ago by simonmar

  • Operating System changed from Unknown to Unknown/Multiple

comment:7 Changed 4 years ago by simonmar

  • Difficulty changed from Moderate (1 day) to Moderate (less than a day)

comment:8 Changed 17 months ago by Yuras

  • Cc shumovichy@… added
  • Type of failure set to None/Unknown

comment:9 Changed 16 months ago by Yuras

  • Status changed from new to patch

The patch is attached.

I don't see any difference in nofib.
I see ~40% speedup in the next program:

import Control.Concurrent
import Control.Monad

main :: IO ()
main = do
  var <- newMVar ()
  replicateM_ 100000000 $ do
    takeMVar var
    putMVar var ()

I tried to check n_capabilities in lockClosure, but the speedup was only ~30%.
I don't see any difference for "+RTS -N2" case.

So, I don't see why it should not be implemented. But I'm ok with wontfix in case there are arguments against it.

$ uname -a
Linux shum-laptop 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux

comment:10 Changed 16 months ago by simonmar

  • Milestone changed from _|_ to 7.8.1
  • Owner set to simonmar

Ok, I'll look at this.

comment:11 Changed 10 months ago by ian@…

commit 75947bb63794cae5950f679c8df86441b736b3fa

Author: Ian Lynagh <ian@well-typed.com>
Date:   Sat Jun 15 19:07:58 2013 +0100

    Optimise lockClosure when n_capabilities == 1; fixes #693
    
    Based on a patch from Yuras Shumovich.

 includes/Rts.h                       |    2 +-
 includes/rts/storage/SMPClosureOps.h |   28 ++++++++++++++++++++++------
 rts/PrimOps.cmm                      |   28 ++++++++++++++++++++++++----
 3 files changed, 47 insertions(+), 11 deletions(-)

comment:12 Changed 10 months ago by igloo

  • Resolution set to fixed
  • Status changed from patch to closed

Thanks; I've pushed a similar patch which also optimises the C users of lockClosure.

comment:13 Changed 9 months ago by ezyang

The way the C-- code is written now seems really odd.

@@ -1290,7 +1295,12 @@ stg_tryTakeMVarzh ( P_ mvar /* :: MVar a */ )
     W_ val, info, tso, q;
 
 #if defined(THREADED_RTS)
-    ("ptr" info) = ccall lockClosure(mvar "ptr");
+    if (CInt[n_capabilities] == 1 :: CInt) {
+        info = GET_INFO(mvar);
+    }
+    else {
+        ("ptr" info) = ccall reallyLockClosure(mvar "ptr");
+    }
 #else
     info = GET_INFO(mvar);
 #endif

Why can't it just call lockClosure? Is the concern here the cost of the extra jump? Seems like we should make a macro for this, then.

comment:14 Changed 9 months ago by igloo

Given the existing code, I assumed that we want to avoid the overhead of a C function call.

comment:15 Changed 9 months ago by ezyang

OK, I adjusted this a little bit:

commit 3a8c50111d5a92594f5c2f1b2b96a7c1cfab82eb
Author: Edward Z. Yang <ezyang@mit.edu>
Date:   Wed Jul 10 13:10:32 2013 -0700

    Add LOCK_CLOSURE macro for use in C--, which inlines the capability check.
    
    This patch also tweaks lockClosure to be INLINE_HEADER, so C-- clients
    don't accidentally use them and updates some other code which locks closures
    to do the capability check.
    
    Signed-off-by: Edward Z. Yang <ezyang@mit.edu>

comment:16 Changed 9 months ago by parcs

I think atomicModifyMutVar# and noDuplicate# could be optimized in a similar fashion.

When n_capabilities == 1, atomicModifyMutVar# could update the MutVar in a faster (1.8x), non-atomic fashion instead of using a compare-and-swap. Similarly, noDuplicate# could elide the call to threadPaused since there is no possibility of work being duplicated. Are these safe assumptions?

How does this patch look? ./validate seems content with it, at least.

diff --git a/rts/PrimOps.cmm b/rts/PrimOps.cmm
index ced15ee..863c352 100644
--- a/rts/PrimOps.cmm
+++ b/rts/PrimOps.cmm
@@ -340,6 +340,15 @@ stg_atomicModifyMutVarzh ( gcptr mv, gcptr f )
    LDV_RECORD_CREATE(r);
    StgThunk_payload(r,0) = z;

+#ifdef THREADED_RTS
+   if (CInt[n_capabilities] == 1 :: CInt) {
+     x = StgMutVar_var(mv);
+     StgThunk_payload(z,1) = x;
+     StgMutVar_var(mv) = y;
+     goto out;
+   }
+#endif
+
  retry:
    x = StgMutVar_var(mv);
    StgThunk_payload(z,1) = x;
@@ -350,6 +359,7 @@ stg_atomicModifyMutVarzh ( gcptr mv, gcptr f )
    StgMutVar_var(mv) = y;
 #endif

+out:
    if (GET_INFO(mv) == stg_MUT_VAR_CLEAN_info) {
      ccall dirty_MUT_VAR(BaseReg "ptr", mv "ptr");
    }
@@ -2008,6 +2018,10 @@ INFO_TABLE_RET(stg_noDuplicate, RET_SMALL, W_ info_ptr)

 stg_noDuplicatezh /* no arg list: explicit stack layout */
 {
+    if (CInt[n_capabilities] == 1 :: CInt) {
+        jump %ENTRY_CODE(Sp(0)) [];
+    }
+
     STK_CHK(WDS(1), stg_noDuplicatezh);

     // leave noDuplicate frame in case the current
Note: See TracTickets for help on using tickets.