Opened 5 years ago

Closed 5 years ago

#3019 closed bug (fixed)

sparc membar asm instruction requires mode parameter

Reported by: duncan Owned by: simonmar
Priority: normal Milestone: 6.12 branch
Component: Runtime System Version: 6.11
Keywords: Cc:
Operating System: Solaris Architecture: sparc
Type of failure: Difficulty: Unknown
Test Case: rts/testwsdeque.c Blocked By:
Blocking: Related Tickets:

Description

The new parallel/WSDeque.c uses store_load_barrier() from includes/SMP.h.

EXTERN_INLINE void
store_load_barrier(void) {
#if i386_HOST_ARCH
    __asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory");
#elif x86_64_HOST_ARCH
    __asm__ __volatile__ ("lock; addq $0,0(%%rsp)" : : : "memory");
#elif powerpc_HOST_ARCH
    __asm__ __volatile__ ("msync" : : : "memory");
#elif sparc_HOST_ARCH
    /* Sparc in TSO mode does not require write/write barriers. */
    __asm__ __volatile__ ("membar" : : : "memory");
#elif !defined(WITHSMP)
    return;
#else
#error memory barriers unimplemented on this architecture
#endif

In particular for sparc the bit:

    /* Sparc in TSO mode does not require write/write barriers. */
    __asm__ __volatile__ ("membar" : : : "memory");

This is not right. The membar assembly statement requires a parameter to specify which kind of memory barrier is required. For store_load_barrier() it is of course membar #StoreLoad.

Without this the assembler complains:

/usr/ccs/bin/as: "parallel/WSDeque.s", line 11: error: statement syntax

With #StoreLoad added it's fine.

Note also that the comment appears to be wrong

    /* Sparc in TSO mode does not require write/write barriers. */

This is store_load_barrier() not store_store_barrier() so it is exactly and only this case that is required.

Attachments (1)

membar#StoreLoad.dpatch (164.5 KB) - added by duncan 5 years ago.
Specify the sparc membar #StoreLoad? parameter

Download all attachments as: .zip

Change History (7)

comment:1 Changed 5 years ago by simonmar

  • Difficulty set to Unknown

Changed 5 years ago by duncan

Specify the sparc membar #StoreLoad? parameter

comment:2 Changed 5 years ago by duncan

hunk ./includes/SMP.h 200
 #elif powerpc_HOST_ARCH
     __asm__ __volatile__ ("msync" : : : "memory");
 #elif sparc_HOST_ARCH
-    /* Sparc in TSO mode does not require write/write barriers. */
-    __asm__ __volatile__ ("membar" : : : "memory");
+    __asm__ __volatile__ ("membar #StoreLoad" : : : "memory");
 #elif !defined(WITHSMP)
     return;
 #else

With this fix (darcs patch attached) rts/parallel/WSDeque.c compiles fine and the test program rts/testwsdeque.c appears to work (and runs very quickly on the T2 cpu).

Out of interest we would like to see if we omit the membar if that causes the test program to fail in practice.

comment:3 follow-up: Changed 5 years ago by duncan

If we modify the rts/parallel/WSDqueue.c to omit the call to store_load_barrier() then running testwsdeque (with 3 threads) fails like so:

internal error: internal error: FAIL: 3461204 1 10FAIL: 3460860 2 10

    (GHC version 6.11.20090211 for sparc_sun_solaris2)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
    (GHC version 6.11.20090211 for sparc_sun_solaris2)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
Abort (core dumped)

Clearly several threads are hitting internal errors and the messages are getting interleaved in the console output. With just two threads the output is more readable:

internal error: FAIL: 3456932 1 10
    (GHC version 6.11.20090211 for sparc_sun_solaris2)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
Abort (core dumped)

Note that at least one thread does not die and the test runs on for a second or so until it aborts.

So the conclusion would seem to be that the attached patch works and that the store/load memory barrier is necessary on Sparc.

As for performance, it is apparently a bit quicker than on x86 which is interesting. For reference, running the testwsdeque test prog with a 10x bigger problem size (ie SCRATCH_SIZE) on the UltraSPARC T2:

  • 3 threads: 1.6s
  • 4 threads: 1.6s
  • 8 threads: 1.9s
  • 16 threads: 3.4s
  • 32 threads: 8.3s
  • 64 threads: 18.6-19.2s (more variability)

comment:4 in reply to: ↑ 3 Changed 5 years ago by duncan

Replying to duncan:

As for performance, it is apparently a bit quicker than on x86 which is interesting. For reference, running the testwsdeque test prog with a 10x bigger problem size (ie SCRATCH_SIZE) on the UltraSPARC T2:

Correction: the above were THREADS=n results. The number of threads actually being used is n+1 because of the main thread which is doing part of the work in the test. So for THREADS=7:

  • 8 threads: 1.79s (+/- 0.01)

On x86-64 Simon reports THREADS=7:

  • 8 threads: 17.2s

So we might hope that work-stealing queues for par sparks might be pretty nippy on the T2 cpus.

comment:5 Changed 5 years ago by simonmar

  • Component changed from Compiler to Runtime System
  • Milestone set to 6.12 branch

Fixed:

Thu Feb 12 01:23:40 PST 2009  Simon Marlow <marlowsd@gmail.com>
  * update Sparc store/load barrier (#3019), and fix comments

comment:6 Changed 5 years ago by simonmar

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.