Opened 11 years ago

Closed 7 years ago

#650 closed merge (fixed)

Improve interaction between mutable arrays and GC

Reported by: simonmar Owned by: igloo
Priority: normal Milestone: 6.12.2
Component: Runtime System Version: 6.4.1
Keywords: Cc: dons, ertai, ganesh, sclv, blarsen
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by simonmar)

This is the result of a discussion between myself and Simon PJ a few weeks ago, inspired by recent discoveries of poor performance with mutable arrays (eg. see Data.Hash discussion on glasgow-haskell-users around October 2005).

Note that all this applies to mutable arrays of pointers, i.e. IOArray and STArray, not to unboxed arrays, i.e. IOUArray and STUArray.

Current implementation

There are two primitive types: MutArray# and Array#. We convert between them with:

   unsafeFreezeArray# :: MutArr# s a -> State# s -> (# State# s, Array# a #)
   unsafeThawArray# :: Array# a -> State# s -> (# State# s, MutArr# s a #)

An Arr# is not normally on the old-gen mutable list, unless (a) it has pointers to young gen objects, or (b) it has been recently frozen. The implementation of unsafeFreezeArray# is a single write to the header word of the array. The implementation of unsafeThawArray# is slightly more complex: if the array was not already on the mutable list (indicated by the value of the header), then we add it. Also, we change the header word to indicate that the array is now mutable.

A MutArr# is always on the mutable list.

Objects pointed to by Array# are eagerly promoted to the generation in which the Array# resides, with the aim that the Array# can then be removed from the mutable list.

It is only safe to write to a MutArr#, so if multiple threads are accessing an array, they should not be doing thaw/freeze tricks without extra locking around the array (such behaviour can cause the GC to crash).

The Problem

The problem is that mutable arrays are always completely traversed on every GC. To get around this, we can keep an array in a frozen state and thaw it just before writing, then freeze it again afterward. This is a bit inconvenient, not to mention unsafe with multiple threads unless extra locking is used.

Furthermore, a modified array is completely scanned, whereas for larger arrays it would be much better to just scan the part of the array that had been modified (known in the GC literature as "card-marking").

The benefit of the current approach is that writing to a mutable array is a single write instruction, whereas to do card-marking or something else requires a write-barrier. The unsafeThaw/write/unsafeFreeze sequence amounts to a write barrier, so if this is a common technique we should provide an easy way to do it, possibly making it the default.

Solutions

Leaving aside card-marking for now, let's think about incorporating the write barrier in the write operation.

Suppose that mutable arrays are always kept on the mutable list, but the header word indicates whether the array needs to be scanned or not (eg. we have MUT_ARR_DIRTY, MUT_ARR_CLEAN). The array write op should (a) set the header to MUT_ARR_DIRTY, and (b) do the write. The GC turns MUT_ARR_DIRTY into MUT_ARR_CLEAN when everything the array points to is in the same generation (or older).

Downsides to this:

  • intitialising a mutable array, or doing block writes, will be more painful, because each write will have the write barrier (perhaps not too painful)

How does freezing/thawing interact with this? We currently create immutable arrays by starting with a MutArr#, intitialising it, and then freezing it to make an Arr#. We can still do this, exactly as now (and with the same thread-unsafety), but initialization will be a bit slower due to the write barrier.

Block writes

We could try to provide for "block writes", by allowing a thread to "open" the array for modification, and then "close" it again after it had finished writing, with all writes in between being done without a write barrier. This would replace unsafeThaw/unsafeFreeze.

To do this safely, we would have to use some kind of synchronisation on the open/close; techniques that we came up with were to increment (atomically) a counter in the array header, or to allocate a new heap object pointing to the array in the current thread's allocation area.

Card marking

We could refine the write barrier so that it marks just part of the array as dirty, instead of the whole array. The natural choice is to put the mark bit in the block descriptor for the current block, giving us a granularity of 4/8k, which is possibly a bit large but other solutions are much more expensive. Even this would significantly increase the cost of the write barrier, so it may be that we want a different kind of array type for this (LargeMutArr#?). Furthermore, currently not all arrays have their own block descriptors ("large objects" in GHC's storage manager), the small ones are allocated in movable memory. To do this, we would have to ensure that every array had its own block (or check in the write barrier, which adds even more expense).

IORefs

This also affects IORefs, which are essentially single-element IOArrays. I just noticed that GHC often has a large number of IORefs hanging around in the heap from the typechecker, and the cost of traversing the mutable list can dominate minor GCs.

Change History (29)

comment:1 Changed 11 years ago by simonmar

Description: modified (diff)

comment:2 Changed 11 years ago by simonmar

Milestone: 6.6

Dropped milestone. Some of this work has been done in 6.6 already: mutable arrays are only traversed during GC if they were modified since the last GC (although they remain on the mutable list), and IORefs are only placed on the mutable list when they are modified. Threads are also marked as clean/dirty to avoid traversing a thread's stack if it hasn't run since the last GC.

The main thing missing is restricting the traversal of an array to just the parts that were modified, using card-marking or similar.

comment:3 Changed 10 years ago by igloo

Milestone: 6.8

comment:4 Changed 9 years ago by simonmar

Milestone: 6.8 branch6.10 branch

comment:5 Changed 9 years ago by simonmar

Milestone: 6.10 branch_|_
Type: taskrun-time performance bug

comment:6 Changed 8 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:7 Changed 8 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:8 Changed 8 years ago by dons

Cc: dons added

comment:9 Changed 8 years ago by ertai

Cc: ertai added

comment:10 Changed 7 years ago by ganesh

Cc: ganesh added

comment:11 Changed 7 years ago by simonmar

difficulty: Moderate (1 day)Moderate (less than a day)

comment:12 Changed 7 years ago by simonmar

Type of failure: Runtime performance bug

comment:13 Changed 7 years ago by sclv

Cc: sclv added

comment:14 Changed 7 years ago by simonmar

See Commentary/Rts/Storage/GC/RememberedSets for more background on this.

comment:15 Changed 7 years ago by blarsen

Cc: blarsen added

comment:16 Changed 7 years ago by morrow

One other possible method to deal with this is to make old-generation pages containing mutable array data read-only via mprotect (or equivalent). Any later write will then cause a SIGSEGV, which is caught by an installed signal handler which then unprotects that page and marks it as dirty (however this needs to be done). A benefit of this is that any future writes to that page need no write barrier. Some limitations of this are that this reserves the SIGSEGV handler for the RTS, and this can only provide PAGE_SIZE granularity (which for large arrays is nevertheless a large improvement). What are people's thoughts on this? Is this (or some variation on this) an option given GHC's RTS?

comment:17 Changed 7 years ago by simonmar

difficulty: Moderate (less than a day)Difficult (2-5 days)
Milestone: _|_6.14.1
Owner: set to simonmar

I worked on this problem yesterday. So far I have a Data.HashTable benchmark that goes from 50s before the fix to 10s afterward with the default heap settings. It's a simple card-marking strategy where the card bits (actually bytes, for atomicity) are stored after the array data.

I want to do some more measurements and testing before I commit. Roman also pointed out to me that this doesn't really fix the problem, it only reduces the constant factor (albeit by a lot), since we still have to traverse the mark bits on every GC, but I can't think of a workable strategy that doesn't suffer from this problem.

comment:18 in reply to:  16 ; Changed 7 years ago by simonmar

Replying to morrow:

One other possible method to deal with this is to make old-generation pages containing mutable array data read-only via mprotect (or equivalent). Any later write will then cause a SIGSEGV,

Not keen on this for a few reasons:

  • non-portability
  • too coarse-grained
  • only works for large arrays
  • GC needs to fiddle around with mprotecting pages
  • taking the signal is expensive
  • recovering from a SIGSEGV is very tricky

I must be getting old, but my feeling is that tricks like this sound quite cool but end up being a pain in the neck in the long run.

comment:19 Changed 7 years ago by GregoryCollins

It's a simple card-marking strategy where the card bits (actually bytes, for atomicity)

Can you use a BTS instruction to do bitwise marking atomically? It'd be nice to see if there was a performance improvement, in theory you trade some bookkeeping complexity for 8x fewer cache line loads.

comment:2 in reply to:  19 Changed 7 years ago by simonmar

Replying to GregoryCollins:

Can you use a BTS instruction to do bitwise marking atomically? It'd be nice to see if there was a performance improvement, in theory you trade some bookkeeping complexity for 8x fewer cache line loads.

BTS is only atomic with a LOCK prefix, and that would make it a lot more expensive than doing byte operations.

comment:21 Changed 7 years ago by guest

Most of the time there won't be contention.

How about using cmpxchg, and falling back to bts if there happened to be contention? Or would that also be much more expensive than byte operations?

comment:22 Changed 7 years ago by simonmar

cmpxchg is also only atomic with a LOCK prefix. Typically a LOCK prefix adds on the order of 100 cycles, even without contention. Apprently it's better on Nehalem CPUs, but still we need something that works well across a variety of hardware. I don't think using byte writes are a big problem in practice.

comment:23 in reply to:  18 Changed 7 years ago by morrow

I must be getting old, but my feeling is that tricks like this sound quite cool but end up being a pain in the neck in the long run.

To be honest, the thought of dealing with that scares the crap out of me. :-)

comment:24 Changed 7 years ago by simonmar

Milestone: 6.14.16.12.2

Fixed:

Thu Dec 17 14:42:28 PST 2009  Simon Marlow <marlowsd@gmail.com>
  * Fix #650: use a card table to mark dirty sections of mutable arrays
  The card table is an array of bytes, placed directly following the
  actual array data.  This means that array reading is unaffected, but
  array writing needs to read the array size from the header in order to
  find the card table.
  
  We use a bytemap rather than a bitmap, because updating the card table
  must be multi-thread safe.  Each byte refers to 128 entries of the
  array, but this is tunable by changing the constant
  MUT_ARR_PTRS_CARD_BITS in includes/Constants.h.

    M ./compiler/codeGen/CgPrimOp.hs -2 +14
    M ./includes/Cmm.h +3
    M ./includes/HaskellConstants.hs +3
    M ./includes/mkDerivedConstants.c +1
    M ./includes/rts/Constants.h +7
    M ./includes/rts/storage/ClosureMacros.h -1 +29
    M ./includes/rts/storage/Closures.h +2
    M ./rts/PrimOps.cmm -5 +23
    M ./rts/Weak.c -2 +9
    M ./rts/sm/Scav.c -54 +123

I benchmarked this program, amongst others:

import Control.Monad
import qualified Data.HashTable as H
import System.Environment

main = do
  [size] <- fmap (fmap read) getArgs
  m <- H.new (==) H.hashInt
  forM_ [1..size] $ \n -> H.insert m n n
  v <- H.lookup m 100
  print v

for size=1000000, with 6.12.1 it takes around 50s on my x86/Linux laptop, with the HEAD and this patch it takes 9.5s.

We could merge this into 6.12.2 after more testing and benchmarking, as the changes are fairly localised.

comment:25 Changed 7 years ago by simonmar

NB. this patch is needed too

Mon Dec 21 11:52:49 GMT 2009  Simon Marlow <marlowsd@gmail.com>
  * Fixes to account for the new layout of MUT_ARR_PTRS (see #650)

comment:26 Changed 7 years ago by igloo

Owner: changed from simonmar to igloo
Type: bugmerge

comment:27 Changed 7 years ago by igloo

Is the "more testing and benchmarking" done? Or is someone doing it?

comment:28 Changed 7 years ago by simonmar

Go ahead and merge, this seems to have no adverse effects in the HEAD.

comment:29 Changed 7 years ago by igloo

Resolution: fixed
Status: newclosed

Merged, along with

Tue Dec 22 22:20:17 GMT 2009  dias@cs.tufts.edu
  * Copying Simon M's fix for 650 to the new codegen
Note: See TracTickets for help on using tickets.