Opened 23 months ago

Last modified 20 months ago

#10652 new feature request

Better cache performance in Array#

Reported by: MikeIzbicki Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.10.1
Keywords: Cc: lelf, bgamari
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

A common use case for arrays is to loop over them, performing an operation on each element of the array. For ByteArray#s, this is guaranteed to have good cache performance: When element i is accessed, the CPU's prefetcher will ensure that the element i+1 is already sitting in cache, so the next loop will not suffer a cache miss. Currently, the Array# and SmallArray# types do not significantly benefit from the prefetcher. This feature request asks for a few new primops that would improve the situation.

My understanding is that the "raw" element in an Array# is actually a pointer to the region in memory associated with the "logical" element of the array. When looping, the subsequent pointers are guaranteed to get prefetched, but the subsequent logical element is not guaranteed to be prefetched. In particular, if we'll only get the benefits of prefetching on the logical elements if we're lucky enough that the memory manager happened to allocate these logical elements on the heap next to each other. I don't know enough about GHC internals to check the source code, but my experiments demonstrate that this is not currently happening in my use case, resulting in rather significant performance degradations. (The experiments are rather complicated, so I'm not attaching them.)

I propose adding the primops:

packArray#      :: Array# a      -> b -> b
packSmallArray# :: SmallArray# a -> b -> b

packMutableArray#      :: MutableArray# s a      -> State# s -> State# s 
packSmallMutableArray# :: SmallMutableArray# s a -> State# s -> State# s 

These operations would have the semantic effect of noops, with the exception that they request that the logical elements in the arrays be arranged adjacently in memory. Thus, future loops over the arrays would benefit from CPU prefetching. There are a number of ways the memory manager could handle these requests, and I don't particular care which is chosen. For example, the memory manager could rearrange the memory immediately or wait until the next garbage collection pass and do it then.

Change History (16)

comment:1 Changed 23 months ago by simonpj

Getting better cache behaviour would be a good thing!

I don't know the impl in detail either, but I'm pretty sure that all you need to do is to use unboxed arrays. Certainly before even beginning to think about new primops we need to understand in detail what the current implementation does, and why it has poor cache behaviour.

comment:2 Changed 23 months ago by MikeIzbicki

In my specific use case, unboxed arrays are not possible because each element does not have a fixed size (think something like Integer).

comment:3 Changed 23 months ago by carter

Mike: have you looked at how the gc arranges values only reachable via a boxed array? I guess one problem I can think of off hand is that your primops get tricky when a value is shared across Multiple boxed arrays! A valid and interesting idea would be to just make sure that the gc places unshared values reachable only via a boxed array continuously when doing a copying gc. I should try to look into this myself. Relatedly: bob Harper et all did prove a few years ago that certain gc strategies do result in cache optimal memory layouts. But I don't know if ghc does a traversal order that maps to their specific proof.

comment:4 in reply to:  3 Changed 23 months ago by MikeIzbicki

Replying to carter:

Mike: have you looked at how the gc arranges values only reachable via a boxed array?

I wouldn't even know where to look for this. My only evidence that things aren't aligned is the huge number of cache misses I'm getting. But I suppose these cache misses could be caused by something else.

I guess one problem I can think of off hand is that your primops get tricky when a value is shared across Multiple boxed arrays!

My proposal doesn't run into this problem. I'm specifically not proposing that GHC try to find a "globally optimal" memory allocation scheme. Instead, all I want is that immediately after these operations are called memory is aligned. A subsequent call to these primops on another array with the same elements in a different order would destroy that alignment.

A valid and interesting idea would be to just make sure that the gc places unshared values reachable only via a boxed array continuously when doing a copying gc. I should try to look into this myself.

I think this would work for me, and probably be better in general performance-wise. I was just (maybe naively) imagining this would be harder to implement.

comment:5 Changed 23 months ago by carter

I actually think that it'd be easier/safe space safety wise to provide the latter (eg, preserve sharing!). your proposed alternative operations requires doing a GC style movement of the values held in the underlying array to have the right semantics

I'm not sure where to look in the GC myself, but i think https://github.com/ghc/ghc/blob/master/rts/sm/Evac.c#L711-L725 is probably a good starting point

comment:6 Changed 23 months ago by carter

roughly: as long as the GC does a *breadth first* copy of the values reachable through a boxed array, in left to right order, you should have the behavior you want. i *believe* ghc roughly does this in a depth first order, but i could be wrong

likewise, https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/GC/Copying provides some more information about the rough flavor of the copying algorithms.

one challenge might be that the underlying storage layer of the GC is block oriented, and i believe the GC can promote a block of memory between generations instead of just copying

comment:7 Changed 23 months ago by rwbarton

If the values pointed to by your array don't themselves contain pointers (and assuming nothing else on the heap points to them) then either a BFS or a DFS traversal will visit them all sequentially. But in many cases if your values don't contain pointers then you could find a way to store them in an unboxed array with higher memory density anyways. (Using a boxed array costs you two words per element up front, one for the pointer in the array and one for the info pointer of the boxed value itself.)

If your values do themselves contain pointers then it does matter whether the traversal is BFS or DFS, but which one you want depends on how you plan to use the values, which the GC can't know without further hints. For example a large Integer contains a pointer to a ByteArray# containing the actual integer data, which you will need for most (but not all) operations on the Integer. If you have an array with a lot of large Integers, you may or may not want to put the ByteArray# after its corresponding Integer (even ignoring the possibility of the ByteArray#s being shared).

My potential concerns about the original feature request are

  • It sounds like a lot of extra complexity in the GC
  • Existing programs gain nothing from this added complexity unless they are modified to use the new primops
  • Consequently the implementation of the new primops must have zero cost for programs that do not use them
  • A fairly narrow range of programs stand to benefit from the new primops
  • It's unclear how much there is to be gained in the best case, especially compared to user-space alternatives like explicitly prefetching the next value of the array before processing the current one

You might be better off just implementing your own memory management (e.g. allocate variable-length representations of your data sequentially from a mutable ByteArray, then store offsets into this storage in your array and access the data through short-lived Haskell values, Storable-style).

Last edited 23 months ago by rwbarton (previous) (diff)

comment:8 Changed 22 months ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:9 Changed 22 months ago by lelf

Cc: lelf added

comment:10 Changed 22 months ago by thomie

See this blog post for a use case for this feature, towards the end of "Lesson 4: Haskell’s standard libraries were not designed for fast numeric computing."

comment:11 Changed 22 months ago by bgamari

Cc: bgamari added

comment:12 Changed 22 months ago by rwbarton

So to simplify, the use case is something like

data IntArrayTree = IntArrayTree {
  value    :: {-# UNPACK #-} !Int,
  children :: {-# UNPACK #-} !(Array Int IntArrayTree)
  }

and arguably the underlying problem that would be nice to fix is that there are two levels of indirection (IntArrayTree -> Array# -> IntArrayTree) per level of the tree.

Of note is that the IntArrayTree values themselves are actually of constant size. But we can't store them efficiently in any sort of array because they contain a mix of pointer and non-pointer fields.

comment:13 in reply to:  12 Changed 22 months ago by tibbe

Getting rid of the kind of array indirection rwbarton mentioned would greatly help unordered-containers.

comment:14 Changed 22 months ago by rwbarton

One idea I had was to create a new family of array types along with boxed and unboxed arrays, called unpacked arrays. (Note this is the opposite of UnpackingArrays.)

The idea is that whenever T is a type that can be unpacked with the {-# UNPACK #-} pragma, we can form values of type UnpackedArray# T. The heap representation of such a value is of the form

UnpackedArray#_info T_con_info n a1 b1 c1 a2 b2 c2 ... an bn cn

representing an array containing the values

T_con_info a1 b1 c1
T_con_info a2 b2 c2
...
T_con_info an bn cn

(Since T can be unpacked, it has just one constructor.)

Compared to an Array# this costs one extra word T_con_info. It might be possible to save this word with a new type of info table layout, then the heap representation of Array# a would be identical (modulo info pointer value) to that of UnpackedArray# (Box a) where data Box a = Box a.

The GC can use T_con_info to understand the layout of the rest of the heap object, while mutator operations like indexing would use operations from a magically-generated Unpack type class. This last part might be tricky.

Then IntArrayTree could hold an UnpackedArray# IntArrayTree to avoid one of the layers of indirection. It won't apply to tibbe's unordered-containers though, since IntMap has multiple constructors. (I guess it could be used inside Collision !Hash !(A.Array (Leaf k v)), but that's rather pointless.)

comment:15 Changed 20 months ago by rwbarton

MikeIzbicki, you will probably be interested in Phab:D1264.

comment:16 Changed 20 months ago by carter

This seems to overlap with the array tech that Edward Kmett has demoed for mutable data structures and the resulting conversations that ensured on the mailing lists and at icfp 2015.

Note: See TracTickets for help on using tickets.