Better cache performance in Array#
|Reported by:||MikeIzbicki||Owned by:|
|Type of failure:||Runtime performance bug||Test Case:|
|Related Tickets:||Differential Rev(s):|
A common use case for arrays is to loop over them, performing an operation on each element of the array. For
ByteArray#s, this is guaranteed to have good cache performance: When element
i is accessed, the CPU's prefetcher will ensure that the element
i+1 is already sitting in cache, so the next loop will not suffer a cache miss. Currently, the
SmallArray# types do not significantly benefit from the prefetcher. This feature request asks for a few new primops that would improve the situation.
My understanding is that the "raw" element in an
Array# is actually a pointer to the region in memory associated with the "logical" element of the array. When looping, the subsequent pointers are guaranteed to get prefetched, but the subsequent logical element is not guaranteed to be prefetched. In particular, if we'll only get the benefits of prefetching on the logical elements if we're lucky enough that the memory manager happened to allocate these logical elements on the heap next to each other. I don't know enough about GHC internals to check the source code, but my experiments demonstrate that this is not currently happening in my use case, resulting in rather significant performance degradations. (The experiments are rather complicated, so I'm not attaching them.)
I propose adding the primops:
packArray# :: Array# a -> b -> b packSmallArray# :: SmallArray# a -> b -> b packMutableArray# :: MutableArray# s a -> State# s -> State# s packSmallMutableArray# :: SmallMutableArray# s a -> State# s -> State# s
These operations would have the semantic effect of noops, with the exception that they request that the logical elements in the arrays be arranged adjacently in memory. Thus, future loops over the arrays would benefit from CPU prefetching. There are a number of ways the memory manager could handle these requests, and I don't particular care which is chosen. For example, the memory manager could rearrange the memory immediately or wait until the next garbage collection pass and do it then.