Add inline versions of clone array primops
|Reported by:||tibbe||Owned by:||simonmar|
|Type of failure:||None/Unknown||Test Case:|
|Related Tickets:||Differential Rev(s):|
I've changed the clone array primops (i.e. cloneArray#, cloneMutableArray#, freezeArray#, and thawArray#) to use the new inline allocation optimization for statically known array sizes. Furthermore, I've moved the implementation for the non-statically known case out-of-line, which should reduce code size.
The numbers are very encouraging, with the new implementation being 74% (i.e. almost 4x) faster than the old one. I measured this by looking at the total time reported by +RTS -s for the attached InlineCloneArrayAlloc benchmark.
Here are the stats from the best out of three runs of the old implementation:
1,600,041,120 bytes allocated in the heap 6,504 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1588 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1 colls, 0 par 0.01s 0.01s 0.0082s 0.0082s Gen 1 1 colls, 0 par 0.00s 0.11s 0.1062s 0.1062s INIT time 0.00s ( 0.00s elapsed) MUT time 0.29s ( 0.57s elapsed) GC time 0.01s ( 0.11s elapsed) EXIT time 0.01s ( 0.11s elapsed) Total time 0.31s ( 0.80s elapsed) %GC time 2.7% (14.2% elapsed) Alloc rate 5,497,251,856 bytes per MUT second Productivity 97.3% of total user, 37.4% of total elapsed
Here are the same for the new implementation:
1,600,041,120 bytes allocated in the heap 57,224 bytes copied during GC 35,992 bytes maximum residency (1 sample(s)) 21,352 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 3125 colls, 0 par 0.01s 0.01s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.08s ( 0.08s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.08s ( 0.09s elapsed) %GC time 6.4% (8.8% elapsed) Alloc rate 21,260,179,643 bytes per MUT second Productivity 93.5% of total user, 88.8% of total elapsed
The performance ratio between the new and old implementation gets worse for the old implementation as the iteration count is increased.
There's also an interesting difference in the Gen 1 collection times between the two implementations.