Add inline versions of clone array primops

I've changed the clone array primops (i.e. cloneArray#, cloneMutableArray#, freezeArray#, and thawArray#) to use the new inline allocation optimization for statically known array sizes. Furthermore, I've moved the implementation for the non-statically known case out-of-line, which should reduce code size.

The numbers are very encouraging, with the new implementation being 74% (i.e. almost 4x) faster than the old one. I measured this by looking at the total time reported by +RTS -s for the attached InlineCloneArrayAlloc benchmark.

Here are the stats from the best out of three runs of the old implementation:

   1,600,041,120 bytes allocated in the heap
           6,504 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          21,352 bytes maximum slop
            1588 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         1 colls,     0 par    0.01s    0.01s     0.0082s    0.0082s
  Gen  1         1 colls,     0 par    0.00s    0.11s     0.1062s    0.1062s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.29s  (  0.57s elapsed)
  GC      time    0.01s  (  0.11s elapsed)
  EXIT    time    0.01s  (  0.11s elapsed)
  Total   time    0.31s  (  0.80s elapsed)

  %GC     time       2.7%  (14.2% elapsed)

  Alloc rate    5,497,251,856 bytes per MUT second

  Productivity  97.3% of total user, 37.4% of total elapsed

Here are the same for the new implementation:

   1,600,041,120 bytes allocated in the heap
          57,224 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          21,352 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3125 colls,     0 par    0.01s    0.01s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0003s    0.0003s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.08s  (  0.08s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.08s  (  0.09s elapsed)

  %GC     time       6.4%  (8.8% elapsed)

  Alloc rate    21,260,179,643 bytes per MUT second

  Productivity  93.5% of total user, 88.8% of total elapsed

The performance ratio between the new and old implementation gets worse for the old implementation as the iteration count is increased.

There's also an interesting difference in the Gen 1 collection times between the two implementations.

Trac metadata

Trac field	Value
Version	7.9
Type	FeatureRequest
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information