Opened 13 months ago

Closed 12 months ago

Last modified 12 months ago

#8885 closed feature request (fixed)

Add inline versions of clone array primops

Reported by: tibbe Owned by: simonmar
Priority: normal Milestone:
Component: Compiler Version: 7.9
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

I've changed the clone array primops (i.e. cloneArray#, cloneMutableArray#, freezeArray#, and thawArray#) to use the new inline allocation optimization for statically known array sizes. Furthermore, I've moved the implementation for the non-statically known case out-of-line, which should reduce code size.

The numbers are very encouraging, with the new implementation being 74% (i.e. almost 4x) faster than the old one. I measured this by looking at the total time reported by +RTS -s for the attached InlineCloneArrayAlloc benchmark.

Here are the stats from the best out of three runs of the old implementation:

   1,600,041,120 bytes allocated in the heap
           6,504 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          21,352 bytes maximum slop
            1588 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         1 colls,     0 par    0.01s    0.01s     0.0082s    0.0082s
  Gen  1         1 colls,     0 par    0.00s    0.11s     0.1062s    0.1062s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.29s  (  0.57s elapsed)
  GC      time    0.01s  (  0.11s elapsed)
  EXIT    time    0.01s  (  0.11s elapsed)
  Total   time    0.31s  (  0.80s elapsed)

  %GC     time       2.7%  (14.2% elapsed)

  Alloc rate    5,497,251,856 bytes per MUT second

  Productivity  97.3% of total user, 37.4% of total elapsed

Here are the same for the new implementation:

   1,600,041,120 bytes allocated in the heap
          57,224 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          21,352 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3125 colls,     0 par    0.01s    0.01s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0003s    0.0003s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.08s  (  0.08s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.08s  (  0.09s elapsed)

  %GC     time       6.4%  (8.8% elapsed)

  Alloc rate    21,260,179,643 bytes per MUT second

  Productivity  93.5% of total user, 88.8% of total elapsed

The performance ratio between the new and old implementation gets worse for the old implementation as the iteration count is increased.

There's also an interesting difference in the Gen 1 collection times between the two implementations.

Change History (17)

comment:1 Changed 13 months ago by tibbe

  • Status changed from new to patch

comment:2 Changed 13 months ago by tibbe

I did some investigation to figure out why the old implementation holds on to all memory (assuming that it's not an accounting error). First, here's what the inner loop looks like for the old implementation:

$wa_entry() //  [R3, R2]
        { [(c3i2,
            $wa_info:
                const 12884901901;
                const 0;
                const 15;)]
        }
    {offset
      c3i2:
          if (R2 != 0) goto c3i0; else goto c3i1;
      c3i0:
          _s3f9::P64 = R3;
          (_c3hS::I64) = call "ccall" arg hints:  [PtrHint,]  result hints:  [PtrHint] allocate(BaseReg - 24, 20);
          I64[_c3hS::I64] = I64[PicBaseReg + stg_MUT_ARR_PTRS_FROZEN0_info@GOTPCREL];
          I64[_c3hS::I64 + 8] = 16;
          I64[_c3hS::I64 + 16] = 17;
          _c3i8::I64 = _c3hS::I64 + 24;
          call MO_Memcpy(_c3i8::I64, _s3f9::P64 + 24, 128, 8);
          call MO_Memset(_c3i8::I64 + 128, 1, 1, 8);
          R3 = _s3f9::P64;
          R2 = R2 - 1;
          call $wa_info(R3, R2) args: 8, res: 0, upd: 8;
      c3i1:
          R1 = PicBaseReg + (()_closure+1);
          call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
    }
}

The first thing I did was to use the correct info table (FROZEN, not FROZEN0). That didn't help.

The second thing I did was to use -fno-omit-yields to try to make sure the GC was run, as there's no heap check in this loop (and I don't know if allocate ever invokes the GC). That didn't have any effect.

For reference, here's what the code with -fno-omit-yields looks like:

$wa_entry() //  [R3, R2]
        { [(c3i7,
            $wa_info:
                const 12884901901;
                const 0;
                const 15;)]
        }
    {offset
      c3i7:
          if (I64[BaseReg + 856] == 0) goto c3i8; else goto c3ia;
      c3i8:
          // nop
          // nop
          R1 = PicBaseReg + $wa_closure;
          call (I64[BaseReg - 8])(R3, R2, R1) args: 8, res: 0, upd: 8;
      c3ia:
          if (R2 != 0) goto c3i5; else goto c3i6;
      c3i5:
          _s3f9::P64 = R3;
          (_c3hX::I64) = call "ccall" arg hints:  [PtrHint,]  result hints:  [PtrHint] allocate(BaseReg - 24, 20);
          I64[_c3hX::I64] = I64[PicBaseReg + stg_MUT_ARR_PTRS_FROZEN_info@GOTPCREL];
          I64[_c3hX::I64 + 8] = 16;
          I64[_c3hX::I64 + 16] = 17;
          _c3ie::I64 = _c3hX::I64 + 24;
          call MO_Memcpy(_c3ie::I64, _s3f9::P64 + 24, 128, 8);
          call MO_Memset(_c3ie::I64 + 128, 1, 1, 8);
          R3 = _s3f9::P64;
          R2 = R2 - 1;
          call $wa_info(R3, R2) args: 8, res: 0, upd: 8;
      c3i6:
          R1 = PicBaseReg + (()_closure+1);
          call (P64[Sp])(R1) args: 8, res: 0, upd: 8;
    }
}

comment:3 Changed 13 months ago by tibbe

Both patches validate (module unrelated failures to type signature printing that's breaking validate in HEAD at the moment.)

comment:4 Changed 13 months ago by simonmar

1588 MB total memory in use (0 MB lost due to fragmentation)

Hehe, there was a bug in the old implementation, it didn't check for heap overflow so this benchmark never GC'd.

comment:5 follow-up: Changed 13 months ago by simonmar

So we don't know how much of the speed increase is due to fixing the bug, and how much is due to inlining the array alloc. Still, it's probably better to inline the array alloc.

comment:6 in reply to: ↑ 5 Changed 13 months ago by tibbe

Replying to simonmar:

So we don't know how much of the speed increase is due to fixing the bug, and how much is due to inlining the array alloc. Still, it's probably better to inline the array alloc.

Right. Is the bug in the implementation of allocate or in the implementation of the primops? If it's in the primops this bug exists in at least stg_newArrayzh as well, as I stole the allocation code from there. If you let me know when the bug is fixed, I will rerun the benchmark.

comment:7 Changed 13 months ago by simonmar

The bug is in the primop. The out-of-line implementation doesn't have the bug: look for MAYBE_GC(), that's the heap-overflow check.

comment:8 Changed 13 months ago by tibbe

There's not much point in comparing the new inline version vs the old, incorrect version. Fixing the old, incorrect version just to get the benchmark numbers doesn't seem worth it, as we'd have to replicate MAYBE_GC in StgCmmPrim. Instead I compare the new inline version against the new out-of-line version (which calls allocate). The inline versions is 69% faster.

Here are the +RTS -s numbers for the new out-of-line version:

   1,600,041,120 bytes allocated in the heap
          57,992 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          21,352 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3173 colls,     0 par    0.01s    0.01s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.25s  (  0.25s elapsed)
  GC      time    0.01s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.26s  (  0.26s elapsed)

  %GC     time       2.1%  (2.9% elapsed)

  Alloc rate    6,417,285,798 bytes per MUT second

  Productivity  97.9% of total user, 95.6% of total elapsed

And for the inline version:

   1,600,041,120 bytes allocated in the heap
          57,224 bytes copied during GC
          35,992 bytes maximum residency (1 sample(s))
          21,352 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      3125 colls,     0 par    0.00s    0.01s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0002s    0.0002s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.08s  (  0.08s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.08s  (  0.09s elapsed)

  %GC     time       6.1%  (8.2% elapsed)

  Alloc rate    20,999,017,271 bytes per MUT second

  Productivity  93.9% of total user, 89.2% of total elapsed

You can see that the GC issue has been fixed.

I've attached updated versions of my patches that address the MAYBE_GC issue, which was also present in my new out-of-line implementation.

comment:9 Changed 13 months ago by tibbe

Marking these "out-of-line but may be inlined" primops as out-of-line in primops.txt.pp had the side effect of always adding an extra heap check for them, removing some of the benefits of inlining them in the first place. I've attached a patch that makes the heap check layout code respect the decision to inline these primops.

comment:10 Changed 13 months ago by tibbe

My intuition to use a pointer based copy loop instead of an index based loop seems to have been wrong. Changing the loop from:

  dst_p = dst + SIZEOF_StgMutArrPtrs;
  src_p = src + SIZEOF_StgMutArrPtrs + WDS(offset);
while:
  if (n != 0) {
    n = n - 1;
    W_[dst_p] = W_[src_p];
    dst_p = dst_p + WDS(1);
    src_p = src_p + WDS(1);
    goto while;
  }

to

  dst_p = dst + SIZEOF_StgMutArrPtrs;
  src_p = src + SIZEOF_StgMutArrPtrs + WDS(offset);
  i = 0;
while:
  if (i < n) {
    W_[dst_p + i] = W_[src_p + i];
    i = i + 1;
    goto while;
  }

improves performance on my benchmark (clone array of 17 elements) by 14%.

Attached patch with this improvement.

comment:11 Changed 12 months ago by tibbe

I've rebased the first three patches, validated them, and submitted them as:

In 1eece45692fb5d1a5f4ec60c1537f8068237e9c1/ghc:

commit 1eece45692fb5d1a5f4ec60c1537f8068237e9c1
Author: Johan Tibell <[email protected]>
Date:   Thu Mar 13 09:35:21 2014 +0100

    codeGen: inline allocation optimization for clone array primops
    
    The inline allocation version is 69% faster than the out-of-line
    version, when cloning an array of 16 unit elements on a 64-bit
    machine.
    
    Comparing the new and the old primop implementations isn't
    straightforward. The old version had a missing heap check that I
    discovered during the development of the new version. Comparing the
    old and the new version would requiring fixing the old version, which
    in turn means reimplementing the equivalent of MAYBE_CG in StgCmmPrim.
    
    The inline allocation threshold is configurable via
    -fmax-inline-alloc-size which gives the maximum array size, in bytes,
    to allocate inline. The size does not include the closure header size.
    
    Allowing the same primop to be either inline or out-of-line has some
    implication for how we lay out heap checks. We always place a heap
    check around out-of-line primops, as they may allocate outside of our
    knowledge. However, for the inline primops we only allow allocation
    via the standard means (i.e. virtHp). Since the clone primops might be
    either inline or out-of-line the heap check layout code now consults
    shouldInlinePrimOp to know whether a primop will be inlined.

This commit does not include 0004-codeGen-cloneArray-use-index-based-for-loop-instead-.patch, which was incorrect.

comment:12 Changed 12 months ago by tibbe

  • Resolution set to fixed
  • Status changed from patch to closed

comment:13 Changed 12 months ago by simonmar

Sorry I didn't get around to reviewing this. I just skimmed the commit and it all looks very reasonable though. Nice work :)

Note: See TracTickets for help on using tickets.