Opened 4 years ago

Last modified 6 months ago

#8905 new bug

Function arguments are always spilled/reloaded if scrutinee is already in WHNF

Reported by: tibbe Owned by: kavon
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.9
Keywords: CodeGen Cc: simonmar
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The code generator unnecessarily spills and reloads function arguments if the scrutinee turns out to be already evaluated (i.e. has non-zero tag bits).

Here's the beginning of a function body, taken from the insert function at https://github.com/tibbe/unordered-containers/blob/master/Data/HashMap/Base.hs#L303:

c2wQ:  // stack check
    if ((Sp + -72) < SpLim) goto c2wR; else goto c2wS;
c2wR:  // stack check failure
    R1 = PicBaseReg + $wpoly_go_closure;
    I64[Sp - 40] = R2;
    I64[Sp - 32] = R3;
    P64[Sp - 24] = R4;
    I64[Sp - 16] = R5;
    P64[Sp - 8] = R6;
    Sp = Sp - 40;
    call (I64[BaseReg - 8])(R1) args: 48, res: 0, upd: 8;
c2wS:  // stack check success
    I64[Sp - 40] = PicBaseReg + block_c2my_info;  // return addr for eval
    R1 = R6;  // t
    I64[Sp - 32] = R2;  // spill: s
    I64[Sp - 24] = R3;  // spill: x
    P64[Sp - 16] = R4;  // spill: k
    I64[Sp - 8] = R5;  // spill: h
    Sp = Sp - 40;
    if (R1 & 7 != 0) goto c2my; else goto c2mz;  // eval check of t
c2mz:  // eval check failed
    call (I64[R1])(R1) returns to c2my, args: 8, res: 8, upd: 8;  // eval
c2my:  // eval check succeeded
    _s2b1::I64 = I64[Sp + 8];  // reload: h
    _s2b2::I64 = I64[Sp + 16];  // reload: k
    _s2b3::P64 = P64[Sp + 24];  // reload: x
    _s2b4::I64 = I64[Sp + 32];  // reload: s
    switch [0 .. 4] (R1 & 7 - 1) {
        case 0 : goto c2wK;
        case 1 : goto c2wL;
        case 2 : goto c2wM;
        case 3 : goto c2wN;
        case 4 : goto c2wO;
    }

It seems to me that all the spills/reloads could be pushed into the c2mz block.

The c2my block, in its current form, is reused for a heap check failure case, so the heap check most likely will have to do its own spilling/reloading. However, since the scrutinee not having tags bit or the eval checking failing is not the common case, they should be out of the common path.

If it matters the data type is spine strict so GHC should have enough information to know that the common case (e.g. self-recursive calls) already have an evaluated argument (although there might be an indirection in some cases).

Attachments (3)

0001-EXPERIMENTAL-save-the-continuation-after-the-tag-che.patch (2.5 KB) - added by simonmar 4 years ago.
cryptarithm1-master.dump-opt-cmm (41.5 KB) - added by tibbe 4 years ago.
cryptarithm1-no-spill.dump-opt-cmm (60.5 KB) - added by tibbe 4 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 4 years ago by tibbe

If code duplication is a concern, perhaps the spill and load code could be give their own basic blocks that could be jumped to.

comment:2 Changed 4 years ago by tibbe

The register allocator doesn't tidy this up either:

_c2wS:
        leaq block_c2my_info(%rip),%rax
        movq %rax,-40(%rbp)
        movq %r9,%rbx
        movq %r14,-32(%rbp)
        movq %rsi,-24(%rbp)
        movq %rdi,-16(%rbp)
        movq %r8,-8(%rbp)
        addq $-40,%rbp
        testq $7,%rbx
        jne _c2my
Last edited 4 years ago by tibbe (previous) (diff)

comment:3 Changed 4 years ago by carter

@tibbe, is this possibly related to https://ghc.haskell.org/trac/ghc/ticket/8048 ?

comment:4 Changed 4 years ago by simonmar

Yeah, it is somewhat by design (to get smaller code), but you're quite right that we should have faster code for the common case. I'll look into it and see if we can improve things.

comment:5 Changed 4 years ago by tibbe

If the stack change code can be moved out of the way, couldn't the switch statements also include checking the tags bit for 0?

comment:6 Changed 4 years ago by tibbe

To provide some motivation, I've included the full Cmm of the common path below. Every line that starts with > is what I would considered unnecessary spills/loads for this path:

  c2wQ:  // stack check
      if ((Sp + -72) < SpLim) goto c2wR; else goto c2wS;
  c2wS:  // stack check success
>     I64[Sp - 40] = PicBaseReg + block_c2my_info;  // return addr for eval
      R1 = R6;  // t
>     I64[Sp - 32] = R2;  // spill: s
>     I64[Sp - 24] = R3;  // spill: x
>     P64[Sp - 16] = R4;  // spill: k
>     I64[Sp - 8] = R5;  // spill: h
      Sp = Sp - 40;
      if (R1 & 7 != 0) goto c2my; else goto c2mz;  // eval check of t
  c2my:  // eval check succeded
>     _s2b1::I64 = I64[Sp + 8];  // reload: h
>     _s2b2::I64 = I64[Sp + 16];  // reload: k
>     _s2b3::P64 = P64[Sp + 24];  // reload: x
>     _s2b4::I64 = I64[Sp + 32];  // reload: s
      switch [0 .. 4] (R1 & 7 - 1) {
          case 0 : goto c2wK;
          case 1 : goto c2wL;
          case 2 : goto c2wM;
          case 3 : goto c2wN;
          case 4 : goto c2wO;
      }
  c2wN:  // Full
      _s2cx::P64 = P64[R1 + 4];  // ary
      _s2cy::I64 = (_s2b1::I64 >> _s2b4::I64) & 15;  // i
      _s2cC::P64 = P64[(_s2cx::P64 + 24) + (_s2cy::I64 << 3)];  // st
      I64[Sp] = PicBaseReg + block_c2nJ_info;  // return addr
      R6 = _s2cC::P64;  // arg: st
      R5 = _s2b4::I64 + 4;  // arg: s + bitsPerSubkey
      R4 = _s2b3::P64;  // arg: x
      R3 = _s2b2::I64;  // arg: k
      R2 = _s2b1::I64;  // arg: h
      P64[Sp + 8] = _s2cC::P64;  // spill: st
      I64[Sp + 16] = _s2cy::I64;  // spill: i
      P64[Sp + 24] = _s2cx::P64;  // spill: ary
      P64[Sp + 32] = R1;  // spill: t (only used in uncommon branch)
      call $wpoly_go_info(R6,
                          R5,
                          R4,
                          R3,
                          R2) returns to c2nJ, args: 8, res: 8, upd: 8;
  c2nJ:
>     _s2b6::P64 = P64[Sp + 32];  // reload: t (only used in uncommon branch)
      _s2cx::P64 = P64[Sp + 24];  // reload: ary
      _s2cy::I64 = I64[Sp + 16];  // reload: i
      _s2cE::P64 = R1;
      _s2cF::I64 = R1 == P64[Sp + 8];
      if (_s2cF::I64 != 1) goto c2nR; else goto c2AE;
  c2nR:  // heap check
      Hp = Hp + 176;
      if (Hp > I64[BaseReg + 856]) goto c2AB; else goto c2AA;
  c2AA:  // heap check success
      I64[Hp - 168] = I64[PicBaseReg + stg_MUT_ARR_PTRS_DIRTY_info@GOTPCREL];
      I64[Hp - 160] = 16;
      I64[Hp - 152] = 17;
      _c2nT::I64 = Hp - 168;
      call MO_Memcpy(_c2nT::I64 + 24, _s2cx::P64 + 24, 128, 8);
      P64[(_c2nT::I64 + 24) + (_s2cy::I64 << 3)] = _s2cE::P64;
      I64[_c2nT::I64] = I64[PicBaseReg + stg_MUT_ARR_PTRS_DIRTY_info@GOTPCREL];
      I8[(_c2nT::I64 + 24) + ((I64[_c2nT::I64 + 8] << 3) + (_s2cy::I64 >> 7))] = 1 :: W8;
      I64[_c2nT::I64] = I64[PicBaseReg + stg_MUT_ARR_PTRS_FROZEN0_info@GOTPCREL];
      I64[Hp - 8] = PicBaseReg + Full_con_info;
      I64[Hp] = _c2nT::I64;
      R1 = Hp - 4;
      Sp = Sp + 40;
      call (P64[Sp])(R1) args: 8, res: 0, upd: 8;  // return

The main body of this function, except for the recursive call, is in the last block, c2AA. Quite of bit of time is spent on "administrative" things. Also note that there are a bunch of static arguments passed around (x, k, and h). I will try to see what the Cmm looks like if I manually perform a static argument transform on this code.

comment:7 Changed 4 years ago by simonmar

I attached a patch you can try. It definitely increases code size, but I'd be interested how much benefit it gives in your benchmarks.

comment:8 Changed 4 years ago by simonpj

I'm puzzled by why the spill code is before the eval check rather than after. Before spilling etc I'd expect to see:

   ...
   if <eval check> goto Leval else goto Ldone

Leval: call (I64[R1])(R1) returns to Ldone

Ldone:  // Now it's evaluated

Then the stack-layout stuff should insert spills and reloads around the call, but not interfering with the fast path.

So how come the spills affect the fast path at all? Is there some special code-size-saving magic in the code generator?

Simon

comment:9 Changed 4 years ago by simonmar

Simon - this is one of the things I changed when I worked on the new code generator. We could certainly investigate alternatives, the patch I attached earlier changes it but increases code size, I haven't done a full nofib run though.

comment:10 Changed 4 years ago by simonmar

Oh, I just remembered one thing: when we are splitting proc points, then the "faster" sequence is actually worse because it generates two separate proc points, one for the call-return and one for the joint point. If we always do the copyout, then the call-return can be the join point too, which means we only need one proc point.

However, we now don't generate proc points (when using the NCG), so we can probably use the faster sequence with only a small increase in code size. Someone needs to measure. And figure out whether we still want to do this when using LLVM or not.

comment:11 in reply to:  7 Changed 4 years ago by tibbe

Replying to simonmar:

I attached a patch you can try. It definitely increases code size, but I'd be interested how much benefit it gives in your benchmarks.

I will try to benchmark the patch on my code tonight. Will you do a nofib run?

comment:12 Changed 4 years ago by tibbe

Here are the nofib results on my machine

$ nofib-analyse/nofib-analyse spill.log no-spill.log
NoFib Results

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           anna          +2.2%      0.0%      0.07      0.08      0.0%
           ansi           0.0%      0.0%      0.00      0.00      0.0%
           atom           0.0%      0.0%     +7.9%     +6.1%      0.0%
         awards           0.0%      0.0%      0.00      0.00      0.0%
         banner           0.0%      0.0%      0.00      0.00      0.0%
     bernouilli           0.0%      0.0%      0.11      0.12      0.0%
   binary-trees           0.0%      0.0%    -14.8%    -15.1%      0.0%
          boyer           0.0%      0.0%      0.02      0.03      0.0%
         boyer2          +0.2%      0.0%      0.00      0.00      0.0%
           bspt          +1.7%      0.0%      0.00      0.01      0.0%
      cacheprof          +0.5%     -0.0%    +11.8%    +12.7%     +5.5%
       calendar          +0.1%      0.0%      0.00      0.00      0.0%
       cichelli          +0.1%      0.0%      0.05      0.05      0.0%
        circsim          +0.2%      0.0%    +15.0%    +15.8%      0.0%
       clausify           0.0%      0.0%      0.02      0.03      0.0%
  comp_lab_zift          +0.2%      0.0%      0.12      0.13      0.0%
       compress          +0.1%      0.0%      0.11      0.12      0.0%
      compress2          +0.6%      0.0%      0.10      0.12      0.0%
    constraints          +0.1%      0.0%     +9.4%     +9.4%      0.0%
   cryptarithm1           0.0%      0.0%    +19.2%    +18.4%      0.0%
   cryptarithm2          +0.1%      0.0%      0.00      0.01      0.0%
            cse           0.0%      0.0%      0.00      0.00      0.0%
          eliza           0.0%      0.0%      0.00      0.00      0.0%
          event          +0.1%      0.0%      0.09      0.10      0.0%
         exp3_8           0.0%      0.0%      0.15      0.16      0.0%
         expert          +0.2%      0.0%      0.00      0.00      0.0%
 fannkuch-redux          +0.1%      0.0%     -7.6%     -7.3%      0.0%
          fasta           0.0%      0.0%     -4.5%     -4.2%      0.0%
            fem          +1.0%      0.0%      0.01      0.02      0.0%
            fft          +0.1%      0.0%      0.02      0.03      0.0%
           fft2          +0.1%      0.0%      0.03      0.03      0.0%
       fibheaps          +0.2%      0.0%      0.02      0.02      0.0%
           fish          +0.1%      0.0%      0.01      0.01      0.0%
          fluid          +2.5%      0.0%      0.00      0.01      0.0%
         fulsom          +0.8%      0.0%      0.16      0.18      0.0%
         gamteb          +0.3%      0.0%      0.02      0.03      0.0%
            gcd           0.0%      0.0%      0.02      0.02      0.0%
    gen_regexps           0.0%      0.0%      0.00      0.00      0.0%
         genfft           0.0%      0.0%      0.02      0.02      0.0%
             gg          +0.5%      0.0%      0.00      0.01      0.0%
           grep          +0.3%      0.0%      0.00      0.00      0.0%
         hidden          +0.5%      0.0%     +1.0%     -0.0%      0.0%
            hpg          +0.3%      0.0%      0.04      0.11      0.0%
            ida          +0.2%      0.0%      0.06      0.06      0.0%
          infer          +0.2%      0.0%      0.03      0.04      0.0%
        integer           0.0%      0.0%     +5.6%     +5.7%      0.0%
      integrate           0.0%      0.0%      0.06      0.09      0.0%
   k-nucleotide          +0.2%      0.0%     -4.6%     -4.8%      0.0%
          kahan           0.0%      0.0%      0.17      0.18      0.0%
        knights          +0.3%      0.0%      0.00      0.00      0.0%
           lcss           0.0%     -0.0%     +2.9%     +2.6%      0.0%
           life           0.0%      0.0%      0.17      0.17      0.0%
           lift          +0.3%      0.0%      0.00      0.00      0.0%
      listcompr           0.0%      0.0%      0.04      0.05      0.0%
       listcopy           0.0%      0.0%      0.05      0.06      0.0%
       maillist           0.0%     +0.0%      0.03      0.17     +2.2%
         mandel           0.0%      0.0%      0.03      0.04      0.0%
        mandel2          +0.1%      0.0%      0.00      0.00      0.0%
        minimax          +0.1%      0.0%      0.00      0.00      0.0%
        mkhprog          +0.1%      0.0%      0.00      0.00      0.0%
     multiplier          +0.1%      0.0%      0.07      0.08      0.0%
         n-body           0.0%      0.0%     -6.3%     -6.2%      0.0%
       nucleic2          +1.8%      0.0%      0.04      0.05      0.0%
           para          +0.5%      0.0%      0.20      0.22      0.0%
      paraffins          +0.1%      0.0%      0.07      0.09      0.0%
         parser          +1.0%      0.0%      0.02      0.02      0.0%
        parstof          +0.4%      0.0%      0.00      0.01      0.0%
            pic          +0.5%      0.0%      0.00      0.01      0.0%
       pidigits           0.0%      0.0%     -6.3%     -5.5%      0.0%
          power          +0.1%      0.0%     +7.7%     +7.4%     +1.8%
         pretty           0.0%      0.0%      0.00      0.00      0.0%
         primes           0.0%      0.0%      0.04      0.04      0.0%
      primetest           0.0%      0.0%      0.06      0.07      0.0%
         prolog          +0.2%      0.0%      0.00      0.00      0.0%
         puzzle          +0.3%      0.0%      0.09      0.13      0.0%
         queens           0.0%      0.0%      0.01      0.01      0.0%
        reptile          +0.5%      0.0%      0.00      0.01      0.0%
reverse-complem          +0.1%      0.0%      0.05      0.08      0.0%
        rewrite          +0.2%      0.0%      0.01      0.01      0.0%
           rfib           0.0%      0.0%      0.01      0.01      0.0%
            rsa           0.0%      0.0%      0.01      0.01      0.0%
            scc           0.0%      0.0%      0.00      0.00      0.0%
          sched          +0.1%      0.0%      0.01      0.01      0.0%
            scs          +1.8%      0.0%    -10.4%    -13.1%      0.0%
         simple          +2.9%      0.0%      0.15      0.17      0.0%
          solid          +0.1%      0.0%      0.09      0.11      0.0%
        sorting          +0.1%      0.0%      0.00      0.00      0.0%
  spectral-norm           0.0%      0.0%     +3.5%     +3.1%      0.0%
         sphere          +0.6%      0.0%      0.02      0.03      0.0%
         symalg          +0.2%      0.0%      0.00      0.00      0.0%
            tak           0.0%      0.0%      0.01      0.01      0.0%
      transform          +0.4%      0.0%      0.22    +12.6%      0.0%
       treejoin          +0.1%      0.0%      0.09      0.11      0.0%
      typecheck          +0.1%      0.0%      0.14      0.15      0.0%
        veritas          +1.9%      0.0%      0.00      0.01      0.0%
           wang          +0.1%      0.0%      0.07      0.09      0.0%
      wave4main          +0.1%      0.0%      0.14      0.16      0.0%
   wheel-sieve1          +0.1%      0.0%     +3.8%     +4.5%      0.0%
   wheel-sieve2           0.0%      0.0%      0.12      0.14      0.0%
           x2n1           0.0%      0.0%      0.00      0.00      0.0%
--------------------------------------------------------------------------------
            Min           0.0%     -0.0%    -14.8%    -15.1%      0.0%
            Max          +2.9%     +0.0%    +19.2%    +18.4%     +5.5%
 Geometric Mean          +0.3%     -0.0%     +1.5%     +1.8%     +0.1%

A more detailed analysis of the results would be interesting. For example, I note that programs that are already heavily tuned (from the shootout suite) get faster, while some more "traditional" benchmarks get slower.

However, given that it's a slight regressions on average on nofib, I think this isn't something we should go ahead with without further investigation.

Last edited 4 years ago by tibbe (previous) (diff)

comment:13 Changed 4 years ago by tibbe

I've attached the Cmm for the biggest loss in the nofib run above, cryptarithm1. There's definitely more code generated with simonmar's patch (which nofib strangely doesn't report).

Just to validate nofib's result, I reran this benchmark manually and grabbed the results for the best out of five runs:

master:

real	0m0.328s
user	0m0.320s
sys	0m0.007s

no-spill:

real	0m0.333s
user	0m0.325s
sys	0m0.007s

That's a 1.5% runtime difference. Looking at the mutator time with +RTS -s (using five new runs), I can see no difference (+RTS -s has lower resolution).

Changed 4 years ago by tibbe

Changed 4 years ago by tibbe

comment:14 Changed 3 years ago by thomie

Component: CompilerCompiler (CodeGen)

comment:15 Changed 8 months ago by simonpj

Keywords: CodeGen added

comment:16 Changed 6 months ago by kavon

Owner: set to kavon
Note: See TracTickets for help on using tickets.