wiki:Commentary/Compiler/NewCodeGenStupidity

Version 10 (modified by ezyang, 3 years ago) (diff)

multiple reloads

Stupidity in the New Code Generator

Presently compiling using the new code generator results in a fairly sizable performance hit, because the new code generator produces sub-optimal (and sometimes absolutely terrible code.) There are a lot of ideas for how to make things better; the idea for this wiki page is to document all of the stupid things the new code generator is doing, to later be correlated with specific refactorings and fixes that will hopefully eliminate classes of these stupid things. The hope here is to develop a sense for what the most endemic problems with the newly generated code is.

Lots of temporary variables

Lots of temporary variables (these can tickle other issues when the temporaries are long-lived, but otherwise would be optimized away). You can at least eliminate some of them by looking at the output of -ddump-opt-cmm, which utilizes some basic temporary inlining when used with the native backend -fasm, but this doesn't currently apply to the GCC or LLVM backends.

At least one major culprit for this is allocDynClosure, described in Note Return a LocalReg; this pins down the value of the CmmExpr to be something for one particular time, but for a vast majority of use-cases the expression is used immediately afterwards. Actually, this is mostly my patches fault, because the extra rewrite means that the inline pass is broken. Fixed in latest version of the pass; we don't quite manage to inline enough but there's only one extra temporary.

Another cause of all of these temporary variables is that the new code generator immediately assigns any variables that were on the stack to temporaries immediately upon entry to a function.

Rewriting stacks

3586.hs emits the following code:

 Main.$wa_entry()
         { [const Main.$wa_slow-Main.$wa_info;, const 3591;, const 0;,
    const 458752;, const 0;, const 15;]
         }
     c17W:
         _s16B::F64 = F64[Sp + 20];
         F64[Sp - 8] = _s16B::F64;
         _s16h::I32 = I32[Sp + 16];
         _s16j::I32 = I32[Sp + 12];
         _s16y::I32 = I32[Sp + 8];
         _s16x::I32 = I32[Sp + 4];
         _s16w::I32 = I32[Sp + 0];
         if (Sp - 12 < SpLim) goto u1bR;
         Sp = Sp + 32;
// [SNIP]
         // directEntry else
         // emitCall: Sequel: Return
         F64[Sp - 12] = _s17a::F64;
         I32[Sp - 16] = _s17b::I32;
         I32[Sp - 20] = _s16j::I32;
         I32[Sp - 24] = _s16y::I32;
         I32[Sp - 28] = _s16x::I32;
         I32[Sp - 32] = _s16w::I32;
         Sp = Sp - 32;
         jump Main.$wa_info ();
     u1bR:
         Sp = Sp + 32;
         // outOfLine here
         R1 = Main.$wa_closure;
         F64[Sp - 12] = _s16B::F64;
         I32[Sp - 16] = _s16h::I32;
         I32[Sp - 20] = _s16j::I32;
         I32[Sp - 24] = _s16y::I32;
         I32[Sp - 28] = _s16x::I32;
         I32[Sp - 32] = _s16w::I32;
         Sp = Sp - 32;
         jump stg_gc_fun ();

We see that these temporary variables are being repeatedly rewritten to the stack, even when there are no changes.

Spilling Hp/Sp?

3586.hs emits the following code:

     _c1ao::I32 = Hp - 4;
     I32[Sp - 20] = _c1ao::I32;
     foreign "ccall"
       newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_];
     _c1ao::I32 = I32[Sp - 20];

We see Hp - 4 being allocated to a temp, and then consequently being spilled to the stack even though newCAF definitely will not change Hp, so we could have floated the expression down.

This seems to happen whenever there's a newCAF ccall.

We also seem to reload these values multiple times.

        _c7Yt::I32 = Hp - 4;
        I32[Sp - 28] = _c7Yt::I32;
        foreign "ccall"
          newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_];
        _c7Yt::I32 = I32[Sp - 28];
        I32[R1 + 4] = _c7Yt::I32;
        I32[R1] = stg_IND_STATIC_info;
        _c7Yt::I32 = I32[Sp - 28];  <--- totally unnecessary
        I32[Sp - 8] = _c7Yt::I32;
        I32[Sp - 12] = stg_upd_frame_info;

Up and Down

A frequent pattern is the stack pointer being bumped up and then back down again, for no particular reason.

         Sp = Sp + 4;
         Sp = Sp - 4;
         jump block_c7xh_entry ();

This is mentioned at the very top of cmm-notes.

Sp is generally stupid

Here is an optimized C-- sample from arr016.hs.

Main.D:Arbitrary_entry()
        { [const 131084;, const 0;, const 15;]
        }
    c7J5:
        _B1::I32 = I32[Sp + 4];
        _B2::I32 = I32[Sp + 0];
        if ((Sp + 0) < I32[BaseReg + 84]) goto u7Jf;
        Sp = Sp + 12;
        Hp = Hp + 12;
        if (Hp > I32[BaseReg + 92]) goto c7Jc;
        I32[Hp - 8] = Main.D:Arbitrary_con_info;
        I32[Hp - 4] = _B2::I32;
        I32[Hp + 0] = _B1::I32;
        _c7J4::I32 = Hp - 7;
        R1 = _c7J4::I32;
        Sp = Sp - 4;
        jump (I32[Sp + 0]) ();
    u7Jf:
        Sp = Sp + 12;
        goto c7Jd;
    c7Jc:
        I32[BaseReg + 112] = 12;
        goto c7Jd;
    c7Jd:
        R1 = Main.D:Arbitrary_closure;
        Sp = Sp - 12;
        jump (I32[BaseReg - 4]) ();
}

Compare with the old code:

Main.D:Arbitrary_entry()
        { [const 131084;, const 0;, const 15;]
        }
    c4pX:
        Hp = Hp + 12;
        if (Hp > I32[BaseReg + 92]) goto c4pU;
        I32[Hp - 8] = Main.D:Arbitrary_con_info;
        I32[Hp - 4] = I32[Sp + 0];
        I32[Hp + 0] = I32[Sp + 4];
        R1 = Hp - 7;
        Sp = Sp + 8;
        jump (I32[Sp + 0]) ();
    c4pV:
        R1 = Main.D:Arbitrary_closure;
        jump (I32[BaseReg - 4]) ();
    c4pU:
        I32[BaseReg + 112] = 12;
        goto c4pV;
}

There are a lot of things wrong

  • We do an unnecessary stack check on entry to this function
  • Sp should be bumped before the stack check (but we need this fishy code due to ncg spilling before the check)
  • Sp is getting bumped too much, and then being adjusted back down again

This pattern essentially happens for every function, since we always assign incoming parameters to temporary variables before doing anything.