Changes between Version 29 and Version 30 of Commentary/Compiler/NewCodeGenStupidity


Ignore:
Timestamp:
Jan 25, 2012 2:41:22 PM (3 years ago)
Author:
simonmar
Comment:

rearraged to put open issues at the top

Legend:

Unmodified
Added
Removed
Modified
  • Commentary/Compiler/NewCodeGenStupidity

    v29 v30  
    22 
    33Presently compiling using the new code generator results in a fairly sizable performance hit, because the new code generator produces sub-optimal (and sometimes absolutely terrible code.) There are [http://darcs.haskell.org/ghc/compiler/cmm/cmm-notes a lot of ideas for how to make things better]; the idea for this wiki page is to document all of the stupid things the new code generator is doing, to later be correlated with specific refactorings and fixes that will hopefully eliminate classes of these stupid things. The hope here is to develop a sense for what the most endemic problems with the newly generated code is. 
    4  
    5 == Lots of temporary variables == 
    6  
    7 WONTFIX. Lots of temporary variables (these can tickle other issues when the temporaries are long-lived, but otherwise would be optimized away). You can at least eliminate some of them by looking at the output of `-ddump-opt-cmm`, which utilizes some basic temporary inlining when used with the native backend `-fasm`, but this doesn't currently apply to the GCC or LLVM backends. 
    8  
    9 ~~At least one major culprit for this is `allocDynClosure`, described in Note `Return a LocalReg`; this pins down the value of the `CmmExpr` to be something for one particular time, but for a vast majority of use-cases the expression is used immediately afterwards. Actually, this is mostly my patches fault, because the extra rewrite means that the inline pass is broken.~~ Fixed in latest version of the pass; we don't quite manage to inline enough but there's only one extra temporary. 
    10  
    11 Another cause of all of these temporary variables is that the new code generator immediately assigns any variables that were on the stack to temporaries immediately upon entry to a function. This is on purpose. The idea is we optimize these temporary variables away. 
    12  
    13 == Rewriting stacks == 
    14  
    15 FIXED. `3586.hs` emits the following code: 
    16  
    17 {{{ 
    18  Main.$wa_entry() 
    19          { [const Main.$wa_slow-Main.$wa_info;, const 3591;, const 0;, 
    20     const 458752;, const 0;, const 15;] 
    21          } 
    22      c17W: 
    23          _s16B::F64 = F64[Sp + 20]; 
    24          F64[Sp - 8] = _s16B::F64; 
    25          _s16h::I32 = I32[Sp + 16]; 
    26          _s16j::I32 = I32[Sp + 12]; 
    27          _s16y::I32 = I32[Sp + 8]; 
    28          _s16x::I32 = I32[Sp + 4]; 
    29          _s16w::I32 = I32[Sp + 0]; 
    30          if (Sp - 12 < SpLim) goto u1bR; 
    31          Sp = Sp + 32; 
    32 // [SNIP] 
    33          // directEntry else 
    34          // emitCall: Sequel: Return 
    35          F64[Sp - 12] = _s17a::F64; 
    36          I32[Sp - 16] = _s17b::I32; 
    37          I32[Sp - 20] = _s16j::I32; 
    38          I32[Sp - 24] = _s16y::I32; 
    39          I32[Sp - 28] = _s16x::I32; 
    40          I32[Sp - 32] = _s16w::I32; 
    41          Sp = Sp - 32; 
    42          jump Main.$wa_info (); 
    43      u1bR: 
    44          Sp = Sp + 32; 
    45          // outOfLine here 
    46          R1 = Main.$wa_closure; 
    47          F64[Sp - 12] = _s16B::F64; 
    48          I32[Sp - 16] = _s16h::I32; 
    49          I32[Sp - 20] = _s16j::I32; 
    50          I32[Sp - 24] = _s16y::I32; 
    51          I32[Sp - 28] = _s16x::I32; 
    52          I32[Sp - 32] = _s16w::I32; 
    53          Sp = Sp - 32; 
    54          jump stg_gc_fun (); 
    55 }}} 
    56  
    57 We see that these temporary variables are being repeatedly rewritten to the stack, even when there are no changes. 
    58  
    59 Since these areas on the stack are all old call areas, one way to fix this is to inline all of the memory references. However, this has certain undesirable properties for other code, so we need to be a little more clever. The key thing to notice is that these accesses are only used once per control flow path, in which case sinking the loads down and then inlining them should be OK (it will increase code size but not execution time.) However, the other difficulty is that the CmmOpt inliner, as it stands, won't inline things that look like this because although the variable is only used once in different branches, the same name is used, so it can't distinguish between the temporaries with mutually exclusive live ranges. Building a more clever inliner with Hoopl is also a bit tricky, because inlining is a forward analysis/transformation, but usage counting is a backwards analysis. 
    60  
    61 This looks fixed with the patch from April 14. 
    62  
    63 == Spilling Hp/Sp == 
    64  
    65 FIXED. `3586.hs` emits the following code: 
    66  
    67 {{{ 
    68      _c1ao::I32 = Hp - 4; 
    69      I32[Sp - 20] = _c1ao::I32; 
    70      foreign "ccall" 
    71        newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; 
    72      _c1ao::I32 = I32[Sp - 20]; 
    73 }}} 
    74  
    75 We see `Hp - 4` being allocated to a temp, and then consequently being spilled to the stack even though `newCAF` definitely will not change `Hp`, so we could have floated the expression down. 
    76  
    77 This seems to happen whenever there's a `newCAF` ccall. 
    78  
    79 We also seem to reload these values multiple times. 
    80  
    81 {{{ 
    82         _c7Yt::I32 = Hp - 4; 
    83         I32[Sp - 28] = _c7Yt::I32; 
    84         foreign "ccall" 
    85           newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; 
    86         _c7Yt::I32 = I32[Sp - 28]; 
    87         I32[R1 + 4] = _c7Yt::I32; 
    88         I32[R1] = stg_IND_STATIC_info; 
    89         _c7Yt::I32 = I32[Sp - 28];  <--- totally unnecessary 
    90         I32[Sp - 8] = _c7Yt::I32; 
    91         I32[Sp - 12] = stg_upd_frame_info; 
    92 }}} 
    93  
    94 ~~We need to not spill across certain foreign calls, but for which calls this is OK for is unclear.~~ Variables stay live across all unsafe foreign calls (foreign calls in the middle), except for the obvious cases (the return registers), so no spilling should happen at all. The liveness analysis is too conservative. 
    95  
    96 This is not fixed in the April 14 version of the patch... we still need to fix the liveness analysis? I thought I fixed that... that's because the transform did extra spilling for CmmUnsafeForeignCalls. Removed that code, and now it's fixed. 
    97  
    98 == Up and Down == 
    99  
    100 FIXED. A frequent pattern is the stack pointer being bumped up and then back down again, for no particular reason.  
    101  
    102 {{{ 
    103          Sp = Sp + 4; 
    104          Sp = Sp - 4; 
    105          jump block_c7xh_entry (); 
    106 }}} 
    107  
    108 This is mentioned at the very top of `cmm-notes`. This was a bug in the stack layout code that I have fixed. 
    109  
    110 == Sp is generally stupid == 
    111  
    112 FIXED. Here is an optimized C-- sample from `arr016.hs`. 
    113  
    114 {{{ 
    115 Main.D:Arbitrary_entry() 
    116         { [const 131084;, const 0;, const 15;] 
    117         } 
    118     c7J5: 
    119         _B1::I32 = I32[Sp + 4]; 
    120         _B2::I32 = I32[Sp + 0]; 
    121         if ((Sp + 0) < I32[BaseReg + 84]) goto u7Jf; 
    122         Sp = Sp + 12; 
    123         Hp = Hp + 12; 
    124         if (Hp > I32[BaseReg + 92]) goto c7Jc; 
    125         I32[Hp - 8] = Main.D:Arbitrary_con_info; 
    126         I32[Hp - 4] = _B2::I32; 
    127         I32[Hp + 0] = _B1::I32; 
    128         _c7J4::I32 = Hp - 7; 
    129         R1 = _c7J4::I32; 
    130         Sp = Sp - 4; 
    131         jump (I32[Sp + 0]) (); 
    132     u7Jf: 
    133         Sp = Sp + 12; 
    134         goto c7Jd; 
    135     c7Jc: 
    136         I32[BaseReg + 112] = 12; 
    137         goto c7Jd; 
    138     c7Jd: 
    139         R1 = Main.D:Arbitrary_closure; 
    140         Sp = Sp - 12; 
    141         jump (I32[BaseReg - 4]) (); 
    142 } 
    143 }}} 
    144  
    145 Compare with the old code: 
    146  
    147 {{{ 
    148 Main.D:Arbitrary_entry() 
    149         { [const 131084;, const 0;, const 15;] 
    150         } 
    151     c4pX: 
    152         Hp = Hp + 12; 
    153         if (Hp > I32[BaseReg + 92]) goto c4pU; 
    154         I32[Hp - 8] = Main.D:Arbitrary_con_info; 
    155         I32[Hp - 4] = I32[Sp + 0]; 
    156         I32[Hp + 0] = I32[Sp + 4]; 
    157         R1 = Hp - 7; 
    158         Sp = Sp + 8; 
    159         jump (I32[Sp + 0]) (); 
    160     c4pV: 
    161         R1 = Main.D:Arbitrary_closure; 
    162         jump (I32[BaseReg - 4]) (); 
    163     c4pU: 
    164         I32[BaseReg + 112] = 12; 
    165         goto c4pV; 
    166 } 
    167 }}} 
    168  
    169 You can see the up and down behavior here, but that's been fixed, so ignore it for now. (Update the C--!) The unfixed problem is this (some of the other problems were already addressed): we do an unnecessary stack check on entry to this function. We should eliminate the stack check (and by dead code analysis, the GC call) in such cases. 
    170  
    171 This pattern essentially happens for every function, since we always assign incoming parameters to temporary variables before doing anything. 
    1724 
    1735== Instruction reordering == 
     
    22759 
    22860After I discussed this with SPJ, we've decided that we need to teach the stack layout how to handle partial conflicts. There is a complication here, in that if we do this naively, the interference graph will blow up (since, rather than conflicting call areas, we now have conflicting words of call areas.) Simon suggested that we bound the amount of conflicts we track: either up to 3 or conflict with everything (in which case we just place the area as far down as necessary rather than try to be clever.) I plan on doing this once I understand the current layout code... 
    229  
    230 == Heap and R1 aliasing == 
    231  
    232 FIXED. Values on the heap and values from R1 don't necessarily clobber 
    233 each other.  allocDynClosure seems like a pretty safe bet they 
    234 don't.  But is this true in general? ANSWER: Memory writes with 
    235 Hp are always new allocations, so they don't clobber anything. 
    236  
    237 {{{ 
    238         _s14Y::F64 = F64[_s1uN::I32 + 8]; 
    239         _s152::F64 = F64[_s1uN::I32 + 16]; 
    240         _s156::F64 = F64[_s1uN::I32 + 24]; 
    241         _s15a::F64 = F64[_s1uN::I32 + 32]; 
    242         _s15e::F64 = F64[_s1uN::I32 + 40]; 
    243         _s15i::F64 = F64[_s1uN::I32 + 48]; 
    244         _s15m::F64 = F64[_s1uN::I32 + 56]; 
    245         _c39O::I32 = Hp - 60; 
    246         // calling allocDynClosure 
    247         // allocDynClosure 
    248         I32[Hp - 60] = sat_s1uQ_info; 
    249         F64[Hp - 52] = _s14Y::F64; 
    250         F64[Hp - 44] = _s152::F64; 
    251         F64[Hp - 36] = _s156::F64; 
    252         F64[Hp - 28] = _s15a::F64; 
    253         F64[Hp - 20] = _s15e::F64; 
    254         F64[Hp - 12] = _s15i::F64; 
    255         F64[Hp - 4] = _s15m::F64; 
    256 }}} 
    25761 
    25862== Double temp-use means no inlinining? == 
     
    429233 
    430234We generate an extra proc-point for ``cmM``, where in theory we ought to be able to stick the subsequent ``stg_ap_pp_fast`` onto the stack as another return point. 
     235 
     236== Lots of temporary variables == 
     237 
     238WONTFIX. Lots of temporary variables (these can tickle other issues when the temporaries are long-lived, but otherwise would be optimized away). You can at least eliminate some of them by looking at the output of `-ddump-opt-cmm`, which utilizes some basic temporary inlining when used with the native backend `-fasm`, but this doesn't currently apply to the GCC or LLVM backends. 
     239 
     240~~At least one major culprit for this is `allocDynClosure`, described in Note `Return a LocalReg`; this pins down the value of the `CmmExpr` to be something for one particular time, but for a vast majority of use-cases the expression is used immediately afterwards. Actually, this is mostly my patches fault, because the extra rewrite means that the inline pass is broken.~~ Fixed in latest version of the pass; we don't quite manage to inline enough but there's only one extra temporary. 
     241 
     242Another cause of all of these temporary variables is that the new code generator immediately assigns any variables that were on the stack to temporaries immediately upon entry to a function. This is on purpose. The idea is we optimize these temporary variables away. 
     243 
     244== Rewriting stacks == 
     245 
     246FIXED. `3586.hs` emits the following code: 
     247 
     248{{{ 
     249 Main.$wa_entry() 
     250         { [const Main.$wa_slow-Main.$wa_info;, const 3591;, const 0;, 
     251    const 458752;, const 0;, const 15;] 
     252         } 
     253     c17W: 
     254         _s16B::F64 = F64[Sp + 20]; 
     255         F64[Sp - 8] = _s16B::F64; 
     256         _s16h::I32 = I32[Sp + 16]; 
     257         _s16j::I32 = I32[Sp + 12]; 
     258         _s16y::I32 = I32[Sp + 8]; 
     259         _s16x::I32 = I32[Sp + 4]; 
     260         _s16w::I32 = I32[Sp + 0]; 
     261         if (Sp - 12 < SpLim) goto u1bR; 
     262         Sp = Sp + 32; 
     263// [SNIP] 
     264         // directEntry else 
     265         // emitCall: Sequel: Return 
     266         F64[Sp - 12] = _s17a::F64; 
     267         I32[Sp - 16] = _s17b::I32; 
     268         I32[Sp - 20] = _s16j::I32; 
     269         I32[Sp - 24] = _s16y::I32; 
     270         I32[Sp - 28] = _s16x::I32; 
     271         I32[Sp - 32] = _s16w::I32; 
     272         Sp = Sp - 32; 
     273         jump Main.$wa_info (); 
     274     u1bR: 
     275         Sp = Sp + 32; 
     276         // outOfLine here 
     277         R1 = Main.$wa_closure; 
     278         F64[Sp - 12] = _s16B::F64; 
     279         I32[Sp - 16] = _s16h::I32; 
     280         I32[Sp - 20] = _s16j::I32; 
     281         I32[Sp - 24] = _s16y::I32; 
     282         I32[Sp - 28] = _s16x::I32; 
     283         I32[Sp - 32] = _s16w::I32; 
     284         Sp = Sp - 32; 
     285         jump stg_gc_fun (); 
     286}}} 
     287 
     288We see that these temporary variables are being repeatedly rewritten to the stack, even when there are no changes. 
     289 
     290Since these areas on the stack are all old call areas, one way to fix this is to inline all of the memory references. However, this has certain undesirable properties for other code, so we need to be a little more clever. The key thing to notice is that these accesses are only used once per control flow path, in which case sinking the loads down and then inlining them should be OK (it will increase code size but not execution time.) However, the other difficulty is that the CmmOpt inliner, as it stands, won't inline things that look like this because although the variable is only used once in different branches, the same name is used, so it can't distinguish between the temporaries with mutually exclusive live ranges. Building a more clever inliner with Hoopl is also a bit tricky, because inlining is a forward analysis/transformation, but usage counting is a backwards analysis. 
     291 
     292This looks fixed with the patch from April 14. 
     293 
     294== Spilling Hp/Sp == 
     295 
     296FIXED. `3586.hs` emits the following code: 
     297 
     298{{{ 
     299     _c1ao::I32 = Hp - 4; 
     300     I32[Sp - 20] = _c1ao::I32; 
     301     foreign "ccall" 
     302       newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; 
     303     _c1ao::I32 = I32[Sp - 20]; 
     304}}} 
     305 
     306We see `Hp - 4` being allocated to a temp, and then consequently being spilled to the stack even though `newCAF` definitely will not change `Hp`, so we could have floated the expression down. 
     307 
     308This seems to happen whenever there's a `newCAF` ccall. 
     309 
     310We also seem to reload these values multiple times. 
     311 
     312{{{ 
     313        _c7Yt::I32 = Hp - 4; 
     314        I32[Sp - 28] = _c7Yt::I32; 
     315        foreign "ccall" 
     316          newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; 
     317        _c7Yt::I32 = I32[Sp - 28]; 
     318        I32[R1 + 4] = _c7Yt::I32; 
     319        I32[R1] = stg_IND_STATIC_info; 
     320        _c7Yt::I32 = I32[Sp - 28];  <--- totally unnecessary 
     321        I32[Sp - 8] = _c7Yt::I32; 
     322        I32[Sp - 12] = stg_upd_frame_info; 
     323}}} 
     324 
     325~~We need to not spill across certain foreign calls, but for which calls this is OK for is unclear.~~ Variables stay live across all unsafe foreign calls (foreign calls in the middle), except for the obvious cases (the return registers), so no spilling should happen at all. The liveness analysis is too conservative. 
     326 
     327This is not fixed in the April 14 version of the patch... we still need to fix the liveness analysis? I thought I fixed that... that's because the transform did extra spilling for CmmUnsafeForeignCalls. Removed that code, and now it's fixed. 
     328 
     329== Up and Down == 
     330 
     331FIXED. A frequent pattern is the stack pointer being bumped up and then back down again, for no particular reason.  
     332 
     333{{{ 
     334         Sp = Sp + 4; 
     335         Sp = Sp - 4; 
     336         jump block_c7xh_entry (); 
     337}}} 
     338 
     339This is mentioned at the very top of `cmm-notes`. This was a bug in the stack layout code that I have fixed. 
     340 
     341== Sp is generally stupid == 
     342 
     343FIXED. Here is an optimized C-- sample from `arr016.hs`. 
     344 
     345{{{ 
     346Main.D:Arbitrary_entry() 
     347        { [const 131084;, const 0;, const 15;] 
     348        } 
     349    c7J5: 
     350        _B1::I32 = I32[Sp + 4]; 
     351        _B2::I32 = I32[Sp + 0]; 
     352        if ((Sp + 0) < I32[BaseReg + 84]) goto u7Jf; 
     353        Sp = Sp + 12; 
     354        Hp = Hp + 12; 
     355        if (Hp > I32[BaseReg + 92]) goto c7Jc; 
     356        I32[Hp - 8] = Main.D:Arbitrary_con_info; 
     357        I32[Hp - 4] = _B2::I32; 
     358        I32[Hp + 0] = _B1::I32; 
     359        _c7J4::I32 = Hp - 7; 
     360        R1 = _c7J4::I32; 
     361        Sp = Sp - 4; 
     362        jump (I32[Sp + 0]) (); 
     363    u7Jf: 
     364        Sp = Sp + 12; 
     365        goto c7Jd; 
     366    c7Jc: 
     367        I32[BaseReg + 112] = 12; 
     368        goto c7Jd; 
     369    c7Jd: 
     370        R1 = Main.D:Arbitrary_closure; 
     371        Sp = Sp - 12; 
     372        jump (I32[BaseReg - 4]) (); 
     373} 
     374}}} 
     375 
     376Compare with the old code: 
     377 
     378{{{ 
     379Main.D:Arbitrary_entry() 
     380        { [const 131084;, const 0;, const 15;] 
     381        } 
     382    c4pX: 
     383        Hp = Hp + 12; 
     384        if (Hp > I32[BaseReg + 92]) goto c4pU; 
     385        I32[Hp - 8] = Main.D:Arbitrary_con_info; 
     386        I32[Hp - 4] = I32[Sp + 0]; 
     387        I32[Hp + 0] = I32[Sp + 4]; 
     388        R1 = Hp - 7; 
     389        Sp = Sp + 8; 
     390        jump (I32[Sp + 0]) (); 
     391    c4pV: 
     392        R1 = Main.D:Arbitrary_closure; 
     393        jump (I32[BaseReg - 4]) (); 
     394    c4pU: 
     395        I32[BaseReg + 112] = 12; 
     396        goto c4pV; 
     397} 
     398}}} 
     399 
     400You can see the up and down behavior here, but that's been fixed, so ignore it for now. (Update the C--!) The unfixed problem is this (some of the other problems were already addressed): we do an unnecessary stack check on entry to this function. We should eliminate the stack check (and by dead code analysis, the GC call) in such cases. 
     401 
     402This pattern essentially happens for every function, since we always assign incoming parameters to temporary variables before doing anything. 
     403 
     404== Heap and R1 aliasing == 
     405 
     406FIXED. Values on the heap and values from R1 don't necessarily clobber 
     407each other.  allocDynClosure seems like a pretty safe bet they 
     408don't.  But is this true in general? ANSWER: Memory writes with 
     409Hp are always new allocations, so they don't clobber anything. 
     410 
     411{{{ 
     412        _s14Y::F64 = F64[_s1uN::I32 + 8]; 
     413        _s152::F64 = F64[_s1uN::I32 + 16]; 
     414        _s156::F64 = F64[_s1uN::I32 + 24]; 
     415        _s15a::F64 = F64[_s1uN::I32 + 32]; 
     416        _s15e::F64 = F64[_s1uN::I32 + 40]; 
     417        _s15i::F64 = F64[_s1uN::I32 + 48]; 
     418        _s15m::F64 = F64[_s1uN::I32 + 56]; 
     419        _c39O::I32 = Hp - 60; 
     420        // calling allocDynClosure 
     421        // allocDynClosure 
     422        I32[Hp - 60] = sat_s1uQ_info; 
     423        F64[Hp - 52] = _s14Y::F64; 
     424        F64[Hp - 44] = _s152::F64; 
     425        F64[Hp - 36] = _s156::F64; 
     426        F64[Hp - 28] = _s15a::F64; 
     427        F64[Hp - 20] = _s15e::F64; 
     428        F64[Hp - 12] = _s15i::F64; 
     429        F64[Hp - 4] = _s15m::F64; 
     430}}} 
     431