avoid redundant stores to the stack when examining already-tagged data
GHC compiles a function that performs case analysis on a value of an ADT like
bool :: a -> a -> Bool -> a
bool f t b = case b of
False -> f
True -> t
to Cmm of the form
{offset
cwV:
if ((Sp + -24) < SpLim) goto cwW; else goto cwX;
cwW:
R4 = R4;
R3 = R3;
R2 = R2;
R1 = Bool.bool_closure;
call (stg_gc_fun)(R4, R3, R2, R1) args: 8, res: 0, upd: 8;
cwX:
I64[Sp - 24] = cwL; -- (*)
R1 = R4;
P64[Sp - 16] = R2; -- (†1)
P64[Sp - 8] = R3; -- (†2)
Sp = Sp - 24; -- (‡)
if (R1 & 7 != 0) goto cwL; else goto cwM;
cwM:
call (I64[R1])(R1) returns to cwL, args: 8, res: 8, upd: 8;
cwL:
if (R1 & 7 >= 2) goto cwT; else goto cwU;
cwT:
R1 = P64[Sp + 16];
Sp = Sp + 24;
call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
cwU:
R1 = P64[Sp + 8];
Sp = Sp + 24;
call stg_ap_0_fast(R1) args: 8, res: 0, upd: 8;
}
Statement (*) stores a return address for the evaluation of b
to return to, and statements (†1), (†2) save local variables that are live in case alternatives, since they cannot be held in registers across the evaluation of b
. But in the event that b
is already evaluated and represented by a tagged pointer, all these stores are unnecessary: the return address written by (*) is simply dead, and the values saved in (†1), (†2) are still available in whatever locations they were copied to the stack from.
In many cases the data we examine is mostly tagged, and while the active part of the stack is likely to be in L1 cache, the cost of these stores and reads is probably still positive (barring secondary effects from changes to pipelining, branch prediction, and so on).
In this case we could certainly move the return address store (*) into block cwM
, and possibly move the local variable stores (†1), (†2) into cwM
as well, though it's then not clear to me how to recover the values in the alternatives (does Cmm have something like phi nodes?) I don't propose to move the statement (‡), as arithmetic on registers is essentially free anyways.
I tried implementing the part of this pertaining to the return address (*) and ran into two complications.
- For some reason, when I moved the return address store (*) into the "data is not tagged" branch in the Stg->Cmm translation, this also resulted in both the local variable stores (†1), (†2) and the update to Sp (‡) being sunk into both branches of the "is the data tagged" conditional at some point in the Cmm optimization pipeline. This was useless since they couldn't be pushed further past the branch on the returned tag value, so the result was enlarged code size that outweighed the savings of avoiding a single store. I didn't investigate exactly why this sinking was dependent on the location of the store (*), but this should be fixable.
- There may be heap checks in the alternatives. In that case, the code generator currently cleverly reuses the stack frame and info table set up for the evaluation of
b
in the heap failure branches. If we move some of the stores (*), (†1), (†2) into the evaluation branchcwM
, then we either have to duplicate them in heap failure branches, or set up a new stack frame and info table, or do some other clever thing. Or in the worst case, only do this optimization when performing the heap check before the case (which may then become slightly more attractive).
I'm attaching the current version of my patch mainly for my own future reference; it seems to produce correct, but larger and marginally slower code, I believe for the reasons described above.
Trac metadata
Trac field | Value |
---|---|
Version | 7.11 |
Type | FeatureRequest |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Compiler (CodeGen) |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | simonmar |
Operating system | |
Architecture |