Version 30 (modified by simonpj, 7 years ago) (diff)


Design of the new code generator

This page contains notes about the design of the new code generator. See also: overview of the module structure in the new code generator.


Code generation now has three stages:

  1. Convert STG to Cmm, with implicit stack implicit, and native Cmm calls.
  2. Optimise the Cmm, and CPS-convert it to have an explicit stack, and no native calls. This part of the pipeline is stitched together in cmm/CmmCPSZ.hs.
  3. Feed the CPS-converted Cmm to the existing, unmodified native code generators.

Ultimately our plan is to expand the capability of the new pipeline so that it does native code generation too, and we can ultimately discard the existing code generators. The design of this stage is here: Commentary/Compiler/IntegratedCodeGen

The Cmm pipeline

The first two steps are described in more detail here:

  • Code generator converts STG to CmmGraph. Implemented in StgCmm* modules (in directory codeGen).
    • Parameter passing is made explicit. Parameters are passed in virtual registers R1, R2 etc. Overflow parameters are passed on the stack using explicit memory stores, to locations described abstractly using the ''Stack Area'' abstraction..
    • That includes a store of the return address, which is stored explicitly on the stack in the same way as overflow parameters.
    • No CopyIn, CopyOut nodes any more; instead "smart constructors" lower the calling convention to loads/stores/register transfers, using stack area abstraction.
    • But we still have LastCall, LastReturn, LastBranch, LastJump as Last nodes.
  • Simple control flow optimisation, implemented in CmmContFlowOpt:
    • Branch chain elimination.
    • Remove unreachable blocks.
    • Block concatenation. branch to K; and this is the only use of K.
    • Common Block Elimination (like CSE). This essentially implements the Adams optimisation, we believe.
    • Consider (sometime): block duplication. branch to K; and K is a short block. Branch chain elimination is just a special case of this.
  • Proc-point analysis and transformation, implemented in CmmProcPointZ. (Adams version is CmmProcPoint.) The transformation part adds a function prologue to the front of each proc-point, following a standard entry convention.
    • The analysis produces a set of BlockId that should become proc-points
    • The transformation inserts a function prologue at the start of each proc-point, and a function epilogue just before each branch to a proc-point.
  • Add spill/reload, implemented in CmmSpillReload, to spill live C-- variables before a call and reload them afterwards. The spill and reload instructions are simply memory stores and loads respectively, using symbolic stack offsets (see stack layout). For example, a spill of variable 'x' would look like Ptr32[SS(x)] = x.
  • Figure out the stack layout, implemented in CmmStackLayout.
    • Each variable 'x', and each proc-point label 'K', has an associated Area, written SS(x) and SS(k) resp, that names a contiguous portion of the stack frame.
    • The stack layout pass produces a mapping of: (Area -> StackOffset). For more detail, see the description of stack layout.
    • A StackOffset is the byte offset of a stack slot from the old end (high address) of the frame. It doesn't vary as the physical stack pointer moves.
  • Manifest the stack pointer, implemented in CmmStackLayout. Once the stack layout mapping has been determined, a second pass walks over the graph, making the stack pointer, Sp explicit. Before this pass, there is no Sp at all. After this, Sp is completely manifest.
    • replacing references to Areas with offsets from Sp.
    • adding adjustments to Sp.

  • Split into multiple CmmProcs, implemented in CmmProcPointZ. At this point we build an info-table for each of the CmmProcs, including SRTs. Done on the basis of the live local variables (by now mapped to stack slots) and live CAF statics.
    • LastCall and LastReturn nodes are replaced by Jumps.
  • Build info tables, implemented in CmmBuildInfoTables..
    • Find each safe MidForeignCall node, "lowers" it into the suspend/call/resume sequence (see Note [Foreign calls] in CmmNode.hs.), and build an info table for them.
    • Convert the CmmInfo for each CmmProc into a [CmmStatic], using the live variable information computed just before "Figure out stack layout".

Branches to continuations and the "Adams optimisation"

A GC block for a heap check after a call should only take one or two instructions. However the natural code:

    ...put params in R1 R2 etc...
    call foo returns to L
 L: r = R1   -- get return value
    goto M
 M: Hp = Hp + 20
    if (Hp > HpLim) { call do_gc returns to K;
                   K: goto M; }

The label M is the head of the call-gc-and-try-again loop. If we do this, we'll generate two info tables, one for L and one for K.

We can do better like this:

    ...put params in R1 R2 etc...
    call foo returns to L
 L: r = R1
    goto M
 M: Hp = Hp + 20
    if (Hp > HpLim) { R1 = r
                      call do_gc_p returns to K;
                   K: r = R1; goto M; }

Now the do_gc_p call has the same return signature as foo and can use the same continuation, thus:

    ...put params in R1 R2 etc...
    call foo returns to L
 L: r = R1
    goto M
 M: Hp = Hp + 20
    if (Hp > HpLim) { R1 = r
                      call do_gc_p returns to L }

(A call followed by a goto thus gets optimized down to just the call.)

Now things are good. Simple common block elimination (CBE) will common up K and L, so both calls share the same info table.

Runtime system

  • Garbage collector entry points: see Note [Heap checks] in StgCmmHeapery.
  • PAPs

  • Update frames and exception handling. Also STM frames.
  • Primitives can be rewritten:
    • Use parameters
    • In a few cases, use native calls (notably eval)