wiki:Commentary/Compiler/BackendOpt

Planned Backend Optimizations

This page gathers together a list of ideas and plans for improving the backend of the compiler. Any progress made will also be noted here.

Removal of Proc Point Splitting for LLVM

The reason we needed proc-point splitting for the LLVM backend is detailed here.

Rationale

Ideally, we would expose an entire Cmm function to LLVM for better optimization and machine code generation, instead of breaking the functions apart just to grab a valid label. The proc-point splitting is also a rather ugly transformation we would like to remove from the pipeline.

Challenges

The address of a basic block is not a well-defined concept outside of an LLVM function. There has been no movement on this in LLVM due to the key assumption made by LLVM's IR that all basic blocks have known predecessors, forming a complete control-flow graph amongst blocks. An escaping block label opens up the possibility of an edge from an unknown location. This is why an indirect branch instruction must conservatively list all possible block targets.

Proposal

We propose to extend LLVM, slightly, to support "CPS calls". The initial proposal, to the LLVM community, can be found here. What follows is a somewhat revised version of that proposal.

Implementation Progress

After recieving a helpful suggestion from Ried on the LLVM mailing list, I've extended LLVM with an experimental CPS call. To help understand how it works, let's consider this Core program:

f x = if x != 0#
      then I# (x-1)   -- This branch allocates
      else foo x

This would generate roughly the following Post CPS Cmm code:

fEntry:
  if x != 0 goto G; else goto J

G:
  if Hp+10 >= HpLim goto needGC; else goto R

needGC:
  I64[SP + 8] = x
  I64[SP] = afterGC             // write block address to memory
  call doGC returns to afterGC  // a non-tail call, x is live

afterGC:
  x = I64[SP + 8]
  goto G

J:
  call foo( x )  // a tail call

R:
  x' = x - 1
  ...allocate (I# x') and return...

afterGC is a return point (aka, proc-point) from a non-tail call. Previously, we would need to split this block out into its own function, and the call would look like the following in LLVM:

define @fEntry (...) {
  entry:
    if x != 0 goto callG; else goto J

  callG:
    tail call G(...)

  J:
    tail call foo ( x )
}
define @G (...) {
  G_blk:
    if Hp+10 >= HpLim goto needGC; else goto R

  needGC:
    SP[8] = x
    SP[0] = @afterGC   // the call returns to afterGC
    tail call @doGC(... SP ...)

  R:
     ...
}

define @afterGC (...) {
  ...
  tail call @G (...)
}  

This is some rather ugly-looking and hard-to-optimize LLVM code, and it is all because we must take the address of the function @afterGC and store it on the stack SP to facilitate a return, which is also performed with a tail call:

define @doGC(... SP ...) {
  returnBlk:
    %returnAddr = SP[0]
    tail call %returnAddr (...)
}

We can avoid the messy LLVM code if we use the proposed llvm.experimental.cpscall intrinsic.

The CPS Call Intrinsic

The design of this intrinsic relies on a fundamental property of programs in continuation-passing style: every call carries in its arguments every live value at that program point. For example, when calling doGC, we pass the continuation we captured (i.e, SP, a continuation that resumes at afterGC) as an argument, along with other runtime values like the Base pointer, etc. When performing a return in CPS, we invoke the continuation SP by jumping to the address at the top of SP, and pass to it SP and the return values as arguments.

Now, to do this properly in LLVM, we need a way to capture a continuation. The live values of the continuation can be saved to a stack frame using ordinary memory writes, but we need a way to describe *where* to continue execution while in in the middle of an LLVM function. Our solution is to create a new LLVM intrinsic cpscall that *implicitly* will write the return address to the given SP when performing a function call.

To be more specific: in GHC, we first omit any uses of a block address in a Cmm assignment, since that usage is undefined in LLVM. Thus, the line I64[Sp] = afterGC in our Cmm example is not generated by GHC. Instead, the cpscall intrinsic will take care of writing the return address to our explicit stack pointer, SP, when we emit assembly code. The Cmm call to doGC would translate to the following LLVM code:

%Slot = inttoptr (%Sp + 8)
store i64* %x, i64** %Slot
%retVals = call ghccc {i64*, i64*, ...}
              @llvm.experimental.cpscall(
                @doGC,
                ID,
                RA_OFFSET,
                SP_ARGNUM,
                %Base, %Sp, ...)
%Sp1 = extractvalue %retVals, 0
%Base1 = extractvalue %retVals, 1
br label %afterGC

afterGC:
  %Slot1 = inttoptr (%Sp1 + 8)
  %x1 = load i64*, i64** %Slot1
  ... more uses of %Sp1, %Base1, %x1, etc ...

In the example above, the cpscall will invoke the given function, @doGC and pass the given arguments %Base, %Sp ... as specified by the ghccc convention. The cpscall also opens a point within the function that allows all of the live values needed by afterGC to be returned in the correct registers, by using a struct and updated ghccc calling convention. The extractvalue instructions are standard struct accessors.

The importance of the "all live values are passed" property comes into play here, because we know that there are no SSA-values that are live across the cpscall, by construction. Thus, we know that LLVM does not need to save any values to its internal stack, i.e., the call stack that is implicit in LLVM where allocas and other values are automatically stored. If LLVM *were* to save values to its internal stack, we would not have a complete continuation in the explicit Sp argument, and would not be able to find the value at runtime.

A few notes about the other arguments to the intrinsic:

  • SP_ARGNUM is a constant integer that describes the offset within the argument list to the cpscall where the explicit stack pointer is passed as an argument to the call. In this example, it would be 1, since we start counting from Base.
  • RA_OFFSET is the constant integer representing a byte offset from the explicit stack pointer where the return address should be written. It is typically 0 but not always (e.g., over-saturated calls and update frames).
  • ID is an arbitrary constant-integer value that is used for the purposes of identifying the return point to help the garbage collector understand the stack frame that was captured at that point.

Suppose RA_OFFSET = 28 and SP_ARGNUM = 1 To understand the final assembly code output of the cpscall intrinsic, the last piece we must consider is the calling convention used. In this the above example, the calling convention is ghccc, so the 2nd integer register (the one used for Sp, which is "argument number" 1) onx86-64 is %rbp. Thus, llc will output the following x86-64 assembly code for the cpscall:

    leaq   returnPt_ID(%rip),  SCRATCH_REG
    movq  SCRATCH_REG,   28(%rbp)       // I64[Sp + RA_OFFSET] = returnPt_ID
    jmp     _doGC

returnPt_ID:
    // code to execute afterGC is here

The details of how the cpscall is carried through LLVM's backend to arrive at the assembly code above is rather detailed and subject to change. The gist of it is that we thread the cpscall through instruction selection as a pseudo-op, and then expand that pseudo-op to LLVM Machine Instructions by splitting the block, etc.


Tuning LLVM IR Passes

The optimization pass sequence in LLVM is well tested for languages like C/C++, but not Haskell. See #11295 for details and progress updates.

Improving Heap Checks

See #8905 and #12231 and [Compiling case expressions] in StgCmmExpr


Here's something that's probably worth improving.

Code like this

if y > 17
  then joinP (y - x)
  else joinP 10

Turns into this:

  call GHC.Classes.>_info(R2) returns to c2mn, args: 32, res: 8, upd: 8;   // CmmCall
c2mn: // global
  _s2lC::P64 = P64[Sp + 8];   // CmmAssign
  _s2lE::P64 = P64[Sp + 24];   // CmmAssign
  if (R1 & 7 != 1) goto c2n1; else goto c2mX;   // CmmCondBranch
c2n1: // global
  Hp = Hp + 40;   // CmmAssign
  if (Hp > HpLim) (likely: False) goto c2n4; else goto c2n3;   // CmmCondBranch
c2mX: // global
  Hp = Hp + 24;   // CmmAssign
  if (Hp > HpLim) (likely: False) goto c2n0; else goto c2mZ;   // CmmCondBranch

Since both branches of the first conditional test lead to a heap check, it's better to commute the conditional tests here so we only have one heap test, since the heap tests produce extra code to enter the GC. The above would transform into:

  call GHC.Classes.>_info(R2) returns to mergedHeapTest, args: 32, res: 8, upd: 8;   // CmmCall
mergedHeapTest:
  _localTemp = Hp + 40; // max(40, 24)
  if (_localTemp > HpLim) (likely: False) goto needGC; else goto continue;
needGC:
  HpAlloc = 40; // max(40, 24)
  R1 = R1;
  call stg_gc_unpt_r1(R1) returns to mergedHeapTest
continue:
  _s2lC::P64 = P64[Sp + 8];   // CmmAssign
  _s2lE::P64 = P64[Sp + 24];   // CmmAssign
  if (R1 & 7 != 1) goto leftSide; else goto rightSide;
leftSide:
  Hp = Hp + 40;  // could be merged into c2n3 if it has one pred.
  goto c2n3;
rightSide:
  Hp = Hp + 24;  // could be merged into c2mZ if it has one pred.
  goto c2mZ;

Other Performance Bugs To Consider

Last modified 4 weeks ago Last modified on May 22, 2018 3:39:58 AM