wiki:Commentary/Compiler/BackendOpt

Planned Backend Optimizations

This page gathers together a list of ideas and plans for improving the backend of the compiler. Any progress made will also be noted here.

Removal of Proc Point Splitting

The reason we needed proc-point splitting for the LLVM backend is detailed here.

Rationale

Ideally, we would expose an entire Cmm function to LLVM for better optimization and machine code generation, instead of breaking the functions apart just to grab a valid label. The proc-point splitting is also a rather ugly transformation we would like to remove from the pipeline.

Challenges

The address of a basic block is not a well-defined concept outside of an LLVM function. There has been no movement on this in LLVM due to the key assumption made by LLVM's IR that all basic blocks have known predecessors, forming a complete control-flow graph amongst blocks. An escaping block label opens up the possibility of an edge from an unknown location. This is why an indirect branch instruction must conservatively list all possible block targets.

Proposal

We propose to extend LLVM, slightly, to support "CPS calls". The initial proposal, to the LLVM community, can be found here. What follows is a somewhat revised version of that proposal.

Implementation Progress

After recieving a helpful suggestion from Ried on the LLVM mailing list, I've extended LLVM with an experimental CPS call. To help understand how it works, let's consider this Core program:

f x = if x!=0#
      then I# (x-1)   -- This branch allocates
      else foo x

This would generate roughly the following Post CPS Cmm code:

fEntry:
  if x != 0 goto G; else goto J

G:
  if Hp+10 >= HpLim goto needGC; else goto R

needGC:
  I64[SP + 8] = x
  I64[SP] = afterGC                    // write block address to memory
  call doGC returns to afterGC  // a non-tail call, x is live

afterGC:
  x = I64[SP + 8]
  goto G

J:
  call foo( x )  // a tail call

R: 
  x' = x - 1
  ...allocate (I# x') and return...

afterGC is a return point (aka, proc-point). Previously, we would need to split this block out into its own function, but we no longer need to if we use the llvm.experimental.cpscall intrinsic. In order to use this intrinsic, we first must omit any uses of a block address in a Cmm assignment. So, the line I64[Sp] = afterGC is not generated. Instead, the LLVM cpscall intrinsic will write the return address to the SP for us. The Cmm call to doGC would translate to the following LLVM code:

; where R1 = x
%retVals = call ghccc @llvm.experimental.cpscall( @doGC, ID, RA_OFFSET, SP_ARGNUM, Base, Sp, ..., R1)
br label %afterGC

A few notes:

  • SP_ARGNUM is a constant integer that describes the offset of the argument, starting from Base in this example, where

the explicit stack pointer is passed as an argument to the call. In this case, it would be 1.

  • RA_OFFSET is the constant integer representing a byte offset from the explicit stack pointer where the return address should be written. It is typically 0 but not always (over-saturated calls).
  • ID is an arbitrary constant-integer value that is used for the purposes of identifying the return point for garbage collector support in the emitted assembly code.

Let RA_OFFSET = 7 and SP_ARGNUM = 1 To understand the final assembly code output of the cpscall intrinsic, the last piece we must consider is the calling convention used. In this the above example, the calling convention is ghccc, so the 2nd integer register (the one used for Sp, which is "argument number" 1) onx86-64 is %rbp. Thus, we will output the following x86-64 assembly code for the cpscall:

    leaq   returnPt_ID(%rip),  SCRATCH_REG
    movq  SCRATCH_REG,   7(%rbp)       // I64[Sp + RA_OFFSET] = returnPt_ID
    jmp     _doGC

returnPt_ID:
    // code to execute afterGC is here


NOTE: out-of-date implementation details for this intrinsic within LLVM follow:


Here is what the needGC block looks like by the time we're ready to expand the CPSCALL pseudo-instruction:

  BB#4: derived from LLVM BB %needGC
    Predecessors according to CFG: BB#3
  %vreg10<def> = LEA64r %RIP, 1, %noreg, <blockaddress(@foo, %G)>, %noreg; GR64:%vreg10
  MOV64mr %vreg3, 1, %noreg, 0, %noreg, %vreg10<kill>; mem:ST8[%G_phi_sp] GR64:%vreg3,%vreg10
  ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
  %R13<def> = COPY %vreg3; GR64:%vreg3
  %RBP<def> = COPY %vreg4; GR64:%vreg4
  CPSCALLdi64 <ga:@doGC>, <regmask>, %RSP<imp-use>, %R13<imp-use>, %RBP<imp-use>, %RSP<imp-def>, %R13<imp-def>, %RBP<imp-def>
  ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
  %vreg11<def> = COPY %R13; GR64:%vreg11
  %vreg12<def> = COPY %RBP; GR64:%vreg12
  %vreg5<def> = COPY %vreg11; GR64:%vreg5,%vreg11
  %vreg6<def> = COPY %vreg12; GR64:%vreg6,%vreg12
  JMP_1 <BB#3>
    Successors according to CFG: BB#3(?%)

When LLVM's expand-isel-pseudos pass is run after isel is complete, we expand the CPSCALL at the MachineBasicBlock level in the following way:

  1. The instructions following the CPSCALL are analyzed to determine which virtual registers correspond to physical registers used by the ghccc convention to return those struct fields.
  1. The stack adjustment following the CPSCALL is moved before the CPSCALL.
  1. The remaining instructions after the CPSCALL are then deleted, leaving the CPSCALL pseudo-op at the end of the needGC block.
  1. The CPSCALL is converted into a TCRETURN, which is the pseudo-instruction used in the x86 backend of LLVM to later emit an LLVM tail-call.
  1. The G_standin block has physical registers added to it as live-in values, it is marked as an EH landing pad, and vReg = COPY pReg copies are emitted in that block.
  1. The phi-nodes of G are updated so that the virtual registers taken from needGC now come from G_standin instead.

This leaves us with the following MachineBasicBlock representation:

  BB#4: derived from LLVM BB %needGC
    Predecessors according to CFG: BB#6
  %vreg10<def> = LEA64r %RIP, 1, %noreg, <blockaddress(@foo, %G)>, %noreg; GR64:%vreg10
  MOV64mr %vreg3, 1, %noreg, 0, %noreg, %vreg10<kill>; mem:ST8[%G_phi_sp] GR64:%vreg3,%vreg10
  ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
  %R13<def> = COPY %vreg3; GR64:%vreg3
  %RBP<def> = COPY %vreg4; GR64:%vreg4
  ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def,dead>, %RSP<imp-use>
  TCRETURNdi64 <ga:@doGC>, 0, %RSP<imp-use>, %R13<imp-use>, %RBP<imp-use>
    Successors according to CFG: BB#3(?%)

BB#3: derived from LLVM BB %G, EH LANDING PAD, ADDRESS TAKEN
    Live Ins: %R13 %RBP
    Predecessors according to CFG: BB#4
  %vreg6<def> = COPY %RBP<kill>; GR64:%vreg6
  %vreg5<def> = COPY %R13<kill>; GR64:%vreg5
  JMP_1 <BB#6>
    Successors according to CFG: BB#6(?%)

BB#6: 
    Predecessors according to CFG: BB#2 BB#3
  %vreg2<def> = PHI %vreg0, <BB#2>, %vreg6, <BB#3>; GR64:%vreg2,%vreg0,%vreg6
  %vreg3<def> = PHI %vreg1, <BB#2>, %vreg5, <BB#3>; GR64:%vreg3,%vreg1,%vreg5
  %vreg4<def,tied1> = DEC64r %vreg2<tied0>, %EFLAGS<imp-def,dead>; GR64:%vreg4,%vreg2
  %vreg9<def,tied1> = SUB64ri32 %vreg3<tied0>, 1000, %EFLAGS<imp-def>; GR64:%vreg9,%vreg3
  JL_1 <BB#1>, %EFLAGS<imp-use>
  JMP_1 <BB#4>
    Successors according to CFG: BB#4(0x7c000000 / 0x80000000 = 96.88%) BB#1(0x04000000 / 0x80000000 = 3.12%)

The machine code for the loops in this example end up looking quite nice:

LBB0_5:                       ## %needGC
  leaq  Ltmp0(%rip), %rdx
  movq  %rdx, (%rax)
  movq  %rax, %r13
  movq  %rcx, %rbp
  popq  %rax                    # modify PEI & RTS to omit this.
  jmp _doGC     ## TAILCALL
Ltmp0:                        ## Block address taken
LBB0_3:                       ## %G
  movq  %rbp, %rcx
  movq  %r13, %rax
LBB0_4:                       ## BB#6
  decq  %rcx
  cmpq  $1000, %rax
  jge LBB0_5
LBB0_1:                       ## %H
  testq %rcx, %rcx
  jne LBB0_4

Tuning LLVM IR Passes

The optimization pass sequence in LLVM is well tested for languages like C/C++, but not Haskell. See #11295 for details and progress updates.

Improving Heap Checks

See #8905 and #12231 and [Compiling case expressions] in StgCmmExpr


Here's something that's probably worth improving.

Code like this

if y > 17
  then joinP (y - x)
  else joinP 10

Turns into this:

  call GHC.Classes.>_info(R2) returns to c2mn, args: 32, res: 8, upd: 8;   // CmmCall
c2mn: // global
  _s2lC::P64 = P64[Sp + 8];   // CmmAssign
  _s2lE::P64 = P64[Sp + 24];   // CmmAssign
  if (R1 & 7 != 1) goto c2n1; else goto c2mX;   // CmmCondBranch
c2n1: // global
  Hp = Hp + 40;   // CmmAssign
  if (Hp > HpLim) (likely: False) goto c2n4; else goto c2n3;   // CmmCondBranch
c2mX: // global
  Hp = Hp + 24;   // CmmAssign
  if (Hp > HpLim) (likely: False) goto c2n0; else goto c2mZ;   // CmmCondBranch

Since both branches of the first conditional test lead to a heap check, it's better to commute the conditional tests here so we only have one heap test, since the heap tests produce extra code to enter the GC. The above would transform into:

  call GHC.Classes.>_info(R2) returns to mergedHeapTest, args: 32, res: 8, upd: 8;   // CmmCall
mergedHeapTest:
  _localTemp = Hp + 40; // max(40, 24)
  if (_localTemp > HpLim) (likely: False) goto needGC; else goto continue;
needGC:
  HpAlloc = 40; // max(40, 24)
  R1 = R1;
  call stg_gc_unpt_r1(R1) returns to mergedHeapTest
continue:
  _s2lC::P64 = P64[Sp + 8];   // CmmAssign
  _s2lE::P64 = P64[Sp + 24];   // CmmAssign
  if (R1 & 7 != 1) goto leftSide; else goto rightSide;
leftSide:
  Hp = Hp + 40;  // could be merged into c2n3 if it has one pred.
  goto c2n3;
rightSide:
  Hp = Hp + 24;  // could be merged into c2mZ if it has one pred.
  goto c2mZ;

Other Performance Bugs To Consider

Last modified 6 weeks ago Last modified on Oct 15, 2017 9:09:49 PM