Opened 4 years ago

Last modified 20 months ago

#4308 new bug

LLVM compiles Updates.cmm badly

Reported by: dterei Owned by:
Priority: low Milestone: 7.6.2
Component: Compiler (LLVM) Version: 6.13
Keywords: Cc: pumpkingod@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by dterei)

Simon M. reported that compiled rts/Updates.cmm on x86-64 with the LLVM backend produced some pretty bad code. The ncg produces this:

stg_upd_frame_info:
.Lco:
       movq 8(%rbp),%rax
       addq $16,%rbp
       movq %rbx,8(%rax)
       movq $stg_BLACKHOLE_info,0(%rax)
       movq %rax,%rcx
       andq $-1048576,%rcx
       movq %rax,%rdx
       andq $1044480,%rdx
       shrq $6,%rdx
       orq %rcx,%rdx
       cmpw $0,52(%rdx)
       jne .Lcf
       jmp *0(%rbp)
.Lcf:
       [...]

The LLVM backend produces this though:

stg_upd_frame_info:                     # @stg_upd_frame_info
# BB#0:                                 # %co
       subq    $104, %rsp
       movq    8(%rbp), %rax
       movq    %rax, 24(%rsp)          # 8-byte Spill
       movq    %rbx, 8(%rax)
       mfence
       movq    $stg_BLACKHOLE_info, (%rax)
       movq    %rax, %rcx
       andq    $-1048576, %rcx         # imm = 0xFFFFFFFFFFF00000
       andq    $1044480, %rax          # imm = 0xFF000
       shrq    $6, %rax
       addq    %rcx, %rax
       addq    $16, %rbp
       cmpw    $0, 52(%rax)
       movsd   %xmm6, 88(%rsp)         # 8-byte Spill
       movsd   %xmm5, 80(%rsp)         # 8-byte Spill
       movss   %xmm4, 76(%rsp)         # 4-byte Spill
       movss   %xmm3, 72(%rsp)         # 4-byte Spill
       movss   %xmm2, 68(%rsp)         # 4-byte Spill
       movss   %xmm1, 64(%rsp)         # 4-byte Spill
       movq    %r9, 56(%rsp)           # 8-byte Spill
       movq    %r8, 48(%rsp)           # 8-byte Spill
       movq    %rdi, 40(%rsp)          # 8-byte Spill
       movq    %rsi, 32(%rsp)          # 8-byte Spill
       je      .LBB1_4

This has two main problems:

  1. mfence instruction (write barrier) isn't required. (write-write barriers aren't required on x86)
  2. LLVM backend is spilling a lot of stuff unnecessarily.

Both these I think are fairly easy fixes. LLVM is handling write barriers quite naively at the moment so 1. is easy. The spilling problem I think is related to a previous fix I made where I need to explicitly kill some of the stg registers if they aren't live across the call, otherwise LLVM rightly thinks they are live since I always pass the stg registers around (so live on entry and exit of every function unless I kill them).

Change History (14)

comment:2 Changed 4 years ago by dterei

  • Description modified (diff)

comment:3 Changed 4 years ago by pumpkin

  • Cc pumpkingod@… added

comment:4 Changed 4 years ago by igloo

  • Milestone set to 7.2.1

comment:5 Changed 4 years ago by dterei

OK, writebarrier problem fixed, patch in my repo waiting to be pushed.

The extra spills is a little more tricky but a bad bug that needs fixing asap. Basically because I always pass the STG registers around their live range is the entire function. Across C calls this is bad since some of them are caller save and get spilled. To fix this I currently store an 'undef' value into the STG registers that are caller save just before the C call. This is kind of like live range splitting. This is a naive fix as if the C call is in one of the branches of say a diamond control flow then in one branch the stg regs are undefined, in the other they are defined as the value they were on entry. Hence why we still get the spilling of them.

So need to handle when the C call is in a branch of a if-else or case statement. Either bubble up or down to the entry or exit node. Or perhaps keep track of what STG regs are actually modified in a method and don't pass through those that aren't. Not sure if this later approach is safe but it seems to produce better code so is preferred if it works.

comment:6 follow-up: Changed 4 years ago by dterei

OK so problem with C call in a branch is due to the LLVM optimiser. In one branch the <Rn> stg registers are the <Rn(old)> value while in the branch with the C call they're 'undef'. So you would expect there to be a phi node like this:

join:
   R3 = phi [lbranch, undef], [rbranch, R3_old]
   ...

However the optimiser doesn't do this as it simplifies phi nodes with 'undef' values by removing the undef value:

join:
   R3 = R3_old

So this is why we get the bad code. Our undef's are removed by the optimiser and so the code generator doesn't know it can let the stg registers be trashed.

When the C call occurs in a non-branch block the 'undef' technique works as the llvm optimiser bubbles down the undef value to the tail call, eg:

mid_block:
   store i64 undef R3
   ...
   %0 = call ccc sin()
   ...
   br %final

final:
   tail call g (Base, Hp, Sp, R1, R2, R3, ...)

Becomes:

mid_block:
   %0 = call ccc sin()
   ...
   br %final

final:
   tail call g (Base, Hp, Sp, R1, R2, undef, ...)

So how to fix? well could try to change the semantics of 'undef' in LLVM. But that seems a risky approach as I don't think the LLVM developers will want to do that. Also there is a better way. We change the Cmm representation to carry with all calls and jumps the live STG register information. Then I don't need to keep passing the stg registers around for every call and can free up some registers for intermediate code. Also will need to stop the saving and restoring of caller save variables being done in Cmm and let LLVM do it instead.

comment:7 in reply to: ↑ 6 ; follow-up: Changed 3 years ago by simonmar

Replying to dterei:

So how to fix? well could try to change the semantics of 'undef' in LLVM. But that seems a risky approach as I don't think the LLVM developers will want to do that. Also there is a better way. We change the Cmm representation to carry with all calls and jumps the live STG register information. Then I don't need to keep passing the stg registers around for every call and can free up some registers for intermediate code. Also will need to stop the saving and restoring of caller save variables being done in Cmm and let LLVM do it instead.

Yes, we do need to do this, and we've discussed doing it in the new backend, but if you have to wait for that it could be a while.

Why not in the meantime propagate the information about dead STG registers yourself, from the foreign call to the final downstream tail-calls? Presumably this could be done with a pass over the LLVM intermediate code.

comment:8 in reply to: ↑ 7 Changed 3 years ago by dterei

Replying to simonmar:

Yes, we do need to do this, and we've discussed doing it in the new backend, but if you have to wait for that it could be a while.

Could we not do this in the old/current codegen? I can't imagine its that much work, I haven't look at the old/current codegen yet but I would assume there is some fairly easy way to access the live stg register information.

Why not in the meantime propagate the information about dead STG registers yourself, from the foreign call to the final downstream tail-calls? Presumably this could be done with a pass over the LLVM intermediate code.

True. I thought about using this approach and actually I have a patch in my local branch that does 'kind' of this. I say kind of since to do it properly you need to process the function in control flow order and my local patch doesn't do this, instead just does a few tricks/hacks to avoid needing to determine control flow.

I thought it would actually be simpler to implement the live stg register info approach then do propagation. The live approach has the advantage of freeing up a lot of registers for most functions since really R3 ... R6, F3, F4... aren't in use a lot of the time.

If I was to do propagation how do you suggest I process a function in control flow order? Just write some code to do this myself or is there an existing framework in ghc? (e.g maybe even use Hoopl?)

comment:9 Changed 3 years ago by simonmar

In the current (aka old) codegen you have to build the control-flow graph yourself, but it's not hard: look at sccBlocks in nativeGen/AsmCodeGen.lhs.

I'm not sure how much work it would be to do this properly, certainly you would have to change the CmmJump constructor to add the information, and make sure it gets correctly generated in the stg->Cmm pass.

comment:11 Changed 2 years ago by dterei

  • Resolution set to fixed
  • Status changed from new to closed

comment:12 Changed 2 years ago by dterei

  • Owner dterei deleted
  • Resolution fixed deleted
  • Status changed from closed to new

So the code is much better now and the live register tracking is done but its still not as good as we'd like. So I'm reopening to track any further work on this.

comment:13 Changed 2 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1
  • Priority changed from normal to low

comment:14 Changed 20 months ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2
Note: See TracTickets for help on using tickets.