Opened 4 years ago

Last modified 6 months ago

#8048 new bug

Register spilling produces ineffecient/highly contending code

Reported by: schyler Owned by:
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.6.3
Keywords: register allocator spill Cc: schyler, simonmar, tjakway, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #10012 Differential Rev(s):
Wiki Page:

Description

The native codegen and llvm both produce ineffecient code for functions using structures with many strict fields or unboxed values.

Consider the following program:

{-# LANGUAGE BangPatterns #-}
module Spill where

import GHC.Exts

data S = S !Int !Int !Int !Int !Int !Int !Int !Int !Int

spill :: S -> S -> S -> S
spill (S !a !b !c !d !e !f !g !h !i) (S !j !k !l !m !n !o !p !q !r) (S !s !t !u !v !w !x !y !z _)
 = S (a + j + s) (b + c) (k + r) (a + b + c + d + e + f + g + h + i) (j + k + l + m + n + o + p + q + r) (s + t + u + v + w + x + y + z) (a + b + c) (j + k + l) (s + t + u)

Parts of the code produced for this (which is identical regardless of -funbox-strict-fields) looks like:

_cnc:
    addq $80,%r12
    cmpq 144(%r13),%r12
    ja _cni
    movq $Spill.S_con_info,-72(%r12)
    movq 32(%rbp),%rax
    movq %rax,-64(%r12)
    movq 24(%rbp),%rax
    movq %rax,-56(%r12)
    movq 16(%rbp),%rax
    movq %rax,-48(%r12)
    movq 8(%rbp),%rax
    movq %rax,-40(%r12)
    movq 40(%rbp),%rax
    movq %rax,-32(%r12)
    movq 48(%rbp),%rax
    movq %rax,-24(%r12)
    movq 56(%rbp),%rax
    movq %rax,-16(%r12)
    movq 64(%rbp),%rax
    movq %rax,-8(%r12)
    movq 7(%rbx),%rax
    movq %rax,0(%r12)
    leaq -71(%r12),%rbx
    addq $72,%rbp
    jmp *0(%rbp)
_csv:
    movq 63(%rbx),%rax
    movq %rax,-56(%rbp)
    movq 55(%rbx),%rax
    movq %rax,-48(%rbp)
    movq 47(%rbx),%rax
    movq %rax,-40(%rbp)
    movq 39(%rbx),%rax
    movq %rax,-32(%rbp)
    movq 31(%rbx),%rax
    movq %rax,-24(%rbp)
    movq 23(%rbx),%rax
    movq %rax,-16(%rbp)
    movq 71(%rbx),%rax
    movq %rax,-8(%rbp)
    movq 15(%rbx),%rax
    movq %rax,0(%rbp)

And likewise for LLVM:

.LBB10_1: # %coZ
        movq    7(%rbx), %rcx
        movq    $Spill_S_con_info, 8(%rax)
        movq    8(%rbp), %rdx
        movq    %rdx, 16(%rax)
        movq    16(%rbp), %rdx
        movq    %rdx, 24(%rax)
        movq    24(%rbp), %rdx
        movq    %rdx, 32(%rax)
        movq    32(%rbp), %rdx
        movq    %rdx, 40(%rax)
        movq    40(%rbp), %rdx
        movq    %rdx, 48(%rax)
        movq    48(%rbp), %rdx
        movq    %rdx, 56(%rax)
        movq    56(%rbp), %rdx
        movq    %rdx, 64(%rax)
        movq    64(%rbp), %rdx
        movq    %rdx, 72(%rax)
        movq    %rcx, (%r12)
        movq    72(%rbp), %rax
        leaq    72(%rbp), %rbp
        leaq    -71(%r12), %rbx
        jmpq    *%rax  # TAILCALL

Quoting from #ghc "the [register allocator] core algo is '96 vintage". Improvements are needed;

  • Take into consideration pipelining and handle spills less dramatically, attempting to reduce register contention
  • Sink memory reads in order to reduce register pressure

Attachments (1)

spill.hs (431 bytes) - added by schyler 4 years ago.
spill.hs built with -ddump-asm and -O2 produces the code above

Download all attachments as: .zip

Change History (17)

Changed 4 years ago by schyler

Attachment: spill.hs added

spill.hs built with -ddump-asm and -O2 produces the code above

comment:1 Changed 4 years ago by jstolarek

Cc: jan.stolarek@… added

comment:2 Changed 4 years ago by schyler

This paper might be relevant or otherwise useful. http://ftp.cs.wisc.edu/galileo/papers/ICS88_Hsu.pdf

comment:3 Changed 4 years ago by carter

Looking at this, it looks like the real issue is the lack of an analogue of the llvm mem2reg optimization for the ghc stack I think... (ie what we have here is one basic block is pushing args onto the stack, then we're funcalling to another basic block that is unloading them again)

(and llvm's mem2reg can't do it because GHC manages its own stack in a way llvm doesn't understand)

https://github.com/mlite/HsLlvm/blob/master/src/Llvm/Pass/Mem2Reg.hs is a implementation of mem2reg using hoopl, for the llvm ir, might be worth figuring out how to make that work for cmm.

(what follows is me sketching out an idea, though I or someone else will have to do some experimentation to see if it helps ) naively: what I think is going on is we have a few different basic blocks as CMM functions going on in the test code for the spill function, and we're moving between these blocks using general function calls, when we actually went to "fuse" the blocks together so that we don't have this busy work of doing the trivial spill to stack only to immediately unspill again.

ok, now that i've thought about it out loud, might be easier to do this optimization before we go to CMM, because then we dont have to invert the spill / unspill code

comment:4 Changed 4 years ago by carter

dcoutts has pointed out this mem2reg work could be done nicely on the CMM side, though [I'm] not sure how much is needed to make it so

edit: I'll try and look into understanding whats needed for this, may yield some nice low hanging performance fruit

Last edited 4 years ago by carter (previous) (diff)

comment:5 Changed 4 years ago by schyler

Unsure if this helps, but there's a project implementing mem2reg using Hoopl on github: https://github.com/mlite/HsLlvm

comment:6 Changed 4 years ago by schyler

Priority: normalhigh

comment:7 Changed 4 years ago by carter

Priority: highnormal

Schyler, do you have a engineering plan of attack to resolve the problem?

Do you have some example codes where you can share benchmarks on the ticket that demonstrate this actually creates a performance problem? I remember you saying you did, but also that you said later "register renaming on newer CPUs seems to make the problem go away".

Is it an actual performance issue, or merely an aesthetical one?

comment:8 Changed 4 years ago by schyler

Using a pretty crude benchmark, found a 9-10% hit. https://gist.github.com/kvanberendonck/6836713 It's hard to benchmark something so small though - those of you with Intel processors might have less luck running this one, but AMD is a little more linear and gives quite stable results.

(Also, I'm aware the benchmark itself can be optimized. It's like that so it's easier to cycle through the results and generate some other stats like dev).

Last edited 4 years ago by schyler (previous) (diff)

comment:9 Changed 4 years ago by schyler

Cc: schyler added

comment:10 Changed 4 years ago by simonmar

Cc: simonmar added
Component: CompilerCompiler (NCG)

comment:11 Changed 4 years ago by schyler

Component: Compiler (NCG)Compiler

Actually simon, I was able to reproduce this also with LLVM. Interestingly, when we fiddled with some flags like -march=native and such, llvm seemed to bandaid it somehow by using mmx (is it really just a case of running out of registers?)

Apologies if your intent was to indicate it was an issue with the mangler or something like that.

Last edited 4 years ago by schyler (previous) (diff)

comment:12 Changed 4 years ago by simonmar

Component: CompilerCompiler (CodeGen)

Alright then, if we're being nitpicky about the Component field :-)

comment:13 Changed 2 years ago by jstolarek

Cc: jan.stolarek@… removed

comment:14 Changed 18 months ago by bgamari

comment:15 Changed 15 months ago by tjakway

Cc: tjakway added

comment:16 Changed 6 months ago by michalt

Cc: michalt added
Note: See TracTickets for help on using tickets.