New codegen more than doubles compile time of T3294

I did some preliminary investigation, and there seem to be a couple of things going on.

First, the stack allocator generates lots of unnecessary reloads at a continuation, for variables that are not used. These would be cleaned up by the sinking pass (if we were running the sinking pass), but generating them in the first place costs compile time.

Second, there is a large nested let expression of the form

let x = let y = let z = ...
                in  f z
        in  f y

where each let binding has a lot of free variables. So the body of each let ends up copying a ton of variables out of its closure to build the inner let binding's closure. These sequences look like:

x1 = [R1+8]
x2 = [R1+16]
...
[Hp-32] = x1
[Hp-24] = x2
...

now CmmSink can't currently inline all the locals because knowing that [R1+8] doesn't alias [Hp-32] is tricky (see comments in CmmSink). However, again, we're not even running the sinking pass because this is -O0. The fact that we generate all this code in the first place is a problem. The old code generator generated

[Hp-32] = [R1+8]
[Hp-24] = [R1+16]
...

which amounts to a lot less Cmm, and a lot less trouble for the register allocator later.

One thing we could do is flatten out the lets, on the grounds that the inner let binding has a lot of free variables that need to be copied when the let is nested. This could be based on a heuristic about the number of free variables and the amount of extra allocation that would be entailed if the let is never entered.

Trac metadata

Trac field	Value
Version	7.4.2
Type	Bug
TypeOfFailure	OtherFailure
Priority	high
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information