Opened 4 years ago

Last modified 2 months ago

#7198 new bug

New codegen more than doubles compile time of T3294

Reported by: simonmar Owned by: simonmar
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.4.2
Keywords: Cc: simonmar, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #4258 Differential Rev(s):
Wiki Page:

Description

I did some preliminary investigation, and there seem to be a couple of things going on.

First, the stack allocator generates lots of unnecessary reloads at a continuation, for variables that are not used. These would be cleaned up by the sinking pass (if we were running the sinking pass), but generating them in the first place costs compile time.

Second, there is a large nested let expression of the form

let x = let y = let z = ...
                in  f z
        in  f y

where each let binding has a lot of free variables. So the body of each let ends up copying a ton of variables out of its closure to build the inner let binding's closure. These sequences look like:

x1 = [R1+8]
x2 = [R1+16]
...
[Hp-32] = x1
[Hp-24] = x2
...

now CmmSink can't currently inline all the locals because knowing that [R1+8] doesn't alias [Hp-32] is tricky (see comments in CmmSink). However, again, we're not even running the sinking pass because this is -O0. The fact that we generate all this code in the first place is a problem. The old code generator generated

[Hp-32] = [R1+8]
[Hp-24] = [R1+16]
...

which amounts to a lot less Cmm, and a lot less trouble for the register allocator later.

One thing we could do is flatten out the lets, on the grounds that the inner let binding has a lot of free variables that need to be copied when the let is nested. This could be based on a heuristic about the number of free variables and the amount of extra allocation that would be entailed if the let is never entered.

Change History (14)

comment:1 Changed 2 years ago by thoughtpolice

  • Milestone changed from 7.8.3 to 7.10.1

Bumping priority down (these tickets haven't been closely followed or fixed in 7.4), and moving out to 7.10 and out of 7.8.3.

comment:2 Changed 2 years ago by thoughtpolice

  • Priority changed from high to normal

Actually dropping priority. :)

comment:3 Changed 20 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:4 Changed 17 months ago by thomie

  • Cc simonmar added
  • Component changed from Compiler to Compiler (CodeGen)
  • Type of failure changed from None/Unknown to Compile-time performance bug

comment:5 Changed 12 months ago by thoughtpolice

  • Milestone changed from 7.12.1 to 8.0.1

Milestone renamed

comment:6 Changed 7 months ago by thomie

  • Milestone 8.0.1 deleted

comment:7 Changed 6 months ago by michalt

This sounds interesting and I wanted to attempt to fix this. :-) I'm not very familiar with the code, so some help would be super useful. IIRC there are a few things happening here:

  • We have a bunch of nested lets that we could try to flatten based on a heuristic that considers how many free variables the let has.
  • We generate a lot of unnecessary code for copying the values from the stack to the heap (using local variables instead of assigning them directly).
  • The original description also mentions reloads for unused variables. But I'm not sure I understand what exactly is this referring to. (Some example would be nice!)

Based on that I have a few questions on how to approach solving this:

  • What would be a good place in the code to consider flattening of lets?
  • What code is responsible for the unnecessary code for copying the values? It seems that this is would be the StgCmm* modules that compile STG to cmm, is that correct?

Note: looking at Cmm when compiled with optimizations, it seems to me that the sinking pass is able to optimize the unnecessary copying of values through local variables (the assignment are of the form F64[Hp - 312] = F64[Sp + 8];).

comment:8 Changed 6 months ago by michalt

  • Cc michalt added

comment:9 Changed 5 months ago by simonmar

The basic problem here is that

  • The Stg->Cmm code generator generates suboptimal code
  • The CmmSink pass would mostly optimise it away, but it is only run at -O.
  • The extra code is hurting compile time
  • The old code generator generated simpler code, and thus was faster

It's not obvious what the solution is. We might need several improvements in various places.

it's not clear whether the lets should be flattened; if so, this is the job of earlier phases (simplifier or CorePrep). Flattening lets is a tradeoff, and I suspect the simplifier is already doing a reasonable job here, and has just decided not to in this case.

In some cases I made the STG->Cmm code generator a bit more clever, so as to generate less code and improve compile time, even though later optimisation phases would have produced the same result. I like this approach because it improves compile times for -O0, relying on later optimisation is brittle, and a few tweaks can have a big effect due to the regular patterns that occur in Cmm.

comment:10 Changed 5 months ago by simonpj

Yes, if improving STG->Cmm can be done without major contortions -- and the additional complexity is documented -- that sounds perfect

comment:11 Changed 2 months ago by michalt

So I was trying to learn more about the codegen using this ticket as the motivation and I think I’m missing something.

I was looking at modifying StgCmmBind.thunkCode where we unconditionally load all the free variables into registers (before generating the code for the body of the thunk). Instead, I thought we could simply populate the environment with expressions containing the right memory accesses (using addBindC). My thinking was that: node is already a local register (so shouldn’t change), so anything relative to it should be stable and thus safe to use. Unfortunately, this segfaults ghc-stage2 (i.e., ghc-stage1 creates something bad). :-/

I tried using devel1 flavor but that didn't help find anything (i.e., no assertion failures). Trivial programs compiled by ghc-stage1 seem to work. So it seems that GHC is large enough to trigger some silly bug in my code, but I’m a bit stuck at what that might be. Any ideas what I’m missing? I’m happy to dig further if you have no ideas (or time), but thought I’d ask if there’s something that’s obviously wrong with the idea or code. Thanks!

Code: https://github.com/michalt/ghc/tree/t7198/1

comment:12 Changed 2 months ago by simonmar

I suspect that R1 is not stable enough, but I'm not sure. You could probably find a smaller test case that fails by running the testsuite with stage=1.

In any case, this probably isn't a good idea, at least in general, because if a free var is used more than once then we'll get a memory accesses to reference it each time, rather than loading it into a register. Of course if we run out of registers then it might be a good idea... but that's hard to tell at this stage (better to let the register allocator make that decision).

comment:13 Changed 2 months ago by simonpj

Distinguish between

  • Global registers, like R1, Sp, which are part of the calling convention, and live in fixed places
  • Local registers, written x, y, etc in Cmm, which are just like local variables in a programming language. There are an infinite number of them. You can't take their address.

We do indeed generate loads into a local register at the start; but we expect those loads to sink downstream to their use sites, so they don't necessarily cause register pressure.

It's not safe to replace them with memory loads. Example

x = R1[4]
call f( p, q )
...use x...

We must save x on the stack across the call. R1 is involved in f's calling convention, so it may well change a lot. Indeed any modification to R1 is going to invalidate your saved "x = R1[4]" delayed load.

Plus the double-load efficiency problem that Simon mentions.

Better to load into a local variable and then sink it.

Does that make sense?

comment:14 Changed 2 months ago by michalt

Right, that makes sense - I think I made a wrong assumption that always put the value of node (R1) register also into a local register (that's what was happening in all my experiments). But that's probably not the case in general (which would invalidate the stored memory loads). Thanks for explanations!

As for sinking everything later, as I far as I understand, this ticket is about trying to avoid generating so much Cmm even if it would get sinked later on (because it hurts compile time and sinking pass is not run actually on -O0).

In any way, I agreed that this might not be the best idea if the free variables are reused multiple times. I'll keep poking around and ping this issue once I have a better idea :-) (or someone else comes up with one)

Note: See TracTickets for help on using tickets.