Opened 4 years ago

Last modified 2 months ago

#7198 new bug

New codegen more than doubles compile time of T3294

Reported by: simonmar Owned by: simonmar
Priority: normal Milestone:
Component: Compiler (CodeGen) Version: 7.4.2
Keywords: Cc: simonmar, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #4258 Differential Rev(s):
Wiki Page:

Description

I did some preliminary investigation, and there seem to be a couple of things going on.

First, the stack allocator generates lots of unnecessary reloads at a continuation, for variables that are not used. These would be cleaned up by the sinking pass (if we were running the sinking pass), but generating them in the first place costs compile time.

Second, there is a large nested let expression of the form

let x = let y = let z = ...
                in  f z
        in  f y

where each let binding has a lot of free variables. So the body of each let ends up copying a ton of variables out of its closure to build the inner let binding's closure. These sequences look like:

x1 = [R1+8]
x2 = [R1+16]
...
[Hp-32] = x1
[Hp-24] = x2
...

now CmmSink can't currently inline all the locals because knowing that [R1+8] doesn't alias [Hp-32] is tricky (see comments in CmmSink). However, again, we're not even running the sinking pass because this is -O0. The fact that we generate all this code in the first place is a problem. The old code generator generated

[Hp-32] = [R1+8]
[Hp-24] = [R1+16]
...

which amounts to a lot less Cmm, and a lot less trouble for the register allocator later.

One thing we could do is flatten out the lets, on the grounds that the inner let binding has a lot of free variables that need to be copied when the let is nested. This could be based on a heuristic about the number of free variables and the amount of extra allocation that would be entailed if the let is never entered.

Change History (10)

comment:1 Changed 2 years ago by thoughtpolice

  • Milestone changed from 7.8.3 to 7.10.1

Bumping priority down (these tickets haven't been closely followed or fixed in 7.4), and moving out to 7.10 and out of 7.8.3.

comment:2 Changed 2 years ago by thoughtpolice

  • Priority changed from high to normal

Actually dropping priority. :)

comment:3 Changed 17 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:4 Changed 14 months ago by thomie

  • Cc simonmar added
  • Component changed from Compiler to Compiler (CodeGen)
  • Type of failure changed from None/Unknown to Compile-time performance bug

comment:5 Changed 9 months ago by thoughtpolice

  • Milestone changed from 7.12.1 to 8.0.1

Milestone renamed

comment:6 Changed 4 months ago by thomie

  • Milestone 8.0.1 deleted

comment:7 Changed 2 months ago by michalt

This sounds interesting and I wanted to attempt to fix this. :-) I'm not very familiar with the code, so some help would be super useful. IIRC there are a few things happening here:

  • We have a bunch of nested lets that we could try to flatten based on a heuristic that considers how many free variables the let has.
  • We generate a lot of unnecessary code for copying the values from the stack to the heap (using local variables instead of assigning them directly).
  • The original description also mentions reloads for unused variables. But I'm not sure I understand what exactly is this referring to. (Some example would be nice!)

Based on that I have a few questions on how to approach solving this:

  • What would be a good place in the code to consider flattening of lets?
  • What code is responsible for the unnecessary code for copying the values? It seems that this is would be the StgCmm* modules that compile STG to cmm, is that correct?

Note: looking at Cmm when compiled with optimizations, it seems to me that the sinking pass is able to optimize the unnecessary copying of values through local variables (the assignment are of the form F64[Hp - 312] = F64[Sp + 8];).

comment:8 Changed 2 months ago by michalt

  • Cc michalt added

comment:9 Changed 2 months ago by simonmar

The basic problem here is that

  • The Stg->Cmm code generator generates suboptimal code
  • The CmmSink pass would mostly optimise it away, but it is only run at -O.
  • The extra code is hurting compile time
  • The old code generator generated simpler code, and thus was faster

It's not obvious what the solution is. We might need several improvements in various places.

it's not clear whether the lets should be flattened; if so, this is the job of earlier phases (simplifier or CorePrep). Flattening lets is a tradeoff, and I suspect the simplifier is already doing a reasonable job here, and has just decided not to in this case.

In some cases I made the STG->Cmm code generator a bit more clever, so as to generate less code and improve compile time, even though later optimisation phases would have produced the same result. I like this approach because it improves compile times for -O0, relying on later optimisation is brittle, and a few tweaks can have a big effect due to the regular patterns that occur in Cmm.

comment:10 Changed 2 months ago by simonpj

Yes, if improving STG->Cmm can be done without major contortions -- and the additional complexity is documented -- that sounds perfect

Note: See TracTickets for help on using tickets.