|Version 16 (modified by nfrisby, 14 months ago) (diff)|
- Plan for my internship summary
- general core knowledge
Late Lambda Float
- Notes for Write-up
- Discovered Detriments of LLF
- Discovered Benefits of LLF
- Miscellaneous Findings
This pages serves as a public log what I did for my GHC internship from 21 Jan 2013 to 12 April 2013.
Plan for my internship summary
Compared to 351a8c6bbd53ce07d687b5a96afff77c4c9910cc, we implemented OPTIMIZATIONS with a cumulative effect of EFFECT on the generated code as well as EFFECT on the compiler's code. The hope is for the optimizations to have beneficial non-trivial interactions and to simplify/tidy the GHC code base.
general core knowledge
Max's page about code generation is really helpful!
- UNKNOWN_CALL_ctr - put arguments on stack and call the RTS's stg_ap_<PAT>_fast routine for that argument pattern
- KNOWN_CALL_ctr - "fast call": put arguments in registers and call the function's the fast entry point
- KNOWN_CALL_TOO_FEW_ARGS_ctr - "PAP": creates a Partial APplication closure
- KNOWN_CALL_EXTRA_ARGS_ctr - like a fast call, but first push a continuation onto the stack that effectively uses stg_ap_<PAT>_fast for the extra args
Core -> STG -> CMM
TODO and _what you can learn by looking at each one_
Late Lambda Float
LLF = Late Lambda Float
As the GHC optimization papers explain, it is an early design decision to *not* perform lambda lifting. My initial project was to investigate the effects of aggressively floating lambdas to the top-level at the end of the core2core pipeline.
- The main reason to not perform lambda lifting is that abstracting over free variables loses information and thereby inhibits *downstream* optimization.
- Doing it *late* (ie just before CorePrep?) circumvents this issue.
- The original conjecture was that doing it would save allocation: a dynamically allocated closure becomes a static top-level function.
- Max Bolingbroke did a quick implementation of this idea some years ago (~mid 2000s), but it seems it was abandoned. I don't know why.
Notes for Write-up
- puzzle"s $fEnumItemType.$cenumFromThen has a nice demonstration: a cascade of let-no-escapes becomes a series of top-level functions, all tail-calls
We decided to implement LLF by re-using most of the FloatOut? machinery.
FloatOut? is structured in three phases.
- Annotate all expressions with their free variables.
- Consume those annotations while annotating each binder with the target "level" (essentially a depth wrt value lambdas) to which we want to float it.
- Consume those annotations while actually relocating the bindings.
We wholesale re-use the third phase (compiler/simplCore/FloatOut) with no changes, add logic to the middle phase, and enrich the first phase with more analyses.
Most of my changes were
- Adding flags (compiler/main/DynFlags compiler/simplCore/CoreMonad compiler/simplCore/SimplCore)
- Implementing the LLF logic in the first two FloatOut? phases (compiler/simplCore/SetLevels)
- Adding LLF to the core2core pipeline (compiler/simplCore/SimplCore)
In order to minimize factors, I decided to float only lambdas during LLF. Thus there is no need to perform FloatIn? afterwards: all of our floats are to the top-level, so there will be nothing to FloatIn?.
We placed LLF as the last pass before CorePrep?. After experimentation, we decided to follow it with a simplifier pass.
The basic shape of things:
outer = CTX[let f x = RHS[x] in BODY[f]]
where outer is a top-level binding. LLF transforms this to:
poly_f FVS x = RHS[x] outer = CTX[BODY[f FVS]]
wbere FVS are the free variables of RHS[x]. We'll use a, b, c, ... for particular variables in FVS.
The poly prefix is vestigial: in the past, floated bindings could never cross lambdas, so the abstracted variables were only type variables. Hence the machinery that adds the new parameters was only ever adding type parameters; it was creating polymorphic functions. The naming scheme was not updated when the machinery was enriched to also abstract over values.
- join points
- let-no-escape (LNE)
- Note [join point abstraction]
Discovered Detriments of LLF
These are the various negative consequences that we discovered on the way. We discuss mitigation below.
- Unapplied occurrences of f in BODY results in the creation of PAPs, which increases allocation. For example: map f xs becomes map (poly_f a b c) xs. Max had identified this issue earlier.
- Abstracting over a known function might change a fast entry call in RHS to a slow entry call. For example, if CTX binds a to a lambda, that information is lost in the right-hand side of poly_f. This can increase runtime.
- Replacing a floated binder's occurrence (ie f becomes poly_f a b c) can add free variables to a thunk's closure, which increases allocation.
- Abstracting over a let-no-escape binder renders it a normal let, which increases allocation.
Mitigating PAP Creation
This is the simplest to mitigate: we do not float f if it ever occurs unapplied.
Mitigating Thunk Growth
In nucleic2, we floated a binding with 11 free variables. But this binder occurred in about 60 thunks, so many closures grew by ~11 pointers, giving a +2.2% allocation change (as opposed to -0.9%).
We've considered three heuristics for avoiding this. In ascending complexity:
- (easy) Limit the number of free variables the binding is allowed.
- in-thunk: If f occurs inside of a thunk in BODY, then limit its free variables.
- thunk-growth: Approximate the maximum number of free variables that floating f would add to a thunk in BODY, and limit that.
We did not implement the first one, since in-thunk is not very complicated. thunk-growth is significantly more complicated.
- The question of whether f occurs in a thunk is not simple.
- We count non-trivial arguments as thunks; but not all non-trivial arguments end up as thunks.
- We do not count lambda-forms as thunks, since the lambda will hopefully be floated.
- Estimating the effect of floating f on such a thunk's (call it t) closure size is more complicated.
- Another floated function (say g) may also add some of f's free variables to t; we shouldn't penal both f and g for that.
- If f itself has a free variable, say h, which is a binder that gets floated, then floating f will also add h's free variables to t.
Therefore, these are rough approximations. Being more accurate would require changing the setLevels pass instead of just the simpler first pass (the one that only analyzes the original term).
We tried limits of 32, 16, 8, and 4 to differentiate between the last two. At a limit of 8, the allocation increase in ida and puzzle were 1.3 and 2.9 percentage points better with thunk-growth than with in-thunk. But there were no differences around 10 --- which is the lowest we can go while improving nucleic2, anyway --- so we're adopting in-thunk for now.
There might have been some potential benefits to run-time from thunk-growth versus in-thunk (with limit=8, 35 percentage points better on constraint, eg), but we're not confident in those measurements.
Preserving Fast Entries
The first idea here was simply: do not float a binding if its RHS applies a free variable.
But since the idea was to avoid losing fast entries, this only applies to saturated and oversaturated calls. As a sanity check, however, I added two flags.
- -f(no-)late-float-abstract-undersat-var don't allow undersaturated applications
- -f(no-)late-float-abstract-sat-var don't allow saturated or oversaturated applications
Ever since, I've been doing parameter sweeps over these as we make other refinements to the system.
- nn - do not float a binding that applies one of its free variables.
- yn - do not float a binding that applies one of its free variables saturated or oversaturated.
- ny - do not float a binding that applies one of its free variables undersaturated.
- yy - do not restrict application of the binding's free variables
There was no variant that bested the others on most programs' runtime. And the data was so noisy that it was difficult to choose which tests to investigate. I eventually developed some bash script (I'm so sorry) to transpose the NoFib? loops; instead of running the entire NoFib? suite for one set of switches and then running it again for the next set of switches, and so on, I build all the variants, and then run each variant's version of each program sequentially. I intend for this to reduce noise by improving the time locality of the measurements of the same test. Even so, the noise in Runtime was bad. Eventually, I turned the iterations up to the 40/50 range and found some steadiness. To my surprise, there were a couple tests that had the best Runtime if we *do* abstract over functions with fast calls! This happens. More on this later (cf #puzzle-time-issue).
I also saw some surprises in allocation. Roughly, we expect that more floating means (barely) less allocation but worse runtime (by how much?) because some known calls become unknown calls. But, eg, going from nn -> yn --- ie floating functions that undersaturate free variables instead of not floating them --- caused worse allocation! This investigation led to #MitigatingLNEAbstraction.
Based on that example, it occurred to me that we should only restrict the binding's saturation of its *known* free variables... duh. For example, we should floating a binding even if its RHS exactly applies a free variable when that free variable is lambda bound. Not floating in that case has no benefit, and indeed was causing knock-on effects that increase allocation (eg #MitigatingLNEAbstraction).
After allocation leads to some code tweaks, I reran the Run time tests with high iterations. hpg and puzzle were odd cases where ny did the best, by far. Both have low run times, so I cranked the iterations up to 200. hpg's results change drastically, which I haven't yet resolved. But puzzle remained; see #puzzle-time-issue.
I have yet to determine that the preservation of fast entries is worth the trouble --- I certainly hope so... the parameter sweeps have taken a lot of time!
To enable further measurements, I have identified the semantics of some ticky counters, cf #TickyCounters, and started resurrecting useful ones that are no longer enabled.
Mitigating LNE Abstraction
We had actually already seen this for a non-lambda join point in knights, but we were preoccupied with the unintentional existence of non-lambda join points and moved on after fixing those. I re-discovered this while experimenting with the fast preservation variants above.
NB I think this will be mitigated "for free", since I'm predicting that we will never abstract variables that occur exactly saturated and an LNE binder can only be exactly saturated. If we do end up abstracting over saturated functions, we may want to consider mitigating this separately.
Using -flate-float-in-thunk-limit=10, -fprotect-last-arg, and -O1, I tested the libraries+NoFib? for the four variants from #PreservingFastEntries. In fish (1.6%), hpg (~4.5%), and sphere (10.4%), allocation gets worse for ny and yy compared to nn and yn. The nn and ny do not change the allocation compared to the baseline library (ie no LLF).
The nn -> ny comparison is counter to our rough idea: floating more bindings (those that saturate/oversaturate some free variables) worsens allocation. Thus, I investigate.
The sphere program hammers hPutStr. Its extra allocation is mostly due to a regression in GHC.IO.Encoding.UTF8. Here's the situation.
With the nn variant:
outer a b c ... = let-no-escape f x = CTX[let-no-escape $j y = ... (f ...) ... in CTX2[$j]] in ...
In this case, $j is not floated because it applies f. With the ny variant, $j gets floated.
poly_$j a b c ... f y = ... outer a b c ... = let f x = CTX[CTX2[poly_$j a b c ... f]] in ...
Thus f cannot be let-no-escape because it now occurs as an argument to poly_$j.
This contributes to sphere's 1 megabyte of extra allocation for two reasons:
- outer is entered about 60,000 times.
- The RHS of f has 13 free variables, so it's closure is rather large.
13*60,000 ~ 750,000. I suspect the rest of sphere's increase is due to a similar issue in GHC.IO.Handle.
In hpg, it's principally due to GHC.IO.Encoding.UTF8 again, with a second place contributor of GHC.IO.FD, where the function $wa17 is again like the outer example above, but with fewer free variables and thus less effect.
Discovered Benefits of LLF
We haven't seen as much decrease in allocation as we would have liked, but there have been some nice benefits:
Creates Inliner Opportunities
Floating functions to the top-level creates more opportunities for the inliner. We've found two ways.
- #7663 - simulates having the inliner discount for free variables like it discounts for parameters
- It also decreases size of functions by floating out internal let-bindings (eg big join points, etc).
Both of these have been observed on puzzle, with commit feec91b71, it-thunk-limit=10, protect-last-arg. We get a big improvement in both allocation (-15.1%) and runtime (-1.4%) by allowing fast entries to be abstracted over. Oddly, if we additionally disallow undersat known calls to be abstract over, we get another runtime boost (up to -3.9%). These are both unfortunate from the fast-entry perspective, but demonstrate a big win.
In particular, the worker for the derived equality function for the StateType contains a join-point. When the join-point is floated, the worker's Guidance goes from
IF_ARGS [70 70 80 0 60 60 60 0] 360 20
IF_ARGS [130 0 0 0 60 0 0 0] 220 20
while the floated join point itself gets a Guidance of
IF_ARGS [170 160 0 60 120 0 0] 300 60}
The loss of parameter discounts may be bad, but the reduction in size exemplifies a good thing.
But there's a bigger change in puzzle's main loop: $wtransfer gets a 28% reduction in allocation. Furthermore, one of its contained letrecs gets a 56% percent reduction. This results in a %15 percent reduction for the whole program.b
TODO I need ticky to track LNEs in order to pin down what's happening there.
Creates Simplifier Opportunities
Floating functions to the top-level creates more opportunities for the simplifier.
Abstracted from boyer2 (where f is a join point):
CTX[case a of [p1 -> let f x = ... in case a of ...]]
The let prevents the cases from being merged. Since LLF is so aggressive, it floats f when it otherwise wouldn't be, enabling the case merge.
Thunk Join Points
We discovered that the worker-wrapper was removing the void argument from join points (eg knights and mandel2). This ultimately resulted in LLF *increasing* allocation. A thunk was let-no-escape before LLF but not after, since it occurred free in the right-hand side of a floated binding and hence now occurred (escapingly) as an argument.
SPJ was expecting no such non-lambda join points to exist. We identified where it was happening (WwLib.mkWorkerArgs) and switched it off. Here are the programs that with affected allocation.
protect-no = allow wwlib to remove the last value argument (ie the previous behavior) protect-yes = protect the last value argument from being removed (ie the experimental behavior) Both are applied to both the libraries and the program. Allocations ------------------------------------------------------------------------------- Program protect-no protect-yes ------------------------------------------------------------------------------- circsim 1326468688 -0.7% hidden 1165299720 -0.7% scs 1029909256 -0.1% transform 738757608 -0.1% cacheprof 478120432 +0.3% listcopy 334710912 -0.4% comp_lab_zift 330889440 -5.0% fulsom 321534872 -0.3% listcompr 304662896 -0.4% anna 70685104 +0.1% gamteb 59846096 -0.3% parser 32406448 +0.2% gg 8970344 -0.2% -1 s.d. ----- -0.6% +1 s.d. ----- +0.5% Average ----- -0.1%
In circsim, put gets 1,000,000 better and Text.Read.Lex gets a tiny bit better.
In hidden, it's Text.Read.Lex, Text.ParserCombinators?.ReadP, and GHC.Read.
In cacheprof, $wpreparseline gets a bit worse.
In comp_lab_zift: f_make_tree gets 2,060,000 better and f_union_br gets 1,500 better.
In parser, Main.leLex gets worse (5000).
In gg, Graph.$wprintFloat gets worse (12 -> 84).
Bigger swings in allocation (mostly good) took place in the programs themselves (eg transform.f_prmdef ~130,000 better, listcopy.f_se ~150,000 better).
Many of the Core differences were of the following form. For example, see circsim's put function. When protecting the last argument from being removed by WwLib.mkWorkerArgs, the Core looks like this:
let x :: RealWorld# -> ... x = \_void -> let d = ... in Ctor(... d ...) (... d ...) ... in CTX[x]
Without protection, it looks like:
let d = ... in CTX[Ctor(... d ...) (... d ...) ...]
Simon explained that it is probably the simplifier floating d out of the unprotected x binding *in order to reveal x as let-bound to a constructor*. Thus revealed, x is immediately inlined. Because of the \_void, this doesn't happen when the last argument is protected.
With protection, d isn't allocated unless x is entered, which might not always happen in CTX. This is a potential win because x might be let-no-escape.
A potential detriment of protection is that x is not exposed as a let-bound constructor. Simon conjectures that's not actually harmful. The reasoning is as follows.
These void arguments arise in two ways.
- when join points are created
- the strictness pass on constant functions
In both cases, it is unlikely that revealing the binding's RHS as a HNF will lead to any beneficial optimizations.
- Join points are unlikely to be case-scrutinized. It's unlikely that further simplification will render them scrutinized.
- Removing the value arg from constant functions would create sharing, which SPJ says is always a "dodgy" thing to do. If the programmer defines and uses a constant function, they may be trying to avoid retention of large data structures. I was concerned that such constant functions might arise upstream (eg from use of generics), but he regards that unlikely/not worth it (because the optimization is not always a good thing).
Affect on Expression Sizes
When we protect the void arguments from being removed by WwLib?.mkWorkerArgs, we effect unfolding decisions.
- The body of the let that keeps its void argument is bigger because of the void arg's lambda
- It also gets a different result discount because of the lambda (instead of the lambda's body's result discount)
- The entire let expression itself is larger because:
- The let's body is larger (see above)
- The occurrences of the let have the additional application (supplying the void actual)
- Parameters' scrut discounts have also changed, but I didn't try figuring out why --- it was already clear we should do something about this.
Just turning on the -fprotect-last-arg flag was generally helpful for allocation. It was also generally helpful for runtime: somewhere between +1% and -3%. But a couple were > +4%. In a couple of those cases, I discovered that the increased expression size was breaching the unfolding creation threshold (in a base library module). This actual prevented some inlining.
I implemented the -funfolding-ignore-RealWorld? flag by ignoring arguments of type State# RealWorld? when measuring applications and lambdas. The only subtlety was that we still need to count the void args when considering whether to award the partial application result discount. Otherwise, every invocation of a join point meant a huge discount (ufFunAppDiscount). In a couple programs, this increased allocation drastically because the newly inlined function's RHS included a join point that lost its LNE status at the call-site because the entire inlined RHS was case-scrutinized.
The switch improved the runtimes of -fprotect-last-arg, especially the formerly bad ones. Two exceptional cases:
- rewrite gets +3% allocation when ignoring RealWorld? (regardless of protection) and
- hidden gets +15% run time with both protecting and ignoring.
TODO I'm investigating these now.
SPJ thought this may be another means of hoisting the let-no-escape functionality from the code generator into the core2core pipeline. LLF would handle let-no-escape lambdas, but it does not handle let-no-escape thunks (which we didn't initially realize were being identified).
Changing the let-no-escape thunks to \void -> ... closures upstream would then subject the binding to more optimisations. Formerly, it's non-lambda status meant that inlining it, eg, would lose sharing.