Currently, this is pretty much the core that comes out at the end. But what we want to see instead is
letgo10=letx=fx0inxgoi=go(i+1)ingo(0::Int)
In general, we do not want to float a binding into a recursive function, because we would risk doing the allocation and/or evaluation of it multiple times. But in this case, we can see that it is going to be used at most once, so it is safe to do so. Even more: In the slightly less contrived examples that I was looking at, the call to x was happening in the a less likely code path, so this way we’d avoid doing the allocation in most cases, a clear win.
It might be enough to simply make CallArity (or rather the cardinality analysis done by call arity) tell the rest of the compiler that it found that x is called at most once, and hopefully the simplifier knows what to make of that information.
Trac metadata
Trac field
Value
Version
7.10.2
Type
Task
TypeOfFailure
OtherFailure
Priority
normal
Resolution
Unresolved
Component
Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Looking at the code, maybe this is a viable course of action:
In FloatIn, in the equation of fiExpr for recursive bindings, where extra_fvs are calculated, exclude any variables that are free variables of a right-hand-side of the binding if they have a Dmd signature that indicates that they are used at most once. This allows them to float inside the let
Then we still need to ensure that they pass past the lambda. So the lambda case of fiExpr needs to be improved to separate the used-once floats from the others and float only those in.
I’m not sure if the float in pass is the right place to do this, though. Shouldn’t the simplifier be able to do these things? So maybe simplRecBind should not zap all floats, but rather distinguish between those that may float into a recursive group and the others?
This needs advise from one of the experts on the simplifier.
I think float-in is the right place. The simplifier does only strictly-local floating; and only float out. Really the only floating transformation the simplifier does is
let x = let y = e in (e1:e2) in ...===> let y = e in let x = e1:e2 in ...
Next thing to consider problem: Can we float it past the case analysis (or past multiple recursive bindings):
In
let x = f x0 in let go 30 = x go 20 = x go 10 = something else go i = go (i+1) in go y
we know that x is evaluated at most once, so we want to float it in. Also, x might actually not be needed, so we would gain something by not allocating it before hand. But we also definitely not want it to float past the let and the lambda, just to get
let go i = let x = f x0 in case i of 30 -> x 20 -> x 10 -> something else i -> go (i+1) in go y
as now we would allocate it on every call.
So either, floats that are allowed to go into a recursive group must also go past a case, possibly duplicating it:
let go 30 = let x = f x0 in x go 20 = let x = f x0 in x go 10 = something else go i = go (i+1) in go y
But this would blow up the code size... .
A more cautious route would be to only float something inside a recursive group if it is used at most once and there is only one syntactical call to it, because then we can be reasonably sure that it will also float into the branches.
Is there an easy way to detect that something has only one syntactic occurence?
Is it not possible to tell the inliner to do this work and decision making?
Oh, you are right! The inliner is a much better place to do this, specifically, the code in preInlineUnconditionally.
It deals with the case where we have
let x = <expression> in ....x.....
where there is a single syntactic occurrence of x, not inside a lambda (unless it's a one-shot lambda). The occurrence analyser marks x with an OccInfo of OneOcc.
So you want to teach the occurrence analyser how to make x as "occurs once" even though it occurs inside a lambda which is called more than once (the one for go). Or perhaps, when the occurrence analyser is about to mark it it as OneOcc True ..., where the True is the InsideLam info, we can switch the InsideLam info to False if x is marked "demanded once"
Hey, where is my comment I reported something here a few days ago :-( Darn it. Maybe I should open a ticket asking for the removal of the “Preview” button (after all, there is a live preview these days).
Anyways, I made CallArity consider all variables interesting, so that we actually collect cardinality information on things that cannot be eta-expanded, made it store the once-used-information:
and then made preInlineUnconditionally use this information:
try_oncein_lamint_cxt-- There's one textual occurrence|notin_lam=isNotTopLeveltop_lvl||early_phase-|otherwise=int_cxt&&canInlineInLamrhs+|otherwise=(int_cxt&&canInlineInLamrhs)||isSingleUsed(idDemandInfobndr)
in the simplifier phase following Call Arity, as expected. But the later FloatOut pass would simply float f x0 out again to where it was before.
I’m not sure how to prevent that. In general, floating something out of a recursive group is good, and the information that was used to effect the inlining (namely the once-used of x) is lost, as no x remains.
Good point. The only obvious thing is to do arity/strictness analysis after the last use of float-out. And in fact with -flate-dmd-anal we do a late strictness analysis pass, which is not followed by float-out. I suppose you might need to add a last arity-analysis pass to that?
I suppose that would work, although I don’t like solving such issues by slamming just another pass to the end of the pipeline, instead of making sure the passes work well together (which is indeed tricky here).
I’ll give it a shot to see if there are performance wins to be gained (I don’t actually expect much, I was mostly hoping for nice core), and if there are not many, I’d rather not pay the cost of yet another pass until there is a more compelling use case.
Well, currently -flate-dmd-anal is an -O2 thing, which also seems appropriate here.
It's also possible (measure!) that a late run of arity analysis might be able to exploit information that was not available early; that might have performance benefits all by itself. That was the reason we introduced -flate-dmd-anal, for example.
Prelimary results: In queens allocation goes down by 17%, due to this change to the core (which is pretty much a poster child for what I am aiming for)
One effect, which might or might not be the reason for the paraffin regression (but certainly obscures the view) is that an inlined expression seems to float out further than a let-bound expression. More explicitly, consider this code:
foo::[[Bool]]->[Bool]--foo input = [ not y | (x:xs) <- input, y <- (x:xs) ]foo[]=[]foo(y:ys)=caseyof[]->fooys(x:xs)->letz=fooysinletgo[]=zgo(y':ys')=noty':goys'innotx:goxs
With my change, this will be turned into
foo::[[Bool]]->[Bool]--foo input = [ not y | (x:xs) <- input, y <- (x:xs) ]foo[]=[]foo(y:ys)=caseyof[]->fooys(x:xs)->letgo[]=fooysgo(y':ys')=noty':goys'innotx:goxs
If the compiler would stop here, I’d be happy.
But instead, something interesting happens. In the pristine case, the binding to z is not affected by the level set:
(let { <z,<1,3>> <z,<1,3>> = T10918a.foo ys } in
whereas the plain foo ys expression got wrapped in a binding and is assigned a level further out:
[] -> let { <lvl,F<1,1>> <lvl,F<1,1>> = T10918a.foo ys } in lvl;
The F<1,1> is – as expected – the binding site of ys.
But now the first invocation of foo ys is in scope of the lvl, CSE happens, the variable is no longer used at most once, it cannot be inlined back and we end up with
foo::[[Bool]]->[Bool]--foo input = [ not y | (x:xs) <- input, y <- (x:xs) ]foo[]=[]foo(y:ys)=letlvl=fooyscaseyof[]->lvl(x:xs)->letgo[]=lvlgo(y':ys')=noty':goys'innotx:goxs
which is not what I want to see.
So the issue here seems to be that under some circumstances, an e in C[e] can be floated out further than let z = e in C[z].
(All this does not look too server in this case, but it happens a few times in paraffin, so I hope that this is actually related to the allocation increase.)
A side effect of Call Arity setting the once-used flag on bindings is that the code generator does not create update code for such thunks (good!). With paraffin something goes wrong, and allocation skyrockets. Unfortnately, I was not able to pin-point what goes wrong, despite pulling out the ticky-ticky-hammer.
(Wishful thinking: A tool that takes a -ddump-prep and a ticky report, annotates the bindings with the ticky numbers, and then removes all uniques from the report, so that I can diff that with a different compilation without the uniques obscuring the diff.)
I suppose that would work, although I don’t like solving such issues by slamming just another pass to the end of the pipeline, instead of making sure the passes work well together (which is indeed tricky here).
I agree with the principle here. But it does seem hard. The float-out pass assumes, crudely, that it's good to float out a redex of any called-many lambda. But, as we see here, that's wrong for case branches that are only evaluated on one of those calls (the final one in this case). Not only is that info hard to record in the syntax tree, but it's also potentially quite fragile to program transformation, like other sorts of cardinality information.
So refraining from let-floating after the final call-arity/simplifier pass does seem plausible.
Annoyingly, it's just possible that inlining (foo ys into that [] branch might then put it in a context when foo inlines, leading to a cascade of further transformations. So it's not necessarily just a little delta.