Float once-used let binding into a recursive function

changed weight to 5

Looking at the code, maybe this is a viable course of action:

In FloatIn, in the equation of fiExpr for recursive bindings, where extra_fvs are calculated, exclude any variables that are free variables of a right-hand-side of the binding if they have a Dmd signature that indicates that they are used at most once. This allows them to float inside the let

Then we still need to ensure that they pass past the lambda. So the lambda case of fiExpr needs to be improved to separate the used-once floats from the others and float only those in.

I’m not sure if the float in pass is the right place to do this, though. Shouldn’t the simplifier be able to do these things? So maybe simplRecBind should not zap all floats, but rather distinguish between those that may float into a recursive group and the others?

This needs advise from one of the experts on the simplifier.

I think float-in is the right place. The simplifier does only strictly-local floating; and only float out. Really the only floating transformation the simplifier does is

  let  
     x = let y = e in (e1:e2)
  in ...
===>
  let y = e in
  let x = e1:e2 in ...

So yes: float-in!

Simon

Thanks for the input!

Next thing to consider problem: Can we float it past the case analysis (or past multiple recursive bindings): In

  let x = f x0
  in let go 30 = x
         go 20 = x
         go 10 = something else
         go i = go (i+1)
     in go y

we know that x is evaluated at most once, so we want to float it in. Also, x might actually not be needed, so we would gain something by not allocating it before hand. But we also definitely not want it to float past the let and the lambda, just to get

  let go i =
    let x = f x0
    in case i of
         30 -> x
         20 -> x
         10 -> something else
         i  -> go (i+1)
  in go y

as now we would allocate it on every call.

So either, floats that are allowed to go into a recursive group must also go past a case, possibly duplicating it:

  let go 30 = let x = f x0 in x
      go 20 = let x = f x0 in x
      go 10 = something else
      go i  = go (i+1)
  in go y

But this would blow up the code size... .

A more cautious route would be to only float something inside a recursive group if it is used at most once and there is only one syntactical call to it, because then we can be reasonably sure that it will also float into the branches.

Is there an easy way to detect that something has only one syntactic occurence?

Is it not possible to tell the inliner to do this work and decision making?

Oh, you are right! The inliner is a much better place to do this, specifically, the code in preInlineUnconditionally.

It deals with the case where we have

let x = <expression> in ....x.....

where there is a single syntactic occurrence of x, not inside a lambda (unless it's a one-shot lambda). The occurrence analyser marks x with an OccInfo of OneOcc.

So you want to teach the occurrence analyser how to make x as "occurs once" even though it occurs inside a lambda which is called more than once (the one for go). Or perhaps, when the occurrence analyser is about to mark it it as OneOcc True ..., where the True is the InsideLam info, we can switch the InsideLam info to False if x is marked "demanded once"

In haste

Simon

Hey, where is my comment I reported something here a few days ago :-( Darn it. Maybe I should open a ticket asking for the removal of the “Preview” button (after all, there is a live preview these days).

Anyways, I made CallArity consider all variables interesting, so that we actually collect cardinality information on things that cannot be eta-expanded, made it store the once-used-information:

    v'' | called_once = v' `setIdDemandInfo` oneifyDmd (idDemandInfo v')
        | otherwise   = v'

and then made preInlineUnconditionally use this information:

     try_once in_lam int_cxt     -- There's one textual occurrence
         | not in_lam = isNotTopLevel top_lvl || early_phase
-        | otherwise  = int_cxt && canInlineInLam rhs
+        | otherwise  = (int_cxt && canInlineInLam rhs) || isSingleUsed (idDemandInfo bndr)

This had the desired effect of changing

  let x = f x0
  in let go 20 = x
         go 10 = something else
         go i = go (i+1)
     in go y

to

     let go 20 = f x0
         go 10 = something else
         go i = go (i+1)
     in go y

in the simplifier phase following Call Arity, as expected. But the later FloatOut pass would simply float f x0 out again to where it was before.

I’m not sure how to prevent that. In general, floating something out of a recursive group is good, and the information that was used to effect the inlining (namely the once-used of x) is lost, as no x remains.

Good point. The only obvious thing is to do arity/strictness analysis after the last use of float-out. And in fact with -flate-dmd-anal we do a late strictness analysis pass, which is not followed by float-out. I suppose you might need to add a last arity-analysis pass to that?

I suppose that would work, although I don’t like solving such issues by slamming just another pass to the end of the pipeline, instead of making sure the passes work well together (which is indeed tricky here).

I’ll give it a shot to see if there are performance wins to be gained (I don’t actually expect much, I was mostly hoping for nice core), and if there are not many, I’d rather not pay the cost of yet another pass until there is a more compelling use case.

Well, currently -flate-dmd-anal is an -O2 thing, which also seems appropriate here.

It's also possible (measure!) that a late run of arity analysis might be able to exploit information that was not available early; that might have performance benefits all by itself. That was the reason we introduced -flate-dmd-anal, for example.

Prelimary results: In queens allocation goes down by 17%, due to this change to the core (which is pretty much a poster child for what I am aiming for)

                          go =
                            \ (ds3 :: [[Int]]) ->
                              case ds3 of _ [Occ=Dead] {
                                [] -> [] @ [Int];
                                : y ys ->
-                                 let {
-                                   z :: [[Int]]
-                                   [LclId, Str=DmdType]
-                                   z = go ys } in
                                  letrec {
                                    go1 [Occ=LoopBreaker] :: [Int] -> [[Int]]
                                    [LclId, Arity=1, Str=DmdType <S,1*U>]
                                    go1 =
                                      \ (ds4 :: [Int]) ->
                                        case ds4 of _ [Occ=Dead] {
-                                         [] -> z;
+                                         [] -> go ys;
                                          : y1 ys1 ->

Multiple similar changes in cryptarithm2, 8% improved allocation.

But in paraffins something goes very wrong, +1290% allocation. Needs more investigation.

One effect, which might or might not be the reason for the paraffin regression (but certainly obscures the view) is that an inlined expression seems to float out further than a let-bound expression. More explicitly, consider this code:

foo :: [[Bool]] -> [Bool]
--foo input = [ not y | (x:xs) <- input, y <- (x:xs) ]
foo [] = []
foo (y:ys) =
    case y of
        [] -> foo ys
        (x:xs) ->
            let z = foo ys in
            let go [] = z
                go (y':ys') = not y' : go ys'
            in  not x : go xs

With my change, this will be turned into

foo :: [[Bool]] -> [Bool]
--foo input = [ not y | (x:xs) <- input, y <- (x:xs) ]
foo [] = []
foo (y:ys) =
    case y of
        [] -> foo ys
        (x:xs) ->
            let go [] = foo ys
                go (y':ys') = not y' : go ys'
            in  not x : go xs

If the compiler would stop here, I’d be happy.

But instead, something interesting happens. In the pristine case, the binding to z is not affected by the level set:

(let {                             
   <z,<1,3>>                       
   <z,<1,3>> = T10918a.foo ys } in

whereas the plain foo ys expression got wrapped in a binding and is assigned a level further out:

[] ->                                  
  let {                                
    <lvl,F<1,1>>                       
    <lvl,F<1,1>> = T10918a.foo ys } in 
  lvl;

The F<1,1> is – as expected – the binding site of ys.

But now the first invocation of foo ys is in scope of the lvl, CSE happens, the variable is no longer used at most once, it cannot be inlined back and we end up with

foo :: [[Bool]] -> [Bool]
--foo input = [ not y | (x:xs) <- input, y <- (x:xs) ]
foo [] = []
foo (y:ys) =
    let lvl = foo ys
    case y of
        [] -> lvl
        (x:xs) ->
            let go [] = lvl
                go (y':ys') = not y' : go ys'
            in  not x : go xs

which is not what I want to see.

So the issue here seems to be that under some circumstances, an e in C[e] can be floated out further than let z = e in C[z].

(All this does not look too server in this case, but it happens a few times in paraffin, so I hope that this is actually related to the allocation increase.)

A side effect of Call Arity setting the once-used flag on bindings is that the code generator does not create update code for such thunks (good!). With paraffin something goes wrong, and allocation skyrockets. Unfortnately, I was not able to pin-point what goes wrong, despite pulling out the ticky-ticky-hammer.

(Wishful thinking: A tool that takes a -ddump-prep and a ticky report, annotates the bindings with the ticky numbers, and then removes all uniques from the report, so that I can diff that with a different compilation without the uniques obscuring the diff.)

With paraffin something goes wrong

Maybe this was due to #11064 (closed), I’ll revisit this ticket soonish.

I suppose that would work, although I don’t like solving such issues by slamming just another pass to the end of the pipeline, instead of making sure the passes work well together (which is indeed tricky here).

I agree with the principle here. But it does seem hard. The float-out pass assumes, crudely, that it's good to float out a redex of any called-many lambda. But, as we see here, that's wrong for case branches that are only evaluated on one of those calls (the final one in this case). Not only is that info hard to record in the syntax tree, but it's also potentially quite fragile to program transformation, like other sorts of cardinality information.

So refraining from let-floating after the final call-arity/simplifier pass does seem plausible.

Annoyingly, it's just possible that inlining (foo ys into that [] branch might then put it in a context when foo inlines, leading to a cascade of further transformations. So it's not necessarily just a little delta.

It'd be good to understand the paraffins thing.

mentioned in issue #14137

mentioned in issue #14152 (closed)

added Pnormal label

changed the description

Trac field	Value
Version	7.10.2
Type	Task
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

Float once-used let binding into a recursive function

Child items 0

Activity