Opened 4 years ago

Last modified 2 weeks ago

#9476 patch feature request

Implement late lambda-lifting

Reported by: simonpj Owned by: sgraf
Priority: normal Milestone: 8.8.1
Component: Compiler Version: 7.8.2
Keywords: LateLamLift Cc: bgamari, kavon, sgraf, maoe
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8763 #13286 Differential Rev(s): Phab:D5224
Wiki Page: LateLamLift

Description (last modified by simonpj)

This ticket exists to coordinate work on Nick Frisby's late lambda-lifting optimisation.

  • Branch in GHC repo: wip/llf
  • Related tickets
    • #9279: local closure remains in final program
    • #9374: static argument transformation
    • #5945 (closed as dup, but with useful example)
    • #11284 (closed as dup, but with useful example)

Attachments (6)

unknown.txt (5.0 KB) - added by sgraf 3 months ago.
allow turning known into unknown calls
rec-5-6-8-10.txt (9.7 KB) - added by sgraf 3 months ago.
Varying max number of recursive parameters
nonrec-5-6-8-10.txt (9.7 KB) - added by sgraf 3 months ago.
Va
8-8.txt (5.0 KB) - added by sgraf 3 months ago.
Allowing max 8 (non-)recursive parameters
f-5-5.txt (11.3 KB) - added by sgraf 3 months ago.
Don't turn known into unknown calls, allow max 5 (non-)recursive non-void parameters after lifting, don't lift things with undersaturated applications or partial applications
nofib.txt (4.9 KB) - added by sgraf 2 weeks ago.
Runtime measurements, ignoring all with <200ms runtime, LLVM backend with -Os

Download all attachments as: .zip

Change History (58)

comment:1 Changed 4 years ago by simonpj

Owner: set to nfrisby

comment:2 Changed 4 years ago by simonpj

Thoughts about the wiki page (which is a fantastic start):

  • The "extended consequences" section is great. But could you give a small example of each point? Otherwise it is hard to grok.

There are lots of other places where an example would be extremely useful. To take just one "If an LNE function f occurs in another LNE function g and we only lift g, then it will spoil f: and it will no longer be an LNE, because it will now be an argument to llf_g."

  • I recall that you implemented a more-or-less complex analysis to try to get the good consequences without the bad ones. Is it worth sketching what the analysis does? The complexity here is my principal worry about the whole thing.
  • Small point: the $(theRHS 'a 'b) notation is more distracting than helpful. I'd use something more informal (..a...b...).

comment:3 Changed 3 years ago by simonpj

Description: modified (diff)

comment:4 Changed 3 years ago by bgamari

Cc: bgamari added
Description: modified (diff)

comment:5 Changed 3 years ago by bgamari

Description: modified (diff)

comment:6 Changed 3 years ago by bgamari

Wiki Page: LateLamLift

comment:7 Changed 3 years ago by simonpj

Description: modified (diff)

comment:8 Changed 3 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:9 Changed 18 months ago by simonpj

Cc: kavon added

comment:10 Changed 7 months ago by sgraf

comment:11 Changed 7 months ago by sgraf

Cc: sgraf added

I rebased Nicholas Frisby's work here: https://github.com/sgraf812/ghc/tree/llf.

It doesn't currently pass the testsuite (even in default mode, not doing any llf), probably because I introduced some errors while merging some decision code and taking care of join point (then still running by let-no-escape) information. In particular, the devel2 version triggers an ASSERT in add_id.

comment:12 Changed 7 months ago by simonpj

OK, great. Do keep us posted. I think there are real wins to be had here. Nicholas's detailed wiki page identifies many of the issues very clearly.

comment:13 Changed 5 months ago by sgraf

It took me quite some time, but this commit passes ./validate (modulo 4 compiler perf tests). Fixing the testsuite was rather simple, but investigating various performance regressions to see which knobs we could turn is really time consuming, so I figured I better post now than never.

I updated the wiki page with a summary of changes I made. For completeness:

  • A hopefully faithful rebase, removing previous LNE (= join point) detection logic
  • Activate all LLF flags (see the above llf-nr10-r6 configuration) by default
  • Actually use the -fllf-nonrec-lam-limit setting
  • Don't stabilise Unfoldings mentioning makeStatic
  • Respect RULEs and Unfoldings in the identifier we abstract over (previously, when SpecConstr added a RULE mentioning an otherwise absent specialised join point, we would ignore it, which is not in line with how CoreFVs works)
  • Stabilise Unfoldings only when we lifted something out of a function (Not doing so led to a huge regression in veritas' Edlib.lhs)

I'll attach nofib results in a following post. Here's the summary:

        Program         Allocs    Allocs    Instrs    Instrs
                        no-llf     llf      no-llf      llf
--------------------------------------------------------------------------------
            Min         -20.3%    -20.3%     -7.8%    -16.5%
            Max          +2.0%     +1.6%    +18.4%    +18.4%
 Geometric Mean          -0.4%     -1.0%     +0.3%     -0.0%

llf is a plain benchmark run, whereas no-llf means libraries compiled with -fllf, but benchmarks compiled with -fno-llf. This is a useful baseline, as it allows to detect test cases where the regression actually happens in the test case rather than somewhere in the boot libraries.

Hardly surprising, allocations go down. More surprisingly, not in a consistent fashion. The most illustrating test case is real/pic:

                         no-llf     llf
            pic          -0.0%     +1.0%     +0.0%     -3.4%

The lifting of some functions results in functions of rather big result arity (6 and 7), which no longer can be fast called. Appearently, there's no stg_ap_pppnn variant matching the call pattern.

Also, counted instructions went up in some cases, so that there's no real win to be had. If I completely refrain from lifting non-recursive join points, things look better wrt. to counted instructions:

        Program         Allocs    Allocs    Instrs    Instrs
                        no-llf     llf      no-llf      llf
--------------------------------------------------------------------------------
            Min         -20.3%    -20.3%     -3.4%    -17.1%
            Max          +2.0%     +1.6%     +6.4%     +6.4%
 Geometric Mean          -0.3%     -1.0%     +0.1%     -0.4%

But I recently questioned using cachegrind results (such as the very relevant counted memory reads/writes) as a reliable metric (#15333).

There are some open things that should be measured:

  • Is it worthwhile at all to lift join points? (Related: don't we rather want 'custom calling conventions' that inherits register/closure configurations to top-level bindings?)
  • Isn't a reduction in allocations a lie when all we did is spill more on to the stack? Imagine we lift a (non-tail-recursive) function to top-level that would have arity > 5. Arguments would have to be passed on the stack, for each recursive call. I'd expect that to be worse than the status quo. So maybe don't just count the number of free ids we abstract over, but rather bound the resulting arity?

Finally, the whole transformation feels more like it belongs in the STG layer: We very brittly anticipate CorePrep and have to pull in really low-level stuff into the analysis, all while having to preserve unfoldings when we change anything. Seems like a very local optimization (except for enabling intra-module inlining opportunities) that doesn't enable many other core2core optimizations (nor should it, that's why we lift late).

comment:14 Changed 5 months ago by sgraf

plain base, no-llf, llf

same without lifting non-rec join points

Btw. is there some way around the 256KB upload limit?

Last edited 5 months ago by sgraf (previous) (diff)

comment:15 Changed 5 months ago by sgraf

I'm currently thinking about Pros and Cons involved with (re-)implementing this as an optimization on STG rather than Core.

Pros:

  • Much less involved analysis that doesn't need to stay in sync with CorePrep
  • LLF only enables intra-module opportunities in the simplifier anyway
  • Runtime arguments are immediately visible, as are PAPs
  • Lifting to top-level only is orthogonal enough to full laziness that it might be worthwhile to split anyway
  • Integrating more low-level information might be more feasible (like calling convention, e.g. number of argument registers, or stg_ap_* patterns)

Cons:

  • The Simplifier might not be able to inline as much (which probably had limited effect anyway)
  • It's not enough to look at Core alone to gauge allocation (quite a bummer)
  • No Core Lint (but STG Lint, still)
  • Potentially we need to re-invent some utility functions we already have for Core
  • Perform the lifting by hand, rather than let FloatOut do the work

I'm really interested to hear opinions/objections about this!

Last edited 4 months ago by sgraf (previous) (diff)

comment:16 Changed 4 months ago by simonpj

Thoughts

  • There are a handful of spectacular reductions in allocation (queens, n-body). It'd be good to understand and explain them. Perhaps we can more closely target LLF on those cases.
  • I don't think we should float join points at all, recursive or non-recursive. Think of them like labels in a control-flow graph.
  • I think of LLF as a code-generation strategy, that we do once all other transformations are done. (Lambda-lifting can affect earlier optimisations. It can make a big function into a small one (by floating out its guts), and thereby let it be inlined. But that is subtle and difficult to get consistent gains for. Let's not complicate LLF by thinking about this.)
  • Given that it's a code-gen strategy, doing it on STG makes perfect sense to me. You've outlined the pros and cons well. Definitely worth a try.

I'm not sure what you meant by "It's not enough to look at Core alone to gauge allocation" as a disadvantage.

When you say "Much less involved analysis that doesn't need to stay in sync with CorePrep", I think it would be v helpful to lay out "the analysis". I have vague memories, but I don't know what this "stay in sync" stuff is about.

If you do it in STG you don't need to explain "stay in sync", but explaining the analysis would be excellent.

comment:17 in reply to:  16 Changed 4 months ago by sgraf

Replying to simonpj:

Thoughts

  • There are a handful of spectacular reductions in allocation (queens, n-body). It'd be good to understand and explain them. Perhaps we can more closely target LLF on those cases.

It's hard to tell for n-body: The -fno-llf variant shows the same reduction in allocations, meaning that the reduction must happen somewhere in the base libraries rather than in n-body itself.

queens has its go1 function within gen lifted (arity 1 with 3 free vars). Non-lifted Core after CorePrep:

let {
  n_s5S3 [Occ=OnceL*] :: [[GHC.Types.Int]]
  [LclId]
  n_s5S3 = go_s5RY ys_s5S2 } in
case GHC.Prim.># 1# ww1_s5RX of {
  __DEFAULT ->
    letrec {
      go1_s5S5 [Occ=LoopBreaker]
        :: GHC.Prim.Int# -> [[GHC.Types.Int]]
      [LclId, Arity=1, Str=<S,U>, Unf=OtherCon []]
      go1_s5S5
        = \ (x1_s5S6 :: GHC.Prim.Int#) ->
            case Main.main_$ssafe y_s5S1 1# x1_s5S6 of {
              GHC.Types.False ->
                case GHC.Prim.==# x1_s5S6 ww1_s5RX of {
                  __DEFAULT ->
                    case GHC.Prim.+# x1_s5S6 1#
                    of sat_s5S9 [Occ=Once]
                    { __DEFAULT ->
                    go1_s5S5 sat_s5S9
                    };
                  1# -> n_s5S3
                };
              GHC.Types.True ->
                let {
                  sat_s5Se [Occ=Once] :: [[GHC.Types.Int]]
                  [LclId]
                  sat_s5Se
                    = case GHC.Prim.==# x1_s5S6 ww1_s5RX
                      of {
                        __DEFAULT ->
                          case GHC.Prim.+# x1_s5S6 1#
                          of sat_s5Sd [Occ=Once]
                          { __DEFAULT ->
                          go1_s5S5 sat_s5Sd
                          };
                        1# -> n_s5S3
                      } } in
                let {
                  sat_s5Sa [Occ=Once] :: GHC.Types.Int
                  [LclId]
                  sat_s5Sa = GHC.Types.I# x1_s5S6 } in
                let {
                  sat_s5Sb [Occ=Once] :: [GHC.Types.Int]
                  [LclId]
                  sat_s5Sb
                    = GHC.Types.:
                        @ GHC.Types.Int sat_s5Sa y_s5S1 } in
                GHC.Types.:
                  @ [GHC.Types.Int] sat_s5Sb sat_s5Se
            }; } in
    go1_s5S5 1#;
  1# -> n_s5S3
}

And when go1 is lifted to top-level, the CorePrep'd call site changes to

case GHC.Prim.># 1# ww1_s5Ts of {
  __DEFAULT ->
    let {
      sat_s5Tz [Occ=Once, Dmd=<L,1*U>] :: [[GHC.Types.Int]]
      [LclId]
      sat_s5Tz = go_s5Tt ys_s5Tx } in
    llf_go_r5SK y_s5Tw ww1_s5Ts sat_s5Tz 1#;
  1# -> go_s5Tt ys_s5Tx
}

The important detail is that n_s5S3 corresponds to sat_s5Tz in the lifted version. Both have a 'single' occurrence, but n_s5S3's occurrence is under a non-oneshot lambda (which appearently and understandably is enough for OccAnal to give up, as opposed to analyzing parameters of recursive functions), so n_s5S3 is updateable and causes an increase in heap allocation, while sat_s5Tz is single-entry.

So, I'm afraid this gain can't entirely claimed by LLF. A late demand analysis run would probably detect the same. There's cichelli, which shows similar improvements (-10% allocs) due to something happening in Prog.hs, but there's too many lifts to bisect, so I'll postpone that to a follow-up post.

  • I don't think we should float join points at all, recursive or non-recursive. Think of them like labels in a control-flow graph.

This was also what I thought, but there can be synergetic effects with the inliner if we manage to "outline" (that's what the transformation would be called in an imperative language) huge non-recursive join points, where the call overhead is neglibigle. At least that's what the wiki page mentions. But that brings me to the next point...

  • I think of LLF as a code-generation strategy, that we do once all other transformations are done. (Lambda-lifting can affect earlier optimisations. It can make a big function into a small one (by floating out its guts), and thereby let it be inlined. But that is subtle and difficult to get consistent gains for. Let's not complicate LLF by thinking about this.)

Yes, I wasn't sure if having it play nice with the simplifier was intended here. I take comfort in your confirmation seeing it as a code-gen strategy.

  • Given that it's a code-gen strategy, doing it on STG makes perfect sense to me. You've outlined the pros and cons well. Definitely worth a try.

Will do.

I'm not sure what you meant by "It's not enough to look at Core alone to gauge allocation" as a disadvantage.

Maybe it's just me only slowly understanding every layer of compilation in GHC, but I felt like I could have this mental model of roughly where and how much things allocate after CorePrep, but that's probably an illusion considering things like let-no-escapes (that story got better with join points, I suppose). Having another pass transforming heap allocation into stack allocation *after* CorePrep isn't exactly helping that intuition.

Or were you thrown off by me using words I maybe mistranslated ("gauge" for "estimate", roughly)?

When you say "Much less involved analysis that doesn't need to stay in sync with CorePrep", I think it would be v helpful to lay out "the analysis". I have vague memories, but I don't know what this "stay in sync" stuff is about.

There is this note that explains why it's necessary to approximate CorePrep. What follows is this new analysis that interleaves a free variable analysis with computing use information for id's which are free in the floating bindings.

If you do it in STG you don't need to explain "stay in sync", but explaining the analysis would be excellent.

Yes! I'm not really sure I understand all of it yet. (Re-)implementing it on STG should help me figure things out.

comment:18 Changed 4 months ago by simonpj

A late demand analysis run would probably detect the same.

-O2 does late demand analysis for this very reason. So presumably it doesn't catch it.

"It's not enough to look at Core alone to gauge allocation"

I agree that STG has a stronger operational model than Core. (Core is not bad, but STG is better.) But that's an advantage of doing LLF in STG, not a disadvantage.

Do have a go!

comment:19 Changed 3 months ago by sgraf

I reimplemented this as an STG pass and just rebased it onto a recent 8.6 alpha commit. You can track the progress on my `lift-lambda` branch, of which this is the current HEAD. It passes ./validate, but due to nofib's lambda benchmark not made fit for the next step of the MonadFail proposal, we can't run benchmarks.

I ran some benchmarks before the rebase, which was relative to this commit. Here are the numbers, eliding some uninteresting ones:

--------------------------------------------------------------------------------
        Program         Allocs    Instrs
--------------------------------------------------------------------------------
            VSD          -0.3%     +0.0%
            VSM           0.0%     -0.1%
           atom          -0.8%     -1.2%
   binary-trees          -0.0%     -1.5%
        circsim          -1.0%     -0.6%
       clausify          -1.9%     -0.2%
    constraints          -0.9%     -0.7%
   cryptarithm1          -2.8%     -8.0%
   cryptarithm2          -4.0%     -2.9%
    exact-reals          -2.1%     -0.1%
           fft2          -1.0%     -0.4%
          fluid          -0.9%     -0.6%
         hidden          -1.0%     -0.6%
   k-nucleotide          -0.0%     +2.4%
           lcss          -0.1%     -1.3%
           mate          -3.2%     -0.6%
        mkhprog          -1.3%     +1.0%
          power          -0.7%     -1.0%
         puzzle          -0.0%     +0.0%
         queens         -17.7%     -0.9%
        reptile          -0.3%    -17.1%
        rewrite          -1.7%     -0.1%
         simple          -0.2%     +4.2%
      typecheck          -2.7%     -1.9%
--------------------------------------------------------------------------------
            Min         -17.7%    -17.1%
            Max           0.0%     +4.2%
 Geometric Mean          -0.5%     -0.3%

So, no regressions in allocation (which I'd consider a bug) and some small gains in general. Note that because of ticket:15333#comment:2 I'm not really trusting many of the regressions in counted instructions. I'd test with the proposed -fllvm -optlo -Os configuration, but that will have to wait until nofib is fixed.

Last edited 3 months ago by sgraf (previous) (diff)

comment:20 Changed 3 months ago by sgraf

comment:21 Changed 3 months ago by sgraf

Owner: changed from nfrisby to sgraf

comment:22 Changed 3 months ago by simonpj

Keywords: LateLamLift added

comment:23 Changed 3 months ago by sgraf

Over the last week, I polished this some more and played around with heuristics regarding known and unknown calls. I'm pretty confident the transformation can handle every idea outlined in Wiki:LateLamLift. Here are is an excerpt from a benchmarks run:

--------------------------------------------------------------------------------
        Program         Allocs    Instrs
--------------------------------------------------------------------------------
   cryptarithm1          -2.8%     -7.9%
   cryptarithm2          -4.0%     -2.4%
    exact-reals          -2.1%     -0.0%
   k-nucleotide          -0.0%     +2.4%
          kahan          -0.4%     -2.0%
           lcss          -0.1%     -5.8%
           mate          -8.4%     -3.5%
         n-body         -20.2%     -0.0%
         queens         -17.7%     -0.8%
      typecheck          -2.7%     -1.8%
--------------------------------------------------------------------------------
            Min         -20.2%     -7.9%
            Max          +0.0%     +2.4%
 Geometric Mean          -0.8%     -0.3%

I had to fight with a surprising amount of benchmark wibbles (+-12% in instructions, c.f. #15333) related to GC parameters and code layout (probably), although I was using cachegrind for measurements. The above numbers are from running NoFib with make EXTRA_RUNTEST_OPTS='-cachegrind +RTS -V0 -A128M -H1G -RTS' NoFibRuns=1.

I looked at the regression in k-nucleotide (+2.4% counted instructions), but couldn't reproduce so far.

I'll probably refactor the one huge module into 2 or 3 smaller ones and then prepare a patch.

comment:24 Changed 3 months ago by simonpj

This sounds great, thank you!

There are some knobs you can turn in LLF, aren't there. E.g. if a function has 200 free vars, we probably don't want to lambda-lift it an turn it into a function with 200 parameters.

Did you experiment with different settings of the threshold for number-of-free-vars?

There may be more knobs too.

comment:25 Changed 3 months ago by sgraf

I think Nicolas played around with those parameters, concluding that this wiki:LateLamLift#llf-nr10-r6 was the best configuration.

I played around with settings like the number of vars to abstract over when I was still testdriving the Core transformation. IIRC, the desired cutoff was 5 arguments in total, because any excess arguments are passed on the stack. But now that the transformation works reliably, I think I can play around with it some more again.

There are some knobs to turn: There's -fstg-lift-lams-args(-any) (default: number of available registers for argument passing, currently hard-coded to 5) which takes an upper bound for the number of arguments the function may have after lifting. Also no distinction if the function is recursive or not (yet?).

As well as -f(no-)stg-lift-lams-known (default: no) which controls if we are willing to turn known calls into unknown calls. This happens when we lift a function that abstracts over a known function:

let f x = ...
    mapF xs = ... f x ...
in mapF [1,2,3]

==>

mapF_up f xs = ... f x ...
let f x = ...
in mapF_up f [1,2,3]

Lifting mapF turns the call to f into an unknown call.

Wrt. the other heuristics I employ, peek here if you're interested: https://github.com/sgraf812/ghc/blob/1e45f1fe3133a263694d05d01b84a245d4244098/compiler/simplStg/StgLiftLams.hs#L413-L420

comment:26 Changed 3 months ago by simonpj

OK, it sounds as if you are exploring the space really well -- thanks

Changed 3 months ago by sgraf

Attachment: unknown.txt added

allow turning known into unknown calls

Changed 3 months ago by sgraf

Attachment: rec-5-6-8-10.txt added

Varying max number of recursive parameters

Changed 3 months ago by sgraf

Attachment: nonrec-5-6-8-10.txt added

Va

Changed 3 months ago by sgraf

Attachment: 8-8.txt added

Allowing max 8 (non-)recursive parameters

comment:27 Changed 3 months ago by sgraf

I investigated various variations of the configuration that intuitively should give the best results. Specifically, I played with 3 parameters:

  1. -fstg-lift-lams-known: Allow turning known into unknown calls. Default: Don't allow.

Imagine we lift f in the following program:

let g x = ...
    f y = ... g y ...
in f 2
==>
f g y = ... g y ...;
let g x = ...
in f g 2

This turns the known call to g within fs body into an unknown one (let's call it anonymisation for the sake of this post), which needs to go through one of the generic apply functions. My gut tells me this probably isn't worth taking the risk: There's the (very small) chance, that when there's no matching slow call pattern, this will turn into a very slow call, which, from my understanding, would allocate.

  1. -fstg-lift-lams-rec-args <n>: Allow lifting as long as the resulting recursive function has at most n arguments. Default: 5, the number of available registers.

Excess arguments are passed on the stack, which would mean the same slow memory accesses we try to avoid. Still, allocating on the stack vs. on the heap might be favorable, but I'm not sure how passing arguments on the stack plays with (non-tail) recursion, e.g. would passing arguments on the stack mean we had to pass the same, static args all over again for a nested recursive activation?

Anyway, I measured.

  1. -fstg-lift-lams-nonrec-args <n>: Allow lifting as long as the resulting non-recursive function has at most n arguments. Default: 5.

Lifting non-recursive functions could have different effects than lifting recursive ones, because a) there's no recursive calls, we pay call overhead only once b) they are probably huge enough that call overhead is neglible.

I'll abbreviate each configuration I tested by a triple {t,f}-<nrec>-<nnonrec>, so the (current) default parameter would be f-5-5.

  • t-5-5 -- or: allow turning known call into unknown calls. Total mean changes: -0.0% allocs, -0.0% counted instructions Numbers in attachment:unknown.txt. No regressions in allocations (that's what we have the cost model for, after all), with two benchmarks standing out:
    • rewrite: -1.5% allocs, -0.1% counted instructions. Changes were somewhere in base, so didn't bother to investigate further
    • mate: -0.9% allocs, -0.1% counted instructions. This one lifted recursive functions of the form
      let {
        $warg_scDq [InlPrag=NOUSERINLINE[2],
                    Occ=OnceL*!,
                    Dmd=<L,C(C1(C1(U)))>]
          :: Board.Kind
              -> Board.Square
              -> [(Move.MoveInFull, Board.Board)]
              -> [(Move.MoveInFull, Board.Board)]
        [LclId, Arity=3, Str=<S,U><L,U(U(U),U(U))><L,U>, Unf=OtherCon []] = ...
      } in ... let {
        go_scKg [Occ=LoopBreaker]
          :: [(Board.Kind, Board.Square)] -> [(Move.MoveInFull, Board.Board)]
        [LclId, Arity=1, Str=<S,1*U>, Unf=OtherCon []] =
            sat-only [go_scKg $warg_scDq] \r [ds_scKh]
                case ds_scKh of {
                  [] -> [] [];
                  : y_scKj [Occ=Once!] ys_scKk [Occ=Once] ->
                      case y_scKj of {
                        (,) ww3_scKm [Occ=Once] ww4_scKn [Occ=Once] ->
                            let {
                              sat_scKo [Occ=Once] :: [(Move.MoveInFull, Board.Board)]
                              [LclId] =
                                  [ys_scKk go_scKg] \u [] go_scKg ys_scKk;
                            } in  $warg_scDq ww3_scKm ww4_scKn sat_scKo;
                      };
                };
      } in  go_scKg ww_scCR;
      
      Which is exactly the kind of lift that I tought we don't want to make: lifting go to top-level will result in abstracting over $warg, which will turn the known call into an unknown one. Perhaps this is only beneficial because the unknown call isn't on the hot path.
  • f-<n>-5: This varied max number of recursive args between 5 and 10 (attachment:rec-5-6-8-10.txt). Allowing 6 parameters lifted some more functions, 8 parameters didn't do anything more than 6 and 10 parameters did another influential lift (-7.7% allocs in mandel2, but +0.3% in ci). The only real beneficial lift here was in fibheaps, happening with n >= 6 (-0.5% on both allocs and ci). The rest seems to be just noise.

So, what happened in fibheaps? It seems two recursive functions were lifted, both taking 6 arguments. Ah, but one of them (the last, in particular) is a 'void' parameter (so, slow call pattern pppppv), which is completely erased from the resulting Cmm! ... the tip of my branch should allow the lift here now.

  • f-5-<n>: This varied max number of non-recursive args between 5 and 10 (attachment:nonrec-5-6-8-10. Allowing up to 8 parameters had great effect on allocations in cichelli (-10.4%), while also improving counted instructions negligibly (-0.0%). Allowing 10 parameters also had a tiny effect on simple (-0.9% allocs, -0.1%). Codegen for both benchmarks reveals that the changes hide somewhere in base, so I'm not investigating further at the moment, seems like it's not worth the time.
  • f-8-8: To test the interaction of both tuning parameters. No surprising results: attachment:8-8.txt (the baseline doesn't use the fibheaps opportunity which is now optimised in f-5-5)

I didn't bother evaluating the combination of allowing anonymisation of calls with the max bound on parameters, because I think they are largely independent of another.

Should I evaluate more variants, like allowing to lift undersaturated applications (are there even any in STG? Thought not, but then there's satCallsOnly)? I don't think these would be worthwhile, except when the resulting PAP is on a cold path...


So, here's a TLDR some questions:

  1. Should we allow anonymisation (see 1) above)? I don't see big wins (configuration t-5-5), and don't really understand why there are any at all, which probably is the bigger problem here.
  2. I think the f-5-5 configuration is a good default. Can you think of any other parameters I could vary? Looking at the Wiki page, I think these are the other ones Nicolas evaluated (which was on Core, which is a lot less explicit about these things):
    1. Abstracting over/lifting LNEs: That's a nono, as the last few years showed
    2. Abstract Undersaturated applciations: As I wrote above, I don't think these are worthwhile, because they only shift allocation to call sites via PAPs (and also they aren't trivially to implement because of STGs ANF guarantees)
    3. Abstract (over-)saturated applications: That's a no brainer in my eyes. Oversats are just regular calls returning something unknown we then call. If the known call is worthwhile to lift, just do it; the unknown call won't change a bit.
    4. Create PAPs: That's a special case of ii. I think, or at least this is the case that would entail a special case in the transformation because of STG only allows atoms as arguments.
    5. Use strictness: Seems to be related to anticipating CorePrep, which fortunately we don't have to any longer.
    6. Use one-shot info: I'm not entirely sure what this means either. Lifting would spoil one-shot info, but I don't think this has any operational consequences, because one-shot info was already used to identify single-entry thunks.

comment:28 Changed 3 months ago by simonpj

-fstg-lift-lams-known: Allow turning known into unknown calls.

In your example, you say that we might float f but not g. But why don't we float g?

lifting go to top-level will result in abstracting over $warg,

Why didn't we float $warg?

Anyway, I agree; it is Very Bad to turn known into unknown calls. Let's not do that.

Last edited 3 months ago by simonpj (previous) (diff)

comment:29 Changed 3 months ago by simonpj

Ah, but one of them (the last, in particular) is a 'void' parameter (so, slow call pattern pppppv), which is completely erased from the resulting Cmm!

OK, so the conclusion is: don't count void args when counting args and comparing with the threshold? I agree!

Last edited 3 months ago by simonpj (previous) (diff)

comment:30 Changed 3 months ago by simonpj

The non-recursive case

f-5-<n>: This varied max number of non-recursive args between 5 and 10 (attachment:nonrec-5-6-8-10. Allowing up to 8 parameters had great effect on allocations in cichelli (-10.4%),

Hmm. Consider

let f x = e in ...(f e1)...(f e2)...

where e has a lot of free vars y1 .. y20. If we lift we get a top-level f with 21 args, and the calls look like:

...(f y1 .. y20 e1) ... (f y1 .. y20 e2)...

Now

  • In the original form we store y1..y20 into the function closure for f, once.
  • In the lifted form we store (most of) y1..y20 on the stack, once for each call.

So if we had a lot of calls, there's be more memory traffic. Plus of course, any case-expression evals would need to save y1..y20 on the stack, and the reload them to make the call...

I was thinking that maybe for non-rec functions we could claim that lifting was a win regardless of the number of parameters, but I don't think that's true.

So it's a bit surprising to me that even with a threshold of 10 we have such a small hit on instruction count.

comment:31 Changed 3 months ago by simonpj

iii) Abstract (over-)saturated applications:

I agree. Let's do it.

vi> Use one-shot info:

I took a quick look at Nick's code. It seemed to be to do with NOT floating a binding that looked like

   f = \x[oneshot]. blah

I'm not sure why - and I think it'd irrelevant by the time we get to STG. So yes, ignore this.

comment:32 Changed 3 months ago by simonpj

  • Would you like to publish figures for your best option (-f-5-5, don't count void args, float over-sat apps)?
  • What happens to binary sizes?
  • I think you should write a paper! It's very easy to forget all the things that are paged into your head at the moment. Get them written down before you forget them!

comment:33 in reply to:  30 Changed 3 months ago by sgraf

Replying to simonpj:

-fstg-lift-lams-known: Allow turning known into unknown calls.

In your example, you say that we might float f but not g. But why don't we float g?

Because at the time we handled g, it (hypothetically) was deemed non-beneficial to lift. It might lead to more allocation or hurt some other heuristic.

Of course one could argue that lifting both bindings could be a win. Our decision strategy is greedy in that regard; There's no backtracking involved. We transform expressions going from outer to inner, at each step taking into account the lifting decisions we already made (and which we don't question or undo). So, for that example in particular, the assumption was that we examined g, decided not to lift it and then look at f knowing that g won't be lifted.

lifting go to top-level will result in abstracting over $warg,

Why didn't we float $warg?

Same reasoning. Because we decided earlier that the lift might not be worthwhile. I could look at some debug output to see what was the reason(s) for $warg if you are really interested.

Replying to simonpj:

Ah, but one of them (the last, in particular) is a 'void' parameter (so, slow call pattern pppppv), which is completely erased from the resulting Cmm!

OK, so the conclusion is: don't count void args when counting args and comparing with the threshold? I agree!

Yes, exactly. See https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9dea/compiler/simplStg/StgLiftLams/Analysis.hs#L310

Replying to simonpj:

The non-recursive case

So it's a bit surprising to me that even with a threshold of 10 we have such a small hit on instruction count.

I was wondering the same thing. I think the other heuristics (https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9dea/compiler/simplStg/StgLiftLams/Analysis.hs#L228-L235, for completeness) fire way too often to allow non-recursive calls with 20 arguments. Consider the impact on allocations: When we lift such a function, all 20ish free variables would have to be made available at all call sites, leading to a massive increase in closure allocation for closures around the actual call sites. Example

let f = [x1...x20] \[a b c] -> ...
    g = [f y] \z -> f y y z
in map g [1,2,3]

Assuming for the sake of the example that g wouldn't be lifted to top-level itself (it's just a PAP of f), lifting f would just shift the whole closure into g's closure. In general, there could be arbitrary more closures between fs binding site and its call sites, each of which would have to carry fs former free variables, should we decide to lift it.

I could try and measure with the allocation heuristic and the max args heuristic turned off, to see if we get more chaotic results.

Changed 3 months ago by sgraf

Attachment: f-5-5.txt added

Don't turn known into unknown calls, allow max 5 (non-)recursive non-void parameters after lifting, don't lift things with undersaturated applications or partial applications

comment:34 in reply to:  32 Changed 3 months ago by sgraf

Replying to simonpj:

  • Would you like to publish figures for your best option (-f-5-5, don't count void args, float over-sat apps)?

I'm currently re-running benchmarks with the new void args thing, but I suspect that the results from comment:23 are pretty much the same, except for fibheaps. All results with at least 1% change:

--------------------------------------------------------------------------------
        Program         Allocs    Instrs
--------------------------------------------------------------------------------
           anna          -1.1%     -0.4%
           atom          -0.8%     -1.1%
        circsim          -1.0%     -0.7%
       clausify          -1.9%     -0.4%
   cryptarithm1          -2.8%     -7.9%
   cryptarithm2          -4.0%     -2.4%
    exact-reals          -2.1%     -0.0%
         expert          -1.0%     -0.0%
           fft2          -1.0%     -0.3%
       fibheaps          -1.4%     -0.7%
          fluid          -1.5%     -0.6%
         hidden          -1.0%     -0.6%
          infer          -0.6%     -1.1%
   k-nucleotide          -0.0%     +2.4%
          kahan          -0.4%     -2.0%
           lcss          -0.1%     -5.8%
           mate          -8.4%     -3.5%
        mkhprog          -1.3%     -0.1%
         n-body         -20.2%     -0.0%
       nucleic2          -1.0%     -0.1%
         queens         -17.7%     -0.8%
      typecheck          -2.7%     -1.8%
--------------------------------------------------------------------------------
            Min         -20.2%     -7.9%
            Max          +0.0%     +2.4%
 Geometric Mean          -0.8%     -0.3%
  • What happens to binary sizes?

Good question, given that we push work from binding sites to call sites. It seems they consistently went down by 0.1%. I wonder why? Maybe some heap checks somewhere in base? Or is it just a wibble? Maybe I should re-evaluate my baseline...

  • I think you should write a paper! It's very easy to forget all the things that are paged into your head at the moment. Get them written down before you forget them!

Yes, I'd love to! In fact, I started something at https://github.com/sgraf812/late-lam-lift. It's a bit rough and only has a section about the decision heuristics at the moment, but I'll post when there's something you could have a look at. In fact, I'd very much like it if Nicolas and/or you would co-author that with me, as I haven't really written a paper before. I think this would be an excellent exercise.

comment:35 Changed 3 months ago by simonpj

I could try and measure with the allocation heuristic ... turned off, to see if we get more chaotic results.

What is the "allocation heuristic"?

comment:36 Changed 3 months ago by sgraf

The allocation heuristic (https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9dea/compiler/simplStg/StgLiftLams/Analysis.hs#L235) tries to estimate closure growth and only allows the lift if it can guarantee that there are no regressions.

I think this is quite essential to identify beneficial lifts, as well as non-trivial to implement. The whole Analysis module linked above mostly revolves around computing this costToLift https://github.com/sgraf812/ghc/blob/9b9260c1d45d127edf9ebdfe04a3daaff24a9dea/compiler/simplStg/StgLiftLams/Analysis.hs#L375.

comment:37 Changed 3 months ago by simonpj

I think this is quite essential to identify beneficial lifts,

Oh, OK. If I was reading the paper I'd ask "which bits of this analysis really make a difference"? Eg.

  • If I switched it off altogether, what happens?
  • If I switch off pieces of it, what happens? We've discussed some of these pieces above, but not all, I think.

All good grist for the paper. (Which I have not yet looked at.)

Last edited 3 months ago by simonpj (previous) (diff)

comment:38 Changed 3 months ago by simonpj

Yes, I'd love to! In fact, I started something at ​https://github.com/sgraf812/late-lam-lift.

Re paper: yes I'd be happy to co-author. Please yell when you think it's ready for a serious look.

comment:39 Changed 6 weeks ago by simonpj

Sebastian, how is this going?

comment:40 Changed 6 weeks ago by sgraf

You can find the latest version of the paper at https://github.com/sgraf812/late-lam-lift/blob/master/paper.pdf. It's still pretty rough and littered with TODOs, but you can take a glance at the two chapters I got so far. What's missing is an evaluation + the usual prologue and epilogue.

While writing the paper, I realised a few things here and there that I could change in the transformation.

For example, currently we can't lift functions that occur in argument position, i.e.

let f x = ... x ... a ... b ... in
let g y z = ... in
g f 1

Lifting f would mean having to allocate a PAP at the call site of g for f a b. One of our heuristics forbids the lift here, because it's nonsense: We immediately re-introduce the allocation we spared at each call site of the lifted function. The only situation I can think of in which this could be beneficial is when this pushes allocation from a hot code path into multiple cold call sites.

Making the transformation correct here is somewhat of a nuisance, but probably needed in order to back my claims in section 3 (C1, in particular) with benchmark figures.

comment:41 Changed 6 weeks ago by simonpj

Good stuff on the paper. I'll try to look.

For example, currently we can't lift functions that occur in argument position, i.e.

That seems correct doesn't it? i.e. the implementation doesn't lift such functions, and it shouldn't. So why do you say "I realised a few things here and there that I could change in the transformation"?

What about the implementation. Are you done? Ready to submit a Phab patch for review? With an up-to-date wiki page describing what it does?

comment:42 Changed 6 weeks ago by sgraf

That seems correct doesn't it? i.e. the implementation doesn't lift such functions, and it shouldn't.

The implementation doesn't lift these functions because we think they won't lead to a beneficial lift anyway. But that's rather a heuristic and for illustrative purposes (e.g., look how things get worse when we allow this) we could implement the transformation in a way that it can cope with these cases.

But after having played with it for a few hours yesterday, I came to realise that this has more serious consequences than I anticipated. In addition to the StgApp case, the Constructor application and PrimOp cases are particularly egregious, because they require to return floats along with the actual tranformed RHS. I'm no longer convinced that the illustrative purpose justifies complicating the implementation so much.

So why do you say "I realised a few things here and there that I could change in the transformation"?

In the process of writing the paper, I repeatedly had to double-check what the transformation does and that it actually does the right thing. For example, thinking about how to formulate the function computing closure growth resulting from a lifting decision lead me to realise that we in fact could make better use of strictness in addition to usage (i.e. one-shot) information.

Example:

let f = [x y]. \z -> ...
    g = [x y f]. \u -> ... x ... y ... f u ...
in g 1

What's the closure growth resulting from lambda lifting f? Previously, although gs closure would shrink by one slot (so closure growth -1 from gs RHS), we can't be certain that this benefit would be visible on every program path, so we'd have to return 0 here.

But that's too conservative, because in fact g is called strictly with one argument (strictness demand C(S)), which means closure growth for the expression is actually -1 (not yet counting in the saved two slots from not having to allocate a closure for f).

What about the implementation. Are you done? Ready to submit a Phab patch for review? With an up-to-date wiki page describing what it does?

I consider the implementation done (probably should've submitted a patch much earlier). I'll bring the wiki page up to date and then submit a patch within the next week or so.

comment:43 Changed 6 weeks ago by simonpj

In the process of writing the paper, I repeatedly had to double-check what the transformation does and that it actually does the right thing. For example, thinking about how to formulate the function computing closure growth resulting from a lifting decision lead me to realise that we in fact could make better use of strictness in addition to usage (i.e. one-shot) information

Great. So did you update the impl to take advantage of these new insights?

I'll look forwards to the patch. An up to date nofib comparison would be good.

And please yell when you are ready for me to take a proper sweep through the paper. Thanks!

comment:44 Changed 6 weeks ago by sgraf

Differential Rev(s): D5224
Status: newpatch

I submitted a patch of the implementation at Phab:D5224 and brought the wiki page up to date. This is ready for review, although still lacking a unit test suite.

comment:45 Changed 6 weeks ago by potato44

Differential Rev(s): D5224Phab:D5224

comment:46 Changed 5 weeks ago by sgraf

I realised that while the heuristics currently employed work quite well, we are too pessimistic wrt. unsaturated applications.

Currently, we don't lift binders that have undersaturated calls at all. But there's nothing wrong with lifting those, as long as allocations don't get worse. So I set out to get rid of that heuristic and replaced it with a check that disallows occurrences in argument position instead (previously, this would be subsumed by the undersaturated call check).

The results are inconclusive, with some significant slow-downs which I could pin-point to a bug in the heuristic that detects known calls. Currently, we say that a call to some variable would be lowered as a known call when idArity bndr > 0. But there actually are thunks (e.g. arity 0) of function type, which count as known calls. Abstracting over these thunks makes for an increase in allocation and runtime. I'm investigating more robust ways of detecting known calls.

comment:47 Changed 5 weeks ago by sgraf

Nevermind, the regression was due to lifting binders occuring in a nullary applications. Example from CS:

main_$s$wgo =
    \r [sc sc1]
        case sc1 of wild {
          __DEFAULT ->
              let {
                lvl =
                    \u []
                        case -# [wild 1#] of sat { __DEFAULT -> main_$s$wgo sc sat; }; } in
              let {
                sat =
                    \r [s1] case plusInteger s1 ds of s' { __DEFAULT -> lvl s'; };
              } in  sat;
          0# -> sc ();
        };
==>
main_$s$wgo =
    \r [sc sc1]
        case sc1 of wild {
          __DEFAULT ->
              let {
                lvl =
                    \u []
                        case -# [wild 1#] of sat { __DEFAULT -> main_$s$wgo sc sat; };
              } in  $lsat lvl;
          0# -> sc ();
        };
$lsat =
    \r [lvl s1] case plusInteger s1 ds of s' { __DEFAULT -> lvl s'; };

(main_$s$wgo is a ternary function)

This is a beneficial lift from the perspective of closure growth, but actually allocates more because the nullary application sat turns into a partial application $lsat lvl which allocates.

Fixing this by regarding nullary applications as argument occurrences of the binder (e.g. a no-go) has the same effect to nofib benchmarks as just disallowing any undersaturated calls to the binder to lift. The latter is much simpler to check for (just look at the StgBinderInfo of the RHS) and is the status quo in Phab:D5224. So, no news, basically.

comment:48 Changed 5 weeks ago by sgraf

Milestone: 8.8.1

comment:49 Changed 4 weeks ago by sgraf

Re: ticket:15770#comment:11 (StgBinderInfo is going to be removed in Phab:D5232)

I can probably just reuse a branch I had lying around that didn't have any consequences for benchmark performance. Still, I'll probably need to wait for the changes from Phab:D5232 to hit master.

comment:50 Changed 4 weeks ago by simonpj

I realised that while the heuristics currently employed work quite well, we are too pessimistic wrt. unsaturated applications.

As I say on Phab, we really need a Note summarising the heuristics. (And of course the paper will help a lot.)

comment:51 Changed 3 weeks ago by maoe

Cc: maoe added

Changed 2 weeks ago by sgraf

Attachment: nofib.txt added

Runtime measurements, ignoring all with <200ms runtime, LLVM backend with -Os

comment:52 Changed 2 weeks ago by sgraf

I'm currently trying to find the right configuration for Runtime benchmarking.

When using the NCG on the architecture I benchmark on, there are seemingly random outliers performance-wise, even when ignoring benchmarks with less than 200ms running time. Take CSD from real/eff for example. On the target architecture (i7-6700), things consistently are 4.5% slower, yet there isn't a single lifted function in that benchmark. It's basically just a counting loop. To make matters worse, I can't reproduce this on my local PC, quite the contrary there. Altogether this makes for a very meager improvement of -0.2% in runtime.

This leads me to believe that the (relatively minor) benefits are obscured by code size and layout concerns. If I only include benchmarks that ran at least 500ms, things look much better (-0.4%), but that's probably because I excluded the eff 'microbenchmarks'.

I tried another configuration that probably does better justice to the optimisation: I re-ran the benchmarks with -fllvm -optlo -Os to have the LLVM optimise for size concerns which IME yields less code layout dependent results.

Anyway, ignoring benchmarks with <200ms runtime yields an improvement of -1.0% (result: https://ghc.haskell.org/trac/ghc/attachment/ticket/9476/nofib.txt), while ignoring all benchmarks with <500ms runtime yields an -1.2% improvement. Ironically, runtime of CSD improved by -7.1%.

Notable is also that while n-body allocates 20% less (heap space!), it got slower by a non-meaningful margin of 0.1%. Maybe watching out for allocations isn't the be all end all here.

I really think we should flag benchmarks for being eligible for runtime measurements. I get hung up on what are architectural wibbles all the time.

Note: See TracTickets for help on using tickets.