Opened 4 years ago

Last modified 11 days ago

#9476 new feature request

Implement late lambda-lifting

Reported by: simonpj Owned by: nfrisby
Priority: normal Milestone:
Component: Compiler Version: 7.8.2
Keywords: Cc: bgamari, kavon, sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8763 Differential Rev(s):
Wiki Page: LateLamLift

Description (last modified by simonpj)

This ticket exists to coordinate work on Nick Frisby's late lambda-lifting optimisation.

  • Branch in GHC repo: wip/llf
  • Related tickets
    • #9279: local closure remains in final program
    • #9374: static argument transformation
    • #5945 (closed as dup, but with useful example)
    • #11284 (closed as dup, but with useful example)

Change History (18)

comment:1 Changed 4 years ago by simonpj

Owner: set to nfrisby

comment:2 Changed 4 years ago by simonpj

Thoughts about the wiki page (which is a fantastic start):

  • The "extended consequences" section is great. But could you give a small example of each point? Otherwise it is hard to grok.

There are lots of other places where an example would be extremely useful. To take just one "If an LNE function f occurs in another LNE function g and we only lift g, then it will spoil f: and it will no longer be an LNE, because it will now be an argument to llf_g."

  • I recall that you implemented a more-or-less complex analysis to try to get the good consequences without the bad ones. Is it worth sketching what the analysis does? The complexity here is my principal worry about the whole thing.
  • Small point: the $(theRHS 'a 'b) notation is more distracting than helpful. I'd use something more informal (..a...b...).

comment:3 Changed 3 years ago by simonpj

Description: modified (diff)

comment:4 Changed 3 years ago by bgamari

Cc: bgamari added
Description: modified (diff)

comment:5 Changed 3 years ago by bgamari

Description: modified (diff)

comment:6 Changed 3 years ago by bgamari

Wiki Page: LateLamLift

comment:7 Changed 3 years ago by simonpj

Description: modified (diff)

comment:8 Changed 3 years ago by thomie

Type of failure: None/UnknownRuntime performance bug

comment:9 Changed 14 months ago by simonpj

Cc: kavon added

comment:10 Changed 3 months ago by sgraf

comment:11 Changed 3 months ago by sgraf

Cc: sgraf added

I rebased Nicholas Frisby's work here: https://github.com/sgraf812/ghc/tree/llf.

It doesn't currently pass the testsuite (even in default mode, not doing any llf), probably because I introduced some errors while merging some decision code and taking care of join point (then still running by let-no-escape) information. In particular, the devel2 version triggers an ASSERT in add_id.

comment:12 Changed 3 months ago by simonpj

OK, great. Do keep us posted. I think there are real wins to be had here. Nicholas's detailed wiki page identifies many of the issues very clearly.

comment:13 Changed 3 weeks ago by sgraf

It took me quite some time, but this commit passes ./validate (modulo 4 compiler perf tests). Fixing the testsuite was rather simple, but investigating various performance regressions to see which knobs we could turn is really time consuming, so I figured I better post now than never.

I updated the wiki page with a summary of changes I made. For completeness:

  • A hopefully faithful rebase, removing previous LNE (= join point) detection logic
  • Activate all LLF flags (see the above llf-nr10-r6 configuration) by default
  • Actually use the -fllf-nonrec-lam-limit setting
  • Don't stabilise Unfoldings mentioning makeStatic
  • Respect RULEs and Unfoldings in the identifier we abstract over (previously, when SpecConstr added a RULE mentioning an otherwise absent specialised join point, we would ignore it, which is not in line with how CoreFVs works)
  • Stabilise Unfoldings only when we lifted something out of a function (Not doing so led to a huge regression in veritas' Edlib.lhs)

I'll attach nofib results in a following post. Here's the summary:

        Program         Allocs    Allocs    Instrs    Instrs
                        no-llf     llf      no-llf      llf
--------------------------------------------------------------------------------
            Min         -20.3%    -20.3%     -7.8%    -16.5%
            Max          +2.0%     +1.6%    +18.4%    +18.4%
 Geometric Mean          -0.4%     -1.0%     +0.3%     -0.0%

llf is a plain benchmark run, whereas no-llf means libraries compiled with -fllf, but benchmarks compiled with -fno-llf. This is a useful baseline, as it allows to detect test cases where the regression actually happens in the test case rather than somewhere in the boot libraries.

Hardly surprising, allocations go down. More surprisingly, not in a consistent fashion. The most illustrating test case is real/pic:

                         no-llf     llf
            pic          -0.0%     +1.0%     +0.0%     -3.4%

The lifting of some functions results in functions of rather big result arity (6 and 7), which no longer can be fast called. Appearently, there's no stg_ap_pppnn variant matching the call pattern.

Also, counted instructions went up in some cases, so that there's no real win to be had. If I completely refrain from lifting non-recursive join points, things look better wrt. to counted instructions:

        Program         Allocs    Allocs    Instrs    Instrs
                        no-llf     llf      no-llf      llf
--------------------------------------------------------------------------------
            Min         -20.3%    -20.3%     -3.4%    -17.1%
            Max          +2.0%     +1.6%     +6.4%     +6.4%
 Geometric Mean          -0.3%     -1.0%     +0.1%     -0.4%

But I recently questioned using cachegrind results (such as the very relevant counted memory reads/writes) as a reliable metric (#15333).

There are some open things that should be measured:

  • Is it worthwhile at all to lift join points? (Related: don't we rather want 'custom calling conventions' that inherits register/closure configurations to top-level bindings?)
  • Isn't a reduction in allocations a lie when all we did is spill more on to the stack? Imagine we lift a (non-tail-recursive) function to top-level that would have arity > 5. Arguments would have to be passed on the stack, for each recursive call. I'd expect that to be worse than the status quo. So maybe don't just count the number of free ids we abstract over, but rather bound the resulting arity?

Finally, the whole transformation feels more like it belongs in the STG layer: We very brittly anticipate CorePrep and have to pull in really low-level stuff into the analysis, all while having to preserve unfoldings when we change anything. Seems like a very local optimization (except for enabling intra-module inlining opportunities) that doesn't enable many other core2core optimizations (nor should it, that's why we lift late).

comment:14 Changed 3 weeks ago by sgraf

plain base, no-llf, llf

same without lifting non-rec join points

Btw. is there some way around the 256KB upload limit?

Last edited 3 weeks ago by sgraf (previous) (diff)

comment:15 Changed 13 days ago by sgraf

I'm currently thinking about Pros and Cons involved with (re-)implementing this as an optimization on STG rather than Core.

Pros:

  • Much less involved analysis that doesn't need to stay in sync with CorePrep
  • LLF only enables intra-module opportunities in the simplifier anyway
  • Runtime arguments are immediately visible, as are PAPs
  • Lifting to top-level only is orthogonal enough to full laziness that it might be worthwhile to split anyway
  • Integrating more low-level information might be more feasible (like calling convention, e.g. number of argument registers, or stg_ap_* patterns)

Cons:

  • The Simplifier might not be able to inline as much (which probably had limited effect anyway)
  • It's not enough to look at Core alone to gauge allocation (quite a bummer)
  • No Core Lint (but STG Lint, still)
  • Potentially we need to re-invent some utility functions we already have for Core
  • Perform the lifting by hand, rather than let FloatOut do the work

I'm really interested to hear opinions/objections about this!

Last edited 12 days ago by sgraf (previous) (diff)

comment:16 Changed 11 days ago by simonpj

Thoughts

  • There are a handful of spectacular reductions in allocation (queens, n-body). It'd be good to understand and explain them. Perhaps we can more closely target LLF on those cases.
  • I don't think we should float join points at all, recursive or non-recursive. Think of them like labels in a control-flow graph.
  • I think of LLF as a code-generation strategy, that we do once all other transformations are done. (Lambda-lifting can affect earlier optimisations. It can make a big function into a small one (by floating out its guts), and thereby let it be inlined. But that is subtle and difficult to get consistent gains for. Let's not complicate LLF by thinking about this.)
  • Given that it's a code-gen strategy, doing it on STG makes perfect sense to me. You've outlined the pros and cons well. Definitely worth a try.

I'm not sure what you meant by "It's not enough to look at Core alone to gauge allocation" as a disadvantage.

When you say "Much less involved analysis that doesn't need to stay in sync with CorePrep", I think it would be v helpful to lay out "the analysis". I have vague memories, but I don't know what this "stay in sync" stuff is about.

If you do it in STG you don't need to explain "stay in sync", but explaining the analysis would be excellent.

comment:17 in reply to:  16 Changed 11 days ago by sgraf

Replying to simonpj:

Thoughts

  • There are a handful of spectacular reductions in allocation (queens, n-body). It'd be good to understand and explain them. Perhaps we can more closely target LLF on those cases.

It's hard to tell for n-body: The -fno-llf variant shows the same reduction in allocations, meaning that the reduction must happen somewhere in the base libraries rather than in n-body itself.

queens has its go1 function within gen lifted (arity 1 with 3 free vars). Non-lifted Core after CorePrep:

let {
  n_s5S3 [Occ=OnceL*] :: [[GHC.Types.Int]]
  [LclId]
  n_s5S3 = go_s5RY ys_s5S2 } in
case GHC.Prim.># 1# ww1_s5RX of {
  __DEFAULT ->
    letrec {
      go1_s5S5 [Occ=LoopBreaker]
        :: GHC.Prim.Int# -> [[GHC.Types.Int]]
      [LclId, Arity=1, Str=<S,U>, Unf=OtherCon []]
      go1_s5S5
        = \ (x1_s5S6 :: GHC.Prim.Int#) ->
            case Main.main_$ssafe y_s5S1 1# x1_s5S6 of {
              GHC.Types.False ->
                case GHC.Prim.==# x1_s5S6 ww1_s5RX of {
                  __DEFAULT ->
                    case GHC.Prim.+# x1_s5S6 1#
                    of sat_s5S9 [Occ=Once]
                    { __DEFAULT ->
                    go1_s5S5 sat_s5S9
                    };
                  1# -> n_s5S3
                };
              GHC.Types.True ->
                let {
                  sat_s5Se [Occ=Once] :: [[GHC.Types.Int]]
                  [LclId]
                  sat_s5Se
                    = case GHC.Prim.==# x1_s5S6 ww1_s5RX
                      of {
                        __DEFAULT ->
                          case GHC.Prim.+# x1_s5S6 1#
                          of sat_s5Sd [Occ=Once]
                          { __DEFAULT ->
                          go1_s5S5 sat_s5Sd
                          };
                        1# -> n_s5S3
                      } } in
                let {
                  sat_s5Sa [Occ=Once] :: GHC.Types.Int
                  [LclId]
                  sat_s5Sa = GHC.Types.I# x1_s5S6 } in
                let {
                  sat_s5Sb [Occ=Once] :: [GHC.Types.Int]
                  [LclId]
                  sat_s5Sb
                    = GHC.Types.:
                        @ GHC.Types.Int sat_s5Sa y_s5S1 } in
                GHC.Types.:
                  @ [GHC.Types.Int] sat_s5Sb sat_s5Se
            }; } in
    go1_s5S5 1#;
  1# -> n_s5S3
}

And when go1 is lifted to top-level, the CorePrep'd call site changes to

case GHC.Prim.># 1# ww1_s5Ts of {
  __DEFAULT ->
    let {
      sat_s5Tz [Occ=Once, Dmd=<L,1*U>] :: [[GHC.Types.Int]]
      [LclId]
      sat_s5Tz = go_s5Tt ys_s5Tx } in
    llf_go_r5SK y_s5Tw ww1_s5Ts sat_s5Tz 1#;
  1# -> go_s5Tt ys_s5Tx
}

The important detail is that n_s5S3 corresponds to sat_s5Tz in the lifted version. Both have a 'single' occurrence, but n_s5S3's occurrence is under a non-oneshot lambda (which appearently and understandably is enough for OccAnal to give up, as opposed to analyzing parameters of recursive functions), so n_s5S3 is updateable and causes an increase in heap allocation, while sat_s5Tz is single-entry.

So, I'm afraid this gain can't entirely claimed by LLF. A late demand analysis run would probably detect the same. There's cichelli, which shows similar improvements (-10% allocs) due to something happening in Prog.hs, but there's too many lifts to bisect, so I'll postpone that to a follow-up post.

  • I don't think we should float join points at all, recursive or non-recursive. Think of them like labels in a control-flow graph.

This was also what I thought, but there can be synergetic effects with the inliner if we manage to "outline" (that's what the transformation would be called in an imperative language) huge non-recursive join points, where the call overhead is neglibigle. At least that's what the wiki page mentions. But that brings me to the next point...

  • I think of LLF as a code-generation strategy, that we do once all other transformations are done. (Lambda-lifting can affect earlier optimisations. It can make a big function into a small one (by floating out its guts), and thereby let it be inlined. But that is subtle and difficult to get consistent gains for. Let's not complicate LLF by thinking about this.)

Yes, I wasn't sure if having it play nice with the simplifier was intended here. I take comfort in your confirmation seeing it as a code-gen strategy.

  • Given that it's a code-gen strategy, doing it on STG makes perfect sense to me. You've outlined the pros and cons well. Definitely worth a try.

Will do.

I'm not sure what you meant by "It's not enough to look at Core alone to gauge allocation" as a disadvantage.

Maybe it's just me only slowly understanding every layer of compilation in GHC, but I felt like I could have this mental model of roughly where and how much things allocate after CorePrep, but that's probably an illusion considering things like let-no-escapes (that story got better with join points, I suppose). Having another pass transforming heap allocation into stack allocation *after* CorePrep isn't exactly helping that intuition.

Or were you thrown off by me using words I maybe mistranslated ("gauge" for "estimate", roughly)?

When you say "Much less involved analysis that doesn't need to stay in sync with CorePrep", I think it would be v helpful to lay out "the analysis". I have vague memories, but I don't know what this "stay in sync" stuff is about.

There is this note that explains why it's necessary to approximate CorePrep. What follows is this new analysis that interleaves a free variable analysis with computing use information for id's which are free in the floating bindings.

If you do it in STG you don't need to explain "stay in sync", but explaining the analysis would be excellent.

Yes! I'm not really sure I understand all of it yet. (Re-)implementing it on STG should help me figure things out.

comment:18 Changed 11 days ago by simonpj

A late demand analysis run would probably detect the same.

-O2 does late demand analysis for this very reason. So presumably it doesn't catch it.

"It's not enough to look at Core alone to gauge allocation"

I agree that STG has a stronger operational model than Core. (Core is not bad, but STG is better.) But that's an advantage of doing LLF in STG, not a disadvantage.

Do have a go!

Note: See TracTickets for help on using tickets.