Opened 17 months ago

# Loopification using join points

Reported by: Owned by: nomeata nomeata normal Compiler 8.0.1 JoinPoints sanjitk@…, sgraf, kavon Unknown/Multiple Unknown/Multiple None/Unknown #13966 #14067 #14827 ​Phab:D3811

The function

```let f x y = case y of
A -> f x' y'
B -> e2
C -> e3
in g f
```

is not turned into a recursive join point, because the call to `f` is not in tail call position. But the recursive calls are, and these matter performance-wise! Hence, it would be beneficial to turn this into

```let f x y = joinrec \$j x y = case x y of
A -> \$j x' y'
B -> e2
C -> e3
in \$j x y
in g f
```

This has the additional effect that now `f` is no longer recursive and may get inlined.

The idea is described under "New idea: use join points" in Commentary/Compiler/Loopification.

Some notes:

• We should to this both at top level and for nested definitions.
• We can remove the "loopification" code from the code generator when this is done.
• It might work well with #13051, which Thomas Jakway is still thinking about.

### comment:1 Changed 17 months ago by nomeata

Owner: set to mpickering → #13966 #14067

### comment:2 Changed 17 months ago by nomeata

Just noting down some thoughts.

How does this work for mutually recursive bindings?

```let f x y = case y of
A -> f x' y'
B -> g (x', y')
C -> e3
g p = case p of (x, y) -> f x' y'
in f 1 2 + g 3 4
```

Here `f` and `g` look like they might be join points. One way of doing that would be

```let f x y = joinrec \$j1 x y = case y of
A -> \$j1 x' y'
B -> \$j2  (x', y')
C -> e3
\$j2 p = case p of (x, y) -> f x' y'
in \$j1 x y
g p   = joinrec \$j1 x y = case y of
A -> \$j1 x' y'
B -> \$j2  (x', y')
C -> e3
\$j2 p = case p of (x, y) -> f x' y'
in \$j2 p
in f 1 2 + g 3 4
```

but such code duplication is clearly not desired.

The next best thing I can think of is some encoding

```let f_g arg = joinrec \$j1 x y = case y of
A -> \$j1 x' y'
B -> \$j2  (x', y')
C -> e3
\$j2 p = case p of (x, y) -> f x' y'
in case arg of Left (x,y) -> \$j1 x y
Right p    -> \$j2 p
f x y = f_h (Left (x,y))
g p   = f_g (Right p)
in f 1 2 + g 3 4
```

(maybe using unboxed sums for the parameter encoding).

This would have the desired effect, but I guess it goes too far. So I conclude this ticket should focus on the singly recursive case.

### comment:3 Changed 17 months ago by simonpj

Description: modified (diff) JoinPoints added Turn tail-recursive functions into recursive jointpoints → Loopification using join points

### comment:4 Changed 17 months ago by simonpj

Yes, I agree: stick to the singly-recursive case.

Make sure we do this at top level too.

### comment:5 Changed 17 months ago by simonpj

Description: modified (diff)

### comment:6 Changed 17 months ago by nomeata

We can remove the "loopification" code from the code generator when this is done.

Ok, so I remembered correctly that, in the end, such code already today produces tight loops. So there are not immediate performance benefits to be gained from this transformation, because we just do loopification earlier, but possible knock-on effects due to inlining or (eventually) joinrec-SAT, right?

### comment:7 Changed 17 months ago by simonpj

Yes, there are immediate perf benefits: #13966 is one such case.

### comment:8 Changed 17 months ago by nomeata

Owner: changed from mpickering to nomeata

Some preliminary work in `wip/T14068` resp D3811.

Last edited 17 months ago by nomeata (previous) (diff)

### comment:9 Changed 17 months ago by simonpj

Differential Rev(s): → Phab:D3811

### comment:10 Changed 15 months ago by simonpj

What's the status here Joachim?

### comment:11 Changed 15 months ago by bgamari

Joachim has said that Phab:D3903 is ready for review.

Last edited 15 months ago by bgamari (previous) (diff)

### comment:12 Changed 15 months ago by bgamari

Differential Rev(s): Phab:D3811 → Phab:D3903

Phab:D3811 has been abandoned in favor of Phab:D3903.

### comment:13 Changed 15 months ago by bgamari

Differential Rev(s): Phab:D3903 → Phab:D3811

Never mind, Phab:D3811 was broader in scope than just the exitification work in Phab:D3903. Only Joachim knows the status of the broader loopification work.

### comment:14 Changed 15 months ago by nomeata

I’d like to get ​Phab:D3903 in first, and then I will come back to this one.

### comment:15 Changed 15 months ago by simonpj

Fair enough. Let's get on with Phab:D3903 (i.e. #14152) then. I've commented.

### comment:16 Changed 15 months ago by simonpj

Description: modified (diff)

### comment:17 Changed 12 months ago by simonpj

Description: modified (diff)

### comment:18 Changed 12 months ago by simonpj

Description: modified (diff)

### comment:19 Changed 11 months ago by simonpj

Joachim, where are we on loopification? I think you have `wip/T14068` with this commit

```commit b4ab3a5f1fa051be9c5689f7ecef16458b2d700d
Author: Joachim Breitner <mail@joachim-breitner.de>
Date:   Fri Aug 4 15:34:11 2017 -0400

Prevent inlining of loopified programs

Previously, a recursive function is not inlineable. After loopification,
it turns into a non-recursive function, and suddenly it is. While this
is in general desirable, it has many knock-on effects, which makes it
hard to evaluate and debug loopification. Therefore, this commit (tries to)
prevent this inlining. When this results in no unfixable regressions,
then we can tackle the next step.

It is surprisingly hard to reliably prevent inlining, it seems, so I
have been playing whack-a-mole a bit:
* simpl_binds has two copies of the ids around, one in the env and one
in the AST. If maybeLoopify changes only one of them, then things go wrong.
Worked-around that for now, but probably not ideal.
TODO: Apply maybeLoopify before entering simplTopBinds
* Also, worker-wrapper needs to preserve the no-inlining better
```

But I think it's in limbo? Or has it been overtaken by events?

### comment:20 Changed 11 months ago by nomeata

Loopification per se works, but it causes huge performance swings in both directions; not because of loopification itself, but because other parts of the compiler now treat the code differently. The current (disheartening) stats are at https://perf.haskell.org/ghc/#compare/af0aea9c3d5f68f2694bd7b6380788764aa3f1ff/f04fdcbc51fffa36619157defb105dae461da4b7 (this URL might stop working in the future when `wip/T14068` is rebased.)

Without “Prevent inlining of loopified programs”, we now start inlining stuff that used to be recursive, which is not always a win, it seems. With this patch, we prevent that, but now other things don’t work as well as they used to.

I lost steam in the fall tracking down all the regressions, but I now have a master student at Penn, Sanjit Kalapatapu, who is helping to track down them. Currently, he is looking into `SpecConstr`, which seems to stopped doing its thing (maybe because of the no-inline marker that we add…). I hope that with him there will be progress again, but I expect it to be slow progress.

### comment:21 Changed 11 months ago by simonpj

OK great. If you can do it on a wip branch then I can readily check it out, which will make it easier for me to offer suggestions if you need any discusison. Thanks!

### comment:22 Changed 11 months ago by nomeata

Well, `wip/T14068` is what we have, and you are very welcome to dabble with it, or even push to it.

### comment:23 Changed 11 months ago by simonpj

Terrific; I doubt I'll dabble much, but am very open to discussion

### comment:24 Changed 10 months ago by simonpj

Related Tickets: #13966 #14067 → #13966 #14067 #14827

### comment:25 Changed 10 months ago by nomeata

Just had a brief look, just to remind myself of the issues here, and compared the Core of `x2n1`. With `-O`, I see loopification happening (yay!) and no allocation changes (expected). But with `-O -fspec-constr`, I see the big increase that is also reported by perf.haskell.org.

It might just be an unwanted side effect of me trying preventing inlining of loopification, and might go away when I release this break – but then other things happen. Well, I should test that hypothesis.

### comment:27 Changed 10 months ago by nomeata

Here is some data about `x2n1`:

```master   loopification  flags
1017640  1017688        -O
1017640  1017688        -O -fliberate-case
378248   378296        -O -fspec-constr
378248   698264        -O -fspec-constr -fliberate-case (== -O2)
```

So liberate case + loopification breaks parts of spec-constr…

### comment:28 Changed 10 months ago by nomeata

Is it possible that this happens (using the example from the top of `SpecConstr`)?

Before:

```Rec {
drop n xs = case xs of
[]     -> []
(y:ys) -> case n of
I# n# -> case n# of
0# -> []
_ -> drop (I# (n# -# 1#)) xs
}
```

here SpecConstr kicks, creating a rule and specializing, yielding the nice

```Rec {
\$sdrop n# xs = case xs of
[]     -> []
(y:ys) -> case n# of
0# -> []
_ -> \$sdrop (n# -# 1#) xs
}
RULE drop (I# n) xs = \$sdrop n xs
```

```drop n xs =
joinrec j n xs = case xs of
[]     -> []
(y:ys) -> case n of
I# n# -> case n# of
0# -> []
_ -> call j (I# (n# -# 1#)) xs
in j n xs
```

(yay!), and now we get (with liberate case? something else?) we get (I just checked)

```drop n xs =
case xs of
[]     -> []
(y:ys) -> case n of
I# n# -> case n# of
0# -> []
_ ->
joinrec j xs n# = case xs of
[]     -> []
(y:ys) -> case n# of
0# -> []
_ -> call j xs (n# -# 1#)
in j n (y:ys)
```

but now spec-constr no longer kicks! It’s comments say (abbreviated):

So we look for a self-recursive function AND at a recursive call, one or more parameters is an explicit constructor application AND that same parameter is scrutinised by a case somewhere in the RHS of the function

This used to be true for the original, recursive `drop`, but not for the loopified: `drop` is not recursive, `j` does not scrutinize the parameter.

So we don’t create a specialization for `drop`, causing extra allocations when there are calls to `drop (I\$ n)` somewhere.

Now in some cases this might be ok, namely when we can inline `drop` (which is no longer recursive). But `drop` contains this big `joinrec`, so it’s too big to be inlined?

Last edited 10 months ago by nomeata (previous) (diff)

### comment:29 Changed 10 months ago by simonpj

now we liberate case can do its work

Are you sure? I don't think liberate case applies here, though I have not tried it out.

ow spec-constr no longer kicks!

Why not? It should apply to the recursive function j.

### comment:30 Changed 10 months ago by nomeata

I am trying it out right now… will report back.

Are you sure? I don't think liberate case applies here, though I have not tried it out.

Hmm, you are right, this example does not work. `n` is not a free variable variable of the recursive function. I will try to construct a better example (or find out what really happens with `x2n1`).

### comment:31 Changed 10 months ago by nomeata

I am trying it out right now… will report back.

Ok, it is independent of liberate case, but otherwise the result is as I described: Without loopification we get a specialization for `drop`, with loopification we do not.

### comment:32 Changed 10 months ago by simonpj

Without loopification we get a specialization for drop, with loopification we do not.

I wonder why. Are you happy to pursue, or would you like help?

### comment:33 Changed 10 months ago by nomeata

Without loopification we get a specialization for drop, with loopification we do not.

I wonder why. Are you happy to pursue, or would you like help?

It seems rather clear to me *why*: We get this code for `drop`, which does not satisfy the conditions for SpecConstr; in particular, `drop` is not recursive (SpecConstr only creates specializations for recursive functions):

```drop n xs =
case xs of
[]     -> []
(y:ys) -> case n of
I# n# -> case n# of
0# -> []
_ ->
joinrec j xs n# = case xs of
[]     -> []
(y:ys) -> case n# of
0# -> []
_ -> call j ys (n# -# 1#)
in j ys (n# -# 1#)
```

But I think I need your guidance as to what we should do about it!

Should we simply enable SpecConstr even for all non-recursive functions? (My gut feeling is that this will blow up code size and compilation times too much.)

Maybe we need to float the `joinrec … in …` out out of `drop`, so that the remaining code in `drop` (which does the case analysis on `n`) is small enough to be always inlined?

### comment:34 Changed 10 months ago by nomeata

JFTR, I just compared `master` with my branch with SpecConstr disabled, and all allocation regressions but one (in `queens`, not yet investigated) did go away.

So the issue of SpecConstr vs. loopification seems to be the main road blocker here.

### comment:35 Changed 10 months ago by simonpj

We get this code for drop, which does not satisfy the conditions for SpecConstr; in particular, drop is not recursive (SpecConstr only creates specializations for recursive functions):

And rightly so! The new code for `drop` look splendid. The first iteration unboxes the boxed parameter `n`, and then the `joinrec` is a tight inner loop that races down the list; it's pretty much identical to the `\$sdrop` we got before. So why is this worse than the previous version?

It should be easy enough for me to repro this even without your loopification patch, by manually writing the post-loopification code. If you can post that code here I'll be in a better position to help. Thanks!

### comment:36 Changed 10 months ago by nomeata

So why is this worse than the previous version?

Because if `drop` itself is in an inner loop, something like `map (\n -> drop n xs) [0..1000]`, then we have to allocate a box for the the `I#` to wrap the argument.

Previously, `drop` would have a specialization rule, and the calls to `drop` would be replaced by a call to `\$sdrop`.

But: Maybe we should simply not worry: For `x2n1`, allocation goes up by 80%, but the instruction count stays the same (+0.05%). I guess that the extra `I#` constructor needed to allocate is consumed pretty quickly, and never copied by the GC…

But again: This line of reasoning doesn’t cut it for other benchmarks. For example `treejoin`’s allocation goes up by 27%, and the instruction count by 72%(!). With `-f-no-spec-constr`, there is no difference in either number.

Last edited 10 months ago by nomeata (previous) (diff)

### comment:37 Changed 10 months ago by simonpj

Because if drop itself is in an inner loop, something like map (\n -> drop n xs) [0..1000], then we have to allocate a box for the the I# to wrap the argument.

Ah, well that would also be true of any non-recursive function that was called in this way.

I think you are claiming that all the extra allocation is due to boxing the arguments to a non-recursive function that is called repeatedly. That seems surprising, but perhaps it is true.

If so, then I suppose a possible fix is to specialise non-recursive functions too. Maybe that would be good in general; I'm not sure. Might be worth trying...

### comment:38 Changed 10 months ago by nomeata

If so, then I suppose a possible fix is to specialise non-recursive functions too. Maybe that would be good in general; I'm not sure. Might be worth trying...

Let’s do this in #14844 and then come back here.

### comment:39 Changed 10 months ago by sgraf

As to why

```drop n xs =
joinrec j n xs = case xs of
[]     -> []
(y:ys) -> case n of
I# n# -> case n# of
0# -> []
_ -> call j (I# (n# -# 1#)) xs
in j n xs
```

specializes to

```drop n xs =
case xs of
[]     -> []
(y:ys) -> case n of
I# n# -> case n# of
0# -> []
_ ->
joinrec j xs n# = case xs of
[]     -> []
(y:ys) -> case n# of
0# -> []
_ -> call j ys (n# -# 1#)
in j ys (n# -# 1#)
```

There is the call pattern `j (I# (n# -# 1#)) xs`, also `j` scrutinizes `n`. So SpecConstr makes something like this:

```drop n xs =
joinrec
j n xs = case xs of
[]     -> []
(y:ys) -> case n of
I# n# -> case n# of
0# -> []
_ -> call j# (n# -# 1#) xs

j# n# xs = case xs of
[]     -> []
(y:ys) -> case n# of
0# -> []
_ -> call j# (n# -# 1#) xs
in j n xs
```

Now `j` is not recursive and is inlined and we get the resulting code.

I wonder why `drop` isn't inlined in your example. That would surely make the allocation go away? Are there multiple calls to `drop` that obstruct inlining?

And if so, do they share a common call pattern (e.g. `j (I# n) xs`)? Then that's one of the cases non-rec specialisation might be beneficial.

### comment:40 Changed 10 months ago by nomeata

I wonder why drop isn't inlined in your example. That would surely make the allocation go away? Are there multiple calls to drop that obstruct inlining?

I am only using `drop` as an example here, the real thing might be much larger; too large to be inlined. I’ll have to look.

And if so, do they share a common call pattern (e.g. j (I# n) xs)? Then that's one of the cases non-rec specialisation might be beneficial.

Yes, that is the hope!

### comment:42 Changed 9 months ago by kavon

While working on my own version of this patch (without knowing about this work yet) I noticed that we may not want to perform the loopification transformation too early, as specialization and other optimizations end up turning a binder marked as `RecursiveTailCalled` into a a full join point, and we end up with better code.

Thus, my plan was to implement it as a separate Core pass to figure out at what point(s) during optimization to perform loopification.

### comment:43 Changed 9 months ago by nomeata

Hi Kavon,

nice to see someone else working on this. With the ICFP deadline out of the way, I want to pick this up again.

as specialization and other optimizations end up turning a binder marked as RecursiveTailCalled into a a full join point, and we end up with better code.

Can you elaborate?

I noticed that the most regressions due to loopification are due to ConstSpec only targetting recursive things. Do you observe that as well?

So the plan would be to experiment with a specializer that works also for non-recursive functions. Do you want to try that out?

### comment:44 Changed 9 months ago by simonpj

So the plan would be to experiment with a specializer that works also for non-recursive functions.

That idea is pursued in #14844.

But I'd still like to be be sure that this is truly the reason for the regression. We have comment:27 which shows a regession for `x2n1` (is it the only regression?). Then comment:28 says "maybe it's this". But we have no concrete evidence for what the `x2n1` regression really is. It'd be good to know.

### comment:45 Changed 9 months ago by nomeata

I am now confident that this is it, together with #14844 and #14951, because only if I fix both these issues, I get the same low allocation number for `x2n1`. Also, the example in #14951 is distilled directly from `x2n1`, and ticket:14844#comment:6 is directly from `x2n1`.

Last edited 9 months ago by nomeata (previous) (diff)

### comment:46 Changed 9 months ago by nomeata

And, to add numbers to this: Running SpecConstr on top-level non-recursive functions and running it twice (with simplification in between) improves `x2n1` allocations by 45%, compared to just loopification, and fixes some of the other regressions: https://perf.haskell.org/ghc/#compare/9245c4bbc2156b3b84f253c97cc2ee8bd8b7dd98/0f1fee6be3df20837543ede7223a827abb6a4759.

Overall, the whole branch has a few promising improvements of ~7%, but also still some egregious regressions that need to be tracked down (`queens` +240%). And it is of course not immediately clear which of the improvements are due to loopification, and which are independent of loopification and due to the changes to SpecConstr. But in any ways we need to conclude the SpecConstr story first and see what, if anything at all, we want to change there.

### comment:47 Changed 9 months ago by nomeata

More data. #14951 fixes some regressions that loopification introduces:

```name                     previous    change    now
nofib/allocs/last-piece  642047800  - 11.08%    570920968  bytes
nofib/allocs/x2n1           697240  - 45.89%       377272  bytes
```

But it does not fix all the regressions that can be fixed by running SpecConstr twice (with simplification in between):

```Benchmark                 previous  change     now
nofib/allocs/constraints  20491632       -      3.63%    19747392   bytes
nofib/allocs/event        129682800      +      8.44%    140626888  bytes
nofib/allocs/fulsom       243329208      -      7.83%    224287496  bytes
nofib/allocs/integer      40633632       -      3.06%    39389560   bytes
nofib/allocs/last-piece   642047800      -     11.49%    568297080  bytes
nofib/allocs/mandel2      1649568        -     26.78%    1207888    bytes
nofib/allocs/minimax      5371584        -      8.73%    4902576    bytes
nofib/allocs/parstof      3023208        -      3.7%     2911384    bytes
nofib/allocs/x2n1         697240         -     45.89%    377272     bytes
```

Which raises the question: Should we just go the easy route and run SpecConstr twice? (I am measuring the effect of that on master, independent of loopification, in branch `wip/T14951-blunt`.)

Last edited 9 months ago by nomeata (previous) (diff)

### comment:48 Changed 9 months ago by kavon

I've been playing around with `wip/T14068-inline`, and the behavior I was seeing occurs with this simple example:

```countDown n = case n of
0 -> 0
n -> countDown (n - 1)

... countDown 10 {- non-tail call -} ...
```

countDown satisfies our `RecursiveTailCalled` requirement right out of the gate, and is marked as such by OccurAnal.

With loopification turned off, the first simplification pass will give us:

```countDown :: forall t p. (Eq t, Num t, Num p) => t -> p
[LclIdX,
Arity=4,
Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
WorkFree=True, Expandable=True,
Guidance=IF_ARGS [30 90 30 0] 550 0}]
countDown
= \ (@ t_a2HQ)
(@ p_a1vP)
(\$dEq_a2OT :: Eq t_a2HQ)
(\$dNum_a2OU :: Num t_a2HQ)
(\$dNum_a2OV :: Num p_a1vP)
(eta_B1 :: t_a2HQ) ->
letrec {
countDown_a1vH [Occ=LoopBreaker] :: t_a2HQ -> p_a1vP
[LclId, Arity=1]
countDown_a1vH
= \ (n_aX3 :: t_a2HQ) ->
case ==
@ t_a2HQ \$dEq_a2OT n_aX3 (fromInteger @ t_a2HQ \$dNum_a2OU 0)
of {
False ->
countDown_a1vH
(- @ t_a2HQ \$dNum_a2OU n_aX3 (fromInteger @ t_a2HQ \$dNum_a2OU 1));
True -> fromInteger @ p_a1vP \$dNum_a2OV 0
}; } in
countDown_a1vH eta_B1
```

Thanks to the eta-expansion, `countDown_a1vH` can now be turned into a full join point. Thus, we didn't gain anything by performing loopification so early, though we also lose nothing since we end up with the same code after some clean up... just a small observation.

A more troubling behavior I've spotted happens in `queens`, where we end up with some good looking code after we loopify and then inline the once-called `safe`, but all of that progress seems to be undone by FloatOut. Simplification will then redo all of that work. I wonder if it would be more efficient to use loopification after outward let-floating?

Also: I think I've narrowed down where the slowdown comes from in `queens`. I'll post the details in a later comment since I need to be up early tomorrow.

### comment:49 Changed 9 months ago by simonpj

Can you explain more precisely why float-out un-does good work?

I'm not sure whether float-out does or does not float join bindings. I suspect it should not?

### comment:50 Changed 9 months ago by kavon

Here are the details on FloatOut: Using `wip/T14068-inline`, I compiled `queens` with `-O2 -ddump-occur-anal -ddump-simpl-iterations -ddump-call-arity -dverbose-core2core` to watch every Core transformation, and by the time we reach Specialize, we have already nicely inlined the "safe" loop (`safe_s5vS`) into the folding function:

```...  inside gen ...
GHC.Base.foldr
@ [Int]
@ a_d3jr
(\ (ds_d3jv :: [Int]) (ds_d3ju [OS=OneShot] :: a_d3jr) ->
GHC.Base.foldr
@ Int
@ a_d3jr
(\ (ds_d3jx :: Int)
(ds_d3jw [OS=OneShot] :: a_d3jr) ->
case joinrec {
safe_s5vS [Occ=LoopBreaker]
:: Int -> Int -> [Int] -> Bool
[LclId[JoinId(3)], Arity=3]
safe_s5vS (x_X14P :: Int)
(d_X14R :: Int)
(ds_X3kg :: [Int])
= ... BODY OF SAFE ... } in
jump safe_s5vS
ds_d3jx (GHC.Types.I# 1#) ds_d3jv
of {
False -> ds_d3jw;
True ->
c_d3js
(GHC.Types.: @ Int ds_d3jx ds_d3jv) ds_d3jw
})
```

But right after Specialize, SetLevels tells FloatOut to move that join point to the top level:

```  GHC.Base.foldr
@ [GHC.Types.Int]
@ a_d3jr
(\ <ds_d3jv,<2,0>> <ds_d3ju,<2,0>> ->
GHC.Base.foldr
@ GHC.Types.Int
@ a_d3jr
(\ <ds_d3jx,<3,0>> <ds_d3jw,<3,0>> ->
case letrec {
<safe_s5w0,F<0,0>>
<safe_s5w0,F<0,0>>
= \ <x_X14P,<1,0>>
<d_X14R,<1,0>>
<ds_X3kg,<1,0>> -> ...
```

So we end up with an ordinary top-level Rec:

```  Rec {
safe_s5w0 [Occ=LoopBreaker] :: Int -> Int -> [Int] -> Bool
[LclId, Arity=3]
safe_s5w0
= \ (x_X14P :: Int) (d_X14R :: Int) (ds_X3kg :: [Int]) -> ...

end Rec }

... inside gen ...
GHC.Base.foldr
@ Int
@ a_d3jr
(\ (ds_d3jx :: Int)
(ds_d3jw [OS=OneShot] :: a_d3jr) ->
case safe_s5w0 ds_d3jx lvl_s5w1 ds_d3jv of {
False -> ds_d3jw;
True ->
c_d3js
(GHC.Types.: @ Int ds_d3jx ds_d3jv) ds_d3jw
})
...
```

We end up recovering the good code that existed before FloatOut after two further iterations of Simplification.

FloatOut does try to float join points (see the Note [Join points] in its file). Perhaps there's something wrong with the "join ceiling" (Note [Join ceiling]) implementation in SetLevels?

Last edited 9 months ago by kavon (previous) (diff)

### comment:51 Changed 9 months ago by simonpj

Yes, the entire join-ceiling machinery is wonky. I have a long-gestated patch about this, not yet completed alas, but this conversation may help me to make progress on it.

I think float-out should not be moving join points at all -- except perhaps if they can go all the way to top level. Why top level? Well

• Then the function it was part of before becomes smaller, and hence may inline.
• At top level there is no closure to allocate.

I suppose that if lifted to top level then it might be loopified afresh. Question: how beneficial is it to loopify a top-level function? Danger: once loopified, it becomes non-recursive and might be inlined again -- which would be funny but not very productive.

### comment:52 follow-up:  53 Changed 9 months ago by nomeata

Question: how beneficial is it to loopify a top-level function?

Well, we still want to use jumps for the recursive calls, rather than normal function calls, even if it is top-level, right?

### comment:53 in reply to:  52 Changed 9 months ago by kavon

Question: how beneficial is it to loopify a top-level function?

Well, we still want to use jumps for the recursive calls, rather than normal function calls, even if it is top-level, right?

In the end, they'll still be emitted as jumps since they're tail calls. Considering the transformation in isolation (i.e., ignoring knock-on effects like inlining), using join-point throws instead of tail-recursive calls theoretically allows us to cheapen the iteration overhead in the following ways:

1. For joinrecs whose RHS contains a non-tail call, we can avoid a stack check and stack pointer bumps on each iteration, since the join continuation can keep reusing the stack frame setup on the initial entry to the function. This depends on whether StackLayout in Cmm is optimized to do this.
1. Optimizing argument-passing, such as by moving static arguments out of the recursive throws, spilling rarely used arguments in high-pressure loops, or allowing code generation to pick registers with a smaller instruction encoding (which LLVM loves to do for x86_64).
1. Aligning a hot loop's header. Many x86_64 CPUs prefer 16-byte aligned jump targets, but because we add info tables just before a function label, the alignment of a function's body may only be 8-byte aligned. Code generators can more easily align the target of a join-point throw since it is less likely to have info table attached to it.

In summary, these are either micro-optimizations or not currently implemented, so they may not make a difference. It's worth checking to see what happens in nofib when we leave the join point at the top level, as #13286 's discussion focused on only a few examples.

Last edited 9 months ago by kavon (previous) (diff)

### comment:54 Changed 9 months ago by simonpj

In the end, they'll still be emitted as jumps since they're tail calls.

Probably more important: for C and LLVM we can convert them into genuine loops, which the C/LLVM compiler will do a much better job of.

### comment:55 Changed 7 months ago by nomeata

Status update for those who understandably don’t want to read the whole thread:

There is working code for loopification on `wip/T14068`. If one disables SpecConstr and compares this branch to master, then there is only one allocation regression (in `queens`, not investigated). However, with SpecConstr (i.e. the default), many nofib tests regress.

I have paused working on loopification until we have improved SpecConstr to yield equivalently good results for the loopified code. Relevant tickets are #14844 (SpecConstr should also apply to top-level non-recursive bindings) and #14951 (SpecConstr is not idempotent; in other words: a stack of deeply nested functions do not get specialized in one go). If these were fixed we would get rid of some of the SpecConstr related regressions introduced by loopification – but still not all of them, I fear, and there will be more to be tracked down.

### comment:56 Changed 7 months ago by simonpj

If these were fixed we would get rid of some of the SpecConstr related regressions introduced by loopification – but still not all of them, I fear, and there will be more to be tracked down.

But these are good things to track, more generally. A program might be loopified by hand, so making loopification be non-regressive should benefit all programs.

Another possible thing to try is to loopify after SpecConstr; that is, treat loopificaiton as an optimisation whose main payoff is in the back end. (One reason I am keen to move forward with loopificaion is because then we can remove some rather ad-hoc loopification code in the code generator.) But I don't have a clear picture of where the wins from loopification come. Maybe there are benefits earlier in the pipeline.

Note: See TracTickets for help on using tickets.