Opened 6 months ago

Last modified 7 weeks ago

#15560 new bug

Full laziness destroys opportunities for join points

Reported by: AndreasK Owned by: chessai
Priority: normal Milestone: 8.10.1
Component: Compiler (CodeGen) Version: 8.4.3
Keywords: JoinPoints Cc: sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: #14287 #13286 Differential Rev(s):
Wiki Page:

Description

Even if we already know a binding is a join point we STILL float it to the top and turn it into a function.

The simple example below results in a join point after the first simplifier run. Then we run the float out pass immediately undoing this by making it a top level binding.

It then stays at the top till we are done resulting in the core I've put in the comments.

data T = A | B | C | D | E | F | G

{-# NOINLINE n #-}
n :: T -> T
n A = B
n B = C
n _ = A

f :: Int -> T -> T -> T
f sel x y =
    -- function large enough to avoid being simply inlined
    let j z = n . n . n . n . n . n $ z
    in case sel of
        -- j is always tailcalled
        0   -> j x
        _   -> j y

-- j is floated to top level instead of ending up as joinpoint.
-- T.f_j
--   = \ (eta_B1 [OS=OneShot] :: T) -> n (n (n (n (n (n eta_B1)))))

-- -- RHS size: {terms: 14, types: 6, coercions: 0, joins: 0/0}
-- f :: Int -> T -> T -> T
-- f = \ (sel_aYP :: Int) (x_aYQ :: T) (y_aYR :: T) ->
--       case sel_aYP of { GHC.Types.I# ds_d2fL ->
--       case ds_d2fL of {
--         __DEFAULT -> T.f_j y_aYR;
--         0# -> T.f_j x_aYQ
--       }
--       }

Change History (24)

comment:1 Changed 6 months ago by simonpj

Generally, I think we should not float join points -- in general floating them will stop them being join points. It'd be easy to change SetLevels thus -- but intuition is a poor guide and it'd be really worth trying it and measuring the effect.

However, if the binding can go all the way to the top level, then there seems no downside to floating it: we end up with a smaller (and hence perhaps more inlinable) function; and the jump to the join point is still a nice, efficient jump. What's not to like? (Your Description suggests that you think that doing so is Bad. Why?)

All that said, there is a Huge Mess in SetLevels around join points with stuff about the "join ceiling". I want to expunge all that, but have lacked the time. The trouble is that we do want some limited, local floating -- I have a whole WIP tree pending on that, but it's in limbo.

If anyone would like to work on this, I'll commit my WIP to a branch and offer advice/support on taking it forward.

comment:2 Changed 6 months ago by AndreasK

However, if the binding can go all the way to the top level, then there seems no downside to floating it: we end up with a smaller (and hence perhaps more inlinable) function; and the jump to the join point is still a nice, efficient jump. What's not to like?

There are no top level join points as far as I know. (I might be mistaken though?) Turning them into a top level binding means the calling convention changes into the same that is used with regular functions. So we have the overhead of register saving, stack checks, layout penalty, the works.

Limiting them to the top of the RHS instead seems like an advantage. Maybe that should already happen and I just hit a bug there? I haven't looked too far into the SetLevel machinery.

Not sure how this would affect inlining. Maybe we should try to push join point compatible bindings back into the RHS if they are not exported and called from a single function instead.

Last edited 6 months ago by AndreasK (previous) (diff)

comment:3 Changed 6 months ago by simonpj

Turning them into a top level binding means the calling convention changes into the same that is used with regular functions. So we have the overhead of register saving, stack checks, layout penalty, the works

Can you be more specific. I think that the jumps to the join points will turn into jumps to the top-level function code, no arity checks nothing. You could look at the code and see, but I think it'll be no less efficient.

comment:4 Changed 6 months ago by sgraf

Cc: sgraf added

comment:5 Changed 6 months ago by AndreasK

If j becomes a top level binding we use the general calling convention. Which at the assembly level is still a jump as you said. However there are a subtle differences between jumping to top level bindings versus jumping into a basic block which can have a major performance impact.

Things I can immediatly think of are:

  • If we jump a top level symbol we can't place the jump target immediately after the caller. This means we:
    • Can't eliminate one of the jump instructions, so they take up resource for branch prediction and need to be executed by the CPU.
    • The code won't be placed sequentially in memory leading to worse cache utilization.
  • Top level bindings require an additional info table compared to a regular jump target. This means more code size which is never a good thing.
  • Being a top level function that uses the stack j now performs a stack check. For very small functions this can be a lot of overhead.

It's quite possible that in the general case more inlining is offsetting this cost, but in some cases this makes a major difference.

For example the program below has ~7% speedup when disabling full laziness(780 vs 730ms).

--Simpler core to read without worker/wrapper
{-# OPTIONS_GHC -fno-full-laziness -fno-worker-wrapper #-}
{-# LANGUAGE MagicHash, BangPatterns #-}

module Main where

import System.Environment
import GHC.Prim

data T = A | B | C

-- If we inline the functions case of known constructors kicks in.
-- Which is good! But means j becomes small enough to be inlined
-- and won't become an join point. So for this example we don't
-- want that.
{-# NOINLINE n #-}
{-# NOINLINE f #-}
n :: T -> T
n A = B
n B = C
n _ = A

toInt :: T -> Int
toInt A = 1
toInt B = 2
toInt C = 3

f :: Int -> T -> T -> T
f sel x y =
    -- function large enough to avoid being simply inlined
    let j z = n . n . n . n . n . n $ z
    in case sel of
        -- j is always tailcalled
        0   -> j x
        _   -> j y

main = do
    print $ sum . map toInt . map (\n -> f n A B) $ [0..50000000]

comment:6 Changed 6 months ago by simonpj

Wow, that's an impressively large effect. Thanks -- I had no idea. If you stare at the assembly code, can you guess which of your bullets is causing the effect here? E.g. do we in fact end up eliminating one of the jump instruction?

Your point about the stack check is a good one.

Info tables. Suppose a function is:

  • top level
  • not exported
  • always called with know, saturated calls

Then it does not need either slow entry code or an info table. So rather than avoid creating such functions (by not floating join points) maybe we should apply the optimisation uniformly to all top-level functions?

Hmm. What about heap checks? If a join point j does heap allocations, where do we do the heap-overflow check? Maybe we should absorb the heap allocation into the jump site (as if the code was inlined)? That could avoid doing two heap checks where only one is needed. (Would only work for non-recursive join points.)

Also I'm unclear about how we save live variables around a GC call at the start of a join point. (On function entry we use the function's info table; on a case return point we use its info table.)

comment:7 in reply to:  6 Changed 6 months ago by sgraf

Replying to simonpj:

Hmm. What about heap checks? If a join point j does heap allocations, where do we do the heap-overflow check? Maybe we should absorb the heap allocation into the jump site (as if the code was inlined)? That could avoid doing two heap checks where only one is needed. (Would only work for non-recursive join points.)

Also I'm unclear about how we save live variables around a GC call at the start of a join point. (On function entry we use the function's info table; on a case return point we use its info table.)

I was under the impression that join points never allocate and that they rather re-use the closure of their enclosing scope. Also, currently full laziness will never float join points (or any other binding) that closes over local variables to top-level, so we can probably disregard heap overflow checks.

comment:8 Changed 6 months ago by simonpj

I was under the impression that join points never allocate and that they rather re-use the closure of their enclosing scope

Correct. But consider

f x = let j y = Just (y,y)
      in case x of
           A -> j (True, False)
           B -> j (False, True)
           C -> j (True, True)

Suppose branch A is taken. Then GHC must allocate the object (True, False) and jump to j. Then j must allocate (y,y) and return it.

I suspect that there's a heap check at the beginning of the A branch; and then a second heap check at the start of the body of j. But instead we could make j not do a heap check (ever) and instead put on the caller (of j) the responsibility for making sure that there's enough heap space for j to do its allocation.

That way we'd get one heap check, at the beginning of the A branch, which would ensure there was enough space for both pairs.

(As you say, no closure is allocated for j itself, but that's an entirely different matter.)

comment:9 in reply to:  6 Changed 6 months ago by AndreasK

Replying to simonpj:

Wow, that's an impressively large effect. Thanks -- I had no idea. If you stare at the assembly code, can you guess which of your bullets is causing the effect here? E.g. do we in fact end up eliminating one of the jump instruction?

It's hard to tell really. I've looked at vtune and it seems the top level variant has more cache misses and decoder stalls. So it seems to be primarily a case of code size and worse code layout. While we also execute about 2,5% more instructions which definitely will come at a cost given the difference I would expect layout/code size to be larger factors.

Your point about the stack check is a good one.

Info tables. Suppose a function is:

  • top level
  • not exported
  • always called with know, saturated calls

Then it does not need either slow entry code or an info table. So rather than avoid creating such functions (by not floating join points) maybe we should apply the optimisation uniformly to all top-level functions?

I assume you are talking about removing the stack check here with the optimisation? Then we would still get the cache fragmentation/layout penalty. So ideally we would do both.

  • Remove the stack check if it's a floated join point called from many places to avoid code bloat.
  • Push them back into the rhs if there is only a single call site.

I suspect that there's a heap check at the beginning of the A branch; and then a second heap check at the start of the body of j. But instead we could make j not do a heap check (ever) and instead put on the caller (of j) the responsibility for making sure that there's enough heap space for j to do its allocation.

Ideally we would combine the checks for all branches into a single one to begin with. But that seems like a different issue to me as this would be beneficial with or without join points.

comment:10 Changed 6 months ago by AndreasK

It seems this was changed deliberately in #13286.

The ticket does mention examples where join points become top level functions as having improved but doesn't contain any performance metrics to judge the impact.

I can see how it might be beneficial for exported functions, but I'm not yet convinced that this is better in general.

I've also looked at the core output of the two testsuite files and at least at O2 there seems to be the same amount of floating happening with 8.0.2 and 8.4.

I also don't think we will get much benefit out of the original example. Looking at the floated code:

$wg_s6x5 [InlPrag=[0], Occ=LoopBreaker]
  :: Int# -> Int# -> Int# -> Int#
$wg_s6x5 (ww5_s6wV :: Int#) (ww6_s6wZ :: Int#) (ww7_s6x3 :: Int#) =
  case remInt# ww6_s6wZ 2# of {
    __DEFAULT ->
      case ww6_s6wZ of wild5_Xen {
        __DEFAULT ->
          jump $wg_s6x5
            (*# ww5_s6wV ww5_s6wV)
            (quotInt# (-# wild5_Xen 1#) 2#)
            (*# ww5_s6wV ww7_s6x3);
        1# -> *# ww5_s6wV ww7_s6x3
      };
    0# ->
      $wg_s6x5 (*# ww5_s6wV ww5_s6wV) (quotInt# ww6_s6wZ 2#) ww7_s6x3

$wf_s6xg [InlPrag=[0], Occ=LoopBreaker] :: Int# -> Int# -> Int#
$wf_s6xg (ww3_X6Hi :: Int#) (ww4_s6xe :: Int#) =
  case remInt# ww4_s6xe 2# of {
    __DEFAULT ->
      case ww4_s6xe of wild3_Xe3 {
        __DEFAULT ->
          $wg_s6x5
            (*# ww3_X6Hi ww3_X6Hi) (quotInt# (-# wild3_Xe3 1#) 2#) ww3_X6Hi;
        1# -> ww3_X6Hi
      };
    0# -> $wf_s6xg (*# ww3_X6Hi ww3_X6Hi) (quotInt# ww4_s6xe 2#)

GHC.Real.$w$s^1 [InlPrag=[0]] :: Int -> Int# -> Int#
GHC.Real.$w$s^1 =
  \ (w_s6xh :: Int) (ww_s6xl :: Int#) ->
    case tagToEnum# @ Bool (<# ww_s6xl 0#) of {
      False ->
        case ww_s6xl of wild1_XdK {
          __DEFAULT ->
            case w_s6xh of { I# ww2_s6xa -> $wf_s6xg ww2_s6xa wild1_XdK  };
          0# -> 1#
        };
      True -> case GHC.Real.^2 of wild1_00 { }
    }
  • For exponent < 0 we throw an exception so we can probably ignore that case when it comes to performance.
  • For exponent == 0 there is an advantage IF GHC.Real.$w$s^1 get's inlined. But an exponent of zero seems like an unlikely case to me.
  • For exponents > 0 it depends heavily on how much get's inlined.
    • If all get inlined (essentially undoing the floating) we save nothing as the unfloated variant would have been inlined as well.
    • If we inline GHC.Real.$w$s^1 and $wf_s6xg we save at best one call overhead if we don't jump into wg_s6x5, otherwise we save nothing.
    • If nothing get's inlined we are worse off as we now have at least as much overhead if not more should we jump into the floated functions.

On the bright side the floated bindings work only on unboxed int's so might not cause an additional stack check.

comment:11 Changed 6 months ago by simonpj

Ideally we would combine the checks for all branches into a single one to begin with.

Do you mean that in

   case x of
     True -> e1
     False -> e2

instead of a heap-check at the start of e1 and another at the start of e2, we could have a single one before the case?

No, we can't do this: evaluating x might force a thunk, and hence allocate an arbitrary amount of stuff.

If the case is strutinising an unlifted type (which does not require evaluating) then yes it's different, and indeed in that case we sometimes do move the heap check up. See the long Note [Compiling case expressions] in StgCmmExpr.hs.

(Another possibility that looks unattractive, and that I have not explored: put the heap check after returning from evaluating x but before doing the case-analysis to decide which branch to take. That might reduce code size, but would never eliminate a heap check altogether; indeed it might put one in the code path that was not there before, for a branch that did not allocate at all.)

comment:12 Changed 6 months ago by simonpj

I assume you are talking about removing the stack check here with the optimisation?

I wasn't very clear. My idea is to ignore whether a top-level binding started life as a join point, and instead optimise all top-level bindings the same way.

There are three things in play:

  1. Eliminating the closure and info-table for some top-level functions. Consider a top-level binding
    f = \xy. e
    
    When can we do without a closure and info table for f? Answer: if
    • All calls to f are known, saturated calls (no partial applications).
    • f is not exported

NB 1: it's fine if the call to f is in a lazy context, e.g.

g y = let t = f (y+1) in Just t

NB 2: for a nested, let-bound function, we do need a function closure, even if all the calls are known, saturated calls. Why? Because we need a heap-allocated container for its free variables. So we need the info table. But if all the calls are known, saturated, we do not need any slow-entry code, so the code pointer for the info table could be "crash" (will never happen).

  1. Eliminating stack checks. Join points already don't have a stack check; regular let-bindings do (of course). Idea: extend to the stack-check elimination to more functions; that is, absorb any stack use for that function into its call site. When can we do this? Answer:
    • All calls to f are known, saturated calls (no partial applications).
    • f is not exported
    • f is not recursive; or f is top-level and tail-calls itself (or is part of a top-level tail-recursive nest).

NB 1: this works for non-top-level functions too. Example:

g x = let f y = blah
          t = f 4 + f 5
      in Just t

Currently the thunk for t will do a stack check, and f will do its own stack check too. We could absorb f's check into t.

NB 2: why not recursive functions? Consider

   let f x = if x==0 then 0 else 1 + f (x-1)

Clearly the stack grows, so we can't eliminate the stack check. But if f is a join point there is no problem, and similarly at top level (i.e. not a join point) if it tail-calls itself or some other top-level function with the same property.... let's call these "top-level join points".

  1. Eliminating heap checks. Idea: absorb the heap check for a function into its caller. When can we do this?
    • All calls to f are known, saturated calls (no partial applications).
    • f is not exported
    • f is not recursive. Reason: if it's recursive, one of its call sites is in its own RHS, so we can't statically bound how much heap it uses.

NB 1: absorbing the heap check is not currently done even for join points. (Reason: it only works for non-recursive functions.)

NB: None of these optimisations rely on f being a join point.

All of this is STG-level/code-gen level optimisation, not Core.

These would be interesting ideas to try out. If you feel motivated, I could advise.

comment:13 in reply to:  11 Changed 6 months ago by AndreasK

Replying to simonpj:

Ideally we would combine the checks for all branches into a single one to begin with.

(Another possibility that looks unattractive, and that I have not explored: put the heap check after returning from evaluating x but before doing the case-analysis to decide which branch to take. That might reduce code size, but would never eliminate a heap check altogether; indeed it might put one in the code path that was not there before, for a branch that did not allocate at all.)

The obvious solution seems to me is to simply limit this to cases where all branches allocate. This would reduce code size while coming with few penalties. It would probably make sense to special case two other scenarios:

  • If the difference in allocation is huge.
  • If the non allocating branches are bottoming (eg pattern match failures).

But I'm not sure how easy it ease to check these conditions during StgToCmm generation. I'm not really familiar with that part of codegen yet.

I wasn't very clear. My idea is to ignore whether a top-level binding started life as a join point, and instead optimise all top-level bindings the same way.

That sounds like something we should do! Making them join points would still have additional advantages for code layout and possible register allocation. So a late "float in" pass of sorts after we are done inlining might still make sense.

But your suggest changes should still reduce overhead of these calls quite a bit compared to what we have now.

These would be interesting ideas to try out. If you feel motivated, I could advise.

I'm interested as I might need these optimizations for other things I'm working on anyway.

I guess a good way to start would be to look into 1) starting at the StgToCmm code?

comment:14 Changed 6 months ago by simonpj

(comment:11) Another possibility that looks unattractive, ... The obvious solution seems to me is to simply limit this to cases where all branches allocate

Perhaps, but I think the other 1,2,3 possibilities look as if they'll have more payoff per hour invested. Once they are done, maybe come back to this.

comment:15 Changed 6 months ago by simonpj

I guess a good way to start would be to look into 1) starting at the StgToCmm code?

Yes, sounds good to me!

comment:16 in reply to:  14 Changed 6 months ago by AndreasK

Replying to simonpj:

(comment:11) Another possibility that looks unattractive, ... The obvious solution seems to me is to simply limit this to cases where all branches allocate

Perhaps, but I think the other 1,2,3 possibilities look as if they'll have more payoff per hour invested. Once they are done, maybe come back to this.

Hard to judge given that the heap check issue also arises in code not using join points at all. But it certainly seems like an orthogonal issue. I will open a seperate ticket for this so we don't get too much noise on this ticket. Edit: There is a similar enough ticket already: #8326

Last edited 6 months ago by AndreasK (previous) (diff)

comment:17 Changed 5 months ago by chessai

Milestone: 8.6.18.8.1
Owner: set to chessai

I am looking into this. Not sure if it's going to be better, but it would at least be good to have benchmarks for the differences over various mock setups and popular libraries.

comment:18 Changed 5 months ago by simonpj

I am looking into this.

Great -- thank you!

comment:19 Changed 4 months ago by AndreasK

@chessai Are you still interested in this?

comment:20 in reply to:  19 Changed 4 months ago by chessai

Replying to AndreasK:

@chessai Are you still interested in this?

Yes - I am working on https://ghc.haskell.org/trac/ghc/ticket/13104 as a sort of preparation, it's a simpler issue related to join points, in that it's less time consuming and has a straightforward solution, as far as i can tell.

comment:21 Changed 3 months ago by chessai

is it ok to make a joinpoints directory in testsuite/tests/? if not, where should tests for this go?

comment:22 Changed 3 months ago by AndreasK

There is already testsuite/tests/perf/join_points which you can use.

comment:23 Changed 3 months ago by chessai

ah, i didn't look in perf/. thanks

comment:24 Changed 7 weeks ago by osa1

Milestone: 8.8.18.10.1

Bumping milestones of low-priority tickets.

Note: See TracTickets for help on using tickets.