Apparently idiomatic code like forM_ [1.._N] does not get fused away.
This can give serious performance problems when unnoticed.
-- Slow:forM_[0.._N-1]$\i->do...-- Around 10 times faster:loop_N$\i->do...{-# INLINE loop #-}loop::(Monadm)=>Int->(Int->m())->m()loopbexf=go0wherego!n|n==bex=return()|otherwise=fn>>go(n+1)
People on #ghc think that it is because the [1.._N] gets floated out as a top-level constant expression:
nomeata: nh2: I’d consider that a bug
nomeata: I believe the problem is that [0..512] does not depend on any local values
nomeata: so it is floated out as a top-level value
nomeata: and there it is not matched by any rules
thoughtpolice: let floating strikes again
thoughtpolice: (well, floating-out being a non-optimization, anyway.)
Fuuzetsu: does this mean that if I use [0 .. 512] twice in a program in different places, it will only be computed once?
hvr: Fuuzetsu: there's a chance, yes
Fuuzetsu: neat
thoughtpolice: well, not so neat. in cases like this you really don't want to float out some things, because it hurts later opportunities to optimize sometimes (e.g. float out a binding that otherwise would have triggered a RULE or fusion, perhaps)
thoughtpolice: unfortunately floating like this is one of the harder things to 'fight against' when you don't want it, from what i've seen.
A comment in SetLevels (which I just came across) in the code indicates that this problem should have been taken care of:
-- We are keen to float something to the top level, even if it does not -- escape a lambda, because then it needs no allocation. But it's controlled -- by a flag, because doing this too early loses opportunities for RULES -- which (needless to say) are important in some nofib programs -- (gcd is an example).
So either my assumption is wrong, or this does not work as desired.
It turns out that this comment is obsolete; the flag is never set. I quote from SimplCore
-- Was: gentleFloatOutSwitches -- -- I have no idea why, but not floating constants to -- top level is very bad in some cases. -- -- Notably: p_ident in spectral/rewrite -- Changing from "gentle" to "constantsOnly" -- improved rewrite's allocation by 19%, and -- made 0.0% difference to any other nofib -- benchmark
This comment was introduced in eaeca51e (2009) by SPJ.
Maybe rules matching should look though unfoldings more easily (at the risk of losing sharing)? There is no point in worrying about sharing [0..N] in a rule application whose purpose is to eliminate that list.
Regarding your suspicion that it gets floated out as a constant, I don't see an improvement when getting the upper bound m of [1..m] from an IO action. What do you think?
There is an example I made for this, mentioned in the bug description.
The performance I measure for that is:
using forM_ with ghc -O: 2.0 s
using loop with ghc -O: 1.6 s
using forM_ with ghc -O2: 0.9 s
using loop with ghc -O2: 0.3 s
using forM_ with ghc -O2 -fllvm: 0.75 s
using loop with ghc -O2 -fllvm: 0.15 s
I tried to make an even smaller benchmark (https://gist.github.com/nh2/11333427) but the performance is identical there although the same thing changes as before.
Could you try my two benchmarks and see if you get the same behaviour?
I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).
(Note that measuring the simple example with -fllvm is pointless because it can optimize it away completely. It can't do that with the matmult though.)
I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).
The simple example doesn't use the same list in different places, so GHC is capable of eliminating it and giving you a loop on unboxed Int#s, at least with -O2. In the matmult example, you need to conceal the fact that both lists are the same from GHC to get a loop on unboxed Int#s.
So in principle, GHC can do the desired thing, just the sharing gets in the way. Can somebody think of a situation where sharing is beneficial for forM_ [a .. b] $ \n -> do ... code? If not, perhaps special-casing enumFromTo arguments for forM_ etc. is, at least for standard integer types, something to be considered.
Preventing it from sharing sounds sensible for me: If the [a .. b] was something expensive to compute (a list of primes, say), I suspect any sane person would easily share it manually by declaring it top-level.
I reported that my manually written loop is much faster than forM_ [1..n] in some cases, suggesting that in some cases optimizing the list away doesn't work well.
nomeata said some technical things that are a bit beyond me.
I submit two benchmarks in the gist at https://gist.github.com/nh2/11333427, a "matmult" benchmark where there is a big difference between forM_ and the hand-written loop, and a "simple" benchmark where they are equally fast.
Daniel suspects the slow case comes from using the same syntactical list twice, and that in this case GHC floats it out to share it, which breaks eliminating it. He suggests we might special-case enumFromTo when used with forM_ to prevent it.
I give a counter example for his suspicion, by changing my "simple" benchmark, where using the same list twice gives the same good performance as using it once.
Slight correction, @Niklas, it's not a suspicion that it's the floating out of the list to the top-level and consequently the use of the list for looping instead of unboxed Int#s, that is direct from the core (-ddump-simpl is your friend in all matters related to performance). The question is under which exact circumstances GHC floats the list out. To answer that, you need somebody knowing how GHC works.
I'm afraid 8.0.2 is out of the question. However, 8.2.1 is a possibility. Of course, any help that you, the reader, can offer would expedite the process.
Here is a test case that does not require criterion. With ghc-8.0.1 -O2 -ddump-simpl, you can see that foo is compiled into code that uses a top-level CAF containing GHC.Enum.eftInt 0# 799#.
The program in rep.hs is rather unusual because it has a constant list, not dependent on input data. Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.
It would be great if someone felt able to dig into #7206 and figure out why it's not a straight win
Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.
Just wanted to chime in that I found this problem in a real-world program, where I looped over all 32-bit integers. Similarly, these known-at-compile-time iteration lists may appear in inner loops of algorithms or data structures (like B-trees with fixed B, or image convolution kernels with fixed size, e.g. 3x3 image templates).
I haven't confirmed, but it looks like 2effe18a may have fixed this. Now fusion happens during the "gentle" phase of simplifier, which happens before any of the float-out passes.
Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer.
I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?
I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?
What is the first argument of forM_ in you program? This ticket (and #7206) is specifically about forM_/mapM_ when the foldable argument is of form [n .. m]. If your program is different than perhaps a new ticket would be better. In any case, it's not a big deal if you paste your program here, we can move it to a new ticket if it turns out to be something different than what we try to improve here.
Attached file fuse.hs , similar to the first file but with nested forM_, both versions run in the same time but the forM_ version allocates about 50% more
ghc -O2 fuse.hs +RTS[1 of 1] Compiling Main ( fuse.hs, fuse.o )Linking fuse ...bash-3.2$ ./fuse 1 +RTS -s486341683267690 320,057,352 bytes allocated in the heap 8,232 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 29,152 bytes maximum slop 308 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.025s 0.0254s 0.0254s INIT time 0.000s ( 0.003s elapsed) MUT time 5.432s ( 5.634s elapsed) GC time 0.000s ( 0.025s elapsed) EXIT time 0.000s ( 0.005s elapsed) Total time 5.433s ( 5.667s elapsed) %GC time 0.0% (0.4% elapsed) Alloc rate 58,918,474 bytes per MUT second Productivity 100.0% of total user, 99.5% of total elapsedbash-3.2$ ./fuse 2 +RTS -s486341683267690 560,057,328 bytes allocated in the heap 15,992 bytes copied during GC 320,028,576 bytes maximum residency (2 sample(s)) 868,448 bytes maximum slop 308 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 228 colls, 0 par 0.001s 0.002s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.000s 0.026s 0.0128s 0.0254s INIT time 0.000s ( 0.003s elapsed) MUT time 5.453s ( 5.630s elapsed) GC time 0.002s ( 0.027s elapsed) EXIT time 0.000s ( 0.008s elapsed) Total time 5.455s ( 5.667s elapsed) %GC time 0.0% (0.5% elapsed) Alloc rate 102,698,216 bytes per MUT second Productivity 100.0% of total user, 99.5% of total elapsed
Here is a smaller example that highlights the problem without vectors. The only difference between the two functions is the use of [2,3..n] instead of [2..n], which desugar to different functions. This results in a difference in a huge difference in allocation as well as runtime:
$ ./repro 2 +RTS -s # [2,3..n]() 960,056,856 bytes allocated in the heap 21,536 bytes copied during GC 44,576 bytes maximum residency (2 sample(s)) 29,152 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 918 colls, 0 par 0.005s 0.003s 0.0000s 0.0000s Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0002s INIT time 0.000s ( 0.000s elapsed) MUT time 0.123s ( 0.125s elapsed) GC time 0.005s ( 0.003s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.129s ( 0.129s elapsed) %GC time 4.1% (2.5% elapsed) Alloc rate 7,778,808,106 bytes per MUT second Productivity 95.8% of total user, 97.4% of total elapsed
$ ./repro 1 +RTS -s # [2..n]() 56,872 bytes allocated in the heap 3,480 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 25,056 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s INIT time 0.000s ( 0.000s elapsed) MUT time 0.048s ( 0.048s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.048s ( 0.048s elapsed) %GC time 0.2% (0.2% elapsed) Alloc rate 1,188,432 bytes per MUT second Productivity 99.6% of total user, 99.6% of total elapsed
This happens in ST, but not in IO, so probably related to some hack. Also the difference vanishes when we allow the functions to inline.
Here's some Core for g (the offending function):
-- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13}$wg$wg = \ @ s ww w -> let { $wc = <huge> } in case <# ww 3# of { __DEFAULT -> let { y' y' = -# ww 1# } in letrec { go_up go_up = \ x eta -> case ># x y' of { __DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` <Co:4>) eta; 1# -> $wc x (lvl `cast` <Co:4>) eta }; } in $wc 2# ((go_up 3#) `cast` <Co:4>) w; 1# -> case <# ww 2# of { __DEFAULT -> $wc 2# (lvl `cast` <Co:4>) w; 1# -> (# w, () #) } }
From my understanding of join points, $wc is only nearly a join point, because go_up with its transitive tail call to $wc appears in argument position. It would be great if we could get rid of this! The IO variant (g 40000000 >>= print) doesn't have this weakness, it's all join points there. Hence my suspicion about some IO hack that let's GHC eta-expand stuff more aggresively, but I'm not sure how that's helping.
For our case, c is the <huge> computation (see the worker $wc in ticket:8763#comment:151206) performed for each outer list element and would be duplicated by inlining: It's mentioned four times in the definition of efdtIntUpFB. Consequently, c has almost always Guidance=NEVER, except in the IO case, where it miraculously gets Guidance=IF_ARGS [20 420 0] 674 0 just when it is inlined. Not sure what this decision is based on.
I'm not sure if IO gets special treatment by the inliner, but I see a few ways out:
Do the same hacks for ST, if there are any which apply (ugly)
Reduce the number of calls to c in the implementation of efdtIntUpFB, probably for worse branch prediction
Figure out why the floated out expression of \x -> (nop x *>) occuring in forM_ nop = flip mapM_ nop = foldr ((>>) . nop) (return ()) doesn't get eta-expanded in the ST case, whereas the relevant IO code is. I hope that by fixing this, the c expression inlines again.
Here's how it inlines for IO:
(>>) . nop= \x -> (nop x >>)= \x -> (nop x *>) -- notice how it's no different than ST up until here= \x -> (thenIO (nop x))
The inliner probably stops here, but because of eta-expansion modulo coercions to \x k s -> thenIO (nop x) k s, we can inline thenIO:
\x k s -> thenIO (nop x) y s= \x k s -> case nop x s of (# new_s, _ #) -> k new_s)
which is much better and probably more keenly inlined than \x -> (nop x *>) in the ST case. What makes GHC eta-expand one, but not the other?
This is just a wild guess and the only real difference I could make out in diffs. Maybe someone with actual insights into the simplifier can comment on this claim (that the inliner gives up on c due to the missed eta-expansion and inlining)?
Your example seems to be different than mine in that when I compile yours with ghc 8.4.1 and -O there is no difference in allocation unlike mine. What flags did you compile with?
ghc -O repro.hs +RTS [1 of 1] Compiling Main ( repro.hs, repro.o )Linking repro ...bash-3.2$ ./repro 1 +RTS -s() 56,880 bytes allocated in the heap 3,480 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 25,056 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.000s ( 0.002s elapsed) MUT time 0.051s ( 0.052s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.003s elapsed) Total time 0.051s ( 0.057s elapsed) %GC time 0.3% (0.4% elapsed) Alloc rate 1,118,716 bytes per MUT second Productivity 99.3% of total user, 95.6% of total elapsedbash-3.2$ ./repro 2 +RTS -s() 56,880 bytes allocated in the heap 3,480 bytes copied during GC 44,576 bytes maximum residency (1 sample(s)) 25,056 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0002s 0.0002s INIT time 0.000s ( 0.002s elapsed) MUT time 0.051s ( 0.051s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.005s elapsed) Total time 0.051s ( 0.059s elapsed) %GC time 0.3% (0.3% elapsed) Alloc rate 1,120,655 bytes per MUT second Productivity 99.3% of total user, 95.7% of total elapsed
Yuck! I was under the impression I used GHC 8.4.1 via a nix-shell when in reality I was using another locally installed 8.2.2. So, back to minimization, I guess.
It seems I uploaded the variant where I used IO instead of ST, where things still inline. When you substitute ST s for IO and use print $ runST $ ... instead of ... >>= print, it should reproduce with 8.4.1.
Now here's the funny part: I managed to make this reproduce even for IO by duplicating the call to nop. So it seems like c really just hits the threshold where the inliner gives up. The only solution I can think of is what I described in my second point above: Implement efdtIntUpFB in a way that doesn't duplicate c.
In general we should avoid to call c in builds more than once because of scenarios like this. Huge cs aren't uncommon at all (do blocks in forM_ bodies, the functions passed as first argument to foldr, etc.) and otherwise we can't guarantee that everything inlines.
In general we should avoid to call c in builds more than once because of scenarios like this.
I vaguely having seen a Note about this somewhere, but I cannot find it right now. But yes, a single occurrence of c is beneficial…
Ok, found it:
{-# INLINE [0] eftIntFB #-} -- See Note [Inline FB functions] in GHC.ListeftIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> reftIntFB c n x0 y | isTrue# (x0 ># y) = n | otherwise = go x0 where go x = I# x `c` if isTrue# (x ==# y) then n else go (x +# 1#) -- Watch out for y=maxBound; hence ==, not > -- Be very careful not to have more than one "c" -- so that when eftInfFB is inlined we can inline -- whatever is bound to "c"
This definition of efdtIntUpFB only has a single occurence of c and n and consequently fixes th issue.
But this probably doesn't have the same performance for the non-fused case. Also fingers crossed wrt. correctness.
data CounterState = More | Last | End-- Requires x2 >= x1{-# INLINE [0] efdtIntUpFB #-} -- See Note [Inline FB functions] in GHC.ListefdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> refdtIntUpFB c n x1 x2 y = -- Be careful about overflow! let !first_state | isTrue# (y <# x2) = if isTrue# (y <# x1) then End else Last | otherwise = More -- Common case: x1 <= x2 <= y !delta = x2 -# x1 -- >= 0 !y' = y -# delta -- x1 <= y' <= y; hence y' is representable next_state End _ = End next_state Last _ = End next_state More x | isTrue# (x ># y') = Last | otherwise = More -- Invariant: x <= y -- Note that: z <= y' => z + delta won't overflow -- so we are guaranteed not to overflow if/when we recurse emit End _ = n emit st x | let x' = x +# delta = I# x `c` emit (next_state st x') x' in emit first_state x1
How about this (untested), which seems a lot simpler
efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> refdtIntUpFB c n x1 x2 y -- Be careful about overflow! | isTrue# (y <# x1) = n | otherwise = go_up x1 -- Common case: x1 <= y where !delta = x2 -# x1 -- >= 0 !y' = y -# delta -- x1 <= y' <= y; hence y' is representable -- Invariant: x <= y -- Note that: x <= y' => x + delta won't overflow -- so we are guaranteed not to overflow if/when we recurse go_up x = I# x `c` if isTrue# (x ># y') then n else go_up (x +# delta)
I just ran some quick quickCheck $ withMaxSuccess 100000 $ \i j k -> take 1000 [i,j..k] == take 1000 (efdtInt (unI# i) (unI# j) (unI# k)) tests and both versions pass. Given that simonpj's is much more to the point, let's run with that one! Although the duplicate n has potential to cause pain... But it's also there in eftIntFB, so it's probably fine.
Edit: Well, either the search space is too huge for QC or fusion didn't kick in. Either way, Simon's code has an underflow weakness: E.g. [minBound,minBound+2..minBound+1] == [minBound], but in the new variant y' = minBound+1-2 will underflow. I fixed this in the diff I prepared by an additional flag which has to be checked on every iteration.
Nofib suggests that this regresses allocations in integer by 6.0% and counted instructions by 0.1%. I had a look at the simplified Core and it seems that it's entirely due to the new definition, although I'm not sure where exactly this allocates more. Maybe it's due to an increase in closure size of go_up because of single?. Here's the Core diff and the new definition of efdtIntUpFB for reference.
It seems that c is still not inlined, even with the new definition. I assume that's because there are multiple occurences of c which were probably duplicated before the inliner had a chance to inline the argument c. It better had introduced a join point before.
Maybe loopification helps here? Indeed #14068 suggests that something beneficial happens, maybe more so with this patch.
Or we could introduce some kind of annotation mechanism to tell GHC to be careful not to duplicate occurences of certain parameters that occur once (f {-# HUGE #-} c n = ...).
I dug through dump-simpl-iterations and noticed that the duplication happens through efdtIntFBs unfolding, which mentions efdtIntUpFBandefdtIntDnFB, which both want to inline their c later on.
Note that's not a problem for [x..y] (eftInt), because that doesn't need to consider counting down. It's not an issue for [x,y..] (edtInt), because although it calls efdtInt{Up,Dn} internally, it doesn't take part in fusion at all (is that an oversight or by design?).
Here's an implementation of efdtIntFB that fits our requirements:
direction :: Int# -> Int# -> Orderingdirection from to = compareInt# to fromopposed :: Ordering -> Ordering -> Boolopposed LT GT = Trueopposed GT LT = Trueopposed _ _ = False-- | Implements `enumFromThenTo @Int` as per section 6.3.4 of the Haskell2010-- report:-- https://www.haskell.org/onlinereport/haskell2010/haskellch6.html#dx13-131001.efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> refdtIntFB c n x1 x2 y = emit first x1 where -- We can safely emit the first element if an iteration -- doesn't move away from @y@. That's exactly the case when -- @dir_x2@ is not opposed to @dir_y@. !first = not (opposed dir_x2 dir_y) && (dir_x2 /= EQ || dir_y /= LT) -- [1,1..0] == [] !dir_x2 = direction x1 x2 !dir_y = direction x1 y -- We need the overflow flag in 'emit'. (# delta, delta_ovf #) = x2 `subIntC#` x1 -- | Think of @emit :: Maybe Int -> [Int]@, only unboxed. -- If the argument is 'Nothing', we reached the end of the list. -- If the argument is 'Just', we emit an element, compute -- the next candidate, validate it and recurse. emit False _ = n emit True x = I# x `c` emit next_ok next where -- Check that @next@ didn't move past @y@. -- Also, overflow is only allowed iff the computation for -- @delta@ overflowed. (# next, next_ovf #) = addIntC# x delta !next_ok = isTrue# (next_ovf ==# delta_ovf) && not (opposed (direction next y) dir_y) -- TODO: evaluate strict && for branchless code{-# INLINE[0] efdtIntFB #-}
Some pros:
I find this much easier to understand. No complicated invariants, etc.
No Up/Dn variants to maintain. Still, if the direction is statically known, constant folding and inlining will simplify stuff to the equivalent code.
As a result, no more duplication of c occurrences
Also no more duplication of n occurrences
Cons:
emits closure is 4 words big (2 words bigger than the closure of the original go_up helper) in the most general form. It's unfortunate that we can't pack together dir_y and delta_ovf into a single word without confusing constant folding. This would need either some kind of constant propagation through bit fields (out of scope for GHC, I think) or a smarter closure allocation theme that packs together non-pointer payloads.
We pay for the generalisation of Up/Dn variants by having to compare with dir_y all the time.
base lacks addWordC# primitives, which I'll probably add now
Looks plausible to me, but needs a careful Note to explain the issues.
But before we go too far with this, I'd like to point to latelambdalifting. In the core reported in ticket:8763#comment:151206, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.
To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.
I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?
Looks plausible to me, but needs a careful Note to explain the issues.
But before we go too far with this, I'd like to point to latelambdalifting. In the core reported in ticket:8763#comment:151206, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.
To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.
I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?
I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.
I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.
In general inlining is good. But in the case on this thread, I think it's the non-allocation of a function closure that saves all the work, rather than any optimisations that happen after inlining c.
Currently, it doesn't pass the testsuite (even in default mode, which doesn't do any new lambda lifting), probably because I introduced some errors during the merge.
I think we should continue the discussion in #9476 (closed).
As this is now 4 years old, does it make sense to open a new issue and close this one? My problem of 3 years ago is basically fixed. In any case I have uploaded a fixed version, reprof.hs of repro.hs which was modified per the comments in 49
george.colpittschanged title from forM_ [1..N] does not get fused (10 times slower than go function) to forM_ [1..N] does not get fused (allocates 50% more)
changed title from forM_ [1..N] does not get fused (10 times slower than go function) to forM_ [1..N] does not get fused (allocates 50% more)
Now that I spent some time with late lambda lifting and have an implementation at hand, I revisited this issue.
Indeed, the original $wg binding from ticket:8763#comment:151206 gets lifted to top-level, but there is no reduction in allocation to be had (it's just swapping each mention of the local binding in closures for its single free variable, after all) and instructions executed are roughly the same. That's because part of the original problem persists: The hot loop go_up is still not a join point.
I still think the implementation from ticket:8763#comment:151815 is the way to go. With the improvements to constant folding in D4605, the mentioned increase in closure size should be constant-folded away in the majority of cases. I'll conduct some benchmarks next week.
It turns out that the implementation from ticket:8763#comment:151815 doesn't get rid of the problem. It seems that the hard earned single call to c in emit True x = I# x \c\ emit next_ok next gets duplicated because of case-of-case.
The relevant Core began as this expression:
case (case ==# next_ovf_a3hz delta_ovf_a3h0 of lwild_s4fL { __DEFAULT -> GHC.Types.False 1# -> b_a2jS }) of next_ok_a2k8 { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx)}
Now case-of-case comes along and immediately simplifies this to
case ==# next_ovf_a3hz delta_ovf_a3h0 of { __DEFAULT -> case b_a2jS of { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf GHC.Types.False next_a3hx) }; 1# -> case b_a2jS of next_ok_a2k8 { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx) }}
I'm not sure if the intermediate join point is never generated or is just inlined immediately, but I'd very much like this not to duplicate the call to c.
Sure will do. Let's begin with the original problem. There's two functions in [[attachment:reprof.hs]]. One is f, having a loop of the forM_ [2..n] body form, the other is g which has the forM_ [2,3..n] body form. The former doesn't allocate (56kB in total), whereas the latter allocates quite a lot (960MB).
Why is that? The arithmetic sequences desugar, rewrite and specialise to build (\c t -> eftIntFB c z 2 n) and build (\c z -> efdtIntFB c z 2 3 n), respectively. That cancels away with forM_'s implementation in terms of foldr:
forM_= flip mapM_ mapM_ body= {- definition -} foldr ((>>) . body) (return ())= {- eta expand the section -} foldr (\x k -> body x >> k) (return ())= {- (>>) of ST, written as `s -> (a, s)` for lighter syntax -} foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ())
Note how k occurs in tail position within the lambda. Now, this cancels with the definition of ef(d)tIntFB:
foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ()) . build (\c z -> efdtIntFB c z 2 3 n)= {- foldr/build -} efdtIntFB (\x k s -> case (body x s) of (_, s') -> k s') (return ()) 2 3 n
So, that lambda is what is bound to c when ef(d)tIntFB gets inlined.
go x = I# x `c` if isTrue# (x ==# y) then n else go (x +# 1#)==> c = \x k s -> case (body x s) of (_, s') -> k s'go x s = case (body (I# x) s) of (_, s') -> if isTrue# (x ==# y) then n s' else go (x +# 1#) s'
And go can be made a join point.
The same isn't possible in the current efdtIntFB, because it duplicates c by branching on whether to count up or down (and also within the loop itself, anyway):
efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> refdtIntFB c n x1 x2 y | isTrue# (x2 >=# x1) = efdtIntUpFB c n x1 x2 y | otherwise = efdtIntDnFB c n x1 x2 y
let c x k s = case (body x s) of (_, s') -> k s'in ... let go_up x | isTrue# (x ># y') = I# x `c` n | otherwise = I# x `c` go_up (x +# delta)
Note that go_up tail calls c and passes itself as the k parameter. If c was inlined, all would be fine and go_up would turn into a join point. That's not the case because c is duplicated in efdtIntFB and then one more time in efdtInt{Up,Dn}FB. My first implementation in ticket:8763#comment:151261 (for which you provided a simplification in ticket:8763#comment:151263) dealt with the latter, while the idea in ticket:8763#comment:151815 is supposed to deal with the former. Sadly, case-of-case seems to undo the painful de-duplication of the c parameter (that's what ticket:8763#comment:159581 is about).
Why doesn't LLF help here? Well, lifting out c to top-level gets rid of allocations for c, but there's still at least the allocation for the thunk for go_up (x+1) (the Int box goes away because of strictness). Also, the call to go_up is still an unknown call, as opposed to the simple join call we would get by inlining c.
I can think of another transformation that would save the day here: We just have to put c in the same recursive group as go_up and recognize the mutual recursive join point. Seems like a better way than to mess with the simplifier and we don't even risk duplication of any code! So
let c x k s = case (body x s) of (_, s') -> k s'in ... let go_up x | isTrue# (x ># y') = I# x `c` n | otherwise = I# x `c` go_up (x +# delta)==> float `c` inwards into the same recursive group, specialise it for `go_up (x+#delta)` and `n` as `k` (SpecConstr? Would entail seeing tail-calls as kind-of a pattern match for functions)join go_up x | isTrue# (x ># y') = jump c1 (I# x) | otherwise = c2 (I# x) go_up (x +# delta) c1 x s = -- k = n case (body x s) of (_, s') -> n s' c2 x s = -- k = go_up (x +# delta) case (body x s) of (_, s') -> go_up (x +# delta) s'
Well, it's probably not SpecConstr that will do the specialisation. Also, why specialise when we could just inline c? Seems like we risk duplication of the potentially huge body after all.
Although the same weakness doesn't apply to the situation in ticket:8763#comment:159581: It's enough to specialise for the emit argument (which serves a similar role as go_up) without any specific arguments to see that it's tail called:
let c_a2jU x k s = case (body x s) of (_, s') -> k s'in join emit_a4hf next_ok next = ... case ==# next_ovf_a3hz delta_ovf_a3h0 of { __DEFAULT -> case b_a2jS of { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf GHC.Types.False next_a3hx) }; 1# -> case b_a2jS of next_ok_a2k8 { __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx) } }==> Specialising `c` for the call pattern `[ds s next_ok next] |> [I# ds, emit next_ok next, s]` as `c'`join c' ds s next_ok next = case (body (I# ds) s) of (_, s') -> emit_a4hf next_ok next s' emit_a4hf next_ok next s = ... case ==# next_ovf_a3hz delta_ovf_a3h0 of { __DEFAULT -> case b_a2jS of { __DEFAULT -> jump c' ds_d42Y GHC.Types.False next_a3hx s }; 1# -> case b_a2jS of next_ok_a2k8 { __DEFAULT -> jump c' ds_d42Y next_ok_a2k8 next_a3hx s } }
This latter case is probably a lot easier to handle. Maybe this is worth some specialised pass?
forM_2 :: (Foldable t, Monad m) => t a -> (a -> m b) -> m ()forM_2 xs f = let c x k = f x >> k {-# INLINE c #-} in foldr c (return ()) xs
and use forM_2 instead of forM_ in the outer calls in f and g.
I then get good results for both.
How does this work? Well by marking c as INLINE, I prevent f from
inlining into it -- remember, the promise of INLINE things is that what
you write gets inlined. And this is what we want: c is small, just
f x >> k, and inlining it is very very good. Without the INLINE
pragmas on c we have something like
let f = BIG in let c x k = f x >> k in BODY
Since f occurs just once, we inline f to give
let c x k = BIG x >> k in BODY
and now c becomes too big to inline. This is a classic inlining dilemma:
do we inline f into c or c into BODY? The latter is much better in
this case.
I think we could build this into the libraries just by changing the definition
of mapM_.
Yes, that works and seems a lot simpler! I didn't know that INLINE works 'both ways', e.g. that we can prevent inlining the body into c this way. The only minor downside is that there might be more unidentified cases where such an INLINE annotation on a lambda could be beneficial.
I didn't know that INLINE works 'both ways', e.g. that we can prevent inlining the body into c this way.
Yes, it's super-important that INLINE works like that. Consider
f x = <very big>g y = Just (f y){-# INLINE g #-}
where that is the only occurrence of f, but there are zillions of occurrences of g.
You would be jolly annoyed if GHC inlined <very big> into g, and then inlined the new g at each of its zillions of call sites!!
No: INLINE says "when you see a saturated application of this function, inline the RHSI wrote at the call site".
It might be useful to beef up the documentation of INLINE to explain this, if you felt able to do so.
I agree that there may be other places such an INLINE would be desirable. A good start would be a careful Note on mapM_ explaining from first principles why the local let and INLINE is important. Then others can refer to that Note.
Right, actually I was aware of INLINE unfoldings being captured as the unoptimized variant, but have never seen it to actively prevent (or rather postpone) inlining of another part of the code. Quite a neat trick!
I've identified some other functions that should probably rewritten the same way (traverse_, foldrM, etc.). No regressions, no improvements according to NoFib. D5131 passes ./validate.sh.