forM_ [1..N] does not get fused (allocates 50% more)

mentioned in issue #7206

changed weight to 5

People on #ghc think that it is because the [1.._N] gets floated out as a top-level constant expression:

nomeata: nh2: I’d consider that a bug nomeata: I believe the problem is that [0..512] does not depend on any local values nomeata: so it is floated out as a top-level value nomeata: and there it is not matched by any rules thoughtpolice: let floating strikes again thoughtpolice: (well, floating-out being a non-optimization, anyway.) Fuuzetsu: does this mean that if I use [0 .. 512] twice in a program in different places, it will only be computed once? hvr: Fuuzetsu: there's a chance, yes Fuuzetsu: neat thoughtpolice: well, not so neat. in cases like this you really don't want to float out some things, because it hurts later opportunities to optimize sometimes (e.g. float out a binding that otherwise would have triggered a RULE or fusion, perhaps) thoughtpolice: unfortunately floating like this is one of the harder things to 'fight against' when you don't want it, from what i've seen.

I believe let-floating happens due to the full-laziness transform. Do you get better performance with -fno-full-laziness ?

A comment in SetLevels (which I just came across) in the code indicates that this problem should have been taken care of:

	  -- We are keen to float something to the top level, even if it does not
	  -- escape a lambda, because then it needs no allocation.  But it's controlled
	  -- by a flag, because doing this too early loses opportunities for RULES
	  -- which (needless to say) are important in some nofib programs
	  -- (gcd is an example).

So either my assumption is wrong, or this does not work as desired.

It turns out that this comment is obsolete; the flag is never set. I quote from SimplCore

                -- Was: gentleFloatOutSwitches
                --
                -- I have no idea why, but not floating constants to
                -- top level is very bad in some cases.
                --
                -- Notably: p_ident in spectral/rewrite
                --          Changing from "gentle" to "constantsOnly"
                --          improved rewrite's allocation by 19%, and
                --          made 0.0% difference to any other nofib
                --          benchmark

This comment was introduced in eaeca51e (2009) by SPJ.

Maybe rules matching should look though unfoldings more easily (at the risk of losing sharing)? There is no point in worrying about sharing [0..N] in a rule application whose purpose is to eliminate that list.

mentioned in issue #8814

@nomeata

Regarding your suspicion that it gets floated out as a constant, I don't see an improvement when getting the upper bound m of [1..m] from an IO action. What do you think?

It might still get floated out of some local function. Have you tried coming up with a minimal example?

@nomeata

There is an example I made for this, mentioned in the bug description.

The performance I measure for that is:

using forM_ with ghc -O: 2.0 s
using loop with ghc -O: 1.6 s
using forM_ with ghc -O2: 0.9 s
using loop with ghc -O2: 0.3 s
using forM_ with ghc -O2 -fllvm: 0.75 s
using loop with ghc -O2 -fllvm: 0.15 s

I tried to make an even smaller benchmark (https://gist.github.com/nh2/11333427) but the performance is identical there although the same thing changes as before.

Could you try my two benchmarks and see if you get the same behaviour?

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

(Note that measuring the simple example with -fllvm is pointless because it can optimize it away completely. It can't do that with the matmult though.)

Replying to [ticket:8763#comment:81818 nh2]:

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

The simple example doesn't use the same list in different places, so GHC is capable of eliminating it and giving you a loop on unboxed Int#s, at least with -O2. In the matmult example, you need to conceal the fact that both lists are the same from GHC to get a loop on unboxed Int#s.

So in principle, GHC can do the desired thing, just the sharing gets in the way. Can somebody think of a situation where sharing is beneficial for forM_ [a .. b] $ \n -> do ... code? If not, perhaps special-casing enumFromTo arguments for forM_ etc. is, at least for standard integer types, something to be considered.

Preventing it from sharing sounds sensible for me: If the [a .. b] was something expensive to compute (a list of primes, say), I suspect any sane person would easily share it manually by declaring it top-level.

Replying to [ticket:8763#comment:81819 daniel.is.fischer]:

The simple example doesn't use the same list in different places

Unfortunately, I still get no difference in performance even if I duplicate the loops to

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

Any idea?

I've got lost in this thread. If someone thinks there is a real bug here, in GHC 7.8, could you re-articulate what you think it is?

Simon

@simonpj: Summary:

I reported that my manually written loop is much faster than forM_ [1..n] in some cases, suggesting that in some cases optimizing the list away doesn't work well.
nomeata said some technical things that are a bit beyond me.
I submit two benchmarks in the gist at https://gist.github.com/nh2/11333427, a "matmult" benchmark where there is a big difference between forM_ and the hand-written loop, and a "simple" benchmark where they are equally fast.
Daniel suspects the slow case comes from using the same syntactical list twice, and that in this case GHC floats it out to share it, which breaks eliminating it. He suggests we might special-case enumFromTo when used with forM_ to prevent it.
I give a counter example for his suspicion, by changing my "simple" benchmark, where using the same list twice gives the same good performance as using it once.

I get the same behaviour for 7.6 and 7.8.

Slight correction, @Niklas, it's not a suspicion that it's the floating out of the list to the top-level and consequently the use of the list for looping instead of unboxed Int#s, that is direct from the core (-ddump-simpl is your friend in all matters related to performance). The question is under which exact circumstances GHC floats the list out. To answer that, you need somebody knowing how GHC works.

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

I'm not using criterion; just using +RTS -s, with _SIZE = 300.

So I can't reproduce the problem.

Simon

Replying to [ticket:8763#comment:82391 simonpj]:

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

Are you compiling with -O? You need -O2 for a significant difference to appear (with only -O, both versions are slow).

mentioned in issue #9476 (closed)

With 7.10.1, compiling with -O2, I see a factor of 2 difference in performance:

benchmarking matmultForM_
time                 10.90 μs   (10.89 μs .. 10.91 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.89 μs   (10.89 μs .. 10.91 μs)
std dev              32.72 ns   (18.98 ns .. 65.42 ns)

benchmarking matmultLoop
time                 5.404 μs   (5.387 μs .. 5.419 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.409 μs   (5.398 μs .. 5.420 μs)
std dev              37.99 ns   (33.64 ns .. 44.26 ns)

with -fllvm it is a factor of almost 3

benchmarking matmultForM_
time                 9.796 μs   (9.788 μs .. 9.806 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.802 μs   (9.795 μs .. 9.815 μs)
std dev              31.64 ns   (21.92 ns .. 49.61 ns)

benchmarking matmultLoop
time                 3.499 μs   (3.497 μs .. 3.502 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.499 μs   (3.497 μs .. 3.503 μs)
std dev              10.27 ns   (7.068 ns .. 16.47 ns)

Trac metadata

Trac field	Value
CC	mail@nh2.me, pivo@pobox.sk → george, mail@nh2.me, pivo@pobox.sk

changed milestone to %8.0.1

Milestone renamed

seeing same issue and performance on 7.10.3. This is on MacOS but I believe the problem occurs on multiple platforms,

This is with _SIZE = 10. Even if I replace

forM_ [0.._SIZE-1]

with

forM_ [0..10-1]

I see the same problem so I don't think it has to do with top level constant expressions

changed milestone to %8.0.2

still seeing the same problem on 8.0.1. Any chance of this getting fixed in 8.0.2? If not 8.0.2 I hope 8.1

changed milestone to %8.2.1

I'm afraid 8.0.2 is out of the question. However, 8.2.1 is a possibility. Of course, any help that you, the reader, can offer would expedite the process.

Attached file rep.hs ($1229).

Test case

Here is a test case that does not require criterion. With ghc-8.0.1 -O2 -ddump-simpl, you can see that foo is compiled into code that uses a top-level CAF containing GHC.Enum.eftInt 0# 799#.

changed the description

It looks like this is the same issue as described in #7206. The branch mentioned in #7206/#ticket:8763#comment:81856 fixes my test case.

Good to know that's it.

The program in rep.hs is rather unusual because it has a constant list, not dependent on input data. Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

It would be great if someone felt able to dig into #7206 and figure out why it's not a straight win

Simon

Replying to [ticket:8763#comment:124875 simonpj]:

Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

Just wanted to chime in that I found this problem in a real-world program, where I looped over all 32-bit integers. Similarly, these known-at-compile-time iteration lists may appear in inner loops of algorithms or data structures (like B-trees with fixed B, or image convolution kernels with fixed size, e.g. 3x3 image templates).

changed milestone to %8.4.1

Given that 8.2.1-rc1 is imminent, I'm bumping these off to the 8.4

changed milestone to %8.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

Trac metadata

Trac field	Value
Related	- → #7206

I confirmed that #7206 fixes this.

Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer.

If the problem is gone in 8.2.2, do we know what commit fixed it?

I haven't confirmed, but it looks like 2effe18a may have fixed this. Now fusion happens during the "gentle" phase of simplifier, which happens before any of the float-out passes.

Replying to [ticket:8763#comment:150151 osa1]:

Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer.

I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?

I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?

What is the first argument of forM_ in you program? This ticket (and #7206) is specifically about forM_/mapM_ when the foldable argument is of form [n .. m]. If your program is different than perhaps a new ticket would be better. In any case, it's not a big deal if you paste your program here, we can move it to a new ticket if it turns out to be something different than what we try to improve here.

Attached file fuse.hs ($1365).

Trac metadata

Trac field	Value
CC	akio, george, mail@nh2.me, pivo@pobox.sk → George, akio, mail@nh2.me, pivo@pobox.sk

Attached file fuse.hs , similar to the first file but with nested forM_, both versions run in the same time but the forM_ version allocates about 50% more

 ghc -O2 fuse.hs +RTS
[1 of 1] Compiling Main             ( fuse.hs, fuse.o )
Linking fuse ...
bash-3.2$ ./fuse 1 +RTS -s
486341683267690
     320,057,352 bytes allocated in the heap
           8,232 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          29,152 bytes maximum slop
             308 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.025s     0.0254s    0.0254s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    5.432s  (  5.634s elapsed)
  GC      time    0.000s  (  0.025s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    5.433s  (  5.667s elapsed)

  %GC     time       0.0%  (0.4% elapsed)

  Alloc rate    58,918,474 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

bash-3.2$ ./fuse 2 +RTS -s
486341683267690
     560,057,328 bytes allocated in the heap
          15,992 bytes copied during GC
     320,028,576 bytes maximum residency (2 sample(s))
         868,448 bytes maximum slop
             308 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       228 colls,     0 par    0.001s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.026s     0.0128s    0.0254s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    5.453s  (  5.630s elapsed)
  GC      time    0.002s  (  0.027s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time    5.455s  (  5.667s elapsed)

  %GC     time       0.0%  (0.5% elapsed)

  Alloc rate    102,698,216 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

Hoping my example is the new reproducer. If not this can be set back to infoneeded or closed.

Here is a smaller example that highlights the problem without vectors. The only difference between the two functions is the use of [2,3..n] instead of [2..n], which desugar to different functions. This results in a difference in a huge difference in allocation as well as runtime:

$ ./repro 2 +RTS -s # [2,3..n]
()
     960,056,856 bytes allocated in the heap
          21,536 bytes copied during GC
          44,576 bytes maximum residency (2 sample(s))
          29,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       918 colls,     0 par    0.005s   0.003s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0001s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.123s  (  0.125s elapsed)
  GC      time    0.005s  (  0.003s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.129s  (  0.129s elapsed)

  %GC     time       4.1%  (2.5% elapsed)

  Alloc rate    7,778,808,106 bytes per MUT second

  Productivity  95.8% of total user, 97.4% of total elapsed

$ ./repro 1 +RTS -s # [2..n]
()
          56,872 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.048s  (  0.048s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.048s  (  0.048s elapsed)

  %GC     time       0.2%  (0.2% elapsed)

  Alloc rate    1,188,432 bytes per MUT second

  Productivity  99.6% of total user, 99.6% of total elapsed

This happens in ST, but not in IO, so probably related to some hack. Also the difference vanishes when we allow the functions to inline.

Here's some Core for g (the offending function):

-- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13}
$wg
$wg
  = \ @ s ww w ->
      let { $wc = <huge> } in
      case <# ww 3# of {
        __DEFAULT ->
          let {
            y'
            y' = -# ww 1# } in
          letrec {
            go_up
            go_up
              = \ x eta ->
                  case ># x y' of {
                    __DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` <Co:4>) eta;
                    1# -> $wc x (lvl `cast` <Co:4>) eta
                  }; } in
          $wc 2# ((go_up 3#) `cast` <Co:4>) w;
        1# ->
          case <# ww 2# of {
            __DEFAULT -> $wc 2# (lvl `cast` <Co:4>) w;
            1# -> (# w, () #)
          }
      }

From my understanding of join points, $wc is only nearly a join point, because go_up with its transitive tail call to $wc appears in argument position. It would be great if we could get rid of this! The IO variant (g 40000000 >>= print) doesn't have this weakness, it's all join points there. Hence my suspicion about some IO hack that let's GHC eta-expand stuff more aggresively, but I'm not sure how that's helping.

Attached file repro.hs ($1370).

Reproduction for comment:44

It seems that for IO, GHC decides that it's OK to inline c from the fusion helper of enumFromThenTo, but not so for ST s.

For our case, c is the <huge> computation (see the worker $wc in ticket:8763#comment:151206) performed for each outer list element and would be duplicated by inlining: It's mentioned four times in the definition of efdtIntUpFB. Consequently, c has almost always Guidance=NEVER, except in the IO case, where it miraculously gets Guidance=IF_ARGS [20 420 0] 674 0 just when it is inlined. Not sure what this decision is based on.

The inlining decision for eftIntFB is much easier: c only happens once there.

I'm not sure if IO gets special treatment by the inliner, but I see a few ways out:

Do the same hacks for ST, if there are any which apply (ugly)
Reduce the number of calls to c in the implementation of efdtIntUpFB, probably for worse branch prediction
Figure out why the floated out expression of \x -> (nop x *>) occuring in forM_ nop = flip mapM_ nop = foldr ((>>) . nop) (return ()) doesn't get eta-expanded in the ST case, whereas the relevant IO code is. I hope that by fixing this, the c expression inlines again.

Here's how it inlines for IO:

  (>>) . nop
= \x -> (nop x >>)
= \x -> (nop x *>) -- notice how it's no different than ST up until here
= \x -> (thenIO (nop x))

The inliner probably stops here, but because of eta-expansion modulo coercions to \x k s -> thenIO (nop x) k s, we can inline thenIO:

  \x k s -> thenIO (nop x) y s
= \x k s -> case nop x s of (# new_s, _ #) -> k new_s)

which is much better and probably more keenly inlined than \x -> (nop x *>) in the ST case. What makes GHC eta-expand one, but not the other?

This is just a wild guess and the only real difference I could make out in diffs. Maybe someone with actual insights into the simplifier can comment on this claim (that the inliner gives up on c due to the missed eta-expansion and inlining)?

Trac metadata

Trac field	Value
CC	George, akio, mail@nh2.me, pivo@pobox.sk → George, akio, mail@nh2.me, pivo@pobox.sk, sgraf

Your example seems to be different than mine in that when I compile yours with ghc 8.4.1 and -O there is no difference in allocation unlike mine. What flags did you compile with?

 ghc -O repro.hs +RTS 
[1 of 1] Compiling Main             ( repro.hs, repro.o )
Linking repro ...
bash-3.2$ ./repro 1 +RTS -s
()
          56,880 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.051s  (  0.052s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time    0.051s  (  0.057s elapsed)

  %GC     time       0.3%  (0.4% elapsed)

  Alloc rate    1,118,716 bytes per MUT second

  Productivity  99.3% of total user, 95.6% of total elapsed

bash-3.2$ ./repro 2 +RTS -s
()
          56,880 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.051s  (  0.051s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    0.051s  (  0.059s elapsed)

  %GC     time       0.3%  (0.3% elapsed)

  Alloc rate    1,120,655 bytes per MUT second

  Productivity  99.3% of total user, 95.7% of total elapsed

Yuck! I was under the impression I used GHC 8.4.1 via a nix-shell when in reality I was using another locally installed 8.2.2. So, back to minimization, I guess.

It seems I uploaded the variant where I used IO instead of ST, where things still inline. When you substitute ST s for IO and use print $ runST $ ... instead of ... >>= print, it should reproduce with 8.4.1.

Now here's the funny part: I managed to make this reproduce even for IO by duplicating the call to nop. So it seems like c really just hits the threshold where the inliner gives up. The only solution I can think of is what I described in my second point above: Implement efdtIntUpFB in a way that doesn't duplicate c.

In general we should avoid to call c in builds more than once because of scenarios like this. Huge cs aren't uncommon at all (do blocks in forM_ bodies, the functions passed as first argument to foldr, etc.) and otherwise we can't guarantee that everything inlines.

I can't attach the fixed reproduction, so use this gist instead: https://gist.github.com/sgraf812/6089d81fbc95af9c5f817ff9dc417401

I'll see if I can craft a variant of efdtInt* that doesn't duplicate c.

In general we should avoid to call c in builds more than once because of scenarios like this.

I vaguely having seen a Note about this somewhere, but I cannot find it right now. But yes, a single occurrence of c is beneficial…

Ok, found it:

{-# INLINE [0] eftIntFB #-} -- See Note [Inline FB functions] in GHC.List
eftIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> r
eftIntFB c n x0 y | isTrue# (x0 ># y) = n
                  | otherwise         = go x0
                 where
                   go x = I# x `c` if isTrue# (x ==# y)
                                   then n
                                   else go (x +# 1#)
                        -- Watch out for y=maxBound; hence ==, not >
        -- Be very careful not to have more than one "c"
        -- so that when eftInfFB is inlined we can inline
        -- whatever is bound to "c"

This definition of efdtIntUpFB only has a single occurence of c and n and consequently fixes th issue.

But this probably doesn't have the same performance for the non-fused case. Also fingers crossed wrt. correctness.

data CounterState
  = More
  | Last
  | End

-- Requires x2 >= x1
{-# INLINE [0] efdtIntUpFB #-} -- See Note [Inline FB functions] in GHC.List
efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntUpFB c n x1 x2 y =
   -- Be careful about overflow!
   let !first_state
         | isTrue# (y <# x2) = if isTrue# (y <# x1) then End else Last
         | otherwise         = More -- Common case: x1 <= x2 <= y

       !delta = x2 -# x1 -- >= 0
       !y' = y -# delta  -- x1 <= y' <= y; hence y' is representable

       next_state End  _     = End
       next_state Last _     = End
       next_state More x
         | isTrue# (x ># y') = Last
         | otherwise         = More

       -- Invariant: x <= y
       -- Note that: z <= y' => z + delta won't overflow
       -- so we are guaranteed not to overflow if/when we recurse
       emit End _ = n
       emit st  x
         | let x' = x +# delta
         = I# x `c` emit (next_state st x') x'

   in emit first_state x1

@nomeata: Ah, right before my nose. Very reassuring.

How about this (untested), which seems a lot simpler

efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntUpFB c n x1 x2 y    -- Be careful about overflow!
 | isTrue# (y <# x1)
 = n
 | otherwise
 = go_up x1    -- Common case: x1 <= y
 where
   !delta = x2 -# x1   -- >= 0
   !y' = y -# delta    -- x1 <= y' <= y; hence y' is representable

       -- Invariant: x <= y
       -- Note that: x <= y' => x + delta won't overflow
       -- so we are guaranteed not to overflow if/when we recurse
   go_up x = I# x `c` if isTrue# (x ># y')
                      then n
		      else go_up (x +# delta)

I just ran some quick quickCheck $ withMaxSuccess 100000 $ \i j k -> take 1000 [i,j..k] == take 1000 (efdtInt (unI# i) (unI# j) (unI# k)) tests and both versions pass. Given that simonpj's is much more to the point, let's run with that one! Although the duplicate n has potential to cause pain... But it's also there in eftIntFB, so it's probably fine.

Edit: Well, either the search space is too huge for QC or fusion didn't kick in. Either way, Simon's code has an underflow weakness: E.g. [minBound,minBound+2..minBound+1] == [minBound], but in the new variant y' = minBound+1-2 will underflow. I fixed this in the diff I prepared by an additional flag which has to be checked on every iteration.

Nofib suggests that this regresses allocations in integer by 6.0% and counted instructions by 0.1%. I had a look at the simplified Core and it seems that it's entirely due to the new definition, although I'm not sure where exactly this allocates more. Maybe it's due to an increase in closure size of go_up because of single?. Here's the Core diff and the new definition of efdtIntUpFB for reference.

It seems that c is still not inlined, even with the new definition. I assume that's because there are multiple occurences of c which were probably duplicated before the inliner had a chance to inline the argument c. It better had introduced a join point before.

Maybe loopification helps here? Indeed #14068 suggests that something beneficial happens, maybe more so with this patch.

Or we could introduce some kind of annotation mechanism to tell GHC to be careful not to duplicate occurences of certain parameters that occur once (f {-# HUGE #-} c n = ...).

I really don't want to add new annotations!

Would it be possible to distil out a small example of what goes wrong with integer? It may not be fixable, but it's always worth a try.

I suppose the gain here is important... it's a fairly common case.

I dug through dump-simpl-iterations and noticed that the duplication happens through efdtIntFBs unfolding, which mentions efdtIntUpFB and efdtIntDnFB, which both want to inline their c later on.

efdtIntFB
  = \ (@ r_a3cz)
      (c_a2ho :: Int -> r_a3cz -> r_a3cz)
      (n_a2hp :: r_a3cz)
      (x1_a2hq :: Int#)
      (x2_a2hr :: Int#)
      (y_a2hs :: Int#) ->
      case >=# x2_a2hr x1_a2hq of {
        __DEFAULT ->
          efdtIntDnFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs;
        1# -> efdtIntUpFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs
      }

I'll see what we can do about that tomorrow...

Note that's not a problem for [x..y] (eftInt), because that doesn't need to consider counting down. It's not an issue for [x,y..] (edtInt), because although it calls efdtInt{Up,Dn} internally, it doesn't take part in fusion at all (is that an oversight or by design?).

Here's an implementation of efdtIntFB that fits our requirements:

direction :: Int# -> Int# -> Ordering
direction from to = compareInt# to from

opposed :: Ordering -> Ordering -> Bool
opposed LT GT = True
opposed GT LT = True
opposed _  _  = False

-- | Implements `enumFromThenTo @Int` as per section 6.3.4 of the Haskell2010
-- report:
-- https://www.haskell.org/onlinereport/haskell2010/haskellch6.html#dx13-131001.
efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntFB c n x1 x2 y = emit first x1
  where
    -- We can safely emit the first element if an iteration
    -- doesn't move away from @y@. That's exactly the case when
    -- @dir_x2@ is not opposed to @dir_y@.
    !first  = not (opposed dir_x2 dir_y)
           && (dir_x2 /= EQ || dir_y /= LT) -- [1,1..0] == []
    !dir_x2 = direction x1 x2
    !dir_y  = direction x1 y
    -- We need the overflow flag in 'emit'.
    (# delta, delta_ovf #) = x2 `subIntC#` x1

    -- | Think of @emit :: Maybe Int -> [Int]@, only unboxed.
    -- If the argument is 'Nothing', we reached the end of the list.
    -- If the argument is 'Just', we emit an element, compute
    -- the next candidate, validate it and recurse.
    emit False _ = n
    emit True  x = I# x `c` emit next_ok next
      where
        -- Check that @next@ didn't move past @y@.
        -- Also, overflow is only allowed iff the computation for
        -- @delta@ overflowed.
        (# next, next_ovf #) = addIntC# x delta
        !next_ok =  isTrue# (next_ovf ==# delta_ovf)
                 && not (opposed (direction next y) dir_y)
                 -- TODO: evaluate strict && for branchless code
{-# INLINE[0] efdtIntFB #-}

Some pros:

I find this much easier to understand. No complicated invariants, etc.
No Up/Dn variants to maintain. Still, if the direction is statically known, constant folding and inlining will simplify stuff to the equivalent code.
As a result, no more duplication of c occurrences
Also no more duplication of n occurrences

Cons:

emits closure is 4 words big (2 words bigger than the closure of the original go_up helper) in the most general form. It's unfortunate that we can't pack together dir_y and delta_ovf into a single word without confusing constant folding. This would need either some kind of constant propagation through bit fields (out of scope for GHC, I think) or a smarter closure allocation theme that packs together non-pointer payloads.
We pay for the generalisation of Up/Dn variants by having to compare with dir_y all the time.
base lacks addWordC# primitives, which I'll probably add now

mentioned in issue #15031 (closed)

Looks plausible to me, but needs a careful Note to explain the issues.

But before we go too far with this, I'd like to point to latelambdalifting. In the core reported in ticket:8763#comment:151206, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.

To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.

I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?

Replying to [ticket:8763#comment:152048 simonpj]:

Looks plausible to me, but needs a careful Note to explain the issues.

But before we go too far with this, I'd like to point to latelambdalifting. In the core reported in ticket:8763#comment:151206, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.

To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.

I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?

I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.

I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.

In general inlining is good. But in the case on this thread, I think it's the non-allocation of a function closure that saves all the work, rather than any optimisations that happen after inlining c.

I rebased wip/llf onto a recent HEAD commit. You can find my progress here: https://github.com/sgraf812/ghc/tree/llf.

Currently, it doesn't pass the testsuite (even in default mode, which doesn't do any new lambda lifting), probably because I introduced some errors during the merge. I think we should continue the discussion in #9476 (closed).

mentioned in commit bb338f2e

changed milestone to %8.8.1

This won't be fixed in 8.6. Bumping to 8.8.

As this is now 4 years old, does it make sense to open a new issue and close this one? My problem of 3 years ago is basically fixed. In any case I have uploaded a fixed version, reprof.hs of repro.hs which was modified per the comments in 49

Attached file reprof.hs ($1382).

fixed version of repro, modified per comment 49

changed title from forM_ [1..N] does not get fused (10 times slower than go function) to forM_ [1..N] does not get fused (allocates 50% more)

updated summary to match current state of this ticket. The speed issue seems to have been resolved.

Thanks for the curation, George!

Now that I spent some time with late lambda lifting and have an implementation at hand, I revisited this issue.

Indeed, the original $wg binding from ticket:8763#comment:151206 gets lifted to top-level, but there is no reduction in allocation to be had (it's just swapping each mention of the local binding in closures for its single free variable, after all) and instructions executed are roughly the same. That's because part of the original problem persists: The hot loop go_up is still not a join point.

I still think the implementation from ticket:8763#comment:151815 is the way to go. With the improvements to constant folding in D4605, the mentioned increase in closure size should be constant-folded away in the majority of cases. I'll conduct some benchmarks next week.

That sounds great! Thanks to you an everybody else who has improved this.

It turns out that the implementation from ticket:8763#comment:151815 doesn't get rid of the problem. It seems that the hard earned single call to c in emit True x = I# x \c\ emit next_ok next gets duplicated because of case-of-case.

The relevant Core began as this expression:

case (case ==# next_ovf_a3hz delta_ovf_a3h0 of lwild_s4fL {
           __DEFAULT -> GHC.Types.False
           1# -> b_a2jS
      }) of next_ok_a2k8 {
    __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx)
}

Now case-of-case comes along and immediately simplifies this to

case ==# next_ovf_a3hz delta_ovf_a3h0 of {
  __DEFAULT ->
    case b_a2jS of { __DEFAULT ->
      c_a2jU
        (GHC.Types.I# ds_d42Y)
        (emit_a3hf GHC.Types.False next_a3hx)
    };
  1# ->
    case b_a2jS of next_ok_a2k8 { __DEFAULT ->
      c_a2jU
        (GHC.Types.I# ds_d42Y)
        (emit_a3hf next_ok_a2k8 next_a3hx)
    }
}

I'm not sure if the intermediate join point is never generated or is just inlined immediately, but I'd very much like this not to duplicate the call to c.

This happens even in -O0. Not sure how to turn case-of-case of (for debugging reasons), -fno-case-of-case dosn't seem to be a thing anymore.

What, exactly is "the problem"? You say in ticket:8763#comment:159457 that something is not a join point, and LLF helps a bit but not enough.

Can you give a from-scratch explanation of what the problem is, perhaps assuming that LLF is available if necessary?

Sure will do. Let's begin with the original problem. There's two functions in [[attachment:reprof.hs]]. One is f, having a loop of the forM_ [2..n] body form, the other is g which has the forM_ [2,3..n] body form. The former doesn't allocate (56kB in total), whereas the latter allocates quite a lot (960MB).

Why is that? The arithmetic sequences desugar, rewrite and specialise to build (\c t -> eftIntFB c z 2 n) and build (\c z -> efdtIntFB c z 2 3 n), respectively. That cancels away with forM_'s implementation in terms of foldr:

  forM_= flip mapM_
  mapM_ body
= {- definition -}
  foldr ((>>) . body) (return ())
= {- eta expand the section -}
  foldr (\x k -> body x >> k) (return ())
= {- (>>) of ST, written as `s -> (a, s)` for lighter syntax -}
  foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ())

Note how k occurs in tail position within the lambda. Now, this cancels with the definition of ef(d)tIntFB:

  foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ()) . build (\c z -> efdtIntFB c z 2 3 n)
= {- foldr/build -}
  efdtIntFB (\x k s -> case (body x s) of (_, s') -> k s') (return ()) 2 3 n

So, that lambda is what is bound to c when ef(d)tIntFB gets inlined.

Now, the implementation of eftIntFB only has a single occurrence of c, so it will inline properly and bind its k parameter (the continuation in tail position) to the recursive go call here [https://github.com/ghc/ghc/blob/fa3143c76ac77ee96fd89559cacc089205abaa20/libraries/base/GHC/Enum.hs#L522]. The call to k turns into a tail call within gos body:

go x = I# x `c` if isTrue# (x ==# y)
       then n
       else go (x +# 1#)
==> c = \x k s -> case (body x s) of (_, s') -> k s'
go x s =
  case (body (I# x) s) of
    (_, s') ->
      if isTrue# (x ==# y)
      then n s'
      else go (x +# 1#) s'

And go can be made a join point.

The same isn't possible in the current efdtIntFB, because it duplicates c by branching on whether to count up or down (and also within the loop itself, anyway):

efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntFB c n x1 x2 y
 | isTrue# (x2 >=# x1) = efdtIntUpFB c n x1 x2 y
 | otherwise = efdtIntDnFB c n x1 x2 y

Now, when efdtInt{Up,Dn}FB gets inlined, we end up with a let-bound c that still tail-calls its argument k, and is even tail-called itself within go_up/go_dn (here https://github.com/ghc/ghc/blob/fa3143c76ac77ee96fd89559cacc089205abaa20/libraries/base/GHC/Enum.hs#L588):

let c x k s = case (body x s) of (_, s') -> k s'
in ...
   let go_up x
         | isTrue# (x ># y') = I# x `c` n
         | otherwise = I# x `c` go_up (x +# delta)

Note that go_up tail calls c and passes itself as the k parameter. If c was inlined, all would be fine and go_up would turn into a join point. That's not the case because c is duplicated in efdtIntFB and then one more time in efdtInt{Up,Dn}FB. My first implementation in ticket:8763#comment:151261 (for which you provided a simplification in ticket:8763#comment:151263) dealt with the latter, while the idea in ticket:8763#comment:151815 is supposed to deal with the former. Sadly, case-of-case seems to undo the painful de-duplication of the c parameter (that's what ticket:8763#comment:159581 is about).

Why doesn't LLF help here? Well, lifting out c to top-level gets rid of allocations for c, but there's still at least the allocation for the thunk for go_up (x+1) (the Int box goes away because of strictness). Also, the call to go_up is still an unknown call, as opposed to the simple join call we would get by inlining c.

I can think of another transformation that would save the day here: We just have to put c in the same recursive group as go_up and recognize the mutual recursive join point. Seems like a better way than to mess with the simplifier and we don't even risk duplication of any code! So

let c x k s = case (body x s) of (_, s') -> k s'
in ...
   let go_up x
         | isTrue# (x ># y') = I# x `c` n
         | otherwise = I# x `c` go_up (x +# delta)
==> float `c` inwards into the same recursive group, specialise it for `go_up (x+#delta)` and `n` as `k` (SpecConstr? Would entail seeing tail-calls as kind-of a pattern match for functions)
join go_up x
       | isTrue# (x ># y') = jump c1 (I# x)
       | otherwise = c2 (I# x) go_up (x +# delta)
     c1 x s = -- k = n
       case (body x s) of (_, s') -> n s'
     c2 x s = -- k = go_up (x +# delta)
       case (body x s) of (_, s') -> go_up (x +# delta) s'

Well, it's probably not SpecConstr that will do the specialisation. Also, why specialise when we could just inline c? Seems like we risk duplication of the potentially huge body after all.

Although the same weakness doesn't apply to the situation in ticket:8763#comment:159581: It's enough to specialise for the emit argument (which serves a similar role as go_up) without any specific arguments to see that it's tail called:

let c_a2jU x k s = case (body x s) of (_, s') -> k s'
in join emit_a4hf next_ok next =
          ...
          case ==# next_ovf_a3hz delta_ovf_a3h0 of {
            __DEFAULT ->
              case b_a2jS of { __DEFAULT ->
                c_a2jU
                  (GHC.Types.I# ds_d42Y)
                  (emit_a3hf GHC.Types.False next_a3hx)
              };
            1# ->
              case b_a2jS of next_ok_a2k8 { __DEFAULT ->
                c_a2jU
                  (GHC.Types.I# ds_d42Y)
                  (emit_a3hf next_ok_a2k8 next_a3hx)
              }
          }
==> Specialising `c` for the call pattern
    `[ds s next_ok next] |> [I# ds, emit next_ok next, s]`
    as `c'`
join c' ds s next_ok next =
       case (body (I# ds) s) of (_, s') -> emit_a4hf next_ok next s'
     emit_a4hf next_ok next s =
       ...
       case ==# next_ovf_a3hz delta_ovf_a3h0 of {
         __DEFAULT ->
           case b_a2jS of { __DEFAULT ->
             jump c' ds_d42Y GHC.Types.False next_a3hx s
           };
         1# ->
           case b_a2jS of next_ok_a2k8 { __DEFAULT ->
             jump c' ds_d42Y next_ok_a2k8 next_a3hx s
           }
       }

This latter case is probably a lot easier to handle. Maybe this is worth some specialised pass?

Thanks for the explanation.

Try this

forM_2 :: (Foldable t, Monad m) => t a -> (a -> m b) -> m ()
forM_2 xs f = let c x k = f x >> k
                  {-# INLINE c #-}
              in foldr c (return ()) xs

and use forM_2 instead of forM_ in the outer calls in f and g. I then get good results for both.

How does this work? Well by marking c as INLINE, I prevent f from inlining into it -- remember, the promise of INLINE things is that what you write gets inlined. And this is what we want: c is small, just f x >> k, and inlining it is very very good. Without the INLINE pragmas on c we have something like

  let f = BIG
  in let c x k = f x >> k
  in BODY

Since f occurs just once, we inline f to give

  let c x k = BIG x >> k
  in BODY

and now c becomes too big to inline. This is a classic inlining dilemma: do we inline f into c or c into BODY? The latter is much better in this case.

I think we could build this into the libraries just by changing the definition of mapM_.

Do you agree?

Yes, that works and seems a lot simpler! I didn't know that INLINE works 'both ways', e.g. that we can prevent inlining the body into c this way. The only minor downside is that there might be more unidentified cases where such an INLINE annotation on a lambda could be beneficial.

I'll prepare a patch and benchmark.

I didn't know that INLINE works 'both ways', e.g. that we can prevent inlining the body into c this way.

Yes, it's super-important that INLINE works like that. Consider

f x = <very big>

g y = Just (f y)
{-# INLINE g #-}

where that is the only occurrence of f, but there are zillions of occurrences of g.

You would be jolly annoyed if GHC inlined <very big> into g, and then inlined the new g at each of its zillions of call sites!!

No: INLINE says "when you see a saturated application of this function, inline the RHSI wrote at the call site".

It might be useful to beef up the documentation of INLINE to explain this, if you felt able to do so.

I agree that there may be other places such an INLINE would be desirable. A good start would be a careful Note on mapM_ explaining from first principles why the local let and INLINE is important. Then others can refer to that Note.

Right, actually I was aware of INLINE unfoldings being captured as the unoptimized variant, but have never seen it to actively prevent (or rather postpone) inlining of another part of the code. Quite a neat trick!

I've identified some other functions that should probably rewritten the same way (traverse_, foldrM, etc.). No regressions, no improvements according to NoFib. D5131 passes ./validate.sh.

forM_ [1..N] does not get fused (allocates 50% more)

Child items ...

Activity

forM_ [1..N] does not get fused (allocates 50% more)

Relates to

Activity