Opened 5 years ago

Closed 8 days ago

#8763 closed bug (fixed)

forM_ [1..N] does not get fused (allocates 50% more)

Reported by: nh2 Owned by:
Priority: normal Milestone: 8.8.1
Component: Compiler Version: 7.6.3
Keywords: Cc: mail@…, pivo@…, George, akio, sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #7206 Differential Rev(s): Phab:D5131
Wiki Page:

Description (last modified by bgamari)

Apparently idiomatic code like forM_ [1.._N] does not get fused away.

This can give serious performance problems when unnoticed.

-- Slow:
forM_ [0.._N-1] $ \i -> do ...

-- Around 10 times faster:
loop _N $ \i -> do ...

{-# INLINE loop #-}
loop :: (Monad m) => Int -> (Int -> m ()) -> m ()
loop bex f = go 0
  where
    go !n | n == bex = return ()
          | otherwise = f n >> go (n+1)

Full code example: https://gist.github.com/nh2/8905997 - the relevant alternatives are commented.

Attachments (4)

rep.hs (211 bytes) - added by akio 2 years ago.
Test case
fuse.hs (2.5 KB) - added by George 7 months ago.
repro.hs (854 bytes) - added by sgraf 7 months ago.
Reproduction for comment:44
reprof.hs (946 bytes) - added by George 3 months ago.
fixed version of repro, modified per comment 49

Download all attachments as: .zip

Change History (92)

comment:1 Changed 5 years ago by nh2

People on #ghc think that it is because the [1.._N] gets floated out as a top-level constant expression:


nomeata: nh2: I’d consider that a bug nomeata: I believe the problem is that [0..512] does not depend on any local values nomeata: so it is floated out as a top-level value nomeata: and there it is not matched by any rules thoughtpolice: let floating strikes again thoughtpolice: (well, floating-out being a non-optimization, anyway.) Fuuzetsu: does this mean that if I use [0 .. 512] twice in a program in different places, it will only be computed once? hvr: Fuuzetsu: there's a chance, yes Fuuzetsu: neat thoughtpolice: well, not so neat. in cases like this you really don't want to float out some things, because it hurts later opportunities to optimize sometimes (e.g. float out a binding that otherwise would have triggered a RULE or fusion, perhaps) thoughtpolice: unfortunately floating like this is one of the harder things to 'fight against' when you don't want it, from what i've seen.

comment:2 Changed 5 years ago by darchon

I believe let-floating happens due to the full-laziness transform. Do you get better performance with -fno-full-laziness ?

comment:3 Changed 5 years ago by nomeata

A comment in SetLevels (which I just came across) in the code indicates that this problem should have been taken care of:

	  -- We are keen to float something to the top level, even if it does not
	  -- escape a lambda, because then it needs no allocation.  But it's controlled
	  -- by a flag, because doing this too early loses opportunities for RULES
	  -- which (needless to say) are important in some nofib programs
	  -- (gcd is an example).

So either my assumption is wrong, or this does not work as desired.

comment:4 Changed 5 years ago by nomeata

It turns out that this comment is obsolete; the flag is never set. I quote from SimplCore

                -- Was: gentleFloatOutSwitches
                --
                -- I have no idea why, but not floating constants to
                -- top level is very bad in some cases.
                --
                -- Notably: p_ident in spectral/rewrite
                --          Changing from "gentle" to "constantsOnly"
                --          improved rewrite's allocation by 19%, and
                --          made 0.0% difference to any other nofib
                --          benchmark

This comment was introduced in eaeca51efc0be3ff865c4530137bfbe9f8553549 (2009) by SPJ.

Maybe rules matching should look though unfoldings more easily (at the risk of losing sharing)? There is no point in worrying about sharing [0..N] in a rule application whose purpose is to eliminate that list.

comment:5 Changed 4 years ago by nh2

@nomeata

Regarding your suspicion that it gets floated out as a constant, I don't see an improvement when getting the upper bound m of [1..m] from an IO action. What do you think?

comment:6 Changed 4 years ago by nomeata

It might still get floated out of some local function. Have you tried coming up with a minimal example?

comment:7 Changed 4 years ago by nh2

@nomeata

There is an example I made for this, mentioned in the bug description.

The performance I measure for that is:

  • using forM_ with ghc -O: 2.0 s
  • using loop with ghc -O: 1.6 s
  • using forM_ with ghc -O2: 0.9 s
  • using loop with ghc -O2: 0.3 s
  • using forM_ with ghc -O2 -fllvm: 0.75 s
  • using loop with ghc -O2 -fllvm: 0.15 s

I tried to make an even smaller benchmark (https://gist.github.com/nh2/11333427) but the performance is identical there although the same thing changes as before.

Could you try my two benchmarks and see if you get the same behaviour?

comment:8 Changed 4 years ago by nh2

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

(Note that measuring the simple example with -fllvm is pointless because it can optimize it away completely. It can't do that with the matmult though.)

Last edited 4 years ago by nh2 (previous) (diff)

comment:9 in reply to:  8 ; Changed 4 years ago by daniel.is.fischer

Replying to nh2:

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

The simple example doesn't use the same list in different places, so GHC is capable of eliminating it and giving you a loop on unboxed Int#s, at least with -O2. In the matmult example, you need to conceal the fact that both lists are the same from GHC to get a loop on unboxed Int#s.

So in principle, GHC can do the desired thing, just the sharing gets in the way. Can somebody think of a situation where sharing is beneficial for forM_ [a .. b] $ \n -> do ... code? If not, perhaps special-casing enumFromTo arguments for forM_ etc. is, at least for standard integer types, something to be considered.

comment:10 Changed 4 years ago by nh2

Preventing it from sharing sounds sensible for me: If the [a .. b] was something expensive to compute (a list of primes, say), I suspect any sane person would easily share it manually by declaring it top-level.

comment:11 in reply to:  9 Changed 4 years ago by nh2

Replying to daniel.is.fischer:

The simple example doesn't use the same list in different places

Unfortunately, I still get no difference in performance even if I duplicate the loops to

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

Any idea?

comment:12 Changed 4 years ago by pivo

Cc: pivo@… added

comment:13 Changed 4 years ago by simonpj

I've got lost in this thread. If someone thinks there is a real bug here, in GHC 7.8, could you re-articulate what you think it is?

Simon

comment:14 Changed 4 years ago by nh2

@simonpj: Summary:

1) I reported that my manually written loop is much faster than forM_ [1..n] in some cases, suggesting that in some cases optimizing the list away doesn't work well.

2) nomeata said some technical things that are a bit beyond me.

3) I submit two benchmarks in the gist at ​https://gist.github.com/nh2/11333427, a "matmult" benchmark where there is a big difference between forM_ and the hand-written loop, and a "simple" benchmark where they are equally fast.

4) Daniel suspects the slow case comes from using the same syntactical list twice, and that in this case GHC floats it out to share it, which breaks eliminating it. He suggests we might special-case enumFromTo when used with forM_ to prevent it.

5) I give a counter example for his suspicion, by changing my "simple" benchmark, where using the same list twice gives the same good performance as using it once.

I get the same behaviour for 7.6 and 7.8.

comment:15 Changed 4 years ago by daniel.is.fischer

Slight correction, @Niklas, it's not a suspicion that it's the floating out of the list to the top-level and consequently the use of the list for looping instead of unboxed Int#s, that is direct from the core (-ddump-simpl is your friend in all matters related to performance). The question is under which exact circumstances GHC floats the list out. To answer that, you need somebody knowing how GHC works.

comment:16 Changed 4 years ago by simonpj

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

I'm not using criterion; just using +RTS -s, with _SIZE = 300.

So I can't reproduce the problem.

Simon

comment:17 in reply to:  16 Changed 4 years ago by daniel.is.fischer

Replying to simonpj:

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

Are you compiling with -O? You need -O2 for a significant difference to appear (with only -O, both versions are slow).

comment:18 Changed 3 years ago by George

With 7.10.1, compiling with -O2, I see a factor of 2 difference in performance:

benchmarking matmultForM_
time                 10.90 μs   (10.89 μs .. 10.91 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.89 μs   (10.89 μs .. 10.91 μs)
std dev              32.72 ns   (18.98 ns .. 65.42 ns)

benchmarking matmultLoop
time                 5.404 μs   (5.387 μs .. 5.419 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.409 μs   (5.398 μs .. 5.420 μs)
std dev              37.99 ns   (33.64 ns .. 44.26 ns)

with -fllvm it is a factor of almost 3

benchmarking matmultForM_
time                 9.796 μs   (9.788 μs .. 9.806 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.802 μs   (9.795 μs .. 9.815 μs)
std dev              31.64 ns   (21.92 ns .. 49.61 ns)

benchmarking matmultLoop
time                 3.499 μs   (3.497 μs .. 3.502 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.499 μs   (3.497 μs .. 3.503 μs)
std dev              10.27 ns   (7.068 ns .. 16.47 ns)
Last edited 3 years ago by simonpj (previous) (diff)

comment:19 Changed 3 years ago by George

Milestone: 7.12.1

comment:20 Changed 3 years ago by George

Cc: george added

comment:21 Changed 3 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:22 Changed 3 years ago by George

seeing same issue and performance on 7.10.3. This is on MacOS but I believe the problem occurs on multiple platforms,

This is with _SIZE = 10. Even if I replace

forM_ [0.._SIZE-1]

with

forM_ [0..10-1]

I see the same problem so I don't think it has to do with top level constant expressions

Last edited 3 years ago by George (previous) (diff)

comment:23 Changed 3 years ago by thomie

Milestone: 8.0.1

comment:24 Changed 2 years ago by George

Milestone: 8.0.2

still seeing the same problem on 8.0.1. Any chance of this getting fixed in 8.0.2? If not 8.0.2 I hope 8.1

Last edited 2 years ago by George (previous) (diff)

comment:25 Changed 2 years ago by bgamari

Milestone: 8.0.28.2.1

I'm afraid 8.0.2 is out of the question. However, 8.2.1 is a possibility. Of course, any help that you, the reader, can offer would expedite the process.

comment:26 Changed 2 years ago by akio

Cc: akio added

Changed 2 years ago by akio

Attachment: rep.hs added

Test case

comment:27 Changed 2 years ago by akio

Here is a test case that does not require criterion. With ghc-8.0.1 -O2 -ddump-simpl, you can see that foo is compiled into code that uses a top-level CAF containing GHC.Enum.eftInt 0# 799#.

comment:28 Changed 2 years ago by bgamari

Description: modified (diff)

comment:29 Changed 2 years ago by akio

It looks like this is the same issue as described in #7206. The branch mentioned in https://ghc.haskell.org/trac/ghc/ticket/7206/#comment:13 fixes my test case.

comment:30 Changed 2 years ago by simonpj

Good to know that's it.

The program in rep.hs is rather unusual because it has a constant list, not dependent on input data. Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

It would be great if someone felt able to dig into #7206 and figure out why it's not a straight win

Simon

comment:31 in reply to:  30 Changed 2 years ago by nh2

Replying to simonpj:

Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

Just wanted to chime in that I found this problem in a real-world program, where I looped over all 32-bit integers. Similarly, these known-at-compile-time iteration lists may appear in inner loops of algorithms or data structures (like B-trees with fixed B, or image convolution kernels with fixed size, e.g. 3x3 image templates).

comment:32 Changed 19 months ago by bgamari

Milestone: 8.2.18.4.1

Given that 8.2.1-rc1 is imminent, I'm bumping these off to the 8.4

comment:33 Changed 9 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:34 Changed 8 months ago by osa1

comment:35 Changed 8 months ago by osa1

I confirmed that #7206 fixes this.

comment:36 Changed 8 months ago by osa1

Status: newinfoneeded

Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer.

comment:37 Changed 8 months ago by nh2

If the problem is gone in 8.2.2, do we know what commit fixed it?

comment:38 Changed 8 months ago by akio

I haven't confirmed, but it looks like 2effe18ab51d66474724d38b20e49cc1b8738f60 may have fixed this. Now fusion happens during the "gentle" phase of simplifier, which happens before any of the float-out passes.

comment:39 in reply to:  36 Changed 7 months ago by George

Replying to osa1:

Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer.

I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?

comment:40 Changed 7 months ago by osa1

I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?

What is the first argument of forM_ in you program? This ticket (and #7206) is specifically about forM_/mapM_ when the foldable argument is of form [n .. m]. If your program is different than perhaps a new ticket would be better. In any case, it's not a big deal if you paste your program here, we can move it to a new ticket if it turns out to be something different than what we try to improve here.

Changed 7 months ago by George

Attachment: fuse.hs added

comment:41 Changed 7 months ago by George

Cc: George added; george removed

comment:42 Changed 7 months ago by George

Attached file fuse.hs , similar to the first file but with nested forM_, both versions run in the same time but the forM_ version allocates about 50% more

 ghc -O2 fuse.hs +RTS
[1 of 1] Compiling Main             ( fuse.hs, fuse.o )
Linking fuse ...
bash-3.2$ ./fuse 1 +RTS -s
486341683267690
     320,057,352 bytes allocated in the heap
           8,232 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          29,152 bytes maximum slop
             308 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.025s     0.0254s    0.0254s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    5.432s  (  5.634s elapsed)
  GC      time    0.000s  (  0.025s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    5.433s  (  5.667s elapsed)

  %GC     time       0.0%  (0.4% elapsed)

  Alloc rate    58,918,474 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

bash-3.2$ ./fuse 2 +RTS -s
486341683267690
     560,057,328 bytes allocated in the heap
          15,992 bytes copied during GC
     320,028,576 bytes maximum residency (2 sample(s))
         868,448 bytes maximum slop
             308 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       228 colls,     0 par    0.001s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.026s     0.0128s    0.0254s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    5.453s  (  5.630s elapsed)
  GC      time    0.002s  (  0.027s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time    5.455s  (  5.667s elapsed)

  %GC     time       0.0%  (0.5% elapsed)

  Alloc rate    102,698,216 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

comment:43 Changed 7 months ago by George

Status: infoneedednew

Hoping my example is the new reproducer. If not this can be set back to infoneeded or closed.

comment:44 Changed 7 months ago by sgraf

Here is a smaller example that highlights the problem without vectors. The only difference between the two functions is the use of [2,3..n] instead of [2..n], which desugar to different functions. This results in a difference in a huge difference in allocation as well as runtime:

$ ./repro 2 +RTS -s # [2,3..n]
()
     960,056,856 bytes allocated in the heap
          21,536 bytes copied during GC
          44,576 bytes maximum residency (2 sample(s))
          29,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       918 colls,     0 par    0.005s   0.003s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0001s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.123s  (  0.125s elapsed)
  GC      time    0.005s  (  0.003s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.129s  (  0.129s elapsed)

  %GC     time       4.1%  (2.5% elapsed)

  Alloc rate    7,778,808,106 bytes per MUT second

  Productivity  95.8% of total user, 97.4% of total elapsed
$ ./repro 1 +RTS -s # [2..n]
()
          56,872 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.048s  (  0.048s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.048s  (  0.048s elapsed)

  %GC     time       0.2%  (0.2% elapsed)

  Alloc rate    1,188,432 bytes per MUT second

  Productivity  99.6% of total user, 99.6% of total elapsed

This happens in ST, but not in IO, so probably related to some hack. Also the difference vanishes when we allow the functions to inline.

Here's some Core for g (the offending function):

-- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13}
$wg
$wg
  = \ @ s ww w ->
      let { $wc = <huge> } in
      case <# ww 3# of {
        __DEFAULT ->
          let {
            y'
            y' = -# ww 1# } in
          letrec {
            go_up
            go_up
              = \ x eta ->
                  case ># x y' of {
                    __DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` <Co:4>) eta;
                    1# -> $wc x (lvl `cast` <Co:4>) eta
                  }; } in
          $wc 2# ((go_up 3#) `cast` <Co:4>) w;
        1# ->
          case <# ww 2# of {
            __DEFAULT -> $wc 2# (lvl `cast` <Co:4>) w;
            1# -> (# w, () #)
          }
      }

From my understanding of join points, $wc is only nearly a join point, because go_up with its transitive tail call to $wc appears in argument position. It would be great if we could get rid of this! The IO variant (g 40000000 >>= print) doesn't have this weakness, it's all join points there. Hence my suspicion about some IO hack that let's GHC eta-expand stuff more aggresively, but I'm not sure how that's helping.

Changed 7 months ago by sgraf

Attachment: repro.hs added

Reproduction for comment:44

comment:45 Changed 7 months ago by sgraf

It seems that for IO, GHC decides that it's OK to inline c from the fusion helper of enumFromThenTo, but not so for ST s.

For our case, c is the <huge> computation (see the worker $wc in comment:44) performed for each outer list element and would be duplicated by inlining: It's mentioned four times in the definition of efdtIntUpFB. Consequently, c has almost always Guidance=NEVER, except in the IO case, where it miraculously gets Guidance=IF_ARGS [20 420 0] 674 0 just when it is inlined. Not sure what this decision is based on.

The inlining decision for eftIntFB is much easier: c only happens once there.

I'm not sure if IO gets special treatment by the inliner, but I see a few ways out:

  • Do the same hacks for ST, if there are any which apply (ugly)
  • Reduce the number of calls to c in the implementation of efdtIntUpFB, probably for worse branch prediction
  • Figure out why the floated out expression of \x -> (nop x *>) occuring in forM_ nop = flip mapM_ nop = foldr ((>>) . nop) (return ()) doesn't get eta-expanded in the ST case, whereas the relevant IO code is. I hope that by fixing this, the c expression inlines again.

Here's how it inlines for IO:

  (>>) . nop
= \x -> (nop x >>)
= \x -> (nop x *>) -- notice how it's no different than ST up until here
= \x -> (thenIO (nop x))

The inliner probably stops here, but because of eta-expansion modulo coercions to \x k s -> thenIO (nop x) k s, we can inline thenIO:

  \x k s -> thenIO (nop x) y s
= \x k s -> case nop x s of (# new_s, _ #) -> k new_s)

which is much better and probably more keenly inlined than \x -> (nop x *>) in the ST case. What makes GHC eta-expand one, but not the other?

This is just a wild guess and the only real difference I could make out in diffs. Maybe someone with actual insights into the simplifier can comment on this claim (that the inliner gives up on c due to the missed eta-expansion and inlining)?

Last edited 7 months ago by sgraf (previous) (diff)

comment:46 Changed 7 months ago by sgraf

Cc: sgraf added

comment:47 Changed 7 months ago by George

Your example seems to be different than mine in that when I compile yours with ghc 8.4.1 and -O there is no difference in allocation unlike mine. What flags did you compile with?

 ghc -O repro.hs +RTS 
[1 of 1] Compiling Main             ( repro.hs, repro.o )
Linking repro ...
bash-3.2$ ./repro 1 +RTS -s
()
          56,880 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.051s  (  0.052s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time    0.051s  (  0.057s elapsed)

  %GC     time       0.3%  (0.4% elapsed)

  Alloc rate    1,118,716 bytes per MUT second

  Productivity  99.3% of total user, 95.6% of total elapsed

bash-3.2$ ./repro 2 +RTS -s
()
          56,880 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.051s  (  0.051s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    0.051s  (  0.059s elapsed)

  %GC     time       0.3%  (0.3% elapsed)

  Alloc rate    1,120,655 bytes per MUT second

  Productivity  99.3% of total user, 95.7% of total elapsed

comment:48 Changed 7 months ago by sgraf

Yuck! I was under the impression I used GHC 8.4.1 via a nix-shell when in reality I was using another locally installed 8.2.2. So, back to minimization, I guess.

comment:49 Changed 7 months ago by sgraf

It seems I uploaded the variant where I used IO instead of ST, where things still inline. When you substitute ST s for IO and use print $ runST $ ... instead of ... >>= print, it should reproduce with 8.4.1.

Now here's the funny part: I managed to make this reproduce even for IO by duplicating the call to nop. So it seems like c really just hits the threshold where the inliner gives up. The only solution I can think of is what I described in my second point above: Implement efdtIntUpFB in a way that doesn't duplicate c.

In general we should avoid to call c in builds more than once because of scenarios like this. Huge cs aren't uncommon at all (do blocks in forM_ bodies, the functions passed as first argument to foldr, etc.) and otherwise we can't guarantee that everything inlines.

comment:50 Changed 7 months ago by sgraf

I can't attach the fixed reproduction, so use this gist instead: https://gist.github.com/sgraf812/6089d81fbc95af9c5f817ff9dc417401

I'll see if I can craft a variant of efdtInt* that doesn't duplicate c.

comment:51 Changed 7 months ago by nomeata

In general we should avoid to call c in builds more than once because of scenarios like this.

I vaguely having seen a Note about this somewhere, but I cannot find it right now. But yes, a single occurrence of c is beneficial…

Ok, found it:

{-# INLINE [0] eftIntFB #-} -- See Note [Inline FB functions] in GHC.List
eftIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> r
eftIntFB c n x0 y | isTrue# (x0 ># y) = n
                  | otherwise         = go x0
                 where
                   go x = I# x `c` if isTrue# (x ==# y)
                                   then n
                                   else go (x +# 1#)
                        -- Watch out for y=maxBound; hence ==, not >
        -- Be very careful not to have more than one "c"
        -- so that when eftInfFB is inlined we can inline
        -- whatever is bound to "c"
Last edited 7 months ago by nomeata (previous) (diff)

comment:52 Changed 7 months ago by sgraf

This definition of efdtIntUpFB only has a single occurence of c and n and consequently fixes th issue.

But this probably doesn't have the same performance for the non-fused case. Also fingers crossed wrt. correctness.

data CounterState
  = More
  | Last
  | End

-- Requires x2 >= x1
{-# INLINE [0] efdtIntUpFB #-} -- See Note [Inline FB functions] in GHC.List
efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntUpFB c n x1 x2 y =
   -- Be careful about overflow!
   let !first_state
         | isTrue# (y <# x2) = if isTrue# (y <# x1) then End else Last
         | otherwise         = More -- Common case: x1 <= x2 <= y

       !delta = x2 -# x1 -- >= 0
       !y' = y -# delta  -- x1 <= y' <= y; hence y' is representable

       next_state End  _     = End
       next_state Last _     = End
       next_state More x
         | isTrue# (x ># y') = Last
         | otherwise         = More

       -- Invariant: x <= y
       -- Note that: z <= y' => z + delta won't overflow
       -- so we are guaranteed not to overflow if/when we recurse
       emit End _ = n
       emit st  x
         | let x' = x +# delta
         = I# x `c` emit (next_state st x') x'

   in emit first_state x1
Last edited 7 months ago by sgraf (previous) (diff)

comment:53 Changed 7 months ago by sgraf

@nomeata: Ah, right before my nose. Very reassuring.

comment:54 Changed 7 months ago by simonpj

How about this (untested), which seems a lot simpler

efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntUpFB c n x1 x2 y    -- Be careful about overflow!
 | isTrue# (y <# x1)
 = n
 | otherwise
 = go_up x1    -- Common case: x1 <= y
 where
   !delta = x2 -# x1   -- >= 0
   !y' = y -# delta    -- x1 <= y' <= y; hence y' is representable

       -- Invariant: x <= y
       -- Note that: x <= y' => x + delta won't overflow
       -- so we are guaranteed not to overflow if/when we recurse
   go_up x = I# x `c` if isTrue# (x ># y')
                      then n
		      else go_up (x +# delta)

comment:55 Changed 7 months ago by sgraf

I just ran some quick quickCheck $ withMaxSuccess 100000 $ \i j k -> take 1000 [i,j..k] == take 1000 (efdtInt (unI# i) (unI# j) (unI# k)) tests and both versions pass. Given that simonpj's is much more to the point, let's run with that one! Although the duplicate n has potential to cause pain... But it's also there in eftIntFB, so it's probably fine.

Edit: Well, either the search space is too huge for QC or fusion didn't kick in. Either way, Simon's code has an underflow weakness: E.g. [minBound,minBound+2..minBound+1] == [minBound], but in the new variant y' = minBound+1-2 will underflow. I fixed this in the diff I prepared by an additional flag which has to be checked on every iteration.

Last edited 7 months ago by sgraf (previous) (diff)

comment:56 Changed 7 months ago by sgraf

Nofib suggests that this regresses allocations in integer by 6.0% and counted instructions by 0.1%. I had a look at the simplified Core and it seems that it's entirely due to the new definition, although I'm not sure where exactly this allocates more. Maybe it's due to an increase in closure size of go_up because of single?. Here's the Core diff and the new definition of efdtIntUpFB for reference.

It seems that c is still not inlined, even with the new definition. I assume that's because there are multiple occurences of c which were probably duplicated before the inliner had a chance to inline the argument c. It better had introduced a join point before.

Maybe loopification helps here? Indeed #14068 suggests that something beneficial happens, maybe more so with this patch.

Or we could introduce some kind of annotation mechanism to tell GHC to be careful not to duplicate occurences of certain parameters that occur once (f {-# HUGE #-} c n = ...).

comment:57 Changed 7 months ago by simonpj

I really don't want to add new annotations!

Would it be possible to distil out a small example of what goes wrong with integer? It may not be fixable, but it's always worth a try.

I suppose the gain here is important... it's a fairly common case.

comment:58 Changed 6 months ago by sgraf

I dug through dump-simpl-iterations and noticed that the duplication happens through efdtIntFBs unfolding, which mentions efdtIntUpFB and efdtIntDnFB, which both want to inline their c later on.

efdtIntFB
  = \ (@ r_a3cz)
      (c_a2ho :: Int -> r_a3cz -> r_a3cz)
      (n_a2hp :: r_a3cz)
      (x1_a2hq :: Int#)
      (x2_a2hr :: Int#)
      (y_a2hs :: Int#) ->
      case >=# x2_a2hr x1_a2hq of {
        __DEFAULT ->
          efdtIntDnFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs;
        1# -> efdtIntUpFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs
      }

I'll see what we can do about that tomorrow...

comment:59 Changed 6 months ago by sgraf

Note that's not a problem for [x..y] (eftInt), because that doesn't need to consider counting down. It's not an issue for [x,y..] (edtInt), because although it calls efdtInt{Up,Dn} internally, it doesn't take part in fusion at all (is that an oversight or by design?).

comment:60 Changed 6 months ago by sgraf

Here's an implementation of efdtIntFB that fits our requirements:

direction :: Int# -> Int# -> Ordering
direction from to = compareInt# to from

opposed :: Ordering -> Ordering -> Bool
opposed LT GT = True
opposed GT LT = True
opposed _  _  = False

-- | Implements `enumFromThenTo @Int` as per section 6.3.4 of the Haskell2010
-- report:
-- https://www.haskell.org/onlinereport/haskell2010/haskellch6.html#dx13-131001.
efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntFB c n x1 x2 y = emit first x1
  where
    -- We can safely emit the first element if an iteration
    -- doesn't move away from @y@. That's exactly the case when
    -- @dir_x2@ is not opposed to @dir_y@.
    !first  = not (opposed dir_x2 dir_y)
           && (dir_x2 /= EQ || dir_y /= LT) -- [1,1..0] == []
    !dir_x2 = direction x1 x2
    !dir_y  = direction x1 y
    -- We need the overflow flag in 'emit'.
    (# delta, delta_ovf #) = x2 `subIntC#` x1

    -- | Think of @emit :: Maybe Int -> [Int]@, only unboxed.
    -- If the argument is 'Nothing', we reached the end of the list.
    -- If the argument is 'Just', we emit an element, compute
    -- the next candidate, validate it and recurse.
    emit False _ = n
    emit True  x = I# x `c` emit next_ok next
      where
        -- Check that @next@ didn't move past @y@.
        -- Also, overflow is only allowed iff the computation for
        -- @delta@ overflowed.
        (# next, next_ovf #) = addIntC# x delta
        !next_ok =  isTrue# (next_ovf ==# delta_ovf)
                 && not (opposed (direction next y) dir_y)
                 -- TODO: evaluate strict && for branchless code
{-# INLINE[0] efdtIntFB #-}

Some pros:

  • I find this much easier to understand. No complicated invariants, etc.
  • No Up/Dn variants to maintain. Still, if the direction is statically known, constant folding and inlining will simplify stuff to the equivalent code.
  • As a result, no more duplication of c occurrences
  • Also no more duplication of n occurrences

Cons:

  • emits closure is 4 words big (2 words bigger than the closure of the original go_up helper) in the most general form. It's unfortunate that we can't pack together dir_y and delta_ovf into a single word without confusing constant folding. This would need either some kind of constant propagation through bit fields (out of scope for GHC, I think) or a smarter closure allocation theme that packs together non-pointer payloads.
  • We pay for the generalisation of Up/Dn variants by having to compare with dir_y all the time.
  • base lacks addWordC# primitives, which I'll probably add now
Last edited 6 months ago by sgraf (previous) (diff)

comment:61 Changed 6 months ago by simonpj

Looks plausible to me, but needs a careful Note to explain the issues.

But before we go too far with this, I'd like to point to late lambda lifting. In the core reported in comment:44, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.

To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.

I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?

comment:62 in reply to:  61 Changed 6 months ago by sgraf

Replying to simonpj:

Looks plausible to me, but needs a careful Note to explain the issues.

But before we go too far with this, I'd like to point to late lambda lifting. In the core reported in comment:44, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.

To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.

I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?

I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.

comment:63 Changed 6 months ago by simonpj

I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.

In general inlining is good. But in the case on this thread, I think it's the non-allocation of a function closure that saves all the work, rather than any optimisations that happen after inlining c.

comment:64 Changed 6 months ago by sgraf

I rebased wip/llf onto a recent HEAD commit. You can find my progress here: https://github.com/sgraf812/ghc/tree/llf.

Currently, it doesn't pass the testsuite (even in default mode, which doesn't do any new lambda lifting), probably because I introduced some errors during the merge. I think we should continue the discussion in #9476.

comment:65 Changed 5 months ago by Ben Gamari <ben@…>

In bb338f2/ghc:

Algebraically simplify add/sub with carry/overflow

Previously, the `{add,sub}{Int,Word}C#` PrimOps weren't handled
in PrelRules (constant folding and algebraic simplification) at all.
This implements the necessary logic, so that using these primitives
isn't too punishing compared to their well-optimised, overflow-unaware
counterparts.

This is so that using these primitives in `enumFromThenTo @Int` can
be optimized by constant folding, reducing closure sizes.

Reviewers: bgamari, simonpj, hsyl20

Reviewed By: bgamari, simonpj

Subscribers: AndreasK, thomie, carter

GHC Trac Issues: #8763

Differential Revision: https://phabricator.haskell.org/D4605

comment:66 Changed 4 months ago by bgamari

Milestone: 8.6.18.8.1

This won't be fixed in 8.6. Bumping to 8.8.

comment:67 Changed 3 months ago by George

As this is now 4 years old, does it make sense to open a new issue and close this one? My problem of 3 years ago is basically fixed. In any case I have uploaded a fixed version, reprof.hs of repro.hs which was modified per the comments in 49

Last edited 3 months ago by George (previous) (diff)

Changed 3 months ago by George

Attachment: reprof.hs added

fixed version of repro, modified per comment 49

comment:68 Changed 8 weeks ago by George

Summary: forM_ [1..N] does not get fused (10 times slower than go function)forM_ [1..N] does not get fused (allocates 50% more)

updated summary to match current state of this ticket. The speed issue seems to have been resolved.

comment:69 Changed 7 weeks ago by sgraf

Thanks for the curation, George!

Now that I spent some time with late lambda lifting and have an implementation at hand, I revisited this issue.

Indeed, the original $wg binding from comment:44 gets lifted to top-level, but there is no reduction in allocation to be had (it's just swapping each mention of the local binding in closures for its single free variable, after all) and instructions executed are roughly the same. That's because part of the original problem persists: The hot loop go_up is still not a join point.

I still think the implementation from comment:60 is the way to go. With the improvements to constant folding in Phab:D4605, the mentioned increase in closure size should be constant-folded away in the majority of cases. I'll conduct some benchmarks next week.

comment:70 Changed 7 weeks ago by George

That sounds great! Thanks to you an everybody else who has improved this.

comment:71 Changed 7 weeks ago by sgraf

It turns out that the implementation from comment:60 doesn't get rid of the problem. It seems that the hard earned single call to c in emit True x = I# x \c\ emit next_ok next gets duplicated because of case-of-case.

The relevant Core began as this expression:

case (case ==# next_ovf_a3hz delta_ovf_a3h0 of lwild_s4fL {
           __DEFAULT -> GHC.Types.False
           1# -> b_a2jS
      }) of next_ok_a2k8 {
    __DEFAULT -> c_a2jU (GHC.Types.I# ds_d42Y) (emit_a3hf next_ok_a2k8 next_a3hx)
}

Now case-of-case comes along and immediately simplifies this to

case ==# next_ovf_a3hz delta_ovf_a3h0 of {
  __DEFAULT ->
    case b_a2jS of { __DEFAULT ->
      c_a2jU
        (GHC.Types.I# ds_d42Y)
        (emit_a3hf GHC.Types.False next_a3hx)
    };
  1# ->
    case b_a2jS of next_ok_a2k8 { __DEFAULT ->
      c_a2jU
        (GHC.Types.I# ds_d42Y)
        (emit_a3hf next_ok_a2k8 next_a3hx)
    }
}

I'm not sure if the intermediate join point is never generated or is just inlined immediately, but I'd very much like this not to duplicate the call to c.

comment:72 Changed 7 weeks ago by sgraf

This happens even in -O0. Not sure how to turn case-of-case of (for debugging reasons), -fno-case-of-case dosn't seem to be a thing anymore.

comment:73 Changed 7 weeks ago by simonpj

What, exactly is "the problem"? You say in comment:69 that something is not a join point, and LLF helps a bit but not enough.

Can you give a from-scratch explanation of what the problem is, perhaps assuming that LLF is available if necessary?

comment:74 Changed 7 weeks ago by sgraf

Sure will do. Let's begin with the original problem. There's two functions in reprof.hs. One is f, having a loop of the forM_ [2..n] body form, the other is g which has the forM_ [2,3..n] body form. The former doesn't allocate (56kB in total), whereas the latter allocates quite a lot (960MB).

Why is that? The arithmetic sequences desugar, rewrite and specialise to build (\c t -> eftIntFB c z 2 n) and build (\c z -> efdtIntFB c z 2 3 n), respectively. That cancels away with forM_'s implementation in terms of foldr:

  forM_= flip mapM_
  mapM_ body
= {- definition -}
  foldr ((>>) . body) (return ())
= {- eta expand the section -}
  foldr (\x k -> body x >> k) (return ())
= {- (>>) of ST, written as `s -> (a, s)` for lighter syntax -}
  foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ())

Note how k occurs in tail position within the lambda. Now, this cancels with the definition of ef(d)tIntFB:

  foldr (\x k s -> case (body x s) of (_, s') -> k s') (return ()) . build (\c z -> efdtIntFB c z 2 3 n)
= {- foldr/build -}
  efdtIntFB (\x k s -> case (body x s) of (_, s') -> k s') (return ()) 2 3 n

So, that lambda is what is bound to c when ef(d)tIntFB gets inlined.

Now, the implementation of eftIntFB only has a single occurrence of c, so it will inline properly and bind its k parameter (the continuation in tail position) to the recursive go call here https://github.com/ghc/ghc/blob/fa3143c76ac77ee96fd89559cacc089205abaa20/libraries/base/GHC/Enum.hs#L522. The call to k turns into a tail call within gos body:

go x = I# x `c` if isTrue# (x ==# y)
       then n
       else go (x +# 1#)
==> c = \x k s -> case (body x s) of (_, s') -> k s'
go x s =
  case (body (I# x) s) of
    (_, s') ->
      if isTrue# (x ==# y)
      then n s'
      else go (x +# 1#) s'

And go can be made a join point.

The same isn't possible in the current efdtIntFB, because it duplicates c by branching on whether to count up or down (and also within the loop itself, anyway):

efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntFB c n x1 x2 y
 | isTrue# (x2 >=# x1) = efdtIntUpFB c n x1 x2 y
 | otherwise = efdtIntDnFB c n x1 x2 y

Now, when efdtInt{Up,Dn}FB gets inlined, we end up with a let-bound c that still tail-calls its argument k, and is even tail-called itself within go_up/go_dn (here https://github.com/ghc/ghc/blob/fa3143c76ac77ee96fd89559cacc089205abaa20/libraries/base/GHC/Enum.hs#L588):

let c x k s = case (body x s) of (_, s') -> k s'
in ...
   let go_up x
         | isTrue# (x ># y') = I# x `c` n
         | otherwise = I# x `c` go_up (x +# delta)

Note that go_up tail calls c and passes itself as the k parameter. If c was inlined, all would be fine and go_up would turn into a join point. That's not the case because c is duplicated in efdtIntFB and then one more time in efdtInt{Up,Dn}FB. My first implementation in comment:52 (for which you provided a simplification in comment:54) dealt with the latter, while the idea in comment:60 is supposed to deal with the former. Sadly, case-of-case seems to undo the painful de-duplication of the c parameter (that's what comment:71 is about).

Why doesn't LLF help here? Well, lifting out c to top-level gets rid of allocations for c, but there's still at least the allocation for the thunk for go_up (x+1) (the Int box goes away because of strictness). Also, the call to go_up is still an unknown call, as opposed to the simple join call we would get by inlining c.

comment:75 Changed 7 weeks ago by sgraf


I can think of another transformation that would save the day here: We just have to put c in the same recursive group as go_up and recognize the mutual recursive join point. Seems like a better way than to mess with the simplifier and we don't even risk duplication of any code! So

let c x k s = case (body x s) of (_, s') -> k s'
in ...
   let go_up x
         | isTrue# (x ># y') = I# x `c` n
         | otherwise = I# x `c` go_up (x +# delta)
==> float `c` inwards into the same recursive group, specialise it for `go_up (x+#delta)` and `n` as `k` (SpecConstr? Would entail seeing tail-calls as kind-of a pattern match for functions)
join go_up x
       | isTrue# (x ># y') = jump c1 (I# x)
       | otherwise = c2 (I# x) go_up (x +# delta)
     c1 x s = -- k = n
       case (body x s) of (_, s') -> n s'
     c2 x s = -- k = go_up (x +# delta)
       case (body x s) of (_, s') -> go_up (x +# delta) s'

Well, it's probably not SpecConstr that will do the specialisation. Also, why specialise when we could just inline c? Seems like we risk duplication of the potentially huge body after all.

Although the same weakness doesn't apply to the situation in comment:71: It's enough to specialise for the emit argument (which serves a similar role as go_up) without any specific arguments to see that it's tail called:

let c_a2jU x k s = case (body x s) of (_, s') -> k s'
in join emit_a4hf next_ok next =
          ...
          case ==# next_ovf_a3hz delta_ovf_a3h0 of {
            __DEFAULT ->
              case b_a2jS of { __DEFAULT ->
                c_a2jU
                  (GHC.Types.I# ds_d42Y)
                  (emit_a3hf GHC.Types.False next_a3hx)
              };
            1# ->
              case b_a2jS of next_ok_a2k8 { __DEFAULT ->
                c_a2jU
                  (GHC.Types.I# ds_d42Y)
                  (emit_a3hf next_ok_a2k8 next_a3hx)
              }
          }
==> Specialising `c` for the call pattern
    `[ds s next_ok next] |> [I# ds, emit next_ok next, s]`
    as `c'`
join c' ds s next_ok next =
       case (body (I# ds) s) of (_, s') -> emit_a4hf next_ok next s'
     emit_a4hf next_ok next s =
       ...
       case ==# next_ovf_a3hz delta_ovf_a3h0 of {
         __DEFAULT ->
           case b_a2jS of { __DEFAULT ->
             jump c' ds_d42Y GHC.Types.False next_a3hx s
           };
         1# ->
           case b_a2jS of next_ok_a2k8 { __DEFAULT ->
             jump c' ds_d42Y next_ok_a2k8 next_a3hx s
           }
       }

This latter case is probably a lot easier to handle. Maybe this is worth some specialised pass?

comment:76 Changed 7 weeks ago by simonpj

Thanks for the explanation.

Try this

forM_2 :: (Foldable t, Monad m) => t a -> (a -> m b) -> m ()
forM_2 xs f = let c x k = f x >> k
                  {-# INLINE c #-}
              in foldr c (return ()) xs

and use forM_2 instead of forM_ in the outer calls in f and g. I then get good results for both.

How does this work? Well by marking c as INLINE, I prevent f from inlining into it -- remember, the promise of INLINE things is that what you write gets inlined. And this is what we want: c is small, just f x >> k, and inlining it is very very good. Without the INLINE pragmas on c we have something like

  let f = BIG
  in let c x k = f x >> k
  in BODY

Since f occurs just once, we inline f to give

  let c x k = BIG x >> k
  in BODY

and now c becomes too big to inline. This is a classic inlining dilemma: do we inline f into c or c into BODY? The latter is much better in this case.

I think we could build this into the libraries just by changing the definition of mapM_.

Do you agree?

comment:77 Changed 7 weeks ago by sgraf

Yes, that works and seems a lot simpler! I didn't know that INLINE works 'both ways', e.g. that we can prevent inlining the body into c this way. The only minor downside is that there might be more unidentified cases where such an INLINE annotation on a lambda could be beneficial.

I'll prepare a patch and benchmark.

comment:78 Changed 7 weeks ago by simonpj

I didn't know that INLINE works 'both ways', e.g. that we can prevent inlining the body into c this way.

Yes, it's super-important that INLINE works like that. Consider

f x = <very big>

g y = Just (f y)
{-# INLINE g #-}

where that is the only occurrence of f, but there are zillions of occurrences of g.

You would be jolly annoyed if GHC inlined <very big> into g, and then inlined the new g at each of its zillions of call sites!!

No: INLINE says "when you see a saturated application of this function, inline the RHSI wrote at the call site".

It might be useful to beef up the documentation of INLINE to explain this, if you felt able to do so.

I agree that there may be other places such an INLINE would be desirable. A good start would be a careful Note on mapM_ explaining from first principles why the local let and INLINE is important. Then others can refer to that Note.

comment:79 Changed 7 weeks ago by sgraf

Right, actually I was aware of INLINE unfoldings being captured as the unoptimized variant, but have never seen it to actively prevent (or rather postpone) inlining of another part of the code. Quite a neat trick!

I've identified some other functions that should probably rewritten the same way (traverse_, foldrM, etc.). No regressions, no improvements according to NoFib. Phab:D5131 passes ./validate.sh.

comment:80 Changed 7 weeks ago by sgraf

Differential Rev(s): Phab:D5131

comment:81 Changed 6 weeks ago by simonpj

Here's a longer, and to me more comprehensible, Note

Note [List fusion and continuations in 'c']
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Suppose we define
   mapM_ f = foldr ((>>) . f) (return ())
(this is the way it used to be).

Now suppose we want to optimise the call

  mapM_ <big> (build g)
    where
  g c n = ...(c x1 y1)...(c x2 y2)....n...

GHC used to proceed like this:

  mapM_ <big> (build g)

  = { Defintion of mapM_ }
    foldr ((>>) . <big>) (return ()) (build g)

  = { foldr/build rule }
    g ((>>) . <big>) (return ())

  = { Inline g ]
    let c = (>>) . <big>
        n = return ()
    in ...(c x1 y1)...(c x2 y2)....n...

The trouble is that `c`, being big, will not be inlined.  And that can
be absolutely terrible for performance, as we saw in Trac #8763.

It's much better to define

   mapM_ f = foldr c (return ())
     where
       c x k = f x >> k
       {-# INLINE c #-}

Now we get
  mapM_ <big> (build g)

  = { inline mapM_ }
    foldr c (return ()) (build g)
      where c x k = f x >> k
            {-# INLINE c #-}
            f = <big>

Notice that `f` does not inine into the RHS of `c`,
because the ININE pragma stops it; see 
Note [Simplifying inside stable unfoldings] in SimplUtils.
Continuing:

  = { foldr/build rule }
    g c (return ())
      where ...
         c x k = f x >> k
         {-# INLINE c #-}
            f = <big>

  = { inline g }
    ...(c x1 y1)...(c x2 y2)....n...
      where c x k = f x >> k
            {-# INLINE c #-}
            f = <big>
            n = return ()

      Now, crucially, `c` does inline

  = { inline c }
    ...(f x1 >> y1)...(f x2 >> y2)....n...
      where f = <big>
            n = return ()

And all is well!  The key thing is that the fragment
`(f x1 >> y1)` is inlined into the body of the builder
`g`.
Last edited 6 weeks ago by simonpj (previous) (diff)

comment:82 Changed 6 weeks ago by sgraf

OK, I revised the patch. I agree, your examples are worth a thousand words.

comment:83 Changed 6 weeks ago by sgraf

Well, I revised the patch but I am now at work, where they block some outgoing ports needed for pushing to staging. Long story short, this will have to wait until tonight or at least until I figured out a way to configure a proxy.

comment:84 Changed 5 weeks ago by simonpj

Let's commit this!

comment:85 Changed 5 weeks ago by Krzysztof Gogolewski <krz.gogolewski@…>

In e655aac1/ghc:

Make sure forM_ and related functions fuse cleanly

Summary:
It was revealed in #8763 that it's hard to come up with a list fusion
helper for `efdtIntFB` that doesn't duplicated occurrences of `c`,
which is crucial in guaranteeing that it is inlined.

Not inlining `c` led to spoiled join points, in turn leading to unnecessary
heap allocation. This patch tackles the problem from a different angle:
Fixing all consumers instead of the less often used producer
`efdtIntFB` by inserting an INLINE pragma in the appropriate places.
See https://ghc.haskell.org/trac/ghc/ticket/8763#comment:76 and the new
Note [List fusion and continuations in 'c'].

A quick run of NoFib revealed no regression or improvements whatsoever.

Reviewers: hvr, bgamari, simonpj

Reviewed By: simonpj

Subscribers: simonpj, rwbarton, carter

GHC Trac Issues: #8763

Differential Revision: https://phabricator.haskell.org/D5131

comment:86 Changed 5 weeks ago by monoidal

Status: newmerge

Not sure if we'd like to merge this.

comment:87 Changed 3 weeks ago by RyanGlScott

Milestone: 8.8.18.6.2

Moving to the 8.6.2 milestone, since these tickets were all recently marked as merge.

comment:88 Changed 8 days ago by bgamari

Milestone: 8.6.28.8.1
Resolution: fixed
Status: mergeclosed

I'm going to leave this for 8.8.

Note: See TracTickets for help on using tickets.