Opened 5 years ago

Last modified 3 weeks ago

#8763 new bug

forM_ [1..N] does not get fused (10 times slower than go function)

Reported by: nh2 Owned by:
Priority: normal Milestone: 8.8.1
Component: Compiler Version: 7.6.3
Keywords: Cc: mail@…, pivo@…, George, akio, sgraf
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #7206 Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

Apparently idiomatic code like forM_ [1.._N] does not get fused away.

This can give serious performance problems when unnoticed.

-- Slow:
forM_ [0.._N-1] $ \i -> do ...

-- Around 10 times faster:
loop _N $ \i -> do ...

{-# INLINE loop #-}
loop :: (Monad m) => Int -> (Int -> m ()) -> m ()
loop bex f = go 0
  where
    go !n | n == bex = return ()
          | otherwise = f n >> go (n+1)

Full code example: https://gist.github.com/nh2/8905997 - the relevant alternatives are commented.

Attachments (4)

rep.hs (211 bytes) - added by akio 2 years ago.
Test case
fuse.hs (2.5 KB) - added by George 5 months ago.
repro.hs (854 bytes) - added by sgraf 5 months ago.
Reproduction for comment:44
reprof.hs (946 bytes) - added by George 3 weeks ago.
fixed version of repro, modified per comment 49

Download all attachments as: .zip

Change History (71)

comment:1 Changed 5 years ago by nh2

People on #ghc think that it is because the [1.._N] gets floated out as a top-level constant expression:


nomeata: nh2: I’d consider that a bug nomeata: I believe the problem is that [0..512] does not depend on any local values nomeata: so it is floated out as a top-level value nomeata: and there it is not matched by any rules thoughtpolice: let floating strikes again thoughtpolice: (well, floating-out being a non-optimization, anyway.) Fuuzetsu: does this mean that if I use [0 .. 512] twice in a program in different places, it will only be computed once? hvr: Fuuzetsu: there's a chance, yes Fuuzetsu: neat thoughtpolice: well, not so neat. in cases like this you really don't want to float out some things, because it hurts later opportunities to optimize sometimes (e.g. float out a binding that otherwise would have triggered a RULE or fusion, perhaps) thoughtpolice: unfortunately floating like this is one of the harder things to 'fight against' when you don't want it, from what i've seen.

comment:2 Changed 5 years ago by darchon

I believe let-floating happens due to the full-laziness transform. Do you get better performance with -fno-full-laziness ?

comment:3 Changed 5 years ago by nomeata

A comment in SetLevels (which I just came across) in the code indicates that this problem should have been taken care of:

	  -- We are keen to float something to the top level, even if it does not
	  -- escape a lambda, because then it needs no allocation.  But it's controlled
	  -- by a flag, because doing this too early loses opportunities for RULES
	  -- which (needless to say) are important in some nofib programs
	  -- (gcd is an example).

So either my assumption is wrong, or this does not work as desired.

comment:4 Changed 5 years ago by nomeata

It turns out that this comment is obsolete; the flag is never set. I quote from SimplCore

                -- Was: gentleFloatOutSwitches
                --
                -- I have no idea why, but not floating constants to
                -- top level is very bad in some cases.
                --
                -- Notably: p_ident in spectral/rewrite
                --          Changing from "gentle" to "constantsOnly"
                --          improved rewrite's allocation by 19%, and
                --          made 0.0% difference to any other nofib
                --          benchmark

This comment was introduced in eaeca51efc0be3ff865c4530137bfbe9f8553549 (2009) by SPJ.

Maybe rules matching should look though unfoldings more easily (at the risk of losing sharing)? There is no point in worrying about sharing [0..N] in a rule application whose purpose is to eliminate that list.

comment:5 Changed 4 years ago by nh2

@nomeata

Regarding your suspicion that it gets floated out as a constant, I don't see an improvement when getting the upper bound m of [1..m] from an IO action. What do you think?

comment:6 Changed 4 years ago by nomeata

It might still get floated out of some local function. Have you tried coming up with a minimal example?

comment:7 Changed 4 years ago by nh2

@nomeata

There is an example I made for this, mentioned in the bug description.

The performance I measure for that is:

  • using forM_ with ghc -O: 2.0 s
  • using loop with ghc -O: 1.6 s
  • using forM_ with ghc -O2: 0.9 s
  • using loop with ghc -O2: 0.3 s
  • using forM_ with ghc -O2 -fllvm: 0.75 s
  • using loop with ghc -O2 -fllvm: 0.15 s

I tried to make an even smaller benchmark (https://gist.github.com/nh2/11333427) but the performance is identical there although the same thing changes as before.

Could you try my two benchmarks and see if you get the same behaviour?

comment:8 Changed 4 years ago by nh2

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

(Note that measuring the simple example with -fllvm is pointless because it can optimize it away completely. It can't do that with the matmult though.)

Last edited 4 years ago by nh2 (previous) (diff)

comment:9 in reply to:  8 ; Changed 4 years ago by daniel.is.fischer

Replying to nh2:

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

The simple example doesn't use the same list in different places, so GHC is capable of eliminating it and giving you a loop on unboxed Int#s, at least with -O2. In the matmult example, you need to conceal the fact that both lists are the same from GHC to get a loop on unboxed Int#s.

So in principle, GHC can do the desired thing, just the sharing gets in the way. Can somebody think of a situation where sharing is beneficial for forM_ [a .. b] $ \n -> do ... code? If not, perhaps special-casing enumFromTo arguments for forM_ etc. is, at least for standard integer types, something to be considered.

comment:10 Changed 4 years ago by nh2

Preventing it from sharing sounds sensible for me: If the [a .. b] was something expensive to compute (a list of primes, say), I suspect any sane person would easily share it manually by declaring it top-level.

comment:11 in reply to:  9 Changed 4 years ago by nh2

Replying to daniel.is.fischer:

The simple example doesn't use the same list in different places

Unfortunately, I still get no difference in performance even if I duplicate the loops to

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

Any idea?

comment:12 Changed 4 years ago by pivo

Cc: pivo@… added

comment:13 Changed 4 years ago by simonpj

I've got lost in this thread. If someone thinks there is a real bug here, in GHC 7.8, could you re-articulate what you think it is?

Simon

comment:14 Changed 4 years ago by nh2

@simonpj: Summary:

1) I reported that my manually written loop is much faster than forM_ [1..n] in some cases, suggesting that in some cases optimizing the list away doesn't work well.

2) nomeata said some technical things that are a bit beyond me.

3) I submit two benchmarks in the gist at ​https://gist.github.com/nh2/11333427, a "matmult" benchmark where there is a big difference between forM_ and the hand-written loop, and a "simple" benchmark where they are equally fast.

4) Daniel suspects the slow case comes from using the same syntactical list twice, and that in this case GHC floats it out to share it, which breaks eliminating it. He suggests we might special-case enumFromTo when used with forM_ to prevent it.

5) I give a counter example for his suspicion, by changing my "simple" benchmark, where using the same list twice gives the same good performance as using it once.

I get the same behaviour for 7.6 and 7.8.

comment:15 Changed 4 years ago by daniel.is.fischer

Slight correction, @Niklas, it's not a suspicion that it's the floating out of the list to the top-level and consequently the use of the list for looping instead of unboxed Int#s, that is direct from the core (-ddump-simpl is your friend in all matters related to performance). The question is under which exact circumstances GHC floats the list out. To answer that, you need somebody knowing how GHC works.

comment:16 Changed 4 years ago by simonpj

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

I'm not using criterion; just using +RTS -s, with _SIZE = 300.

So I can't reproduce the problem.

Simon

comment:17 in reply to:  16 Changed 4 years ago by daniel.is.fischer

Replying to simonpj:

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

Are you compiling with -O? You need -O2 for a significant difference to appear (with only -O, both versions are slow).

comment:18 Changed 3 years ago by George

With 7.10.1, compiling with -O2, I see a factor of 2 difference in performance:

benchmarking matmultForM_
time                 10.90 μs   (10.89 μs .. 10.91 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.89 μs   (10.89 μs .. 10.91 μs)
std dev              32.72 ns   (18.98 ns .. 65.42 ns)

benchmarking matmultLoop
time                 5.404 μs   (5.387 μs .. 5.419 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.409 μs   (5.398 μs .. 5.420 μs)
std dev              37.99 ns   (33.64 ns .. 44.26 ns)

with -fllvm it is a factor of almost 3

benchmarking matmultForM_
time                 9.796 μs   (9.788 μs .. 9.806 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.802 μs   (9.795 μs .. 9.815 μs)
std dev              31.64 ns   (21.92 ns .. 49.61 ns)

benchmarking matmultLoop
time                 3.499 μs   (3.497 μs .. 3.502 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.499 μs   (3.497 μs .. 3.503 μs)
std dev              10.27 ns   (7.068 ns .. 16.47 ns)
Last edited 3 years ago by simonpj (previous) (diff)

comment:19 Changed 3 years ago by George

Milestone: 7.12.1

comment:20 Changed 3 years ago by George

Cc: george added

comment:21 Changed 3 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:22 Changed 3 years ago by George

seeing same issue and performance on 7.10.3. This is on MacOS but I believe the problem occurs on multiple platforms,

This is with _SIZE = 10. Even if I replace

forM_ [0.._SIZE-1]

with

forM_ [0..10-1]

I see the same problem so I don't think it has to do with top level constant expressions

Last edited 3 years ago by George (previous) (diff)

comment:23 Changed 3 years ago by thomie

Milestone: 8.0.1

comment:24 Changed 2 years ago by George

Milestone: 8.0.2

still seeing the same problem on 8.0.1. Any chance of this getting fixed in 8.0.2? If not 8.0.2 I hope 8.1

Last edited 2 years ago by George (previous) (diff)

comment:25 Changed 2 years ago by bgamari

Milestone: 8.0.28.2.1

I'm afraid 8.0.2 is out of the question. However, 8.2.1 is a possibility. Of course, any help that you, the reader, can offer would expedite the process.

comment:26 Changed 2 years ago by akio

Cc: akio added

Changed 2 years ago by akio

Attachment: rep.hs added

Test case

comment:27 Changed 2 years ago by akio

Here is a test case that does not require criterion. With ghc-8.0.1 -O2 -ddump-simpl, you can see that foo is compiled into code that uses a top-level CAF containing GHC.Enum.eftInt 0# 799#.

comment:28 Changed 2 years ago by bgamari

Description: modified (diff)

comment:29 Changed 2 years ago by akio

It looks like this is the same issue as described in #7206. The branch mentioned in https://ghc.haskell.org/trac/ghc/ticket/7206/#comment:13 fixes my test case.

comment:30 Changed 2 years ago by simonpj

Good to know that's it.

The program in rep.hs is rather unusual because it has a constant list, not dependent on input data. Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

It would be great if someone felt able to dig into #7206 and figure out why it's not a straight win

Simon

comment:31 in reply to:  30 Changed 2 years ago by nh2

Replying to simonpj:

Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

Just wanted to chime in that I found this problem in a real-world program, where I looped over all 32-bit integers. Similarly, these known-at-compile-time iteration lists may appear in inner loops of algorithms or data structures (like B-trees with fixed B, or image convolution kernels with fixed size, e.g. 3x3 image templates).

comment:32 Changed 17 months ago by bgamari

Milestone: 8.2.18.4.1

Given that 8.2.1-rc1 is imminent, I'm bumping these off to the 8.4

comment:33 Changed 7 months ago by bgamari

Milestone: 8.4.18.6.1

This ticket won't be resolved in 8.4; remilestoning for 8.6. Do holler if you are affected by this or would otherwise like to work on it.

comment:34 Changed 5 months ago by osa1

comment:35 Changed 5 months ago by osa1

I confirmed that #7206 fixes this.

comment:36 Changed 5 months ago by osa1

Status: newinfoneeded

Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer.

comment:37 Changed 5 months ago by nh2

If the problem is gone in 8.2.2, do we know what commit fixed it?

comment:38 Changed 5 months ago by akio

I haven't confirmed, but it looks like 2effe18ab51d66474724d38b20e49cc1b8738f60 may have fixed this. Now fusion happens during the "gentle" phase of simplifier, which happens before any of the float-out passes.

comment:39 in reply to:  36 Changed 5 months ago by George

Replying to osa1:

Sorry for the confusion -- it turns out GHC 8.2.2 already optimizes this. Perhaps this can be closed or we may need a new reproducer.

I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?

comment:40 Changed 5 months ago by osa1

I have a very similar program where, with 8.4.1, the forM_ version allocates 50% more than the version with recursive go but both versions run in about the same time . Is it worth giving that code here in this bug or should I enter a new bug referencing this one?

What is the first argument of forM_ in you program? This ticket (and #7206) is specifically about forM_/mapM_ when the foldable argument is of form [n .. m]. If your program is different than perhaps a new ticket would be better. In any case, it's not a big deal if you paste your program here, we can move it to a new ticket if it turns out to be something different than what we try to improve here.

Changed 5 months ago by George

Attachment: fuse.hs added

comment:41 Changed 5 months ago by George

Cc: George added; george removed

comment:42 Changed 5 months ago by George

Attached file fuse.hs , similar to the first file but with nested forM_, both versions run in the same time but the forM_ version allocates about 50% more

 ghc -O2 fuse.hs +RTS
[1 of 1] Compiling Main             ( fuse.hs, fuse.o )
Linking fuse ...
bash-3.2$ ./fuse 1 +RTS -s
486341683267690
     320,057,352 bytes allocated in the heap
           8,232 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          29,152 bytes maximum slop
             308 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.025s     0.0254s    0.0254s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    5.432s  (  5.634s elapsed)
  GC      time    0.000s  (  0.025s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    5.433s  (  5.667s elapsed)

  %GC     time       0.0%  (0.4% elapsed)

  Alloc rate    58,918,474 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

bash-3.2$ ./fuse 2 +RTS -s
486341683267690
     560,057,328 bytes allocated in the heap
          15,992 bytes copied during GC
     320,028,576 bytes maximum residency (2 sample(s))
         868,448 bytes maximum slop
             308 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       228 colls,     0 par    0.001s   0.002s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.026s     0.0128s    0.0254s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    5.453s  (  5.630s elapsed)
  GC      time    0.002s  (  0.027s elapsed)
  EXIT    time    0.000s  (  0.008s elapsed)
  Total   time    5.455s  (  5.667s elapsed)

  %GC     time       0.0%  (0.5% elapsed)

  Alloc rate    102,698,216 bytes per MUT second

  Productivity 100.0% of total user, 99.5% of total elapsed

comment:43 Changed 5 months ago by George

Status: infoneedednew

Hoping my example is the new reproducer. If not this can be set back to infoneeded or closed.

comment:44 Changed 5 months ago by sgraf

Here is a smaller example that highlights the problem without vectors. The only difference between the two functions is the use of [2,3..n] instead of [2..n], which desugar to different functions. This results in a difference in a huge difference in allocation as well as runtime:

$ ./repro 2 +RTS -s # [2,3..n]
()
     960,056,856 bytes allocated in the heap
          21,536 bytes copied during GC
          44,576 bytes maximum residency (2 sample(s))
          29,152 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       918 colls,     0 par    0.005s   0.003s     0.0000s    0.0000s
  Gen  1         2 colls,     0 par    0.000s   0.000s     0.0001s    0.0002s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.123s  (  0.125s elapsed)
  GC      time    0.005s  (  0.003s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.129s  (  0.129s elapsed)

  %GC     time       4.1%  (2.5% elapsed)

  Alloc rate    7,778,808,106 bytes per MUT second

  Productivity  95.8% of total user, 97.4% of total elapsed
$ ./repro 1 +RTS -s # [2..n]
()
          56,872 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.048s  (  0.048s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    0.048s  (  0.048s elapsed)

  %GC     time       0.2%  (0.2% elapsed)

  Alloc rate    1,188,432 bytes per MUT second

  Productivity  99.6% of total user, 99.6% of total elapsed

This happens in ST, but not in IO, so probably related to some hack. Also the difference vanishes when we allow the functions to inline.

Here's some Core for g (the offending function):

-- RHS size: {terms: 235, types: 242, coercions: 61, joins: 4/13}
$wg
$wg
  = \ @ s ww w ->
      let { $wc = <huge> } in
      case <# ww 3# of {
        __DEFAULT ->
          let {
            y'
            y' = -# ww 1# } in
          letrec {
            go_up
            go_up
              = \ x eta ->
                  case ># x y' of {
                    __DEFAULT -> $wc x ((go_up (+# x 1#)) `cast` <Co:4>) eta;
                    1# -> $wc x (lvl `cast` <Co:4>) eta
                  }; } in
          $wc 2# ((go_up 3#) `cast` <Co:4>) w;
        1# ->
          case <# ww 2# of {
            __DEFAULT -> $wc 2# (lvl `cast` <Co:4>) w;
            1# -> (# w, () #)
          }
      }

From my understanding of join points, $wc is only nearly a join point, because go_up with its transitive tail call to $wc appears in argument position. It would be great if we could get rid of this! The IO variant (g 40000000 >>= print) doesn't have this weakness, it's all join points there. Hence my suspicion about some IO hack that let's GHC eta-expand stuff more aggresively, but I'm not sure how that's helping.

Changed 5 months ago by sgraf

Attachment: repro.hs added

Reproduction for comment:44

comment:45 Changed 5 months ago by sgraf

It seems that for IO, GHC decides that it's OK to inline c from the fusion helper of enumFromThenTo, but not so for ST s.

For our case, c is the <huge> computation (see the worker $wc in comment:44) performed for each outer list element and would be duplicated by inlining: It's mentioned four times in the definition of efdtIntUpFB. Consequently, c has almost always Guidance=NEVER, except in the IO case, where it miraculously gets Guidance=IF_ARGS [20 420 0] 674 0 just when it is inlined. Not sure what this decision is based on.

The inlining decision for eftIntFB is much easier: c only happens once there.

I'm not sure if IO gets special treatment by the inliner, but I see a few ways out:

  • Do the same hacks for ST, if there are any which apply (ugly)
  • Reduce the number of calls to c in the implementation of efdtIntUpFB, probably for worse branch prediction
  • Figure out why the floated out expression of \x -> (nop x *>) occuring in forM_ nop = flip mapM_ nop = foldr ((>>) . nop) (return ()) doesn't get eta-expanded in the ST case, whereas the relevant IO code is. I hope that by fixing this, the c expression inlines again.

Here's how it inlines for IO:

  (>>) . nop
= \x -> (nop x >>)
= \x -> (nop x *>) -- notice how it's no different than ST up until here
= \x -> (thenIO (nop x))

The inliner probably stops here, but because of eta-expansion modulo coercions to \x k s -> thenIO (nop x) k s, we can inline thenIO:

  \x k s -> thenIO (nop x) y s
= \x k s -> case nop x s of (# new_s, _ #) -> k new_s)

which is much better and probably more keenly inlined than \x -> (nop x *>) in the ST case. What makes GHC eta-expand one, but not the other?

This is just a wild guess and the only real difference I could make out in diffs. Maybe someone with actual insights into the simplifier can comment on this claim (that the inliner gives up on c due to the missed eta-expansion and inlining)?

Last edited 5 months ago by sgraf (previous) (diff)

comment:46 Changed 5 months ago by sgraf

Cc: sgraf added

comment:47 Changed 5 months ago by George

Your example seems to be different than mine in that when I compile yours with ghc 8.4.1 and -O there is no difference in allocation unlike mine. What flags did you compile with?

 ghc -O repro.hs +RTS 
[1 of 1] Compiling Main             ( repro.hs, repro.o )
Linking repro ...
bash-3.2$ ./repro 1 +RTS -s
()
          56,880 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.051s  (  0.052s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time    0.051s  (  0.057s elapsed)

  %GC     time       0.3%  (0.4% elapsed)

  Alloc rate    1,118,716 bytes per MUT second

  Productivity  99.3% of total user, 95.6% of total elapsed

bash-3.2$ ./repro 2 +RTS -s
()
          56,880 bytes allocated in the heap
           3,480 bytes copied during GC
          44,576 bytes maximum residency (1 sample(s))
          25,056 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.000s   0.000s     0.0002s    0.0002s

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    0.051s  (  0.051s elapsed)
  GC      time    0.000s  (  0.000s elapsed)
  EXIT    time    0.000s  (  0.005s elapsed)
  Total   time    0.051s  (  0.059s elapsed)

  %GC     time       0.3%  (0.3% elapsed)

  Alloc rate    1,120,655 bytes per MUT second

  Productivity  99.3% of total user, 95.7% of total elapsed

comment:48 Changed 5 months ago by sgraf

Yuck! I was under the impression I used GHC 8.4.1 via a nix-shell when in reality I was using another locally installed 8.2.2. So, back to minimization, I guess.

comment:49 Changed 5 months ago by sgraf

It seems I uploaded the variant where I used IO instead of ST, where things still inline. When you substitute ST s for IO and use print $ runST $ ... instead of ... >>= print, it should reproduce with 8.4.1.

Now here's the funny part: I managed to make this reproduce even for IO by duplicating the call to nop. So it seems like c really just hits the threshold where the inliner gives up. The only solution I can think of is what I described in my second point above: Implement efdtIntUpFB in a way that doesn't duplicate c.

In general we should avoid to call c in builds more than once because of scenarios like this. Huge cs aren't uncommon at all (do blocks in forM_ bodies, the functions passed as first argument to foldr, etc.) and otherwise we can't guarantee that everything inlines.

comment:50 Changed 5 months ago by sgraf

I can't attach the fixed reproduction, so use this gist instead: https://gist.github.com/sgraf812/6089d81fbc95af9c5f817ff9dc417401

I'll see if I can craft a variant of efdtInt* that doesn't duplicate c.

comment:51 Changed 5 months ago by nomeata

In general we should avoid to call c in builds more than once because of scenarios like this.

I vaguely having seen a Note about this somewhere, but I cannot find it right now. But yes, a single occurrence of c is beneficial…

Ok, found it:

{-# INLINE [0] eftIntFB #-} -- See Note [Inline FB functions] in GHC.List
eftIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> r
eftIntFB c n x0 y | isTrue# (x0 ># y) = n
                  | otherwise         = go x0
                 where
                   go x = I# x `c` if isTrue# (x ==# y)
                                   then n
                                   else go (x +# 1#)
                        -- Watch out for y=maxBound; hence ==, not >
        -- Be very careful not to have more than one "c"
        -- so that when eftInfFB is inlined we can inline
        -- whatever is bound to "c"
Last edited 5 months ago by nomeata (previous) (diff)

comment:52 Changed 5 months ago by sgraf

This definition of efdtIntUpFB only has a single occurence of c and n and consequently fixes th issue.

But this probably doesn't have the same performance for the non-fused case. Also fingers crossed wrt. correctness.

data CounterState
  = More
  | Last
  | End

-- Requires x2 >= x1
{-# INLINE [0] efdtIntUpFB #-} -- See Note [Inline FB functions] in GHC.List
efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntUpFB c n x1 x2 y =
   -- Be careful about overflow!
   let !first_state
         | isTrue# (y <# x2) = if isTrue# (y <# x1) then End else Last
         | otherwise         = More -- Common case: x1 <= x2 <= y

       !delta = x2 -# x1 -- >= 0
       !y' = y -# delta  -- x1 <= y' <= y; hence y' is representable

       next_state End  _     = End
       next_state Last _     = End
       next_state More x
         | isTrue# (x ># y') = Last
         | otherwise         = More

       -- Invariant: x <= y
       -- Note that: z <= y' => z + delta won't overflow
       -- so we are guaranteed not to overflow if/when we recurse
       emit End _ = n
       emit st  x
         | let x' = x +# delta
         = I# x `c` emit (next_state st x') x'

   in emit first_state x1
Last edited 5 months ago by sgraf (previous) (diff)

comment:53 Changed 5 months ago by sgraf

@nomeata: Ah, right before my nose. Very reassuring.

comment:54 Changed 5 months ago by simonpj

How about this (untested), which seems a lot simpler

efdtIntUpFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntUpFB c n x1 x2 y    -- Be careful about overflow!
 | isTrue# (y <# x1)
 = n
 | otherwise
 = go_up x1    -- Common case: x1 <= y
 where
   !delta = x2 -# x1   -- >= 0
   !y' = y -# delta    -- x1 <= y' <= y; hence y' is representable

       -- Invariant: x <= y
       -- Note that: x <= y' => x + delta won't overflow
       -- so we are guaranteed not to overflow if/when we recurse
   go_up x = I# x `c` if isTrue# (x ># y')
                      then n
		      else go_up (x +# delta)

comment:55 Changed 5 months ago by sgraf

I just ran some quick quickCheck $ withMaxSuccess 100000 $ \i j k -> take 1000 [i,j..k] == take 1000 (efdtInt (unI# i) (unI# j) (unI# k)) tests and both versions pass. Given that simonpj's is much more to the point, let's run with that one! Although the duplicate n has potential to cause pain... But it's also there in eftIntFB, so it's probably fine.

Edit: Well, either the search space is too huge for QC or fusion didn't kick in. Either way, Simon's code has an underflow weakness: E.g. [minBound,minBound+2..minBound+1] == [minBound], but in the new variant y' = minBound+1-2 will underflow. I fixed this in the diff I prepared by an additional flag which has to be checked on every iteration.

Last edited 5 months ago by sgraf (previous) (diff)

comment:56 Changed 5 months ago by sgraf

Nofib suggests that this regresses allocations in integer by 6.0% and counted instructions by 0.1%. I had a look at the simplified Core and it seems that it's entirely due to the new definition, although I'm not sure where exactly this allocates more. Maybe it's due to an increase in closure size of go_up because of single?. Here's the Core diff and the new definition of efdtIntUpFB for reference.

It seems that c is still not inlined, even with the new definition. I assume that's because there are multiple occurences of c which were probably duplicated before the inliner had a chance to inline the argument c. It better had introduced a join point before.

Maybe loopification helps here? Indeed #14068 suggests that something beneficial happens, maybe more so with this patch.

Or we could introduce some kind of annotation mechanism to tell GHC to be careful not to duplicate occurences of certain parameters that occur once (f {-# HUGE #-} c n = ...).

comment:57 Changed 4 months ago by simonpj

I really don't want to add new annotations!

Would it be possible to distil out a small example of what goes wrong with integer? It may not be fixable, but it's always worth a try.

I suppose the gain here is important... it's a fairly common case.

comment:58 Changed 4 months ago by sgraf

I dug through dump-simpl-iterations and noticed that the duplication happens through efdtIntFBs unfolding, which mentions efdtIntUpFB and efdtIntDnFB, which both want to inline their c later on.

efdtIntFB
  = \ (@ r_a3cz)
      (c_a2ho :: Int -> r_a3cz -> r_a3cz)
      (n_a2hp :: r_a3cz)
      (x1_a2hq :: Int#)
      (x2_a2hr :: Int#)
      (y_a2hs :: Int#) ->
      case >=# x2_a2hr x1_a2hq of {
        __DEFAULT ->
          efdtIntDnFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs;
        1# -> efdtIntUpFB @ r_a3cz c_a2ho n_a2hp x1_a2hq x2_a2hr y_a2hs
      }

I'll see what we can do about that tomorrow...

comment:59 Changed 4 months ago by sgraf

Note that's not a problem for [x..y] (eftInt), because that doesn't need to consider counting down. It's not an issue for [x,y..] (edtInt), because although it calls efdtInt{Up,Dn} internally, it doesn't take part in fusion at all (is that an oversight or by design?).

comment:60 Changed 4 months ago by sgraf

Here's an implementation of efdtIntFB that fits our requirements:

direction :: Int# -> Int# -> Ordering
direction from to = compareInt# to from

opposed :: Ordering -> Ordering -> Bool
opposed LT GT = True
opposed GT LT = True
opposed _  _  = False

-- | Implements `enumFromThenTo @Int` as per section 6.3.4 of the Haskell2010
-- report:
-- https://www.haskell.org/onlinereport/haskell2010/haskellch6.html#dx13-131001.
efdtIntFB :: (Int -> r -> r) -> r -> Int# -> Int# -> Int# -> r
efdtIntFB c n x1 x2 y = emit first x1
  where
    -- We can safely emit the first element if an iteration
    -- doesn't move away from @y@. That's exactly the case when
    -- @dir_x2@ is not opposed to @dir_y@.
    !first  = not (opposed dir_x2 dir_y)
           && (dir_x2 /= EQ || dir_y /= LT) -- [1,1..0] == []
    !dir_x2 = direction x1 x2
    !dir_y  = direction x1 y
    -- We need the overflow flag in 'emit'.
    (# delta, delta_ovf #) = x2 `subIntC#` x1

    -- | Think of @emit :: Maybe Int -> [Int]@, only unboxed.
    -- If the argument is 'Nothing', we reached the end of the list.
    -- If the argument is 'Just', we emit an element, compute
    -- the next candidate, validate it and recurse.
    emit False _ = n
    emit True  x = I# x `c` emit next_ok next
      where
        -- Check that @next@ didn't move past @y@.
        -- Also, overflow is only allowed iff the computation for
        -- @delta@ overflowed.
        (# next, next_ovf #) = addIntC# x delta
        !next_ok =  isTrue# (next_ovf ==# delta_ovf)
                 && not (opposed (direction next y) dir_y)
                 -- TODO: evaluate strict && for branchless code
{-# INLINE[0] efdtIntFB #-}

Some pros:

  • I find this much easier to understand. No complicated invariants, etc.
  • No Up/Dn variants to maintain. Still, if the direction is statically known, constant folding and inlining will simplify stuff to the equivalent code.
  • As a result, no more duplication of c occurrences
  • Also no more duplication of n occurrences

Cons:

  • emits closure is 4 words big (2 words bigger than the closure of the original go_up helper) in the most general form. It's unfortunate that we can't pack together dir_y and delta_ovf into a single word without confusing constant folding. This would need either some kind of constant propagation through bit fields (out of scope for GHC, I think) or a smarter closure allocation theme that packs together non-pointer payloads.
  • We pay for the generalisation of Up/Dn variants by having to compare with dir_y all the time.
  • base lacks addWordC# primitives, which I'll probably add now
Last edited 4 months ago by sgraf (previous) (diff)

comment:61 Changed 4 months ago by simonpj

Looks plausible to me, but needs a careful Note to explain the issues.

But before we go too far with this, I'd like to point to late lambda lifting. In the core reported in comment:44, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.

To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.

I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?

comment:62 in reply to:  61 Changed 4 months ago by sgraf

Replying to simonpj:

Looks plausible to me, but needs a careful Note to explain the issues.

But before we go too far with this, I'd like to point to late lambda lifting. In the core reported in comment:44, all we need to do is lambda-lift $wc and go_up to top level, and all will be well, I claim. And that is precisely what late-lambda lifting does. And the result might be faster than the very careful code above, because of the extra argument passing and case-tesing it has to do.

To me LLF is low-hanging fruit. There are promising results described on the wiki page, and the whole join-point business eliminates its principal shortcoming.

I wonder if, before going ahead with this somewhat-delicate efdtIntFB business, it might be fun to re-awaken LLF?

I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.

comment:63 Changed 4 months ago by simonpj

I'll give it a try. Without understanding all operational consequences of LLF, I'd still guess that making sure all cs inline would be more beneficial in this scenario.

In general inlining is good. But in the case on this thread, I think it's the non-allocation of a function closure that saves all the work, rather than any optimisations that happen after inlining c.

comment:64 Changed 4 months ago by sgraf

I rebased wip/llf onto a recent HEAD commit. You can find my progress here: https://github.com/sgraf812/ghc/tree/llf.

Currently, it doesn't pass the testsuite (even in default mode, which doesn't do any new lambda lifting), probably because I introduced some errors during the merge. I think we should continue the discussion in #9476.

comment:65 Changed 3 months ago by Ben Gamari <ben@…>

In bb338f2/ghc:

Algebraically simplify add/sub with carry/overflow

Previously, the `{add,sub}{Int,Word}C#` PrimOps weren't handled
in PrelRules (constant folding and algebraic simplification) at all.
This implements the necessary logic, so that using these primitives
isn't too punishing compared to their well-optimised, overflow-unaware
counterparts.

This is so that using these primitives in `enumFromThenTo @Int` can
be optimized by constant folding, reducing closure sizes.

Reviewers: bgamari, simonpj, hsyl20

Reviewed By: bgamari, simonpj

Subscribers: AndreasK, thomie, carter

GHC Trac Issues: #8763

Differential Revision: https://phabricator.haskell.org/D4605

comment:66 Changed 2 months ago by bgamari

Milestone: 8.6.18.8.1

This won't be fixed in 8.6. Bumping to 8.8.

comment:67 Changed 3 weeks ago by George

As this is now 4 years old, does it make sense to open a new issue and close this one? My problem of 3 years ago is basically fixed. In any case I have uploaded a fixed version, reprof.hs of repro.hs which was modified per the comments in 49

Last edited 3 weeks ago by George (previous) (diff)

Changed 3 weeks ago by George

Attachment: reprof.hs added

fixed version of repro, modified per comment 49

Note: See TracTickets for help on using tickets.