Opened 3 years ago

Last modified 7 weeks ago

#8763 new bug

forM_ [1..N] does not get fused (10 times slower than go function)

Reported by: nh2 Owned by:
Priority: normal Milestone: 8.4.1
Component: Compiler Version: 7.6.3
Keywords: Cc: mail@…, pivo@…, george, akio
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

Apparently idiomatic code like forM_ [1.._N] does not get fused away.

This can give serious performance problems when unnoticed.

-- Slow:
forM_ [0.._N-1] $ \i -> do ...

-- Around 10 times faster:
loop _N $ \i -> do ...

{-# INLINE loop #-}
loop :: (Monad m) => Int -> (Int -> m ()) -> m ()
loop bex f = go 0
  where
    go !n | n == bex = return ()
          | otherwise = f n >> go (n+1)

Full code example: https://gist.github.com/nh2/8905997 - the relevant alternatives are commented.

Attachments (1)

rep.hs (211 bytes) - added by akio 8 months ago.
Test case

Download all attachments as: .zip

Change History (33)

comment:1 Changed 3 years ago by nh2

People on #ghc think that it is because the [1.._N] gets floated out as a top-level constant expression:


nomeata: nh2: I’d consider that a bug nomeata: I believe the problem is that [0..512] does not depend on any local values nomeata: so it is floated out as a top-level value nomeata: and there it is not matched by any rules thoughtpolice: let floating strikes again thoughtpolice: (well, floating-out being a non-optimization, anyway.) Fuuzetsu: does this mean that if I use [0 .. 512] twice in a program in different places, it will only be computed once? hvr: Fuuzetsu: there's a chance, yes Fuuzetsu: neat thoughtpolice: well, not so neat. in cases like this you really don't want to float out some things, because it hurts later opportunities to optimize sometimes (e.g. float out a binding that otherwise would have triggered a RULE or fusion, perhaps) thoughtpolice: unfortunately floating like this is one of the harder things to 'fight against' when you don't want it, from what i've seen.

comment:2 Changed 3 years ago by darchon

I believe let-floating happens due to the full-laziness transform. Do you get better performance with -fno-full-laziness ?

comment:3 Changed 3 years ago by nomeata

A comment in SetLevels (which I just came across) in the code indicates that this problem should have been taken care of:

	  -- We are keen to float something to the top level, even if it does not
	  -- escape a lambda, because then it needs no allocation.  But it's controlled
	  -- by a flag, because doing this too early loses opportunities for RULES
	  -- which (needless to say) are important in some nofib programs
	  -- (gcd is an example).

So either my assumption is wrong, or this does not work as desired.

comment:4 Changed 3 years ago by nomeata

It turns out that this comment is obsolete; the flag is never set. I quote from SimplCore

                -- Was: gentleFloatOutSwitches
                --
                -- I have no idea why, but not floating constants to
                -- top level is very bad in some cases.
                --
                -- Notably: p_ident in spectral/rewrite
                --          Changing from "gentle" to "constantsOnly"
                --          improved rewrite's allocation by 19%, and
                --          made 0.0% difference to any other nofib
                --          benchmark

This comment was introduced in eaeca51efc0be3ff865c4530137bfbe9f8553549 (2009) by SPJ.

Maybe rules matching should look though unfoldings more easily (at the risk of losing sharing)? There is no point in worrying about sharing [0..N] in a rule application whose purpose is to eliminate that list.

comment:5 Changed 3 years ago by nh2

@nomeata

Regarding your suspicion that it gets floated out as a constant, I don't see an improvement when getting the upper bound m of [1..m] from an IO action. What do you think?

comment:6 Changed 3 years ago by nomeata

It might still get floated out of some local function. Have you tried coming up with a minimal example?

comment:7 Changed 3 years ago by nh2

@nomeata

There is an example I made for this, mentioned in the bug description.

The performance I measure for that is:

  • using forM_ with ghc -O: 2.0 s
  • using loop with ghc -O: 1.6 s
  • using forM_ with ghc -O2: 0.9 s
  • using loop with ghc -O2: 0.3 s
  • using forM_ with ghc -O2 -fllvm: 0.75 s
  • using loop with ghc -O2 -fllvm: 0.15 s

I tried to make an even smaller benchmark (https://gist.github.com/nh2/11333427) but the performance is identical there although the same thing changes as before.

Could you try my two benchmarks and see if you get the same behaviour?

comment:8 Changed 3 years ago by nh2

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

(Note that measuring the simple example with -fllvm is pointless because it can optimize it away completely. It can't do that with the matmult though.)

Last edited 3 years ago by nh2 (previous) (diff)

comment:9 in reply to:  8 ; Changed 3 years ago by daniel.is.fischer

Replying to nh2:

I have updated the gist at https://gist.github.com/nh2/11333427 to contain both the matmult example (where the difference between forM_ and loop is big) and the simple example (where no difference can be measured).

The simple example doesn't use the same list in different places, so GHC is capable of eliminating it and giving you a loop on unboxed Int#s, at least with -O2. In the matmult example, you need to conceal the fact that both lists are the same from GHC to get a loop on unboxed Int#s.

So in principle, GHC can do the desired thing, just the sharing gets in the way. Can somebody think of a situation where sharing is beneficial for forM_ [a .. b] $ \n -> do ... code? If not, perhaps special-casing enumFromTo arguments for forM_ etc. is, at least for standard integer types, something to be considered.

comment:10 Changed 3 years ago by nh2

Preventing it from sharing sounds sensible for me: If the [a .. b] was something expensive to compute (a list of primes, say), I suspect any sane person would easily share it manually by declaring it top-level.

comment:11 in reply to:  9 Changed 3 years ago by nh2

Replying to daniel.is.fischer:

The simple example doesn't use the same list in different places

Unfortunately, I still get no difference in performance even if I duplicate the loops to

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

forM_ [1.._MAX] $ \i -> do

UM.unsafeWrite v 0 i

Any idea?

comment:12 Changed 3 years ago by pivo

Cc: pivo@… added

comment:13 Changed 3 years ago by simonpj

I've got lost in this thread. If someone thinks there is a real bug here, in GHC 7.8, could you re-articulate what you think it is?

Simon

comment:14 Changed 3 years ago by nh2

@simonpj: Summary:

1) I reported that my manually written loop is much faster than forM_ [1..n] in some cases, suggesting that in some cases optimizing the list away doesn't work well.

2) nomeata said some technical things that are a bit beyond me.

3) I submit two benchmarks in the gist at ​https://gist.github.com/nh2/11333427, a "matmult" benchmark where there is a big difference between forM_ and the hand-written loop, and a "simple" benchmark where they are equally fast.

4) Daniel suspects the slow case comes from using the same syntactical list twice, and that in this case GHC floats it out to share it, which breaks eliminating it. He suggests we might special-case enumFromTo when used with forM_ to prevent it.

5) I give a counter example for his suspicion, by changing my "simple" benchmark, where using the same list twice gives the same good performance as using it once.

I get the same behaviour for 7.6 and 7.8.

comment:15 Changed 3 years ago by daniel.is.fischer

Slight correction, @Niklas, it's not a suspicion that it's the floating out of the list to the top-level and consequently the use of the list for looping instead of unboxed Int#s, that is direct from the core (-ddump-simpl is your friend in all matters related to performance). The question is under which exact circumstances GHC floats the list out. To answer that, you need somebody knowing how GHC works.

comment:16 Changed 3 years ago by simonpj

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

I'm not using criterion; just using +RTS -s, with _SIZE = 300.

So I can't reproduce the problem.

Simon

comment:17 in reply to:  16 Changed 3 years ago by daniel.is.fischer

Replying to simonpj:

I've tried with 7.8.2. I get the same performance for matmultForM_ and matMultLoop.

Are you compiling with -O? You need -O2 for a significant difference to appear (with only -O, both versions are slow).

comment:18 Changed 2 years ago by George

With 7.10.1, compiling with -O2, I see a factor of 2 difference in performance:

benchmarking matmultForM_
time                 10.90 μs   (10.89 μs .. 10.91 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.89 μs   (10.89 μs .. 10.91 μs)
std dev              32.72 ns   (18.98 ns .. 65.42 ns)

benchmarking matmultLoop
time                 5.404 μs   (5.387 μs .. 5.419 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.409 μs   (5.398 μs .. 5.420 μs)
std dev              37.99 ns   (33.64 ns .. 44.26 ns)

with -fllvm it is a factor of almost 3

benchmarking matmultForM_
time                 9.796 μs   (9.788 μs .. 9.806 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.802 μs   (9.795 μs .. 9.815 μs)
std dev              31.64 ns   (21.92 ns .. 49.61 ns)

benchmarking matmultLoop
time                 3.499 μs   (3.497 μs .. 3.502 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.499 μs   (3.497 μs .. 3.503 μs)
std dev              10.27 ns   (7.068 ns .. 16.47 ns)
Last edited 2 years ago by simonpj (previous) (diff)

comment:19 Changed 2 years ago by George

Milestone: 7.12.1

comment:20 Changed 2 years ago by George

Cc: george added

comment:21 Changed 21 months ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:22 Changed 17 months ago by George

seeing same issue and performance on 7.10.3. This is on MacOS but I believe the problem occurs on multiple platforms,

This is with _SIZE = 10. Even if I replace

forM_ [0.._SIZE-1]

with

forM_ [0..10-1]

I see the same problem so I don't think it has to do with top level constant expressions

Last edited 17 months ago by George (previous) (diff)

comment:23 Changed 16 months ago by thomie

Milestone: 8.0.1

comment:24 Changed 9 months ago by George

Milestone: 8.0.2

still seeing the same problem on 8.0.1. Any chance of this getting fixed in 8.0.2? If not 8.0.2 I hope 8.1

Last edited 9 months ago by George (previous) (diff)

comment:25 Changed 8 months ago by bgamari

Milestone: 8.0.28.2.1

I'm afraid 8.0.2 is out of the question. However, 8.2.1 is a possibility. Of course, any help that you, the reader, can offer would expedite the process.

comment:26 Changed 8 months ago by akio

Cc: akio added

Changed 8 months ago by akio

Attachment: rep.hs added

Test case

comment:27 Changed 8 months ago by akio

Here is a test case that does not require criterion. With ghc-8.0.1 -O2 -ddump-simpl, you can see that foo is compiled into code that uses a top-level CAF containing GHC.Enum.eftInt 0# 799#.

comment:28 Changed 8 months ago by bgamari

Description: modified (diff)

comment:29 Changed 8 months ago by akio

It looks like this is the same issue as described in #7206. The branch mentioned in https://ghc.haskell.org/trac/ghc/ticket/7206/#comment:13 fixes my test case.

comment:30 Changed 8 months ago by simonpj

Good to know that's it.

The program in rep.hs is rather unusual because it has a constant list, not dependent on input data. Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

It would be great if someone felt able to dig into #7206 and figure out why it's not a straight win

Simon

comment:31 in reply to:  30 Changed 8 months ago by nh2

Replying to simonpj:

Real programs usually depend on input data. So this probem may be less of a probem in practice than in benchmarks.

Just wanted to chime in that I found this problem in a real-world program, where I looped over all 32-bit integers. Similarly, these known-at-compile-time iteration lists may appear in inner loops of algorithms or data structures (like B-trees with fixed B, or image convolution kernels with fixed size, e.g. 3x3 image templates).

comment:32 Changed 7 weeks ago by bgamari

Milestone: 8.2.18.4.1

Given that 8.2.1-rc1 is imminent, I'm bumping these off to the 8.4

Note: See TracTickets for help on using tickets.