Opened 4 years ago
Last modified 4 months ago
#6166 new bug
Performance regression in mwc-random since 7.0.x
Reported by: | bos | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | 8.0.1 |
Component: | Compiler | Version: | 7.4.2 |
Keywords: | Cc: | pho@…, dima@…, bgamari@… | |
Operating System: | Unknown/Multiple | Architecture: | x86_64 (amd64) |
Type of failure: | Runtime performance bug | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description (last modified by bgamari)
I've had a report that the performance of the mwc-random package has regressed seriously after upgrading from GHC 7.0 to 7.4. It turns out that 7.2 also has the regression.
Here's a sample program.
import qualified Data.Vector.Unboxed as U import qualified System.Random.MWC as R import System.Random.MWC.Distributions (standard) count = 1000 * 1000 fast gen = standard gen slow gen = standard gen >>= return -- Edit this to choose fast or slow. which gen = slow gen main = do gen <- R.create v <- U.replicateM count (which gen) print (U.last v)
With GHC 7.0.3 -O2, this runs in 0.294 sec, regardless of whether fast or slow is used.
Under 7.4, fast runs in 0.062 sec (a nice speedup!), but slow now takes 9.2 sec (yikes!).
Roman suggested compiling the slow version with -fno-state-hack, which brings performance back up to 0.062 sec.
Change History (18)
comment:1 Changed 4 years ago by simonmar
- difficulty set to Unknown
- Milestone set to 7.6.1
- Priority changed from normal to high
comment:2 Changed 4 years ago by PHO
- Cc pho@… added
comment:3 Changed 4 years ago by Dzhus
- Cc dima@… added
comment:4 Changed 4 years ago by bgamari
- Cc bgamari@… added
comment:5 Changed 4 years ago by Dzhus
Still present in 7.5.20120718
comment:6 Changed 3 years ago by igloo
- Milestone changed from 7.6.1 to 7.6.2
comment:7 Changed 2 years ago by Khudyakov
Bug is still present in 7.6.3. I've made a reduced test case with stripped-down standard inlined. Note that adding or removing return in main loop have no effect. Something interesting is going on with blocks. Replacing f with constant or removing cons all makes bug go away. Simplifying go function changes runtime drastically.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Control.Monad.Primitive (PrimMonad, PrimState) import Data.Word import Data.Bits import Control.Monad import System.Random.MWC main :: IO () main = do g <- create replicateM_ (200*1000) $ standard g standard :: PrimMonad m => Gen (PrimState m) -> m Double {-# INLINE standard #-} standard gen = do u <- (subtract 1 . (*2)) `liftM` uniform gen ri <- uniform gen let i = fromIntegral ((ri :: Word32) .&. 127) bi = (I.!) blocks i return $! u * bi where blocks = I.cons r -- Removing cons $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v = 9.91256303526217e-3 r = 3.442619855899 f = exp (-0.5 * r * r) -- Replacing with constant make bug go away! -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
comment:8 Changed 2 years ago by simonpj
Thank you for a stripped-down case. Can you explain exactly how to demonstrate the bug with this test program? Ie "Try X and program runs in 2s; make trivial change to Y and it takes 10s". Or whatever.
Does it need mwc-random in all its glory, or would it be possible to make a standalone test case?
Simon
comment:9 Changed 2 years ago by Khudyakov
I've slightly simplified test case. I've tried to replace call to uniform with mock function but to avail. It's certainly possible to add only relevant parts of mwc-random. Only small part is actually used
Test case is slow (~100x) version of program. It's quite fragile. Small changes can return program to normal performance. Known methods: replace definition of f with constant (marked as fast), float blocks to the top level.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Data.Word import Data.Bits import Control.Monad import System.Random.MWC main :: IO () main = do g <- create replicateM_ (100*1000) $ standard g standard :: GenIO -> IO Double {-# INLINE standard #-} standard gen = do ri <- uniform gen return $! blocks ! fromIntegral ((ri :: Word32) .&. 127) where blocks :: I.Vector Double blocks = I.cons r -- Removing cons $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v,r,f :: Double v = 9.91256303526217e-3 r = 3.442619855899 -- f = 2.669629083880923e-3 -- FAST f = exp (-0.5 * r * r) -- SLOW -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
comment:10 Changed 2 years ago by Khudyakov
I've been able to remove all stuff from mwc-random. Here is test case. Again it's slow version.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Control.Monad main :: IO () main = replicateM_ (100*1000) (return $! standard) standard :: Double {-# INLINE standard #-} standard = blocks ! 0 where blocks :: I.Vector Double blocks = I.cons r $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v,r,f :: Double v = 9.91256303526217e-3 r = 3.442619855899 -- f = 2.669629083880923e-3 -- FAST f = exp (-0.5 * r * r) -- SLOW -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
Couple of observations
- Replacing f with constant restores run time to normal. AFAIR GHC cannot constant fold exp and similar functions. So it may matter
- Floating block to top level or removing I.cons restores run time too.
- Simplifying go function changes run time. Removing sqrt or log reduce rim time. It looks like blocks is reevaluated every time standard is evaluated.
comment:11 Changed 2 years ago by Khudyakov
Another simplification
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Control.Monad main :: IO () main = replicateM_ (200*1000) (return $! standard) standard :: Double -- Removing or replacing with NOINLINE returns perfomance to normal {-# INLINE standard #-} standard = blocks ! 0 where blocks :: I.Vector Double blocks = I.cons 0.123 $ I.unfoldrN 130 go (T f) where go q@(T a) = Just (log (exp a), q) {-# NOINLINE blocks #-} r,f :: Double r = 3.442619855899 -- replacing f with constant return perfomance to normal -- f = 2.669629083880923e-3 f = exp (-0.5 * r * r) -- replacing data with newtype returns performance to normal data T = T {-# UNPACK #-} !Double
Problem is visible at the core level. Code is compiled down to the something similar to following pseudocode:
loop i = if i /= 0 then evaluate (blocks ! 0) >> loop (i-1) else return ()
blocks array is inlied despite being marked as NOINLINE and is evaluated on each iteration so performance is abysmal. When small chages to the program are made it's not inlined and evaluated only once.
comment:12 Changed 19 months ago by thoughtpolice
- Priority changed from high to normal
Lowering priority (these tickets are assigned to older versions, so they're getting bumped as they've been around for a while).
comment:13 Changed 19 months ago by thoughtpolice
- Milestone changed from 7.6.2 to 7.10.1
Moving to 7.10.1.
comment:14 Changed 14 months ago by thoughtpolice
- Milestone changed from 7.10.1 to 7.12.1
Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.
comment:15 Changed 5 months ago by thoughtpolice
- Milestone changed from 7.12.1 to 8.0.1
Milestone renamed
comment:16 Changed 4 months ago by bgamari
- Description modified (diff)
comment:17 Changed 4 months ago by bgamari
- Priority changed from normal to high
This looks like a pretty serious performance regression, bumping priority.
comment:18 Changed 4 months ago by Khudyakov
Does same regression present in latest version of mwc-random? AFAIR performance regression appears because GHC sometimes chose to inline lookup table in where block. It's marked as NOINLINE. It thus was recalculated on each call and caused slowdown. I moved it at top level and problem disapeared.
We should look into this: -fno-state-hack should never make something faster; the whole point of the state hack is to improve performance.