Opened 23 months ago
Last modified 7 months ago
#6166 new bug
Performance regression in mwc-random since 7.0.x
Reported by: | bos | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | 7.6.2 |
Component: | Compiler | Version: | 7.4.2 |
Keywords: | Cc: | pho@…, dima@…, bgamari@… | |
Operating System: | Unknown/Multiple | Architecture: | x86_64 (amd64) |
Type of failure: | Runtime performance bug | Difficulty: | Unknown |
Test Case: | Blocked By: | ||
Blocking: | Related Tickets: |
Description
I've had a report that the performance of the mwc-random package has regressed seriously after upgrading from GHC 7.0 to 7.4. It turns out that 7.2 also has the regression.
Here's a sample program.
import qualified Data.Vector.Unboxed as U import qualified System.Random.MWC as R import System.Random.MWC.Distributions (standard) count = 1000 * 1000 fast gen = standard gen slow gen = standard gen >>= return -- Edit this to choose fast or slow. which gen = slow gen main = do gen <- R.create v <- U.replicateM count (which gen) print (U.last v)
With GHC 7.0.3 -O2, this runs in 0.294 sec, regardless of whether fast or slow is used.
Under 7.4, fast runs in 0.062 sec (a nice speedup!), but slow now takes 9.2 sec (yikes!).
Roman suggested compiling the slow version with -fno-state-hack, which brings performance back up to 0.062 sec.
Change History (11)
comment:1 Changed 22 months ago by simonmar
- Difficulty set to Unknown
- Milestone set to 7.6.1
- Priority changed from normal to high
comment:2 Changed 22 months ago by PHO
- Cc pho@… added
comment:3 Changed 22 months ago by Dzhus
- Cc dima@… added
comment:4 Changed 22 months ago by bgamari
- Cc bgamari@… added
comment:5 Changed 21 months ago by Dzhus
Still present in 7.5.20120718
comment:6 Changed 20 months ago by igloo
- Milestone changed from 7.6.1 to 7.6.2
comment:7 Changed 7 months ago by Khudyakov
Bug is still present in 7.6.3. I've made a reduced test case with stripped-down standard inlined. Note taht adding or removing return in main loop have no effect. Something interesting is going on with blocks. Replacing f with constant or removing cons all makes bug go away. Simplifying go function changes runtime drastically.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Control.Monad.Primitive (PrimMonad, PrimState) import Data.Word import Data.Bits import Control.Monad import System.Random.MWC main :: IO () main = do g <- create replicateM_ (200*1000) $ standard g standard :: PrimMonad m => Gen (PrimState m) -> m Double {-# INLINE standard #-} standard gen = do u <- (subtract 1 . (*2)) `liftM` uniform gen ri <- uniform gen let i = fromIntegral ((ri :: Word32) .&. 127) bi = (I.!) blocks i return $! u * bi where blocks = I.cons r -- Removing cons $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v = 9.91256303526217e-3 r = 3.442619855899 f = exp (-0.5 * r * r) -- Replacing with constant make bug go away! -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
comment:8 Changed 7 months ago by simonpj
Thank you for a stripped-down case. Can you explain exactly how to demonstrate the bug with this test program? Ie "Try X and program runs in 2s; make trivial change to Y and it takes 10s". Or whatever.
Does it need mwc-random in all its glory, or would it be possible to make a standalone test case?
Simon
comment:9 Changed 7 months ago by Khudyakov
I've slightly simplified test case. I've tried to replace call to uniform with mock function but to avail. It's certainly possible to add only relevant parts of mwc-random. Only small part is actually used
Test case is slow (~100x) version of program. It's quite fragile. Small changes can return program to normal performance. Known methods: replace definition of f with constant (marked as fast), float blocks to the top level.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Data.Word import Data.Bits import Control.Monad import System.Random.MWC main :: IO () main = do g <- create replicateM_ (100*1000) $ standard g standard :: GenIO -> IO Double {-# INLINE standard #-} standard gen = do ri <- uniform gen return $! blocks ! fromIntegral ((ri :: Word32) .&. 127) where blocks :: I.Vector Double blocks = I.cons r -- Removing cons $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v,r,f :: Double v = 9.91256303526217e-3 r = 3.442619855899 -- f = 2.669629083880923e-3 -- FAST f = exp (-0.5 * r * r) -- SLOW -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
comment:10 Changed 7 months ago by Khudyakov
I've been able to remove all stuff from mwc-random. Here is test case. Again it's slow version.
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Control.Monad main :: IO () main = replicateM_ (100*1000) (return $! standard) standard :: Double {-# INLINE standard #-} standard = blocks ! 0 where blocks :: I.Vector Double blocks = I.cons r $ I.unfoldrN 130 go $ T r f where go (T b g) = let !u = T h (exp (-0.5 * h * h)) h = sqrt (-2 * log (v / b + g)) in Just (h, u) {-# NOINLINE blocks #-} v,r,f :: Double v = 9.91256303526217e-3 r = 3.442619855899 -- f = 2.669629083880923e-3 -- FAST f = exp (-0.5 * r * r) -- SLOW -- Unboxed 2-tuple data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double
Couple of observations
- Replacing f with constant restores run time to normal. AFAIR GHC cannot constant fold exp and similar functions. So it may matter
- Floating block to top level or removing I.cons restores run time too.
- Simplifying go function changes run time. Removing sqrt or log reduce rim time. It looks like blocks is reevaluated every time standard is evaluated.
comment:11 Changed 7 months ago by Khudyakov
Another simplification
{-# LANGUAGE BangPatterns #-} import qualified Data.Vector.Unboxed as I import Data.Vector.Unboxed ((!)) import Control.Monad main :: IO () main = replicateM_ (200*1000) (return $! standard) standard :: Double -- Removing or replacing with NOINLINE returns perfomance to normal {-# INLINE standard #-} standard = blocks ! 0 where blocks :: I.Vector Double blocks = I.cons 0.123 $ I.unfoldrN 130 go (T f) where go q@(T a) = Just (log (exp a), q) {-# NOINLINE blocks #-} r,f :: Double r = 3.442619855899 -- replacing f with constant return perfomance to normal -- f = 2.669629083880923e-3 f = exp (-0.5 * r * r) -- replacing data with newtype returns performance to normal data T = T {-# UNPACK #-} !Double
Problem is visible at the core level. Code is compiled down to the something similar to following pseudocode:
loop i = if i /= 0 then evaluate (blocks ! 0) >> loop (i-1) else return ()
blocks array is inlied despite being marked as NOINLINE and is evaluated on each iteration so performance is abysmal. When small chages to the program are made it's not inlined and evaluated only once.
We should look into this: -fno-state-hack should never make something faster; the whole point of the state hack is to improve performance.