Opened 22 months ago

Last modified 7 months ago

#6166 new bug

Performance regression in mwc-random since 7.0.x

Reported by: bos Owned by:
Priority: high Milestone: 7.6.2
Component: Compiler Version: 7.4.2
Keywords: Cc: pho@…, dima@…, bgamari@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

I've had a report that the performance of the mwc-random package has regressed seriously after upgrading from GHC 7.0 to 7.4. It turns out that 7.2 also has the regression.

Here's a sample program.

import qualified Data.Vector.Unboxed as U

import qualified System.Random.MWC as R
import System.Random.MWC.Distributions (standard)

count = 1000 * 1000

fast gen = standard gen

slow gen = standard gen >>= return

-- Edit this to choose fast or slow.
which gen = slow gen

main = do
  gen <- R.create
  v <- U.replicateM count (which gen)
  print (U.last v)

With GHC 7.0.3 -O2, this runs in 0.294 sec, regardless of whether fast or slow is used.

Under 7.4, fast runs in 0.062 sec (a nice speedup!), but slow now takes 9.2 sec (yikes!).

Roman suggested compiling the slow version with -fno-state-hack, which brings performance back up to 0.062 sec.

Change History (11)

comment:1 Changed 22 months ago by simonmar

  • Difficulty set to Unknown
  • Milestone set to 7.6.1
  • Priority changed from normal to high

We should look into this: -fno-state-hack should never make something faster; the whole point of the state hack is to improve performance.

comment:2 Changed 22 months ago by PHO

  • Cc pho@… added

comment:3 Changed 21 months ago by Dzhus

  • Cc dima@… added

comment:4 Changed 21 months ago by bgamari

  • Cc bgamari@… added

comment:5 Changed 21 months ago by Dzhus

Still present in 7.5.20120718

comment:6 Changed 19 months ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2

comment:7 Changed 7 months ago by Khudyakov

Bug is still present in 7.6.3. I've made a reduced test case with stripped-down standard inlined. Note taht adding or removing return in main loop have no effect. Something interesting is going on with blocks. Replacing f with constant or removing cons all makes bug go away. Simplifying go function changes runtime drastically.

{-# LANGUAGE BangPatterns #-}

import qualified Data.Vector.Unboxed as I
import Control.Monad.Primitive (PrimMonad, PrimState)
import Data.Word
import Data.Bits
import Control.Monad
import System.Random.MWC


main :: IO ()
main = do
  g <- create
  replicateM_ (200*1000) $ standard g


standard :: PrimMonad m => Gen (PrimState m) -> m Double
{-# INLINE standard #-}
standard gen = do
  u  <- (subtract 1 . (*2)) `liftM` uniform gen
  ri <- uniform gen
  let i  = fromIntegral ((ri :: Word32) .&. 127)
      bi = (I.!) blocks i
  return $! u * bi
  where
    blocks = I.cons r           -- Removing cons 
           $ I.unfoldrN 130 go
           $ T r f
      where
        go (T b g)   = let !u = T h (exp (-0.5 * h * h))
                           h  = sqrt (-2 * log (v / b + g))
                       in Just (h, u)
    {-# NOINLINE blocks #-}
    v = 9.91256303526217e-3
    r = 3.442619855899
    f = exp (-0.5 * r * r) -- Replacing with constant make bug go away!


-- Unboxed 2-tuple
data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double

comment:8 Changed 7 months ago by simonpj

Thank you for a stripped-down case. Can you explain exactly how to demonstrate the bug with this test program? Ie "Try X and program runs in 2s; make trivial change to Y and it takes 10s". Or whatever.

Does it need mwc-random in all its glory, or would it be possible to make a standalone test case?

Simon

comment:9 Changed 7 months ago by Khudyakov

I've slightly simplified test case. I've tried to replace call to uniform with mock function but to avail. It's certainly possible to add only relevant parts of mwc-random. Only small part is actually used

Test case is slow (~100x) version of program. It's quite fragile. Small changes can return program to normal performance. Known methods: replace definition of f with constant (marked as fast), float blocks to the top level.

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as I
import           Data.Vector.Unboxed   ((!))
import Data.Word
import Data.Bits
import Control.Monad
import System.Random.MWC

main :: IO ()
main = do
  g <- create
  replicateM_ (100*1000) $ standard g


standard :: GenIO -> IO Double
{-# INLINE standard #-}
standard gen = do
  ri <- uniform gen
  return $! blocks ! fromIntegral ((ri :: Word32) .&. 127)
  where
    blocks :: I.Vector Double
    blocks = I.cons r           -- Removing cons 
           $ I.unfoldrN 130 go
           $ T r f
      where
        go (T b g)   = let !u = T h (exp (-0.5 * h * h))
                           h  = sqrt (-2 * log (v / b + g))
                       in Just (h, u)
    {-# NOINLINE blocks #-}

v,r,f :: Double
v = 9.91256303526217e-3
r = 3.442619855899
-- f = 2.669629083880923e-3  -- FAST
f = exp (-0.5 * r * r)    -- SLOW

-- Unboxed 2-tuple
data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double

comment:10 Changed 7 months ago by Khudyakov

I've been able to remove all stuff from mwc-random. Here is test case. Again it's slow version.

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as I
import           Data.Vector.Unboxed   ((!))
import Control.Monad


main :: IO ()
main = replicateM_ (100*1000) (return $! standard)

standard :: Double
{-# INLINE standard #-}
standard = blocks ! 0
  where
    blocks :: I.Vector Double
    blocks = I.cons r
           $ I.unfoldrN 130 go
           $ T r f
      where
        go (T b g)   = let !u = T h (exp (-0.5 * h * h))
                           h  = sqrt (-2 * log (v / b + g))
                       in Just (h, u)
    {-# NOINLINE blocks #-}

v,r,f :: Double
v = 9.91256303526217e-3
r = 3.442619855899
-- f = 2.669629083880923e-3  -- FAST
f = exp (-0.5 * r * r)    -- SLOW

-- Unboxed 2-tuple
data T = T {-# UNPACK #-} !Double {-# UNPACK #-} !Double

Couple of observations

  1. Replacing f with constant restores run time to normal. AFAIR GHC cannot constant fold exp and similar functions. So it may matter
  1. Floating block to top level or removing I.cons restores run time too.
  1. Simplifying go function changes run time. Removing sqrt or log reduce rim time. It looks like blocks is reevaluated every time standard is evaluated.

comment:11 Changed 7 months ago by Khudyakov

Another simplification

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Unboxed as I
import           Data.Vector.Unboxed   ((!))
import Control.Monad

main :: IO ()
main = replicateM_ (200*1000) (return $! standard)

standard :: Double
-- Removing or replacing with NOINLINE returns perfomance to normal
{-# INLINE standard #-}
standard = blocks ! 0
  where
    blocks :: I.Vector Double
    blocks = I.cons 0.123
           $ I.unfoldrN 130 go (T f)
      where
        go q@(T a) = Just (log (exp a), q)
    {-# NOINLINE blocks #-}

r,f :: Double
r = 3.442619855899
-- replacing f with constant return perfomance to normal
-- f = 2.669629083880923e-3
f = exp (-0.5 * r * r)

-- replacing data with newtype returns performance to normal
data T = T {-# UNPACK #-} !Double

Problem is visible at the core level. Code is compiled down to the something similar to following pseudocode:

loop i = if i /= 0 
         then evaluate (blocks ! 0) >> loop (i-1) 
         else return ()

blocks array is inlied despite being marked as NOINLINE and is evaluated on each iteration so performance is abysmal. When small chages to the program are made it's not inlined and evaluated only once.

Note: See TracTickets for help on using tickets.