Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#7429 closed bug (wontfix)

Unexplained performance boost with +RTS -h

Reported by: simonmar Owned by:
Priority: normal Milestone: 7.8.1
Component: Compiler Version: 7.6.1
Keywords: Cc: pho@…, tomberek@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

In #5505 the following program was reported:

{-# LANGUAGE 
        RankNTypes
        ,RecursiveDo
        ,TypeFamilies
        ,GeneralizedNewtypeDeriving
  #-}

import System.TimeIt
import qualified Data.Vector as V
import Control.Monad.State.Strict
import Control.Monad.ST.Strict
import Data.STRef

-- STO Monad is just StateT with an Int stacked on top of a ST monad.
newtype STO s a = STO { unSTO :: StateT Int (ST s) a }
    deriving (Functor, Monad, MonadFix)

runSTO :: (forall s. STO s a) -> a
runSTO x = runST (evalStateT (unSTO x) 0)

data CircList s = ConsM {value :: {-# UNPACK #-} !Int
                        ,cons  :: {-# UNPACK #-} !(STRef s (CircList s )) }

-- | Defines a circular list of length ns
clist ns = do
    rec allItems <-
            let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns)
            in V.generateM ns next
    return $ (V.!) allItems 0
    where nextCons v n = do
                        n' <- switchSTorSTO  $ newSTRef $ n
                        return $ ConsM v n'
                        
-- | Take one step in the CircList
oneStep (ConsM v c) = switchSTorSTO $ readSTRef c

-- | I tie a circular list of size 100 and step through it n times.
main :: IO ()
main = do 
    let n = 333222111 :: Int
    timeIt . print $ runSTO ( clist 1001 >>= liftM value . iterateM n oneStep)
    --timeIt . print $ runST ( clist 1001 >>= liftM value . iterateM n oneStep)
--  ****************************  TO SWITCH TO ST: switch between the two sets of definitions above and below this line
--switchSTorSTO = id
switchSTorSTO  = STO . lift

iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b
iterateM n f c = go n c where 
    go 0 b = return b
    go ns b = f b >>= go (ns-1)

Which on my system, when compiled for profiling, goes more than twice as fast with +RTS -h. The following is some output from perf stat with both versions:

> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505        
222
CPU time:   5.34s

 Performance counter stats for './5505':

    10,676,916,633 cycles                    #    0.000 GHz                     [75.00%]
    11,023,609,154 instructions              #    1.03  insns per cycle        
                                             #    0.60  stalled cycles per insn [75.01%]
       462,115,593 L1-dcache-load-misses                                        [75.01%]
     6,661,087,690 stalled-cycles-backend    #   62.39% backend  cycles idle    [75.01%]

       5.346786849 seconds time elapsed

and for the fast version:

> perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-cycles-backend ./5505 +RTS -h
222
CPU time:   2.38s

 Performance counter stats for './5505 +RTS -h':

     4,766,694,528 cycles                    #    0.000 GHz                     [75.02%]
    10,691,687,462 instructions              #    2.24  insns per cycle        
                                             #    0.06  stalled cycles per insn [75.02%]
        18,763,412 L1-dcache-load-misses                                        [75.02%]
       673,890,507 stalled-cycles-backend    #   14.14% backend  cycles idle    [74.99%]

       2.399953273 seconds time elapsed

There's a huge difference in the number of L1-dcache misses, and the slow version is stalled for 62% of the time.

Why is this? I'm not exactly sure. The program doesn't allocate anything except at the beginning. The effect of -h is to cause a major GC to happen every 0.1s or so, which copies some data, which should tend to cause more cache misses, rather than fewer.

I don't know what's going on, so I'm leaving the details in this ticket for now.

Attachments (3)

Main.hs (1.5 KB) - added by tomberek 6 years ago.
Smaller version of the same thing
Main.2.hs (2.0 KB) - added by tomberek 6 years ago.
Removed reliance on mtl
Main2.hs (3.4 KB) - added by tomberek 6 years ago.
Added performGC and comments

Download all attachments as: .zip

Change History (14)

Changed 6 years ago by tomberek

Attachment: Main.hs added

Smaller version of the same thing

comment:1 Changed 6 years ago by PHO

Cc: pho@… added

comment:2 in reply to:  description Changed 6 years ago by tomberek

Replying to simonmar:

{-# LANGUAGE 
        RankNTypes
        ,RecursiveDo
        ,TypeFamilies
        ,GeneralizedNewtypeDeriving
  #-}

import System.TimeIt
import Control.Monad.State.Strict
import Control.Monad.ST.Strict
import Data.STRef

-- STO Monad is just StateT with an Int stacked on top of a ST monad.
newtype STO s a = STO { unSTO :: StateT Int (ST s) a }
    deriving (Functor, Monad)

runSTO :: (forall s. STO s a) -> a
runSTO x = runST (evalStateT (unSTO x) 0)

data CircList s = ConsM {value :: {-# UNPACK #-} !Int
                        ,cons  :: {-# UNPACK #-} !(STRef s (CircList s )) }

-- | Defines a circular list of length 2
twoItemList :: forall s. ST s (CircList s)
twoItemList = do
    rec item1 <- liftM (ConsM 0) $ newSTRef item2
        item2 <- liftM (ConsM 1) $ newSTRef item1
    return item1

-- | I tie a circular list of size 2 and step through it n times.
main :: IO ()
main = do 
    let n = 633222111 :: Int
    let takeOneStep = switchSTorSTO . readSTRef . cons
    
    runLine $ switchSTorSTO twoItemList  >>= liftM value . iterateM n takeOneStep

--switchSTorSTO = id
--runLine = timeIt . print . runST 
--  ****************************  TO SWITCH TO ST: switch between the two sets of definitions above and below this line
switchSTorSTO  = STO . lift
runLine = timeIt . print . runSTO

-- I find myself writing this function a lot, did I miss some basic Monad helper?
iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b
iterateM n f c = go n c where 
    go 0 b = return b
    go ns b = f b >>= go (ns-1)
    

Here is a simpler version (also attached as Main.hs) that uses a circular list of length two. The odd thing is that this runs around 3 seconds when I only use the ST monad, or the STO without profiling. But the STO with profiling makes it FASTER to 2.6 seconds. The two versions seem to have the same GC stats as before. So then why not just use profiling ALL the time? I don't know if this difference exists for large 'project sized' uses of ST, but it seems a monad transformer on top of it along with profiling helps it go faster.

The STO version produces this core, which looks pretty good to me:

Rec {
Main.$wa [Occ=LoopBreaker]
  :: forall s_XF2.
     GHC.Prim.Int#
     -> Main.CircList s_XF2
     -> GHC.Types.Int
     -> GHC.Prim.State# s_XF2
     -> (# GHC.Prim.State# s_XF2,
           (Main.CircList s_XF2, GHC.Types.Int) #)
[GblId, Arity=4, Caf=NoCafRefs, Str=DmdType LLLL]
Main.$wa =
  \ (@ s_XF2)
    (ww_s1na :: GHC.Prim.Int#)
    (w_s1nc :: Main.CircList s_XF2)
    (w1_s1nd :: GHC.Types.Int)
    (w2_s1ne :: GHC.Prim.State# s_XF2) ->
    case ww_s1na of ds_XFh {
      __DEFAULT ->
        case w_s1nc of _ { Main.ConsM rb_dFC rb1_dFD ->
        case GHC.Prim.readMutVar#
               @ s_XF2 @ (Main.CircList s_XF2) rb1_dFD w2_s1ne
        of _ { (# ipv_aJ9, ipv1_aJa #) ->
        Main.$wa @ s_XF2 (GHC.Prim.-# ds_XFh 1) ipv1_aJa w1_s1nd ipv_aJ9
        }
        };
      0 -> (# w2_s1ne, (w_s1nc, w1_s1nd) #)
    }
end Rec }

The effect disappears when you switch to using just ST, but the core looks almost identical other than the threading of the original Int in the StateT. Here is the core produced for the ST version:

Rec {
Main.$wpoly_a [Occ=LoopBreaker]
  :: forall s_aDw.
     GHC.Prim.Int#
     -> Main.CircList s_aDw
     -> GHC.Prim.State# s_aDw
     -> (# GHC.Prim.State# s_aDw, Main.CircList s_aDw #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.$wpoly_a =
  \ (@ s_aDw)
    (ww_s1la :: GHC.Prim.Int#)
    (w_s1lc :: Main.CircList s_aDw)
    (w1_s1ld :: GHC.Prim.State# s_aDw) ->
    case ww_s1la of ds_XFl {
      __DEFAULT ->
        case w_s1lc of _ { Main.ConsM rb_dFI rb1_dFJ ->
        case GHC.Prim.readMutVar#
               @ s_aDw @ (Main.CircList s_aDw) rb1_dFJ w1_s1ld
        of _ { (# ipv_aIP, ipv1_aIQ #) ->
        Main.$wpoly_a @ s_aDw (GHC.Prim.-# ds_XFl 1) ipv1_aIQ ipv_aIP
        }
        };
      0 -> (# w1_s1ld, w_s1lc #)
    }
end Rec }

Changed 6 years ago by tomberek

Attachment: Main.2.hs added

Removed reliance on mtl

comment:3 Changed 6 years ago by tomberek

Some code golf at #haskell produced a new version of iterateM:

iterateM2 :: Monad m => Int -> (b -> m b) -> b -> m b
iterateM2 = (fmap.fmap) (foldr (>=>) return) replicate

This produces slightly different CORE as follows:

Rec {
Main.main6 [Occ=LoopBreaker]
  :: forall s_atH.
     GHC.Prim.Int#
     -> Main.CircList s_atH
     -> GHC.Types.Int
     -> GHC.Prim.State# s_atH
     -> (# GHC.Prim.State# s_atH,
           (Main.CircList s_atH, GHC.Types.Int) #)
[GblId, Arity=4, Caf=NoCafRefs, Str=DmdType LSLL]
Main.main6 =
  \ (@ s_atH)
    (m_a16D :: GHC.Prim.Int#)
    (eta_B3 :: Main.CircList s_atH)
    (eta1_X14 :: GHC.Types.Int)
    (eta2_X15 :: GHC.Prim.State# s_atH) ->
    case GHC.Prim.<=# m_a16D 1 of _ {
      GHC.Types.False ->
        case eta_B3 of _ { Main.ConsM rb_dE3 rb1_dE4 ->
        case GHC.Prim.readMutVar#
               @ s_atH @ (Main.CircList s_atH) rb1_dE4 eta2_X15
        of _ { (# ipv_aEY, ipv1_aEZ #) ->
        Main.main6 @ s_atH (GHC.Prim.-# m_a16D 1) ipv1_aEZ eta1_X14 ipv_aEY
        }
        };
      GHC.Types.True ->
        case eta_B3 of _ { Main.ConsM rb_dE3 rb1_dE4 ->
        case GHC.Prim.readMutVar#
               @ s_atH @ (Main.CircList s_atH) rb1_dE4 eta2_X15
        of _ { (# ipv_aEY, ipv1_aEZ #) ->
        (# ipv_aEY, (ipv1_aEZ, eta1_X14) #)
        }
        }
    }
end Rec }

This destroys the effect, ie: it only produces the "slow" version. The iterateM as is, WITH StateT over ST WITH (+RTS -s -h) is the only way I have been able to produce the "fast" version.

And to clarify the effect:

  1. I compile with -O2 -rtsopts.
  2. I run with +RTS -s
  3. Without recompiling or anything else I run +RTS -s -h

#3 is faster than #2 or ANYTHING else, including no RTS, no -rtsopts. It still works with -O1 I end up thrashing the system when I compile without an -Ox flag

comment:4 Changed 6 years ago by simonpj

Does -fno-state-hack make any difference?

comment:5 in reply to:  4 Changed 6 years ago by tomberek

Replying to simonpj:

Does -fno-state-hack make any difference?

No effect.

comment:6 Changed 6 years ago by simonmar

I suspect it's a weird caching effect. The program creates a small data structure and then repeatedly traverses it (if I'm understanding correctly). After a GC the program performs much better, which suggests that the GC has rearranged the data in memory in a more cache-friendly way (not deliberately, I should point out). Or perhaps the data rearrangement has avoided some cache conflicts, but that seems unlikely - L1 caches are usually very associative.

You should be able to get the same effect by adding a performGC after allocating the data structure but before traversing it.

comment:7 Changed 6 years ago by tomberek

Cc: tomberek@… added

I'm not sure how to insert an IO () into the ST/STO monad, either that sounds "unsafe" or I lift everything to IO? I tried using:

lift $ unsafeIOToST performGC

in Main2.hs

This fixes the problem when the circular list is of size 2 (using function twoItemList). The problem creeps back in when using threeItemList or when the list is produced by nlist. Perhaps my generation of the circular list is suspect. Perhaps too lazy? The inner loop in Core is unchanged from previous versions.

Regarding "small data structure": This effect is actually more pronounced the larger the circular list is. I've tested this up to n = 2 million and it seems that the GC overhead of profiling catches up around 1 million, but the non-GC time difference becomes very pronounced.

So the "GC fix" works for an explicitly constructed circular list of length 2, but nothing else. And still, working in only ST is always slower than in a superfluous StateT over ST.

Changed 6 years ago by tomberek

Attachment: Main2.hs added

Added performGC and comments

comment:8 in reply to:  7 Changed 6 years ago by tomberek

Replying to tomberek: correction, the GC fix doesn't work for twoItemList either (it's late, i got confused)

comment:9 Changed 6 years ago by simonmar

Resolution: wontfix
Status: newclosed

Aha - of course I was forgetting something, the GC will also eliminate the indirections, and that seems to explain the difference at least in Main2.hs.

In the original version, the data structure with the indirections is enough to push it over the size of the L1 cache, so all the misses are capacity misses (on my machine anyway). In Main2.hs, the difference is all due to having to traverse the extra indirections in the data structure.

Mystery solved - and I don't think there's anything we can do here. If it is important to make this case go fast, you might try to rewrite the program so that it creates the data structure strictly, which will eliminate the indirections at creation time.

comment:10 in reply to:  9 Changed 6 years ago by tomberek

Ok, then I'd like to know what I can do to create the data structure strictly. I've got the bang patterns on the CircList,

data CircList s = ConsM {value :: {-# UNPACK #-} !Int
                        ,cons  :: {-# UNPACK #-} !(STRef s (CircList s )) }

Is there something in the following that should be "strictified"?

twoItemList :: forall s. ST s (CircList s)
twoItemList = do
    rec item1 <- liftM (ConsM 0) $ newSTRef item2
        item2 <- liftM (ConsM 1) $ newSTRef item1
    return item1

or in

-- | Defines a circular list of length n     
nlist :: forall s. Int -> ST s (CircList s)
nlist ns = do
    rec allItems <-
            let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns)
            in V.generateM ns next
    return $ (V.!) allItems 0
    where nextCons v n = do
                        n' <- newSTRef n
                        return $ ConsM v n'

comment:11 Changed 6 years ago by simonmar

The laziness here is in the contents of the STRef, where you can't put a bang pattern. Now the problem is that because you're creating a circular list, there must be at least one lazy element in it, because you're using laziness to create the circularity. So you could probably add one bang pattern in twoItemList, but not two (I haven't tried this).

Note: See TracTickets for help on using tickets.