Looking at the core, it seems that repeating the definition means that one of the key functions doesn't get specialised as expected which leads to a much slower program.
Observe that in the first two benchmarks there is a worker function go :: Int# -> Int -> ReaderT Int (StateT Int Identity Int but in the third benchmark this is specialised to $sgo :: Int# -> Int -> Int -> Int# -> Int# -> Identity (Int, Int). Removing the duplicate benchmark means that specialisation happens properly in the first case as well.
The proper specialisation also happens in 8.0.2.
This causes the first two cases to be 6x slower than the last case.
Trac metadata
Trac field
Value
Version
8.2.1-rc2
Type
Bug
TypeOfFailure
OtherFailure
Priority
normal
Resolution
Unresolved
Component
Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
{-# LANGUAGE BangPatterns #-}{-# LANGUAGE FlexibleContexts #-}{-# LANGUAGE FlexibleInstances #-}{-# LANGUAGE FunctionalDependencies #-}{-# LANGUAGE MultiParamTypeClasses #-}{-# LANGUAGE UndecidableInstances #-}moduleMain(main)where-- | Benchmarks for various effect system implementations-- import Criterion.MainimportData.BitsimportData.IntimportData.IORefimportData.RatioimportData.Time(getCurrentTime,utctDayTime)importControl.ExceptionimportControl.MonadimportControl.Monad.Trans.ClassimportqualifiedControl.Monad.Trans.State.StrictasSimportControl.Monad.Trans.ReaderimportSystem.CPUTime(getCPUTime)importSystem.IO.Unsafe-- Use only state, lift variable number of effects over/under--------------------------------------------------------------------------------test1mtl::MonadStateIntm=>Int->mInttest1mtln=foldMf1[0..n]wherefaccx|x`rem`5==0=dos<-getput$!(s+1)pure$!maxaccx|otherwise=pure$!maxaccxmain=do-- Used to definitively disable bench argument inlining-- !n <- randomRIO (1000000, 1000000) :: IO Int!m<-randomRIO(0,0)::IOIntletrunRT=(`runReaderT`(m::Int))letrunS=(`S.runState`(m::Int))replicateM_100$do!n<-randomRIO(1000000,1000000)::IOIntevaluate$(runS.runRT.test1mtl)nreplicateM_100$do!n<-randomRIO(1000000,1000000)::IOIntevaluate$(runS.runRT.test1mtl)nreplicateM_100$do!n<-randomRIO(1000000,1000000)::IOIntevaluate$(runS.runRT.runRT.test1mtl)n------- Auxiliary----classMonadm=>MonadStatesm|m->swhereget::msget=state(\s->(s,s))put::s->m()puts=state(\_->((),s))state::(s->(a,s))->mastatef=dos<-getlet~(a,s')=fsputs'returna{-# MINIMAL state | get, put #-}instanceMonadStatesm=>MonadStates(ReaderTrm)whereget=liftgetput=lift.putstate=lift.stateinstanceMonadm=>MonadStates(S.StateTsm)whereget=S.getput=S.putstate=S.stateclassRandomawhererandomR::RandomGeng=>(a,a)->g->(a,g)-- random :: RandomGen g => g -> (a, g)randomRIO::(a,a)->IOarandomRIOrange=getStdRandom(randomRrange)instanceRandomIntwhererandomR=randomIvalIntegral-- ; random = randomBoundedrandomIvalIntegral::(RandomGeng,Integrala)=>(a,a)->g->(a,g)randomIvalIntegral(l,h)=randomIvalInteger(toIntegerl,toIntegerh){-# SPECIALIZE randomIvalInteger :: (Num a) => (Integer, Integer) -> StdGen -> (a, StdGen) #-}randomIvalInteger::(RandomGeng,Numa)=>(Integer,Integer)->g->(a,g)randomIvalInteger(l,h)rng|l>h=randomIvalInteger(h,l)rng|otherwise=case(f10rng)of(v,rng')->(fromInteger(l+v`mod`k),rng')where(genlo,genhi)=genRangerngb=fromIntegralgenhi-fromIntegralgenlo+1q=1000k=h-l+1magtgt=k*qfmagvg|mag>=magtgt=(v,g)|otherwise=v'`seq`f(mag*b)v'g'where(x,g')=nextgv'=(v*b+(fromIntegralx-fromIntegralgenlo))classRandomGengwherenext::g->(Int,g)genRange::g->(Int,Int)genRange_=(minBound,maxBound)dataStdGen=StdGen!Int32!Int32instanceRandomGenStdGenwherenext=stdNextgenRange_=stdRangestdRange::(Int,Int)stdRange=(1,2147483562)stdNext::StdGen->(Int,StdGen)stdNext(StdGens1s2)=(fromIntegralz',StdGens1''s2'')wherez'=ifz<1thenz+2147483562elsezz=s1''-s2''k=s1`quot`53668s1'=40014*(s1-k*53668)-k*12211s1''=ifs1'<0thens1'+2147483563elses1'k'=s2`quot`52774s2'=40692*(s2-k'*52774)-k'*3791s2''=ifs2'<0thens2'+2147483399elses2'getStdRandom::(StdGen->(a,StdGen))->IOagetStdRandomf=atomicModifyIORef'theStdGen(swap.f)whereswap(v,g)=(g,v)theStdGen::IORefStdGentheStdGen=unsafePerformIO$dorng<-mkStdRNG0newIORefrngmkStdRNG::Integer->IOStdGenmkStdRNGo=doct<-getCPUTime(sec,psec)<-getTimereturn(createStdGen(sec*12345+psec+ct+o))createStdGen::Integer->StdGencreateStdGens=mkStdGen32$fromIntegralsmkStdGen32::Int32->StdGenmkStdGen32sMaybeNegative=StdGen(s1+1)(s2+1)wheres=sMaybeNegative.&.maxBound(q,s1)=s`divMod`2147483562s2=q`mod`2147483398getTime::IO(Integer,Integer)getTime=doutc<-getCurrentTimeletdaytime=toRational$utctDayTimeutcreturn$quotRem(numeratordaytime)(denominatordaytime)
$ /opt/ghc/8.0.2/bin/ghc MultiBench2.hs -O2 -fforce-recomp[1 of 1] Compiling Main ( MultiBench2.hs, MultiBench2.o )Linking MultiBench2 ...$ time ./MultiBench2 real 0m2.954suser 0m2.952ssys 0m0.000s$ /opt/ghc/8.2.1/bin/ghc MultiBench2.hs -O2 -fforce-recomp[1 of 1] Compiling Main ( MultiBench2.hs, MultiBench2.o )Linking MultiBench2 ...$ time ./MultiBench2 real 0m12.335suser 0m12.292ssys 0m0.048s
The bulk of the increase in runtime is due to 2effe18a (The Early Inline Patch):
Commit 55efc9718b520ef354e32c15c4b49cdfecce412f (Combine identical case alternatives in CSE)-----$ time ./MultiBench2 real 0m2.786suser 0m2.784ssys 0m0.000sCommit 2effe18ab51d66474724d38b20e49cc1b8738f60 (The Early Inline Patch)-----$ time ./MultiBench2 real 0m11.861suser 0m11.816ssys 0m0.052s
(I'm not sure yet what contributes to the other 0.5 seconds in runtime increase.)
Since the first arg of the foldr has no free vars, we float it out to give
lvl = \x y z. blah $stest1mtl = \eta. ...foldr lvl z e...
That makes $stest1mtl small, so it is inlined at its two call sites (the first two test case in main).
So now there are two calls to lvl, and it is quite big, so it doesn't get inlined.
But actually it is much better not to inline $stest1mtl, and instead (after the foldr/build stuff has happened) to inline lvl back into it.
This kind of thing not new; I trip over it quite often. Generally, given
f = e g = ...f.. h = ...g...g..f...
should we inline f into g, thereby making g big, so it doesn't inline into h? Or should we instead inline g into h? Sometimes one is better, sometimes the other; I don't know any systematic way of doing The Right Thing all the time. It turned out that the early-inline patch changed the choice, which resulted in the changed performance.
However I did spot several things worth trying out
In CoreArity.rhsEtaExpandArity we carefully do not eta-expand thunks. But I saw some thunks like
lvl_s621 = case z_a4NJ of wild_a4OF { GHC.Types.I# x1_a4OH -> case x_a4NH of wild1_a4OJ { GHC.Types.I# y1_a4OL -> case GHC.Prim.<=# x1_a4OH y1_a4OL of { __DEFAULT -> (\ _ (eta_B1 :: Int) -> (wild_a4OF, eta_B1)) 1# -> (\ _ (eta_B1 :: Int) -> (wild1_a4OJ, eta_B1))
Here it really would be good to eta-expand; then that particular lvl could be inlined at its call sites. Here's a change to CoreArity.rhsEtaExpandArity that did the job:
- | isOneShotInfo os || has_lam e -> 1 + length oss+ | isOneShotInfo os || not (is_app e) -> 1 + length oss- has_lam (Tick _ e) = has_lam e- has_lam (Lam b e) = isId b || has_lam e- has_lam _ = False+ is_app (Tick _ e) = is_app e+ is_app (App f _) = is_app f+ is_app (Var _) = True+ is_app _ = False
Worth trying.
Now the offending top-level lvl function is still not inlined; but it has a function argument that is applied, so teh call sites look like
lvl ... (\ab. blah) ...
When considering inining we do get a discount for the application of the argument inside lvl's rhs, but it was only a discout of 60, which seems small considering how great it is to inline a function. Boosting it to 150 with -funfolding-fun-discount=150 make the function inline, and we get good code all round. Maybe we should just up the default.
All the trouble is caused by the early float-out. I think we could try just elminating it.
Based on this diagnosis, I don't think we should hold up the release. It's not a bug in pass X that can readily be fixed; it's the (very difficult) challenge of making correct inlining decisions.