Inspired by #3138, it might be useful for StrAnal to detect functions such as the following where only one of the data constructors for a sum type are CPRed:
loop x = case x < 10 of True -> Left x; False -> loop (x*2)
We can usefully transform to:
$wloop x = case (case x < 10 of True -> Left x; False -> loop (x*2)) of Left y -> (# y #)loop x = case loop x of (# y #) -> Left y
Attached patch implements this behaviour. Most of the complication in the new code occurs because adding a DataCon field to the Demand data type means that we have to define a separate IfaceDemand type for storage in interface files.
The patch validates but I haven't done any tests on nofib. I have confirmed that the new optimisation hits on some examples, though.
Trac metadata
Trac field
Value
Version
7.0.3
Type
FeatureRequest
TypeOfFailure
OtherFailure
Priority
normal
Resolution
Unresolved
Component
Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Still to come: the sum-CPR stuff is switched off for nested functions because it turns some let-no-escape functions into non-let-no-escape ones, which increases allocation. I'm hopeful that Nick F's work on late-lambda-lifting may solve this, in which case we can switch it on more vigorously. Hence leaving open for now.
Indeed. But the opportunity only arises after SpecConstr, and we don't currently run the demand/CPR analyser after SpecConstr, we run it before.
Several other tickets (#6087, #5302 (closed), #6070 (closed)) identify other opportunities that could be exploited by running the demand analyser again, late in the optimisation pipeline.
Nick Frisby is planning to experiment with this. Thanks for a new example.
Also, in case someone looks here: The branch cpr-sum-types is considered dead; Its last commit ([9b0d70e/ghc]) is from October 2011, while CPR for sum types has hit master in January 2013 ([d3b8991b/ghc]). (Maybe we should rename dead branches to dead/...?)
So one quite extreme degradation, and otherwise minor losses.
If I make sure that join points do not get the CPR information unless the expression they are a join point for does, we get this change (relative to the same baseline):
This looks quite different now: A few losses, but much smaller than before. A mean of zero, and more gains than before. Unfortunately, the code size increase is still there...
If it were not for the code size increase, I’d suggest to merge these changes: Fewer special cases in transformations is a good thing. But it is probably not worth the code size increase, is it?
(Just for the record: In theory it is possible that CPR’ing a non-sum-type can destroy the join-point-property of a let. But it does not seem to happen: If I enable the cpr-join-point-fix, but keep CPR for sum types disabled, nofib does not record any change whatsoever.)
What happens for the program in the Description? Do we have a regression test?
Do we have a regression test for the program in this comment?
Is sum-CPR switched off for nested functions? Apparently this was because "it turns some let-no-escape functions into non-let-no-escape ones, which increases allocation.", and as you say that's no longer necessary because we know which functions are join points. But let's double check.
Does switching on/off sum-CPR (how?) make allocation go up in some programs? If so it'd be good to know why, even if we agree to accept it.
What happens for the program in the Description? Do we have a regression test?
When we give it the type signature loop :: Int -> Either Int Int, it works. I couldn't find a regression test with that exact condition (x < 10).
Do we have a regression test for the program in this comment?
I presume you mean this program:
loop :: Int -> Maybe Int -> Maybe Intloop n x = case n of 0 -> x _ -> loop (n-1) (fmap (+1) x)
I grepped for the type signature and couldn't find it.
Is sum-CPR switched off for nested functions?
Yes, see the binding for trim_sums and Note [CPR for sum types]. More details below.
Does switching on/off sum-CPR (how?) make allocation go up in some programs?
TBD, I'm on it.
Interestingly, the example from the Description does not get WW'd. Here it is, for completeness (note the lack of a type signature):
loop x = case x < 10 of True -> Left x False -> loop (x*2)
This will be turned into the following Core before Demand Analysis:
loop = \ (@ t_a2e5) (@ b_a2e3) ($dOrd_a2ty :: Ord t_a2e5) ($dNum_a2tz :: Num t_a2e5) (eta_B1 :: t_a2e5) -> let { lvl_s2uU = fromInteger @ t_a2e5 $dNum_a2tz lvl_s2uT } in let { lvl_s2uW = fromInteger @ t_a2e5 $dNum_a2tz lvl_s2uV } in joinrec { loop_s2uR (x_atn :: t_a2e5) = case < @ t_a2e5 $dOrd_a2ty x_atn lvl_s2uU of { False -> jump loop_s2uR (* @ t_a2e5 $dNum_a2tz x_atn lvl_s2uW); True -> Data.Either.Left @ t_a2e5 @ b_a2e3 x_atn }; } in jump loop_s2uR eta_B1
The inner loop is now actually a local binding, for which we historically trim CPR information in order not to spoil potential join points. This is detailed in Note [CPR for sum types] in DmdAnal and hinted at by https://gitlab.haskell.org//ghc/ghc/issues/5075#note_68711.
We really don't want to spoil the join point in this particular example, but there are probably plenty of cases when the heuristic not . isTopLevel (see trim_sums) is too broad: First, we unnecessarily penalise any local non-join binding. Second, when a join point has the CPR property, we can give it to the whole expression that defines the join point.
When we give loop a proper type signature, like loop :: Int -> Either Int Int, everything is well, though:
Test.$wloop = \ (ww_s2gr :: GHC.Prim.Int#) -> case GHC.Prim.<# ww_s2gr 10# of { __DEFAULT -> Test.$wloop (GHC.Prim.*# ww_s2gr 2#); 1# -> (# GHC.Types.I# ww_s2gr #) }loop = \ (w_s2go :: Int) -> case w_s2go of { GHC.Types.I# ww1_s2gr -> case Test.$wloop ww1_s2gr of { (# ww3_s2gx #) -> Data.Either.Left @ Int @ Int ww3_s2gx } }
So, I'm going to try to get rid of trim_sums, which should propagate CPR from join points rather automatically by virtue of the different code path join points already take in LetDown.
I'll evaluate the resulting version against HEAD and a modified version of HEAD where I set trim_sums = True, effectively removing all sum CPR info.
I think this whole thing about sum-CPR may be rendered moot by Note [Don't CPR join points] in WorkWrap. It seems that we compute CPR info for join points, but do not use that info for w/w, regardless of whether it's a sum type. The example in that Note is great.
I think that's all we need. Maybe we can kill off the special case in DmdAnal?
My job for benchmarking the baseline got stuck, but here's the comparison of no CPR for sums (i.e., trim_sums = True) vs CPR for all sums (i.e., trim_sums = False):
NoFib Results-------------------------------------------------------------------------------- Program Allocs Instrs-------------------------------------------------------------------------------- binary-trees -0.8% +3.8% digits-of-e2 -5.4% -1.5% fibheaps -0.3% -1.2% fish 0.0% +1.9% genfft -0.1% +1.1% integer 0.0% +1.0% last-piece +0.0% +2.0% mkhprog -1.3% -0.8% parser -0.4% -4.0% parstof -6.8% -4.0% reptile +25.9% +22.4%-------------------------------------------------------------------------------- Min -6.8% -4.0% Max +25.9% +22.4% Geometric Mean +0.1% +0.2%
So there are regressions, most notably in reptile. Binary sizes increased by 2.5% on average. Will investigate some time next week or so.
-- | The resulting program should not contain any Just constructor-- top-levelloop1 :: Int -> Maybe Intloop1 x | x > 10 = Just x | otherwise = loop1 (x*2){-# NOINLINE loop1 #-}-- nestedloop2 :: Int -> Maybe Intloop2 x = go x +? go (x+1) where Just x +? Just y = Just (x + y) _ +? _ = Nothing go z | z > 10 = Just (x + z) | otherwise = go (z*2){-# NOINLINE loop2 #-}main = do let Just y = loop1 2 print y let Just z = loop2 2 print z
The resulting simplifier output should not contain any mentions of Just (checkable by check_errmsg). loop2 currently does contain mentions even after simplification.
Interestingly, deactivating CPR for sums altogether makes things faster:
NoFib Results-------------------------------------------------------------------------------- Program Allocs Allocs Instrs Instrs no-cpr-su cpr-all-s no-cpr-su cpr-all-s-------------------------------------------------------------------------------- parstof +7.3% -0.0% +4.2% -0.0% reptile -20.6% 0.0% -18.3% 0.0%-------------------------------------------------------------------------------- Min -20.6% -0.3% -18.3% -1.0% Max +7.3% +0.4% +4.2% +2.0% Geometric Mean -0.1% +0.0% -0.2% +0.1%
All wrt. to a recent HEAD commit. Doing no CPR for sums (i.e., always trimming sum CPR, no-cpr-sums) has a huge win in reptile, which is probably the reason it's a win at all. Excluding reptile, doing sum CPR is a marginal win. Also note that doing CPR for all sums (i.e., never trim sum CPR, cpr-all-sums) makes things marginally slower.
Binary sizes go down by 1.7% if we deactivate CPR for sums.
I believe that the regression in reptile is due to the following spin-off of the hash function we now use throughout Nofib:
lvl9 :: Int -> Int[GblId]lvl9 = case Mgrfuns.$wescom Mgrfuns.shapewindow1 lvl8 of { (# ww1, ww2 #) -> go3 (GHC.Types.: @ Char ww1 ww2) }Main.main_go = \ (ds :: [Char]) (eta :: Int) -> case ds of { [] -> lvl9 eta; : y ys -> case eta of { GHC.Types.I# ipv -> Main.main_go ys (case y of { GHC.Types.C# c# -> GHC.Types.I# (GHC.Prim.+# (GHC.Prim.ord# c#) (GHC.Prim.*# ipv 31#)) }) } }
Note that main_go (which is essentially a strict left fold computing a hash over a String) couldn't unbox its Int parameter because lvl9 is a thunk that apparently couldn't be inlined. Without !312 (merged) (and probably also with it, seeing that idArity is 0), this will not assign a useful strictness signature to lvl9.
This is an example where we want to look at how lvl9 is used to determine which strictness signature to give it.
Short of being able to propagate the appropriate higher-order info, we have to blame the simplifier. Why wasn't lvl9 inlined? I guess if the wrapper for $wescom wasn't inlined into lvl9, it would just be a PAP amenable to being inlined into main_go. This is the reason why the no-sum-cpr version (which doesn't WW escom) properly inlines lvl9 and propagates the right strictness signatures.
So: we should make sure that lvl9 properly inlines or is given idArity 1 (which would be a lie, strange).
I tracked down how we got into the above situation. There is this huge foldl', 'optimised' to a foldr:
foldr k (... (foldr k id (escom bitcopy1 [...])) ...) (escom shapewindow [...])) NofibUtils.hash1
The escom wrappers get inlined, thus destroying the usful property that each foldr in itself will desugar to a PAP of the form
lvl = go (escom shapewindow [...])
which will have the proper arity 1 and inline, instead of
lvl = case $wescom shapewindow [...] of (# ww1, ww2 #) -> go (GHC.Types.: @ Char ww1 ww2)
I wonder why the wrapper was inlined in the first place; It doesn't look like there's any incoming demand on that foldr argument justifying that decision.
I'm not sure if this is the particular go3, but they are all pretty similar:
go3_rnDO = \ (ds_a3er :: [Char]) (eta_B1 :: Int) -> case ds_a3er of { [] -> z1_rnDN eta_B1; : y_a3ew ys_a3ex -> case eta_B1 of { GHC.Types.I# ipv_smq5 -> go3_rnDO ys_a3ex (case y_a3ew of { GHC.Types.C# c#_a9VM -> GHC.Types.I# (GHC.Prim.+# (GHC.Prim.ord# c#_a9VM) (GHC.Prim.*# ipv_smq5 31#)) }) } }
This accumulates a hash in eta_B1 of the string ds_a3er. Which is basically main_go from above with a different continuation z1_rnDN (which is a CAF again, exhibiting the same lack of argument strictness).
Becauuse then Mgrfuns.$wescom Mgrfuns.shapewindow1 lvl8 might be evaluated many times
Right, the culprit is probably that the wrapper escom was inlined and the resulting case floated out of the inner foldr application, because foldr is strict in its third argument.
Example:
foldr k (foldr k ys (escom shapewindow [...])) xs
(*)
turns into (wrapper inlined)
foldr k (foldr k ys (case $wescom shapewindow [...] of ... -> ...)) xs
and then (case floated out, because foldr is strict in its third argument, but not in its second)
foldr k (case $wescom shapewindow [...] of ... -> foldr k ys (...)) xs
Now, in practice the case expression gets really big and is floated out to top-level, leaving us with a non-PAP CAF lvl9 that can't be inlined.
Consider what happens if we were at (*) and didn't inline the wrapper (or didn't w/w escom in the first place): No floating out of the argument is happening (I wonder why -- I bet this is documented somewhere) and we'd just float out a huge thunk like
where go is one of those arity 2 hashing functions. Note how there was an arity increase!. Crucially, this will just inline at call sites and we'll see gos strictness signature!
Compare that to lvl9 above, has exactly the same purpose but doesn't inline because we floated out the case. Here is it again for reference:
lvl9 = case Mgrfuns.$wescom Mgrfuns.shapewindow1 lvl8 of { (# ww1, ww2 #) -> go3 (GHC.Types.: @ Char ww1 ww2) }
This will not inline and we won't see go3 strictness signature. Note how we in turn memoised the head of the list, but that's much worse than being able to propagate lvl9's argument strictness.
Solutions
lvl9's strictness signature is pretty useless. Since top-level thunks are analysed with LetDown anyway (yay), maybe we should compute strictness signatures of top-level thunks for typeArity? Seems hacky. Btw., this is an instance were I'm sure the idea from my master's thesis would be able to propagate the right strictness signature for arity 1.
The other way out I see is to refrain from floating out the case expression in the first place, but that only turned out to be beneficial because the arity of the thunk increased, rendering it a PAP in one case but not in the other.
I feel like this isn't so much related to CPR anymore. Maybe we should fork this into its own ticket?
I think this ticket needs to be solved (the issues around #5075 (comment 189715) and #16570 (closed)) before sum CPR can live up to its full potential.
It appears that fixing #18793 (closed) helped with the immediate problem described in the ticket. But the regression in reptile persisted even after this patch and is only fixed with !5667 (closed) where I deactivate CPR for recursive data cons (such as list cons), so basically not fixed at all.
In reptile we have the more general setting
--CR should try to generalise this to cope with more (all?) commandsescom::[Char]->[Int]->[Char]escomstrns='\ESC':foldrf""nswherefn""=shown++strfns=shown++","++sbitcopy,circle,shapewindow::[Int]->[Char]bitcopy=escom"b"-- xd yd w h xs yscircle=escom"o"-- x y rshapewindow=escom"W"-- x y w h-- and so on
And if I look at the Core for shapewindow or the others, I still see the wrapper of escom being inlined:
This looks alright, but then later in main we have something like foldr k id (shapewindow [0,0,1150]) and we'll inline the wrapper shapewindow and get foldr k id (case $wescom Mgrfuns.shapewindow1 [0,0,1150] of <rebuild (:)>) and we'll float out that case again, to
And when we finally inline foldr we don't see that the inner loop thing actually has arity 1. Contrast that with
lvl8=shapewindow[0,0,1150]lvl9=foldrkidlvl8
Here, lvl9 has arity 1 and after inlining foldr completely, we'll have a much more useful strictness signature.
It took me a few hours to reproduce, but I finally was able to do so:
{-# LANGUAGE MagicHash, UnboxedTuples #-}moduleLib(hash,salt,shapewindow,potatotile,setup,func)whereimportData.Listhiding(foldr,length)importData.Char-- It's important that escom gets WW'd for its result to reproduce the issue-- We'll do so manually hereescom::[Char]->[Int]->[Char]escomstrns=casewescomstrnsof(#x,xs#)->x:xs{-# INLINE escom #-}wescom::[Char]->[Int]->(#Char,[Char]#)wescomstrns=case'\ESC':foldrf""nsofx:xs->(#x,xs#)wherefn""=shown++strfns=shown++","++s{-# NOINLINE wescom #-}shapewindow::[Int]->[Char]shapewindow=escom"W"-- x y w hfunc::Int->[Char]funcmode=escom"b"[mode]setup::Stringsetup="huge"{-# NOINLINE setup #-}potatotile::[String]->[Char]potatotile=escom"W".maplength-- x y w h{-# NOINLINE potatotile #-}hash::String->Inthash=foldl'(\accc->ordc+acc*31)0{-# INLINE hash #-}salt::a->IOasalt=pure{-# NOINLINE salt #-}moduleMainwhereimportControl.MonadimportLibmain=doinput<-getContentsreplicateM_500$dos<-saltinputprint$hash(func5++potatotile(liness)++setup++func15)
If you look at the simplified core output, you see something like
-- RHS size: {terms: 17, types: 13, coercions: 0, joins: 0/0}z_r2GL :: Int -> Int[GblId]z_r2GL = case Lib.$wwescom Lib.func1 lvl1_r2GK of { (# ww_i29e, ww1_i29f #) -> \ (w_s2Fm :: Int) -> case w_s2Fm of { GHC.Types.I# ww2_s2Fo -> case $s$wgo1_r2GH ww_i29e ww1_i29f ww2_s2Fo of ww3_s2FR { __DEFAULT -> GHC.Types.I# ww3_s2FR } } }
This is the problematic thunk that isn't eta-expanded. We want it to have arity 1 so that it has a useful strictness signature and so that we can unbox the Int. Tracking the issue as #19970.
This is all irrelevant to whether or not we should give sums the CPR property unconditionally, though. It was a very unfortunate coincidence that doing so triggered the regression in reptile, which since has made it into GHC HEAD anyway. If !5667 (closed) lands, we'll only give it to non-recursive data constructors and I hope we'll soon be able to resolve this ticket, too.