test::(a->a->a)->Int->a->HashMapInta->HashMapIntatestfkam=insertModifyinga(blink(fa))kmblink::(a->b)->a->(#b#)-- Or blink g = \a -> (# g a #) ; it makes no difference.blinkga=(#ga#)
nomeata, I finally came up with a standalone test case that exhibits the same (apparent) peculiarity. I don't really understand what you're asking, so I'm hoping this will help.
{-# language UnboxedTuples #-}moduleFishwhereimportData.Array.STimportControl.Monad.ST.StrictimportControl.Monadblink::(a->b)->a->(#b#)blinkga=(#ga#)test::Int->a->(a->a->a)->STArraysInta->STs(STArraysInta)testkafm=insertModifyingArrk(blink(fa))m{-# NOINLINE test #-}insertModifyingArr::Int->(a->(#a#))->STArraysInta->STs(STArraysInta)insertModifyingArri0farr0=dorng<-range<$>getBoundsarr0goi0rngarr0wheregoi[]arr=purearrgoi(k:ks)arr|i==k=doold<-readArrayarricasefoldof(#new#)->writeArrayarrinewreturnarr|otherwise=goiksarr
It works if you don't export insertModifyingArr. Then it gets inlined into test, and CallArity, when looking at the binding of g_s9E in your original snippet, sees all the uses, sees that they are called at most once, and is happy to eta-expand it!
But without inlining insertModifyingArr, this is beyond the reach of Call Arity, because it is not a higher-order analysis.
Now, why does the demand analyser (which is higher-order, i.e. knows how functions call their arguments) not fix this? Because the demand analyser does not see that insertModifyingArr calls its argument only once, because the call is in a recursive loop.
Sebastian Graf (@sgraf812) has thoughts on combining the analyses to give us the best of both worlds, maybe he can comment.
For now, does
import GHC.Magicblink :: (a -> b) -> a -> (# b #)blink g = oneShot $ \a -> (# g a #)
Your workaround does work. Inlining insertModifying (in the real-life case) is probably a bad idea. Your workaround does work. An alternative workaround does too:
testkafm=insertModifyingArrk(oneShot$blink(fa))m
I'm glad to hear someone's been thinking about this issue; I'm much less glad that I've spent so much time coming up with a test case for something that's already known!
Hey, it seems I'm lucky I subscribed to ghc-ticket just yesterday :)
I investigated a little.
So, it turns out that Demand Analysis doesn't recognize insertModifyingArrs argument f to be called once (has usage C(U(U)). That's really weird, actually the analysis should be able to figure that out. The fact that -ddump-stranal doesn't mention f in the demand type as a free variable to go makes me feel like there is some conservative approximation at play here. And in fact, I believe that's how fix-pointing works for functions in DmdAnal: Completely forget about free variables and assume the most pessimistic result.
You can circumvent that if you make f an argument to go (reverse static arg transform, essentially):
insertModifyingArr :: Int -> (a -> (# a #)) -> STArray s Int a -> ST s (STArray s Int a)insertModifyingArr i0 f arr0 = do rng <- range <$> getBounds arr0 go f i0 rng arr0 where go f i [] arr = pure arr go f i (k : ks) arr | i == k = do old <- readArray arr i case f old of (# new #) -> writeArray arr i new return arr | otherwise = go f i ks arr
This makes DmdAnal propagate the usage information to the top-level: The demand signature for inserModifyingArr mentions f with usage 1*C1(U(U)), which in theory should be enough information for the simplifier to eta-expand the blink expression.
You can circumvent that if you make f an argument to go (reverse static arg transform, essentially):
That’s smart! By making it an argument, you essentially tell GHC to apply inductive reasoning, and then the Demand Analyzer finds out that f is used at most once! Cool.
And I presume if the recursion were non-linear, it would also do the right thing…
So – can we just include the free variables in the strictness signature during fixpointing and get that optimization without the code transformation?
So – can we just include the free variables in the strictness signature during fixpointing and get that optimization without the code transformation?
Well, actually I had expected DmdAnal to just work here. Not sure why f doesn't end up in gos signature, I suspected this has something to do with this Lazy and unleashable free variables hack, but I'm not so sure anymore. I'll investigate.
OK, this is due to the call to reuseEnv here (that function effectively duplicates every demand, so 1*C1(U(U)) -> C(U(U))) with the explanation in Aggregated demand for cardinality, the gist of which is that we have to treat free variables of let-bound thunks differently in usage analysis than we would like to for strictness analysis.
I think this should only be relevant for thunks, e.g. we should first splitFVs is_thunk rhs_fv and only then reuseEnv the lazy_fv. I'll give that a shot.
OK, that didn't work, but for reasons I didn't expect. If we apply that change, suddenly some bindings get *worse* *strictness* annotations, although it should only make for *less conservative* (possibly unsound) *usage* annotations, as reuseEnv will only affect usage information.
It turns out that this is due to the interaction between the lazy fv hack and the fix-pointing algorithm. An example is adapted from T876:
foo :: Int -> Intfoo n = sum [ length [i..n] | i <- [1..n] ]main = print (foo 100)
The variant that does get rid of the call to reuseEnv altogether will produce something like this code:
foo = \ (n_aYV [Dmd=<L,U(U)>] :: Int) -> joinrec { go_a2c3 [Occ=LoopBreaker] :: [Int] -> Int -> Int [LclId[JoinId(2)], Arity=2, Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True, WorkFree=True, Expandable=True, Guidance=IF_ARGS [30 20] 137 0}] go_a2c3 (ds_a2c4 [Dmd=<S,1*U>] :: [Int]) (eta_B1 [Dmd=<L,1*U(U)>] :: Int) = case ds_a2c4 of { [] -> eta_B1; : y_a2c9 [Dmd=<L,1*U(U)>] ys_a2ca -> jump go_a2c3 ys_a2ca (case eta_B1 of { GHC.Types.I# x_a3ii [Dmd=<S,U>] -> case y_a2c9 of { GHC.Types.I# x_a3jq [Dmd=<S,U>] -> case n_aYV of { GHC.Types.I# y_a3jz [Dmd=<S,U>] -> case GHC.List.$wlenAcc @ Int (GHC.Enum.eftInt x_a3jq y_a3jz) 0# of ww2_a4JP [Dmd=<S,U>] { __DEFAULT -> GHC.Types.I# (GHC.Prim.+# x_a3ii ww2_a4JP) } } } }) }; } in jump go_a2c3 (case n_aYV of { GHC.Types.I# y_a3jz [Dmd=<S,U>] -> GHC.Enum.eftInt 1# y_a3jz }) lvl_s4Jd
Note that foo is clearly strict in n (that's what HEAD finds out), but this variant is too conservative. Some printfs revealed that's due to abortion of fix-pointing. This is a log for the lazy_fvs and the sig_fv envs:
It flip flops between putting ns demand into the sig_fv and the lazy_fv. That's decided by isWeakDmd, which amounts to checking if the demand is equivalent to <L,U> and will thus no longer change during fix-pointing. After the initial iteration, we find that n is called once and gets tagged onto the strictness signature. The second iteration sees that n is called an additional time, demand <L,U(U)>. This means it no longer has an interesting demand and goes into lazy_fv. But here's the culprit: The fix-pointer only compares the strictness signature for changes! It will start a third iteration, completely forget about any lazy_fv and flop back to the state of the first iteration.
There's two ways out:
Also check lazy_fvs for changes. This is the thing we wanted to avoid in the first place. Also this is LetUp in disguise, which purposefully isn't equipped to deal with recursive bindings.
Don't check lazy_fvs. These are outer bindings only, so they don't actually need to play a role in fix-pointing. Also everything in lazy_fvs is already top-ish, so it suffices to check if a variable in a prior signature is now part of lazy_fvs and exclude them from the check.
I spent some time thinking about how to implement this, but given that lazy_fvs plays a largely different role for thunks (unleashes strictness through signatures and stores usage in lazy_fvs) vs. non-thunks (puts weak demands into lazy_fvs, e.g. <L,U> and <L,C(C1(U))> which won't change anymore during fixpointing), I don't think pursuing this is worth the trouble.
I'm afraid we just have to live with free variables getting a second-class treatment compared to parameters for the time being. This is another good argument for eventually splitting the demand analyser into its three sub-analyses.
As I noted in #14816 (comment 149404), I think in order to drop the call to reuseEnv, we have to make sure that we don't exit fixed-point iteration too early in doing so. Which entails having to compare DmdEnv entries, which is very costly (solution (1) in above comment).
On further thought, I think we can make the DmdEnv comparison quite efficient with the following realisation:
If an entry in the DmdEnv is stable between two iterations, it will stay so for all the coming iterations, even if some other entry is still unstable. Needs a proof, or at least an assertion in the code.
Then we can maintain a VarSet during fixed-point iteration with the FVs that still need stabilisation. We can immediately remove variables with weak demands (cf. isWeakDmd) from this set.
Edit: I'm pretty sure that the realisation doesn't hold for the DmdEnv in the strictness signature. But maybe it does for the lazy_fv part...
Motivated by #19001 (comment 340233), I set out again to remove the call to reuseEnv in !5349. At the moment, it hangs while compiling libraries/base/GHC/Ix.hs.
It seems that the Ix instances are responsible. E.g.
All the way up to 16-tuples. Through deletion I realised that each additional tuple component multiplies the allocation it takes to compile the instance by 2, so clearly exponential. I insert a trace instruction in the place where the reuseEnv call used to be. Here's an excerpt of its output:
What appears to be flip-flopping is actually thousands of duplicated go_X5 definitions. But note that they all take an additional iteration as initially, the recursive function go_X5 itself is inferred to be one-shot MCM(L). That's of course bogus, all recursive functions are many-shot. So we should simply re-use all bindings of the recursive group in dmdFix or dmdAnalRhsSig. Out of cycles for today, more tomorrow.
Generally I believe that the range methods generate particularly gnarly code.
Beyond a source of quadratic slowdown in #19584, removing reuseEnv regressed GHC.Ix exponentially, as I found above.
I've meanwhile characterised the problem. Consider the cartesian product of lists a and b:
cart a b = let go1 xs = case xs of { [] -> [] x:xs' -> let z = go1 xs' in let go2 ys = case ys of { [] -> z y:ys' -> (x,y):go2 ys' } in go2 b } in go1 a
Given the stable demand signature <1L>{z->ML} and lazy_fvs {go2->L,x->L} for
go2, these are the demand signatures and lazy_fvs of go1 after each iteration:
<1L>{b->ML, go1->MCM(L)}, {}
<1L>{}, {b->L, go1->L}
<1L>{b->ML, go1->MCM(L)}, {}
<1L>{}, {b->L, go1->L}
...
and so on, ad infinitum. Actually not ad infinitum, because after 10 iterations we'll stop and continue with a Top signature. But it's easy to see how this scales exponentially if you increase the number of nested bindings.
Note that reuseEnv made sure that we didn't get (2): Reusing ML and MCM(L) would result in L and then be put into lazy_fvs.
Clearly, monotonicity of the lazy_fvs is the problem here. More tomorrow...
"Given the stable demand signature <1L>{z->ML} for go2". That's an odd signature: go2 is patently strict in ys
Yes, but ys is the arg to go2 and hence you'll see that through <1L>.
I think the problem is simply that we can't compute a new signature without considering the lazy_fvs of the previous iteration. Going from (2) to (3) if we don't look at the previous lazy_fvs, we'd not be able to see that we already found that b was used multiple times, thus putting it with the signature again.
The fix is to continually extend lazy_fv while doing fixed-point iteration. I did that, but the resulting analysis is still too slow. Everything needs an additional iteration to go from ML to L, so every first stable approximation takes 3 iterations.
To make matters worse, we are exponential in the following variation of the cartesian product:
cart2::[a]->[b]->[(a,b)]cart2ab=letgo1xs=casexsof{[]->[];x:xs'->letk1=go1xs'inletgo2ys=caseysof{[]->k1;y:ys'->(x,y):go2ys'++k1;-- added (++k1) here }ingo2b;}ingo1a
The signatures and lazy_fv (omitting the recursive binders which are always L) after every iteration:
Iterate go1:
go2: <1L> {k1->ML}, {x->L} (NB: I made sure that go2 immediately ends up in lazy_fv. Irrelevant here)
go2: <1L> {}, {k1->L, x->L}
go2: <1L> {}, {k1->L, x->L} stable!
Finish go1: <1L> {b->ML}, {}
Iterate go1:
go2: <1L> {k1->ML}, {x->L} (NB: I made sure that go2 immediately ends up in lazy_fv. Irrelevant here)
go2: <1L> {}, {k1->L, x->L}
go2: <1L> {}, {k1->L, x->L} stable!
Finish go1: <1L> {}, {b->L}
Iterate go1:
go2: <1L> {k1->ML}, {x->L} (NB: I made sure that go2 immediately ends up in lazy_fv. Irrelevant here)
go2: <1L> {}, {k1->L, x->L}
go2: <1L> {}, {k1->L, x->L} stable!
Finish go1: <1L> {}, {b->L} stable!
Why do we need to iterate go2 thrice again in (2) and (3)? The reason is that we drop the lazy_fv every time around, because unlike for the signature which we persist via idStrictness in the CoreExpr, abusing the AST as state, we can do no such hack for lazy_fv. So we get exponential runtime as every analysis of an inner binding begins with an empty lazy_fv anew.
I see no other way than to merge lazy_fv back into idStrictness. Maybe we can afford an additional field in DmdType? It is all very painful because we only need it for analysis efficiency reasons.
Here's another option that is much simpler: Store the idStrictness state in the AnalEnv (which then has to be threaded like regular State) instead of the CoreExpr, because then we can add the lazy_fvs to the AnalEnv and wouldn't have to complicate DmdType. I feel like Note [Initialising strictness] is too smart for its own good anyway. Big Problem with that approach: It needs all binders to be globally unique.
Btw., this is the kind of problem I solved in my thesis by building a dependency graph for the whole program, where every node (e.g. one for go1 and go2 each) has its own state. I'm still hesitant to explain it to someone because it was so complicated and the constants weren't very encouraging, but it's exactly what we need.
Yes, but ys is the arg to go2 and hence you'll see that through <1L>.
Oh, I was misled by the L. I think the demand on ys is C_11 :* Poly C_0N, which is displayed as 1L. So it's strict all right.
Displaying Poly C_0N as L is quite confusing. I wonder about P(L). Oh, that's used for products, so maybe Q(L) or something. But I suppose that's a separate discussion.
!5349 caches lazy FVs with the StrictSig. We currently don't zap the cache after DmdAnal, so I'm a bit vary I've introduced a space leak... But that should be easy to rectify.
Yes! Now there's only #19584 in the way and any ghc/alloc regressions the testsuite might turn up. I ran the stranal testsuite, no changes in annotations so far. I haven't yet run the full testsuite, the OP or #19001. Will do in time.
Needless to say, the changes around lazy FVs and fixed-point iteration aren't all that simple... But maybe it's worthwhile if we can get better eta-expansion in #19001.
I can confirm that in the reproducer from #14816 (comment 149239), !5349 assigns the more precise arg signature of MCM(C1(L)) (which is how we say "called at most once" at the moment) to f now. By contrast, HEAD assigns LCL(C1(L)), as when this ticket was created. So !5349 will indeed fix this bug, but currently has performance problems if #19584 is not resolved, but I have an idea for how to fix it.
After a radical rewrite of the whole demand analyser in !5349, I can finally say that the prototype implementation there fixes this ticket and manages to pass the testsuite (modulo perf tests). It also is a fix for #19584.
While the idea !5349 is still under way (and hopefully nears conclusion in the form of a paper), I wondered if we could have a shortcut solution to the original issue. My motivation is in the context of !9089, where the join point loopification idea obscures exit contexts of otherwise easily inlinable bindings, as explained here #22274 (comment 460376). We wondered whether this issue is more easily solvable than #22274.
I don't think it is. My (third, apparently) summary to this ticket is as follows:
The call to reuseEnv is responsible for demand signatures never having interesting used-at-most-once information on free variables. (And having this info would allow us to inline in loop exits, for example.)
I had hoped we could remove that call, or the motivation for it. I recalled it had to do with analysis perf, as the FIXME comment above the call in DmdAnal suggests, so I tried to do it less often, e.g., when the env has at most 5 elements.
But then I rediscovered the monotonicity issues regarding weak variables, see #14816 (comment 341630). The short story is that during fixed-point iteration, variables flip between occuring in weak_fvs (those not unleashed at call sites because no use, perf!) and sig_fvs (those we unleash at call sites such as strict ones).
But (4) is not easily possible, because we sometimes don't have access to the previous iteration's weak_fvs; namely when we continue fixed-point iteration from a previously approximated demand signature. The weak_fvs are not currently part of the demand signature, so we forget those every time around. And nor should they, because they belong to the binding group rather than just the binding.
I think I tried to experiment with special demand signatures that cached the weak_fvs about a year ago. To no avail, but I forgot the specifics. I think I could look them up in a log of the calls Simon and I had about the issue.
...
And so on. The rabbit hole never ends. Rather than trying to solve any of these problems piecemeal, I suggest waiting for !5349 I recall solved the issue entirely.