DmdAnal punishes free variables compared to static argument encoding

mentioned in issue #9374

changed milestone to %8.6.1

changed weight to 5

added Tbug Trac import labels

Sorry. I commented here, but my comment seems to have disappeared… probably did not send it.

insertModifying uses its function argument at most once, so there is no possible benefit to this partial application.

but does GHC know this (non-local) fact? What is the strictness signature of RULES.$w$sinsertModifying.

Also, what is the Core directly after Call Arity? (-ddump-call-arity)?

Also, if call arity is too stupid, you might avoid the problem by using oneShot in the right place.

nomeata, I finally came up with a standalone test case that exhibits the same (apparent) peculiarity. I don't really understand what you're asking, so I'm hoping this will help.

{-# language UnboxedTuples #-}
module Fish where
import Data.Array.ST
import Control.Monad.ST.Strict
import Control.Monad

blink :: (a -> b) -> a -> (# b #)
blink g a = (# g a #)

test :: Int -> a -> (a -> a -> a) -> STArray s Int a -> ST s (STArray s Int a)
test k a f m = insertModifyingArr k (blink (f a)) m
{-# NOINLINE test #-}

insertModifyingArr :: Int -> (a -> (# a #))
                   -> STArray s Int a -> ST s (STArray s Int a)
insertModifyingArr i0 f arr0 = do
   rng <- range <$> getBounds arr0
   go i0 rng arr0
  where
    go i [] arr = pure arr
    go i (k : ks) arr
      | i == k = do
          old <- readArray arr i
          case f old of (# new #) -> writeArray arr i new
          return arr
      | otherwise = go i ks arr

It works if you don't export insertModifyingArr. Then it gets inlined into test, and CallArity, when looking at the binding of g_s9E in your original snippet, sees all the uses, sees that they are called at most once, and is happy to eta-expand it!

But without inlining insertModifyingArr, this is beyond the reach of Call Arity, because it is not a higher-order analysis.

Now, why does the demand analyser (which is higher-order, i.e. knows how functions call their arguments) not fix this? Because the demand analyser does not see that insertModifyingArr calls its argument only once, because the call is in a recursive loop.

Sebastian Graf (@sgraf812) has thoughts on combining the analyses to give us the best of both worlds, maybe he can comment.

For now, does

import GHC.Magic

blink :: (a -> b) -> a -> (# b #)
blink g = oneShot $ \a -> (# g a #)

do what you want?

Your workaround does work. Inlining insertModifying (in the real-life case) is probably a bad idea. Your workaround does work. An alternative workaround does too:

test k a f m = insertModifyingArr k (oneShot $ blink (f a)) m

I'm glad to hear someone's been thinking about this issue; I'm much less glad that I've spent so much time coming up with a test case for something that's already known!

I'm much less glad that I've spent so much time coming up with a test case for something that's already known!

Sorry about that – at least we have a good test case now for when someone pick this up again!

Hey, it seems I'm lucky I subscribed to ghc-ticket just yesterday :)

I investigated a little.

So, it turns out that Demand Analysis doesn't recognize insertModifyingArrs argument f to be called once (has usage C(U(U)). That's really weird, actually the analysis should be able to figure that out. The fact that -ddump-stranal doesn't mention f in the demand type as a free variable to go makes me feel like there is some conservative approximation at play here. And in fact, I believe that's how fix-pointing works for functions in DmdAnal: Completely forget about free variables and assume the most pessimistic result.

You can circumvent that if you make f an argument to go (reverse static arg transform, essentially):

insertModifyingArr :: Int -> (a -> (# a #))
                   -> STArray s Int a -> ST s (STArray s Int a)
insertModifyingArr i0 f arr0 = do
   rng <- range <$> getBounds arr0
   go f i0 rng arr0
  where
    go f i [] arr = pure arr
    go f i (k : ks) arr
      | i == k = do
          old <- readArray arr i
          case f old of (# new #) -> writeArray arr i new
          return arr
      | otherwise = go f i ks arr

This makes DmdAnal propagate the usage information to the top-level: The demand signature for inserModifyingArr mentions f with usage 1*C1(U(U)), which in theory should be enough information for the simplifier to eta-expand the blink expression.

And sure enough, it does (somewhere inside test:

      Fish.insertModifyingArr_$spoly_go
        @ a_a3IX
        @ s_a3IY
        w_s4pP
        ww6_s4s2
        ww8_s4s5
        ww3_s4q2
        ww4_s4q3
        (GHC.Enum.eftInt ww6_s4s2 ww8_s4s5)
        k_a15T
        (\ (a2_a15S [OS=OneShot] :: a_a3IX) ->
           (# f_a15V a1_a15U a2_a15S #))

The interactions between free vars and static args are a little counter-intuitive, I suppose...

You can circumvent that if you make f an argument to go (reverse static arg transform, essentially):

That’s smart! By making it an argument, you essentially tell GHC to apply inductive reasoning, and then the Demand Analyzer finds out that f is used at most once! Cool.

And I presume if the recursion were non-linear, it would also do the right thing…

So – can we just include the free variables in the strictness signature during fixpointing and get that optimization without the code transformation?

Replying to [ticket:14816#comment:149267 nomeata]:

So – can we just include the free variables in the strictness signature during fixpointing and get that optimization without the code transformation?

Well, actually I had expected DmdAnal to just work here. Not sure why f doesn't end up in gos signature, I suspected this has something to do with this Lazy and unleashable free variables hack, but I'm not so sure anymore. I'll investigate.

Trac metadata

Trac field	Value
CC	nomeata → nomeata, sgraf

OK, this is due to the call to reuseEnv here (that function effectively duplicates every demand, so 1*C1(U(U)) -> C(U(U))) with the explanation in Aggregated demand for cardinality, the gist of which is that we have to treat free variables of let-bound thunks differently in usage analysis than we would like to for strictness analysis.

I think this should only be relevant for thunks, e.g. we should first splitFVs is_thunk rhs_fv and only then reuseEnv the lazy_fv. I'll give that a shot.

OK, that didn't work, but for reasons I didn't expect. If we apply that change, suddenly some bindings get *worse* *strictness* annotations, although it should only make for *less conservative* (possibly unsound) *usage* annotations, as reuseEnv will only affect usage information.

It turns out that this is due to the interaction between the lazy fv hack and the fix-pointing algorithm. An example is adapted from T876:

foo :: Int -> Int
foo n = sum [ length [i..n] | i <- [1..n] ]

main = print (foo 100)

The variant that does get rid of the call to reuseEnv altogether will produce something like this code:

foo
  = \ (n_aYV [Dmd=<L,U(U)>] :: Int) ->
      joinrec {
        go_a2c3 [Occ=LoopBreaker] :: [Int] -> Int -> Int
        [LclId[JoinId(2)],
         Arity=2,
         Unf=Unf{Src=<vanilla>, TopLvl=False, Value=True, ConLike=True,
                 WorkFree=True, Expandable=True, Guidance=IF_ARGS [30 20] 137 0}]
        go_a2c3 (ds_a2c4 [Dmd=<S,1*U>] :: [Int])
                (eta_B1 [Dmd=<L,1*U(U)>] :: Int)
          = case ds_a2c4 of {
              [] -> eta_B1;
              : y_a2c9 [Dmd=<L,1*U(U)>] ys_a2ca ->
                jump go_a2c3
                  ys_a2ca
                  (case eta_B1 of { GHC.Types.I# x_a3ii [Dmd=<S,U>] ->
                   case y_a2c9 of { GHC.Types.I# x_a3jq [Dmd=<S,U>] ->
                   case n_aYV of { GHC.Types.I# y_a3jz [Dmd=<S,U>] ->
                   case GHC.List.$wlenAcc @ Int (GHC.Enum.eftInt x_a3jq y_a3jz) 0#
                   of ww2_a4JP [Dmd=<S,U>]
                   { __DEFAULT ->
                   GHC.Types.I# (GHC.Prim.+# x_a3ii ww2_a4JP)
                   }
                   }
                   }
                   })
            }; } in
      jump go_a2c3
        (case n_aYV of { GHC.Types.I# y_a3jz [Dmd=<S,U>] ->
         GHC.Enum.eftInt 1# y_a3jz
         })
        lvl_s4Jd

Note that foo is clearly strict in n (that's what HEAD finds out), but this variant is too conservative. Some printfs revealed that's due to abortion of fix-pointing. This is a log for the lazy_fvs and the sig_fv envs:

dmdAnalRhsLetDown go_a2c3 [] []
dmdAnalRhsLetDown go_a2c3 [] [aYV :-> <L,1*U(U)>]
dmdAnalRhsLetDown go_a2c3 [aYV :-> <L,U(U)>] []
dmdAnalRhsLetDown go_a2c3 [] [aYV :-> <L,1*U(U)>]
dmdAnalRhsLetDown go_a2c3 [aYV :-> <L,U(U)>] []
dmdAnalRhsLetDown go_a2c3 [] [aYV :-> <L,1*U(U)>]
dmdAnalRhsLetDown go_a2c3 [aYV :-> <L,U(U)>] []
dmdAnalRhsLetDown go_a2c3 [] [aYV :-> <L,1*U(U)>]
dmdAnalRhsLetDown go_a2c3 [aYV :-> <L,U(U)>] []
dmdAnalRhsLetDown go_a2c3 [] [aYV :-> <L,1*U(U)>]
dmdAnalRhsLetDown go_a2c3 [] [aYV :-> <L,1*U(U)>]
dmdAnalRhsLetDown foo [] []

It flip flops between putting ns demand into the sig_fv and the lazy_fv. That's decided by isWeakDmd, which amounts to checking if the demand is equivalent to <L,U> and will thus no longer change during fix-pointing. After the initial iteration, we find that n is called once and gets tagged onto the strictness signature. The second iteration sees that n is called an additional time, demand <L,U(U)>. This means it no longer has an interesting demand and goes into lazy_fv. But here's the culprit: The fix-pointer only compares the strictness signature for changes! It will start a third iteration, completely forget about any lazy_fv and flop back to the state of the first iteration.

There's two ways out:

Also check lazy_fvs for changes. This is the thing we wanted to avoid in the first place. Also this is LetUp in disguise, which purposefully isn't equipped to deal with recursive bindings.
Don't check lazy_fvs. These are outer bindings only, so they don't actually need to play a role in fix-pointing. Also everything in lazy_fvs is already top-ish, so it suffices to check if a variable in a prior signature is now part of lazy_fvs and exclude them from the check.

I'll try 2. tomorrow.

I spent some time thinking about how to implement this, but given that lazy_fvs plays a largely different role for thunks (unleashes strictness through signatures and stores usage in lazy_fvs) vs. non-thunks (puts weak demands into lazy_fvs, e.g. <L,U> and <L,C(C1(U))> which won't change anymore during fixpointing), I don't think pursuing this is worth the trouble.

I'm afraid we just have to live with free variables getting a second-class treatment compared to parameters for the time being. This is another good argument for eventually splitting the demand analyser into its three sub-analyses.

changed milestone to %8.8.1

These won't be addressed in GHC 8.6.

changed milestone to %8.10.1

Bumping milestones of low-priority tickets.

added runtime perf label

added Pnormal label

removed milestone

added demand analysis label

As I noted in #14816 (comment 149404), I think in order to drop the call to reuseEnv, we have to make sure that we don't exit fixed-point iteration too early in doing so. Which entails having to compare DmdEnv entries, which is very costly (solution (1) in above comment).

On further thought, I think we can make the DmdEnv comparison quite efficient with the following realisation:

If an entry in the DmdEnv is stable between two iterations, it will stay so for all the coming iterations, even if some other entry is still unstable. Needs a proof, or at least an assertion in the code.

Then we can maintain a VarSet during fixed-point iteration with the FVs that still need stabilisation. We can immediately remove variables with weak demands (cf. isWeakDmd) from this set.

Edit: I'm pretty sure that the realisation doesn't hold for the DmdEnv in the strictness signature. But maybe it does for the lazy_fv part...

changed title from Missed Called Arity opportunity? to DmdAnal punishes free variables compared to static argument encoding

mentioned in issue #19001

mentioned in commit 7c906385

mentioned in merge request !5349

Motivated by #19001 (comment 340233), I set out again to remove the call to reuseEnv in !5349. At the moment, it hangs while compiling libraries/base/GHC/Ix.hs.

It seems that the Ix instances are responsible. E.g.

instance  (Ix a1, Ix a2, Ix a3, Ix a4) => Ix (a1,a2,a3,a4)  where
    range ((l1,l2,l3,l4),(u1,u2,u3,u4)) =
      [(i1,i2,i3,i4) | i1 <- range (l1,u1),
                       i2 <- range (l2,u2),
                       i3 <- range (l3,u3),
                       i4 <- range (l4,u4)]

    unsafeIndex ((l1,l2,l3,l4),(u1,u2,u3,u4)) (i1,i2,i3,i4) =
      unsafeIndex (l4,u4) i4 + unsafeRangeSize (l4,u4) * (
      unsafeIndex (l3,u3) i3 + unsafeRangeSize (l3,u3) * (
      unsafeIndex (l2,u2) i2 + unsafeRangeSize (l2,u2) * (
      unsafeIndex (l1,u1) i1)))

    inRange ((l1,l2,l3,l4),(u1,u2,u3,u4)) (i1,i2,i3,i4) =
      inRange (l1,u1) i1 && inRange (l2,u2) i2 &&
      inRange (l3,u3) i3 && inRange (l4,u4) i4

All the way up to 16-tuples. Through deletion I realised that each additional tuple component multiplies the allocation it takes to compile the instance by 2, so clearly exponential. I insert a trace instruction in the place where the reuseEnv call used to be. Here's an excerpt of its output:

would reuse env
  go1_X5
  [1L]
  [X4 :-> ML, X5 :-> MCM(L), a1Dx :-> L]
would reuse env
  go1_X5
  [1L]
  [X4 :-> ML, X5 :-> L, a1Dx :-> L]
would reuse env
  go1_X5
  [1L]
  [X4 :-> ML, X5 :-> MCM(L), a1Dx :-> L]
would reuse env
  go1_X5
  [1L]
  [X4 :-> ML, X5 :-> L, a1Dx :-> L]
would reuse env
  go1_X5
  [1L]
  [X4 :-> ML, X5 :-> MCM(L), a1Dx :-> L]

What appears to be flip-flopping is actually thousands of duplicated go_X5 definitions. But note that they all take an additional iteration as initially, the recursive function go_X5 itself is inferred to be one-shot MCM(L). That's of course bogus, all recursive functions are many-shot. So we should simply re-use all bindings of the recursive group in dmdFix or dmdAnalRhsSig. Out of cycles for today, more tomorrow.

Generally I believe that the range methods generate particularly gnarly code.

Everything should be monotonic: absolutely no flip-flopping. This looks like a semantic issue, not just a matter of efficiency.

Thanks for investigating

mentioned in issue #19584

Beyond a source of quadratic slowdown in #19584, removing reuseEnv regressed GHC.Ix exponentially, as I found above.

I've meanwhile characterised the problem. Consider the cartesian product of lists a and b:

cart a b =
  let go1 xs = case xs of {
        []    -> []
        x:xs' ->
           let z = go1 xs' in
           let go2 ys = case ys of {
                 []    -> z
                 y:ys' -> (x,y):go2 ys'
               }
           in go2 b
      }
  in go1 a

Given the stable demand signature <1L>{z->ML} and lazy_fvs {go2->L,x->L} for go2, these are the demand signatures and lazy_fvs of go1 after each iteration:

<1L>{b->ML, go1->MCM(L)}, {}
<1L>{}, {b->L, go1->L}
<1L>{b->ML, go1->MCM(L)}, {}
<1L>{}, {b->L, go1->L}
... and so on, ad infinitum. Actually not ad infinitum, because after 10 iterations we'll stop and continue with a Top signature. But it's easy to see how this scales exponentially if you increase the number of nested bindings.

Note that reuseEnv made sure that we didn't get (2): Reusing ML and MCM(L) would result in L and then be put into lazy_fvs.

Clearly, monotonicity of the lazy_fvs is the problem here. More tomorrow...

Surely go1 a not go2 b in the body of let go1
"Given the stable demand signature <1L>{z->ML} for go2". That's an odd signature: go2 is patently strict in ys

But while I can't understand the specifics here, we clearly have a monotonicity bug -- well discovered!

Surely go1 a not go2 b in the body of let go1

Thanks, should be fixed now.

"Given the stable demand signature <1L>{z->ML} for go2". That's an odd signature: go2 is patently strict in ys

Yes, but ys is the arg to go2 and hence you'll see that through <1L>.

I think the problem is simply that we can't compute a new signature without considering the lazy_fvs of the previous iteration. Going from (2) to (3) if we don't look at the previous lazy_fvs, we'd not be able to see that we already found that b was used multiple times, thus putting it with the signature again.

The fix is to continually extend lazy_fv while doing fixed-point iteration. I did that, but the resulting analysis is still too slow. Everything needs an additional iteration to go from ML to L, so every first stable approximation takes 3 iterations.

To make matters worse, we are exponential in the following variation of the cartesian product:

cart2 :: [a] -> [b] -> [(a, b)]
cart2 a b =
  let go1 xs = case xs of {
        []    -> [];
        x:xs' ->
           let k1 = go1 xs' in
           let go2 ys = case ys of {
                 []    -> k1;
                 y:ys' -> (x,y):go2 ys'++k1; -- added (++k1) here 
               }
           in go2 b;
      }
  in go1 a

The signatures and lazy_fv (omitting the recursive binders which are always L) after every iteration:

Iterate go1:
1. go2: <1L> {k1->ML}, {x->L} (NB: I made sure that go2 immediately ends up in lazy_fv. Irrelevant here)
2. go2: <1L> {}, {k1->L, x->L}
3. go2: <1L> {}, {k1->L, x->L} stable!
Finish go1: <1L> {b->ML}, {}
Iterate go1:
1. go2: <1L> {k1->ML}, {x->L} (NB: I made sure that go2 immediately ends up in lazy_fv. Irrelevant here)
2. go2: <1L> {}, {k1->L, x->L}
3. go2: <1L> {}, {k1->L, x->L} stable!
Finish go1: <1L> {}, {b->L}
Iterate go1:
1. go2: <1L> {k1->ML}, {x->L} (NB: I made sure that go2 immediately ends up in lazy_fv. Irrelevant here)
2. go2: <1L> {}, {k1->L, x->L}
3. go2: <1L> {}, {k1->L, x->L} stable!
Finish go1: <1L> {}, {b->L} stable!

Why do we need to iterate go2 thrice again in (2) and (3)? The reason is that we drop the lazy_fv every time around, because unlike for the signature which we persist via idStrictness in the CoreExpr, abusing the AST as state, we can do no such hack for lazy_fv. So we get exponential runtime as every analysis of an inner binding begins with an empty lazy_fv anew.

I see no other way than to merge lazy_fv back into idStrictness. Maybe we can afford an additional field in DmdType? It is all very painful because we only need it for analysis efficiency reasons.

Here's another option that is much simpler: Store the idStrictness state in the AnalEnv (which then has to be threaded like regular State) instead of the CoreExpr, because then we can add the lazy_fvs to the AnalEnv and wouldn't have to complicate DmdType. I feel like Note [Initialising strictness] is too smart for its own good anyway. Big Problem with that approach: It needs all binders to be globally unique.

Btw., this is the kind of problem I solved in my thesis by building a dependency graph for the whole program, where every node (e.g. one for go1 and go2 each) has its own state. I'm still hesitant to explain it to someone because it was so complicated and the constants weren't very encouraging, but it's exactly what we need.

Yes, but ys is the arg to go2 and hence you'll see that through <1L>.

Oh, I was misled by the L. I think the demand on ys is C_11 :* Poly C_0N, which is displayed as 1L. So it's strict all right.

Displaying Poly C_0N as L is quite confusing. I wonder about P(L). Oh, that's used for products, so maybe Q(L) or something. But I suppose that's a separate discussion.

Good insights.

I see no other way than to merge lazy_fv back into idStrictness. Maybe we can afford an additional field in DmdType?

Yes, and it's very easy to do so. Instead of

newtype StrictSig = StrictSig DmdType

have

data StrictSig = StrictSig DmdType LazyFVs

No need to mess with DmdType! Easy.

mentioned in commit b3ecde29

!5349 caches lazy FVs with the StrictSig. We currently don't zap the cache after DmdAnal, so I'm a bit vary I've introduced a space leak... But that should be easy to rectify.

Super! Does that stop the non-monotonicity and exponential behaviour?

Yes! Now there's only #19584 in the way and any ghc/alloc regressions the testsuite might turn up. I ran the stranal testsuite, no changes in annotations so far. I haven't yet run the full testsuite, the OP or #19001. Will do in time.

Needless to say, the changes around lazy FVs and fixed-point iteration aren't all that simple... But maybe it's worthwhile if we can get better eta-expansion in #19001.

mentioned in issue #19602

Great. Did you like my suggestion on #19584? (maybe reply there)

I just gave it a read, but I think I have to think about it a bit more. I'm trying to draw a connection to https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.125.1549&rep=rep1&type=pdf, which tries out different iteration orders for data-flow analysis which I think is what you are proposing. Will reply there in time...

I can confirm that in the reproducer from #14816 (comment 149239), !5349 assigns the more precise arg signature of MCM(C1(L)) (which is how we say "called at most once" at the moment) to f now. By contrast, HEAD assigns LCL(C1(L)), as when this ticket was created. So !5349 will indeed fix this bug, but currently has performance problems if #19584 is not resolved, but I have an idea for how to fix it.

mentioned in commit bcd11217

mentioned in issue #19970

marked this issue as related to #19001

marked this issue as related to #19970

mentioned in issue #20283 (closed)

mentioned in commit 80634fba

After a radical rewrite of the whole demand analyser in !5349, I can finally say that the prototype implementation there fixes this ticket and manages to pass the testsuite (modulo perf tests). It also is a fix for #19584.

mentioned in commit e7033861

mentioned in commit 84ef5d1e

mentioned in commit bdd158ee

mentioned in commit 973a1d9c

mentioned in commit d5e8d37c

mentioned in commit e476a7ed

mentioned in commit c44a1a0f

mentioned in commit dcfbdaef

mentioned in commit 167b7577

mentioned in commit d3a24e56

mentioned in issue #22274

While the idea !5349 is still under way (and hopefully nears conclusion in the form of a paper), I wondered if we could have a shortcut solution to the original issue. My motivation is in the context of !9089, where the join point loopification idea obscures exit contexts of otherwise easily inlinable bindings, as explained here #22274 (comment 460376). We wondered whether this issue is more easily solvable than #22274.

I don't think it is. My (third, apparently) summary to this ticket is as follows:

The call to reuseEnv is responsible for demand signatures never having interesting used-at-most-once information on free variables. (And having this info would allow us to inline in loop exits, for example.)
I had hoped we could remove that call, or the motivation for it. I recalled it had to do with analysis perf, as the FIXME comment above the call in DmdAnal suggests, so I tried to do it less often, e.g., when the env has at most 5 elements.
But then I rediscovered the monotonicity issues regarding weak variables, see #14816 (comment 341630). The short story is that during fixed-point iteration, variables flip between occuring in weak_fvs (those not unleashed at call sites because no use, perf!) and sig_fvs (those we unleash at call sites such as strict ones).
We could easily re-attain monotonicity by requiring that any variable that turns up in the sig_fvs must not be in the prior iteration's weak_fvs. We can do that by an elaborate minusUFM, as I have in fact done in !5349, here: https://gitlab.haskell.org/ghc/ghc/-/blob/b921a6fe317d3cb8a894615ea53fbb8a00742284/compiler/GHC/Core/Opt/DmdAnal/Denot.hs#L867-871. I also wrote a Note [Lazy free variables and monotonicity] about the monotonicity issue. I base my summary on that Note.
But (4) is not easily possible, because we sometimes don't have access to the previous iteration's weak_fvs; namely when we continue fixed-point iteration from a previously approximated demand signature. The weak_fvs are not currently part of the demand signature, so we forget those every time around. And nor should they, because they belong to the binding group rather than just the binding.
I think I tried to experiment with special demand signatures that cached the weak_fvs about a year ago. To no avail, but I forgot the specifics. I think I could look them up in a log of the calls Simon and I had about the issue.
...

And so on. The rabbit hole never ends. Rather than trying to solve any of these problems piecemeal, I suggest waiting for !5349 I recall solved the issue entirely.

mentioned in issue #25578

Trac field	Value
Version	8.2.2
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC	nomeata
Operating system
Architecture

DmdAnal punishes free variables compared to static argument encoding

Child items ...

Activity

DmdAnal punishes free variables compared to static argument encoding

Blocks

Relates to

Activity