wiki:Commentary/Compiler/Demand/LetUp

Making LetUp more precise

The demand analyser uses two different rules for handling let bindings, as explained in the cardinality paper. TLDR:

  • LetDown is used for top-level bindings, recursive bindings and local, non-recursive functions. It closely mimics semantics of an actual function call, as it unleashes the demand signature of the bound RHS at the call site within the let body.
  • LetUp is used for local, non-recursive thunks. This works more like a type-system: Analysing the let-body reveals its demand on the identifier, in which the RHS of the bindings is analysed exactly once. Only now it is that the results from the body and the RHS get sequentially combined.

There are reasons why we currently need both rules (cardinality paper §3.5 and §3.6), although the general approach of LetDown is more flow-sensitive and thus strictly more precise than LetUp.

Consider the following running example:

let y = ... 
in let x = 2*y
   in if ... then x else y

This evaluates y only once, but the LetUp rule (in contrast to LetDown) fails to recognise this: The inner let-body has a demand environment of [x :-> <L,1*U>, y :-> <L, 1*U>], which combines with the DmdEnv of the RHS [y :-> <L, 1*U> to a total DmdEnv for the let-expression binding x of [y :-> <L,U>]. E.g., y is possibly evaluated multiple times!

Co-call graphs

Meanwhile, Call Arity, using a LetUp strategy at let-bindings exclusively, has countered the exact same loss of precision by employing Co-call graphs. Instead of having cardinality information stored in a plain identifier map, where identifiers are assumed to be evaluated with each other, co-call graphs additionally assert that some ids are never evaluated during the same evaluation of the containing expression.

For the above example, the co-call graph for the inner let-body would specifically not contain an edge between x and y, because they are evaluated on mutually exclusive code paths. Thus, substituting the co-call graph of xs RHS into the co-call graph of the let-body will not induce a loop on y, so y will be recognised as evaluated at most once, which is what we want here.

BOTH/OR trees

The idea behind co-call graphs translates to demand analysis by being as lazy as possible in computing DmdEnvs.

In particular, bothDmdType and lubDmdType currently smash together both incoming DmdEnvs. This forgets important structure for the above example: Namely that x and y are evaluated mutually exclusively.

To remedy that, we could model DmdEnvs as a tree structure (leaving Termination-based nastiness aside for a moment):

data DmdEnv'
  = Empty
  | Var Id Demand
  | Or DmdEnv' DmdEnv'
  | Both DmdEnv' DmdEnv'

flattenDmdEnv' :: DmdEnv' -> DmdEnv
flattenDmdEnv' env =
  case env of
    Empty -> emptyVarEnv
    Var id dmd -> unitVarEnv id dmd
    Or env1 env2 -> lubDmdEnv env1 env2
    Both env1 env2 -> bothDmdEnv env1 env2

This is essentially the interpreter pattern, with flattenDmdEnv' being an interpreter in terms of old DmdEnv.

However, as we'll see in a second, this buys us the ability to perform proper substitution when resolving let-bindings with LetUp, instead of just deleting the bound id from the body env and bothing with the result of the RHS.

For the above example, we get (Both (...) (Or (Var x <S,1*U>) (Var y <S,1*U>))) as the DmdEnv' of the let-body and (Var y <L,1*U>) as the DmdEnv' of xs RHS under the recorded demand <L,1*U> on x within the body. We can now substitute the DmdEnv' of xs RHS into all occurences of x within the DmdEnv' of the let-body:

  (Both (...) (Or env             (Var y <S,1*U>)))[env := (Var y <L,1*U>)]
= (Both (...) (Or (Var y <L,1*U>) (Var y <S,1*U>)))

If we look up the demand on y in the resulting DmdEnv', we find that y is put under demand <L,1*U>, so used at most once, which is what we wanted.

Note however that it is still not detected as being used strictly! For this, we would need to analyse xs RHS under the demand of the use site we substitute the DmdEnv' into, much like LetDown would. Let's revisit this later.

Thunk Sharing

There's an(other) issue with the substitution model.

Substitution appeals because it approximates LetDown in terms of precision, but it also suffers from the same imprecision when it comes to modeling thunk sharing:

let y = ..
in let x = 2*y
   in x + x

Note that although x is evaluated twice, the evaluation of its RHS is shared, so that y is evaluated at most once. This is also what the current demand analysis finds out.

Let's look at the DmdEnv' for the inner let-body:

(Both
  (Var x <S,1*U>) 
  (Var x <S,1*U>))

Substituting the DmdEnv' of xs RHS:

(Both
  (Var y <L,1*U>) 
  (Var y <L,1*U>))

Ouch! Substitution forgets about sharing and let's us see an imprecise upper bound of w*U. What we are lacking is some kind of model for the runtime heap.

How does LetUp solve this? Well, it operates under the simple assumption that the binding is evaluated exactly once in the entire body, if it's evaluated at all. So, it's OK to both the DmdEnv' of the RHS to that of the let body at the root. This won't destroy sharing!

Now, for the first example (if ... then x else y), it is quite obvious that the bound id x doesn't even occur in the first Or branch of the DmdEnv' of the let body (after deleting x from it). Only the second branch evaluates x at all, so it should be safe to 'push' the graft point, where we graft the DmdEnv' of the RHS of x onto the DmdEnv' of the let body, down to the ''most recent common ancestor'' (MCRA) of all occurences of x in the body.

For the first example:

-- [] marks the MCRA of all previous occurences of `x`, which were already deleted
  graft (\t -> (Both t (Var y <L,1*U>))) onto (Both (...) (Or [Empty] (Var y <S,1*U>))))
= (Both (...) (Or (Both Empty (Var y <L,1*U>)) (Var y <S,1*U>))))

Multiple graft points

Grafting the DmdEnv' of the RHS of x at the MCRA of all occurences of x in the DmdEnv' of the body is AFAICT the only solution that doesn't duplicate the DmdEnv' of the RHS into the tree, so should be quite predictable from a performance perspective.

However, there are shortcomings, exposed by examples like this:

let y = ... in
let x = 2*y in
if ... then
  x
else if ... then
  y 
else if ... then
  x

(Both 
  (...)
  (Or
    (Var x <S,1*U>)
    (Both
      (...)
      (Or
        (Var y <S,1*U>)
        (Var x <S,1*U>)))))

The MCRA of all occurences of x is the root of the DmdEnv'. That's unfortunate, as this means we aren't more precise than LetUp here, which is insufficient to realise that y is only evaluated at most once.

This tension can only be resolved by having multiple graft points, one at each occurence of x.

When is this safe to do? It's always safe to push down a graft point when there's only one child node in which x occured (that's why it's safe to choose the MCRA). For the case when x occured in both children (can only be Both or Or):

  1. It's safe to push graft points down both branches of an Or, as this can't destroy sharing. The only gripe is that the tree grows linear in the number of Or nodes with this property compared to the MCRA or plain LetUp approach.
  2. It's not generally safe push graft points down both branches of a Both: (Both (Var x <S,1*U>) (Var x <S,1*U>)) was an earlier example that proofs that.

So, by additionally pushing down graft points along both branches of an Or node, if needed, we can handle cases like the above. We buy increased precision by possibly super-linear space complexity.

Distributing Both over Or

There's one additional trick we can apply for ... a simpler model of the DmdEnv'?

In order to arrive at a disjunctive normal form-like model for DmdEnv', we can apply the distributive law (Both env1 (Or env2 env3) = (Or (Both env1 env2) (Both env1 env3)). This allows for the following DmdEnv':

newtype DmdEnv'' = DNF [DmdEnv]

flattenDmdEnv'' :: DmdEnv'' -> DmdEnv
flattenDmdEnv'' (DNF envs) =
  lubDmdEnvs envs

dnf :: DmdEnv' -> DmdEnv''
dnf = DNF . go
  where
    go env =
      case env of
      Empty -> [emptyDmdEnv]
      Var id dmd -> [unitDmdEnv id dmd]
      Or env1 env2 -> go env1 ++ go env2
      Both env1 env2 -> concatMap (\env1 -> map (bothDmdEnv env1) (go env2)) (go env1) -- ugh

dnf shows the biggest drawback of this: This is pretty much exponential in the number Both nodes.

'Substitution' is pretty easy here: Because there are no Both nodes (or rather that the top-level is Or only), we can just substitute into each other.

sgraf: Actually, I'm not sure what this buys us, other than simplicity of implementation. And performance-wise, I have a strong feeling that this blows up pretty fast. Effectively, we model every possible path of execution separately. It's the same as in logic, I guess: DNF is always possible, and easily handled, but conversion blows up exponentially in general.

Termination

The previous ideas greatly simplify by ignoring Termation/DmdResults. In fact, the bugs encountered while implementing the first proposal were all due to complex interplay with Termination.

Consider the two DmdTypes {} (e.g. empty, terminating) and {y-><S,1*U>}x (uses y exactly once and throws, like in error y). When these are lubed, we want the DmdType {y-><L,1*U>} (uses y at most once, possibly terminating) as a result.

The status quo is to compute the new DmdEnv by plusVarEnv_CD lubDmd env1 (defaultDmd res1) env2 (defaultDmd res2).

  • Where *either* the left or right env* is defined, we compute a new entry by lubing, using defaultDmd res* (where res* is the DmdResult of the corresponding DmdType) when there is no entry in one of the env*s.
  • When there is no entry in either env*, then there won't be in the result!

Since this approach gets its precision from delaying lubs as long as possible, we have to store Termination information in the DmdEnv'/DmdEnv'' data structure. At the least, we have to tag the children of Or nodes, for the first two approaches (DmdEnv') we also have to tag Both nodes, e.g.

data DmdEnv'
  = Empty
  | Var Id Demand
  | Or DmdEnv' (Termination ()) DmdEnv' (Termination ())
  | Both DmdEnv' (Termination ()) DmdEnv' (Termination ())

newtype DmdEnv''
  = DNF [(DmdEnv, Termination ())] -- Not quite, see below

Variables not mentioned in either lub operand are assumed to be used according to defaultDmd res, where res is the DmdResult currently part of the containing DmdType. Note that due to ExnStr/catch#, the defaultDmd res can change after computing the actual lub above, so in general defaultDmd res* and defaultDmd res can vary independently!

That's also why [findIdDemand](https://github.com/sgraf812/ghc/blob/922db3dac896b8cf364c9ebaebf1a27c2468c709/compiler/basicTypes/Demand.hs#L1577) in Demand.hs takes a DmdResult as argument.

And that's also why the above definition for DmdEnv'' is incorrect! It assumes that a single Termination result per lub operand is enough, which isn't the case, as the following continued example shows:

We started with lubDmdType {} {y-><S,1*U>}x = {y-><L,1*U>} (in terms of the status quo). Now, lub that again with {z-><S,1*U>} to get {y-><L,1*U>,z-><L,1*U>}.

So far so good. The DNF would look something like [({},Dunno ()), ({y-><S,1*U>},ThrowsExn), ({z-><S,1*U>},Dunno ())], which flattenDmdEnv''s to pretty much the same DmdEnv as in the DmdType above.

Now consider what happens when Termination changed in-between the two lubs, e.g. because the first result of lubDmdType was an argument to error:

  lubDmdType (markThrowsExn (lubDmdType {} {y-><S,1*U>}x)) {z-><S,1*U>}
= lubDmdType (markThrowsExn {y-><L,1*U>}                 ) {z-><S,1*U>}
= lubDmdType  {y-><L,1*U>}x                                {z-><S,1*U>}
= {y-><L,1*U>,z-><S,1*U>}

Note that, different to before, z is now evaluated strictly! In our current DNF formulation, we wouldn't be able to express the difference.

The problem here is that lub is no longer associative, because it depends on the Termination at the time of the call. So, we can't actually model DmdEnv'' as a DNF, but have to resort to

data DmdEnv''
  = Lit DmdEnv
  | Or DmdEnv'' (Termination ()) DmdEnv'' (Termination ())

Which is only marginally less complex than a version of DmdEnv' where we subsume the Empty and Var cases with an equivalent Lit case:

data DmdEnv'
  = Lit DmdEnv
  | Or DmdEnv' (Termination ()) DmdEnv' (Termination ())
  | Both DmdEnv' (Termination ()) DmdEnv' (Termination ())

dnf :: DmdEnv' -> DmdEnv''
dnf (DmdEnv'.Lit env) = DmdEnv''.Lit env                               -- hypothetical disambiguating syntax
dnf (DmdEnv'.Or fv1 t1 fv2 t2) = DmdEnv''.Or (dnf fv1) t1 (dnf fv2) t2
dnf (DmdEnv'.Both fv1 t1 fv2 t2) = distr (dnf fv1) t1 (dnf fv2) t2 -- case in point
  where
    -- both env1 with defaultDmd t1 into every Lit of fv2
    distr (DmdEnv''.Lit env1) t1 fv2 t2 = undefined
    -- both env2 with defaultDmd t2 into every Lit of fv1
    distr fv1 t1 (DmdEnv''.Lit env2) t2 = undefined
    -- Not exactly sure how to compute this. 
    -- This case leads to the exponential explosion, 
    -- as we have to combine every Lit in fv1 with every Lit in fv2.
    -- This gets insane as we have to keep in mind which `Termination` to use for each point
    -- in each `Lit` separately.
    distr DmdEnv''.Or{} t1 DmdEnv''.Or{} t2 = undefined

dnf mirrors how bothDmdEnv' would have to change in the current implementation to get to the DNF-like version.

In fact, the currently working prototype of the first proposal had a complicated bug in the `lookupDmdTree` implementation, which forced me to think about how the lookup of a single point can potentially access *all* Termination () tags on the path to a Lit it is contained in. This also applies to the implementation of dnf: I'm pretty sure the step that distributes Both over both sub trees needs to model all points separately to know with which Termination to both. Sorry if this is a little incomprehensible without an example, but this is already pretty deep down the rabbit hole.

The bottom-line is that there still is some DNF-like structure possible, but in addition to the incurred exponential space complexity, there's also the issue of how to distribute Both nodes into leafs of the Or-tree without losing sanity to the implementation.

Evaluation

Both/Or trees, grafted at MCRAs

The current working prototype of the first proposal (progress tracked at `and-or` branch) passes ./validate and nofib.

GHC is built with 0.1% more allocations and 7.1% (!) more executed instructions (cachegrind). There may still be potential in optimizing the DmdTree (called DmdEnv' on this page) representation in its smart constructors (bothDmdTree, lubDmdTree) to avoid big trees.

Running nofib revealed two benchmarks where allocation changed, fft2 with -0.1% and transform with -0.0%. Changes in counted instructions where all around +-0.0%, with the exception of fft2 (-0.2%). Seems like a net gain, but pretty much insignificant, especially compared to the significant increase in compilation time.

Both/Or trees, pushed into Or

The implementation of the second proposal can be found here (progress tracked at `and-or-distr` branch. This was a minimal change compared to the single graft point solution, with great repercussions:

Compilation of modules with big records (looking at you, `Distribution.Types.BuildInfo` eats up all resources on my machine without coming to an end. Given the minimal invasive change, this can only be due to combinatorial explosion. Some debugging pinned this down to derived Eq and Show instances, where the generated DmdTrees grow bigger than e.g. 10000 nodes. I'm not entirely sure what causes the duplication that leads to the slowdown, so I tried falling back to the status quo for dictionary Ids (like we do for unlifted lets anyway) to no avail.

DNF

When trying to implement this, I stumbled on the problems wrt. Termination outlined above. Also considering that the space complexity of this approach will be even less predictable, I've put the implementation of this is on ice for now.

Last modified 11 days ago Last modified on Nov 9, 2017 11:43:38 AM