I'll look at this again when someone finds that SpecConstr really isn't doing the Right Thing for them. My guess is that this will happen either in array fusion for nested data parallelism, or in the use of stream fusion for lists (Coutts et al ICFP'07).
Meanwhile, here is the last message I had from Roman on the subject, which summarises what he did to make SpecConstr work reasonably well for their ICFP paper.
Specialising on constants ===
Consider the following program (admittedly contrived, but we get this kind of code with stream fusion):
lvl = Just Truefoo :: Maybe Bool -> Int -> Intfoo (Just True) i = ifoo _ i = foo lvl i
SpecConstr doesn't optimise this even though it is supposed to specialise on global variables. If I understand correctly, the reason for this is the following test in argToPat:
-- Check if the argument is a variable that -- is in scope at the function definition site -- It's worth specialising on this if -- (a) it's used in an interesting way in the body -- (b) we know what its value isargToPat in_scope con_env (Var v) arg_occ | not (isLocalId v) || v `elemInScopeSet` in_scope, case arg_occ of { UnkOcc -> False; other -> True }, -- (a) isValueUnfolding (idUnfolding v) -- (b) = return (True, Var v)
The problem with this is that v might have been cloned (via extendBndr*) and in that case, it won't have an unfolding. That is precisely what happens in the above example, I believe.
I solved this by explicitly recording which binders have a known value
but you might have a better idea.
The threshold
I don't like the spec-threshold idea at all. In fact, it was the first thing I had to disable for the ICFP paper since it just doesn't make sense for the kind of code stream fusion generates. It can produce some very large loops which *must* be specialised; indeed, SpecConstr will usually decrease the program size.
Specialising on lambdas
Specialising on lambdas is absolutely crucial for stream fusion (for nested lists). I completely agree that the direct approach was too fragile. What I did was a rather bad hack but it worked.
I extended FloatOut (or, rather, SetLevels) such that it would float
out lambdas in recursive arguments, i.e., given
foo x = ... foo (C (\x. e)) ...
it would produce
f a1 ... an x = e foo x = ... foo (C (f a1 ... an)) ...
where a1,...,an are the free variables of e.
This floating would happen before every SpecConstr pass (see below).
I extended SpecConstr (very hackishly) to specialise on partial
applications of global variables. For the above, it would produce
foo' a1 ... an = ...
and the rewrite rule
forall a1 ... an. foo (C (f a1 ... an)) = foo' a1 ... an
Now, f is used directly in foo'. We have a bad staging problem here,
however: we definitely want to inline f in foo' (i.e., it should get
an INLINE pragma) but not in foo (at least, not before the rule has
fired). I solved this by a terrible hack involving core notes but
clearly, something better is needed here. Note that we might
generate several specialisations which use the same f and we want to
inline the latter in every specialisation.
Finally, inlining f in foo' and simplifying might expose further
opportunities for SpecConstr. For the ICFP paper, I implemented
FloatOut/SpecConstr/Simplify loop. Stream fusion with nested lists
requires as many iterations as there are nesting levels (each
iteration would peel away one nesting level). Note that the current
specLoop doesn't help here because we really have to do full
simplification (including inlining).
That was basically all I did. We didn't really use the static argument
transformation because specialising on global variables essentially has
the same effect.
Do we have a place for the kind of knowledge buried in simonpj's comment above? Or a ticket type? (My apologies if I could have found this info elsewhere.)
To have a concrete example to work on, I here is stream-fusion.hs, which contains a minimal stream fusion harness using just singletonS, enumFromToS, sumS, mapS and concatMapS. There are three example functions ex{1,2,3} with increasing nesting of concatMapS that should all reduce to the same function goal. More examples can follow when we get these running.
Currently, GHC HEAD (8.5, 8529fbba309cd692bbbb0386321515d05a6ed256) produces this infamous piece of Core (-ddump-spec) for ex1. The goal is to specialise go for the call-pattern including a lambda here.
Before I found this ticket, I basically re-implemented part of what simonpj did, results in this commit. I basically tracked CallOccurrences similarly to ScrutOccs.
With this specialisation for lambdas, things currently look like this. The function has been specialised alright, but the free variables of said lambda, which includes the constant Step, are passed to the specialisation as arguments.
We want to specialise for this new Call pattern! However, without -fspec-constr-keen, !SpecConstr will only fire when it finds a matching ScrutOcc. Looking at the output of -ddump-spec the corresponding ScrutOcc will only become visible in the specialised RHS, but we currently only specialise the original RHS. Even then, I imagine that in the general case we need a simplifier run in-between to reliably detect that we scrutinize the new parameter. But that's not actually a problem, because we can use GHC.Types.SPEC to tell the compiler to specialise without seeing an ArgOcc.
Still, GHC needs to see a Call pattern with that constant argument, which will emerge here, but only after the corresponding RULEs fired.
Here are some ideas:
We can query the exprFreeVarsListhere and see which free variables of the lambda have known constant value and include these in the specialisation. This was somehow fruitless so far for this case, as the free var for the Yield ... wasn't included in the ValueEnv. Not sure if this is even enough to reach the fixed point in all cases.
Specialise specialisations (and fix anything non-terminating) (+ mini simplifier passes) + mini RULE engine. I'm uncomfortable with that much code duplication.
Include call-pattern specialisation in the Simplifier, for a radical change. On an abstract level, this seems reasonable: Specialisation is a more modest form of inlining with code size in mind, one that even works for recursive functions. It could chew on stuff the inliner gave up on because of code size requirements (even non-recursive functions).
Another way of thinking about: thing in terms of defunctionalisation.
Consider this higher order function
let f :: (Int -> Bool) -> Int -> Char f g x = ....(g e)...in ...(f (\x.y+x)).... (f (\v.v*p*q))....(f h)...
Now defunctionalise by making a version of f that takes
a data structure as its argument:
data G_Fun = G1 Int -- \x.y+x | G2 Int Int -- \v.v*p*q | G3 (Int->Int) -- Escape hatchapplyG :: G_Fun -> Int -> BoolapplyG (G1 x) = \x.y+xapplyG (G2 p q) = \v.v*p*q in ...(g e)...applyG (G3 g) = glet f' :: G_Fun -> Int -> Char f' ga x = ...(applyG ga e)...in ...(f' (G1 x))...(f' (G2 p q))...(f' (G3 h))
(I guess you could do this via a w/w kind of transformation, but for
now it's purely hypothetical.)
Now we are back in the land of data-constructors, where SpecConstr thrives.
Suppose the call is actually
...(f' (G1 (Yield e1 e2 e3)))...
Should we specialise on (G1 x) or on the deeper pattern (G1 (Yield a b c))?
It depends how much f' scrutinises its argument. And you can see that from
what applyG does.
I think you could follow all this reasoning without actually createing G_Fun etc.
Thanks for your insights! I have to read up on defunctionalisation, but what you suggest is basically what I had in mind in 1. above: Query the free variables of the lambda and see if they are in the ValueEnv (meaning their RHSs are in WHNF, too). For some reason they weren't present where I expected them, looking into this now.
Should we specialise on (G1 x) or on the deeper pattern (G1 (Yield a b c))? It depends how much f' scrutinises its argument. And you can see that from what applyG does.
Well, I'm not sure we can see that without running some kind of simplification first. But with SPEC, the matching ArgOcc is irrelevant, so we can look into this later. I conveniently leftCallOcc with an ArgOcc field for when the result of the call is scrutinised for that.
I'll go and see if I can get the free vars of the lambda into the ValueEnv.
The reason why I couldn't find the lambda's free vars in the ValueEnv earlier is that the matching argToPat case didn't add them before. That's a problem for our specific case, where the free vars of the lambda are bound within the call-pattern, e.g. Just (let x = 4 in \y. x + y).
With that fix in place, the free var values are detected just fine and guided by the ConVal case I could implement the specialisation code. From what I can tell, the generated specialisation looks great, but it isn't called anywhere because now the rules no longer fire.
My particular solution would just introduce let bindings into the lambda body that re-bind things we find to be values (similar to how arguments of a data con are handled). Example:
let { lvl_s4md = *# sc_s4rb sc_s4rb lvl_s4lL = I# lvl_s4md lvl_s4ll = Yield @ Bool @ Int False lvl_s4lL } inMkStream @ Int @ Bool (\ (ds_d2Fz :: Bool) -> case ds_d2Fz of { False -> Stop @ Bool @ Int; True -> lvl_s4ll }) True==>let { lvl_s4md = *# sc_s4rb sc_s4rb lvl_s4lL = I# lvl_s4md lvl_s4ll = Yield @ Bool @ Int False lvl_s4lL } inMkStream @ Int @ Bool (\ (ds_d2Fz :: Bool) -> let { lvl_s4ll = Yield @ Bool @ Int False (I# sc_s4md) } in case ds_d2Fz of { False -> Stop @ Bool @ Int; True -> lvl_s4ll }) True
Kind-of a manual float-in of all bindings we know are values and have corresponding ScrutOcc (which we ignore at the moment).
This results in a much better specialisation for Just (MkStream <lam> True) I believe (dump-simpl):
A quick update: ex1 is optimized to similar code as above, but without resorting to forced SPEC now. E.g., inference of ArgOccs is now much better, because it looks at occs from specialised RHSs (specialising for lambdas gives rise to new occs). This entailed a rewrite of the spec loop. Also I had to pass on occs from recursive calls to achieve something like the static argument transformation.
I'll write things up in a wiki page once I'm done trying to optimize ex{2,3}. This is the code currently generated for ex1:
-flate-dmd-anal gets rid of the superfluous $s$wgo_s4or, but what pass is responsible for contracting the recursive group into a single binding (by inlining until we hit the loopbreaker)?