Improvements to SpecConstr

changed weight to 5

Trac metadata

Trac field	Value
Test case	- → N/A

I'll look at this again when someone finds that SpecConstr really isn't doing the Right Thing for them. My guess is that this will happen either in array fusion for nested data parallelism, or in the use of stream fusion for lists (Coutts et al ICFP'07).

Meanwhile, here is the last message I had from Roman on the subject, which summarises what he did to make SpecConstr work reasonably well for their ICFP paper.

Specialising on constants ===

Consider the following program (admittedly contrived, but we get this kind of code with stream fusion):

lvl = Just True

foo :: Maybe Bool -> Int -> Int
foo (Just True) i = i
foo _           i = foo lvl i

SpecConstr doesn't optimise this even though it is supposed to specialise on global variables. If I understand correctly, the reason for this is the following test in argToPat:

   -- Check if the argument is a variable that
   -- is in scope at the function definition site
   -- It's worth specialising on this if
   --   (a) it's used in an interesting way in the body
   --   (b) we know what its value is
argToPat in_scope con_env (Var v) arg_occ
   | not (isLocalId v) || v `elemInScopeSet` in_scope,
     case arg_occ of { UnkOcc -> False; other -> True },        -- (a)
     isValueUnfolding (idUnfolding v)                   -- (b)
   = return (True, Var v)

The problem with this is that v might have been cloned (via extendBndr*) and in that case, it won't have an unfolding. That is precisely what happens in the above example, I believe.

I solved this by explicitly recording which binders have a known value but you might have a better idea.

The threshold

I don't like the spec-threshold idea at all. In fact, it was the first thing I had to disable for the ICFP paper since it just doesn't make sense for the kind of code stream fusion generates. It can produce some very large loops which *must* be specialised; indeed, SpecConstr will usually decrease the program size.

Specialising on lambdas

Specialising on lambdas is absolutely crucial for stream fusion (for nested lists). I completely agree that the direct approach was too fragile. What I did was a rather bad hack but it worked.

I extended FloatOut (or, rather, SetLevels) such that it would float

out lambdas in recursive arguments, i.e., given

       foo x = ... foo (C (\x. e)) ...

it would produce

       f a1 ... an x = e
       foo x = ... foo (C (f a1 ... an)) ...

where a1,...,an are the free variables of e.

This floating would happen before every SpecConstr pass (see below).

I extended SpecConstr (very hackishly) to specialise on partial

applications of global variables. For the above, it would produce

       foo' a1 ... an = ...

and the rewrite rule

       forall a1 ... an. foo (C (f a1 ... an)) = foo' a1 ... an

Now, f is used directly in foo'. We have a bad staging problem here,

however: we definitely want to inline f in foo' (i.e., it should get

an INLINE pragma) but not in foo (at least, not before the rule has

fired). I solved this by a terrible hack involving core notes but

clearly, something better is needed here. Note that we might

generate several specialisations which use the same f and we want to

inline the latter in every specialisation.

Finally, inlining f in foo' and simplifying might expose further

opportunities for SpecConstr. For the ICFP paper, I implemented

FloatOut/SpecConstr/Simplify loop. Stream fusion with nested lists

requires as many iterations as there are nesting levels (each

iteration would peel away one nesting level). Note that the current

specLoop doesn't help here because we really have to do full

simplification (including inlining).

That was basically all I did. We didn't really use the static argument transformation because specialising on global variables essentially has the same effect.

Trac metadata

Trac field	Value
Architecture	Unknown → Unknown/Multiple

Trac metadata

Trac field	Value
Operating system	Unknown → Unknown/Multiple

mentioned in issue #4941

Do we have a place for the kind of knowledge buried in simonpj's comment above? Or a ticket type? (My apologies if I could have found this info elsewhere.)

Trac metadata

Trac field	Value
TypeOfFailure	- → OtherFailure

Trac metadata

Trac field	Value
CC	→ hackage.haskell.org@liyang.hu

Trac metadata

Trac field	Value
TypeOfFailure	OtherFailure → RuntimePerformance

Trac metadata

Trac field	Value
Blocking	→ #915 (closed)

Re: Specialising on lambdas.

To have a concrete example to work on, I here is stream-fusion.hs, which contains a minimal stream fusion harness using just singletonS, enumFromToS, sumS, mapS and concatMapS. There are three example functions ex{1,2,3} with increasing nesting of concatMapS that should all reduce to the same function goal. More examples can follow when we get these running.

Currently, GHC HEAD (8.5, 8529fbba309cd692bbbb0386321515d05a6ed256) produces this infamous piece of Core (-ddump-spec) for ex1. The goal is to specialise go for the call-pattern including a lambda here.

Before I found this ticket, I basically re-implemented part of what simonpj did, results in this commit. I basically tracked CallOccurrences similarly to ScrutOccs.

With this specialisation for lambdas, things currently look like this. The function has been specialised alright, but the free variables of said lambda, which includes the constant Step, are passed to the specialisation as arguments.

We want to specialise for this new Call pattern! However, without -fspec-constr-keen, !SpecConstr will only fire when it finds a matching ScrutOcc. Looking at the output of -ddump-spec the corresponding ScrutOcc will only become visible in the specialised RHS, but we currently only specialise the original RHS. Even then, I imagine that in the general case we need a simplifier run in-between to reliably detect that we scrutinize the new parameter. But that's not actually a problem, because we can use GHC.Types.SPEC to tell the compiler to specialise without seeing an ArgOcc.

Still, GHC needs to see a Call pattern with that constant argument, which will emerge here, but only after the corresponding RULEs fired.

Here are some ideas:

We can query the exprFreeVarsList here and see which free variables of the lambda have known constant value and include these in the specialisation. This was somehow fruitless so far for this case, as the free var for the Yield ... wasn't included in the ValueEnv. Not sure if this is even enough to reach the fixed point in all cases.
Specialise specialisations (and fix anything non-terminating) (+ mini simplifier passes) + mini RULE engine. I'm uncomfortable with that much code duplication.
Include call-pattern specialisation in the Simplifier, for a radical change. On an abstract level, this seems reasonable: Specialisation is a more modest form of inlining with code size in mind, one that even works for recursive functions. It could chew on stuff the inliner gave up on because of code size requirements (even non-recursive functions).

Trac metadata

Trac field	Value
Blocking	#915 (closed) → -

Sorry, didn't mean to remove that Blocking thing.

Trac metadata

Trac field	Value
Blocking	- → #915 (closed)

Include call-pattern specialisation in the Simplifier, for a radical change

I'm not keen on this

The Simplifier and SpecConstr are both very complex passes. Combining them would be ... difficult.
SpecConstr relies on a pretty elaborate form of occurrence analysis, which I'm not sure we'd want to replicated in the simplifier.

Let's be really sure that nothing else will work first.

What about just running SpecConstr twice (with simplification in between). It's a bit brutal (and why not three times?) but it might help.

Specialising on lambdas is pretty ambitious.

Another way of thinking about: thing in terms of defunctionalisation.

Consider this higher order function

let f :: (Int -> Bool) -> Int -> Char
    f g x = ....(g e)...
in
   ...(f (\x.y+x)).... (f (\v.v*p*q))....(f h)...

Now defunctionalise by making a version of f that takes a data structure as its argument:

data G_Fun = G1 Int         -- \x.y+x
           | G2 Int Int     -- \v.v*p*q
           | G3 (Int->Int)  -- Escape hatch

applyG :: G_Fun -> Int -> Bool
applyG (G1 x)   = \x.y+x
applyG (G2 p q) = \v.v*p*q in ...(g e)...
applyG (G3 g)   = g

let f' :: G_Fun -> Int -> Char
    f' ga x = ...(applyG ga e)...
in
 ...(f' (G1 x))...(f' (G2 p q))...(f' (G3 h))

(I guess you could do this via a w/w kind of transformation, but for now it's purely hypothetical.)

Now we are back in the land of data-constructors, where SpecConstr thrives. Suppose the call is actually

  ...(f' (G1 (Yield e1 e2 e3)))...

Should we specialise on (G1 x) or on the deeper pattern (G1 (Yield a b c))? It depends how much f' scrutinises its argument. And you can see that from what applyG does.

I think you could follow all this reasoning without actually createing G_Fun etc.

Thanks for your insights! I have to read up on defunctionalisation, but what you suggest is basically what I had in mind in 1. above: Query the free variables of the lambda and see if they are in the ValueEnv (meaning their RHSs are in WHNF, too). For some reason they weren't present where I expected them, looking into this now.

Should we specialise on (G1 x) or on the deeper pattern (G1 (Yield a b c))? It depends how much f' scrutinises its argument. And you can see that from what applyG does.

Well, I'm not sure we can see that without running some kind of simplification first. But with SPEC, the matching ArgOcc is irrelevant, so we can look into this later. I conveniently left CallOcc with an ArgOcc field for when the result of the call is scrutinised for that.

I'll go and see if I can get the free vars of the lambda into the ValueEnv.

This is what I have so far: diff

The reason why I couldn't find the lambda's free vars in the ValueEnv earlier is that the matching argToPat case didn't add them before. That's a problem for our specific case, where the free vars of the lambda are bound within the call-pattern, e.g. Just (let x = 4 in \y. x + y).

With that fix in place, the free var values are detected just fine and guided by the ConVal case I could implement the specialisation code. From what I can tell, the generated specialisation looks great, but it isn't called anywhere because now the rules no longer fire.

My particular solution would just introduce let bindings into the lambda body that re-bind things we find to be values (similar to how arguments of a data con are handled). Example:

let {
  lvl_s4md = *# sc_s4rb sc_s4rb
  lvl_s4lL = I# lvl_s4md
  lvl_s4ll = Yield @ Bool @ Int False lvl_s4lL } in
MkStream
  @ Int
  @ Bool
  (\ (ds_d2Fz :: Bool) ->
    case ds_d2Fz of {
      False -> Stop @ Bool @ Int;
      True -> lvl_s4ll
    })
  True

==>

let {
  lvl_s4md = *# sc_s4rb sc_s4rb
  lvl_s4lL = I# lvl_s4md
  lvl_s4ll = Yield @ Bool @ Int False lvl_s4lL } in
MkStream
  @ Int
  @ Bool
  (\ (ds_d2Fz :: Bool) ->
    let {
      lvl_s4ll = Yield @ Bool @ Int False (I# sc_s4md) } in
    case ds_d2Fz of {
      False -> Stop @ Bool @ Int;
      True -> lvl_s4ll
    })
  True

Kind-of a manual float-in of all bindings we know are values and have corresponding ScrutOcc (which we ignore at the moment).

This results in a much better specialisation for Just (MkStream <lam> True) I believe (dump-simpl):

$s$wgo_s4rg :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
$s$wgo_s4rg (sc_s4rf :: GHC.Prim.Int#)
            (sc1_s4re :: GHC.Prim.Int#)
            (sc2_s4rd :: GHC.Prim.Int#)
  = jump $wgo_s4ph
      GHC.Types.SPEC
      (GHC.Prim.+# sc2_s4rd sc_s4rf)
      sc1_s4re
      (GHC.Base.Just
        @ (Stream Int)
        (Main.MkStream
            @ Int
            @ Bool
            (\ (ds_d2Fz :: Bool) ->
              case ds_d2Fz of {
                False -> Main.Stop @ Bool @ Int;
                True ->
                  Main.Yield @ Bool @ Int GHC.Types.False (GHC.Types.I# sc_s4rf)
              })
            GHC.Types.False));

There's a problem though: That function doesn't get called any more, because the corresponding RULE looks like this:

"SC:$wgo2" [0]
  forall (sc_s4rj :: GHC.Prim.Int#)
          (sc_s4rk :: Bool)
          (sc_s4ri :: GHC.Prim.Int#)
          (sc_s4rh :: GHC.Prim.Int#).
    $wgo_s4ph GHC.Types.SPEC
              sc_s4rh
              sc_s4ri
              (GHC.Base.Just
                  @ (Stream Int)
                  (Main.MkStream
                    @ Int
                    @ Bool
                    (\ (ds_d2Fz :: Bool) ->
                        let {
                          sc_s4rf :: GHC.Prim.Int#
                          [LclId]
                          sc_s4rf = sc_s4rj } in
                        let {
                          lvl_s4ll :: Step Bool Int
                          [LclId]
                          lvl_s4ll
                            = Main.Yield
                                @ Bool
                                @ Int
                                GHC.Types.False
                                (GHC.Types.I# sc_s4rf) } in
                        case ds_d2Fz of {
                          False -> Main.Stop @ Bool @ Int;
                          True -> lvl_s4ll
                        })
                    sc_s4rk))
    = jump $s$wgo_s4rr sc_s4rj sc_s4rk sc_s4ri sc_s4rh

E.g. the extra bindings I introduced obstruct the RULE engine. I can probably figure out how to generate the right rules tomorrow.

Please yell if I'm missing some critical pass or function to do this!

Calling the very simple optimiser (in CoreOpt) on the LHS might help.

Long story short, I had success with the following diff. Currently, it optimizes ex1 to this code:

-- RHS size: {terms: 39, types: 21, coercions: 0, joins: 3/3}
Main.$wex1 [InlPrag=NOINLINE] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=<S,U>, Unf=OtherCon []]
Main.$wex1
  = \ (ww_s4pl :: GHC.Prim.Int#) ->
      joinrec {
        $s$wgo_s4rk
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(3)], Arity=3, Str=<L,A><S,U><S,U>]
        $s$wgo_s4rk _ [Occ=Dead]
                    (sc1_s4ri :: GHC.Prim.Int#)
                    (sc2_s4rh :: GHC.Prim.Int#)
          = jump $s$wgo2_s4rc sc1_s4ri sc2_s4rh;
        $s$wgo1_s4rg
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(3)], Arity=3, Str=<S,U><S,U><S,U>]
        $s$wgo1_s4rg (sc_s4rf :: GHC.Prim.Int#)
                     (sc1_s4re :: GHC.Prim.Int#)
                     (sc2_s4rd :: GHC.Prim.Int#)
          = jump $s$wgo_s4rk sc_s4rf sc1_s4re (GHC.Prim.+# sc2_s4rd sc_s4rf);
        $s$wgo2_s4rc [Occ=LoopBreaker]
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(2)], Arity=2, Str=<S,U><S,U>, Unf=OtherCon []]
        $s$wgo2_s4rc (sc_s4rb :: GHC.Prim.Int#) (sc1_s4ra :: GHC.Prim.Int#)
          = case GHC.Prim.<# sc_s4rb ww_s4pl of {
              __DEFAULT -> sc1_s4ra;
              1# ->
                jump $s$wgo1_s4rg
                  (GHC.Prim.*# sc_s4rb sc_s4rb) (GHC.Prim.+# sc_s4rb 1#) sc1_s4ra
            }; } in
      jump $s$wgo2_s4rc 1# 0#

Which is just a stone throw (or rather an inlining pass) away from the goal!

For the examples ex{2,3} with increasing nesting level, things don't look so bright yet. Also I got some cleanup work to do tomorrow.

A quick update: ex1 is optimized to similar code as above, but without resorting to forced SPEC now. E.g., inference of ArgOccs is now much better, because it looks at occs from specialised RHSs (specialising for lambdas gives rise to new occs). This entailed a rewrite of the spec loop. Also I had to pass on occs from recursive calls to achieve something like the static argument transformation.

I'll write things up in a wiki page once I'm done trying to optimize ex{2,3}. This is the code currently generated for ex1:

-- RHS size: {terms: 41, types: 23, coercions: 0, joins: 3/3}
Main.$wex1 [InlPrag=NOINLINE] :: GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=1, Caf=NoCafRefs, Str=<S,U>, Unf=OtherCon []]
Main.$wex1
  = \ (ww_s4m8 :: GHC.Prim.Int#) ->
      joinrec {
        $s$wgo_s4or
          :: GHC.Prim.Int#
             -> GHC.Prim.Int# -> GHC.Prim.Int# -> Bool -> GHC.Prim.Int#
        [LclId[JoinId(4)], Arity=4, Str=<S,U><S,U><L,A><L,A>]
        $s$wgo_s4or (sc_s4oq :: GHC.Prim.Int#)
                    (sc1_s4op :: GHC.Prim.Int#)
                    _ [Occ=Dead]
                    _ [Occ=Dead]
          = jump $s$wgo2_s4nk sc1_s4op sc_s4oq;
        $s$wgo1_s4o0
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(3)], Arity=3, Str=<S,U><S,U><S,U>]
        $s$wgo1_s4o0 (sc_s4nZ :: GHC.Prim.Int#)
                     (sc1_s4nY :: GHC.Prim.Int#)
                     (sc2_s4nX :: GHC.Prim.Int#)
          = jump $s$wgo_s4or
              (GHC.Prim.+# sc2_s4nX sc_s4nZ) sc1_s4nY sc_s4nZ GHC.Types.False;
        $s$wgo2_s4nk [Occ=LoopBreaker]
          :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
        [LclId[JoinId(2)], Arity=2, Str=<S,U><S,U>, Unf=OtherCon []]
        $s$wgo2_s4nk (sc_s4nj :: GHC.Prim.Int#) (sc1_s4ni :: GHC.Prim.Int#)
          = case GHC.Prim.<# sc_s4nj ww_s4m8 of {
              __DEFAULT -> sc1_s4ni;
              1# ->
                jump $s$wgo1_s4o0
                  (GHC.Prim.*# sc_s4nj sc_s4nj) (GHC.Prim.+# sc_s4nj 1#) sc1_s4ni
            }; } in
      jump $s$wgo2_s4nk 1# 0#

-flate-dmd-anal gets rid of the superfluous $s$wgo_s4or, but what pass is responsible for contracting the recursive group into a single binding (by inlining until we hit the loopbreaker)?

mentioned in issue #16473 (closed)

added Pnormal label

mentioned in issue #19732

mentioned in issue #19881 (closed)

removed Pnormal Ttask Trac import labels

added Pnormal Ttask Trac import labels

added SpecConstr label

Improvements to SpecConstr

Child items 0

Activity

Specialising on constants ===

The threshold

Specialising on lambdas

Trac field	Value
Version	6.4.2
Type	Task
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system	Unknown
Architecture	Unknown