Opened 5 years ago

Last modified 2 years ago

#4941 new task

SpecConstr generates functions that do not use their arguments

Reported by: simonpj Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.0.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

Consider this function:

f :: Int -> (Bool,Bool) -> Bool -> Bool
f 0 x     y = y
f n (p,q) y = f (n-1) (p,q) q

SpecConstr does a reasonable job, but ends up with a function like this:

T4908a.f_$s$wf =
  \ (sc_sp4 :: GHC.Prim.Int#)
    (sc1_sp5 :: GHC.Types.Bool)
    (sc2_sp6 :: GHC.Types.Bool)
    (sc3_sp7 :: GHC.Types.Bool) ->
    case sc_sp4 of ds_Xom {
      __DEFAULT ->
        T4908a.f_$s$wf (GHC.Prim.-# ds_Xom 1) sc1_sp5 sc2_sp6 sc2_sp6;
      0 -> sc3_sp7
    }

Note that sc1_sp5 is passed around the loop but never used.

I had a quick go at trying to make SpecConstr cleverer, but absence info requires a fixpoint analysis, which the existing ArgOcc stuff doesn't do. Nor can we rely on absence analysis from earlier in the compiler, because CSE invalidates it.

A possibility would be to run strictness/absence analysis again after SpecConstr, which would pick this up. I'm not sure what other consequences this would have.

So there's an opportunity here, but I'm not sure how much it matters in practice.

Change History (21)

comment:1 Changed 5 years ago by batterseapower

  • Type of failure changed from None/Unknown to Runtime performance bug

I added this after SpecConstr:

        -- Attempt to clean up dead bindings introduced by SpecConstr with another
        -- round of absence analysis and simplification
        runWhen strictness (CoreDoPasses [
                simpl_phase 0 ["pre-final-strictness"] max_iter,
                CoreDoStrictness,
                CoreDoWorkerWrapper,
                CoreDoGlomBinds
                ]),

And it does indeed fix the rubbish left in STUArray-Rewrite2 after #4945 is fixed. Even though the absent arguments in that example are invariant around the inner loop (so keeping them around doesn't cause allocation in the inner loop), the outer loop *can* usefully avoid allocating thunks for those arguments, because it calls the inner loop a few different ways.

Another argument for doing strictness analysis after SpecConstr is that some of the specialisations might be strict in a way that the non-specialised version is not. A concrete example of this is the Tak program I sent you by email yesterday:

module Tak (tak) where

tak' :: Int -> Int -> Int -> Bool -> Int
tak' x y z b = if b
              then z
              else tak (let x' = x-1 in tak' x' y z (not (y < x')))
                       (let y' = y-1 in tak' y' z x (not (z < y')))
                       (let z' = z-1 in tak' z' x y (not (x < z')))

tak :: Int -> Int -> Int -> Int
tak x y z = tak' x y z (not (y < x))

If we do strictness analysis after SpecConstr (compile with @-O2 -fspec-constr-count=4@), we are able to totally unbox all Int to Int# in this program because the specialisations specialise tak' on whether the last (boolean) argument is True or False!

Parenthetical remark: actually, even with this change the resulting Tak program contains some stupidity in the form of functions like this one:

$w$stak'1_rr8
  :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
$w$stak'1_rr8 =
  \ (w_sqz :: GHC.Prim.Int#)
    (w1_sqA :: GHC.Prim.Int#)
    (w2_sqB :: GHC.Prim.Int#) ->
    let {
      a_smv [Dmd=Just L] :: GHC.Prim.Int#
      [LclId, Str=DmdType]
      a_smv = GHC.Prim.-# w_sqz 1 } in
    let {
      a1_smt [Dmd=Just L] :: GHC.Prim.Int#
      [LclId, Str=DmdType]
      a1_smt = GHC.Prim.-# w1_sqA 1 } in
    let {
      y_smH [Dmd=Just U(L)] :: GHC.Types.Int
      [LclId, Str=DmdType m]
      y_smH =
        case GHC.Prim.<# w_sqz a1_smt of _ {
          GHC.Types.False -> GHC.Types.I# w2_sqB;
          GHC.Types.True ->
            case $w$stak'1_rr8 w2_sqB w_sqz a1_smt of ww_sqE { __DEFAULT ->
            GHC.Types.I# ww_sqE
            }
        } } in
    let {
      a2_smr [Dmd=Just L] :: GHC.Prim.Int#
      [LclId, Str=DmdType]
      a2_smr = GHC.Prim.-# w2_sqB 1 } in
    let {
      x_smF [Dmd=Just U(L)] :: GHC.Types.Int
      [LclId, Str=DmdType m]
      x_smF =
        case GHC.Prim.<# w1_sqA a2_smr of _ {
          GHC.Types.False -> GHC.Types.I# w_sqz;
          GHC.Types.True ->
            case $w$stak'1_rr8 w_sqz w1_sqA a2_smr of ww_sqE { __DEFAULT ->
            GHC.Types.I# ww_sqE
            }
        } } in
    case GHC.Prim.<# w2_sqB a_smv of _ {
      GHC.Types.False ->
        case y_smH of _ { GHC.Types.I# x1_Xmg ->
        case x_smF of _ { GHC.Types.I# y1_Xmo ->
        case GHC.Prim.<# x1_Xmg y1_Xmo of _ {
          GHC.Types.False -> w1_sqA;
          GHC.Types.True -> $w$stak'1_rr8 w1_sqA x1_Xmg y1_Xmo
        }
        }
        };
      GHC.Types.True ->
        case y_smH of _ { GHC.Types.I# x1_Xmg ->
        case x_smF of _ { GHC.Types.I# y1_Xmo ->
        case GHC.Prim.<# x1_Xmg y1_Xmo of _ {
          GHC.Types.False -> $w$stak'1_rr8 w1_sqA w2_sqB a_smv;
          GHC.Types.True ->
            case $w$stak'1_rr8 w1_sqA w2_sqB a_smv of ww_sqE { __DEFAULT ->
            Tak.$w$stak' ww_sqE x1_Xmg y1_Xmo
            }
        }
        }
        }
    }

You can see that e.g. y_smH allocates a I# box explicitly, which is then explicitly demanded in both branches of the case! What is going on?? This pattern occurs no less than 4 times in the tidied core output.

comment:2 Changed 5 years ago by rl

So there are a few useful things we can do after SpecConstr:

  1. eliminate unused arguments
  2. eliminate duplicate arguments
  3. static argument transformation (?)

Strictness analysis helps with the former but not with the latter two. It seems that a pass that does either 1 and 2 or all 3 would be quite useful. Or perhaps SpecConstr itself could be extended to do this analysis. It seems that roughly speaking, it could record information about the shape/values of and the demand on arguments and free variables, split them into equivalence classes and then specialise based on that information. My gut feeling is that something like this could do 1-3 and also subsume LiberateCase.

comment:3 Changed 5 years ago by batterseapower

Indeed. I found that you need to have a fixed point between SAT and SpecConstr to deal with the output of stream fusion. Similarly there is an argument for having a fixed point between SAT and strictness analysis because it is possible that strictness analysis (in particular CPR-optimisation) would give rise to further opportunities for SpecConstr. Consider:

f b x = case b of A -> I# y; B -> x; C -> f A (x + 1)

After SpecConstr, the (f A z) specialisation has the CPR property, as worker-wrapper will reveal. So if there was some other call:

g x = g (f A x)

After inlining the wrapper we now have a SpecConstr opportunity for (g (I# z)). I wonder how often this would actually occur in practice..

comment:4 Changed 5 years ago by rl

Why do you need SAT for stream fusion?

The problem with pulling strictness analysis and w/w into the loop is that you basically need a SpecConstr/simplify loop with an unbounded number of iterations because strictness and w/w are only useful if you simplify afterwards. I implemented something like this (without strictness analysis) for the stream fusion paper - see the "Specialising on lambdas" bit in #855. It was awfully fragile.

Perhaps it is possible to implement a combined shape/demand analysis which subsumes both the current strictness analysis and the analysis that is done by SpecConstr. Maybe SpecConstr could then subsume worker/wrapper as well.

comment:5 Changed 5 years ago by simonpj

Concerning your parenthetical remark, what would you prefer? Each of those strict lets has two branches and GHC rightly refrains from duplicating all the code of the continuation into each branch.

Simon

comment:6 Changed 5 years ago by batterseapower

Roman: I was thinking about making stream fusion do concatMap properly. Of course you can specialise on lambdas directly as an alternative.

Simon: I would prefer if you eliminated the box for e.g. y_smH, so y_smH binds something of type Int# instead of Int. This would eliminate the allocation and the unboxing in both branches. I don't think this requires code duplication, unless I'm missing something?

comment:7 Changed 5 years ago by simonpj

I've made a new ticket #4957 for the "excessive boxing" point, leaving this ticket to address its original purpose, namely the question of absent parameters.

Simon

comment:8 Changed 5 years ago by simonpj

OK I've fixed #4957. Would you like to try the effect of strictness analysis following SpecContr now? Is it always a win?

Simon

comment:9 follow-up: Changed 5 years ago by batterseapower

Headline numbers for this change after your #4957 fix:

            Min          +0.0%     -3.2%     -4.4%     -2.5%     +0.0%
            Max          +0.3%     +4.0%     +4.5%     +4.9%     +0.0%
 Geometric Mean          +0.0%     -0.0%     +0.2%     +0.2%     -0.0%

So a solid reduction in allocations for a few benchmarks (knights, parstof, simple and sphere are the biggest winners). However, some benchmarks get worse. I looked at the worst offender (crypytarithm2) and saw something pretty odd. It looks like the transformation is working correctly but the output code (post simplification) contains dead let-bindings for SpecConstr specialisations that have been worker/wrappered, and those workers have been subsequently inlined into all use sites.

For example, we have a local specialisation within the final body of permute1:

sgo_s1lg [InlPrag=INLINE[0], Occ=LoopBreaker!]
        :: Main.Digits
           -> Main.Digits
           -> [(Main.Digits, Main.Digits)]
           -> [(GHC.Types.Int, Main.Digits)]
      [LclId,
       Arity=3,
       Str=DmdType U(SL)LL,
       Unf=Unf{Src=Worker=$w$sgo_s1mv, TopLvl=False, Arity=3, Value=True,
               ConLike=True, Cheap=True, Expandable=True,
               Guidance=ALWAYS_IF(unsat_ok=True,boring_ok=False)
               Tmpl= \ (w_s1lD [Occ=Once!] :: Main.Digits)
                       (w1_s1lI [Occ=Once] :: Main.Digits)
                       (w2_s1lJ [Occ=Once] :: [(Main.Digits, Main.Digits)]) ->
                       case w_s1lD
                       of _ { Main.Digits ww_s1lF [Occ=Once] ww1_s1lG [Occ=Once] ->
                       $w$sgo_s1mv ww_s1lF ww1_s1lG w1_s1lI w2_s1lJ
                       }}]
      $sgo_s1lg =
        \ (w_s1lD :: Main.Digits)
          (w1_s1lI :: Main.Digits)
          (w2_s1lJ :: [(Main.Digits, Main.Digits)]) ->
          case w_s1lD of _ { Main.Digits ww_s1lF ww1_s1lG ->
          $w$sgo_s1mv ww_s1lF ww1_s1lG w1_s1lI w2_s1lJ
          };
      $w$sgo_s1mv [Occ=LoopBreaker]
        :: [GHC.Types.Int]
           -> [(GHC.Types.Char, GHC.Types.Int)]
           -> Main.Digits
           -> [(Main.Digits, Main.Digits)]
           -> [(GHC.Types.Int, Main.Digits)]
      [LclId, Arity=4, Str=DmdType SLLL]
      $w$sgo_s1mv = ... big ...

But $sgo_s1lg does not occur anywhere in the code! These bindings are probably the source of the increased allocation as their closures will be allocated anew for each application of permute1.

I found that changing the final simplification phase to use at least 3 iterations (just like the main simplification phase) did not eliminate these. Is there any obvious explanation for this behaviour?

comment:10 Changed 5 years ago by batterseapower

I think I see the problem. preInlineUnconditionally/PostInlineUnconditionally prevents dead let-bindings from being inlined if they have a stable unfolding.

IMHO they should unconditionally inline bindings with IAmDead occurrence information.

comment:11 Changed 5 years ago by batterseapower

Er, no - I was wrong as usual. The issue is right there in the code: $sgo_s1lg is a LoopBreaker for some reason. Weird.

comment:12 Changed 5 years ago by batterseapower

I'm being particularly slow today. The issue was obvious once I looked at the simplified output rather than the tidied core - there is a non-dead function later in that letrec (go_s1gC) that has a RULE that rewrites to $sgo_s1lg.

It is sort of unfortunate that this is keeping the thing it refers to alive even after core tidying, where rules cannot possibly apply any more.

comment:13 Changed 5 years ago by simonpj

Well, if go_s1gC was externally visible (via the interface file), then its RULE could still fire. But if it's not exported, it can't. So

  • CoreTidy could discard RULES and unfoldings for non-externally-visible things
  • CoreTidy could run the occurrence analyser to drop code thereby shown to be dead

But that still doesn't account for some programs allocating more and running slower.

Simon

comment:14 Changed 5 years ago by batterseapower

You are right that the rules need to be kept for at least the interface file. But it is useless to keep around let bindings only referred to by rules at the CorePrep stage - i.e. for code generation.

I fail to see how this does not explain the allocation increase - if my inner loop is allocating a spurious closure for this dead function, doesn't that increase the allocation figures? The dead function is not present in the version of the code without the extra strictness analysis pass.

comment:15 Changed 5 years ago by batterseapower

I've confirmed that the problem I identified was the source of increased allocations in crypytarithm2. I opened #4962 to track progress on that issue.

With my hacky patch for that problem, the new numbers are:

            Min          +0.0%     -4.7%     -4.8%     -4.9%     +0.0%
            Max          +0.3%     +2.4%     +4.6%     +4.7%     +0.0%
 Geometric Mean          +0.0%     -0.2%     +0.3%     +0.1%     -0.0%

Allocations are only being increased in lift (+0.1%) and para (+2.4%). I'm investigating now.

comment:16 Changed 5 years ago by batterseapower

I've stared at para for a while and can't see anything obviously bad going on to cause the allocation increase.

I did notice one suspicious thing, but I don't think it's actually a problem. Before we have something like:

let wild = (a, b)
      $j = \x y -> case BIG of A -> wild; B -> (x, y)
in if e1 then $j c d else $j e f

CPRish $j gets worker/wrappered with this new invocation of the demand analyser so we get:

let $j = \x y -> case BIG of A -> (# a, b #); B -> (# x, y #)
in if e1 then case $j c d of (# k, l #) -> (k, l) else case $j e f of (# k, l #) -> (k, l)

This causes reboxing if "wild" is mentioned anywhere else, since sharing of the allocation is lost. The wild binding does actually get eliminated by the simplifier. However, I've managed to convince myself that wild was probably actually used Once in the input so this is in fact OK.

So in summary I don't think the nofib results are going to get any better than that. Take it or leave it :-)

Personally I've seen this change making a difference to numerical loops that pass around STUArray. After SpecConstr (which unpacks the STUArray data type) several of the STUArray fields will typically be dead, though we may still want e.g. the MutableByteArray# inside. Eliminating these useless arguments from numerical loops using STUArrays might be a cool enough reason to warrant this change.

comment:17 follow-up: Changed 5 years ago by simonpj

FYI: compiling with -ticky is a fantastic way to nail exactly where allocation increases.

comment:18 in reply to: ↑ 17 Changed 5 years ago by batterseapower

Replying to simonpj:

FYI: compiling with -ticky is a fantastic way to nail exactly where allocation increases.

I have just tried this without success. Ticky tells me that a function $wtrim is increasing allocs by 23%, but I have traced it through manually and I just can't see how that can be happening.

(I infer that the -ticky output includes those allocations made within a let-no-escape in the allocation count of their caller).

To me it looks like the optimised code is never worse and sometimes an improvement in # of allocations metric - and optimisation reduces the number of let-no-escapes to boot. So I'm really flummoxed as to why allocations increase. It doesn't seem to be just because closure sizes increase or something because ALLOC_HEAP_ctr rises. Perhaps it would be clearer if ticky either told me a breakdown of allocation by the type of thing being allocated or by the let binding from which it originated.

comment:19 Changed 4 years ago by simonpj

See also #5302

comment:20 in reply to: ↑ 9 ; follow-up: Changed 2 years ago by nfrisby

Replying to batterseapower:

I looked at the worst offender (crypytarithm2) and saw something pretty odd. It looks like the transformation is working correctly but the output code (post simplification) contains dead let-bindings for SpecConstr specialisations that have been worker/wrappered, and those workers have been subsequently inlined into all use sites.

Replying to batterseapower:

I fail to see how this does not explain the allocation increase - if my inner loop is allocating a spurious closure for this dead function, doesn't that increase the allocation figures? The dead function is not present in the version of the code without the extra strictness analysis pass.

These dead SpecConstr'd and ww'd closures are still allocated in cryptarithm2.

With the late demand, cryptarithm2 has +8% allocation. These dead closures are 4% of that (according to ticky). The other 4% is reboxing: 3 words, ~ 4000 times --- for calls to those dead lets' progeny, no less.

I used the pipeline: SpecConstr, simpl, stranal, wwsplit, simpl.

I have confirmed that the variables are dead in the STG.

I have not investigated why the fix for #4962 doesn't remove these lets.

I altered ticky to count the heap used when allocating each closure.

    Entries     Allocs    Alloc'd  Arity  Kinds        Function
--------------------------------------------------------------------------------
          0          0       2056      3    SSL        $sgo{v s2X7} (main:Main)      
          0          0       8262      3    SSL        $sgo{v s2Qx} (main:Main)      
          0          0      13460      3    ESL        $sgo1{v s2Zq} (main:Main)     
          0          0      25288      3    ESL        $sgo1{v s2R0} (main:Main)     
          0          0      75814      3    SSL        $sgo{v s2Ui} (main:Main) 

3376149 ALLOC_HEAP_tot

With just -O2, it's 3125158 ALLOC_HEAP_tot.

Otherwise, the late demand looks pretty good. Column header = <libraries options>/<nofib test options>. is 0.0%. In this notation, "late-dmd" implies also -O2.

Allocations

-------------------------------------------------------------------------------
        Program                O2/O2     late-dmd/O2    late-dmd/late-dmd
-------------------------------------------------------------------------------
   cryptarithm2             25078168           +____           +8.0%
       nucleic2             98331744           +____           +3.2%

       cichelli             80310632           +____          -22.9%
          fasta            401159024           -9.1%           -9.1%
         fulsom            321427240           +____           -2.6%
   k-nucleotide           4125102928           -____           -4.8%
        knights              2037984           +____           -3.7%
        mandel2              1041840           +____          -21.4%
        parstof              3103208           +____           -1.4%
reverse-complem            155188304          -12.8%          -12.8%
         simple            226412800           -____           -1.0%

I'm investigating nucleic2 at the moment; it's changing something currently not specifically tracked by ticky, so I'm resurrecting more counters. There's nothing obvious to me in the STG, at least.

I don't yet have trustworthy runtime numbers.

comment:21 in reply to: ↑ 20 Changed 2 years ago by nfrisby

Replying to nfrisby:

I'm investigating nucleic2 at the moment; it's changing something currently not specifically tracked by ticky, so I'm resurrecting more counters. There's nothing obvious to me in the STG, at least.

It's actually pretty clear in the STG. A let at the entry to $w$wvar_most_distant_atom with 12 free Float# vars gets inlined into numerous thunks, at least 24 (25?) of which are allocated all at once. In the CMM Hp = Hp + 1376 becomes Hp = Hp + 3576. Some arithmetic with these numbers just about covers the reported ticky difference for $w$wvar_most_distant_atom.

Needs more inspection (distinct from this ticket... sorry), though this seems a bit of a corner case: arity=40 and the involved state type uses record types with numerous non-strict fields of type Float. I'll record it with the ticket I'm soon making for the late demand analysis pass.

Note: See TracTickets for help on using tickets.