Optimisation: eliminate unnecessary heap check in recursive function

changed milestone to %6.10 branch

changed weight to 5

Trac metadata

Trac field	Value
Architecture	Unknown → Unknown/Multiple

Trac metadata

Trac field	Value
Operating system	Unknown → Unknown/Multiple

changed milestone to %6.12 branch

removed milestone

Another new codegen task.

mentioned in issue #4258 (closed)

changed milestone to %7.2.1

Trac metadata

Trac field	Value
TypeOfFailure	- → OtherFailure
BlockedBy	- → #4258 (closed)

This is actually biting high-performance code with the LLVM backend quite badly. I'm not sure if and how the new codegen will affect the LLVM backend but this is one of the top reasons for suboptimal loop performance right now.

Trac metadata

Trac field	Value
Type	Task → Bug
TypeOfFailure	OtherFailure → RuntimePerformance

changed milestone to %7.6.1

changed weight to 3

Trac metadata

Trac field	Value
Priority	normal → low

It's worth noting that the new codegen does put the heap check in the "correct" place for the example program:

foo as n = loop 0# 0.0##
  where
    loop i x
      | i >=# n = (# (), D# x #)
      | otherwise = loop (i +# 1#) (x *## indexDoubleArray# as i)

The old code generator still has the problem, of course.

Out of curiosity I changed CgCase to push heap checks into branches in more cases, according to the following diff. My patch affects those cases with primop scrutinees which have at least 2 alts *and* are which will not take a heap check before scrutinee evaluation.

Such cases are instead treated just like algebraic cases where the alts are given a full info tables etc so the heap check can be done in the branches, *after* evaluating the scrutinee.

Nofib results were poor: code size increased by an average of 0.1% (max 1.7%) and allocations didn't fall perceptibly on average (they fell by 0.2% in cacheprof though). Runtime rose by 1.8%, with a max increase of 16.2% in wheel-sieve1.

It is not really surprising that this costs so much when you consider the overhead of spilling variables to a stack-carried closure for the alts, and that even if you end up eliminating a heap check, you might end up causing a brand new stack check. It doesn't help that the code generated by this change tends to be particularly stupid in more obvious ways. Here is the inner loop of the example:

[section "data" {
     HC_foo_closure:
         const HC_foo_info;
 },
 sbX_ret()
         { update_frame: <none>
           label: sbX_info
           rep:StackRep [True, True, False, True]
         }
     cca:
         _ccb::I64 = R1 & 7;
         if (_ccb::I64 >= 2) goto ccc;
         _sbR::F64 = F64[I64[Sp + 24] + 16 + (I64[Sp + 16] << 3)];
         _sbU::F64 = %MO_F_Mul_W64(F64[Sp + 8], _sbR::F64);
         _sbV::I64 = I64[Sp + 16] + 1;
         R1 = _sbV::I64;
         D1 = _sbU::F64;
         Sp = Sp + 16;
         jump sbO_info; // [D1, R1]
     ccc:
         Hp = Hp + 16;
         if (Hp > HpLim) goto cci;
         I64[Hp - 8] = ghczmprim_GHCziTypes_Dzh_con_info;
         F64[Hp + 0] = F64[Sp + 8];
         R1 = ghczmprim_GHCziTuple_Z0T_closure+1;
         R2 = Hp - 7;
         Sp = Sp + 40;
         jump (I64[Sp + 0]); // [R2, R1]
     ccj: jump stg_gc_enter_1; // [R1]
     cci:
         HpAlloc = 16;
         goto ccj;
 },
 sbO_ret()
         { update_frame: <none>
           label: sbO_info
           rep:StackRep [False, True]
         }
     ccl:
         I64[Sp + 0] = R1;
         F64[Sp - 8] = D1;
         _ccn::I64 = %MO_S_Ge_W64(R1, I64[Sp + 16]);
         R1 = I64[ghczmprim_GHCziTypes_Bool_closure_tbl + (_ccn::I64 << 3)];
         I64[Sp - 16] = sbX_info;
         Sp = Sp - 16;
         jump sbX_info; // [R1]
 },
 HC_foo_info()
         { update_frame: <none>
           label: HC_foo_info
           rep:HeapRep static { Fun {arity: 2 fun_type: ArgSpec 11} }
         }
     ccp:
         if (Sp - 40 < SpLim) goto ccr;
         I64[Sp - 8] = R3;
         I64[Sp - 16] = R2;
         R1 = 0;
         D1 = 0.0 :: W64;
         Sp = Sp - 24;
         jump sbO_info; // [D1, R1]
     ccr:
         R1 = HC_foo_closure;
         jump stg_gc_fun; // [R1, R3, R2]
 }]

Note the totally unnecessary use of the ghczmprim_GHCziTypes_Bool_closure_tbl. (It is worth noting this same problem afflicts the new codegen even in HEAD.)

Here is a possible restriction of this optimization that will still hit the particular example in this ticket. Note that deferring the heap check is a clear win when the heap check can have the GC reenter in such a way that it *reevaluates the scrutinee*. This is safe from a work-duplication perspective when the scrutinee is cheap, as "i >=# n" is. If we don't mind duplicating that work then we can avoid creating a stack entry for the alts and just reuse the existing stack entry for the whole case. In the CMM above, this would amount to not giving sbX_ret an info table and instead having stg_gc just reenter the already-present sbO_ret closure.

Actually implementing this idea is a problematic, not least because we need to know whether all the currently emitted code from the last closure onwards is safe to work-duplicate in this way. It is not sufficient to just check the scrutinee - it depends on the context of the cgCase.

For reference, the patch I tested is as follows:

$ git diff
diff --git a/compiler/codeGen/CgCase.lhs b/compiler/codeGen/CgCase.lhs
index dd607de..ad7610e 100644
--- a/compiler/codeGen/CgCase.lhs
+++ b/compiler/codeGen/CgCase.lhs
@@ -190,10 +190,15 @@ cgCase (StgOpApp (StgPrimOp SeqOp) [StgVarArg a, _] _)
 Special case #3: inline PrimOps and foreign calls.
 
 \begin{code}
-cgCase (StgOpApp (StgPrimOp primop) args _)
-       _live_in_whole_case live_in_alts bndr alt_type alts
+cgCase e@(StgOpApp (StgPrimOp primop) args _)
+       live_in_whole_case live_in_alts bndr alt_type alts
   | not (primOpOutOfLine primop)
-  = cgInlinePrimOp primop args bndr alt_type live_in_alts alts
+  = do { hp_usage <- getHpUsage
+       ; if heapHWM hp_usage > heapHWM initHpUsage || length alts <= 1 -- Reuse existing heap check for alts if it already exists, or if there is no way
+          then cgInlinePrimOp primop args bndr alt_type live_in_alts alts
+          else cgCase' e live_in_whole_case live_in_alts bndr alt_type alts }
+cgCase expr live_in_whole_case live_in_alts bndr alt_type alts
+  = cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts
 \end{code}
 
 TODO: Case-of-case of primop can probably be done inline too (but
@@ -205,7 +210,7 @@ Special case #4: inline foreign calls: an unsafe foreign call can be done
 right here, just like an inline primop.
 
 \begin{code}
-cgCase (StgOpApp (StgFCallOp fcall _) args _)
+cgCase' (StgOpApp (StgFCallOp fcall _) args _)
        _live_in_whole_case live_in_alts _bndr _alt_type alts
   | unsafe_foreign_call
   = ASSERT( isSingleton alts )
@@ -229,7 +234,7 @@ This can be done a little better than the general case, because
 we can reuse/trim the stack slot holding the variable (if it is in one).
 
 \begin{code}
-cgCase (StgApp fun args)
+cgCase' (StgApp fun args)
         _live_in_whole_case live_in_alts bndr alt_type alts
   = do  { fun_info <- getCgIdInfo fun
         ; arg_amodes <- getArgAmodes args
@@ -268,7 +273,7 @@ deAllocStackTop call is doing above.
 Finally, here is the general case.
 
 \begin{code}
-cgCase expr live_in_whole_case live_in_alts bndr alt_type alts
+cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts
   = do  {       -- Figure out what volatile variables to save
           nukeDeadBindings live_in_whole_case

Very interesting (even if a bit disappointing). I'm somewhat suspicious of your condition, though:

cases with primop scrutinees which have at least 2 alts *and* are which will not take a heap check before scrutinee evaluation

The tight loops that LLVM can optimise well all look like this:

f x = case ... of
        p1 -> ... f y ...      -- recursive, doesn't allocate
        ...
        pm -> ... f z ...      -- recursive, doesn't allocate
        q1 -> ...              -- non-recursive, might allocate
        ...
        qn -> ...              -- non-recursive, might allocate

So we have some alternatives that don't allocate and perform recursive tailcalls (the tight loop) and then a number of loop exits which might allocate. It seems that pushing the heap check into the exits should always be a win here. This seems quite different from the condition you used so there might still be an easy fix for these cases.

In any case, thanks for looking into this!

Max: note that I'm partway through a large overhaul of the new code generator, which you can find on the newcg branch of the GHC repo (you also need the simonmar-hoopl-opt branch of the hoopl repo). In particular I fixed the Bool_closure_tbl problem you mentioned, and I've redone the stack allocator.

I think probably the right way to approach this problem is for the heap check failure code to branch back to the beginning of the function, rather than the branch, to avoid turning the branch into a proc point and causing everything to get spilled to the stack.

Simon: agreed that the in-branch heap check should just reuse the immediately dominating procpoint rather than creating a new one. This is what I was getting at in my previous comment. We just have to be careful about code like:

f x y = case sideEffectingPrimOp of _ -> case (x ># y) of True -> let a = D# x in ...; False -> f (x +# 1#) y

Because we don't want to reexecute sideEffectingPrimOp if the heap check in the True branch fails. I've attached a patch that attempts to implement this (it applies on top of the result of merging 7a7f6d70 into your newcg branch), but I cant' benchmark it because newcg seems to be broken at the moment (the built stage2 compiler crashes?).

The implementation is very hacky and awkward. I added a new field to the FCode's state which carries around information about the last procpoint (if any), and we can then use that procpoint as the target of our heap check when generating code for certain alts.

One reason the implementation is not very nice because you have to be very careful to zap this last-procpoint info wherever you emit any code which it is not safe to duplicate the evaluation of (and where you do not immediately establish a new procpoint). Currently I zap only when emitting a tick or a call to a non-cheap inline primop, but there may be more cases where zapping should happen (in fact, now I think about I should also have zapped at any point where heapCheck was called but none was required because virtHP didn't move). In particular, I'm not 100% sure if we should allow heap allocation itself to be duplicated, e.g. in code like:

f x y = let z = D# x in case x ># y of ...

Perhaps there is a nicer way to implement it that I'm missing. I'm not 100% comfortable with how the codegen is written, so I consider this quite likely.

Attached file 0001-Horrible-hacky-implementation-of-avoiding-heap-check.patch (download).

Max: the newcg branch may well be broken, it's work in progress. Can we revisit this when I have the new code generator back together and working? (I'm actively working on it)

changed milestone to %7.6.2

mentioned in issue #7561 (closed)

Bump; what's the status of the newcg branch?

The new code generator is live in HEAD - it is now the only code generator that exists. This optimization still hasn't been implemented yet, though.

Is someone able to provide a complete code example that demonstrates this bug? I'd like to take a look.

mentioned in issue #8326

http://neilmitchell.blogspot.co.uk/2014/01/optimising-haskell-for-tight-inner-loop.html - the code in version 4/version 5 includes a small example of the bug.

Trac metadata

Trac field	Value
Related	→ #8326

changed milestone to %7.10.1

Moving to 7.10.1.

removed milestone

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

changed milestone to %8.0.1

Milestone renamed

Trac metadata

Trac field	Value
CC	jan.stolarek@p.lodz.pl, johan.tibell@gmail.com, mail@nh2.me, ndmitchell@gmail.com, omeragacan@gmail.com, rwbarton, slyfox@inbox.ru → johan.tibell@gmail.com, mail@nh2.me, ndmitchell@gmail.com, omeragacan@gmail.com, rwbarton, slyfox@inbox.ru

mentioned in issue #12231

mentioned in issue #14791

added Plow label

marked this issue as related to #8317 (closed)

marked this issue as related to #16064

marked this issue as related to #8326

removed the relation with #8326

marked this issue as related to #12231

mentioned in issue #25642

Optimisation: eliminate unnecessary heap check in recursive function

Child items ...

Activity

Optimisation: eliminate unnecessary heap check in recursive function

Relates to

Activity