Opened 8 years ago

# Optimisation: eliminate unnecessary heap check in recursive function

Reported by: Owned by: simonmar low 8.0.1 Compiler 6.6.1 ndmitchell@…, johan.tibell@…, slyfox@…, mail@…, omeragacan@…, rwbarton Unknown/Multiple Unknown/Multiple Runtime performance bug #4258 8326

### Description

In a basic block consisting of some primitive cases, in which only some branches require heap, we could push the heap check into the branches as long as the primitive operations are safe to repeat, and the stack has not been modified. This might save a heap check in the common recursive case, which could be a significant win.

### comment:3 Changed 7 years ago by simonmar

• Architecture changed from Unknown to Unknown/Multiple

### comment:4 Changed 7 years ago by simonmar

• Operating System changed from Unknown to Unknown/Multiple

### comment:5 Changed 7 years ago by igloo

• Milestone changed from 6.10 branch to 6.12 branch

### comment:6 Changed 6 years ago by simonmar

• Milestone changed from 6.12 branch to 6.14.1

### comment:7 Changed 6 years ago by simonmar

• difficulty changed from Moderate (1 day) to Moderate (less than a day)

### comment:8 Changed 5 years ago by simonmar

• Milestone changed from 7.0.1 to 7.2.1
• Type of failure set to None/Unknown

### comment:9 Changed 4 years ago by rl

• Type changed from task to bug
• Type of failure changed from None/Unknown to Runtime performance bug

This is actually biting high-performance code with the LLVM backend quite badly. I'm not sure if and how the new codegen will affect the LLVM backend but this is one of the top reasons for suboptimal loop performance right now.

### comment:10 Changed 4 years ago by igloo

• Milestone changed from 7.4.1 to 7.6.1
• Priority changed from normal to low

### comment:11 Changed 4 years ago by batterseapower

It's worth noting that the new codegen does put the heap check in the "correct" place for the example program:

foo as n = loop 0# 0.0##
where
loop i x
| i >=# n = (# (), D# x #)
| otherwise = loop (i +# 1#) (x *## indexDoubleArray# as i)


The old code generator still has the problem, of course.

### comment:12 Changed 4 years ago by batterseapower

Out of curiosity I changed CgCase to push heap checks into branches in more cases, according to the following diff. My patch affects those cases with primop scrutinees which have at least 2 alts *and* are which will not take a heap check before scrutinee evaluation.

Such cases are instead treated just like algebraic cases where the alts are given a full info tables etc so the heap check can be done in the branches, *after* evaluating the scrutinee.

Nofib results were poor: code size increased by an average of 0.1% (max 1.7%) and allocations didn't fall perceptibly on average (they fell by 0.2% in cacheprof though). Runtime rose by 1.8%, with a max increase of 16.2% in wheel-sieve1.

It is not really surprising that this costs so much when you consider the overhead of spilling variables to a stack-carried closure for the alts, and that even if you end up eliminating a heap check, you might end up causing a brand new stack check. It doesn't help that the code generated by this change tends to be particularly stupid in more obvious ways. Here is the inner loop of the example:

[section "data" {
HC_foo_closure:
const HC_foo_info;
},
sbX_ret()
{ update_frame: <none>
label: sbX_info
rep:StackRep [True, True, False, True]
}
cca:
_ccb::I64 = R1 & 7;
if (_ccb::I64 >= 2) goto ccc;
_sbR::F64 = F64[I64[Sp + 24] + 16 + (I64[Sp + 16] << 3)];
_sbU::F64 = %MO_F_Mul_W64(F64[Sp + 8], _sbR::F64);
_sbV::I64 = I64[Sp + 16] + 1;
R1 = _sbV::I64;
D1 = _sbU::F64;
Sp = Sp + 16;
jump sbO_info; // [D1, R1]
ccc:
Hp = Hp + 16;
if (Hp > HpLim) goto cci;
I64[Hp - 8] = ghczmprim_GHCziTypes_Dzh_con_info;
F64[Hp + 0] = F64[Sp + 8];
R1 = ghczmprim_GHCziTuple_Z0T_closure+1;
R2 = Hp - 7;
Sp = Sp + 40;
jump (I64[Sp + 0]); // [R2, R1]
ccj: jump stg_gc_enter_1; // [R1]
cci:
HpAlloc = 16;
goto ccj;
},
sbO_ret()
{ update_frame: <none>
label: sbO_info
rep:StackRep [False, True]
}
ccl:
I64[Sp + 0] = R1;
F64[Sp - 8] = D1;
_ccn::I64 = %MO_S_Ge_W64(R1, I64[Sp + 16]);
R1 = I64[ghczmprim_GHCziTypes_Bool_closure_tbl + (_ccn::I64 << 3)];
I64[Sp - 16] = sbX_info;
Sp = Sp - 16;
jump sbX_info; // [R1]
},
HC_foo_info()
{ update_frame: <none>
label: HC_foo_info
rep:HeapRep static { Fun {arity: 2 fun_type: ArgSpec 11} }
}
ccp:
if (Sp - 40 < SpLim) goto ccr;
I64[Sp - 8] = R3;
I64[Sp - 16] = R2;
R1 = 0;
D1 = 0.0 :: W64;
Sp = Sp - 24;
jump sbO_info; // [D1, R1]
ccr:
R1 = HC_foo_closure;
jump stg_gc_fun; // [R1, R3, R2]
}]


Note the totally unnecessary use of the ghczmprim_GHCziTypes_Bool_closure_tbl. (It is worth noting this same problem afflicts the new codegen even in HEAD.)

Here is a possible restriction of this optimization that will still hit the particular example in this ticket. Note that deferring the heap check is a clear win when the heap check can have the GC reenter in such a way that it *reevaluates the scrutinee*. This is safe from a work-duplication perspective when the scrutinee is cheap, as "i >=# n" is. If we don't mind duplicating that work then we can avoid creating a stack entry for the alts and just reuse the existing stack entry for the whole case. In the CMM above, this would amount to not giving sbX_ret an info table and instead having stg_gc just reenter the already-present sbO_ret closure.

Actually implementing this idea is a problematic, not least because we need to know whether all the currently emitted code from the last closure onwards is safe to work-duplicate in this way. It is not sufficient to just check the scrutinee - it depends on the context of the cgCase.

For reference, the patch I tested is as follows:

\$ git diff
diff --git a/compiler/codeGen/CgCase.lhs b/compiler/codeGen/CgCase.lhs
--- a/compiler/codeGen/CgCase.lhs
+++ b/compiler/codeGen/CgCase.lhs
@@ -190,10 +190,15 @@ cgCase (StgOpApp (StgPrimOp SeqOp) [StgVarArg a, _] _)
Special case #3: inline PrimOps and foreign calls.

\begin{code}
-cgCase (StgOpApp (StgPrimOp primop) args _)
-       _live_in_whole_case live_in_alts bndr alt_type alts
+cgCase e@(StgOpApp (StgPrimOp primop) args _)
+       live_in_whole_case live_in_alts bndr alt_type alts
| not (primOpOutOfLine primop)
-  = cgInlinePrimOp primop args bndr alt_type live_in_alts alts
+  = do { hp_usage <- getHpUsage
+       ; if heapHWM hp_usage > heapHWM initHpUsage || length alts <= 1 -- Reuse existing heap check for alts if it already exists, or if there is no way
+          then cgInlinePrimOp primop args bndr alt_type live_in_alts alts
+          else cgCase' e live_in_whole_case live_in_alts bndr alt_type alts }
+cgCase expr live_in_whole_case live_in_alts bndr alt_type alts
+  = cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts
\end{code}

TODO: Case-of-case of primop can probably be done inline too (but
@@ -205,7 +210,7 @@ Special case #4: inline foreign calls: an unsafe foreign call can be done
right here, just like an inline primop.

\begin{code}
-cgCase (StgOpApp (StgFCallOp fcall _) args _)
+cgCase' (StgOpApp (StgFCallOp fcall _) args _)
_live_in_whole_case live_in_alts _bndr _alt_type alts
| unsafe_foreign_call
= ASSERT( isSingleton alts )
@@ -229,7 +234,7 @@ This can be done a little better than the general case, because
we can reuse/trim the stack slot holding the variable (if it is in one).

\begin{code}
-cgCase (StgApp fun args)
+cgCase' (StgApp fun args)
_live_in_whole_case live_in_alts bndr alt_type alts
= do  { fun_info <- getCgIdInfo fun
; arg_amodes <- getArgAmodes args
@@ -268,7 +273,7 @@ deAllocStackTop call is doing above.
Finally, here is the general case.

\begin{code}
-cgCase expr live_in_whole_case live_in_alts bndr alt_type alts
+cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts
= do  {       -- Figure out what volatile variables to save


### comment:13 Changed 4 years ago by rl

Very interesting (even if a bit disappointing). I'm somewhat suspicious of your condition, though:

cases with primop scrutinees which have at least 2 alts *and* are which will not take a heap check before scrutinee evaluation

The tight loops that LLVM can optimise well all look like this:

f x = case ... of
p1 -> ... f y ...      -- recursive, doesn't allocate
...
pm -> ... f z ...      -- recursive, doesn't allocate
q1 -> ...              -- non-recursive, might allocate
...
qn -> ...              -- non-recursive, might allocate


So we have some alternatives that don't allocate and perform recursive tailcalls (the tight loop) and then a number of loop exits which might allocate. It seems that pushing the heap check into the exits should always be a win here. This seems quite different from the condition you used so there might still be an easy fix for these cases.

In any case, thanks for looking into this!

### comment:14 Changed 4 years ago by simonmar

Max: note that I'm partway through a large overhaul of the new code generator, which you can find on the newcg branch of the GHC repo (you also need the simonmar-hoopl-opt branch of the hoopl repo). In particular I fixed the Bool_closure_tbl problem you mentioned, and I've redone the stack allocator.

I think probably the right way to approach this problem is for the heap check failure code to branch back to the beginning of the function, rather than the branch, to avoid turning the branch into a proc point and causing everything to get spilled to the stack.

### comment:15 Changed 4 years ago by batterseapower

Simon: agreed that the in-branch heap check should just reuse the immediately dominating procpoint rather than creating a new one. This is what I was getting at in my previous comment. We just have to be careful about code like:

f x y = case sideEffectingPrimOp of _ -> case (x ># y) of True -> let a = D# x in ...; False -> f (x +# 1#) y


Because we don't want to reexecute sideEffectingPrimOp if the heap check in the True branch fails. I've attached a patch that attempts to implement this (it applies on top of the result of merging 7a7f6d703a99045cb9f590c819b795409a090022 into your newcg branch), but I cant' benchmark it because newcg seems to be broken at the moment (the built stage2 compiler crashes?).

The implementation is very hacky and awkward. I added a new field to the FCode's state which carries around information about the last procpoint (if any), and we can then use that procpoint as the target of our heap check when generating code for certain alts.

One reason the implementation is not very nice because you have to be very careful to zap this last-procpoint info wherever you emit any code which it is not safe to duplicate the evaluation of (and where you do not immediately establish a new procpoint). Currently I zap only when emitting a tick or a call to a non-cheap inline primop, but there may be more cases where zapping should happen (in fact, now I think about I should also have zapped at any point where heapCheck was called but none was required because virtHP didn't move). In particular, I'm not 100% sure if we should allow heap allocation itself to be duplicated, e.g. in code like:

f x y = let z = D# x in case x ># y of ...


Perhaps there is a nicer way to implement it that I'm missing. I'm not 100% comfortable with how the codegen is written, so I consider this quite likely.

### comment:16 Changed 4 years ago by simonmar

Max: the newcg branch may well be broken, it's work in progress. Can we revisit this when I have the new code generator back together and working? (I'm actively working on it)

### comment:17 Changed 3 years ago by igloo

• Milestone changed from 7.6.1 to 7.6.2

### comment:18 Changed 3 years ago by morabbin

Bump; what's the status of the newcg branch?

### comment:19 Changed 3 years ago by thoughtpolice

The new code generator is live in HEAD - it is now the only code generator that exists. This optimization still hasn't been implemented yet, though.

### comment:21 Changed 2 years ago by jstolarek

Is someone able to provide a complete code example that demonstrates this bug? I'd like to take a look.

### comment:22 Changed 23 months ago by NeilMitchell

http://neilmitchell.blogspot.co.uk/2014/01/optimising-haskell-for-tight-inner-loop.html - the code in version 4/version 5 includes a small example of the bug.

### comment:26 Changed 17 months ago by thoughtpolice

• Milestone changed from 7.6.2 to 7.10.1

Moving to 7.10.1.

### comment:27 Changed 11 months ago by thoughtpolice

• Milestone changed from 7.10.1 to 7.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

### comment:30 Changed 3 months ago by thoughtpolice

• Milestone changed from 7.12.1 to 8.0.1

Milestone renamed

### comment:31 Changed 2 months ago by jstolarek

• Cc jan.stolarek@… removed
Note: See TracTickets for help on using tickets.