In a basic block consisting of some primitive cases, in which only some branches require heap, we could push the heap check into the branches as long as the primitive operations are safe to repeat, and the stack has not been modified. This might save a heap check in the common recursive case, which could be a significant win.
Trac metadata
Trac field
Value
Version
6.6.1
Type
Task
TypeOfFailure
OtherFailure
Priority
normal
Resolution
Unresolved
Component
Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Unknown
Architecture
Unknown
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
This is actually biting high-performance code with the LLVM backend quite badly. I'm not sure if and how the new codegen will affect the LLVM backend but this is one of the top reasons for suboptimal loop performance right now.
Out of curiosity I changed CgCase to push heap checks into branches in more cases, according to the following diff. My patch affects those cases with primop scrutinees which have at least 2 alts *and* are which will not take a heap check before scrutinee evaluation.
Such cases are instead treated just like algebraic cases where the alts are given a full info tables etc so the heap check can be done in the branches, *after* evaluating the scrutinee.
Nofib results were poor: code size increased by an average of 0.1% (max 1.7%) and allocations didn't fall perceptibly on average (they fell by 0.2% in cacheprof though). Runtime rose by 1.8%, with a max increase of 16.2% in wheel-sieve1.
It is not really surprising that this costs so much when you consider the overhead of spilling variables to a stack-carried closure for the alts, and that even if you end up eliminating a heap check, you might end up causing a brand new stack check. It doesn't help that the code generated by this change tends to be particularly stupid in more obvious ways. Here is the inner loop of the example:
Note the totally unnecessary use of the ghczmprim_GHCziTypes_Bool_closure_tbl. (It is worth noting this same problem afflicts the new codegen even in HEAD.)
Here is a possible restriction of this optimization that will still hit the particular example in this ticket. Note that deferring the heap check is a clear win when the heap check can have the GC reenter in such a way that it *reevaluates the scrutinee*. This is safe from a work-duplication perspective when the scrutinee is cheap, as "i >=# n" is. If we don't mind duplicating that work then we can avoid creating a stack entry for the alts and just reuse the existing stack entry for the whole case. In the CMM above, this would amount to not giving sbX_ret an info table and instead having stg_gc just reenter the already-present sbO_ret closure.
Actually implementing this idea is a problematic, not least because we need to know whether all the currently emitted code from the last closure onwards is safe to work-duplicate in this way. It is not sufficient to just check the scrutinee - it depends on the context of the cgCase.
For reference, the patch I tested is as follows:
$ git diffdiff --git a/compiler/codeGen/CgCase.lhs b/compiler/codeGen/CgCase.lhsindex dd607de..ad7610e 100644--- a/compiler/codeGen/CgCase.lhs+++ b/compiler/codeGen/CgCase.lhs@@ -190,10 +190,15 @@ cgCase (StgOpApp (StgPrimOp SeqOp) [StgVarArg a, _] _) Special case #3: inline PrimOps and foreign calls. \begin{code}-cgCase (StgOpApp (StgPrimOp primop) args _)- _live_in_whole_case live_in_alts bndr alt_type alts+cgCase e@(StgOpApp (StgPrimOp primop) args _)+ live_in_whole_case live_in_alts bndr alt_type alts | not (primOpOutOfLine primop)- = cgInlinePrimOp primop args bndr alt_type live_in_alts alts+ = do { hp_usage <- getHpUsage+ ; if heapHWM hp_usage > heapHWM initHpUsage || length alts <= 1 -- Reuse existing heap check for alts if it already exists, or if there is no way+ then cgInlinePrimOp primop args bndr alt_type live_in_alts alts+ else cgCase' e live_in_whole_case live_in_alts bndr alt_type alts }+cgCase expr live_in_whole_case live_in_alts bndr alt_type alts+ = cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts \end{code} TODO: Case-of-case of primop can probably be done inline too (but@@ -205,7 +210,7 @@ Special case #4: inline foreign calls: an unsafe foreign call can be done right here, just like an inline primop. \begin{code}-cgCase (StgOpApp (StgFCallOp fcall _) args _)+cgCase' (StgOpApp (StgFCallOp fcall _) args _) _live_in_whole_case live_in_alts _bndr _alt_type alts | unsafe_foreign_call = ASSERT( isSingleton alts )@@ -229,7 +234,7 @@ This can be done a little better than the general case, because we can reuse/trim the stack slot holding the variable (if it is in one). \begin{code}-cgCase (StgApp fun args)+cgCase' (StgApp fun args) _live_in_whole_case live_in_alts bndr alt_type alts = do { fun_info <- getCgIdInfo fun ; arg_amodes <- getArgAmodes args@@ -268,7 +273,7 @@ deAllocStackTop call is doing above. Finally, here is the general case. \begin{code}-cgCase expr live_in_whole_case live_in_alts bndr alt_type alts+cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts = do { -- Figure out what volatile variables to save nukeDeadBindings live_in_whole_case
Very interesting (even if a bit disappointing). I'm somewhat suspicious of your condition, though:
cases with primop scrutinees which have at least 2 alts *and* are which will not take a heap check before scrutinee evaluation
The tight loops that LLVM can optimise well all look like this:
f x = case ... of p1 -> ... f y ... -- recursive, doesn't allocate ... pm -> ... f z ... -- recursive, doesn't allocate q1 -> ... -- non-recursive, might allocate ... qn -> ... -- non-recursive, might allocate
So we have some alternatives that don't allocate and perform recursive tailcalls (the tight loop) and then a number of loop exits which might allocate. It seems that pushing the heap check into the exits should always be a win here. This seems quite different from the condition you used so there might still be an easy fix for these cases.
Max: note that I'm partway through a large overhaul of the new code generator, which you can find on the newcg branch of the GHC repo (you also need the simonmar-hoopl-opt branch of the hoopl repo). In particular I fixed the Bool_closure_tbl problem you mentioned, and I've redone the stack allocator.
I think probably the right way to approach this problem is for the heap check failure code to branch back to the beginning of the function, rather than the branch, to avoid turning the branch into a proc point and causing everything to get spilled to the stack.
Simon: agreed that the in-branch heap check should just reuse the immediately dominating procpoint rather than creating a new one. This is what I was getting at in my previous comment. We just have to be careful about code like:
f x y = case sideEffectingPrimOp of _ -> case (x ># y) of True -> let a = D# x in ...; False -> f (x +# 1#) y
Because we don't want to reexecute sideEffectingPrimOp if the heap check in the True branch fails. I've attached a patch that attempts to implement this (it applies on top of the result of merging 7a7f6d70 into your newcg branch), but I cant' benchmark it because newcg seems to be broken at the moment (the built stage2 compiler crashes?).
The implementation is very hacky and awkward. I added a new field to the FCode's state which carries around information about the last procpoint (if any), and we can then use that procpoint as the target of our heap check when generating code for certain alts.
One reason the implementation is not very nice because you have to be very careful to zap this last-procpoint info wherever you emit any code which it is not safe to duplicate the evaluation of (and where you do not immediately establish a new procpoint). Currently I zap only when emitting a tick or a call to a non-cheap inline primop, but there may be more cases where zapping should happen (in fact, now I think about I should also have zapped at any point where heapCheck was called but none was required because virtHP didn't move). In particular, I'm not 100% sure if we should allow heap allocation itself to be duplicated, e.g. in code like:
f x y = let z = D# x in case x ># y of ...
Perhaps there is a nicer way to implement it that I'm missing. I'm not 100% comfortable with how the codegen is written, so I consider this quite likely.
Max: the newcg branch may well be broken, it's work in progress. Can we revisit this when I have the new code generator back together and working? (I'm actively working on it)