Version 16 (modified by nomeata, 21 months ago) (diff)


This is nomeata’s notepad about the nested CPR information:

Related tickets

  • #1600 Main tickets where I mention progress.

Tickets with stuff that would make nested CPR better:

  • #8598 CPR after IO (partly done)

Related testcases


  • Does Nick Frisby’s late λ-lifting alleviate problems when CPR’ing join-points?
    • Need to see if his branch can be merged onto master.
  • Paper-Writeup of CPR
  • Shouldn’t nested CPR help a lot with Complex-heavy code? Is there something in nofib?
  • Try passing CPR information from the scrunitee to the pattern variables. For that: Reverse flow of analysis for complex scrunitees (for simple, we want the demand coming from the body, for complex, this is not so important.)
  • Use ticky-profiling to learn more about the effects of nested CPR.
  • Look at DmdAnal-related [SLPJ-Tickets] and see which ones are affected by nested-cpr.
  • Do not destroy join points (see below).
  • Can we make sure more stuff gets the Converging flag, e.g. after a case of an unboxed value? Should case binders get the Converging flag? What about pattern match variables in strict data constructors? Unboxed values?
  • Why does nested CPR make some stuff so bad?
    • Possibly because of character reboxing. Try avoiding CPR’ing C# alltogether!

Degradation explanation

At one point, I thought that a major contributor to increased allocations is nested-CPR’ing things returning String, causing them to return (# Char#, String #). But removing the CPR information from C# calls has zero effect on the allocations, both on master and on nested-cpr. It had very small (positive) effect on code size. Will have to look at Core... Here are some case studies:


Baseline: [0e2fd3/ghc], Tested: nested-cpr (without nesting inside sum-types, without join-point detection).

Found a 11% increase in allocation, around 9000000 bytes.

The most obvious change in ticky-ticky-number are:

  • FUNCTION ENTRIES and ENTERS increasing by ~100000
  • RETURNS doubling from 140745 to 280795
  • ALLOC_FUN_ctr and ALLOC_FUN_gds almost doubling, by ~18000 resp. 9000000

So we are allocating more function closures. First guess: Join point property destroyed somewhere.

The ticky output shows a $wgo{v s60k} (main:Main) appearing that was not there before, with 140016 enters and 23522688 allocations. This appears in $wtabulate, and indeed corresponds to a go1 that is a join-point before. So what is happening? We are changing

go1 [Occ=LoopBreaker]                                      
  :: GHC.Prim.Int#                                         
     -> GHC.Prim.State# s                                  
     -> (# GHC.Prim.State# s, GHC.Arr.Array GHC.Types.Int x #)


$wgo [Occ=LoopBreaker]          
  :: GHC.Prim.Int#
     -> GHC.Prim.State# s
     -> (# GHC.Prim.State# s,   
           GHC.Prim.Array# x #) 

go1 is recursive, but tail-recursive, so the worker and wrapper indeed cancel for the recursive call. But where it is being used, we simply apply the Array constructor to the second component. So nothing is gained, but a join-point is lost.

join points

CPR can kill join points.

Common Context

Idea to fix this, and possibly more general benefits:; prototype in branch wip/common-context.

  • On its own, improvements are present but very small:
  • Enabling CPR for sum types in non-top-level-bindings (which is currently disabled due to worries abut lost join points) yields mixed results (min -3.8%, mean -0.0%, max 3.4%).
  • Enabling sum types inside nested CPR: Also yields mixed, not very promising results (-6.9% / +0.0% / +11.3%).

Direct detection

Alternative: Detect join points during dmdAnal and make sure that their CPR info is not greater than that of the expression they are a join-point for. Would also fix #5075, see 5075#comment:19 for benchmark numbers.

  • On its own, no changes.
  • Enabling CPR for sumtypes: (min -3.8%, mean -0.0%, max 1.7%) (slightly better than with Common Context)
  • Enabling sum types inside nested CPR: TBD

Side tracks

  • Should runSTRep be inlined (see ticket:1600#comment:34)?
  • Can we use Terminates CPR information to eagerly evaluate thunks? Yes, and there is a small gain there: #8655
    • But why no allocation change? Understand this better!
    • Can we statically and/or dynamically count the number of thunks, and the number of CBV’ed thunks?
  • Why is cacheprof not deterministic? (→ #8611)
  • What became of Simon’s better-ho-cardinality branch? See better-ho-cardinality.
  • Try vtunes to get better numbers.