Opened 7 years ago

Last modified 3 months ago

#1600 new task

Optimisation: CPR the results of IO

Reported by: simonmar Owned by: nomeata
Priority: lowest Milestone: 7.6.2
Component: Compiler Version: 6.6.1
Keywords: Cc: pho@…, mail@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Difficulty: Moderate (less than a day)
Test Case: Blocked By:
Blocking: Related Tickets: #8598

Description

GHC currently cannot unbox the result of a function in the IO monad. For example:

facIO :: Int -> IO Int
facIO n = if n < 2 then return 1 else do n' <- facIO (n-1); return (n*n')

the Int argument is unboxed fine, but not the result. It ought to be possible to do this: the CPR analysis needs to somehow look inside the unboxed pair returned by IO-monadic code.

Change History (40)

comment:1 Changed 6 years ago by simonmar

  • Architecture changed from Unknown to Unknown/Multiple

comment:2 Changed 6 years ago by simonmar

  • Operating System changed from Unknown to Unknown/Multiple

comment:3 Changed 5 years ago by igloo

  • Milestone changed from 6.10 branch to 6.12 branch

comment:4 Changed 5 years ago by simonpj

See also #2387 and #2289

comment:5 Changed 4 years ago by PHO

  • Cc pho@… added

comment:6 Changed 4 years ago by simonmar

  • Difficulty changed from Moderate (1 day) to Moderate (less than a day)

comment:7 Changed 4 years ago by igloo

  • Milestone changed from 6.12 branch to 6.12.3

comment:8 Changed 4 years ago by simonmar

  • Type of failure set to Runtime performance bug

comment:9 Changed 4 years ago by igloo

  • Milestone changed from 6.12.3 to 6.14.1
  • Priority changed from normal to low

comment:10 Changed 3 years ago by igloo

  • Milestone changed from 7.0.1 to 7.0.2

comment:11 Changed 3 years ago by igloo

  • Milestone changed from 7.0.2 to 7.2.1

comment:12 Changed 3 years ago by igloo

  • Milestone changed from 7.2.1 to 7.4.1

comment:13 Changed 2 years ago by igloo

  • Milestone changed from 7.4.1 to 7.6.1
  • Priority changed from low to lowest

comment:14 Changed 20 months ago by igloo

  • Milestone changed from 7.6.1 to 7.6.2

comment:15 Changed 15 months ago by morabbin

Time to revisit this optimization.

Is this more of feature request though?

comment:16 Changed 5 months ago by nomeata

  • Cc mail@… added

comment:17 Changed 5 months ago by nomeata

Looking at examples for code where nested CPR could, would and should work, I found this ticket. At first glance, this looks good. But isn’t it the case that a usage like

main = do
    _ <- facIO 1000

will not do any calculation (besides counting down from 1000, and allocating a lot of thunks)?

This shows that the transformation wanted here cannot be obtained without either offering two variants of the function (would inflate code size), or requires an analysis that calculating the factorial even if not needed will still always be better than allocating the thunks.

comment:18 Changed 5 months ago by nomeata

Maybe this code is a slightly better example for something where we want to avoid the allocations:

facIO :: Int -> IO Int
facIO n = if n < 2 then return 1 else do n' <- facIO (n-1); return $! n*n'

The worker Core currently (7.6.3) looks as

FactIO.$wa [Occ=LoopBreaker]
  :: GHC.Prim.Int#
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Types.Int #)
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
FactIO.$wa =
  \ (ww_snv :: GHC.Prim.Int#)
    (w_snx :: GHC.Prim.State# GHC.Prim.RealWorld) ->
    case GHC.Prim.<# ww_snv 2 of _ {
      GHC.Types.False ->
        case FactIO.$wa (GHC.Prim.-# ww_snv 1) w_snx
        of _ { (# ipv_amL, ipv1_amM #) ->
        case ipv1_amM of _ { GHC.Types.I# y_amd ->
        (# ipv_amL, GHC.Types.I# (GHC.Prim.*# ww_snv y_amd) #)
        }
        };
      GHC.Types.True -> (# w_snx, lvl_roe #)
    }

and there is a very obvious packing/unpacking going on.

A nested CPR should probably first be able to fix this case; and then venture into detecting cheap-and-total-code (as required for the variant without $!) beyond constructors.

comment:19 Changed 5 months ago by nomeata

I’m working on nested cpr now. There a few related tickets (#1885 and #2289), but I’ll report my progress here, mostly because this has the lowest number and the fine factIO example.

Building on a patch from SPJ I now have a branch that works in simple cases. I have some doubts about correctness (i.e. bothDmdType is used both for type applications and cases, but I believe these need to treat the question of convergence differently).

Also, it does not work for the factIO example yet, probably because there is an unboxed tuple to begin with. Need to investigate.

The testsuite goes through (reporting, as it seems, only differences in, well, demand signatures), but nofib’s typecheck fills the memory quickly, so the analysis clearly is not conservative yet.. Need to investigate.

comment:20 Changed 5 months ago by nomeata

Analysing the typecheck divergence yields a very interesting example of how nested CPR can go wrong, which I’d like to document here.

It completely wreak havoc with this innocent function:

repeat :: x -> [x]
repeat x = x : repeat x

(This requires nesting of sum types, but just imagine for this example that [] was a stream data type with only one constructor, if you want).

The analyizer figurs out that its demand signature is DmdType <L,U>tm2(d,tm2(d,tm2(d,d))), i.e. it lazily uses its argument and will, with guaranteed convergence, produce a : constructor, and evaluating the second parameter thereof will also converge to a :, and so on. The signature could actually be infinite; my code cuts them off a a certain depth. This signature is certainly correct.

So it seems this is eligible to a worker-wrapper-transformation. But when we do it we end up with (using non-core patterns for clarity):

$wrepeat :: x -> (# a, a, a, [a] #)
$wrepeat x = case x : repeat x of (x1:x2:x3:r) -> (# x1, x2, x3, r#)

repeat x :: -> [x]
repeat x = case $wrepeat x of (# x1, x2, x3, r#) -> (x1:x2:x3:r)

and now this diverges.

I’m not entirely sure who is at fault here. Since the analysis yields a correct result, most likely the w/w transformation. But that does not seem to have enough information to prevent this. Needs more thinking.

comment:21 Changed 5 months ago by nomeata

One solution, which seems to make sense, is to make the analysis a little bit less precise and more careful with recursive definition. If I make sure that an Id that is a LoopBreaker never has a DmdResult that Converges (by lub’ing it with Diverges), both the example above and the typecheck program of nofib compiles.

Last edited 5 months ago by nomeata (previous) (diff)

comment:22 follow-up: Changed 5 months ago by simonpj

Great example! Graham Hutton would be interested, since he wrote a paper about worker/wrapper (JFP 2009).

A neat approach might be to say that, when doing the fixpoint on a recursive group, trim off any nested CPR info from the demand signatures we use when processing the RHSs.

Simom

comment:23 in reply to: ↑ 22 Changed 5 months ago by nomeata

Replying to simonpj:

A neat approach might be to say that, when doing the fixpoint on a recursive group, trim off any nested CPR info from the demand signatures we use when processing the RHSs.

Woudn’t that be far too imprecise? That would completely prevent nested CPR in loops, like factIO above...

Nevermind, you mean we should trim the signature on the binders when analyzing their RHSs. But isn’t that almost what I suggest with the LoopBreaker?

Last edited 5 months ago by nomeata (previous) (diff)

comment:24 Changed 5 months ago by nomeata

Run nofib the first time, mixed results:

   binary-trees          -1.3%     +0.6%     +7.7%     +7.5%     +0.0%
         hidden          -1.3%     -0.5%     -0.3%     -0.5%     +0.0%
         mandel          -1.2%     -9.2%      0.10      0.10     +0.0%
       nucleic2          -1.0%     -2.6%      0.11      0.11     +0.0%
reverse-complem          -1.3%     +4.9%      0.21      0.22     +0.0%
--------------------------------------------------------------------------------
            Min          -1.4%     -9.2%     -6.3%     -6.6%     -4.5%
            Max          -1.0%     +4.9%     +7.7%     +7.5%     +0.0%
 Geometric Mean          -1.3%     -0.0%     +0.5%     +0.6%     -0.1%

The gain in mandel seems to come from a a few more unboxed results in Data.Complex, in particular the return types of $w$sphase and $w$smagnitued turns from Double to Double# (and likewise the Float-specializations).

And why the new chance for that? Because $fFloatingComplex (which is a CAF for D# 0.0) and other constants are inlined (and thus no longer shared). Not sure why that part changed, and it seems that this is not related to nestedCPR...

comment:25 Changed 5 months ago by nomeata

  • Owner set to nomeata

Interesting corner case: With nested CPR enabled, GHC.TopHandler in base fails to compile with:

ghc-stage1: panic! (the 'impossible' happened)
  (GHC version 7.7.20131203 for x86_64-unknown-linux):
        mkWWcpr:
    non-algebraic or open body type a{tv a28d} [tv] but CPR type tm1()

The problem is this code:

              case {__pkg_ccall_GC base shutdownHaskellAndExit GHC.Prim.Int#
                                            -> GHC.Prim.Int#
                                            -> GHC.Prim.State# GHC.Prim.RealWorld
                                            -> (# GHC.Prim.State# GHC.Prim.RealWorld #)}_d2VI
                     255 ds_d2VF eta_XG
              of _ [Occ=Dead, Dmd=<L,A>] { (# ds_d2VG [OS=OneShot] #) ->
              (# ds_d2VG, GHC.Tuple.() #)

stemming from

foreign import ccall "shutdownHaskellAndSignal"
  shutdownHaskellAndSignal :: CInt -> CInt -> IO ()

which then is unsafeCoerce#’ed to a, with note explanation

-- we have to use unsafeCoerce# to get the 'IO a' result type, since the
-- compiler doesn't let us declare that as the result type of a foreign export.

There are a few ways to attack this issue:

  1. We allow a in the return type of a foreign export (but maybe overkill to do that just for this code).
  2. We try to forget CPR information when things pass through unsafeCoerce# (but that is just a `cast` in Core, and may be hard to detect reliably. Maybe every coercion that has a UnivCo representational inside?)
  3. We re-write that code in base, e.g. instead of unsafeCoerce use ... >> error "I’m still alive?".
  4. We simply turn the panic into a warning.

I guess 2. makes most sense, because we don’t want other instances of unsafeCoerce# to cause this error either. For now I’ll do 4, to not get stuck compiling base.

comment:26 Changed 5 months ago by nomeata

Also see ticket #8598 (CPR after IO action) which is not directly related to nested cpr, but will become more relevant when want to do nested CPR inside IO, such as in this example.

comment:27 Changed 5 months ago by nomeata

Finally I got the code in a shape good enough for controlled experiments.

I started with a merge of branch better-ho-cardinality and master (at [4025d66/ghc]). This is my baseline.

The next measurement is with the nested CPR analysis, but without changing the worker-wrapper code, i.e. the CPR information is only used as far as before. Here is the result (skipping lines with +0.0%):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      cacheprof          +0.0%     +0.6%     +0.3%     +0.6%     -3.4%
--------------------------------------------------------------------------------
            Min          +0.0%     -0.0%     -8.9%     -8.7%     -3.4%
            Max          +0.1%     +0.6%     +9.1%     +9.1%     +3.0%
 Geometric Mean          +0.0%     +0.0%     -0.4%     -0.4%     -0.0%

I did not look at cacheprof, but I am confident that I did not break anything serious by introducing nested CPR to the analysis.

Next measurement is is with the worker/wrapper code handling nested CPR, but using -fnested-cpr-off to make sure that all CPR information is actually what we had before. I used this to ensure that my mkWWcpr-implementation did not regress over the old one (it turned out that I did mess up in various ways, so this is important):

--------------------------------------------------------------------------------
      cacheprof          +0.0%     -0.6%     -0.3%     -0.6%     +1.8%
--------------------------------------------------------------------------------
            Min          +0.0%     -0.6%     -1.4%     -1.7%     +0.0%
            Max          +0.0%     +0.0%     +2.3%     +2.4%     +1.8%
 Geometric Mean          -0.0%     -0.0%     +0.1%     +0.1%     +0.0%

And now the exciting part: The same code, but now without -fnested-cpr-off:

--------------------------------------------------------------------------------
           anna          +0.5%     +0.1%      0.18      0.19     +0.0%
         boyer2          +0.5%     +0.4%      0.01      0.01     +0.0%
           bspt          +1.1%     +0.2%      0.01      0.01     +0.0%
      cacheprof          +0.5%     -0.7%     +0.6%     +1.2%     -2.6%
       calendar          +0.5%     +0.1%      0.00      0.00     +0.0%
  comp_lab_zift          +0.5%     +0.1%     +0.0%     +0.0%     +0.0%
          infer          +0.4%     -1.2%      0.09      0.09     +0.0%
           para          +0.4%     +0.2%     -1.5%     -1.5%     +0.0%
         prolog          +0.5%     +0.2%      0.00      0.00     +0.0%
        reptile          +0.7%     +0.3%      0.02      0.02     +0.0%

--------------------------------------------------------------------------------
            Min          +0.4%     -1.2%     -8.0%     -8.0%     -4.4%
            Max          +1.1%     +0.4%     +6.9%     +6.9%     +4.2%
 Geometric Mean          +0.5%     -0.0%     +0.0%     -0.0%     -0.1%

so most of the 101 benchmarks are not affected by nested CPR (in its current, prelimary) form at all (besides a quite reliable increase of binary sizes by +0.5% – does Size include .hi-files?). When it changes the Allocs number the effect is small and indecisive.

It may be that fixing #8598 would improve matters (or at least allow nested CPR to occur more often, whether for the better or for the worse, I don’t know.) I might look at that next, but probably off master, as it is an independent feature.

Of course it is also quite likely that the nested CPR analysis needs more tuning by looking at code where we want it to fire.

comment:28 Changed 5 months ago by nomeata

The change of allocations in cacheprof is weird: the output of -ddump-simpl is identical (ignoring unique numbers and strictness annotations). So there is a change hidden in the libraries somewhere... hopefully nothing to worry about.

comment:29 Changed 4 months ago by nomeata

Found one possible reason for the allocation increase: boyer2 was losing a no-let-escape, because the insight from Note [CPR for sum types] was not applied inside a nested CPR result. Fixed that now, nofibs are running.

comment:30 Changed 4 months ago by nomeata

Unfortunately, that was not it. The only visible change left is that addtoLUT in Lisplikefns is getting nestedly CPR’ed, without real gain (but also without loss, one would think).

There is one other lead: With nested CPR, lots of "show" methods get a nested CPR property. E.g. the show method for (,) has now a return demand tm2(tm(d),d) (instead of m2 as before), so the worker returns (# GHC.Prim.Char#, [GHC.Types.Char] #) instead of (# GHC.Types.Char, [GHC.Types.Char] #), likely causing a re-boxing of the ( character.

I’ll disable nested CPR information inside a sum type to confirm this theory.

comment:31 Changed 4 months ago by nomeata

It helps, i.e. removes all increases of Allocs (I don’t trust cacheprof, the allocations there seem to vary even from run to run):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      cacheprof          +0.0%     +0.2%     +1.5%     +0.9%     -1.8%
         gamteb          +0.1%     -0.2%      0.06      0.06     +0.0%
          infer          +0.0%     -1.2%      0.09      0.09     +0.0%
            pic          +0.0%     -0.6%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -1.2%     -5.2%     -5.2%     -4.0%
            Max          +0.6%     +0.2%     +3.3%     +3.0%     +9.1%
 Geometric Mean          +0.0%     -0.0%     +0.1%     +0.1%     +0.0%

So on the one hand: Nice, no regression due to nested CPR, and some improvements (although very minor – is that even significant?)

But the “fix” is not well-targeted, it rather is a heuristic. Unfortunately, I don’t see anything smarter to do here if we do not do whole-program compilation, or do not provide multiple implementations of the the function with varying degrees of CPRness.

An alternative, not well-aimed fix would be to zap the CPR property for all top-level constants (and not just for thunks). This would make CPR much more robust against allocation increase, but that would prevent a lot of CPR where we really want it, i.e. where the constant is on the cold path and the unshared Int# cell is worth having an unboxed result type (i.e. in a naive fac function).

comment:32 Changed 4 months ago by nomeata

Here is a nother reason why nested CPR is not very successful: The requirement of definite termination is not easy to meet. One would think that the extended Euclidean algorithm is a good candidate for nested CPR:

extendedEu2 :: Int -> Int -> (Int, Int)
extendedEu2 a 0 = (1, 0)
extendedEu2 a b = (t, s - q * t)
    where (q, r) = quotRem a b
          (s, t) = extendedEu2 b r

but it is not: With a return type of dm(d,dm(d)) we cannot do more than unbox the tuple. Now with a few iterations between the code and the core, one can find a strictness annotation that makes it work: With

extendedEu :: Int -> Int -> (Int, Int)
extendedEu a 0 = (1, 0)
extendedEu a b = let b' = s - q * t
                 in b' `seq` (t, b')
    where (q, r) = quotRem a b
          (s, t) = extendedEu b r

we infer dm(tm(d),tm(d)) and the worker gets type GHC.Types.Int -> GHC.Prim.Int# -> (# GHC.Prim.Int#, GHC.Prim.Int# #).

So likely nested CPR might help those who are careful to use strictness annotation and use strict data types (which some people are doing almost exclusively), but not a lot with the usual lazy programming.

Last edited 4 months ago by simonpj (previous) (diff)

comment:33 Changed 4 months ago by nomeata

The numbers are a bit more interesting if I enabled nested CPR inside unboxed tuples, i.e. in code involving IO or ST:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
         banner          +2.1%     +0.1%      0.00      0.00     +0.0%
      compress2          +2.0%     +0.1%    +11.5%     +9.8%    +20.8%
         expert          +2.1%     +0.1%      0.00      0.00     +0.0%
       fibheaps          +2.0%     -0.3%      0.05      0.05     +0.0%
          fluid          +2.5%     +0.1%      0.01      0.01     +0.0%
         gamteb          +2.1%     -0.2%      0.06      0.06     +0.0%
           grep          +2.0%     +0.1%      0.00      0.00     +0.0%
          infer          +2.0%     -1.2%      0.09      0.09     +0.0%
   k-nucleotide          +1.5%     -6.9%     +0.1%     +0.2%     +0.0%
       maillist          +2.1%     -0.3%      0.10      0.12     +0.0%
            pic          +2.2%     -0.5%      0.01      0.01     +0.0%
           rfib          +2.1%     +0.1%      0.03      0.03     +0.0%
            scs          +2.3%     +0.2%     +1.0%     +1.4%     +0.0%
            tak          +2.1%     -0.1%      0.02      0.02     +0.0%
       treejoin          +2.1%     +0.1%     +0.0%     +0.0%     +0.0%
      wave4main          +2.1%    +11.3%     -0.5%     +0.0%     -7.1%
--------------------------------------------------------------------------------
            Min          +1.5%     -6.9%    -13.2%    -13.3%    -33.3%
            Max          +2.5%    +11.3%    +16.5%    +16.0%    +20.8%
 Geometric Mean          +2.0%     +0.0%     +0.2%     +0.3%     -0.3%

One particular good result (k-nucleotide), and one bad wave4main, and otherwise a slight general improvement. The change in k-nucleotide’s core is too large to spot the reason for the improvement.

Diffing the -ddump-simpl of wave4main shows only one change. It stems from this function

tabulate :: (Int -> x) -> (Int, Int) -> Array Int x
tabulate f (l,u) = array (l,u) [(i, f i) | i <- [l..u]]

where in the (inlined) array a worker for go gets its return type changed from (# GHC.Prim.State# s_aTM, GHC.Arr.Array GHC.Types.Int x_aqE #) to (# GHC.Prim.State# s_aTM, GHC.Prim.Int#, GHC.Prim.Int#, GHC.Prim.Int#, GHC.Prim.Array# x_aqE #). Which looks good, but the worker is tail-recursive, and the boxing in the wrapper is not cancelled at the use site of go, so there is nothing gain by moving the constructor applications from the worker to the wrapper.

But some isolated testing indicates that this costs 96 bytes of allocation per run, so I doubt that this is the main cause for the 11% increase; there might be something hidden in the libraries.

comment:34 Changed 4 months ago by nomeata

Trac seems to think I write too much, just lost a somewhat long text here :-(.

I was explaining why nested CPR kills a let-no-espcape in this code taken from scs’s LinearAlgebra:

v_zipWith :: (a -> b -> c) -> Array Int a -> Array Int b -> Array Int c
v_zipWith f a b | compatible = listArray (bounds a) (zipWith f (elems a) (elems b))
                | otherwise  = error "error"
  where
  compatible  = bounds a == bounds b 

Point taken, I’ll be brief: CPR can kill the no-let-escape property; hence more CPR kills more of that. The problem occurs if the $j gets a more detailed CPR type than the expression it is part of, or the expression is somewhere where CPR w/w cannot happen (e.g. in an argument to runStRep). This problem is not new and some work-arounds for it exist in the current code ([CPR for Sun types]). But maybe this needs a generally better story.

(Sidenote: Inlining runSTRep would have helped here, but was disabled by simonmar in 920dbbddf57ff02e0734943bb93dd4cecc5568e0/base.)

comment:35 Changed 4 months ago by nomeata

(Update to sidenote: inlining runSTRep may allow it to get a CPR property, but this does not magically help with join-points, also see NestedCPR/wave4main).

comment:36 Changed 3 months ago by nomeata

(Removed post; looks like one of the trees was dirty in libraries/base from my experiments with inlining runSTRep. Re-running nofib....)

Last edited 3 months ago by nomeata (previous) (diff)

comment:37 Changed 3 months ago by nomeata

New numbers, after uprooting a bug where things unrelated to CPR (namely things alrady returning an unboxed tuple, with nothing to be CPRed inside) would suddenly get an INLINE flag, including some thunks. Finally a measurable positive change in the geometric mean!

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           ansi          +0.3%     -0.1%      0.00      0.00     +0.0%
      compress2          +0.6%     -0.8%      0.11      0.11     -8.0%
       fibheaps          +0.3%     -0.3%      0.03      0.03     +0.0%
         gamteb          +0.4%     -0.2%      0.04      0.04     +0.0%
           grep          +0.3%     -0.1%      0.00      0.00     +0.0%
            hpg          +0.4%     -3.0%      0.13      0.13     +0.0%
          infer          +0.3%     -1.2%      0.03      0.03     +0.0%
   k-nucleotide          +0.2%     -6.9%     -1.6%     -1.4%     +0.0%
       maillist          +0.4%     -0.8%      0.04      0.04     +0.8%
        mkhprog          +0.4%     -0.3%      0.00      0.00     +0.0%
            pic          +0.3%     -0.7%      0.00      0.00     +0.0%
         pretty          +0.4%     -0.1%      0.00      0.00     +0.0%
           rfib          +0.3%     -0.2%      0.01      0.01     +0.0%
            scc          +0.3%     -0.1%      0.00      0.00     +0.0%
  spectral-norm          +0.4%     -0.1%     +0.3%     +0.3%     +0.0%
         sphere          +0.4%     -4.7%      0.04      0.04     +0.0%
         symalg          +0.3%     -0.1%      0.01      0.01     +0.0%
            tak          +0.3%     -0.3%      0.01      0.01     +0.0%
      transform          +0.3%     +0.2%     -2.2%     -2.2%     +0.0%
--------------------------------------------------------------------------------
            Min          +0.2%     -6.9%     -5.3%    -28.2%    -11.2%
            Max          +0.7%     +0.2%     +2.1%     +2.1%    +50.0%
 Geometric Mean          +0.3%     -0.2%     -0.8%     -2.3%     +0.2%

Surprisingly to me, the bug fix also fixed a +11% increase in wave4main’s allocations, which I thought were caused by join-point losses.

A quick glance into transform shows that f_list_cmp lost its (non-nested) CPR property, accounting for most of the increase according to ticky. Will investigate tomorrow.

comment:38 Changed 3 months ago by nomeata

Fixed the transform issue (when analysing a complex case expression where we do the scrunitee first to feed nested CPR information into the pattern match variable, I was not making sure that the case binder gets at least a flat CPR property, even if the scrunitee has none), and we are finally where Simon expected nested CPR to be: No increased allocations any more (or at least none that are not surpassed by the gains). Still a very small effect, though.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           ansi          +0.3%     -0.1%      0.00      0.00     +0.0%
         awards          +0.3%     -0.1%      0.00      0.00     +0.0%
      compress2          +0.6%     -0.8%      0.11      0.11     -8.0%
       fibheaps          +0.3%     -0.3%      0.03      0.03     +0.0%
         gamteb          +0.4%     -0.2%      0.04      0.04     +0.0%
           grep          +0.3%     -0.1%      0.00      0.00     +0.0%
            hpg          +0.4%     -3.0%      0.13      0.13     +0.0%
          infer          +0.3%     -1.2%      0.03      0.03     +0.0%
   k-nucleotide          +0.2%     -6.9%     -0.7%     -0.5%     +0.0%
       maillist          +0.4%     -0.8%      0.04      0.04     -4.2%
        mkhprog          +0.4%     -0.3%      0.00      0.00     +0.0%
            pic          +0.3%     -0.7%      0.00      0.00     +0.0%
         pretty          +0.4%     -0.1%      0.00      0.00     +0.0%
           rfib          +0.3%     -0.2%      0.01      0.01     +0.0%
            scc          +0.3%     -0.1%      0.00      0.00     +0.0%
  spectral-norm          +0.4%     -0.1%     +0.0%     +0.0%     +0.0%
         sphere          +0.4%     -4.7%      0.04      0.04     +0.0%
         symalg          +0.3%     -0.1%      0.01      0.01     +0.0%
            tak          +0.3%     -0.3%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          +0.2%     -6.9%     -5.3%    -27.3%     -8.0%
            Max          +0.7%     +0.0%     +1.4%     +1.4%    +50.0%
 Geometric Mean          +0.3%     -0.2%     -0.7%     -2.3%     +0.2%

comment:39 Changed 3 months ago by nomeata

Bummer. While finding out what happened to wave4main, I noticed that I introduced a bug (well, imprecision) where Converges information was lost. This prevented nested CPR in wave4main, and everything looked good. Fixing that bug gives us back the +11.3% for wave4main, and also a +0.2% for scs. So it remains all very heuristical...

The change in scs is hard to pin-point, as there is quite a bit of CPR’ing going on which moves allocations between different ticky-ticky-counters. But in the summary, we see an increase in ALLOC_FUN_gds again, and there are $wgo popping up where there were none, so this indicates lost join points. (Also interesting: ALLOC_CON_ctr goes up, but ALLOC_CON_gds goes down.)

comment:40 Changed 3 months ago by nomeata

Small status update:

My work on nested CPR has now reached somewhat conclusive state. The performance changes are not great, but it is still nice to have, especially if users who would re-write their code for performance would no longer have to do so.

It should not go in for 7.8 – the gains are too small, and there still might be corner cases bugs. So I plan to merge it somewhen after 7.8. Until then, the branch wip/nested-cpr contains the changes in mostly cleaned-up, individually validated patches. I might occasionally rebase that branch to keep the conflicts small.

Note: See TracTickets for help on using tickets.