Opened 10 years ago

Last modified 8 weeks ago

#1600 new task

Optimisation: CPR the results of IO

Reported by: simonmar Owned by:
Priority: lowest Milestone:
Component: Compiler Version: 6.6.1
Keywords: Cc: pho@…, mail@…, maoe@…, akio, osa1, maurerl@…, nomeata
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #8598 Differential Rev(s):
Wiki Page:

Description

GHC currently cannot unbox the result of a function in the IO monad. For example:

facIO :: Int -> IO Int
facIO n = if n < 2 then return 1 else do n' <- facIO (n-1); return (n*n')

the Int argument is unboxed fine, but not the result. It ought to be possible to do this: the CPR analysis needs to somehow look inside the unboxed pair returned by IO-monadic code.

Change History (77)

comment:1 Changed 9 years ago by simonmar

Architecture: UnknownUnknown/Multiple

comment:2 Changed 9 years ago by simonmar

Operating System: UnknownUnknown/Multiple

comment:3 Changed 8 years ago by igloo

Milestone: 6.10 branch6.12 branch

comment:4 Changed 8 years ago by simonpj

See also #2387 and #2289

comment:5 Changed 8 years ago by PHO

Cc: pho@… added

comment:6 Changed 8 years ago by simonmar

difficulty: Moderate (1 day)Moderate (less than a day)

comment:7 Changed 7 years ago by igloo

Milestone: 6.12 branch6.12.3

comment:8 Changed 7 years ago by simonmar

Type of failure: Runtime performance bug

comment:9 Changed 7 years ago by igloo

Milestone: 6.12.36.14.1
Priority: normallow

comment:10 Changed 7 years ago by igloo

Milestone: 7.0.17.0.2

comment:11 Changed 7 years ago by igloo

Milestone: 7.0.27.2.1

comment:12 Changed 6 years ago by igloo

Milestone: 7.2.17.4.1

comment:13 Changed 6 years ago by igloo

Milestone: 7.4.17.6.1
Priority: lowlowest

comment:14 Changed 5 years ago by igloo

Milestone: 7.6.17.6.2

comment:15 Changed 5 years ago by morabbin

Time to revisit this optimization.

Is this more of feature request though?

comment:16 Changed 4 years ago by nomeata

Cc: mail@… added

comment:17 Changed 4 years ago by nomeata

Looking at examples for code where nested CPR could, would and should work, I found this ticket. At first glance, this looks good. But isn’t it the case that a usage like

main = do
    _ <- facIO 1000

will not do any calculation (besides counting down from 1000, and allocating a lot of thunks)?

This shows that the transformation wanted here cannot be obtained without either offering two variants of the function (would inflate code size), or requires an analysis that calculating the factorial even if not needed will still always be better than allocating the thunks.

comment:18 Changed 4 years ago by nomeata

Maybe this code is a slightly better example for something where we want to avoid the allocations:

facIO :: Int -> IO Int
facIO n = if n < 2 then return 1 else do n' <- facIO (n-1); return $! n*n'

The worker Core currently (7.6.3) looks as

FactIO.$wa [Occ=LoopBreaker]
  :: GHC.Prim.Int#
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Types.Int #)
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
FactIO.$wa =
  \ (ww_snv :: GHC.Prim.Int#)
    (w_snx :: GHC.Prim.State# GHC.Prim.RealWorld) ->
    case GHC.Prim.<# ww_snv 2 of _ {
      GHC.Types.False ->
        case FactIO.$wa (GHC.Prim.-# ww_snv 1) w_snx
        of _ { (# ipv_amL, ipv1_amM #) ->
        case ipv1_amM of _ { GHC.Types.I# y_amd ->
        (# ipv_amL, GHC.Types.I# (GHC.Prim.*# ww_snv y_amd) #)
        }
        };
      GHC.Types.True -> (# w_snx, lvl_roe #)
    }

and there is a very obvious packing/unpacking going on.

A nested CPR should probably first be able to fix this case; and then venture into detecting cheap-and-total-code (as required for the variant without $!) beyond constructors.

comment:19 Changed 4 years ago by nomeata

I’m working on nested cpr now. There a few related tickets (#1885 and #2289), but I’ll report my progress here, mostly because this has the lowest number and the fine factIO example.

Building on a patch from SPJ I now have a branch that works in simple cases. I have some doubts about correctness (i.e. bothDmdType is used both for type applications and cases, but I believe these need to treat the question of convergence differently).

Also, it does not work for the factIO example yet, probably because there is an unboxed tuple to begin with. Need to investigate.

The testsuite goes through (reporting, as it seems, only differences in, well, demand signatures), but nofib’s typecheck fills the memory quickly, so the analysis clearly is not conservative yet.. Need to investigate.

comment:20 Changed 4 years ago by nomeata

Analysing the typecheck divergence yields a very interesting example of how nested CPR can go wrong, which I’d like to document here.

It completely wreak havoc with this innocent function:

repeat :: x -> [x]
repeat x = x : repeat x

(This requires nesting of sum types, but just imagine for this example that [] was a stream data type with only one constructor, if you want).

The analyizer figurs out that its demand signature is DmdType <L,U>tm2(d,tm2(d,tm2(d,d))), i.e. it lazily uses its argument and will, with guaranteed convergence, produce a : constructor, and evaluating the second parameter thereof will also converge to a :, and so on. The signature could actually be infinite; my code cuts them off a a certain depth. This signature is certainly correct.

So it seems this is eligible to a worker-wrapper-transformation. But when we do it we end up with (using non-core patterns for clarity):

$wrepeat :: x -> (# a, a, a, [a] #)
$wrepeat x = case x : repeat x of (x1:x2:x3:r) -> (# x1, x2, x3, r#)

repeat x :: -> [x]
repeat x = case $wrepeat x of (# x1, x2, x3, r#) -> (x1:x2:x3:r)

and now this diverges.

I’m not entirely sure who is at fault here. Since the analysis yields a correct result, most likely the w/w transformation. But that does not seem to have enough information to prevent this. Needs more thinking.

comment:21 Changed 4 years ago by nomeata

One solution, which seems to make sense, is to make the analysis a little bit less precise and more careful with recursive definition. If I make sure that an Id that is a LoopBreaker never has a DmdResult that Converges (by lub’ing it with Diverges), both the example above and the typecheck program of nofib compiles.

Last edited 4 years ago by nomeata (previous) (diff)

comment:22 Changed 4 years ago by simonpj

Great example! Graham Hutton would be interested, since he wrote a paper about worker/wrapper (JFP 2009).

A neat approach might be to say that, when doing the fixpoint on a recursive group, trim off any nested CPR info from the demand signatures we use when processing the RHSs.

Simom

comment:23 in reply to:  22 Changed 4 years ago by nomeata

Replying to simonpj:

A neat approach might be to say that, when doing the fixpoint on a recursive group, trim off any nested CPR info from the demand signatures we use when processing the RHSs.

Woudn’t that be far too imprecise? That would completely prevent nested CPR in loops, like factIO above...

Nevermind, you mean we should trim the signature on the binders when analyzing their RHSs. But isn’t that almost what I suggest with the LoopBreaker?

Last edited 4 years ago by nomeata (previous) (diff)

comment:24 Changed 4 years ago by nomeata

Run nofib the first time, mixed results:

   binary-trees          -1.3%     +0.6%     +7.7%     +7.5%     +0.0%
         hidden          -1.3%     -0.5%     -0.3%     -0.5%     +0.0%
         mandel          -1.2%     -9.2%      0.10      0.10     +0.0%
       nucleic2          -1.0%     -2.6%      0.11      0.11     +0.0%
reverse-complem          -1.3%     +4.9%      0.21      0.22     +0.0%
--------------------------------------------------------------------------------
            Min          -1.4%     -9.2%     -6.3%     -6.6%     -4.5%
            Max          -1.0%     +4.9%     +7.7%     +7.5%     +0.0%
 Geometric Mean          -1.3%     -0.0%     +0.5%     +0.6%     -0.1%

The gain in mandel seems to come from a a few more unboxed results in Data.Complex, in particular the return types of $w$sphase and $w$smagnitued turns from Double to Double# (and likewise the Float-specializations).

And why the new chance for that? Because $fFloatingComplex (which is a CAF for D# 0.0) and other constants are inlined (and thus no longer shared). Not sure why that part changed, and it seems that this is not related to nestedCPR...

comment:25 Changed 4 years ago by nomeata

Owner: set to nomeata

Interesting corner case: With nested CPR enabled, GHC.TopHandler in base fails to compile with:

ghc-stage1: panic! (the 'impossible' happened)
  (GHC version 7.7.20131203 for x86_64-unknown-linux):
        mkWWcpr:
    non-algebraic or open body type a{tv a28d} [tv] but CPR type tm1()

The problem is this code:

              case {__pkg_ccall_GC base shutdownHaskellAndExit GHC.Prim.Int#
                                            -> GHC.Prim.Int#
                                            -> GHC.Prim.State# GHC.Prim.RealWorld
                                            -> (# GHC.Prim.State# GHC.Prim.RealWorld #)}_d2VI
                     255 ds_d2VF eta_XG
              of _ [Occ=Dead, Dmd=<L,A>] { (# ds_d2VG [OS=OneShot] #) ->
              (# ds_d2VG, GHC.Tuple.() #)

stemming from

foreign import ccall "shutdownHaskellAndSignal"
  shutdownHaskellAndSignal :: CInt -> CInt -> IO ()

which then is unsafeCoerce#’ed to a, with note explanation

-- we have to use unsafeCoerce# to get the 'IO a' result type, since the
-- compiler doesn't let us declare that as the result type of a foreign export.

There are a few ways to attack this issue:

  1. We allow a in the return type of a foreign export (but maybe overkill to do that just for this code).
  2. We try to forget CPR information when things pass through unsafeCoerce# (but that is just a `cast` in Core, and may be hard to detect reliably. Maybe every coercion that has a UnivCo representational inside?)
  3. We re-write that code in base, e.g. instead of unsafeCoerce use ... >> error "I’m still alive?".
  4. We simply turn the panic into a warning.

I guess 2. makes most sense, because we don’t want other instances of unsafeCoerce# to cause this error either. For now I’ll do 4, to not get stuck compiling base.

comment:26 Changed 4 years ago by nomeata

Also see ticket #8598 (CPR after IO action) which is not directly related to nested cpr, but will become more relevant when want to do nested CPR inside IO, such as in this example.

comment:27 Changed 4 years ago by nomeata

Finally I got the code in a shape good enough for controlled experiments.

I started with a merge of branch better-ho-cardinality and master (at [4025d66/ghc]). This is my baseline.

The next measurement is with the nested CPR analysis, but without changing the worker-wrapper code, i.e. the CPR information is only used as far as before. Here is the result (skipping lines with +0.0%):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      cacheprof          +0.0%     +0.6%     +0.3%     +0.6%     -3.4%
--------------------------------------------------------------------------------
            Min          +0.0%     -0.0%     -8.9%     -8.7%     -3.4%
            Max          +0.1%     +0.6%     +9.1%     +9.1%     +3.0%
 Geometric Mean          +0.0%     +0.0%     -0.4%     -0.4%     -0.0%

I did not look at cacheprof, but I am confident that I did not break anything serious by introducing nested CPR to the analysis.

Next measurement is is with the worker/wrapper code handling nested CPR, but using -fnested-cpr-off to make sure that all CPR information is actually what we had before. I used this to ensure that my mkWWcpr-implementation did not regress over the old one (it turned out that I did mess up in various ways, so this is important):

--------------------------------------------------------------------------------
      cacheprof          +0.0%     -0.6%     -0.3%     -0.6%     +1.8%
--------------------------------------------------------------------------------
            Min          +0.0%     -0.6%     -1.4%     -1.7%     +0.0%
            Max          +0.0%     +0.0%     +2.3%     +2.4%     +1.8%
 Geometric Mean          -0.0%     -0.0%     +0.1%     +0.1%     +0.0%

And now the exciting part: The same code, but now without -fnested-cpr-off:

--------------------------------------------------------------------------------
           anna          +0.5%     +0.1%      0.18      0.19     +0.0%
         boyer2          +0.5%     +0.4%      0.01      0.01     +0.0%
           bspt          +1.1%     +0.2%      0.01      0.01     +0.0%
      cacheprof          +0.5%     -0.7%     +0.6%     +1.2%     -2.6%
       calendar          +0.5%     +0.1%      0.00      0.00     +0.0%
  comp_lab_zift          +0.5%     +0.1%     +0.0%     +0.0%     +0.0%
          infer          +0.4%     -1.2%      0.09      0.09     +0.0%
           para          +0.4%     +0.2%     -1.5%     -1.5%     +0.0%
         prolog          +0.5%     +0.2%      0.00      0.00     +0.0%
        reptile          +0.7%     +0.3%      0.02      0.02     +0.0%

--------------------------------------------------------------------------------
            Min          +0.4%     -1.2%     -8.0%     -8.0%     -4.4%
            Max          +1.1%     +0.4%     +6.9%     +6.9%     +4.2%
 Geometric Mean          +0.5%     -0.0%     +0.0%     -0.0%     -0.1%

so most of the 101 benchmarks are not affected by nested CPR (in its current, prelimary) form at all (besides a quite reliable increase of binary sizes by +0.5% – does Size include .hi-files?). When it changes the Allocs number the effect is small and indecisive.

It may be that fixing #8598 would improve matters (or at least allow nested CPR to occur more often, whether for the better or for the worse, I don’t know.) I might look at that next, but probably off master, as it is an independent feature.

Of course it is also quite likely that the nested CPR analysis needs more tuning by looking at code where we want it to fire.

comment:28 Changed 4 years ago by nomeata

The change of allocations in cacheprof is weird: the output of -ddump-simpl is identical (ignoring unique numbers and strictness annotations). So there is a change hidden in the libraries somewhere... hopefully nothing to worry about.

comment:29 Changed 4 years ago by nomeata

Found one possible reason for the allocation increase: boyer2 was losing a no-let-escape, because the insight from Note [CPR for sum types] was not applied inside a nested CPR result. Fixed that now, nofibs are running.

comment:30 Changed 4 years ago by nomeata

Unfortunately, that was not it. The only visible change left is that addtoLUT in Lisplikefns is getting nestedly CPR’ed, without real gain (but also without loss, one would think).

There is one other lead: With nested CPR, lots of "show" methods get a nested CPR property. E.g. the show method for (,) has now a return demand tm2(tm(d),d) (instead of m2 as before), so the worker returns (# GHC.Prim.Char#, [GHC.Types.Char] #) instead of (# GHC.Types.Char, [GHC.Types.Char] #), likely causing a re-boxing of the ( character.

I’ll disable nested CPR information inside a sum type to confirm this theory.

comment:31 Changed 4 years ago by nomeata

It helps, i.e. removes all increases of Allocs (I don’t trust cacheprof, the allocations there seem to vary even from run to run):

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      cacheprof          +0.0%     +0.2%     +1.5%     +0.9%     -1.8%
         gamteb          +0.1%     -0.2%      0.06      0.06     +0.0%
          infer          +0.0%     -1.2%      0.09      0.09     +0.0%
            pic          +0.0%     -0.6%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -1.2%     -5.2%     -5.2%     -4.0%
            Max          +0.6%     +0.2%     +3.3%     +3.0%     +9.1%
 Geometric Mean          +0.0%     -0.0%     +0.1%     +0.1%     +0.0%

So on the one hand: Nice, no regression due to nested CPR, and some improvements (although very minor – is that even significant?)

But the “fix” is not well-targeted, it rather is a heuristic. Unfortunately, I don’t see anything smarter to do here if we do not do whole-program compilation, or do not provide multiple implementations of the the function with varying degrees of CPRness.

An alternative, not well-aimed fix would be to zap the CPR property for all top-level constants (and not just for thunks). This would make CPR much more robust against allocation increase, but that would prevent a lot of CPR where we really want it, i.e. where the constant is on the cold path and the unshared Int# cell is worth having an unboxed result type (i.e. in a naive fac function).

comment:32 Changed 4 years ago by nomeata

Here is a nother reason why nested CPR is not very successful: The requirement of definite termination is not easy to meet. One would think that the extended Euclidean algorithm is a good candidate for nested CPR:

extendedEu2 :: Int -> Int -> (Int, Int)
extendedEu2 a 0 = (1, 0)
extendedEu2 a b = (t, s - q * t)
    where (q, r) = quotRem a b
          (s, t) = extendedEu2 b r

but it is not: With a return type of dm(d,dm(d)) we cannot do more than unbox the tuple. Now with a few iterations between the code and the core, one can find a strictness annotation that makes it work: With

extendedEu :: Int -> Int -> (Int, Int)
extendedEu a 0 = (1, 0)
extendedEu a b = let b' = s - q * t
                 in b' `seq` (t, b')
    where (q, r) = quotRem a b
          (s, t) = extendedEu b r

we infer dm(tm(d),tm(d)) and the worker gets type GHC.Types.Int -> GHC.Prim.Int# -> (# GHC.Prim.Int#, GHC.Prim.Int# #).

So likely nested CPR might help those who are careful to use strictness annotation and use strict data types (which some people are doing almost exclusively), but not a lot with the usual lazy programming.

Last edited 4 years ago by simonpj (previous) (diff)

comment:33 Changed 4 years ago by nomeata

The numbers are a bit more interesting if I enabled nested CPR inside unboxed tuples, i.e. in code involving IO or ST:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
         banner          +2.1%     +0.1%      0.00      0.00     +0.0%
      compress2          +2.0%     +0.1%    +11.5%     +9.8%    +20.8%
         expert          +2.1%     +0.1%      0.00      0.00     +0.0%
       fibheaps          +2.0%     -0.3%      0.05      0.05     +0.0%
          fluid          +2.5%     +0.1%      0.01      0.01     +0.0%
         gamteb          +2.1%     -0.2%      0.06      0.06     +0.0%
           grep          +2.0%     +0.1%      0.00      0.00     +0.0%
          infer          +2.0%     -1.2%      0.09      0.09     +0.0%
   k-nucleotide          +1.5%     -6.9%     +0.1%     +0.2%     +0.0%
       maillist          +2.1%     -0.3%      0.10      0.12     +0.0%
            pic          +2.2%     -0.5%      0.01      0.01     +0.0%
           rfib          +2.1%     +0.1%      0.03      0.03     +0.0%
            scs          +2.3%     +0.2%     +1.0%     +1.4%     +0.0%
            tak          +2.1%     -0.1%      0.02      0.02     +0.0%
       treejoin          +2.1%     +0.1%     +0.0%     +0.0%     +0.0%
      wave4main          +2.1%    +11.3%     -0.5%     +0.0%     -7.1%
--------------------------------------------------------------------------------
            Min          +1.5%     -6.9%    -13.2%    -13.3%    -33.3%
            Max          +2.5%    +11.3%    +16.5%    +16.0%    +20.8%
 Geometric Mean          +2.0%     +0.0%     +0.2%     +0.3%     -0.3%

One particular good result (k-nucleotide), and one bad wave4main, and otherwise a slight general improvement. The change in k-nucleotide’s core is too large to spot the reason for the improvement.

Diffing the -ddump-simpl of wave4main shows only one change. It stems from this function

tabulate :: (Int -> x) -> (Int, Int) -> Array Int x
tabulate f (l,u) = array (l,u) [(i, f i) | i <- [l..u]]

where in the (inlined) array a worker for go gets its return type changed from (# GHC.Prim.State# s_aTM, GHC.Arr.Array GHC.Types.Int x_aqE #) to (# GHC.Prim.State# s_aTM, GHC.Prim.Int#, GHC.Prim.Int#, GHC.Prim.Int#, GHC.Prim.Array# x_aqE #). Which looks good, but the worker is tail-recursive, and the boxing in the wrapper is not cancelled at the use site of go, so there is nothing gain by moving the constructor applications from the worker to the wrapper.

But some isolated testing indicates that this costs 96 bytes of allocation per run, so I doubt that this is the main cause for the 11% increase; there might be something hidden in the libraries.

comment:34 Changed 4 years ago by nomeata

Trac seems to think I write too much, just lost a somewhat long text here :-(.

I was explaining why nested CPR kills a let-no-espcape in this code taken from scs’s LinearAlgebra:

v_zipWith :: (a -> b -> c) -> Array Int a -> Array Int b -> Array Int c
v_zipWith f a b | compatible = listArray (bounds a) (zipWith f (elems a) (elems b))
                | otherwise  = error "error"
  where
  compatible  = bounds a == bounds b 

Point taken, I’ll be brief: CPR can kill the no-let-escape property; hence more CPR kills more of that. The problem occurs if the $j gets a more detailed CPR type than the expression it is part of, or the expression is somewhere where CPR w/w cannot happen (e.g. in an argument to runStRep). This problem is not new and some work-arounds for it exist in the current code ([CPR for Sun types]). But maybe this needs a generally better story.

(Sidenote: Inlining runSTRep would have helped here, but was disabled by simonmar in 920dbbddf57ff02e0734943bb93dd4cecc5568e0/base.)

comment:35 Changed 4 years ago by nomeata

(Update to sidenote: inlining runSTRep may allow it to get a CPR property, but this does not magically help with join-points, also see NestedCPR/wave4main).

comment:36 Changed 4 years ago by nomeata

(Removed post; looks like one of the trees was dirty in libraries/base from my experiments with inlining runSTRep. Re-running nofib....)

Last edited 4 years ago by nomeata (previous) (diff)

comment:37 Changed 4 years ago by nomeata

New numbers, after uprooting a bug where things unrelated to CPR (namely things alrady returning an unboxed tuple, with nothing to be CPRed inside) would suddenly get an INLINE flag, including some thunks. Finally a measurable positive change in the geometric mean!

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           ansi          +0.3%     -0.1%      0.00      0.00     +0.0%
      compress2          +0.6%     -0.8%      0.11      0.11     -8.0%
       fibheaps          +0.3%     -0.3%      0.03      0.03     +0.0%
         gamteb          +0.4%     -0.2%      0.04      0.04     +0.0%
           grep          +0.3%     -0.1%      0.00      0.00     +0.0%
            hpg          +0.4%     -3.0%      0.13      0.13     +0.0%
          infer          +0.3%     -1.2%      0.03      0.03     +0.0%
   k-nucleotide          +0.2%     -6.9%     -1.6%     -1.4%     +0.0%
       maillist          +0.4%     -0.8%      0.04      0.04     +0.8%
        mkhprog          +0.4%     -0.3%      0.00      0.00     +0.0%
            pic          +0.3%     -0.7%      0.00      0.00     +0.0%
         pretty          +0.4%     -0.1%      0.00      0.00     +0.0%
           rfib          +0.3%     -0.2%      0.01      0.01     +0.0%
            scc          +0.3%     -0.1%      0.00      0.00     +0.0%
  spectral-norm          +0.4%     -0.1%     +0.3%     +0.3%     +0.0%
         sphere          +0.4%     -4.7%      0.04      0.04     +0.0%
         symalg          +0.3%     -0.1%      0.01      0.01     +0.0%
            tak          +0.3%     -0.3%      0.01      0.01     +0.0%
      transform          +0.3%     +0.2%     -2.2%     -2.2%     +0.0%
--------------------------------------------------------------------------------
            Min          +0.2%     -6.9%     -5.3%    -28.2%    -11.2%
            Max          +0.7%     +0.2%     +2.1%     +2.1%    +50.0%
 Geometric Mean          +0.3%     -0.2%     -0.8%     -2.3%     +0.2%

Surprisingly to me, the bug fix also fixed a +11% increase in wave4main’s allocations, which I thought were caused by join-point losses.

A quick glance into transform shows that f_list_cmp lost its (non-nested) CPR property, accounting for most of the increase according to ticky. Will investigate tomorrow.

comment:38 Changed 4 years ago by nomeata

Fixed the transform issue (when analysing a complex case expression where we do the scrunitee first to feed nested CPR information into the pattern match variable, I was not making sure that the case binder gets at least a flat CPR property, even if the scrunitee has none), and we are finally where Simon expected nested CPR to be: No increased allocations any more (or at least none that are not surpassed by the gains). Still a very small effect, though.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           ansi          +0.3%     -0.1%      0.00      0.00     +0.0%
         awards          +0.3%     -0.1%      0.00      0.00     +0.0%
      compress2          +0.6%     -0.8%      0.11      0.11     -8.0%
       fibheaps          +0.3%     -0.3%      0.03      0.03     +0.0%
         gamteb          +0.4%     -0.2%      0.04      0.04     +0.0%
           grep          +0.3%     -0.1%      0.00      0.00     +0.0%
            hpg          +0.4%     -3.0%      0.13      0.13     +0.0%
          infer          +0.3%     -1.2%      0.03      0.03     +0.0%
   k-nucleotide          +0.2%     -6.9%     -0.7%     -0.5%     +0.0%
       maillist          +0.4%     -0.8%      0.04      0.04     -4.2%
        mkhprog          +0.4%     -0.3%      0.00      0.00     +0.0%
            pic          +0.3%     -0.7%      0.00      0.00     +0.0%
         pretty          +0.4%     -0.1%      0.00      0.00     +0.0%
           rfib          +0.3%     -0.2%      0.01      0.01     +0.0%
            scc          +0.3%     -0.1%      0.00      0.00     +0.0%
  spectral-norm          +0.4%     -0.1%     +0.0%     +0.0%     +0.0%
         sphere          +0.4%     -4.7%      0.04      0.04     +0.0%
         symalg          +0.3%     -0.1%      0.01      0.01     +0.0%
            tak          +0.3%     -0.3%      0.01      0.01     +0.0%
--------------------------------------------------------------------------------
            Min          +0.2%     -6.9%     -5.3%    -27.3%     -8.0%
            Max          +0.7%     +0.0%     +1.4%     +1.4%    +50.0%
 Geometric Mean          +0.3%     -0.2%     -0.7%     -2.3%     +0.2%

comment:39 Changed 4 years ago by nomeata

Bummer. While finding out what happened to wave4main, I noticed that I introduced a bug (well, imprecision) where Converges information was lost. This prevented nested CPR in wave4main, and everything looked good. Fixing that bug gives us back the +11.3% for wave4main, and also a +0.2% for scs. So it remains all very heuristical...

The change in scs is hard to pin-point, as there is quite a bit of CPR’ing going on which moves allocations between different ticky-ticky-counters. But in the summary, we see an increase in ALLOC_FUN_gds again, and there are $wgo popping up where there were none, so this indicates lost join points. (Also interesting: ALLOC_CON_ctr goes up, but ALLOC_CON_gds goes down.)

comment:40 Changed 4 years ago by nomeata

Small status update:

My work on nested CPR has now reached somewhat conclusive state. The performance changes are not great, but it is still nice to have, especially if users who would re-write their code for performance would no longer have to do so.

It should not go in for 7.8 – the gains are too small, and there still might be corner cases bugs. So I plan to merge it somewhen after 7.8. Until then, the branch wip/nested-cpr contains the changes in mostly cleaned-up, individually validated patches. I might occasionally rebase that branch to keep the conflicts small.

comment:41 Changed 3 years ago by maoe

Cc: maoe@… added

comment:42 Changed 3 years ago by thoughtpolice

Milestone: 7.6.27.10.1

Moving to 7.10.1.

comment:43 Changed 3 years ago by nomeata

Owner: nomeata deleted

I think the conclusion was that Nested CPR was not worth it. The branch is still there if someone wants to pick it up. Unassigning.

comment:44 Changed 3 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:45 Changed 3 years ago by thoughtpolice

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:46 in reply to:  43 Changed 2 years ago by akio

Replying to nomeata:

I think the conclusion was that Nested CPR was not worth it. The branch is still there if someone wants to pick it up. Unassigning.

Why is it not worth it?

If there is a chance that the branch can be improved to the point that it can be merged, I'm interested in trying. The lack of nested CPR has been the most frequent reason that I have to use unboxed types manually.

comment:47 Changed 2 years ago by akio

Cc: akio added

comment:48 Changed 2 years ago by nomeata

I do not know how much the code changed in master... just give it a try!

comment:49 Changed 2 years ago by simonpj

Yes, do rebase wip/nested-cpr on master, do a like-for-like nofib comparison. I'd quite like to see this land.

Simon

comment:50 Changed 2 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:51 Changed 20 months ago by thomie

Milestone: 8.0.1

comment:52 Changed 11 months ago by akio

I've rebased nomeata's branch: https://github.com/takano-akio/ghc/compare/afdde48...f8eb150

Some nofib results:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
         mandel          +1.9%    -25.0%     0.047     0.047      0.0%
            gcd          +2.0%    -21.6%     0.030     0.031      0.0%
   k-nucleotide          +1.7%    -18.0%     +3.4%     +2.7%      0.0%
      wave4main          +2.0%    +12.3%     0.179     0.179      0.0%
          kahan          +2.0%     +7.8%     +1.4%     +1.3%      0.0%
          solid          +2.0%     -6.4%     0.081     0.081      0.0%
         sphere          +2.0%     +5.8%     0.040     0.040      0.0%
            hpg          +2.0%     +4.1%     0.102     0.102      0.0%
         symalg          +1.9%     +3.6%     0.007     0.007      0.0%
         gamteb          +2.4%     -4.8%     0.026     0.026      0.0%
--------------------------------------------------------------------------------
            Min          +1.5%    -25.0%     -0.6%     -0.4%    -20.0%
            Max          +2.8%    +12.3%    +12.7%    +12.8%    +12.0%
 Geometric Mean          +2.0%     -0.5%     +5.0%     +4.7%     +0.2%

I'll look into the allocation regressions next.

comment:53 Changed 11 months ago by nomeata

Oh, that’s an old branch. I’m happy that a second pair of eyes looks at it. How tedious was it to rebase?

comment:54 in reply to:  53 Changed 11 months ago by akio

Replying to nomeata:

How tedious was it to rebase?

It took me a few days, but in the process I learned a lot about nested CPR and about the demand analyzer in general.

comment:55 Changed 11 months ago by simonpj

Cc: osa1 added

I wonder why there's a consistent 2% increase in binary size; I was not expecting that.

I have not looked at the code... when you are ready, seeing it all as one Phab diff (not lots) would be helpful.

Omer may be thinking about nested CPR, and has an implementation on a branch somewhere. This is a particular instance of nested CPR, so it'd be good to sync up with him (cc'd).

Simon

comment:56 Changed 11 months ago by akio

Regarding wave4main, the main problem seems to be destruction of a join point.

The program contains an expression that looks like this:

runRW#
  (\ s ->
      letrec
        go s1 =
          let ...
          in (# s2, Array a b c #)
      in go s)

After a nested CPR transformation, it becomes:

runRW#
  (\ s ->
      letrec
        $wgo s1 =
          let ...
          in (# s2, a, b, c #)
      in case $wgo s of
        (# s3, a', b', c' #) -> (# s3, Array a' b' c' #))

destroying the join-point property of go. This is basically the same as what @nomeata found 2 years ago.

I see two ways to move forward:

  1. Wait for SequentCore to be implemented.
  2. Have special cases for runRW# in the CPR analyzer and in the simplifier, as discussed in https://ghc.haskell.org/trac/ghc/ticket/10678#comment:10

I'm mildly interested in doing (2), but I don't think I fully understand the idea. In particular, the above comment mentions an example expression case (runRW e) of BN# farr -> farr that the simplifier would be able to transform. However, since runRW# returns an unboxed pair rather than a single boxed value, it seems that it would have to spot a much more complicated pattern to accomplish this transformation.

comment:57 Changed 11 months ago by simonpj

Cc: maurerl@… added

Let's do (1); it's nearly ready!

Akio, Luke (now in cc), can you work together on this? Luke has a pretty stable join-point branch which never destroys join points. Best just to build on that I think.

Simon

comment:58 Changed 8 months ago by akio

I've rebased the branch on top of the join points work. In nofib, allocations are generally down but the difference is not very big:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           bspt          +0.0%     +0.1%     0.007     0.007      0.0%
      compress2          +1.5%     -0.9%     0.158     0.158     -3.7%
          fluid          +1.6%     -0.2%     0.007     0.007      0.0%
          infer          +0.0%     -1.2%     0.039     0.039      0.0%
            pic          +0.1%     -0.6%     0.005     0.005      0.0%
          solid          +0.0%     -6.6%     0.159     0.159      0.0%
--------------------------------------------------------------------------------
            Min          -0.0%     -6.6%    -28.7%    -28.8%     -3.7%
            Max          +1.6%     +0.1%     +4.3%     +4.3%     +6.7%
 Geometric Mean          +0.1%     -0.1%    -14.7%    -14.5%     +0.0%

I'll look more carefully to see why the improvement is so small. I may have messed up something while rebasing.

comment:59 Changed 8 months ago by simonpj

Indeed. Start with one or two poster-child examples to show what the goal is; and check they work as expected. Only then go in with nofib.

Also this is a long ticket. It would be good to write a decription somewhere (on the wiki page?) of what exactly the patch does.

comment:60 Changed 7 months ago by akio

It seems like the above results were bogus because it used libraries compiled with the wrong compiler (I had stage=2 in build.mk. I removed it and typed make. Is it not sufficient to cause a full build?).

After a clean build, the allocation numbers look very good:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      compress2          +2.0%     -0.9%     0.170     0.170     -3.7%
           fft2          +1.0%     -0.2%     0.078     0.078      0.0%
          fluid          +2.1%     -0.4%     0.006     0.006      0.0%
         gamteb          +0.4%     -4.5%     0.032     0.032      0.0%
            gcd          +0.7%    -21.4%     0.046     0.046      0.0%
          infer          -0.1%     -1.2%     0.054     0.054      0.0%
        integer          +1.0%     -1.5%    -15.0%    -15.0%      0.0%
         mandel          +1.0%    -24.4%     0.051     0.051      0.0%
           rfib          +1.0%     -0.3%     0.012     0.012      0.0%
          solid          +0.4%     -6.6%     0.130     0.131      0.0%
--------------------------------------------------------------------------------
            Min          -0.2%    -24.4%    -21.1%    -21.3%     -3.7%
            Max          +2.1%     +0.1%    +39.3%    +39.3%    +17.3%
 Geometric Mean          +0.2%     -0.7%     -4.9%     -4.9%     +0.1%

Should I worry about the increase in the code size? If not I'll clean up the branch and submit a Diff.

comment:61 Changed 7 months ago by simonmar

(I had stage=2 in build.mk. I removed it and typed make. Is it not sufficient to cause a full build?).

make won't necessarily rebuild your libraries, due to GHC's recompilation avoidance. It knows about changes in the source code and the flags, but it doesn't know about changes in GHC itself. So when I'm making measurements like this, I always delete the library object files manually:

find libraries/*/dist-install | grep 'o$' | xargs rm
Last edited 7 months ago by simonmar (previous) (diff)

comment:62 in reply to:  61 Changed 7 months ago by akio

Replying to simonmar:

make won't necessarily rebuild your libraries, due to GHC's recompilation avoidance.

Oh I see... Thank you for the tip!

comment:63 Changed 7 months ago by simonpj

Good looking numbers esp for mandel, gcd. Can you dig in to see what is really happening there? How does the inner loop look?

For code size, it's worth investigating to get insight. Look at the "module sizes" in the nofib-analyse log, and take a look at modules that got significantly bigger. Is that code-size increase because of an actual optimisation or is it just accidental?

comment:64 Changed 7 months ago by akio

mandel and gcd

The main difference seems to come from this change, which makes integer-gmp allocate less by inlining wrappers for S#, Jp# etc. Actually this small change seems to be responsible for the much of the allocation improvement in this branch:

            gcd          +0.8%    -21.4%     0.039     0.040      0.0%
        integer          +1.1%     -1.5%     +5.0%     +5.1%      0.0%
         mandel          +1.1%    -24.4%     0.070     0.070      0.0%
--------------------------------------------------------------------------------
            Min          -0.0%    -24.4%    -38.7%    -38.5%      0.0%
            Max          +1.1%      0.0%    +13.7%    +13.6%    +14.4%
 Geometric Mean          +0.3%     -0.6%     -6.9%     -6.9%     +0.1%

Perhaps I should try go get this change merged, separately from the rest of the nested CPR work.

Binary size increase

I had a brief look at nofib/real/fluid/Jcb_method.hs, which showed a +60% increase in the binary size, but I wasn't able to figure out why the change happened. SpecConstr now seems to duplicate a big function, but I don't know what caused this (yet).

Wiki page

I think the content under wiki:NestedCPR is mostly valid, because all I have done so far is essentially just to rebase nomeata's branch on top of the join-point commit.

comment:65 Changed 7 months ago by simonpj

The main difference seems to come from ​this change...

Interesting! There are two changes in that commit:

  • Making data con wrappers have ug_unsat_ok = unSaturatedOk. I buy this. Can you try the effect of this change alone? It's entirely unrelated to nested-CPR. It would be good to isolate that perf bump from the perf changes due to nested CPR.
  • Making data con wrapper have ug_boring_ok = boringCxtOk. I'm not sure I buy this in full. In Note [Inline data constructor wrappers aggresively] there's a claim that we get better nested-CPR info. That's true if we have a polymorphic bang argument (e.g. data T a = MkT !a) but for a monomorphic one like data S = MkS !Int we'll unpack it.

So I think the benefit comes only for polymorphic bangs. And there is a cost to aggressive inlining, so we don't want to do it gratuitously.

comment:66 Changed 7 months ago by akio

There are two changes in that commit:

Actually the only change is the latter one. The former change is already in HEAD, as of 2be364ac8c.

Making data con wrapper have ug_boring_ok = boringCxtOk. I'm not sure I buy this in full. In Note [Inline data constructor wrappers aggresively] there's a claim that we get better nested-CPR info.

I don't think the improvement has anything to do with nested CPR, because I'm testing this change in isolation, without any other changes from the nested CPR branch.

The big difference in nofib comes from the GHC.Integer.Types module in integer-gmp. This module has this definition:

data Integer  = S#                !Int#
                -- ^ iff value in @[minBound::'Int', maxBound::'Int']@ range
              | Jp# {-# UNPACK #-} !BigNat
                -- ^ iff value in @]maxBound::'Int', +inf[@ range
              | Jn# {-# UNPACK #-} !BigNat
                -- ^ iff value in @]-inf, minBound::'Int'[@ range

Note the redundant bang in the S# constructor. This causes a wrapper for this constructor to be created:

GHC.Integer.Type.$WS# =
  \ (dt [Occ=Once] :: Int#) ->
    case dt of dt { __DEFAULT -> GHC.Integer.Type.S# dt }

And this wrapper remains un-inlined in various places like:

quotRemInteger =
  \ (n :: Integer) (ds :: Integer) ->
    join {
      fail1 [Dmd=<L,1*C1(U(U,U))>] :: Void# -> (# Integer, Integer #)
      [LclId[JoinId(1)], Arity=1, Str=<L,A>, Unf=OtherCon []]
      fail1 _ [Occ=Dead, OS=OneShot] =
        case n of n1 { __DEFAULT ->
        join {
          fail2 [Dmd=<L,1*C1(U(U,U))>] :: Void# -> (# Integer, Integer #)
          [LclId[JoinId(1)], Arity=1, Str=<L,A>, Unf=OtherCon []]
          fail2 _ [Occ=Dead, OS=OneShot] =
            case n1 of wild {
              S# ds3 ->
                case ds3 of ds4 {
                  __DEFAULT ->
                    case ds of {
                      S# d# ->
                        case quotRemInt# ds4 d# of { (# ipv, ipv1 #) ->
                        (# GHC.Integer.Type.$WS# ipv, GHC.Integer.Type.$WS# ipv1 #)
...

leading to unnecessary allocation of thunks. It should be possible to fix this problem by doing one of the following:

  1. Inline constructor wrappers like $WS# aggressively, at least when it's fully applied.
  2. Do not create a redundant wrapper in a case like this.
  3. Remove this particular bang from integer-gmp.

I have tried (1) and (3), and they seem to have very similar effects. (3) is definitely the easiest to implement, but I wonder if (2) is more correct. I'd like to hear some suggestions about this.

comment:67 Changed 7 months ago by Simon Peyton Jones <simonpj@…>

In 900cfdc2/ghc:

Do not generate a data-con wrapper for !Int#

See Note [Data con wrappers and unlifted types] in MkId.
We were being totally stupid!

See Trac #1600 comment:66

comment:68 Changed 7 months ago by simonpj

Akio, I've fixed that wrapper issue.

What's next on this ticket?

comment:69 in reply to:  68 Changed 6 months ago by akio

Replying to simonpj:

Akio, I've fixed that wrapper issue.

Thank you!

What's next on this ticket?

After rebasing the branch on top of your fix, the nofib improvements are again consistent but very small:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
            hpg          +0.2%     -1.1%     0.212     0.212      0.0%
          infer          +0.3%     -1.2%     0.054     0.054      0.0%
          solid          +0.3%     -6.6%     0.181     0.181      0.0%
         sphere          +0.3%     -1.7%     0.086     0.086      0.0%
--------------------------------------------------------------------------------
            Min          +0.2%     -6.6%    -23.2%    -23.4%    -19.2%
            Max          +0.7%     +0.1%    +16.3%    +16.2%      0.0%
 Geometric Mean          +0.3%     -0.1%     -0.1%     -0.1%     -0.3%

The branch can perform nested CPR in simple cases, but it fails to transform some cases I would like it to. Currently I'm trying to make extendSigsWithLam a bit more aggressive by allowing it to generate nested CPR information.

comment:70 Changed 6 months ago by simonpj

it fails to transform some cases I would like it to.

Can you give some specific programs that you would like to be optimised and aren't? A list of specific examples would help to make sure that everyone on this thread was talking about the same things.

comment:71 Changed 6 months ago by akio

I made a wiki page with a few specific examples: NestedCPR/Akio2017. I think I'm now done with updating extendSigsWithLam. I'm going to run nofib again and look for more improvements to make.

comment:72 Changed 6 months ago by simonpj

OK great, thank you! Neither example on your status page use I/O, and I think it may be there that the main gains are to be found.

comment:73 Changed 4 months ago by akio

simonpj: I've added two examples that use IO.

After a bit of work I managed to make a big improvement on x2n1:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           bspt          -0.1%     +0.1%     0.007     0.007      0.0%
      cacheprof          -0.1%     +0.1%    +28.4%    +28.4%     +0.9%
           grep          -0.1%     -0.2%     0.000     0.000      0.0%
       maillist          -0.1%     -0.3%     0.053     0.053      0.0%
            hpg          -0.1%     -0.4%     0.205     0.205      0.0%
            pic          -0.1%     -0.6%     0.007     0.007      0.0%
      compress2          +0.2%     -0.9%     0.196     0.196    -19.2%
          infer          -0.1%     -1.2%     0.069     0.069      0.0%
         sphere          -0.1%     -1.7%     0.049     0.049      0.0%
          solid          -0.1%     -6.6%     0.127     0.169      0.0%
           x2n1          -0.1%    -84.7%     0.006     0.006      0.0%
--------------------------------------------------------------------------------
            Min          -0.1%    -84.7%    -37.7%    -35.7%    -19.2%
            Max          +0.2%     +0.1%    +34.3%    +34.2%     +0.9%
 Geometric Mean          -0.1%     -1.9%     -2.6%     -1.3%     -0.2%

I wonder if this improvement justifies the added complexity to the compiler (about 400 lines of code added under compiler/). Right now I don't see any way to make the analysis/transformation more effective.

comment:74 Changed 8 weeks ago by bgamari

What is the status of this, akio?

I wonder if this improvement justifies the added complexity to the compiler (about 400 lines of code added under compiler/). Right now I don't see any way to make the analysis/transformation more effective.

Well, 2.6% improvement in runtime is quite sizeable. This sounds quite worthwhile to me. Good work!

Simon, do you agree?

comment:75 Changed 8 weeks ago by akio

I'm currently cleaning up the branch before submitting a Diff.

Well, 2.6% improvement in runtime is quite sizeable.

The runtime numbers are probably unreliable because I didn't try to keep machine unloaded while running nofib.

comment:76 Changed 8 weeks ago by simonpj

Cc: nomeata added

I still feel that this is the Right Thing To Do. The performance results are not so exciting... but it would not surprise me if there was more to be had.

I'm currently cleaning up the branch before submitting a Diff.

Great. I'd like Joachim to review your code, if he is willing; after all, he had the first go at this. And sometimes a review leads to improvements and clean ups.

On your wiki page NestedCPR/Akio2017 you speak of "nested strictness", "more aggressive worker/wrapper", "correct handling of strict contructor field", etc, but in each case I have little idea of what you mean. Could you expand with more specifics?

Really I'd like to have a page describing the design. (E.g. do you have a convergence analyser? Could we use that for other things?) Having such a description would make it much easier to read the code.

There's a paper in here somewhere.

comment:77 Changed 8 weeks ago by nomeata

Let me know when the branch is cleaned up. I can have a look (although I probably forgot everything since then…).

Note: See TracTickets for help on using tickets.