Opened 15 months ago

Last modified 2 months ago

#7596 new bug

Opportunity to improve CSE

Reported by: simonpj Owned by: simonpj
Priority: normal Milestone: 7.8.3
Component: Compiler Version: 7.6.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

In nofib/spectral/mandel2, the function check_perim calls point_colour on various arguments. After inlining point_colour there is the opportunity to CSE among the sub-expressions the inlinings create.

In GHC 7.6, the join point didn't have a "one-shot" flag, so the full laziness pass floated these sub-expressions out, and they got CSEd.

As part of Ilya's new demand analyser changes, we get the "one-shot" flag right, so don't MFE those sub-expressions, so they aren't CSEd. As a result, allocation on mandel2 increases slightly (4.2%).

The solution is, I think, to do something a bit like like CorePrep before CSE. At the moment if we have

case f (I# y) of { (a,b) ->
case g (I# y) of { (p,q) -> ... }}

we stuipdly don't CSE that (I# y) even though it is manifestly sharable. Somehow we should.

Change History (6)

comment:1 Changed 15 months ago by simonpj@…

commit 0831a12ea2fc73c33652eeec1adc79fa19700578

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Thu Jan 17 10:54:07 2013 +0000

    Major patch to implement the new Demand Analyser
    
    This patch is the result of Ilya Sergey's internship at MSR.  It
    constitutes a thorough overhaul and simplification of the demand
    analyser.  It makes a solid foundation on which we can now build.
    Main changes are
    
    * Instead of having one combined type for Demand, a Demand is
       now a pair (JointDmd) of
          - a StrDmd and
          - an AbsDmd.
       This allows strictness and absence to be though about quite
       orthogonally, and greatly reduces brain melt-down.
    
    * Similarly in the DmdResult type, it's a pair of
         - a PureResult (indicating only divergence/non-divergence)
         - a CPRResult (which deals only with the CPR property
    
    * In IdInfo, the
        strictnessInfo field contains a StrictSig, not a Maybe StrictSig
        demandInfo     field contains a Demand, not a Maybe Demand
      We don't need Nothing (to indicate no strictness/demand info)
      any more; topSig/topDmd will do.
    
    * Remove "boxity" analysis entirely.  This was an attempt to
      avoid "reboxing", but it added complexity, is extremely
      ad-hoc, and makes very little difference in practice.
    
    * Remove the "unboxing strategy" computation. This was an an
      attempt to ensure that a worker didn't get zillions of
      arguments by unboxing big tuples.  But in fact removing it
      DRAMATICALLY reduces allocation in an inner loop of the
      I/O library (where the threshold argument-count had been
      set just too low).  It's exceptional to have a zillion arguments
      and I don't think it's worth the complexity, especially since
      it turned out to have a serious performance hit.
    
    * Remove quite a bit of ad-hoc cruft
    
    * Move worthSplittingFun, worthSplittingThunk from WorkWrap to
      Demand. This allows JointDmd to be fully abstract, examined
      only inside Demand.
    
    Everything else really follows from these changes.
    
    All of this is really just refactoring, so we don't expect
    big performance changes, but acutally the numbers look quite
    good.  Here is a full nofib run with some highlights identified:
    
            Program           Size    Allocs   Runtime   Elapsed  TotalMem
    --------------------------------------------------------------------------------
             expert          -2.6%    -15.5%      0.00      0.00     +0.0%
              fluid          -2.4%     -7.1%      0.01      0.01     +0.0%
                 gg          -2.5%    -28.9%      0.02      0.02    -33.3%
          integrate          -2.6%     +3.2%     +2.6%     +2.6%     +0.0%
            mandel2          -2.6%     +4.2%      0.01      0.01     +0.0%
           nucleic2          -2.0%    -16.3%      0.11      0.11     +0.0%
               para          -2.6%    -20.0%    -11.8%    -11.7%     +0.0%
             parser          -2.5%    -17.9%      0.05      0.05     +0.0%
             prolog          -2.6%    -13.0%      0.00      0.00     +0.0%
             puzzle          -2.6%     +2.2%     +0.8%     +0.8%     +0.0%
            sorting          -2.6%    -35.9%      0.00      0.00     +0.0%
           treejoin          -2.6%    -52.2%     -9.8%     -9.9%     +0.0%
    --------------------------------------------------------------------------------
                Min          -2.7%    -52.2%    -11.8%    -11.7%    -33.3%
                Max          -1.8%     +4.2%    +10.5%    +10.5%     +7.7%
     Geometric Mean          -2.5%     -2.8%     -0.4%     -0.5%     -0.4%
    
    Things to note
    
    * Binary sizes are smaller. I don't know why, but it's good.
    
    * Allocation is sometiemes a *lot* smaller. I believe that all the big numbers
      (I checked treejoin, gg, sorting) arise from one place, namely a function
      GHC.IO.Encoding.UTF8.utf8_decode, which is strict in two Buffers both of
      which have several arugments.  Not w/w'ing both arguments (which is what
      we did before) has a big effect.  So the big win in actually somewhat
      accidental, gained by removing the "unboxing strategy" code.
    
    * A couple of benchmarks allocate slightly more.  This turns out
      to be due to reboxing (integrate).  But the biggest increase is
      mandel2, and *that* turned out also to be a somewhat accidental
      loss of CSE, and pointed the way to doing better CSE: see Trac
      #7596.
    
    * Runtimes are never very reliable, but seem to improve very slightly.
    
    All in all, a good piece of work.  Thank you Ilya!

 compiler/basicTypes/Demand.lhs     | 1229 +++++++++++++++++++++++++++++-------
 compiler/basicTypes/Id.lhs         |   65 +-
 compiler/basicTypes/IdInfo.lhs     |   54 +-
 compiler/basicTypes/MkId.lhs       |   45 +-
 compiler/coreSyn/CoreArity.lhs     |    8 +-
 compiler/coreSyn/CoreLint.lhs      |    7 +-
 compiler/coreSyn/CorePrep.lhs      |   24 +-
 compiler/coreSyn/CoreTidy.lhs      |    4 +-
 compiler/coreSyn/MkCore.lhs        |   11 +-
 compiler/coreSyn/PprCore.lhs       |   12 +-
 compiler/iface/BinIface.hs         |  100 +---
 compiler/iface/IfaceSyn.lhs        |   22 +-
 compiler/iface/MkIface.lhs         |    8 +-
 compiler/iface/TcIface.lhs         |   22 +-
 compiler/main/TidyPgm.lhs          |   28 +-
 compiler/prelude/primops.txt.pp    |   10 +-
 compiler/simplCore/FloatOut.lhs    |    8 +-
 compiler/simplCore/SetLevels.lhs   |   23 +-
 compiler/simplCore/SimplCore.lhs   |   21 +-
 compiler/simplCore/Simplify.lhs    |    5 +-
 compiler/specialise/SpecConstr.lhs |   30 +-
 compiler/stranal/DmdAnal.lhs       | 1077 ++++++++++---------------------
 compiler/stranal/WorkWrap.lhs      |   58 +--
 compiler/stranal/WwLib.lhs         |  131 +++--
 24 files changed, 1658 insertions(+), 1344 deletions(-)

comment:2 Changed 13 months ago by igloo

  • Milestone set to 7.8.1
  • Owner set to simonpj

comment:3 Changed 3 months ago by nomeata

As suggested, I made CSE much more aggressive by floating out much more expressions, so that Just x is CSE’ed in

module T7596 where

f :: Maybe Int -> (Bool, Bool)
f Nothing = (True, True)
f _ = (False, True)
{-# NOINLINE f #-}

g :: Maybe Int -> (Bool, Bool)
g Nothing = (True, True)
g _ = (False, True)
{-# NOINLINE g #-}

foo :: Int -> Bool
foo x = case f (Just x) of
    (a, b) -> case g (Just x) of
        (p,q) -> a && b && p && q

It hardy helps, though:

            Min          -0.6%    -14.1%    -33.0%    -33.9%    -20.0%
            Max          +0.6%   +178.2%    +85.2%    +86.1%    +50.0%
 Geometric Mean          -0.2%    +14.1%    +13.1%    +12.5%     +0.8%

But that is no surprise; it CSE’s dictionary access functions like lvl1 = GHC.Classes.<= @ a sc – not much to win here. Preventing aggresive floating of partial applications... slightly better, but still horrible:

            Min          -0.7%    -14.1%    -53.0%    -55.4%    -20.0%
            Max          +0.5%   +178.2%    +25.9%    +25.9%    +50.0%
 Geometric Mean          -0.4%    +13.6%    -16.8%    -18.1%     +0.6%

For example spectral/sorting: +123.1% increase, due to the expression reverse rev shared in this code

insertSort []       = []
insertSort (x:xs)   = trins [] [x] xs
  where
    trins :: Ord a => [a] -> [a] -> [a] -> [a]

    trins rev []     (y:ys)         = trins [] ((reverse rev) ++ [y]) ys
    trins rev xs     []             = (reverse rev) ++ xs
    trins rev (x:xs) (y:ys) | x < y = trins (x:rev) xs (y:ys)
                            | True  = trins [] (reverse rev ++ (y:x:xs)) ys

Clearly CSE’ing something from different branches can never be a win (besides potentially code size). But that is something the current CSE code cannot check, right?

And even in the example from the ticket description, it is not stupid not to do CSE: With some luck the first I# allocated will be free before the next GC run, causing no copying. After CSE it stays alive for longer, causing more work.

comment:4 Changed 2 months ago by nomeata

With aggressive floating out bounded by multi-way-cases, the changes are less dramatic:

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
        knights          +0.5%    -23.1%      0.01      0.01      0.0%
        mandel2          +0.4%    +46.6%      0.00      0.00      0.0%
--------------------------------------------------------------------------------
            Min          +0.3%    -23.1%    -20.2%    -19.7%    -20.0%
            Max          +1.3%    +46.6%     +1.9%     +1.9%     +3.6%
 Geometric Mean          +0.5%     -0.0%     -4.1%     -4.2%     -0.3%

Will investigate what happened with mandel2 now.

comment:5 Changed 2 months ago by nomeata

Hmm, tricky. The Increase in allocation happens in check_perim (which produces quite ugly code; if this were real code I’d suggest the author to put in some strictness annotation on the floats...). There are multiple calls to point_color, which call np resp. nq on the coordinates of its arguments. These calculations are shared. How does that increase allocations? Hard to tell, even with ticky-ticky-profiling.

But it seems that there is still stuff floating past multi-way cases, and it is hard to avoid. If there is a nested expression (C1 (C2 x)), and (C1 (C2 x)) is worth floating on its own, it is floated past multi-way-cases, preserving existing behavior. But if C2 x is also optimistically floated out, it is now outside the multi-way case and might be CSEd there with values from other branches, even if C1 lvl itself gets floated back in. But then, I’m not entirely sure that this is main reason for allocation increase.

comment:6 Changed 2 months ago by nomeata

I wrote a small wiki page with some of my findings, if you work on CSE you might find some of it useful: MoreCSE. Not many hard insights there, though.

Note: See TracTickets for help on using tickets.