Opened 22 months ago

Closed 15 months ago

Last modified 11 months ago

#11564 closed bug (fixed)

Possible overzealous unfolding

Reported by: simonmar Owned by:
Priority: high Milestone: 8.0.2
Component: Compiler Version: 8.0.1
Keywords: Inlining Cc: niteria
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D1900
Wiki Page:

Description

I was investigating why (>>=) in the Haxl monad is being inlined more than I would expect, and I ran into something I don't fully understand, and looks dubious.

Start from this standalone example:

{-# LANGUAGE ScopedTypeVariables, ExistentialQuantification #-}
module Haxl where

import Data.IORef
import Control.Exception

newtype GenHaxl u a = GenHaxl
  { unHaxl :: Int -> IORef () -> IO (Result u a) }

data Result u a
  = Done a
  | Throw SomeException
  | Blocked (Cont u a)

data Cont u a
  = forall b. Cont u b :>>= (b -> GenHaxl u a)
  | forall b. (b -> a) :<$> (Cont u b)

instance Monad (GenHaxl u) where
  return a = GenHaxl $ \_env _ref -> return (Done a)
  GenHaxl m >>= k = GenHaxl $ \env ref -> do
    e <- m env ref
    case e of
      Done a       -> unHaxl (k a) env ref
      Throw e      -> return (Throw e)
      Blocked cont -> return (Blocked (cont :>>= k))

instance Functor (GenHaxl u)
instance Applicative (GenHaxl u)

(it could be simplified further, but I've intentionally used the exact definition of >>= that is used in Haxl to be sure I'm not investigating the wrong thing)

Compile like this:

ghc -O -c Haxl.hs

and look at the .hi file:

ghc --show-iface Haxl.hi

see this:

ea159c3b107c307a4e76cd310efcc8bc
  $fMonadGenHaxl2 ::
    GenHaxl u a
    -> (a -> GenHaxl u b)
    -> Int
    -> IORef ()
    -> State# RealWorld
    -> (# State# RealWorld, Result u b #)
  {- Arity: 5, HasNoCafRefs,
     Strictness: <C(C(C(S(SS)))),1*C1(C1(C1(U(U,1*U))))><L,U><L,U><L,U><S,U>,
     Unfolding: InlineRule (5, True, False)
                (\ @ u
                   @ a
                   @ b
                   (ds :: GenHaxl u a)
                   (k :: a -> GenHaxl u b)
                   (env :: Int)
                   (ref :: IORef ())
                   (s :: State# RealWorld)[OneShot] ->
                 case (ds `cast` (N:GenHaxl[0] <u>_P <a>_R) env ref)
                        `cast`
                      (N:IO[0] <Result u a>_R)
                        s of ds1 { (#,#) ipv ipv1 ->
                 case ipv1 of wild {
                   Done a1
                   -> ((k a1) `cast` (N:GenHaxl[0] <u>_P <b>_R) env ref)
                        `cast`
                      (N:IO[0] <Result u b>_R)
                        ipv
                   Throw e -> (# ipv, Throw @ u @ b e #)
                   Blocked cont
                   -> (# ipv, Blocked @ u @ b (:>>= @ u @ b @ a cont k) #) } }) -}

That right there is the definition of >>=. Note that it has an InlineRule, which means that it will be unconditionally unfolded pretty much everywhere. I don't think this is right - there's no benefit to be had in inlining it unconditionally.

I delved in a bit more, and it seems this unfolding arises during worker-wrapper. Before WW we have

a_sVM
[LclId,
 Arity=5,
 Str=DmdType <C(C(C(S(SS)))),1*C1(C1(C1(U(U,1*U))))><L,U><L,U><L,U><S,U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=IF_ARGS [0 60 0 0 0] 94 60}]
a_sVM =
  \ @ u_XQR
    @ a_aPN
    @ b_aPO
    ds_dQP [Dmd=<C(C(C(S(SS)))),1*C1(C1(C1(U(U,1*U))))>]
    k_aEC
    env_aED
    ref_aEE
    s_aVC [Dmd=<S,U>, OS=OneShot] ->
    case (((ds_dQP `cast` ...) env_aED ref_aEE) `cast` ...) s_aVC
    of _ [Occ=Dead, Dmd=<L,A>]
    { (# ipv_aVF [Dmd=<S,U>], ipv1_aVG [Dmd=<S,1*U>] #) ->
    case ipv1_aVG of _ [Occ=Dead, Dmd=<L,A>] {
      Done a_aEG ->
        ((((k_aEC a_aEG) `cast` ...) env_aED ref_aEE) `cast` ...) ipv_aVF;
      Throw e_aEH -> (# ipv_aVF, Haxl.Throw e_aEH #);
      Blocked cont_aEI ->
        (# ipv_aVF, Haxl.Blocked (Haxl.:>>= cont_aEI k_aEC) #)
    }
    }

and after WW we have

a_sVM
[LclId,
 Arity=5,
 Str=DmdType <C(C(C(S(SS)))),1*C1(C1(C1(U(U,1*U))))><L,U><L,U><L,U><S,U>,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True,
         Guidance=ALWAYS_IF(arity=5,unsat_ok=True,boring_ok=False)
         Tmpl= \ @ u_XQR
                 @ a_aPN
                 @ b_aPO
                 ds_dQP [Occ=Once]
                 k_aEC [Occ=Once*]
                 env_aED
                 ref_aEE
                 s_aVC [Occ=Once, OS=OneShot] ->
                 case (((ds_dQP `cast` ...) env_aED ref_aEE) `cast` ...) s_aVC
                 of _ [Occ=Dead]
                 { (# ipv_aVF [Occ=Once*], ipv1_aVG [Occ=Once!] #) ->
                 case ipv1_aVG of _ [Occ=Dead] {
                   Done a_aEG [Occ=Once] ->
                     ((((k_aEC a_aEG) `cast` ...) env_aED ref_aEE) `cast` ...) ipv_aVF;
                   Throw e_aEH [Occ=Once] -> (# ipv_aVF, Haxl.Throw e_aEH #);
                   Blocked cont_aEI [Occ=Once] ->
                     (# ipv_aVF, Haxl.Blocked (Haxl.:>>= cont_aEI k_aEC) #)
                 }
                 }}]
a_sVM =
  \ @ u_XQR
    @ a_aPN
    @ b_aPO
    ds_dQP [Dmd=<C(C(C(S(SS)))),1*C1(C1(C1(U(U,1*U))))>]
    k_aEC
    env_aED
    ref_aEE
    s_aVC [Dmd=<S,U>, OS=OneShot] ->
    case (((ds_dQP `cast` ...) env_aED ref_aEE) `cast` ...) s_aVC
    of _ [Occ=Dead, Dmd=<L,A>]
    { (# ipv_aVF [Dmd=<S,U>], ipv1_aVG [Dmd=<S,1*U>] #) ->
    case ipv1_aVG of _ [Occ=Dead, Dmd=<L,A>] {
      Done a_aEG ->
        ((((k_aEC a_aEG) `cast` ...) env_aED ref_aEE) `cast` ...) ipv_aVF;
      Throw e_aEH -> (# ipv_aVF, Haxl.Throw e_aEH #);
      Blocked cont_aEI ->
        (# ipv_aVF, Haxl.Blocked (Haxl.:>>= cont_aEI k_aEC) #)
    }
    }

For some unknown reason, this binding has acquired an always-on unfolding. There's no wrapper, we're just unfolding the whole thing.

Simon, can you shed any light here? I would like to tune unfolding sizes to reduce code bloat in our codebase, but with this unfolding being unconditional it doesn't work to use -funfolding-use-threshold, I can only use NOINLINE but that's too blunt.

Change History (15)

comment:1 Changed 22 months ago by simonpj

Here's the code in WorkWrap.tryWW

  | not loop_breaker
  , Just stable_unf <- certainlyWillInline dflags fn_unf
  = return [ (fn_id `setIdUnfolding` stable_unf, rhs) ]
        -- Note [Don't w/w inline small non-loop-breaker, or INLINE, things]
        -- NB: use idUnfolding because we don't want to apply
        --     this criterion to a loop breaker!

The note says

Note [Don't w/w inline small non-loop-breaker things]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In general, we refrain from w/w-ing *small* functions, which are not
loop breakers, because they'll inline anyway.  But we must take care:
it may look small now, but get to be big later after other inlining
has happened.  So we take the precaution of adding an INLINE pragma to
any such functions.

I made this change when I observed a big function at the end of
compilation with a useful strictness signature but no w-w.  (It was
small during demand analysis, we refrained from w/w, and then got big
when something was inlined in its rhs.) When I measured it on nofib,
it didn't make much difference; just a few percent improved allocation
on one benchmark (bspt/Euclid.space).  But nothing got worse.

Also look at certainlyWillInline.

Now, maybe the size calculation is bad, and is treating as small something that isn't small. The calculation in certainlyWillInline is saying, I think, that the size of the RHS is no bigger than the size of the call (i.e. (arity+1)*10).

Does that help?

comment:2 Changed 22 months ago by simonmar

Ok, I understand the reason, we're saying "this function is strict, but it's small enough to be inlined anyway so there's no point in worker/wrappering it, we'll just inline it instead." In my case here the size is 94 and the arity is 5, so it falls under the threshold.

The calculation is a bit generous (I found a bug, diff coming soon), but even still there's something a bit wrong here. The problem is that we're deciding based on the value of -funfolding-use-threshold at the definition site, which means we can't use that flag to decide at the call site.

Presumably we need the INLINE pragma because otherwise we might lose the opportunity to take advantage of the strictness. But what goes wrong if we make it an INLINABLE pragma instead?

comment:3 Changed 22 months ago by simonmar

Differential Rev(s): Phab:D1900

comment:4 Changed 22 months ago by Simon Marlow <marlowsd@…>

In 51a3392/ghc:

sizeExpr: fix a bug in the size calculation

There were two bugs here:

* We weren't ignoring Cast in size_up_app
* An application of a non-variable wasn't being charged correct

The result was that some things looked too cheap.  In my case I had
things like

    ((f x) `cast` ...) y

which was given size 21 instead of 30, and this had knock-on effects
elsewhere that caused some large code bloat.

Test Plan:
* nofib runs (todo)
* validate

Reviewers: simonpj, austin, bgamari, erikd

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D1900

GHC Trac Issues: #11564

comment:5 Changed 22 months ago by simonmar

Status: newmerge

I pushed the fix and opened #11568 for the regression. We may want to merge, but we should understand the regression before doing so.

comment:6 Changed 21 months ago by simonmar

Owner: simonpj deleted
Status: mergenew

I reverted the change, because it caused perf test regressions (that were apparently not reported by Travis): https://perf.haskell.org/ghc/#revision/51a33924fc118d9b6c1db556c75c0d010ef95e18

tests/alloc/T3064 	297834384 	+ 3.18% 	307311896 	bytes
tests/alloc/T5631 	1164415992 	+ 108.27% 	2425081664 	bytes
tests/alloc/T7257 	1654893352 	- 14.50% 	1414893352 	bytes
tests/alloc/T783 	489042024 	+ 3.47% 	505987776 	bytes
tests/alloc/T9020 	703211744 	+ 16.45% 	818924416 	bytes
tests/alloc/T9872d 	535824528 	+ 3.09% 	552399688 	bytes
tests/alloc/T9961 	731573736 	+ 5.10% 	768861280 	bytes
tests/alloc/haddock.Cabal 	10472403176 	+ 4.52% 	10945830216 	bytes
tests/alloc/haddock.compiler 	59679619080 	+ 4.56% 	62402376816 	bytes

Need to investigate these before pushing.

comment:7 Changed 21 months ago by simonmar

I looked at the -dverbose-core2core for T5631 and it's identical between the before and after compilers, so this is a difference in the performance of GHC itself after this change.

Last edited 21 months ago by simonmar (previous) (diff)

comment:8 Changed 21 months ago by simonpj

Well yes, I guess that's maybe what tests/alloc/T5631 measures? I don't even know where that test is.

But why would the compiler allocate twice as much? If -dverbose-core2core is identical, it can't be that the program being compiled is getting bigger. Very mysterious.

comment:9 Changed 17 months ago by simonmar

I tried this again, the results are a bit different:

bytes allocated value is too high:
    Expected    T5631(normal) bytes allocated: 1124068664 +/-5%
    Lower bound T5631(normal) bytes allocated: 1067865230 
    Upper bound T5631(normal) bytes allocated: 1180272098 
    Actual      T5631(normal) bytes allocated: 1384316488 
    Deviation   T5631(normal) bytes allocated:       23.2 %

bytes allocated value is too high:
    Expected    T9020(optasm) bytes allocated: 698401736 +/-10%
    Lower bound T9020(optasm) bytes allocated: 628561562 
    Upper bound T9020(optasm) bytes allocated: 768241910 
    Actual      T9020(optasm) bytes allocated: 843289480 
    Deviation   T9020(optasm) bytes allocated:      20.7 %

update the test so that GHC doesn't regress again)
    Expected    T7257(normal) bytes allocated: 1654893248 +/-5%
    Lower bound T7257(normal) bytes allocated: 1572148585 
    Upper bound T7257(normal) bytes allocated: 1737637911 
    Actual      T7257(normal) bytes allocated: 1414893248 
    Deviation   T7257(normal) bytes allocated:      -14.5 %

On nofib:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
   k-nucleotide          +0.5%   +106.7%     +4.9%     +4.9%      0.0%
--------------------------------------------------------------------------------
            Min          +0.4%     -0.3%     -8.2%     -8.3%     -1.8%
            Max          +0.7%   +106.7%     +7.9%     +7.8%     +2.6%
 Geometric Mean          +0.6%     +0.7%     +0.0%     +0.0%     -0.0%

Compile Allocations
        -1 s.d.                -----            -0.1%
        +1 s.d.                -----            +1.4%
        Average                -----            +0.7%

Compile Times
        -1 s.d.                -----           -10.6%
        +1 s.d.                -----            +8.7%
        Average                -----            -1.4%

To investigate:

  • k-nucleotide
  • compiler perf on T5631 and T9020

comment:10 Changed 17 months ago by Simon Marlow <marlowsd@…>

In a47b62cb/ghc:

Second attempt to fix sizeExpr

Summary:
Background:
* sizeExpr was calculating expressions like ((e `cast` T) x) wrongly
* Fixing it caused regressions in compile performance, and one nofib
  program (k-nucleotide)

I managed to fix the source of the compiler regressions.  I think it was
due to traceTc not being inlined, which I fixed in a more robust way by
putting an export list on TcRnMonad.

The k-nucleotide regression is more difficult.  I don't think anything
is actually going wrong, but this program has been highly tuned and is
quite sensitive to changing in inlining behaviour.  I managed to recover
most of the performance by manual lambda-lifting which makes it a bit
less fragile, but the end result was a bit slower.  I don't think this
is disastrous, the program is pretty horrible to begin with and we could
probably make a faster one by starting from scratch.

Test Plan: validate, nofib

Reviewers: simonpj, bgamari, niteria, austin, erikd

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2338

GHC Trac Issues: #11564

comment:11 Changed 17 months ago by simonmar

Status: newmerge

comment:12 Changed 16 months ago by mpickering

Keywords: Inlining added

comment:13 Changed 15 months ago by Ben Gamari <ben@…>

In 498009a/ghc:

Second attempt to fix sizeExpr

Summary:
Background:
* sizeExpr was calculating expressions like ((e `cast` T) x) wrongly
* Fixing it caused regressions in compile performance, and one nofib
  program (k-nucleotide)

I managed to fix the source of the compiler regressions.  I think it was
due to traceTc not being inlined, which I fixed in a more robust way by
putting an export list on TcRnMonad.

The k-nucleotide regression is more difficult.  I don't think anything
is actually going wrong, but this program has been highly tuned and is
quite sensitive to changing in inlining behaviour.  I managed to recover
most of the performance by manual lambda-lifting which makes it a bit
less fragile, but the end result was a bit slower.  I don't think this
is disastrous, the program is pretty horrible to begin with and we could
probably make a faster one by starting from scratch.

Test Plan: validate, nofib

Reviewers: simonpj, bgamari, niteria, austin, erikd

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2338

GHC Trac Issues: #11564

(cherry picked from commit a47b62cb36853d03c77ef63b3208b3d869fb687e)

comment:14 Changed 15 months ago by bgamari

Milestone: 8.0.2
Resolution: fixed
Status: mergeclosed
Version: 8.18.0.1

comment:15 Changed 11 months ago by bgamari

For the record this was merged to 8.0.2 as 498009a904a1e8910f9e0e527f6eb6c8073c8a76.

Note: See TracTickets for help on using tickets.