Opened 4 years ago

Closed 4 years ago

Last modified 7 weeks ago

#5113 closed bug (fixed)

Huge performance regression of 7.0.2, 7.0.3 and HEAD over 7.0.1 and 6.12 (MonoLocalBinds)

Reported by: daniel.is.fischer Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.0.3
Keywords: performance, MonoLocalBinds Cc:
Operating System: Linux Architecture: x86
Type of failure: Runtime performance bug Test Case: perf/should_run/T5113
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

Exploring ways to make show for Double and Float faster, I encountered a terrible performance regression. Time increased about 20-fold, allocation about 75-fold.

With 6.12.3 and 7.0.1, the new code is nearly twice as fast as the old and allocates less than a third. With 7.0.2, 7.0.3 and the HEAD (7.1.20110329), the new code is over ten times slower than the old and allocates more than twenty times as much.

In 7.0.3's core at least some local bindings appear as polymorphic functions which are inlined by 7.0.1. And indeed, compiling with -XMonoLocalBinds rectifies matters. Nevertheless, that shouldn't happen.

Attachments (2)

Test.hs (779 bytes) - added by daniel.is.fischer 4 years ago.
small example of related behaviour of 7.0.3
display.tar.gz (4.6 KB) - added by daniel.is.fischer 4 years ago.
bundle exhibiting example of undesirable behaviour

Download all attachments as: .zip

Change History (14)

comment:1 Changed 4 years ago by daniel.is.fischer

Namely, the culprit is

inc i = unsafeRead darr i >>= \k -> unsafeWrite darr i (k+1)

darr is an STUArray s Int Int, yet without MonoLocalBinds or a type signature, 7.0.3 produces

let {
   $winc_s21z
     :: forall (m_a1mK :: * -> *).
        GHC.Base.Monad m_a1mK =>
        (forall i_a1my.
         GHC.Arr.Ix i_a1my =>
         Data.Array.Base.STUArray
           s_a1ln i_a1my GHC.Types.Int
         -> GHC.Types.Int
         -> m_a1mK GHC.Types.Int)
        -> (forall i_a1mk.
            GHC.Arr.Ix i_a1mk =>
            Data.Array.Base.STUArray
              s_a1ln i_a1mk GHC.Types.Int
            -> GHC.Types.Int
            -> GHC.Types.Int
            -> m_a1mK ())
        -> GHC.Types.Int
        -> m_a1mK ()
   [LclId, Arity=4, Str=DmdType SLLL]
   $winc_s21z =
     \ (@ m_a1mK::* -> *)
       (ww3_s1YM :: GHC.Base.Monad m_a1mK)
       (ww4_s1YS
          :: forall i_a1my.
             GHC.Arr.Ix i_a1my =>
             Data.Array.Base.STUArray
               s_a1ln i_a1my GHC.Types.Int
             -> GHC.Types.Int
             -> m_a1mK GHC.Types.Int)
       (ww5_s1YT
          :: forall i_a1mk.
             GHC.Arr.Ix i_a1mk =>
             Data.Array.Base.STUArray
               s_a1ln i_a1mk GHC.Types.Int
             -> GHC.Types.Int
             -> GHC.Types.Int
             -> m_a1mK ())
       (w1_s1YV :: GHC.Types.Int) ->
       GHC.Base.>>=
         @ m_a1mK
         ww3_s1YM
         @ GHC.Types.Int
         @ ()
         (ww4_s1YS
            @ GHC.Types.Int
            GHC.Arr.$fIxInt
            marr_s21x
            w1_s1YV)
         (\ (k_a169 :: GHC.Types.Int) ->
            ww5_s1YT
              @ GHC.Types.Int
              GHC.Arr.$fIxInt
              marr_s21x
              w1_s1YV
              (case k_a169 of _ { GHC.Types.I# x1_X1G6 ->
               GHC.Types.I# (GHC.Prim.+# x1_X1G6 1)
               })) }

Give that a type signature inc :: Int -> ST s () (requires moving stuff outside the runST) and you're good again.

comment:2 Changed 4 years ago by simonpj

I'm lost. The inc you give above is indeed polymorphic in both the monad and the index type, and (unless specialised or inlined) will probably perform much worse than the polymorphic vresion.

Can you give a concrete module that, when compiled, gives much worse code with HEAD than 6.12?

Thanks

Simon

comment:3 Changed 4 years ago by daniel.is.fischer

Yes, it is polymorphic, but it's a local binding inside a runST, only ever used at the one specific type coming from the enclosing ST-action. Up to and including 7.0.1, it was automatically inlined, resulting in

case GHC.Prim.readIntArray#
       @ s_aKv marr#_aLG sc1_sSa sc_sS9
of _ { (# s2#1_aPz, e#_aPA #) ->
case GHC.Prim.writeIntArray#
       @ s_aKv
       marr#_aLG
       sc1_sSa
       (GHC.Prim.+# e#_aPA 1)
       s2#1_aPz
of s2#2_aQ2 { __DEFAULT ->

at the appropriate places. 7.0.2, 7.0.3 and HEAD (29.3. - I've not yet built a newer HEAD to see whether it changed) produce the polymorphic worker that's then called.

As for specific modules, I've not yet managed to produce a small example exhibiting exactly that, but for the small Test.hs module, 7.0.3 produces bad (polymorphic) core while HEAD (and 7.0.1) do fine. However, for the display bundle (the interesting module is DispFloat), HEAD also produces the polymorphic $winc.

Changed 4 years ago by daniel.is.fischer

small example of related behaviour of 7.0.3

Changed 4 years ago by daniel.is.fischer

bundle exhibiting example of undesirable behaviour

comment:4 Changed 4 years ago by simonpj

  • Resolution set to fixed
  • Status changed from new to closed
  • Test Case set to perf/should_run/T5113

The HEAD seems fine with your Test file, which is good. I've turned it into a performance test so we'll see if it ever goes bad on us.

Simon

comment:5 Changed 3 years ago by simonpj@…

commit 33683ba9367e03b6b051f1c0c694fd8bf244a759

Author: Simon Peyton Jones <[email protected]>
Date:   Mon Feb 11 08:38:12 2013 +0000

    Extra comment about the fix to Trac #5113

 compiler/specialise/Specialise.lhs |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

comment:6 Changed 3 years ago by simonpj

  • difficulty set to Unknown

This patch is the crucial one that fixes the original problem

commit b5c18c91da911a7729563207c7b95f7e452cca7e
Author: Simon Peyton Jones <[email protected]>
Date:   Fri Feb 8 17:29:40 2013 +0000

    Fix an old and egregious specialisation bug (Trac #5113)
    
    The specialiser needs to know if a dictionay has some structure,
    so that it can decide whether to specialise a function. Eg
     (A)    let d = $dfblah d1
            in ....(f d)....
    
     (B)    \d. ....(f d)....
    
    In (A) it's probably worth specialising f; in (B) it isn't.
    Previously we were relying on d's unfolding, but the specialiser
    does cloning as it goes, which discards the unfolding. So we
    were simply discarding all specialisations for functions with
    local dictionary bindings!  This bug seems to have been there
    for a long time.
    
    This is what originally caused Trac #5113.  Then we went through a phase
    where local bindings were not generalised, and that meant there was
    no locally overloaded f to specialise; so the performance problem appeared
    to be fixed.  But now we are generalising local bindings again, so it
    re-appeared.
    
    This patch fixes the original problem.

 compiler/specialise/Specialise.lhs |  390 ++++++++++++++++++++----------------
 1 files changed, 214 insertions(+), 176 deletions(-)

comment:7 Changed 7 weeks ago by bgamari

Unfortunately it seems that simonpj's fix for #10527, 07a1f32e8bacecd450112607df3fdf39e553c91e, again breaks this testcase,

=====> T5113(normal) 2426 of 4449 [0, 2, 0]
cd ./perf/should_run &&  "/home/ben/trees/ghc/ghc-7.10/inplace/bin/ghc-stage2" -o T5113 T5113.hs -fforce-recomp -dcore-lint -dcmm-lint -dno-debug-output -no-user-package-db -rtsopts -fno-warn-tabs -fno-ghci-history  -O > T5113.comp.stderr 2>&1
cd ./perf/should_run && ./T5113  +RTS -V0 -tT5113.stats --machine-readable -RTS   </dev/null > T5113.run.stdout 2> T5113.run.stderr
bytes allocated value is too high:
    Expected    T5113(normal) bytes allocated:   8000000 +/-5%
    Lower bound T5113(normal) bytes allocated:   7600000
    Upper bound T5113(normal) bytes allocated:   8400000
    Actual      T5113(normal) bytes allocated: 806747568
    Deviation   T5113(normal) bytes allocated:    9984.3 %
*** unexpected stat test failure for T5113(normal)

The dump-simpl output for the testcase both before and after the patch can be found here.

It seems that the Monad and Applicative instances for ST are no longer being inlined. Specifically these top-level bindings are present in the bad case but appear to be inlined and optimized away in the good case,

poly_$dApplicative_r54y :: forall s_a1fk. Applicative (ST s_a1fk)
[GblId, Str=DmdType]
poly_$dApplicative_r54y =
  \ (@ s_a1fk) ->
    GHC.ST.$fApplicativeST @ s_a1fk (GHC.ST.$fFunctorST @ s_a1fk)

poly_$dMonad_r54z :: forall s_a1fk. Monad (ST s_a1fk)
[GblId, Str=DmdType]
poly_$dMonad_r54z =
  \ (@ s_a1fk) ->
    GHC.ST.$fMonadST @ s_a1fk (poly_$dApplicative_r54y @ s_a1fk)

poly_$dMArray_r54A
  :: forall s_a1fk. MArray (STUArray s_a1fk) Int (ST s_a1fk)
[GblId, Str=DmdType]
poly_$dMArray_r54A =
  \ (@ s_a1fk) ->
    Data.Array.Base.$fMArraySTUArrayIntST
      @ s_a1fk (poly_$dMonad_r54z @ s_a1fk)
Last edited 7 weeks ago by bgamari (previous) (diff)

comment:8 Changed 7 weeks ago by bgamari

The above bindings are introduced during FloatOut in both the good and bad cases, as expected. The cases diverge in the very next pass (SimplMode {Phase = 2 [main], inline, rules, eta-expand, case-of-case}) where they are inlined in the good case but not in the bad. This of course isn't terribly surprising but I thought I'd write it down anyways.

comment:9 Changed 7 weeks ago by bgamari

It seems that this all comes down to a specialisation rule not firing. Namely this one: Rule fired: SPEC note @ (ST s) which happens twice in the good case and never in the bad case. This rule applies to,

note_s2QQ
  :: forall (m_a29f :: * -> *).
     MArray (STUArray s_a1fk) Int m_a29f =>
     Int -> Int -> m_a29f ()
Last edited 7 weeks ago by bgamari (previous) (diff)

comment:10 Changed 7 weeks ago by bgamari

This is odd as the good and bad cases have identical output from -ddump-rules and I can clearly see the rule in the Core output in both cases. For some reason it just never fires in the bad case.

comment:11 Changed 7 weeks ago by bgamari

It seems to me like the fix for #10527 is a bit questionable. I may be missing something but I don't see how arguments could possibly get to the rebuilder (which is responsible for trying rewrite rules) as we now simply perform a bunch of substitutions on them with substExprS instead of passing them tosimplExpr.

Surely I'm missing something as one would think this would result in massive a performance regression if it really were the case. Simon, perhaps you can explain how you intended this to work?

comment:12 Changed 7 weeks ago by simonpj

The new substExprS was dropping rules and specialisations, which is terribly wrong.

So I backed out and fixed #10527 another way; see the commit comments.

I'm getting a small number of apparently-unrelated validate failures, but I figure it's better to push this so that Ben can try it out.

Simon

Note: See TracTickets for help on using tickets.