Opened 3 years ago

Closed 5 days ago

#9630 closed bug (fixed)

compile-time performance regression (probably due to Generics)

Reported by: hvr Owned by: dfeuer
Priority: high Milestone: 8.2.1
Component: Compiler Version: 7.9
Keywords: deriving-perf, Generics Cc: gidyn, kolmodin, erikd, michalt, RyanGlScott, akst, dfeuer, asr
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #9583, #10293, #13059, #10818 Differential Rev(s): Phab:D3656
Wiki Page:

Description (last modified by hvr)

This is a continuation of #9583

Compiling latest Cabal from Git's Language.Haskell.Extension module shows a significant performance regression with respect to GHC 7.8.3:

$ uname -a
Linux duo 3.17.0-031700rc6-generic #201409211935 SMP Sun Sep 21 23:37:11 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ git describe 
cabal-install_merged-1912-g7161e4c

#############################################################################

$ /opt/ghc/7.8.3/bin/ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3

$ /opt/ghc/7.8.3/bin/ghc -Rghc-timing -O -c Language/Haskell/Extension.hs 
<<ghc: 5228751760 bytes, 543 GCs, 
       34135677/114940864 avg/max bytes residency (14 samples),
       273M in use, 0.00 INIT (0.03 elapsed),
       4.96 MUT (5.28 elapsed), 3.26 GC (3.47 elapsed) :ghc>>

#############################################################################

$ /opt/ghc/head/bin/ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.9.20140923

$ /opt/ghc/head/bin/ghc -Rghc-timing -O -c Language/Haskell/Extension.hs 
<<ghc: 125917854120 bytes, 3181 GCs, 
       349136895/833496072 avg/max bytes residency (30 samples), 
       2096M in use, 0.001 INIT (0.001 elapsed),
       133.003 MUT (142.730 elapsed), 73.241 GC (78.234 elapsed) :ghc>>

Or in other words, this represents a ~7.7x "memory in use" and ~26x elapsed time regression

Attachments (2)

tracetcduh (160.1 KB) - added by dfeuer 7 weeks ago.
tracecsduh (2.4 KB) - added by dfeuer 7 weeks ago.

Download all attachments as: .zip

Change History (68)

comment:1 Changed 3 years ago by hvr

Description: modified (diff)

description line-wrapped for readability

comment:2 Changed 3 years ago by dreixel

What makes you think it's due to Generics? Not much changed there since 7.8...

comment:3 Changed 3 years ago by dreixel

Oh, I didn't notice that I was CC-ed on that other ticket. I will have a look.

comment:4 Changed 3 years ago by hvr

Priority: normalhigh

increasing priority, as Phab:D183 has landed and therefore is starting to affect buildbots and developers working on GHC HEAD

comment:5 Changed 3 years ago by simonpj

I'm happy to support, but I currently believe that this isn't a GHC shortcoming. I'm willing to revise that view if evidence emerges! A good starting point might be the amount of code generated by Edward's example in comment 5 of #9583

comment:6 Changed 3 years ago by dreixel

I'm happy to investigate the problem, but it will take a few days, so we might need some temporary workaround. In fact, I can't even build HEAD at the moment (my VM runs out of memory), because of this.

Can the Binary instances be disabled in Cabal for a while?

comment:7 Changed 3 years ago by Herbert Valerio Riedel <hvr@…>

In 3ecca02516af5de803e4ff667c8c969c5bffb35f/ghc:

Update `binary` submodule in an attempt to address #9630

This removes a couple of `INLINE` pragmas from the generics helper
classes. With this change the compile times and memory usage should
go back to the previous GHC 7.8.3 situation.

This has been submitted upstream as https://github.com/kolmodin/binary/pull/62

comment:8 Changed 3 years ago by simonpj

Might it be worth leaving a few comments in the GBInary class to highlight that innocently inlining those methods is dangerous?

And we still don't really understand why, I think. gget looks as if it has lots of cpp-generated code, but the others look small. Unless the things they call are in turn INLINE.

So, a good fix, but it might be worth more investigation to figure out the best way to solve this in the end. I think Pedro's goal was to eliminate all the intermediate structure.

Simon

comment:9 Changed 2 years ago by thoughtpolice

Milestone: 7.10.17.12.1

So the short-term problem here is fixed, but the long-term fixes are still needed, so I'm bumping this to 7.12 for future work by Pedro.

comment:10 Changed 2 years ago by darchon

I have another data-point for this performance bug: http://hackage.haskell.org/package/unbound-generics. Compiling the following example: https://github.com/lambdageek/unbound-generics/blob/master/examples/F.hs

$ uname -a
Darwin wlan022084.mobiel.utwente.nl 14.1.0 Darwin Kernel Version 14.1.0: 
Thu Feb 26 19:26:47 PST 2015; 
root:xnu-2782.10.73~1/RELEASE_X86_64 x86_64

#############################################################################

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4

$ ghc -O -c -Rghc-timing -fforce-recomp -Wnot F.hs 
<<ghc: 1721650904 bytes, 3317 GCs, 
       16624472/43436104 avg/max bytes residency (12 samples), 112M in use, 
       0.00 INIT (0.00 elapsed), 1.85 MUT (2.22 elapsed),
       1.52 GC (2.03 elapsed) :ghc>>

#############################################################################

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.0.20150323

$ ghc -O -c -Rghc-timing -fforce-recomp -Wnot F.hs 
<<ghc: 13422435616 bytes, 21356 GCs, 
       46569528/80396192 avg/max bytes residency (25 samples), 221M in use, 
       0.000 INIT (0.001 elapsed), 11.911 MUT (12.895 elapsed), 
       10.298 GC (10.973 elapsed) :ghc>>

The interesting parts of running with -v2 are:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4

$ ghc -O -c -Rghc-timing -fforce-recomp -Wnot -v2 F.hs 
[...]
Result size of Simplifier
  = {terms: 2,583, types: 8,513, coercions: 1,371}
*** Specialise:
Result size of Specialise
  = {terms: 2,583, types: 8,513, coercions: 1,371}
*** Float out(FOS {Lam = Just 0, Consts = True, PAPs = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              PAPs = False})
  = {terms: 2,923, types: 9,081, coercions: 1,371}
*** Float inwards:
Result size of Float inwards
  = {terms: 2,923, types: 9,081, coercions: 1,371}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 6,490, types: 31,862, coercions: 17,447}
[...]

#############################################################################
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.0.20150323

$ ghc -O -c -Rghc-timing -fforce-recomp -Wnot -v2 F.hs 
[...]
Result size of Simplifier
  = {terms: 2,028, types: 5,759, coercions: 1,207}
*** Specialise:
Result size of Specialise
  = {terms: 11,945, types: 48,810, coercions: 9,282}
*** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}):
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 14,293, types: 68,412, coercions: 9,282}
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 22,244, types: 103,329, coercions: 35,450}
[...]

Running ghc-7.10 with -fno-specialise, we get:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.0.20150323

$ ghc -O -c -Rghc-timing -fforce-recomp -Wnot F.hs 
<<ghc: 1635521672 bytes, 3140 GCs, 
       19135611/46926832 avg/max bytes residency (12 samples), 124M in use, 
       0.000 INIT (0.001 elapsed), 1.891 MUT (2.121 elapsed), 
       1.853 GC (2.051 elapsed) :ghc>>

Which are the same performance characteristics as with 7.8.4

Last edited 2 years ago by darchon (previous) (diff)

comment:11 Changed 2 years ago by simonpj

I would love for someone to investigate this:

  • What exactly is happening?
  • Is it due to over-aggressive inlining in the binary? comment:7 suggests that might be so
  • Are there underlying issues in GHC itself? comment:10 suggests so.
  • If so, is the issue only that GHC is specialising too much? Or is it doing so uselessly or redundantly? And is it doing so by itself, or in response to user pragmas?
  • Even if GHC is behaving precisely as advertised, what should we do to stop this kind of code blow-up happening, or tell the user that it is happening, and give some clue about what to do?

We are a bit stuck here until someone has the time and motivation to get some data.

Thanks

Simon

comment:12 Changed 2 years ago by goldfire

Here is some relevant data (thanks to Michal Terepeta). These numbers were gotten while compiling haskell-src-exts:

(I'm choosing the -fno-call-arity results here for better comparison against 7.8, which lacked the -fcall-arity feature.)

** HEAD with -fno-call-arity

	Sun Apr 12 18:16 2015 Time and Allocation Profiling Report  (Final)

	   ghc +RTS -p -RTS [...] -fno-call-arity

	total time  =      113.71 secs   (113714 ticks @ 1000 us, 1 processor)
	total alloc = 121,884,896,720 bytes  (excludes profiling overheads)

COST CENTRE       MODULE          %time %alloc

SimplTopBinds     SimplCore        37.2   36.6
CoreTidy          HscMain           6.0    7.3
lintAnnots        CoreLint          5.8    6.5
pprNativeCode     AsmCodeGen        4.1    4.8
OccAnal           SimplCore         3.6    3.8
occAnalBind.assoc OccurAnal         2.9    3.2
StgCmm            HscMain           2.9    2.6
RegAlloc          AsmCodeGen        2.6    3.4
FloatOutwards     SimplCore         2.6    2.3
regLiveness       AsmCodeGen        2.5    2.8
tc_rn_src_decls   TcRnDriver        2.4    1.9
Simplify          SimplCore         2.4    0.3
sink              CmmPipeline       2.1    2.2
NewStranal        SimplCore         1.7    2.1
genMachCode       AsmCodeGen        1.4    1.4
layoutStack       CmmPipeline       1.4    1.4
NativeCodeGen     CodeOutput        1.1    1.2
FloatInwards      SimplCore         1.1    1.4
do_block          Hoopl.Dataflow    1.0    0.6
Digraph.scc       Digraph           0.8    1.3
** 7.8.4

	Sun Apr 12 15:41 2015 Time and Allocation Profiling Report  (Final)

	   ghc +RTS -p -RTS [...]

	total time  =       93.11 secs   (93112 ticks @ 1000 us, 1 processor)
	total alloc = 103,135,975,120 bytes  (excludes profiling overheads)

COST CENTRE                 MODULE         %time %alloc

SimplTopBinds               SimplCore       38.5   37.4
pprNativeCode               AsmCodeGen       6.2    7.2
StgCmm                      HscMain          3.9    4.2
RegAlloc                    AsmCodeGen       3.7    5.1
occAnalBind.assoc           OccurAnal        3.3    3.6
OccAnal                     SimplCore        3.3    3.6
regLiveness                 AsmCodeGen       3.1    3.4
FloatOutwards               SimplCore        2.9    2.4
sink                        CmmPipeline      2.8    2.8
Simplify                    SimplCore        2.6    0.3
tc_rn_src_decls             TcRnDriver       2.4    2.1
genMachCode                 AsmCodeGen       1.9    2.0
NewStranal                  SimplCore        1.8    2.1
layoutStack                 CmmPipeline      1.8    1.8
Core2Core                   HscMain          1.3    1.2
deSugar                     HscMain          1.1    1.1
do_block                    Hoopl.Dataflow   1.1    0.7
CoreTidy                    HscMain          1.0    1.1
CorePrep                    HscMain          1.0    1.1
Digraph.scc                 Digraph          0.9    1.5
versioninfo                 MkIface          0.9    1.0
zonkEvBndr_zonkTcTypeToType TcHsSyn          0.6    1.4

HEAD with -fno-specialise and -fno-call-arity:

 ghc +RTS -p -RTS  [...] -fno-specialise -fno-call-arity

total time  =       89.93 secs   (89928 ticks @ 1000 us, 1 processor)
total alloc = 93,495,685,792 bytes  (excludes profiling overheads)

COST CENTRE       MODULE          %time %alloc

SimplTopBinds     SimplCore        38.7   38.6
pprNativeCode     AsmCodeGen        5.1    5.9
StgCmm            HscMain           3.7    3.3
occAnalBind.assoc OccurAnal         3.2    3.6
OccAnal           SimplCore         3.2    3.6
tc_rn_src_decls   TcRnDriver        3.1    2.5
RegAlloc          AsmCodeGen        3.1    4.2
regLiveness       AsmCodeGen        3.0    3.5
FloatOutwards     SimplCore         2.7    2.4
sink              CmmPipeline       2.6    2.7
Simplify          SimplCore         2.5    0.1
NewStranal        SimplCore         1.9    2.3
genMachCode       AsmCodeGen        1.8    1.7
layoutStack       CmmPipeline       1.6    1.7
NativeCodeGen     CodeOutput        1.4    1.5
FloatInwards      SimplCore         1.2    1.6
CoreTidy          HscMain           1.2    1.2
deSugar           HscMain           1.2    1.2
do_block          Hoopl.Dataflow    1.1    0.7
CorePrep          HscMain           1.0    1.1
versioninfo       MkIface           0.9    1.0
Parser            HscMain           0.9    1.2
Digraph.scc       Digraph           0.9    1.5
canEvVar          TcCanonical       0.7    1.0

And now with only -fno-specialise:

 ghc +RTS -p -RTS [...] -fno-specialise

total time  =      109.78 secs   (109784 ticks @ 1000 us, 1 processor)
total alloc = 124,469,615,048 bytes  (excludes profiling overheads)

COST CENTRE       MODULE       %time %alloc

SimplTopBinds     SimplCore     35.4   32.4
CallArity         SimplCore     13.4   20.7
pprNativeCode     AsmCodeGen     4.1    4.5
OccAnal           SimplCore      3.1    3.1
StgCmm            HscMain        3.0    2.5
occAnalBind.assoc OccurAnal      3.0    3.1
RegAlloc          AsmCodeGen     2.6    3.2
tc_rn_src_decls   TcRnDriver     2.5    1.8
regLiveness       AsmCodeGen     2.4    2.6
Simplify          SimplCore      2.3    0.1
FloatOutwards     SimplCore      2.3    1.8
sink              CmmPipeline    2.2    2.0
genMachCode       AsmCodeGen     1.5    1.3
NewStranal        SimplCore      1.5    1.7
layoutStack       CmmPipeline    1.3    1.3
NativeCodeGen     CodeOutput     1.2    1.1
FloatInwards      SimplCore      1.0    1.2
Digraph.scc       Digraph        0.8    1.2

comment:13 Changed 2 years ago by goldfire

One thing I see in the (first set of) numbers above is that lintAnnots takes up a lot of time! It might be a problem with wrong call-centres, but then at least the call-centres should be sorted out.

The call-arity problem is being tracked in #10293.

comment:14 Changed 2 years ago by goldfire

Michal describes how he got the numbers pasted in comment:12:

build GHC HEAD/7.8.4 with build flavor prof then get the haskell-src-exts sources, install the dependencies and finally add +RTS -p -RTS to the cabal file and compile it, the resulting ghc.prof contains the profiling information.

comment:15 Changed 2 years ago by gidyn

CoreTidy seems to be another culprit for investigation.

comment:16 Changed 2 years ago by gidyn

Cc: gidyn added

comment:17 in reply to:  13 Changed 2 years ago by thomie

Replying to goldfire:

One thing I see in the (first set of) numbers above is that lintAnnots takes up a lot of time!

Please be aware of #10007: "Fix misattribution of Cost Centre profiles to lintAnnots".

comment:18 Changed 23 months ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:19 Changed 19 months ago by ezyang

Keywords: deriving-perf added

comment:20 Changed 19 months ago by kolmodin

Cc: kolmodin added

comment:21 Changed 19 months ago by ezyang

I did some investigation related to this bug on GHC 7.10.3-ish and binary 0.7.5.0 (yes it's old, but I didn't see any more recent relevant commits; and it's what GHC is bootstrapping with). Here are some partial findings:

  1. I can trigger bad constant factors this data type:
    {-# LANGUAGE DeriveGeneric #-}
    module A where
    import Data.Binary
    import GHC.Generics
    data T = T
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     deriving Generic
    instance Binary T
    
    ezyang@sabre:~/Dev/labs$ time $DEV/ghc-7.10-frontend-plugins/usr/bin/ghc --make Bin.hs -O2  -fforce-recomp -fliberate-case-threshold=1000 
    [1 of 1] Compiling A                ( Bin.hs, Bin.o )
    
    real	0m7.369s
    user	0m6.408s
    sys	0m0.224s
    ezyang@sabre:~/Dev/labs$ time $DEV/ghc-7.10-frontend-plugins/usr/bin/ghc --make Bin.hs -O0  -fforce-recomp -fliberate-case-threshold=1000 
    [1 of 1] Compiling A                ( Bin.hs, Bin.o )
    
    real	0m0.881s
    user	0m0.776s
    sys	0m0.080s
    
  1. The problem gets worse if you have explicit field names, something like a x2 factor.
    data T = T
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     () () () () () () () () () ()
     deriving Generic
    
    -- versus
    
    data T = T {
        f0 :: (),
        f1 :: (),
        f2 :: (),
        f3 :: (),
        f4 :: (),
        f5 :: (),
        f6 :: (),
        f7 :: (),
        f8 :: (),
        f9 :: (),
        f10 :: (),
        f11 :: (),
        f12 :: (),
        f13 :: (),
        f14 :: (),
        f15 :: (),
        f16 :: (),
        f17 :: (),
        f18 :: (),
        f19 :: (),
        f20 :: (),
        f21 :: (),
        f22 :: (),
        f23 :: (),
        f24 :: (),
        f25 :: (),
        f26 :: (),
        f27 :: (),
        f28 :: (),
        f29 :: (),
        f30 :: (),
        f31 :: (),
        f32 :: (),
        f33 :: (),
        f34 :: (),
        f35 :: (),
        f36 :: (),
        f37 :: (),
        f38 :: (),
        f39 :: ()
        }
        deriving (Generic)
    
    ezyang@sabre:~/Dev/labs$ time $DEV/ghc-7.10-frontend-plugins/usr/bin/ghc --make Bin.hs -O  -fforce-recomp -fliberate-case-threshold=1000
    [1 of 1] Compiling A                ( Bin.hs, Bin.o )
    
    real	0m2.649s
    user	0m2.520s
    sys	0m0.080s
    ezyang@sabre:~/Dev/labs$ time $DEV/ghc-7.10-frontend-plugins/usr/bin/ghc --make Bin.hs -O  -fforce-recomp -fliberate-case-threshold=1000
    [1 of 1] Compiling A                ( Bin.hs, Bin.o )
    
    real	0m4.990s
    user	0m4.824s
    sys	0m0.144s
    
    It's NOT just generic deriving; the generic deriving is still pretty speedy, honestly.
  1. If you inline all of Binary's class definitions into the same file, things speed up. Comparing the two cases, it seems that when we cross-module compile, GHC does more inlining and more optimization, and it takes longer. This is something like a 30% factor.

comment:22 Changed 19 months ago by ezyang

Here is a minimized test-case with no dependencies, which may be useful for diagnosing:

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE TypeOperators #-}
{-# OPTIONS_GHC -fno-warn-missing-methods #-}
module Gen where

import GHC.Generics
import Control.Monad
import Data.Monoid

data PairS a = PairS a !(() -> ())

newtype PutM a = Put { unPut :: PairS a }

-- Use of this writer monad seems to be important; IO speeds it up
type Put = PutM ()
-- type Put = IO ()

-- binary has INLINE pragmas on most of the instances but you can still
-- trigger bad behavior without them.
instance Functor PutM where
        fmap f m = Put $ let PairS a w = unPut m in PairS (f a) w

-- Just to appease AMP
instance Applicative PutM where
        pure  = return
        (<*>) = ap

instance Monad PutM where
    return a = Put $ PairS a id

    m >>= k  = Put $
        let PairS a w  = unPut m
            PairS b w' = unPut (k a)
        in PairS b (w . w')

class GBinary f where
    gput :: f t -> Put
    -- Forcing the dictionary to have two elements hurts
    -- the optimizer a lot.
    not_used :: f t

instance GBinary a => GBinary (M1 i c a) where
    gput = gput . unM1

instance Binary a => GBinary (K1 i a) where
    gput = put . unK1

instance (GBinary a, GBinary b) => GBinary (a :*: b) where
    gput (x :*: y) = gput x >> gput y

class Binary t where
    put :: t -> Put

instance Binary () where
    put () = return ()

data T = T () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
           () () () () () () () () () ()
    deriving Generic

-- Trigger specialization
tput :: T -> Put
tput = gput . from

On my machine, it takes 2.8s to build -O2, and 0.9s to build -O0. There are a few important ways you can tweak this:

  1. This also exhibits the "out-of-line more expensive" behavior; moving the code out into a separate module jumps compile time from 2.7s to 5.2s.
  1. If you replace PutM () with IO (), compile time goes from 2.7s to 1.9s
  1. Removing not_used, pushes compile time from 2.7s to 1.5s. This DOES NOT stack with (2). So having to deal with dictionary records seems to make things slower.
Last edited 19 months ago by ezyang (previous) (diff)

comment:23 Changed 19 months ago by ezyang

I managed to massively speed up Binary by applying (3). So I think we have our culprit. Here's the PR: https://github.com/kolmodin/binary/pull/95

(To be clear, this doesn't fix GHC's regression, but it does work around it pretty effectively, it would seem.)

Last edited 19 months ago by ezyang (previous) (diff)

comment:24 Changed 19 months ago by ezyang

Status: newpatch

comment:25 Changed 19 months ago by simonpj

Thank you for helping to characterise this. More please! Eg is it the type checker, or all the way down the pipeline? Do terms get big? Why does putting Binary's classes in the file help? Why does removing not-used help.

One possible reason: see Scrap your type applications Section 2.3. Generics, I suspect, uses lots of nested types!

comment:26 Changed 19 months ago by simonpj

Splitting binary as done in comment:23 will improve matters. As you say it doesn't fix GHC; indeed it'll make the problem less egregious.

Just to check, then: does comment:22 give a standalone case, independent of the binary class, that demonstrates the effect of splitting the Binary class into two?

comment:27 Changed 19 months ago by ezyang

Yes, comment:22 gives a standalone case of how making a type class only have one method improves time. I can try to characterize this further.

comment:28 Changed 19 months ago by bgamari

Owner: simonpj deleted
Status: patchnew

comment:29 Changed 18 months ago by thomie

Keywords: Generics added

comment:30 Changed 18 months ago by bgamari

Milestone: 8.0.18.2.1

Punting to 8.2.

comment:31 Changed 16 months ago by erikd

Cc: erikd added

comment:32 Changed 16 months ago by michalt

Cc: michalt added

comment:33 Changed 15 months ago by RyanGlScott

Cc: RyanGlScott added

comment:34 Changed 15 months ago by thoughtpolice

Just as an aside, I also applied the trick from comment:23 to cereal, and the results were almost unbelievable. The example given (comment:22) actually completely exhausts the simplifier tickets, so it essentially turns a non-compiling program into a compiling one that works in less than a second.

https://github.com/GaloisInc/cereal/commit/94c87592d010fdb381c6a2dc963bbfcef99c3a27

The result on the binary-serialise-cbor suite with this change is incredible. Just another data point - it would be useful to minimize cereal akin to binary in comment:22 and see the difference.

comment:35 Changed 14 months ago by akst

Cc: akst added

comment:36 Changed 10 months ago by simonmar

In case any more motivation is needed here: we have had to stop using deriving Generic and find alternative methods due to this problem, so it's an actual blocker for some use cases of deriving Generic.

Do we know what the problem is? Is it fixable?

comment:37 Changed 10 months ago by simonpj

We don't seem to have a well-characterised idea of what is actually going wrong here. As I said in comment:25 we need more info. It really just needs someone to pay a bit of time and attention to identify exactly what is happening.

comment:38 Changed 7 months ago by bgamari

Milestone: 8.2.18.4.1

Sadly we seem to have lost steam on this. Bumping off to 8.4.

comment:39 Changed 5 months ago by RyanGlScott

See also #13059 for an example of GHC.Generics-related memory blowup in the store library.

comment:40 Changed 3 months ago by dfeuer

Cc: dfeuer added

It appears that much or all of the regression in the original example occurred in b9e49d3e9580e13d89efd1f779cb76f610e0d6e0. Before, I get

dfeuer@squirrel:~/src/cabal> ~/src/ghc-iproute-before/inplace/bin/ghc-stage2 -Rghc-timing -O -c Cabal/Language/Haskell/Extension.hs 
<<ghc: 5795390224 bytes, 640 GCs, 37133197/104324936 avg/max bytes residency (16 samples), 248M in use, 0.001 INIT (0.001 elapsed), 3.981 MUT (4.258 elapsed), 2.622 GC (2.639 elapsed) :ghc>>

After,

dfeuer@squirrel:~/src/cabal> ~/src/ghc-iproute-after/inplace/bin/ghc-stage2 -Rghc-timing -O -c Cabal/Language/Haskell/Extension.hs 
<<ghc: 126972407680 bytes, 3337 GCs, 340164097/914871104 avg/max bytes residency (36 samples), 2652M in use, 0.001 INIT (0.001 elapsed), 82.951 MUT (118.315 elapsed), 52.254 GC (52.317 elapsed) :ghc>>

Jeepers!

comment:41 Changed 3 months ago by dfeuer

Note: b9e49d3e9580e13d89efd1f779cb76f610e0d6e0 added 1024 specialization rules to -ddump-rules for this module.

comment:42 Changed 3 months ago by simonpj

1024 specialization rules to -ddump-rules for this module.

Really? I compiled the module in comment:22 with HEAD and got no rules at all! What am I doing wrong?

Result size of Tidy Core
  = {terms: 2,582, types: 63,214, coercions: 20,656, joins: 0/34}

Simon

comment:43 in reply to:  42 Changed 3 months ago by dfeuer

Replying to simonpj:

1024 specialization rules to -ddump-rules for this module.

Really? I compiled the module in comment:22 with HEAD and got no rules at all! What am I doing wrong?

Result size of Tidy Core
  = {terms: 2,582, types: 63,214, coercions: 20,656, joins: 0/34}

Simon

Yes, it seems that a subsequent change removed those, so maybe that's not relevant anymore.

comment:44 Changed 3 months ago by simonpj

OK, so is there now a regression wrt 7.8 or not? If not let's close. If so, let's investigate.

comment:45 Changed 3 months ago by dfeuer

Milestone: 8.4.18.2.1

This appears to be fixed in HEAD, but I haven't checked the 8.2 release. I checked both the reproduction in comment:22 and the Cabal module that originally triggered this. I don't know why I didn't check that before going digging through history. Whoops. I'll add a perf test case based on Edward's reproduction.

comment:46 Changed 3 months ago by bgamari

Owner: set to dfeuer

How is this going, David?

comment:47 Changed 3 months ago by asr

Cc: asr added

comment:48 Changed 2 months ago by dfeuer

I think the original issue is fixed. To be extra clear, the original issue was that it was very slow to compile Language.Haskell.Extension from Cabal. I was able to reproduce this using Cabal at commit eaf2aae18603965ff668384b391fcaa14de19823. The problem there appeared in GHC commit b9e49d3e9580e13d89efd1f779cb76f610e0d6e0, which added 1024 specialization rules for said module. The problem is now gone.

It looks to me as though ezyang's test case in comment:22 doesn't actually relate to the regression; I don't see it moving terribly much from 7.8 to 7.10 or from 7.10 to 8.2. It seems to be looking at other issues. I will try to see how those have shifted.

Stupid me. I forgot to work with the old binary version.

Last edited 2 months ago by dfeuer (previous) (diff)

comment:49 Changed 2 months ago by dfeuer

A bit of good news: I build the commit of cereal from just before its GSerialize class was split in two. I confirmed that the test case in https://github.com/GaloisInc/cereal/pull/54 caused tick exhaustion in 8.0.2. I found that it worked just fine with a recent GHC build, and continued to do so even when I substantially increased the size of the test case.

comment:50 Changed 2 months ago by simonpj

OK, so we conclude that the original problem is now fixed? (Albeit we don't quite know how.)

If so, great; but let's add a regression test.

comment:51 in reply to:  50 Changed 8 weeks ago by dfeuer

Replying to simonpj:

OK, so we conclude that the original problem is now fixed? (Albeit we don't quite know how.)

If so, great; but let's add a regression test.

I'm not sure. There were two changes to binary relating to this ticket: several INLINEs were removed, and the generic classes were split. Are we hoping to be able to reintroduce the inlines and merge the classes? I don't think we're in a position to do that right now (I tried, and compiling Cabal got quite slow). Are we hoping just to be able to merge the classes? I'll try to see how that looks next.

comment:52 Changed 8 weeks ago by simonpj

Are we hoping to be able to reintroduce the inlines and merge the classes?

Well, the original Description says that compiling Cabal with GHC 8.0 was 26x slower than compiling with GHC 7.8. That's a big difference. How long does it take to compile (the same) Cabal with GHC 8.2? That's the question at hand. If it's still 26x slower (and your comment suggests that it is, albeit maybe not 26x), I'd love to dig a bit into why.

comment:53 in reply to:  42 Changed 8 weeks ago by dfeuer

Argh! I was stupid again, in a different way. When I was trying out the old-style generics for binary, I forgot that my stage1 was 8.0.2, which is no good for that. When I used GHC 8.2.1 as the stage1, it built just fine. Using the merged classes and inlines to compile Language.Haskell.Extension takes 9 seconds, whereas with the split class and no inlines it takes 7. So I think this actually is fixed now. I'll put together a test case.

comment:54 Changed 8 weeks ago by dfeuer

Nope, nope, nope. Actually, there still seems to be a problem, but I think it's been masked by the early inline patch. Specifically, removing the INLINEs makes the merged class problematic.

comment:55 Changed 7 weeks ago by dfeuer

I pored over -ddump-ds, and I think I see what's going on. Suppose we have (cereal-style, but without any INLINEs)

class Serialize t where
    put :: Putter t
    get :: Get t

    default put :: (Generic t, GSerialize (Rep t)) => Putter t
    put = gPut . from

    default get :: (Generic t, GSerialize (Rep t)) => Get t
    get = to <$> gGet


data T = T
 () () () () ()
  deriving Generic
instance Serialize T

What we want to do is build a single GSerialize (Rep T) dictionary and share it between put and get. But when we desugar instance Serialize T, we end up building a GSerialize (Rep T) dictionary in the definition of put and another one in the definition of get.

The problem even occurs without DefaultSignatures; writing the instance by hand using gPut and gGet is sufficient to demonstrate the problem:

instance Serialize T where
  put = gPut . from
  get = to <$> gGet

bgamari indicates that instances are supposed to be cached. Indeed, if I write

putT :: Putter T
putT = gPut . from

getT :: Get T
getT = to <$> gGet

at the top level, then all is well.

comment:56 Changed 7 weeks ago by dfeuer

Here's a truly bare-bones, Haskell 98 reproduction:

module Duh where

class Duh a where
  duh :: Integer -> a
  noDuh :: a -> Integer

class GenDuh a where
  gduh :: Integer -> a
  gnoDuh :: a -> Integer

instance GenDuh () where
  gduh = const ()
  gnoDuh = const 1

instance GenDuh a => GenDuh (Maybe a) where
  gduh 0 = Nothing
  gduh n = Just (gduh (n - 1))

  gnoDuh Nothing = 0
  gnoDuh (Just x) = 1 + gnoDuh x

data T = T (Maybe (Maybe ()))

instance Duh T where
  duh i = T (gduh i)
  noDuh (T m) = gnoDuh m

GHC 8.2.1rc2 with -O2 -ddump-ds produces

-- RHS size: {terms: 13, types: 18, coercions: 0, joins: 0/2}
$cnoDuh_aRo :: T -> Integer
[LclId]
$cnoDuh_aRo
  = let {
      $dGenDuh_a2nn :: GenDuh (Maybe ())
      [LclId]
      $dGenDuh_a2nn = Duh.$fGenDuhMaybe @ () Duh.$fGenDuh() } in
    let {
      $dGenDuh_aRs :: GenDuh (Maybe (Maybe ()))
      [LclId]
      $dGenDuh_aRs = Duh.$fGenDuhMaybe @ (Maybe ()) $dGenDuh_a2nn } in
    \ (ds_d2ow :: T) ->
      case ds_d2ow of { T m_azN ->
      gnoDuh @ (Maybe (Maybe ())) $dGenDuh_aRs m_azN
      }

-- RHS size: {terms: 11, types: 14, coercions: 0, joins: 0/2}
$cduh_aRi :: Integer -> T
[LclId]
$cduh_aRi
  = let {
      $dGenDuh_a2nl :: GenDuh (Maybe ())
      [LclId]
      $dGenDuh_a2nl = Duh.$fGenDuhMaybe @ () Duh.$fGenDuh() } in
    let {
      $dGenDuh_aRm :: GenDuh (Maybe (Maybe ()))
      [LclId]
      $dGenDuh_aRm = Duh.$fGenDuhMaybe @ (Maybe ()) $dGenDuh_a2nl } in
    \ (i_azM :: Integer) ->
      Duh.T (gduh @ (Maybe (Maybe ())) $dGenDuh_aRm i_azM)

We solve the GenDuh (Maybe (Maybe ())) constraint twice, and build its dictionary twice. I'll attach -ddump-tc-trace.

Changed 7 weeks ago by dfeuer

Attachment: tracetcduh added

Changed 7 weeks ago by dfeuer

Attachment: tracecsduh added

comment:57 Changed 7 weeks ago by dfeuer

My Duh test case appears to go bad in 32973bf3c2f6fe00e01b44a63ac1904080466938 ("Major patch to add -fwarn-redundant-constraints").

comment:58 Changed 7 weeks ago by RyanGlScott

I've confirmed this as well. Before that commit, this is the -ddump-ds output:

$cnoDuh_axu :: T -> Integer
[LclId, Str=DmdType]
$cnoDuh_axu =
  \ (ds_dTM :: T) ->
    case ds_dTM of _ [Occ=Dead] { T m_avB ->
    gnoDuh @ (Maybe (Maybe ())) $dGenDuh_axt m_avB
    }

$cduh_axo :: Integer -> T
[LclId, Str=DmdType]
$cduh_axo =
  \ (i_avA :: Integer) ->
    Duh.T (gduh @ (Maybe (Maybe ())) $dGenDuh_axt i_avA)

But after that commit:

$cnoDuh_axw :: T -> Integer
[LclId, Str=DmdType]
$cnoDuh_axw =
  let {
    $dGenDuh_aNk :: GenDuh (Maybe ())
    [LclId, Str=DmdType]
    $dGenDuh_aNk = Duh.$fGenDuhMaybe @ () Duh.$fGenDuh() } in
  let {
    $dGenDuh_axC :: GenDuh (Maybe (Maybe ()))
    [LclId, Str=DmdType]
    $dGenDuh_axC = Duh.$fGenDuhMaybe @ (Maybe ()) $dGenDuh_aNk } in
  \ (ds_dTV :: T) ->
    case ds_dTV of _ [Occ=Dead] { T m_avB ->
    gnoDuh @ (Maybe (Maybe ())) $dGenDuh_axC m_avB
    }

$cduh_axp :: Integer -> T
[LclId, Str=DmdType]
$cduh_axp =
  let {
    $dGenDuh_aNi :: GenDuh (Maybe ())
    [LclId, Str=DmdType]
    $dGenDuh_aNi = Duh.$fGenDuhMaybe @ () Duh.$fGenDuh() } in
  let {
    $dGenDuh_axv :: GenDuh (Maybe (Maybe ()))
    [LclId, Str=DmdType]
    $dGenDuh_axv = Duh.$fGenDuhMaybe @ (Maybe ()) $dGenDuh_aNi } in
  \ (i_avA :: Integer) ->
    Duh.T (gduh @ (Maybe (Maybe ())) $dGenDuh_axv i_avA)

comment:59 Changed 7 weeks ago by simonpj

I can see that you get two copies of the same code, which CSE will presumably get rid of. That might be worth trying to fix.

But why does it lead to a 26x increase in compilation time? I'd expect it to be un-noticeable.

comment:60 in reply to:  59 Changed 7 weeks ago by dfeuer

Replying to simonpj:

I can see that you get two copies of the same code, which CSE will presumably get rid of. That might be worth trying to fix.

But why does it lead to a 26x increase in compilation time? I'd expect it to be un-noticeable.

Well, it's not always going to cause serious problems, and I think this ticket touches several underlying issues, but it bet it can cause serious trouble. Suppose you have a class with ten methods, and use generic defaults. Now you'll end up with ten dictionaries of ten methods each. Depending on how inlining goes, I suspect that could be quite bad.

comment:61 Changed 6 weeks ago by dfeuer

Commit 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 (Stop the specialiser generating loopy code) had a big positive impact here. My test case (as yet unpublished) compiles over twice as quickly. Edward's (when the specialization is cross-module) compiles something like one and a half times as quickly, and the unused method no longer hurts it (in fact it helps it slightly; perhaps the single-method class optimization burns a little time). So it looks pretty likely that this bug is squashed. I'll try to check a few more things before adding a test case and calling it done. I suspect it's still worth looking into the doubled constraint solving that came up in comment:55, but perhaps that should be a separate (lower priority) ticket.

comment:62 Changed 5 weeks ago by dfeuer

Hmmmm... While 2b74bd9d8b4c6b20f3e8d9ada12e7db645cc3c19 seems to improve matters considerably, it doesn't fix the whole issue. ezyang's reproduction in comment:22 can be modified to make things blow up after the commit, and as far back as 7.8 at least. Specifically:

  • Increase the size of the definition of T, leaving it in the Gen module.
  • Move the definition of tput to a separate GenSpec module.

This will lead to simplifier tick exhaustion. If the definition of T (or even just the derivation of Generic T) is moved to GenSpec, then compilation succeeds.

The number of terms seems to blow up several simplifier passes after specialization. I don't yet know why, but it smells like inlining.

comment:63 Changed 5 weeks ago by dfeuer

Actually, push that all the way back to 7.4.1, which produces a warning instead of an error for tick exhaustion. I don't have easy access to any earlier versions. So I guess part of this problem is a lot older than previously realized.

comment:64 Changed 5 weeks ago by dfeuer

Differential Rev(s): Phab:D3656

I've added a test demonstrating improvement from the specializer change. It's not as precise as I'd like. A variation (moving the definition of T into the T9630a module while leaving the specialization in T9630) has more dramatic effects, with simplifier tick exhaustion in the older GHC. But as mentioned, a similar modification of Edward's test case does not see any improvement.

comment:65 Changed 5 weeks ago by dfeuer

FYI, the case that fails with tick exhaustion for (apparently) absolutely everything from 7.4 onward is in this gist. Ryan says it worked fine in 7.2, but I doubt it'll be useful to try to figure out just what made that happen.

comment:66 Changed 5 days ago by dfeuer

Resolution: fixed
Status: newclosed

As the regression seems to be fixed, I've opened #13993 to address the longer-standing issue and am closing this ticket.

Note: See TracTickets for help on using tickets.