Opened 5 years ago

Last modified 10 months ago

#7450 new bug

Regression in optimisation time of functions with many patterns (6.12 to 7.4)?

Reported by: iustin Owned by: bgamari
Priority: normal Milestone:
Component: Compiler Version: 7.6.1
Keywords: Cc: HairyDude, RyanGlScott
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D1012
Wiki Page:

Description

In our project, we build (via TH) some not-trivial data structures, including custom serialisation. After growing the number of constructors significantly, compiling same code with ghc 7.4/7.6 is about 10 times slower than with ghc 6.12.

Narrowing this down on a simpler test case, it turns out that TH is not the cause, just the optimisation of the generated code.

The attached Generator.hs file generates a simple data structure (our project is using a much smaller number of constructors, but with more complex data types, and the result is the same):

data F
  = F1 {fldF1 :: Int} |
    F2 {fldF2 :: Int} |
    F3 {fldF3 :: Int} |
    F4 {fldF4 :: Int} |
    F5 {fldF5 :: Int} |
  … deriving (Eq, Show, Read)

prettyF :: F -> String
prettyF (F1 v) = show v
prettyF (F2 v) = show v
…

The main.hs file just calls this splice and then does a trivial use of prettyF.

Compiling this with a variable number of constructors, with -O:

Constructors6.127.6
502.45s4.10s
1004.40s10.60s
2008.45s33.30s
40016.90s121.00s
80035.95s514.50s

As you can see, 6.12 looks liniar in the number of constructors, whereas 7.6 is not anymore (at least above 100 constructors).

Compiling without -O makes things sane again:

Constructors6.127.6
501.40s1.97s
1002.45s2.70s
2004.50s4.95s
4008.95s9.55s
80018.25s19.10s

After some more investigation, it turns out that just our function function (prettyF) with no automatic instances is cheap, compiling/deriving Eq/Show is also cheap, but Read is very expensive (all numbers with ghc 7.6):

ConstructorsNo instancesEqShowRead
500.75s0.90s1.20s2.95s
1000.85s1.00s1.70s6.80s
2001.20s1.45s2.85s19.15s
4002.05s2.50s5.40s64.45s
8004.30s5.40s11.65s259.40s

So I believe this is about optimisation of complex functions with many patterns/case statements. 6.12 seems nicely liniar, whereas 7.4/7.6 exhibit a much worse behaviour.

This might be a known issue, or even a non-issue ("Don't use Read for complex types"), but I thought it's better to report it. I don't have the skills to look at the core result, and see if it's different or not between 6.12/7.6, sorry.

Attachments (3)

Generator.hs (1.2 KB) - added by iustin 5 years ago.
TH generator file
main.hs (117 bytes) - added by iustin 5 years ago.
main file (trivial)
ghc.prof (47.1 KB) - added by bgamari 2 years ago.
Profile from ghc compiling 1000-constructor type with non-occurring call-centers removed

Download all attachments as: .zip

Change History (37)

Changed 5 years ago by iustin

Attachment: Generator.hs added

TH generator file

Changed 5 years ago by iustin

Attachment: main.hs added

main file (trivial)

comment:1 Changed 5 years ago by igloo

difficulty: Unknown
Resolution: duplicate
Status: newclosed

Thanks for the report. This looks like a duplicate of #7258.

comment:2 Changed 5 years ago by simonpj

Resolution: duplicate
Status: closednew

It may be a dup of #7258, but if it's true that the issue is solely concerned with deriving(Read) that would localise it much more than saying it's to do with DynFlags.lhs.

So if I understand it, simply having your data type with lots of constructors and deriving(Read) is enough to trigger this non-linear behaviour?

Simon

comment:3 Changed 5 years ago by igloo

Lots of named fields; see W2.hs in #7258.

comment:4 in reply to:  2 Changed 5 years ago by iustin

Replying to simonpj:

It may be a dup of #7258, but if it's true that the issue is solely concerned with deriving(Read) that would localise it much more than saying it's to do with DynFlags.lhs.

So if I understand it, simply having your data type with lots of constructors and deriving(Read) is enough to trigger this non-linear behaviour?

Indeed. The following conditions seem to be required:

  • many constructors (> ~100)
  • constructors should be record constructors, not normal ones
  • optimisations must be turned on

For normal constructors, I can't trigger it with even ~1500 constructors (it's a bit less than linear, but still very acceptable, e.g. 30s for 1600 normal constructors versus 259s for 800 record constructors).

comment:5 Changed 5 years ago by simonpj

One reason that the derived code was big was this. The derived Read code has lots of this

   do { ...
      ; Ident "foo" <- lexP
      ; Punc "=" <- lexP
      ;  ...

Each of these failable pattern matches generates a case expression with a call to error and an error string. This is very wasteful. Better instead to define in GHC.Read:

  expectP :: L.Lexeme -> ReadPrec ()

and use it thus

   do { ...
      ; expectP (Ident "foo")
      ; expectP (Punc "=")
      ;  ...

This makes the code significantly shorter. Without -O, and 200 constructors, the compiler itself allocates half as much as before.

This may or may not address the non-linearity, but it certainly improves Read instances.

commit 52e43004f63276c1342933e40a673ad25cf2113a
Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Dec 21 17:39:33 2012 +0000

    Use expectP in deriving( Read )
    
    Note [Use expectP]   in TcGenDeriv
    ~~~~~~~~~~~~~~~~~~
    Note that we use
       expectP (Ident "T1")
    rather than
       Ident "T1" <- lexP
    The latter desugares to inline code for matching the Ident and the
    string, and this can be very voluminous. The former is much more
    compact.  Cf Trac #7258, although that also concerned non-linearity in
    the occurrence analyser, a separate issue.

There is an accompanying patch to base:

commit d9b6b25a30bfdaefb69c29dedb30eed06ae71e61
Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Dec 21 17:40:08 2012 +0000

    Define GHC.Read.expectP and Text.Read.Lex.expect
    
    They are now used by TcGenDeriv

>---------------------------------------------------------------

 GHC/Read.lhs     |   29 ++++++++++++++++-------------
 Text/Read/Lex.hs |    7 ++++++-
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/GHC/Read.lhs b/GHC/Read.lhs index c5024fc..c542274 100644
--- a/GHC/Read.lhs
+++ b/GHC/Read.lhs
@@ -32,7 +32,7 @@ module GHC.Read
   , lexDigits
 
   -- defining readers
-  , lexP
+  , lexP, expectP
   , paren
   , parens
   , list
@@ -270,12 +270,15 @@ lexP :: ReadPrec L.Lexeme
 -- ^ Parse a single lexeme
 lexP = lift L.lex
 
+expectP :: L.Lexeme -> ReadPrec ()
+expectP lexeme = lift (L.expect lexeme)
+
 paren :: ReadPrec a -> ReadPrec a
 -- ^ @(paren p)@ parses \"(P0)\"
 --      where @p@ parses \"P0\" in precedence context zero
-paren p = do L.Punc "(" <- lexP
-             x          <- reset p
-             L.Punc ")" <- lexP
+paren p = do expectP (L.Punc "(")
+             x <- reset p
+             expectP (L.Punc ")")
              return x
 
 parens :: ReadPrec a -> ReadPrec a
@@ -292,7 +295,7 @@ list :: ReadPrec a -> ReadPrec [a]
 -- using the usual square-bracket syntax.
 list readx =
   parens
-  ( do L.Punc "[" <- lexP
+  ( do expectP (L.Punc "[")
        (listRest False +++ listNext)
   )
  where
@@ -408,12 +411,12 @@ parenthesis-like objects such as (...) and [...] can be an argument to  instance Read a => Read (Maybe a) where
   readPrec =
     parens
-    (do L.Ident "Nothing" <- lexP
+    (do expectP (L.Ident "Nothing")
         return Nothing
      +++
      prec appPrec (
-        do L.Ident "Just" <- lexP
-           x              <- step readPrec
+        do expectP (L.Ident "Just")
+           x <- step readPrec
            return (Just x))
     )
 
@@ -427,7 +430,7 @@ instance Read a => Read [a] where
 
 instance  (Ix a, Read a, Read b) => Read (Array a b)  where
     readPrec = parens $ prec appPrec $
-               do L.Ident "array" <- lexP
+               do expectP (L.Ident "array")
                   theBounds <- step readPrec
                   vals   <- step readPrec
                   return (array theBounds vals) @@ -504,9 +507,9 @@ instance (Integral a, Read a) => Read (Ratio a) where
   readPrec =
     parens
     ( prec ratioPrec
-      ( do x            <- step readPrec
-           L.Symbol "%" <- lexP
-           y            <- step readPrec
+      ( do x <- step readPrec
+           expectP (L.Symbol "%")
+           y <- step readPrec
            return (x % y)
       )
     )
@@ -543,7 +546,7 @@ wrap_tup :: ReadPrec a -> ReadPrec a  wrap_tup p = parens (paren p)
 
 read_comma :: ReadPrec ()
-read_comma = do { L.Punc "," <- lexP; return () }
+read_comma = expectP (L.Punc ",")
 
 read_tup2 :: (Read a, Read b) => ReadPrec (a,b)
 -- Reads "a , b"  no parens!
diff --git a/Text/Read/Lex.hs b/Text/Read/Lex.hs index f5a07f1..8a64e21 100644
--- a/Text/Read/Lex.hs
+++ b/Text/Read/Lex.hs
@@ -22,7 +22,7 @@ module Text.Read.Lex
   , numberToInteger, numberToRational, numberToRangedRational
 
   -- lexer
-  , lex
+  , lex, expect
   , hsLex
   , lexChar
 
@@ -144,6 +144,11 @@ numberToRational (MkDecimal iPart mFPart mExp)  lex :: ReadP Lexeme  lex = skipSpaces >> lexToken
 
+expect :: Lexeme -> ReadP ()
+expect lexeme = do { skipSpaces 
+                   ; thing <- lexToken
+                   ; if thing == lexeme then return () else pfail }
+
 hsLex :: ReadP String
 -- ^ Haskell lexer: returns the lexed string, rather than the lexeme  hsLex = do skipSpaces

comment:6 Changed 5 years ago by simonpj

OK with these changes I now get this:

6.12.3 6.12.3 HEAD HEAD
#constructors Alloc (Mbytes) Time (s) Alloc (Mbytes) Time (s)
40 1075 1.7
80 1646 4 2184 5
160 3217 8 4862 10
320 6385 16 12242 23
640 12766 34 35009 60

So it still looks quite a bit less well-behaved than 6.12.3, for reasons I don't yet understand. But better than before.

comment:7 Changed 5 years ago by igloo

Milestone: 7.8.1

comment:8 Changed 4 years ago by thoughtpolice

Milestone: 7.8.37.10.1

Moving to 7.10.1

comment:9 Changed 3 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

Changed 2 years ago by bgamari

Attachment: ghc.prof added

Profile from ghc compiling 1000-constructor type with non-occurring call-centers removed

comment:10 Changed 2 years ago by bgamari

I've done some testing; notes can be found here. https://github.com/bgamari/ghc-T7450

I have also attached a profile of GHC churning through a Read instance for a 1000-constructor ADT above.

comment:11 Changed 2 years ago by simonpj

I could not find any notes at that link.

What does -dshow-passes show?

Simon

comment:12 Changed 2 years ago by bgamari

Ahh, yes, sorry, the above repository is at this point really just a couple of scripts for characterising the problem.

-dshow-passes points at the specializer being much of the issue. Take the case of a 1000 constructor ADT. With GHC from master, the compilation takes roughly 75 seconds on my laptop. Of this it seems roughly 20 seconds is spent in the specialiser, and another 20 seconds are spent in the assembler. Otherwise the rest of the time is split between various simplifier iterations, with no single iteration taking more than a couple of seconds.

If I disable specialisation the compilation time drops to 52 seconds, with the largest portion of this being a 7 second simplifier iteration.

comment:13 Changed 2 years ago by bgamari

Here is GHC 7.11 output annotated with relative timestamps for a 1000-constructor ADT,

Last edited 2 years ago by bgamari (previous) (diff)

comment:14 Changed 2 years ago by bgamari

I think I finally found it (after struggling for far too long wondering why the profiles were so useless, eventually finding that I misspelled OPTIONS_GHC). Specialise.pair_fvs shows up very strongly in the profile; in particular from the Rec equation of bind_fvs. Still trying to reason through what is happening.

Last edited 2 years ago by bgamari (previous) (diff)

comment:15 Changed 2 years ago by simonpj

Ben, good work. I think you said that the culprit is consDictBind. Here are some ideas:

  • snocDictBinds is only ever called with a singleton argument. Nuke it in favour of snocDictBind`.
  • The calls to snocDictBind in specBind have just called flattenDictBinds which stupidly throws away all the free varaible information cached in the DictBind. Better to keep it! To achieve that
    • snocDictBind should take a DictBind not a CoreBind
    • flattenDictBinds should return a DictBind not a CoreBind. This is the place we are discarding the free var info. Instead, keep it!

Does that make sense?

comment:16 Changed 2 years ago by bgamari

Indeed most of the time spent in the specialiser is spent in consDictBind.

With respect to snocDictBinds: I believe final_binds in specBind actually can be of length greater than one. Seems worthwhile to keep it.

Otherwise it seems fairly clear. I just hacked up a quick attempt; we'll see if it validates.

comment:17 Changed 2 years ago by bgamari

Differential Rev(s): Phab:D1012
Owner: set to bgamari

Seems to work. Opened a Differential. We'll see how it looks.

comment:18 Changed 2 years ago by bgamari

Phab:D1012 appears to give a moderate improvement in compile time in this case. Take, for instance, a 1000-constructor testcase,

module Mod(F(..)) where
data F
  = F0 { fldF0 :: ()}
  | F1 { fldF1 :: ()}
  ...
  | F1023 {fldF1023 :: ()}
  deriving (Read)

With 7.11 prior to the change ghc -O runs for about 140 seconds. After the change it runs for about 100 seconds. Moreover, one sees a substantial reduction in heap allocations. The previously non-linear performance degradation can be seen in the table below.

Number of constructors allocated (MBytes) time (seconds)
n Pre-D1012 Post-D1012 Delta (%) Pre-D1012 Post-D1012 Delta (%)
16 1484.58 ± 0.75 1470.53 ± 0.77 -1% 1.75 ± 0.03 1.69 ± 0.02 -3%
64 5466.96 ± 1.53 5248.22 ± 1.64 -4% 6.62 ± 0.25 5.96 ± 0.06 -10%
128 11282.75 ± 1.55 10406.51 ± 1.99 -8% 13.33 ± 0.82 12.00 ± 0.63 -10%
256 24680.84 ± 2.32 21137.36 ± 2.36 -14% 26.68 ± 0.53 23.69 ± 0.81 -11%
512 58628.48 ± 30.32 44412.63 ± 0.36 -24% 67.61 ± 2.69 47.99 ± 0.67 -29%
1024 156753.21 ± 374.40 99523.84 ± 121.99 -37% 144.91 ± 4.07 102.19 ± 0.60 -29%
2048 478481.02 ± 287.23 248791.64 ± 233.84 -48% 379.93 ± 6.77 236.63 ± 3.51 -38%
4096 1685030.63 ± 2917.98 759242.66 ± 42.84 -55% 1359.73 ± 21.62 650.83 ± 5.65 -52%
Last edited 2 years ago by bgamari (previous) (diff)

comment:19 Changed 2 years ago by bgamari

Status: newpatch

comment:20 Changed 2 years ago by simonpj

OK so it's now looking roughly linear, which is great.

It still seems like a lot for what started life as a pretty small program. What does the profile show now?

Simon

comment:21 Changed 2 years ago by Ben Gamari <ben@…>

In 4681f55970cabc6e33591d7e698621580818f9a2/ghc:

Specialise: Avoid unnecessary recomputation of free variable information

When examining compile times for code with large ADTs (particularly those with
many record constructors), I found that the specialiser contributed
disproportionately to the compiler runtime. Some profiling suggested that
the a great deal of time was being spent in `pair_fvs` being called from
`consDictBind`.

@simonpj pointed out that `flattenDictBinds` as called by `specBind` was
unnecessarily discarding cached free variable information, which then needed to
be recomputed by `pair_fvs`.

Here I refactor the specializer to retain the free variable cache whenever
possible.

**Open Qustions**

 * I used `fst` in a couple of places to extract the bindings from a `DictBind`.
   Perhaps this is a sign that `DictBind` has outgrown its type synonym status?

Test Plan: validate

Reviewers: austin, simonpj

Reviewed By: simonpj

Subscribers: thomie, bgamari, simonpj

Differential Revision: https://phabricator.haskell.org/D1012

GHC Trac Issues: #7450

comment:22 Changed 2 years ago by bgamari

I finally had a chance to look further into this. It seems that after the fix CmmPipeline.doSRTs may still be non-linear.

With 2048 constructors,

        total time  =      119.25 secs   (119253 ticks @ 1000 us, 1 processor)
        total alloc = 101,753,041,080 bytes  (excludes profiling overheads)

COST CENTRE          MODULE       %time %alloc

lintAnnots           CoreLint      12.5   20.4
doSRTs               CmmPipeline   11.6   16.9
FloatOutwards        SimplCore      6.3    2.0
simplLetUnfolding    Simplify       4.5    2.7
pprNativeCode        AsmCodeGen     4.3    5.0
completeBind         Simplify       4.1    2.0
StgCmm               HscMain        3.3    2.7
...

With 4096 constructors,

        total time  =      372.17 secs   (372168 ticks @ 1000 us, 1 processor)
        total alloc = 335,122,544,120 bytes  (excludes profiling overheads)

COST CENTRE          MODULE       %time %alloc      
                                                    
doSRTs               CmmPipeline   23.0   33.0      
lintAnnots           CoreLint      16.9   26.5      
FloatOutwards        SimplCore      7.8    1.6      
completeBind         Simplify       4.0    1.2      
simplLetUnfolding    Simplify       3.9    1.6      
pprNativeCode        AsmCodeGen     3.3    3.3      
...
Last edited 2 years ago by bgamari (previous) (diff)

comment:23 Changed 2 years ago by bgamari

In particular most of this appears to be coming from CmmBuildInfoTables.procpointSRT.bitmap.

comment:24 Changed 2 years ago by bgamari

Indeed the implementation of intsToBitmap had non-linear behavior due to the accumulation of nested map thunks. I've opened Phab:D1041 to resolve this.

With this fix the profile for the 4096 constructor testcase looks like this,

	total time  =      245.76 secs   (245764 ticks @ 1000 us, 1 processor)
	total alloc = 255,033,667,528 bytes  (excludes profiling overheads)

COST CENTRE              MODULE              %time %alloc

lintAnnots               CoreLint             19.8   34.1
FloatOutwards            SimplCore             9.7    2.1
flatten.lookup           CmmBuildInfoTables    4.8    5.2
simplLetUnfolding        Simplify              4.1    2.2
completeBind             Simplify              3.9    1.6

comment:25 Changed 2 years ago by bgamari

ticket:9669#comment:13 contains a summary of the performance impact of Phab:D1012 and Phab:D1041 on this and potentially-related ticket #7258.

Last edited 2 years ago by bgamari (previous) (diff)

comment:26 Changed 2 years ago by Ben Gamari <ben@…>

In b29633f5cf310824f3e34716e9261162ced779d3/ghc:

Bitmap: Fix thunk explosion

Previously we would build up another `map (-N)` thunk
for every word in the bitmap. Now we strictly accumulate the position
and carry out a single ``map (`subtract` accum)``.

`Bitmap.intsToBitmap` showed up in the profile while compiling a 
testcase of #7450 (namely a program containing a record type with large 
number of fields which derived `Read`). The culprit was 
`CmmBuildInfoTables.procpointSRT.bitmap`. On the testcase (with 4096 
fields), the profile previously looked like,

```
	total time  =      307.94 secs   (307943 ticks @ 1000 us, 1 
processor)
	total alloc = 336,797,868,056 bytes  (excludes profiling 
overheads)

COST CENTRE              MODULE              %time %alloc

lintAnnots               CoreLint             17.2   25.8
procpointSRT.bitmap      CmmBuildInfoTables   11.3   25.2
FloatOutwards            SimplCore             7.5    1.6
flatten.lookup           CmmBuildInfoTables    4.0    3.9
...
```

After this fix it looks like,
```
	total time  =      256.88 secs   (256876 ticks @ 1000 us, 1 
processor)
	total alloc = 255,033,667,448 bytes  (excludes profiling 
overheads)

COST CENTRE              MODULE              %time %alloc

lintAnnots               CoreLint             20.3   34.1
FloatOutwards            SimplCore             9.1    2.1
flatten.lookup           CmmBuildInfoTables    4.8    5.2
pprNativeCode            AsmCodeGen            3.7    4.3
simplLetUnfolding        Simplify              3.6    2.2
StgCmm                   HscMain               3.6    2.1
```

Signed-off-by: Ben Gamari <ben@smart-cactus.org>

Test Plan: Validate

Reviewers: austin, simonpj

Reviewed By: simonpj

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D1041

GHC Trac Issues: #7450

comment:27 Changed 2 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:28 Changed 2 years ago by bgamari

Judging by the summary in ticket:9669#comment:13 there still may be some non-linear behavior here. I need to look further into this.

comment:29 Changed 2 years ago by bgamari

Owner: bgamari deleted
Status: patchnew

comment:30 Changed 2 years ago by bgamari

Owner: set to bgamari

comment:31 Changed 22 months ago by thomie

Milestone: 8.0.1

comment:32 Changed 22 months ago by HairyDude

Cc: HairyDude added

comment:33 Changed 15 months ago by RyanGlScott

Cc: RyanGlScott added

comment:34 Changed 10 months ago by Ben Gamari <ben@…>

In b1726c11/ghc:

Bitmap: Use foldl' instead of foldr

These are producing StgWords so foldl' is the natural choice. I'm not
sure how I didn't notice this when I wrote D1041.

Test Plan: Validate

Reviewers: austin, simonmar

Reviewed By: simonmar

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2955

GHC Trac Issues: #7450
Note: See TracTickets for help on using tickets.