Opened 6 years ago

Last modified 12 months ago

#7258 new bug

Compiling DynFlags is jolly slow

Reported by: simonpj Owned by: simonpj
Priority: normal Milestone:
Component: Compiler Version: 7.6.1
Keywords: deriving-perf Cc: iustin@…, heisenbug, pho@…, bgamari, asr, gidyn, HairyDude, RyanGlScott, trommler
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Compiling DynFlags really takes a long time these days.

Ian thinks that it's due to the Read and Show instances he has added (see attached W2.hs.

Simon M suggests: instead of using Read/Show, you could generate some code in mkDerivedConstants to use ReadP and Outputable, which should be much smaller and faster.

This ticket is

  • To see if we can speed up compilation of DynFlags
  • To check WHY it is so slow. Are there any lessons we can learn or ways to make it compile faster? Is it tickling some asymptotically-bad corner of the compiler?

Simon

Attachments (24)

W2.hs (6.4 KB) - added by simonpj 6 years ago.
Exhibts long compilation times
overview_heap_profile.png (85.6 KB) - added by igloo 6 years ago.
overview_optimised_heap_profile.png (70.6 KB) - added by igloo 6 years ago.
profiles.csv (1 bytes) - added by tdammers 13 months ago.
Sample GHC profile data for W2.hs style record type, comparing just the record type, derived Read, derived Show, and a hand-written getLine-based applicative record construction function
ghc-large-record-types.png (70.8 KB) - added by tdammers 13 months ago.
Performance profiling data for large record types.
examples.zip (2.7 KB) - added by tdammers 13 months ago.
Code examples for the large records performance plot
ghc-large-record-types-optimized.png (54.3 KB) - added by tdammers 13 months ago.
Performance graph for various examples, compiling with -O
ghc-large-record-types-optimized-O2.png (55.6 KB) - added by tdammers 13 months ago.
Performance graph for various examples, compiling with -O2
ghc-large-record-types-cmm-size.png (65.3 KB) - added by tdammers 13 months ago.
Size of C-- for large record examples
ghc-large-record-types-cmm-per-core.png (38.1 KB) - added by tdammers 13 months ago.
C-- / Core size ratio for large record modules
ghc-large-record-types-core-size.png (80.4 KB) - added by tdammers 13 months ago.
Core size for large record modules
examples100.zip (5.9 KB) - added by tdammers 13 months ago.
100-field record examples
ghc-stage2-7258-deeper-SCC.prof (129.1 KB) - added by tdammers 13 months ago.
Profiling output with more SCC's added
ghc-large-record-types-optimized-2.png (46.1 KB) - added by tdammers 13 months ago.
Performance graph for various examples, compiling with -O, using UniqSet (as in comment #67)
numTerms.png (32.4 KB) - added by tdammers 13 months ago.
Generated Core, number of terms
simplSize.png (30.9 KB) - added by tdammers 13 months ago.
Generated Core, total size (in LOC)
stgSize.png (31.8 KB) - added by tdammers 13 months ago.
Generated STG, total size (in LOC)
cmmSize.png (23.1 KB) - added by tdammers 13 months ago.
Generated C--, total size (in LOC)
totalTime.png (24.2 KB) - added by tdammers 13 months ago.
Total execution time (CPU, seconds)
t-read-scc-time-abs.png (30.6 KB) - added by tdammers 13 months ago.
Largest profiling cost centres (in absolute terms) for the read example
t-getline-appl-scc-time-abs.png (26.2 KB) - added by tdammers 13 months ago.
Largest profiling cost centres (in absolute terms) for the getline-appl example
t-show-scc-time-abs.png (27.8 KB) - added by tdammers 13 months ago.
Largest profiling cost centres (in absolute terms) for the show example
stgVarsSizeRatio.png (24.8 KB) - added by tdammers 13 months ago.
Free variables / other STG ratio
t-10-dump.zip (52.8 KB) - added by tdammers 13 months ago.
Various GHC dumps for 10-field getline-appl and read-appl examples. Read misbehaves, getLine doesn't.

Download all attachments as: .zip

Change History (117)

Changed 6 years ago by simonpj

Attachment: W2.hs added

Exhibts long compilation times

comment:1 Changed 6 years ago by simonpj

Owner: set to igloo

comment:2 Changed 6 years ago by igloo

Milestone: 7.8.1

Changed 6 years ago by igloo

Attachment: overview_heap_profile.png added

comment:3 Changed 6 years ago by igloo

    Fri Oct 19 20:23 2012 Time and Allocation Profiling Report  (Final)

       ghc-stage2 +RTS -p -h -RTS -B/home/ian/ghc/git/ghc/inplace/lib -fforce-recomp -c W2.hs

    total time  =       26.08 secs   (26080 ticks @ 1000 us, 1 processor)
    total alloc = 14,601,152,424 bytes  (excludes profiling overheads)

COST CENTRE       MODULE         %time %alloc

OccAnal           SimplCore       27.0   24.5
RegAlloc          AsmCodeGen      19.2   22.4
pprNativeCode     AsmCodeGen      16.8   19.9
regLiveness       AsmCodeGen       8.0    6.3
genMachCode       AsmCodeGen       5.0    5.4
StgCmm            HscMain          4.3    3.5
occAnalBind.assoc OccurAnal        3.3    3.3
tops              CmmPipeline      3.3    3.4
SimplTopBinds     SimplCore        1.7    1.6
do_block          Hoopl.Dataflow   1.1    0.8

Changed 6 years ago by igloo

comment:4 Changed 6 years ago by igloo

    Fri Oct 19 20:32 2012 Time and Allocation Profiling Report  (Final)

       ghc-stage2 +RTS -p -h -RTS -B/home/ian/ghc/git/ghc/inplace/lib -fforce-recomp -O -c W2.hs

    total time  =       76.26 secs   (76258 ticks @ 1000 us, 1 processor)
    total alloc = 45,073,993,600 bytes  (excludes profiling overheads)

COST CENTRE       MODULE      %time %alloc

OccAnal           SimplCore    49.7   45.3
CoreTidy          HscMain      22.3   31.8
SimplTopBinds     SimplCore    12.9    9.9
pprNativeCode     AsmCodeGen    2.0    1.8
RegAlloc          AsmCodeGen    1.9    2.0
sink2             CmmPipeline   1.5    1.8
StgCmm            HscMain       1.0    0.8
occAnalBind.assoc OccurAnal     1.0    1.0

comment:5 Changed 6 years ago by igloo

                                                     individual     inherited   
COST CENTRE               MODULE      no.  entries  %time %alloc   %time %alloc

getProxies.fwd_pe         OccurAnal  1776     9226    6.7    0.0    25.5   23.7 
 getProxies.fwd_pe.add1   OccurAnal  1779  9733280    5.8    1.6    18.8   23.7 
  getProxies.fwd_pe.add2  OccurAnal  1781  9733280   13.1   22.1    13.1   22.1 

(without -O)

comment:6 Changed 6 years ago by igloo

Owner: changed from igloo to simonpj

Simon, I think the conclusion was that you'd look into this. Please bounce it back to me if you'd like me to investigate further.

comment:7 Changed 6 years ago by igloo

See also #7450

comment:8 Changed 6 years ago by iustin

Cc: iustin@… added

comment:9 Changed 6 years ago by PHO

Cc: pho@… added

comment:10 Changed 6 years ago by simonpj@…

commit 52e43004f63276c1342933e40a673ad25cf2113a

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Dec 21 17:39:33 2012 +0000

    Use expectP in deriving( Read )
    
    Note [Use expectP]   in TcGenDeriv
    ~~~~~~~~~~~~~~~~~~
    Note that we use
       expectP (Ident "T1")
    rather than
       Ident "T1" <- lexP
    The latter desugares to inline code for matching the Ident and the
    string, and this can be very voluminous. The former is much more
    compact.  Cf Trac #7258, although that also concerned non-linearity in
    the occurrence analyser, a separate issue.

 compiler/prelude/PrelNames.lhs    |    3 +-
 compiler/typecheck/TcGenDeriv.lhs |   43 ++++++++++++++++++++++--------------
 2 files changed, 28 insertions(+), 18 deletions(-)

comment:11 Changed 6 years ago by simonpj@…

commit 9ea2b6666cb6684279a120c688e8557bcef3dc73

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Mon Dec 24 11:34:51 2012 +0000

    Simplify the binder-swap transformation
    
    The occurrence analyser implements the "binder-swap" transformation,
    described in Note [Binder swap] in OccAnal. For some reason I had
    implemeted an extremely complicated version, I believe intended to get
    as much as possible done in single simplifier pass.  But it turned
    out (Trac #7258) that the 'getProxies' bit of this complicated code
    scaled rather non-linearly, and all by itself could consume half of
    the entire compile time.
    
    The patch dramatically simplifies the transformation, so that
    we simply swizzle
         case x of y { I# v -> e }
    to
         case x of y { I# v -> let x = y in e }
    
    I can't see any reason not to do this
    
      * Compiler allocation for #7258 with 200 fields goes down by 25%
        and compile time by 20%
    
      * The nofib figures do not budge
    
      * Quite a bit of complicated code goes away

 compiler/simplCore/OccurAnal.lhs |  255 +++++++-------------------------------
 1 files changed, 46 insertions(+), 209 deletions(-)

comment:12 Changed 6 years ago by simonpj

OK I have verified that the changes to the occurrence analyser (above) make essentially zero different to nofib numbers. It's a very worthwhile simplification, because it completely gets rid of the getProxies stuff that was eating all the time before.

Alas, compiling W2 is still non-linear. Here's the allocation by the stage-2 compiler

  • 50 fields: 1.01 Gbyte
  • 100 fields: 2.98 Gbyte
  • 200 fields: 9.64 Gbyte

This is after including the improvements to the derived Read code in #7450.

So something is still wrong. Need to do some profilling to find out.

There are some very deeply nested lambdas, which lead to SRTs of ever-increasing size, so there is definitely a quadratic effect there. But I'm not sure that is all.

comment:13 Changed 6 years ago by simonmar

I can't compile DynFlags on my 256MB Raspberry Pi. Could we do something about having less auto-generated code in this module please?

comment:14 Changed 6 years ago by dterei

Cc: dterei added

comment:15 Changed 5 years ago by thoughtpolice

Milestone: 7.8.37.10.1

Moving to 7.10.1

comment:16 Changed 4 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:17 Changed 4 years ago by thomie

Type of failure: None/UnknownCompile-time performance bug

comment:18 Changed 4 years ago by bgamari

Cc: bgamari added

comment:19 Changed 4 years ago by asr

Cc: asr added

comment:20 Changed 3 years ago by gidyn

Cc: gidyn added

comment:21 Changed 3 years ago by bgamari

I strongly suspect that some of the fixes that I've made while working on #7450 will help this although I have yet to test this.

comment:22 Changed 3 years ago by bgamari

ticket:9669#comment:13 contains a summary of the performance impact of Phab:D1012 and Phab:D1040 on this and potentially-related ticket #7450.

comment:23 Changed 3 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:24 Changed 3 years ago by ezyang

Keywords: deriving-perf added

comment:25 Changed 3 years ago by thomie

Milestone: 8.0.1

comment:26 Changed 3 years ago by HairyDude

Cc: HairyDude added

comment:27 Changed 2 years ago by RyanGlScott

Cc: RyanGlScott added

comment:28 Changed 13 months ago by trommler

Cc: trommler added

comment:29 Changed 13 months ago by simonpj

Tobaias found that the bottlenecks on W2 were

  • Pretty-printing of assembly code
  • Register allocation

He has some improvements in progress.

comment:30 Changed 13 months ago by dfeuer

comment:31 Changed 13 months ago by simonpj

Apparently this is a example of non-linearity in the pretty printer library.

See Bartosz's summary here, and #14005 (which is about switching over to a new pretty-printing library).

Last edited 13 months ago by simonpj (previous) (diff)

comment:32 Changed 13 months ago by heisenbug

Cc: heisenbug added

comment:33 Changed 13 months ago by tdammers

This commit:

http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5

...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (assig) and the list of registers to keep; the new version uses set operations instead, based on UniqSet / UniqFW. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.

I believe there are further optimization opportunities here, such as:

  • changing the current list-based code to also use set/map operations
  • moving the register class check into the candidates part
  • precalculating the list of candidates (not sure about this one though)

comment:34 Changed 13 months ago by tdammers

http://git.haskell.org/ghc.git/commitdiff/375d4bd0fc2afd72617bc827bf63b5eeb24f2f7c

This one makes crucial parts of the pretty-printing code stricter; the modified functions score better in profiling, but the overall effect on compilation times seems to be minimal.

comment:35 in reply to:  33 ; Changed 13 months ago by heisenbug

Replying to tdammers:

This commit:

http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5

...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (assig) and the list of registers to keep; the new version uses set operations instead, based on UniqSet / UniqFW. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.

Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.

I believe there are further optimization opportunities here, such as:

  • changing the current list-based code to also use set/map operations
  • moving the register class check into the candidates part
  • precalculating the list of candidates (not sure about this one though)

comment:36 in reply to:  35 Changed 13 months ago by tdammers

Replying to heisenbug:

Replying to tdammers:

This commit:

http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5

...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (assig) and the list of registers to keep; the new version uses set operations instead, based on UniqSet / UniqFW. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.

Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.

Could be worth it, but I'm a bit skeptical - the additional overhead of unpacking the sets into lists could easily cancel out the performance benefit, and it would make the code more complicated. I'll give it a try though.

comment:37 Changed 13 months ago by bgamari

While chatting with tdammers about this I had a peek at the core; it seems that 5000 terms of the 7000 terms in the simplified core of W2 are in creadPrec_rdaO. Moreover, much of this is repetition. I've proposed an approach for dealing with this in #14364.

comment:38 Changed 13 months ago by dterei

Cc: dterei removed

comment:39 Changed 13 months ago by bgamari

Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.

I actually suspected this in the past, but dfeuer did some measurements in the context of ListSetUtils and found that sets aren't measurably different. Don't let that stop you from quickly trying though.

comment:40 in reply to:  37 Changed 13 months ago by tdammers

Replying to bgamari:

While chatting with tdammers about this I had a peek at the core; it seems that 5000 terms of the 7000 terms in the simplified core of W2 are in creadPrec_rdaO. Moreover, much of this is repetition. I've proposed an approach for dealing with this in #14364.

I've just done a bit more profiling, and it turns out that removing the derived Read instance from the W3 example (W2 extended to 500 fields) cuts overall compilation time down from 25.5 seconds to 3.6. So it seems that the edge case isn't just "constructors with many fields", but rather, "derived Read instances for constructors with many fields".

comment:41 Changed 13 months ago by tdammers

On further thought, #14364 is a bit of a stop-gap measure; generating more efficient Read instances is of course a good thing, but the register allocator edge case remains, and hand-written code (or TH-generated code) that follows a similar pattern will still hit it.

comment:42 Changed 13 months ago by tdammers

A few things I have tried that do not seem to make a difference:

  • Pre-filtering the assig list to retain only the "interesting" entries (InReg and InBoth), and passing the filtered list around separately; while in theory this would get rid of unnecessary list filtering passes, in practice it seems that the overhead of maintaining a second list in parallel cancels out anything we may have won
  • Collecting the InReg and InBoth candidate lists in a single pass. The idea behind this was that we would visit each list entry just once, putting it in one bin or the other, rather than traversing the list of candidates twice. In practice however there is no measurable performance benefit.
  • Merging the filtering predicates (checking whether a candidate is InReg or InBoth, and checking whether the register class matches); the idea here was that doing both checks in one pass might help, but it doesn't, the performance is exactly the same.

Changed 13 months ago by tdammers

Attachment: profiles.csv added

Sample GHC profile data for W2.hs style record type, comparing just the record type, derived Read, derived Show, and a hand-written getLine-based applicative record construction function

Changed 13 months ago by tdammers

Attachment: ghc-large-record-types.png added

Performance profiling data for large record types.

comment:43 Changed 13 months ago by tdammers

Latest profiling results:

The test module defines a record type of N fields, named field0 through field{N-1}, called DT; the performance metric is the "total time" value from the profiler output for GHC compiling the module on its own. The codenames refer to different versions of the test module (all of them defining the same DT type though):

  • nothing: just a raw record type, no instances or functions. This is the baseline.
  • show: a derived Show instance (deriving (Show))
  • read: a derived Read instance (deriving (Read))
  • read-appl: a hand-written Read instance using applicative syntax (DT <$> a1 <*> a2 <*> a3 <*> ...)
  • getline: a hand-written function consisting of a series of N monadic binds followed by a return of an N-argument call to the DT constructor. This is the same shape as what the derived Read and Show instances produce internally.
  • getline-appl: a hand-written function that reads fields from stdin using getLine and applicatively combines them into a value of our record type: DT <$> (read <$> getLine) <*> (read <$> getLine) <*> ...
  • getline-standalone: a set of N functions performing one getLine call each: getlD :: Int -> IO Int; getlD i = read <$> getLine
  • show-standalone: a set of N functions, each constructing a string by combining an integer argument with the function's index

The -standalone and nothing flavors were added as controls.

Looking at the plot, the pattern is pretty obvious: the applicative versions show performance patterns very similar to the control, while the three versions that use many monadic binds all exhibit clearly nonlinear performance.

Changed 13 months ago by tdammers

Attachment: examples.zip added

Code examples for the large records performance plot

comment:44 Changed 13 months ago by tdammers

So apparently, good shape (linear-ish performance):

D <$> a1 <*> a2 <*> a2 <*> ... <*> aN

Bad shape (exponential-ish performance):

do
  v1 <- a1
  v2 <- a2
  v3 <- a3
  ...
  vN <- aN
  return $ D v1 v2 v3 ... vN

comment:45 Changed 13 months ago by tdammers

Core for monadic-bind style Read instance:

[1 of 1] Compiling D                ( examples/t-10-read.hs, examples/t-10-read.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 373, types: 324, coercions: 0}

-- RHS size: {terms: 269, types: 160, coercions: 0}
$creadPrec_r1Hl
$creadPrec_r1Hl =
  parens
    (prec
       (I# 11#)
       (>>
          $fMonadReadPrec
          (expectP (Ident (unpackCString# "DT"#)))
          (>>
             $fMonadReadPrec
             (expectP (Punc (unpackCString# "{"#)))
             (>>
                $fMonadReadPrec
                (expectP (Ident (unpackCString# "field0"#)))
                (>>
                   $fMonadReadPrec
                   (expectP (Punc (unpackCString# "="#)))
                   (>>=
                      $fMonadReadPrec
                      (reset (readPrec $fReadInt))
                      (\ a1_a1tq ->
                         >>
                           $fMonadReadPrec
                           (expectP (Punc (unpackCString# ","#)))
                           (>>
                              $fMonadReadPrec
                              (expectP (Ident (unpackCString# "field2"#)))
                              (>>
                                 $fMonadReadPrec
                                 (expectP (Punc (unpackCString# "="#)))
                                 (>>=
                                    $fMonadReadPrec
                                    (reset (readPrec $fReadInt))
                                    (\ a2_a1tr ->
                                       >>
                                         $fMonadReadPrec
                                         (expectP (Punc (unpackCString# ","#)))
                                         (>>
                                            $fMonadReadPrec
                                            (expectP (Ident (unpackCString# "field3"#)))
                                            (>>
                                               $fMonadReadPrec
                                               (expectP (Punc (unpackCString# "="#)))
                                               (>>=
                                                  $fMonadReadPrec
                                                  (reset (readPrec $fReadInt))
                                                  (\ a3_a1ts ->
                                                     >>
                                                       $fMonadReadPrec
                                                       (expectP (Punc (unpackCString# ","#)))
                                                       (>>
                                                          $fMonadReadPrec
                                                          (expectP
                                                             (Ident (unpackCString# "field4"#)))
                                                          (>>
                                                             $fMonadReadPrec
                                                             (expectP (Punc (unpackCString# "="#)))
                                                             (>>=
                                                                $fMonadReadPrec
                                                                (reset (readPrec $fReadInt))
                                                                (\ a4_a1tt ->
                                                                   >>
                                                                     $fMonadReadPrec
                                                                     (expectP
                                                                        (Punc
                                                                           (unpackCString# ","#)))
                                                                     (>>
                                                                        $fMonadReadPrec
                                                                        (expectP
                                                                           (Ident
                                                                              (unpackCString#
                                                                                 "field5"#)))
                                                                        (>>
                                                                           $fMonadReadPrec
                                                                           (expectP
                                                                              (Punc
                                                                                 (unpackCString#
                                                                                    "="#)))
                                                                           (>>=
                                                                              $fMonadReadPrec
                                                                              (reset
                                                                                 (readPrec
                                                                                    $fReadInt))
                                                                              (\ a5_a1tu ->
                                                                                 >>
                                                                                   $fMonadReadPrec
                                                                                   (expectP
                                                                                      (Punc
                                                                                         (unpackCString#
                                                                                            ","#)))
                                                                                   (>>
                                                                                      $fMonadReadPrec
                                                                                      (expectP
                                                                                         (Ident
                                                                                            (unpackCString#
                                                                                               "field6"#)))
                                                                                      (>>
                                                                                         $fMonadReadPrec
                                                                                         (expectP
                                                                                            (Punc
                                                                                               (unpackCString#
                                                                                                  "="#)))
                                                                                         (>>=
                                                                                            $fMonadReadPrec
                                                                                            (reset
                                                                                               (readPrec
                                                                                                  $fReadInt))
                                                                                            (\ a6_a1tv ->
                                                                                               >>
                                                                                                 $fMonadReadPrec
                                                                                                 (expectP
                                                                                                    (Punc
                                                                                                       (unpackCString#
                                                                                                          ","#)))
                                                                                                 (>>
                                                                                                    $fMonadReadPrec
                                                                                                    (expectP
                                                                                                       (Ident
                                                                                                          (unpackCString#
                                                                                                             "field7"#)))
                                                                                                    (>>
                                                                                                       $fMonadReadPrec
                                                                                                       (expectP
                                                                                                          (Punc
                                                                                                             (unpackCString#
                                                                                                                "="#)))
                                                                                                       (>>=
                                                                                                          $fMonadReadPrec
                                                                                                          (reset
                                                                                                             (readPrec
                                                                                                                $fReadInt))
                                                                                                          (\ a7_a1tw ->
                                                                                                             >>
                                                                                                               $fMonadReadPrec
                                                                                                               (expectP
                                                                                                                  (Punc
                                                                                                                     (unpackCString#
                                                                                                                        ","#)))
                                                                                                               (>>
                                                                                                                  $fMonadReadPrec
                                                                                                                  (expectP
                                                                                                                     (Ident
                                                                                                                        (unpackCString#
                                                                                                                           "field8"#)))
                                                                                                                  (>>
                                                                                                                     $fMonadReadPrec
                                                                                                                     (expectP
                                                                                                                        (Punc
                                                                                                                           (unpackCString#
                                                                                                                              "="#)))
                                                                                                                     (>>=
                                                                                                                        $fMonadReadPrec
                                                                                                                        (reset
                                                                                                                           (readPrec
                                                                                                                              $fReadInt))
                                                                                                                        (\ a8_a1tx ->
                                                                                                                           >>
                                                                                                                             $fMonadReadPrec
                                                                                                                             (expectP
                                                                                                                                (Punc
                                                                                                                                   (unpackCString#
                                                                                                                                      ","#)))
                                                                                                                             (>>
                                                                                                                                $fMonadReadPrec
                                                                                                                                (expectP
                                                                                                                                   (Ident
                                                                                                                                      (unpackCString#
                                                                                                                                         "field9"#)))
                                                                                                                                (>>
                                                                                                                                   $fMonadReadPrec
                                                                                                                                   (expectP
                                                                                                                                      (Punc
                                                                                                                                         (unpackCString#
                                                                                                                                            "="#)))
                                                                                                                                   (>>=
                                                                                                                                      $fMonadReadPrec
                                                                                                                                      (reset
                                                                                                                                         (readPrec
                                                                                                                                            $fReadInt))
                                                                                                                                      (\ a9_a1ty ->
                                                                                                                                         >>
                                                                                                                                           $fMonadReadPrec
                                                                                                                                           (expectP
                                                                                                                                              (Punc
                                                                                                                                                 (unpackCString#
                                                                                                                                                    ","#)))
                                                                                                                                           (>>
                                                                                                                                              $fMonadReadPrec
                                                                                                                                              (expectP
                                                                                                                                                 (Ident
                                                                                                                                                    (unpackCString#
                                                                                                                                                       "field10"#)))
                                                                                                                                              (>>
                                                                                                                                                 $fMonadReadPrec
                                                                                                                                                 (expectP
                                                                                                                                                    (Punc
                                                                                                                                                       (unpackCString#
                                                                                                                                                          "="#)))
                                                                                                                                                 (>>=
                                                                                                                                                    $fMonadReadPrec
                                                                                                                                                    (reset
                                                                                                                                                       (readPrec
                                                                                                                                                          $fReadInt))
                                                                                                                                                    (\ a10_a1tz ->
                                                                                                                                                       >>
                                                                                                                                                         $fMonadReadPrec
                                                                                                                                                         (expectP
                                                                                                                                                            (Punc
                                                                                                                                                               (unpackCString#
                                                                                                                                                                  "}"#)))
                                                                                                                                                         (return
                                                                                                                                                            $fMonadReadPrec
                                                                                                                                                            (DT
                                                                                                                                                               a1_a1tq
                                                                                                                                                               a2_a1tr
                                                                                                                                                               a3_a1ts
                                                                                                                                                               a4_a1tt
                                                                                                                                                               a5_a1tu
                                                                                                                                                               a6_a1tv
                                                                                                                                                               a7_a1tw
                                                                                                                                                               a8_a1tx
                                                                                                                                                               a9_a1ty
                                                                                                                                                               a10_a1tz)))))))))))))))))))))))))))))))))))))))))))))

Rec {
-- RHS size: {terms: 5, types: 1, coercions: 0}
$fReadDT
$fReadDT =
  C:Read
    $creadsPrec_r1I5
    $creadList_r1I6
    $creadPrec_r1Hl
    $creadListPrec_r1I7

-- RHS size: {terms: 2, types: 1, coercions: 0}
$creadsPrec_r1I5
$creadsPrec_r1I5 = $dmreadsPrec $fReadDT

-- RHS size: {terms: 2, types: 1, coercions: 0}
$creadList_r1I6
$creadList_r1I6 = readListDefault $fReadDT

-- RHS size: {terms: 2, types: 1, coercions: 0}
$creadListPrec_r1I7
$creadListPrec_r1I7 = readListPrecDefault $fReadDT
end Rec }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field9
field9 =
  \ ds_d1H8 ->
    case ds_d1H8
    of _
    { DT ds1_d1H9 ds2_d1Ha ds3_d1Hb ds4_d1Hc ds5_d1Hd ds6_d1He ds7_d1Hf
         ds8_d1Hg ds9_d1Hh ds10_d1Hi ->
    ds9_d1Hh
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field8
field8 =
  \ ds_d1GX ->
    case ds_d1GX
    of _
    { DT ds1_d1GY ds2_d1GZ ds3_d1H0 ds4_d1H1 ds5_d1H2 ds6_d1H3 ds7_d1H4
         ds8_d1H5 ds9_d1H6 ds10_d1H7 ->
    ds8_d1H5
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field7
field7 =
  \ ds_d1GM ->
    case ds_d1GM
    of _
    { DT ds1_d1GN ds2_d1GO ds3_d1GP ds4_d1GQ ds5_d1GR ds6_d1GS ds7_d1GT
         ds8_d1GU ds9_d1GV ds10_d1GW ->
    ds7_d1GT
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field6
field6 =
  \ ds_d1GB ->
    case ds_d1GB
    of _
    { DT ds1_d1GC ds2_d1GD ds3_d1GE ds4_d1GF ds5_d1GG ds6_d1GH ds7_d1GI
         ds8_d1GJ ds9_d1GK ds10_d1GL ->
    ds6_d1GH
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field5
field5 =
  \ ds_d1Gq ->
    case ds_d1Gq
    of _
    { DT ds1_d1Gr ds2_d1Gs ds3_d1Gt ds4_d1Gu ds5_d1Gv ds6_d1Gw ds7_d1Gx
         ds8_d1Gy ds9_d1Gz ds10_d1GA ->
    ds5_d1Gv
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field4
field4 =
  \ ds_d1Gf ->
    case ds_d1Gf
    of _
    { DT ds1_d1Gg ds2_d1Gh ds3_d1Gi ds4_d1Gj ds5_d1Gk ds6_d1Gl ds7_d1Gm
         ds8_d1Gn ds9_d1Go ds10_d1Gp ->
    ds4_d1Gj
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field3
field3 =
  \ ds_d1G4 ->
    case ds_d1G4
    of _
    { DT ds1_d1G5 ds2_d1G6 ds3_d1G7 ds4_d1G8 ds5_d1G9 ds6_d1Ga ds7_d1Gb
         ds8_d1Gc ds9_d1Gd ds10_d1Ge ->
    ds3_d1G7
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field2
field2 =
  \ ds_d1FT ->
    case ds_d1FT
    of _
    { DT ds1_d1FU ds2_d1FV ds3_d1FW ds4_d1FX ds5_d1FY ds6_d1FZ ds7_d1G0
         ds8_d1G1 ds9_d1G2 ds10_d1G3 ->
    ds2_d1FV
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field10
field10 =
  \ ds_d1FI ->
    case ds_d1FI
    of _
    { DT ds1_d1FJ ds2_d1FK ds3_d1FL ds4_d1FM ds5_d1FN ds6_d1FO ds7_d1FP
         ds8_d1FQ ds9_d1FR ds10_d1FS ->
    ds10_d1FS
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field0
field0 =
  \ ds_d1Fx ->
    case ds_d1Fx
    of _
    { DT ds1_d1Fy ds2_d1Fz ds3_d1FA ds4_d1FB ds5_d1FC ds6_d1FD ds7_d1FE
         ds8_d1FF ds9_d1FG ds10_d1FH ->
    ds1_d1Fy
    }

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1_r1I8
$trModule1_r1I8 = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2_r1I9
$trModule2_r1I9 = TrNameS "D"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule
$trModule = Module $trModule1_r1I8 $trModule2_r1I9

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tc'DT1_r1Ia
$tc'DT1_r1Ia = TrNameS "'DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tc'DT
$tc'DT =
  TyCon
    9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1Ia

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tcDT1_r1Ib
$tcDT1_r1Ib = TrNameS "DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tcDT
$tcDT =
  TyCon
    14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1Ib

And for applicative-style Read implementation:

[1 of 1] Compiling D                ( examples/t-10-read-appl.hs, examples/t-10-read-appl.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 272, types: 324, coercions: 0}

-- RHS size: {terms: 168, types: 160, coercions: 0}
$creadsPrec_r1Gi
$creadsPrec_r1Gi =
  \ p_aNu ->
    readP_to_S
      (<*
         $fApplicativeReadP
         (*>
            $fApplicativeReadP
            (string (unpackCString# "DT{"#))
            (<*>
               $fApplicativeReadP
               (<*>
                  $fApplicativeReadP
                  (<*>
                     $fApplicativeReadP
                     (<*>
                        $fApplicativeReadP
                        (<*>
                           $fApplicativeReadP
                           (<*>
                              $fApplicativeReadP
                              (<*>
                                 $fApplicativeReadP
                                 (<*>
                                    $fApplicativeReadP
                                    (<*>
                                       $fApplicativeReadP
                                       (<$>
                                          $fFunctorReadP
                                          DT
                                          (*>
                                             $fApplicativeReadP
                                             (string (unpackCString# "field0="#))
                                             (readS_to_P (readsPrec $fReadInt p_aNu))))
                                       (*>
                                          $fApplicativeReadP
                                          (string (unpackCString# ","#))
                                          (*>
                                             $fApplicativeReadP
                                             (string (unpackCString# "field1="#))
                                             (readS_to_P (readsPrec $fReadInt p_aNu)))))
                                    (*>
                                       $fApplicativeReadP
                                       (string (unpackCString# ","#))
                                       (*>
                                          $fApplicativeReadP
                                          (string (unpackCString# "field2="#))
                                          (readS_to_P (readsPrec $fReadInt p_aNu)))))
                                 (*>
                                    $fApplicativeReadP
                                    (string (unpackCString# ","#))
                                    (*>
                                       $fApplicativeReadP
                                       (string (unpackCString# "field3="#))
                                       (readS_to_P (readsPrec $fReadInt p_aNu)))))
                              (*>
                                 $fApplicativeReadP
                                 (string (unpackCString# ","#))
                                 (*>
                                    $fApplicativeReadP
                                    (string (unpackCString# "field4="#))
                                    (readS_to_P (readsPrec $fReadInt p_aNu)))))
                           (*>
                              $fApplicativeReadP
                              (string (unpackCString# ","#))
                              (*>
                                 $fApplicativeReadP
                                 (string (unpackCString# "field5="#))
                                 (readS_to_P (readsPrec $fReadInt p_aNu)))))
                        (*>
                           $fApplicativeReadP
                           (string (unpackCString# ","#))
                           (*>
                              $fApplicativeReadP
                              (string (unpackCString# "field6="#))
                              (readS_to_P (readsPrec $fReadInt p_aNu)))))
                     (*>
                        $fApplicativeReadP
                        (string (unpackCString# ","#))
                        (*>
                           $fApplicativeReadP
                           (string (unpackCString# "field7="#))
                           (readS_to_P (readsPrec $fReadInt p_aNu)))))
                  (*>
                     $fApplicativeReadP
                     (string (unpackCString# ","#))
                     (*>
                        $fApplicativeReadP
                        (string (unpackCString# "field8="#))
                        (readS_to_P (readsPrec $fReadInt p_aNu)))))
               (*>
                  $fApplicativeReadP
                  (string (unpackCString# ","#))
                  (*>
                     $fApplicativeReadP
                     (string (unpackCString# "field9="#))
                     (readS_to_P (readsPrec $fReadInt p_aNu))))))
         (string (unpackCString# "}"#)))

Rec {
-- RHS size: {terms: 5, types: 1, coercions: 0}
$fReadDT
$fReadDT =
  C:Read
    $creadsPrec_r1Gi
    $creadList_r1GT
    $creadPrec_r1GU
    $creadListPrec_r1GV

-- RHS size: {terms: 2, types: 1, coercions: 0}
$creadList_r1GT
$creadList_r1GT = $dmreadList $fReadDT

-- RHS size: {terms: 2, types: 1, coercions: 0}
$creadPrec_r1GU
$creadPrec_r1GU = $dmreadPrec $fReadDT

-- RHS size: {terms: 2, types: 1, coercions: 0}
$creadListPrec_r1GV
$creadListPrec_r1GV = $dmreadListPrec $fReadDT
end Rec }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field9
field9 =
  \ ds_d1G5 ->
    case ds_d1G5
    of _
    { DT ds1_d1G6 ds2_d1G7 ds3_d1G8 ds4_d1G9 ds5_d1Ga ds6_d1Gb ds7_d1Gc
         ds8_d1Gd ds9_d1Ge ds10_d1Gf ->
    ds9_d1Ge
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field8
field8 =
  \ ds_d1FU ->
    case ds_d1FU
    of _
    { DT ds1_d1FV ds2_d1FW ds3_d1FX ds4_d1FY ds5_d1FZ ds6_d1G0 ds7_d1G1
         ds8_d1G2 ds9_d1G3 ds10_d1G4 ->
    ds8_d1G2
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field7
field7 =
  \ ds_d1FJ ->
    case ds_d1FJ
    of _
    { DT ds1_d1FK ds2_d1FL ds3_d1FM ds4_d1FN ds5_d1FO ds6_d1FP ds7_d1FQ
         ds8_d1FR ds9_d1FS ds10_d1FT ->
    ds7_d1FQ
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field6
field6 =
  \ ds_d1Fy ->
    case ds_d1Fy
    of _
    { DT ds1_d1Fz ds2_d1FA ds3_d1FB ds4_d1FC ds5_d1FD ds6_d1FE ds7_d1FF
         ds8_d1FG ds9_d1FH ds10_d1FI ->
    ds6_d1FE
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field5
field5 =
  \ ds_d1Fn ->
    case ds_d1Fn
    of _
    { DT ds1_d1Fo ds2_d1Fp ds3_d1Fq ds4_d1Fr ds5_d1Fs ds6_d1Ft ds7_d1Fu
         ds8_d1Fv ds9_d1Fw ds10_d1Fx ->
    ds5_d1Fs
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field4
field4 =
  \ ds_d1Fc ->
    case ds_d1Fc
    of _
    { DT ds1_d1Fd ds2_d1Fe ds3_d1Ff ds4_d1Fg ds5_d1Fh ds6_d1Fi ds7_d1Fj
         ds8_d1Fk ds9_d1Fl ds10_d1Fm ->
    ds4_d1Fg
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field3
field3 =
  \ ds_d1F1 ->
    case ds_d1F1
    of _
    { DT ds1_d1F2 ds2_d1F3 ds3_d1F4 ds4_d1F5 ds5_d1F6 ds6_d1F7 ds7_d1F8
         ds8_d1F9 ds9_d1Fa ds10_d1Fb ->
    ds3_d1F4
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field2
field2 =
  \ ds_d1EQ ->
    case ds_d1EQ
    of _
    { DT ds1_d1ER ds2_d1ES ds3_d1ET ds4_d1EU ds5_d1EV ds6_d1EW ds7_d1EX
         ds8_d1EY ds9_d1EZ ds10_d1F0 ->
    ds2_d1ES
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field10
field10 =
  \ ds_d1EF ->
    case ds_d1EF
    of _
    { DT ds1_d1EG ds2_d1EH ds3_d1EI ds4_d1EJ ds5_d1EK ds6_d1EL ds7_d1EM
         ds8_d1EN ds9_d1EO ds10_d1EP ->
    ds10_d1EP
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field0
field0 =
  \ ds_d1Eu ->
    case ds_d1Eu
    of _
    { DT ds1_d1Ev ds2_d1Ew ds3_d1Ex ds4_d1Ey ds5_d1Ez ds6_d1EA ds7_d1EB
         ds8_d1EC ds9_d1ED ds10_d1EE ->
    ds1_d1Ev
    }

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1_r1GW
$trModule1_r1GW = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2_r1GX
$trModule2_r1GX = TrNameS "D"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule
$trModule = Module $trModule1_r1GW $trModule2_r1GX

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tc'DT1_r1GY
$tc'DT1_r1GY = TrNameS "'DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tc'DT
$tc'DT =
  TyCon
    9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1GY

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tcDT1_r1GZ
$tcDT1_r1GZ = TrNameS "DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tcDT
$tcDT =
  TyCon
    14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1GZ

comment:46 Changed 13 months ago by tdammers

getLine examples:

monadic-bind style:

[1 of 1] Compiling D                ( examples/t-10-getline.hs, examples/t-10-getline.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 182, types: 231, coercions: 0}

-- RHS size: {terms: 5, types: 12, coercions: 0}
field10
field10 =
  \ ds_d1JC ->
    case ds_d1JC
    of _
    { DT ds1_d1JD ds2_d1JE ds3_d1JF ds4_d1JG ds5_d1JH ds6_d1JI ds7_d1JJ
         ds8_d1JK ds9_d1JL ds10_d1JM ->
    ds10_d1JM
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field9
field9 =
  \ ds_d1Jr ->
    case ds_d1Jr
    of _
    { DT ds1_d1Js ds2_d1Jt ds3_d1Ju ds4_d1Jv ds5_d1Jw ds6_d1Jx ds7_d1Jy
         ds8_d1Jz ds9_d1JA ds10_d1JB ->
    ds9_d1JA
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field8
field8 =
  \ ds_d1Jg ->
    case ds_d1Jg
    of _
    { DT ds1_d1Jh ds2_d1Ji ds3_d1Jj ds4_d1Jk ds5_d1Jl ds6_d1Jm ds7_d1Jn
         ds8_d1Jo ds9_d1Jp ds10_d1Jq ->
    ds8_d1Jo
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field7
field7 =
  \ ds_d1J5 ->
    case ds_d1J5
    of _
    { DT ds1_d1J6 ds2_d1J7 ds3_d1J8 ds4_d1J9 ds5_d1Ja ds6_d1Jb ds7_d1Jc
         ds8_d1Jd ds9_d1Je ds10_d1Jf ->
    ds7_d1Jc
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field6
field6 =
  \ ds_d1IU ->
    case ds_d1IU
    of _
    { DT ds1_d1IV ds2_d1IW ds3_d1IX ds4_d1IY ds5_d1IZ ds6_d1J0 ds7_d1J1
         ds8_d1J2 ds9_d1J3 ds10_d1J4 ->
    ds6_d1J0
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field5
field5 =
  \ ds_d1IJ ->
    case ds_d1IJ
    of _
    { DT ds1_d1IK ds2_d1IL ds3_d1IM ds4_d1IN ds5_d1IO ds6_d1IP ds7_d1IQ
         ds8_d1IR ds9_d1IS ds10_d1IT ->
    ds5_d1IO
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field4
field4 =
  \ ds_d1Iy ->
    case ds_d1Iy
    of _
    { DT ds1_d1Iz ds2_d1IA ds3_d1IB ds4_d1IC ds5_d1ID ds6_d1IE ds7_d1IF
         ds8_d1IG ds9_d1IH ds10_d1II ->
    ds4_d1IC
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field3
field3 =
  \ ds_d1In ->
    case ds_d1In
    of _
    { DT ds1_d1Io ds2_d1Ip ds3_d1Iq ds4_d1Ir ds5_d1Is ds6_d1It ds7_d1Iu
         ds8_d1Iv ds9_d1Iw ds10_d1Ix ->
    ds3_d1Iq
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field2
field2 =
  \ ds_d1Ic ->
    case ds_d1Ic
    of _
    { DT ds1_d1Id ds2_d1Ie ds3_d1If ds4_d1Ig ds5_d1Ih ds6_d1Ii ds7_d1Ij
         ds8_d1Ik ds9_d1Il ds10_d1Im ->
    ds2_d1Ie
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field0
field0 =
  \ ds_d1I1 ->
    case ds_d1I1
    of _
    { DT ds1_d1I2 ds2_d1I3 ds3_d1I4 ds4_d1I5 ds5_d1I6 ds6_d1I7 ds7_d1I8
         ds8_d1I9 ds9_d1Ia ds10_d1Ib ->
    ds1_d1I2
    }

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1_r1EM
$trModule1_r1EM = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2_r1Ke
$trModule2_r1Ke = TrNameS "D"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule
$trModule = Module $trModule1_r1EM $trModule2_r1Ke

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tc'DT1_r1Kf
$tc'DT1_r1Kf = TrNameS "'DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tc'DT
$tc'DT =
  TyCon
    9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1Kf

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tcDT1_r1Kg
$tcDT1_r1Kg = TrNameS "DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tcDT
$tcDT =
  TyCon
    14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1Kg

-- RHS size: {terms: 93, types: 82, coercions: 0}
getlD
getlD =
  >>=
    $fMonadIO
    (<$> $fFunctorIO (read $fReadInt) getLine)
    (\ field1_aGE ->
       >>=
         $fMonadIO
         (<$> $fFunctorIO (read $fReadInt) getLine)
         (\ field12_aGF ->
            >>=
              $fMonadIO
              (<$> $fFunctorIO (read $fReadInt) getLine)
              (\ field13_aGG ->
                 >>=
                   $fMonadIO
                   (<$> $fFunctorIO (read $fReadInt) getLine)
                   (\ field14_aGH ->
                      >>=
                        $fMonadIO
                        (<$> $fFunctorIO (read $fReadInt) getLine)
                        (\ field15_aGI ->
                           >>=
                             $fMonadIO
                             (<$> $fFunctorIO (read $fReadInt) getLine)
                             (\ field16_aGJ ->
                                >>=
                                  $fMonadIO
                                  (<$> $fFunctorIO (read $fReadInt) getLine)
                                  (\ field17_aGK ->
                                     >>=
                                       $fMonadIO
                                       (<$> $fFunctorIO (read $fReadInt) getLine)
                                       (\ field18_aGL ->
                                          >>=
                                            $fMonadIO
                                            (<$> $fFunctorIO (read $fReadInt) getLine)
                                            (\ field19_aGM ->
                                               >>=
                                                 $fMonadIO
                                                 (<$> $fFunctorIO (read $fReadInt) getLine)
                                                 (\ field20_aGN ->
                                                    return
                                                      $fMonadIO
                                                      (DT
                                                         field1_aGE
                                                         field12_aGF
                                                         field13_aGG
                                                         field14_aGH
                                                         field15_aGI
                                                         field16_aGJ
                                                         field17_aGK
                                                         field18_aGL
                                                         field19_aGM
                                                         field20_aGN)))))))))))



vs. applicative:

[1 of 1] Compiling D                ( examples/t-10-getline-appl.hs, examples/t-10-getline-appl.o )

==================== Tidy Core ====================
Result size of Tidy Core = {terms: 160, types: 264, coercions: 0}

-- RHS size: {terms: 5, types: 12, coercions: 0}
field10
field10 =
  \ ds_d1Je ->
    case ds_d1Je
    of _
    { DT ds1_d1Jf ds2_d1Jg ds3_d1Jh ds4_d1Ji ds5_d1Jj ds6_d1Jk ds7_d1Jl
         ds8_d1Jm ds9_d1Jn ds10_d1Jo ->
    ds10_d1Jo
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field9
field9 =
  \ ds_d1J3 ->
    case ds_d1J3
    of _
    { DT ds1_d1J4 ds2_d1J5 ds3_d1J6 ds4_d1J7 ds5_d1J8 ds6_d1J9 ds7_d1Ja
         ds8_d1Jb ds9_d1Jc ds10_d1Jd ->
    ds9_d1Jc
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field8
field8 =
  \ ds_d1IS ->
    case ds_d1IS
    of _
    { DT ds1_d1IT ds2_d1IU ds3_d1IV ds4_d1IW ds5_d1IX ds6_d1IY ds7_d1IZ
         ds8_d1J0 ds9_d1J1 ds10_d1J2 ->
    ds8_d1J0
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field7
field7 =
  \ ds_d1IH ->
    case ds_d1IH
    of _
    { DT ds1_d1II ds2_d1IJ ds3_d1IK ds4_d1IL ds5_d1IM ds6_d1IN ds7_d1IO
         ds8_d1IP ds9_d1IQ ds10_d1IR ->
    ds7_d1IO
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field6
field6 =
  \ ds_d1Iw ->
    case ds_d1Iw
    of _
    { DT ds1_d1Ix ds2_d1Iy ds3_d1Iz ds4_d1IA ds5_d1IB ds6_d1IC ds7_d1ID
         ds8_d1IE ds9_d1IF ds10_d1IG ->
    ds6_d1IC
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field5
field5 =
  \ ds_d1Il ->
    case ds_d1Il
    of _
    { DT ds1_d1Im ds2_d1In ds3_d1Io ds4_d1Ip ds5_d1Iq ds6_d1Ir ds7_d1Is
         ds8_d1It ds9_d1Iu ds10_d1Iv ->
    ds5_d1Iq
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field4
field4 =
  \ ds_d1Ia ->
    case ds_d1Ia
    of _
    { DT ds1_d1Ib ds2_d1Ic ds3_d1Id ds4_d1Ie ds5_d1If ds6_d1Ig ds7_d1Ih
         ds8_d1Ii ds9_d1Ij ds10_d1Ik ->
    ds4_d1Ie
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field3
field3 =
  \ ds_d1HZ ->
    case ds_d1HZ
    of _
    { DT ds1_d1I0 ds2_d1I1 ds3_d1I2 ds4_d1I3 ds5_d1I4 ds6_d1I5 ds7_d1I6
         ds8_d1I7 ds9_d1I8 ds10_d1I9 ->
    ds3_d1I2
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field2
field2 =
  \ ds_d1HO ->
    case ds_d1HO
    of _
    { DT ds1_d1HP ds2_d1HQ ds3_d1HR ds4_d1HS ds5_d1HT ds6_d1HU ds7_d1HV
         ds8_d1HW ds9_d1HX ds10_d1HY ->
    ds2_d1HQ
    }

-- RHS size: {terms: 5, types: 12, coercions: 0}
field0
field0 =
  \ ds_d1HD ->
    case ds_d1HD
    of _
    { DT ds1_d1HE ds2_d1HF ds3_d1HG ds4_d1HH ds5_d1HI ds6_d1HJ ds7_d1HK
         ds8_d1HL ds9_d1HM ds10_d1HN ->
    ds1_d1HE
    }

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule1_r1Ey
$trModule1_r1Ey = TrNameS "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0}
$trModule2_r1JQ
$trModule2_r1JQ = TrNameS "D"#

-- RHS size: {terms: 3, types: 0, coercions: 0}
$trModule
$trModule = Module $trModule1_r1Ey $trModule2_r1JQ

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tc'DT1_r1JR
$tc'DT1_r1JR = TrNameS "'DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tc'DT
$tc'DT =
  TyCon
    9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1JR

-- RHS size: {terms: 2, types: 0, coercions: 0}
$tcDT1_r1JS
$tcDT1_r1JS = TrNameS "DT"#

-- RHS size: {terms: 5, types: 0, coercions: 0}
$tcDT
$tcDT =
  TyCon
    14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1JS

-- RHS size: {terms: 71, types: 115, coercions: 0}
getlD
getlD =
  <*>
    $fApplicativeIO
    (<*>
       $fApplicativeIO
       (<*>
          $fApplicativeIO
          (<*>
             $fApplicativeIO
             (<*>
                $fApplicativeIO
                (<*>
                   $fApplicativeIO
                   (<*>
                      $fApplicativeIO
                      (<*>
                         $fApplicativeIO
                         (<*>
                            $fApplicativeIO
                            (<$> $fFunctorIO DT (<$> $fFunctorIO (read $fReadInt) getLine))
                            (<$> $fFunctorIO (read $fReadInt) getLine))
                         (<$> $fFunctorIO (read $fReadInt) getLine))
                      (<$> $fFunctorIO (read $fReadInt) getLine))
                   (<$> $fFunctorIO (read $fReadInt) getLine))
                (<$> $fFunctorIO (read $fReadInt) getLine))
             (<$> $fFunctorIO (read $fReadInt) getLine))
          (<$> $fFunctorIO (read $fReadInt) getLine))
       (<$> $fFunctorIO (read $fReadInt) getLine))
    (<$> $fFunctorIO (read $fReadInt) getLine)



comment:47 Changed 13 months ago by simonpj

Ha. Reminds me of #13253.

comment:48 Changed 13 months ago by simonpj

Is the labelling in the graph after comment:46 correct? read-appl looks worst by a long way, which wasn't true bef.re Ditto geline-appl seems worse than getline (which is invisible).

comment:49 Changed 13 months ago by simonpj

Here's a mystery.

  • I took t-10-getline.hs and t-10-getine-appl.hs from examples.zip just before comment:44
  • I compiled both with -O -dshow-passes -ddump-simpl.
  • Both produced identical Core. In the early stages getline-appl is a little bigger, but they become identical after the first run of the simplifier following specialise/float-out. (I.e. early)

I am at a loss for how to reconcile these results with the dramatically worse compile times for getline compared with getline-appl reported in the graph plot in comment:43.

Tobias, any ideas?

comment:50 in reply to:  48 Changed 13 months ago by tdammers

Replying to simonpj:

Is the labelling in the graph after comment:46 correct? read-appl looks worst by a long way, which wasn't true bef.re Ditto geline-appl seems worse than getline (which is invisible).

Yes, it is correct; the difference between this new graph and the previous one is that the new one uses -O2, while the old one doesn't use any optimizations. The profiler breakdown looks radically different between the optimized and unoptimized versions, for *all* examples - RegAlloc-linear and ppr, which are the most serious offenders in the unoptimized run, account for less than 1% of execution time in the optimized case.

comment:51 in reply to:  49 Changed 13 months ago by tdammers

Replying to simonpj:

Here's a mystery.

  • I took t-10-getline.hs and t-10-getine-appl.hs from examples.zip just before comment:44
  • I compiled both with -O -dshow-passes -ddump-simpl.
  • Both produced identical Core. In the early stages getline-appl is a little bigger, but they become identical after the first run of the simplifier following specialise/float-out. (I.e. early)

I am at a loss for how to reconcile these results with the dramatically worse compile times for getline compared with getline-appl reported in the graph plot in comment:43.

Tobias, any ideas?

Hmm, maybe -O vs. -O2? I'll run a few trials myself.

Other thought, though probably not really an explanation: the t-10-XXXX examples are maybe too small to produce useful metrics, the performance differences only start to get dramatic past 100 fields or so.

comment:52 Changed 13 months ago by simonpj

the t-10-XXXX examples are maybe too small to produce useful metrics,

I wasn't measuring performance; I was just eye-balling the (identical) code.

The profiler breakdown looks radically different between the optimized and unoptimized versions, for *all* examples

OK, so there are two perhaps-different problems

  • With -O compiling DynFlags is jolly slow. (The original bug report.)
  • Without -O we get non-linear stuff going on (comment:43)

I think the -O case is the higher-priority item. (Though I accept that compiling without -O is supposed to be fast.)

I would stick to -O for now, not -O2. The latter involves SpecConstr and LiberateCase both of which can generate a lot of code. Again let's not ignore -O2 but let's nail -O first.

So what do the -O graphs look like?

comment:53 Changed 13 months ago by tdammers

I don't have -O graphs yet, unfortunately it takes a good while to generate those (I aborted the -O2 one after running all night and only making it to 420 fields). I'll start up a job and report back. One-shot experiments do suggest though that pretty-printing and register allocation aren't going to be bottlenecks here either.

comment:54 Changed 13 months ago by Ben Gamari <ben@…>

In 2a4c24e4/ghc:

Make layLeft and reduceDoc stricter (#7258)

Making the pretty-printer based assembly output stricter in
strategically chosen locations produces a minor performance improvement
when compiling large derived Read instance (on the order of 5-10%).

Reviewers: austin, bgamari

Reviewed By: bgamari

Subscribers: rwbarton, thomie

Differential Revision: https://phabricator.haskell.org/D4111

comment:55 Changed 13 months ago by Ben Gamari <ben@…>

In df636682/ghc:

Performance improvements linear regAlloc (#7258)

When allocating and potentially spilling registers, we need to check
the desired allocations against current allocations to decide where we
can spill to, cq. which allocations we can toss and if so, how.
Previously, this was done by walking the Cartesian product of the
current allocations (`assig`) and the allocations to keep (`keep`),
which has quadratic complexity. This patch introduces two improvements:

1. pre-filter the `assig` list, because we are only interested in two
types of allocations (in register, and in register+memory), which will
only make up a small and constant portion of the list; and
2. use set / map operations instead of lists, which reduces algorithmic
complexity.

Reviewers: austin, bgamari

Reviewed By: bgamari

Subscribers: rwbarton, thomie

Differential Revision: https://phabricator.haskell.org/D4109

Changed 13 months ago by tdammers

Performance graph for various examples, compiling with -O

Changed 13 months ago by tdammers

Performance graph for various examples, compiling with -O2

comment:56 Changed 13 months ago by tdammers

Updated performance graph: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-optimized.png (this one uses -O, not -O2)

Key observation: read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.

A sample profile like the one for 400-field read, shows that the bottleneck is no longer ASM output or register allocation anymore once -O is involved:

	Thu Oct 26 01:25 2017 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-read.hs

	total time  =      128.62 secs   (128624 ticks @ 1000 us, 1 processor)
	total alloc = 37,659,351,320 bytes  (excludes profiling overheads)

COST CENTRE     MODULE      SRC                                                 %time %alloc

sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        77.8   23.6
SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74             9.6   30.0
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             3.0   10.4
RegAlloc-linear AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    1.6    5.0
OccAnal         SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     1.1    4.2
pprNativeCode   AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    1.0    4.7
NewStranal      SimplCore   compiler/simplCore/SimplCore.hs:480:40-63             0.9    4.2
CoreTidy        HscMain     compiler/main/HscMain.hs:1253:27-67                   0.7    2.9
StgCmm          HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)          0.6    1.9
regLiveness     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    0.5    1.4
genMachCode     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    0.4    1.3
NativeCodeGen   CodeOutput  compiler/main/CodeOutput.hs:171:18-78                 0.4    1.6
CommonSubExpr   SimplCore   compiler/simplCore/SimplCore.hs:462:40-56             0.2    1.1
deSugar         HscMain     compiler/main/HscMain.hs:511:7-44                     0.2    1.1

comment:57 Changed 13 months ago by simonpj

read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.

Very good!

sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        77.8   23.6
SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74             9.6   30.0
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             3.0   10.4

What does -dshow-passes show about the program size? I.e. are we getting very large Core?

So sink in C-- seems very slow. Nearly 4 bytes in 5 of all allocation is in tis pass alone. Is this a non-linear algorithmic effect. e.g if you double the size of the input, does the sink time go up non-linearly while simplTopBinds goes up linearly? That's my guess.

(Related question: does sink show up a a big item when compiling "normal" programs?)

Guessing: the "shape" of the C-- code makes sink behave very badly. Can you characterise what that shape is, and where the algorithmic badness comes from?

This is good... fixing these performance bumps will likely help ALL programs, even ones that don't hit the really bad spots.

comment:58 in reply to:  57 Changed 13 months ago by tdammers

Replying to simonpj:

read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.

Very good!

sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        77.8   23.6
SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74             9.6   30.0
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             3.0   10.4

What does -dshow-passes show about the program size? I.e. are we getting very large Core?

For a 100-line read example:

Glasgow Haskell Compiler, Version 8.3.20171016, stage 2 booted by GHC version 8.0.1
Using binary package database: /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d/package.cache
package flags []
loading package database /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d
wired-in package ghc-prim mapped to ghc-prim-0.5.2.0
wired-in package integer-gmp mapped to integer-gmp-1.0.1.0
wired-in package base mapped to base-4.11.0.0
wired-in package rts mapped to rts
wired-in package template-haskell mapped to template-haskell-2.13.0.0
wired-in package ghc mapped to ghc-8.3
wired-in package dph-seq not found.
wired-in package dph-par not found.
*** Checking old interface for D (use -ddump-hi-diffs for more details):
*** Parser [D]:
!!! Parser [D]: finished in 3.22 milliseconds, allocated 3.170 megabytes
*** Renamer/typechecker [D]:
!!! Renamer/typechecker [D]: finished in 124.33 milliseconds, allocated 72.473 megabytes
*** Desugar [D]:
Result size of Desugar (after optimization)
  = {terms: 2,667, types: 11,534, coercions: 0, joins: 0/0}
!!! Desugar [D]: finished in 174.52 milliseconds, allocated 59.456 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 7,375, types: 18,990, coercions: 6,881, joins: 0/913}
Result size of Simplifier iteration=2
  = {terms: 3,722, types: 13,604, coercions: 1,351, joins: 0/198}
Result size of Simplifier iteration=3
  = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1}
Result size of Simplifier
  = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1}
!!! Simplifier [D]: finished in 1306.06 milliseconds, allocated 749.648 megabytes
*** Specialise [D]:
Result size of Specialise
  = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1}
!!! Specialise [D]: finished in 19.96 milliseconds, allocated 17.006 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [D]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 4,923, types: 15,504, coercions: 1,342, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [D]: finished in 122.82 milliseconds, allocated 103.551 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 4,115, types: 13,888, coercions: 1,342, joins: 0/0}
Result size of Simplifier
  = {terms: 4,115, types: 13,888, coercions: 1,342, joins: 0/0}
!!! Simplifier [D]: finished in 216.40 milliseconds, allocated 122.592 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 4,115, types: 13,282, coercions: 1,342, joins: 0/0}
Result size of Simplifier
  = {terms: 4,115, types: 13,282, coercions: 1,342, joins: 0/0}
!!! Simplifier [D]: finished in 183.50 milliseconds, allocated 131.494 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0}
Result size of Simplifier
  = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0}
!!! Simplifier [D]: finished in 183.89 milliseconds, allocated 126.382 megabytes
*** Float inwards [D]:
Result size of Float inwards
  = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0}
!!! Float inwards [D]: finished in 8.67 milliseconds, allocated 4.238 megabytes
*** Called arity analysis [D]:
Result size of Called arity analysis
  = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0}
!!! Called arity analysis [D]: finished in 5.66 milliseconds, allocated 5.974 megabytes
*** Simplifier [D]:
Result size of Simplifier
  = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0}
!!! Simplifier [D]: finished in 67.09 milliseconds, allocated 62.910 megabytes
*** Demand analysis [D]:
Result size of Demand analysis
  = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0}
!!! Demand analysis [D]: finished in 170.27 milliseconds, allocated 91.193 megabytes
*** Worker Wrapper binds [D]:
Result size of Worker Wrapper binds
  = {terms: 4,541, types: 14,225, coercions: 1,342, joins: 0/3}
!!! Worker Wrapper binds [D]: finished in 5.09 milliseconds, allocated 3.614 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0}
Result size of Simplifier
  = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0}
!!! Simplifier [D]: finished in 351.93 milliseconds, allocated 219.389 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [D]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = True})
  = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [D]: finished in 268.13 milliseconds, allocated 280.555 megabytes
*** Common sub-expression [D]:
Result size of Common sub-expression
  = {terms: 3,740, types: 13,522, coercions: 1,342, joins: 0/0}
!!! Common sub-expression [D]: finished in 38.50 milliseconds, allocated 48.593 megabytes
*** Float inwards [D]:
Result size of Float inwards
  = {terms: 3,740, types: 13,522, coercions: 1,342, joins: 0/0}
!!! Float inwards [D]: finished in 5.37 milliseconds, allocated 3.369 megabytes
*** Simplifier [D]:
Result size of Simplifier
  = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0}
!!! Simplifier [D]: finished in 140.01 milliseconds, allocated 98.855 megabytes
*** Demand analysis [D]:
Result size of Demand analysis
  = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0}
!!! Demand analysis [D]: finished in 77.63 milliseconds, allocated 86.076 megabytes
*** CoreTidy [D]:
Result size of Tidy Core
  = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0}
!!! CoreTidy [D]: finished in 54.47 milliseconds, allocated 69.051 megabytes
Created temporary directory: /tmp/ghc27776_0
*** CorePrep [D]:
Result size of CorePrep
  = {terms: 3,569, types: 13,155, coercions: 1,342, joins: 0/204}
!!! CorePrep [D]: finished in 25.73 milliseconds, allocated 32.097 megabytes
*** Stg2Stg:
*** CodeGen [D]:
!!! CodeGen [D]: finished in 2109.26 milliseconds, allocated 1630.642 megabytes
*** Assembler:
*** Deleting temp files:
Warning: deleting non-existent /tmp/ghc27776_0/ghc_2.c
*** Deleting temp dirs:

At 200 lines, we're looking at:

Glasgow Haskell Compiler, Version 8.3.20171016, stage 2 booted by GHC version 8.0.1
Using binary package database: /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d/package.cache
package flags []
loading package database /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d
wired-in package ghc-prim mapped to ghc-prim-0.5.2.0
wired-in package integer-gmp mapped to integer-gmp-1.0.1.0
wired-in package base mapped to base-4.11.0.0
wired-in package rts mapped to rts
wired-in package template-haskell mapped to template-haskell-2.13.0.0
wired-in package ghc mapped to ghc-8.3
wired-in package dph-seq not found.
wired-in package dph-par not found.
*** Checking old interface for D (use -ddump-hi-diffs for more details):
*** Parser [D]:
!!! Parser [D]: finished in 3.80 milliseconds, allocated 6.207 megabytes
*** Renamer/typechecker [D]:
!!! Renamer/typechecker [D]: finished in 230.12 milliseconds, allocated 138.058 megabytes
*** Desugar [D]:
Result size of Desugar (after optimization)
  = {terms: 5,267, types: 43,034, coercions: 0, joins: 0/0}
!!! Desugar [D]: finished in 289.66 milliseconds, allocated 201.767 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 14,575, types: 57,790, coercions: 13,681, joins: 0/1,813}
Result size of Simplifier iteration=2
  = {terms: 7,322, types: 47,104, coercions: 2,651, joins: 0/398}
Result size of Simplifier iteration=3
  = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1}
Result size of Simplifier
  = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1}
!!! Simplifier [D]: finished in 5652.80 milliseconds, allocated 2700.163 megabytes
*** Specialise [D]:
Result size of Specialise
  = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1}
!!! Specialise [D]: finished in 62.79 milliseconds, allocated 58.261 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [D]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 9,723, types: 50,904, coercions: 2,642, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [D]: finished in 401.80 milliseconds, allocated 374.441 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 8,115, types: 47,688, coercions: 2,642, joins: 0/0}
Result size of Simplifier
  = {terms: 8,115, types: 47,688, coercions: 2,642, joins: 0/0}
!!! Simplifier [D]: finished in 754.47 milliseconds, allocated 393.159 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 8,115, types: 46,482, coercions: 2,642, joins: 0/0}
Result size of Simplifier
  = {terms: 8,115, types: 46,482, coercions: 2,642, joins: 0/0}
!!! Simplifier [D]: finished in 579.29 milliseconds, allocated 411.892 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0}
Result size of Simplifier
  = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0}
!!! Simplifier [D]: finished in 630.25 milliseconds, allocated 382.851 megabytes
*** Float inwards [D]:
Result size of Float inwards
  = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0}
!!! Float inwards [D]: finished in 8.17 milliseconds, allocated 8.300 megabytes
*** Called arity analysis [D]:
Result size of Called arity analysis
  = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0}
!!! Called arity analysis [D]: finished in 21.61 milliseconds, allocated 11.860 megabytes
*** Simplifier [D]:
Result size of Simplifier
  = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0}
!!! Simplifier [D]: finished in 362.68 milliseconds, allocated 190.967 megabytes
*** Demand analysis [D]:
Result size of Demand analysis
  = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0}
!!! Demand analysis [D]: finished in 551.36 milliseconds, allocated 339.529 megabytes
*** Worker Wrapper binds [D]:
Result size of Worker Wrapper binds
  = {terms: 8,941, types: 48,325, coercions: 2,642, joins: 0/3}
!!! Worker Wrapper binds [D]: finished in 7.70 milliseconds, allocated 11.542 megabytes
*** Simplifier [D]:
Result size of Simplifier iteration=1
  = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0}
Result size of Simplifier
  = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0}
!!! Simplifier [D]: finished in 1307.95 milliseconds, allocated 727.784 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [D]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = True})
  = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [D]: finished in 1393.21 milliseconds, allocated 1078.391 megabytes
*** Common sub-expression [D]:
Result size of Common sub-expression
  = {terms: 7,340, types: 46,922, coercions: 2,642, joins: 0/0}
!!! Common sub-expression [D]: finished in 227.12 milliseconds, allocated 194.184 megabytes
*** Float inwards [D]:
Result size of Float inwards
  = {terms: 7,340, types: 46,922, coercions: 2,642, joins: 0/0}
!!! Float inwards [D]: finished in 11.63 milliseconds, allocated 6.545 megabytes
*** Simplifier [D]:
Result size of Simplifier
  = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0}
!!! Simplifier [D]: finished in 590.60 milliseconds, allocated 349.746 megabytes
*** Demand analysis [D]:
Result size of Demand analysis
  = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0}
!!! Demand analysis [D]: finished in 485.19 milliseconds, allocated 330.909 megabytes
*** CoreTidy [D]:
Result size of Tidy Core
  = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0}
!!! CoreTidy [D]: finished in 293.13 milliseconds, allocated 258.898 megabytes
Created temporary directory: /tmp/ghc18725_0
*** CorePrep [D]:
Result size of CorePrep
  = {terms: 6,969, types: 46,155, coercions: 2,642, joins: 0/404}
!!! CorePrep [D]: finished in 163.49 milliseconds, allocated 113.842 megabytes
*** Stg2Stg:
*** CodeGen [D]:
!!! CodeGen [D]: finished in 23041.19 milliseconds, allocated 12665.144 megabytes
*** Assembler:
*** Deleting temp files:
Warning: deleting non-existent /tmp/ghc18725_0/ghc_2.c
*** Deleting temp dirs:

Not sure whether this qualifies as "unusually large". -ddump-cmm shows that the C-- for a 100-field read instance is more than twice as large as that for the 100-field getline example (39234 lines vs 14895), but size alone doesn't explain the huge difference.

However, at 300 fields, the read example produces 237541 lines of C--, while getline grows roughly linearly, to 47312 lines, meaning that there is definitely some sort of nonlinearity going on wrt C-- code size.

So sink in C-- seems very slow. Nearly 4 bytes in 5 of all allocation is in tis pass alone. Is this a non-linear algorithmic effect. e.g if you double the size of the input, does the sink time go up non-linearly while simplTopBinds goes up linearly? That's my guess.

Yes, in fact it does, quite clearly so:

10-field example:

	Wed Oct 25 14:57 2017 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-10-read.hs

	total time  =        0.19 secs   (188 ticks @ 1000 us, 1 processor)
	total alloc = 134,109,112 bytes  (excludes profiling overheads)

COST CENTRE        MODULE      SRC                                                 %time %alloc

SimplTopBinds      SimplCore   compiler/simplCore/SimplCore.hs:761:39-74            35.6   28.7
tc_rn_src_decls    TcRnDriver  compiler/typecheck/TcRnDriver.hs:(493,4)-(555,7)      9.0    9.0
OccAnal            SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     4.8    5.5
FloatOutwards      SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             4.3    5.1
pprNativeCode      AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    3.7    3.5
tcRnImports        TcRnDriver  compiler/typecheck/TcRnDriver.hs:240:20-50            3.7    3.6
StgCmm             HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)          3.7    2.1
sink               CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)         3.2    1.5
setSessionDynFlags GHC         compiler/main/GHC.hs:(578,1)-(584,16)                 3.2    4.7
NativeCodeGen      CodeOutput  compiler/main/CodeOutput.hs:171:18-78                 2.1    0.9
NewStranal         SimplCore   compiler/simplCore/SimplCore.hs:480:40-63             2.1    2.9
bin_tycldecls      HscTypes    compiler/main/HscTypes.hs:1085:52-57                  1.6    1.0
initGhcMonad       GHC         compiler/main/GHC.hs:(493,1)-(503,25)                 1.6    4.7
MAIN               MAIN        <built-in>                                            1.1    0.3
fixStgRegisters    AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:566:17-42            1.1    0.2
genMachCode        AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    1.1    1.3
shortcutBranches   AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:694:17-54            1.1    0.2
elimCommonBlocks   CmmPipeline compiler/cmm/CmmPipeline.hs:(71,13)-(72,76)           1.1    1.0
CAF                PrimOp      <entire-module>                                       1.1    0.5
CAF                PrelInfo    <entire-module>                                       1.1    1.0
deSugar            HscMain     compiler/main/HscMain.hs:511:7-44                     1.1    0.9
CorePrep           HscMain     compiler/main/HscMain.hs:(1313,24)-(1314,57)          1.1    0.6
withCleanupSession GHC         compiler/main/GHC.hs:(466,1)-(475,37)                 1.1    0.8
regLiveness        AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    0.5    2.0
RegAlloc-linear    AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    0.5    2.4
layoutStack        CmmPipeline compiler/cmm/CmmPipeline.hs:(97,13)-(99,40)           0.5    1.0
CoreTidy           HscMain     compiler/main/HscMain.hs:1253:27-67                   0.5    1.0

100 fields:

	Wed Oct 25 15:00 2017 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-100-read.hs

	total time  =        3.10 secs   (3103 ticks @ 1000 us, 1 processor)
	total alloc = 2,397,379,664 bytes  (excludes profiling overheads)

COST CENTRE     MODULE      SRC                                                 %time %alloc

SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74            38.9   36.6
sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        16.0    7.3
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             7.9   10.8
OccAnal         SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     4.3    5.8
pprNativeCode   AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    4.1    5.3
RegAlloc-linear AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    3.8    4.6
StgCmm          HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)          2.8    2.5
NewStranal      SimplCore   compiler/simplCore/SimplCore.hs:480:40-63             2.8    4.8
regLiveness     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    1.8    1.9
CoreTidy        HscMain     compiler/main/HscMain.hs:1253:27-67                   1.8    2.9
genMachCode     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    1.8    1.6
tc_rn_src_decls TcRnDriver  compiler/typecheck/TcRnDriver.hs:(493,4)-(555,7)      1.3    1.3
NativeCodeGen   CodeOutput  compiler/main/CodeOutput.hs:171:18-78                 1.0    1.7
CommonSubExpr   SimplCore   compiler/simplCore/SimplCore.hs:462:40-56             0.8    1.2
deSugar         HscMain     compiler/main/HscMain.hs:511:7-44                     0.8    1.3

400 fields:

	Thu Oct 26 01:25 2017 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-read.hs

	total time  =      128.62 secs   (128624 ticks @ 1000 us, 1 processor)
	total alloc = 37,659,351,320 bytes  (excludes profiling overheads)

COST CENTRE     MODULE      SRC                                                 %time %alloc

sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        77.8   23.6
SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74             9.6   30.0
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             3.0   10.4
RegAlloc-linear AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    1.6    5.0
OccAnal         SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     1.1    4.2
pprNativeCode   AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    1.0    4.7
NewStranal      SimplCore   compiler/simplCore/SimplCore.hs:480:40-63             0.9    4.2
CoreTidy        HscMain     compiler/main/HscMain.hs:1253:27-67                   0.7    2.9
StgCmm          HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)          0.6    1.9
regLiveness     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    0.5    1.4
genMachCode     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    0.4    1.3
NativeCodeGen   CodeOutput  compiler/main/CodeOutput.hs:171:18-78                 0.4    1.6
CommonSubExpr   SimplCore   compiler/simplCore/SimplCore.hs:462:40-56             0.2    1.1
deSugar         HscMain     compiler/main/HscMain.hs:511:7-44                     0.2    1.1

(Related question: does sink show up a a big item when compiling "normal" programs?)

It does, but nowhere near this dominantly. E.g. getline, 400 fields:

	Thu Oct 26 02:54 2017 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-getline.hs

	total time  =       14.13 secs   (14126 ticks @ 1000 us, 1 processor)
	total alloc = 12,479,030,504 bytes  (excludes profiling overheads)

COST CENTRE   MODULE      SRC                                                %time %alloc

sink          CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)       31.7    3.3
SimplTopBinds SimplCore   compiler/simplCore/SimplCore.hs:761:39-74           24.7   39.0
OccAnal       SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)    7.2    9.7
NewStranal    SimplCore   compiler/simplCore/SimplCore.hs:480:40-63            6.5   10.4
CoreTidy      HscMain     compiler/main/HscMain.hs:1253:27-67                  5.2    7.6
FloatOutwards SimplCore   compiler/simplCore/SimplCore.hs:471:40-66            4.1    6.2
StgCmm        HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)         2.0    1.6
deSugar       HscMain     compiler/main/HscMain.hs:511:7-44                    2.0    3.3
CommonSubExpr SimplCore   compiler/simplCore/SimplCore.hs:462:40-56            1.9    3.2
CorePrep      HscMain     compiler/main/HscMain.hs:(1313,24)-(1314,57)         1.4    2.5
Specialise    SimplCore   compiler/simplCore/SimplCore.hs:486:40-50            0.8    1.3
StgCse        SimplStg    compiler/simplStg/SimplStg.hs:(88,14)-(91,44)        0.5    1.3

Guessing: the "shape" of the C-- code makes sink behave very badly. Can you characterise what that shape is, and where the algorithmic badness comes from?

Not yet, but I have a hunch that it's simply a matter of producing too much of it. Maybe there is quadratic behavior there, where the number of lines of C-- grows proportionally with the square of the number of fields? That seems like a plausible explanation so far.

comment:59 Changed 13 months ago by simonpj

Maybe there is quadratic behavior there, where the number of lines of C-- grows proportionally with the square of the number of fields?

But then all the C-- passes would get slower, wouldn't they? Still it'd be easy to measure code size of

  • Core
  • C--

as the number of fields increases.

My guess is that there's an algorithmic blow-up (quadratic or worse) in sink, so that it starts to dominate when there is a very large CmmProc.

comment:60 Changed 13 months ago by alexbiehl

Tobias, I had a quick glimpse over CmmSink and `skipped` looks suspicious:

For every assignment l = e which is not sunk l is consed to skipped.

Further, for each assignment l = e we check a) is l element of skipped and b) is there any register in skipped mentioned in e:

        cannot_inline = skipped `regsUsedIn` rhs -- Note [dependent assignments]
                        || l `elem` skipped
                        || not (okToInline dflags rhs node)

This sounds accidentally quadric to me. Considering the amount of code generated this might be a thing.

Last edited 13 months ago by alexbiehl (previous) (diff)

comment:61 Changed 13 months ago by simonpj

Indeed. Actually any use of elem on a long list is suspicious. c.f. ListSetOpts.unionLists which warns if it sees long lists.

Maybe we could have ListSetOps.checkedElem which checks for over-long lists and warns. And change all uses of elem into checkedElem. There may be more of these land-mines!

Changed 13 months ago by tdammers

Size of C-- for large record examples

comment:62 Changed 13 months ago by tdammers

More data! https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-cmm-size.png shows C-- code size growth as the number of fields grows for the same generated examples as in the other graphs. getline and getline-appl generate the exact same C--, so there's no point including both in the graph; they are the only ones that grow linearly among the "interesting" examples.

comment:63 Changed 13 months ago by simonpj

Interesting. I can't tell whether that's as non-linear as the runtime, but the bad cases are clearly big!

What happens to the size of Core? (Neglect types and coercions which are erased.)

And finally does the Core-to-Cmm size ratio go up?

comment:64 in reply to:  61 Changed 13 months ago by tdammers

Replying to simonpj:

Indeed. Actually any use of elem on a long list is suspicious. c.f. ListSetOpts.unionLists which warns if it sees long lists.

Maybe we could have ListSetOps.checkedElem which checks for over-long lists and warns. And change all uses of elem into checkedElem. There may be more of these land-mines!

What about using a different kind of data structure like a map or similar, that has logarithmic lookup performance? That would work even if we cannot reduce the size of the list, but if the list is short-lived, constructing the map might introduce prohibitive overhead?

comment:65 in reply to:  63 Changed 13 months ago by tdammers

Replying to simonpj:

Interesting. I can't tell whether that's as non-linear as the runtime, but the bad cases are clearly big!

What happens to the size of Core? (Neglect types and coercions which are erased.)

And finally does the Core-to-Cmm size ratio go up?

I'll check that next, unfortunately this means I have to run all those examples again...

comment:66 Changed 13 months ago by simonpj

What about using a different kind of data structure like a map or similar

Of course that the first thing we should do, once we are convinced that's where the time is going! (We only ave suggestive evidence so far.) Small maps should be cheap; that's the business of the containers library.

comment:67 in reply to:  66 Changed 13 months ago by alexbiehl

Replying to simonpj:

What about using a different kind of data structure like a map or similar

Of course that the first thing we should do, once we are convinced that's where the time is going! (We only ave suggestive evidence so far.) Small maps should be cheap; that's the business of the containers library.

LocalReg is Uniquable so something like this should work:

type LocalRegSet = UniqSet LocalReg

emptyLocalRegSet :: LocalRegSet
emptyLocalRegSet = emptyUniqSet

nullLocalRegSet :: LocalRegSet -> Bool
nullLocalRegSet = isEmptyUniqSet

elemLocalRegSet :: LocalReg -> LocalRegSet -> Bool
elemLocalRegSet = elementOfUniqSet

insertLocalRegSet :: LocalRegSet -> LocalReg -> LocalRegSet
insertLocalRegSet = addOneToUniqSet

cannot_inline = skipped `regsUsedIn` rhs -- Note [dependent assignments]
                || l `elemLocalRegSet` skipped
                || not (okToInline dflags rhs node)

regsUsedIn :: LocalRegSet -> CmmExpr -> Bool
regsUsedIn ls _ | nullLocalRegSet ls = False
regsUsedIn ls e = wrapRecExpf f e False
  where f (CmmReg (CmmLocal l))      _ | l `elemLocalRegSet` ls = True
        f (CmmRegOff (CmmLocal l) _) _ | l `elemLocalRegSet` ls = True
        f _ z = z

comment:68 Changed 13 months ago by tdammers

Interesting: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-cmm-per-core.png

For the three offenders (read, show, read-appl), the C--/Core ratio goes *up* as the total code size grows, for the well-behaved examples it goes down. The latter is expected (constant overhead + linear code size), the former hints at some sort of nonlinearity.

Changed 13 months ago by tdammers

C-- / Core size ratio for large record modules

comment:69 Changed 13 months ago by simonpj

That is indeed interesting.

  • The bad cases are read, read-appl and show (comment:56)
  • They all generate very big Cmm (png just before (comment:62)
  • They all generate unusually large Core-to-Cmm ratio (comment:68)
  • This very large Cmm makes sink go bananas.

Is the large Cmm accounted for by the big Core-to-Cmm ratio, or do these examples also have Core sizes that scale badly (look at term size only)?

I'm guessing that the Core size does not scale badly. If that was so our questions would be

  • Why do we generate so much Cmm per unit of Core
  • Can we make sink scale better

Both are worth working on.

comment:70 Changed 13 months ago by alexbiehl

I tried the suggestion in comment 67 and got these results for compiling W2.hs

Vanilla:

        Sun Oct 29 19:58 2017 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs -fforce-recomp -O2

        total time  =       26.00 secs   (25996 ticks @ 1000 us, 1 processor)
        total alloc = 14,921,627,912 bytes  (excludes profiling overheads)

COST CENTRE     MODULE      SRC                                                 %time %alloc

sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        55.7   15.9
SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74            19.5   30.6
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66             4.2    9.0
RegAlloc-linear AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    4.0   11.1
pprNativeCode   AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.8    6.3
NewStranal      SimplCore   compiler/simplCore/SimplCore.hs:480:40-63             1.6    3.7
OccAnal         SimplCore   compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     1.5    3.5
StgCmm          HscMain     compiler/main/HscMain.hs:(1426,13)-(1427,62)          1.2    2.4
regLiveness     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    1.2    1.9
genMachCode     AsmCodeGen  compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    0.9    1.8
NativeCodeGen   CodeOutput  compiler/main/CodeOutput.hs:171:18-78                 0.9    2.1
CoreTidy        HscMain     compiler/main/HscMain.hs:1253:27-67                   0.8    1.9

skpped as UniqSet:

        Sun Oct 29 19:18 2017 Time and Allocation Profiling Report  (Final)

           ghc-stage2 +RTS -p -RTS -B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs -fforce-recomp -O2

        total time  =       13.31 secs   (13307 ticks @ 1000 us, 1 processor)
        total alloc = 15,772,184,488 bytes  (excludes profiling overheads)

COST CENTRE     MODULE         SRC                                                 %time %alloc

SimplTopBinds   SimplCore      compiler/simplCore/SimplCore.hs:761:39-74            38.3   29.0
sink            CmmPipeline    compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        13.2   20.3
RegAlloc-linear AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    8.3   10.5
FloatOutwards   SimplCore      compiler/simplCore/SimplCore.hs:471:40-66             8.1    8.5
pprNativeCode   AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    5.4    5.9
NewStranal      SimplCore      compiler/simplCore/SimplCore.hs:480:40-63             3.1    3.5
OccAnal         SimplCore      compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     2.9    3.3
StgCmm          HscMain        compiler/main/HscMain.hs:(1426,13)-(1427,62)          2.3    2.3
regLiveness     AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    2.1    1.8
NativeCodeGen   CodeOutput     compiler/main/CodeOutput.hs:171:18-78                 1.7    2.0
genMachCode     AsmCodeGen     compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    1.6    1.7
CoreTidy        HscMain        compiler/main/HscMain.hs:1253:27-67                   1.4    1.8
foldNodesBwdOO  Hoopl.Dataflow compiler/cmm/Hoopl/Dataflow.hs:(397,1)-(403,17)       1.1    0.8

Good results I think. Though I need to check if allocation goes up significantly in common cases.

Last edited 13 months ago by alexbiehl (previous) (diff)

Changed 13 months ago by tdammers

Core size for large record modules

comment:71 in reply to:  69 Changed 13 months ago by tdammers

Replying to simonpj:

Is the large Cmm accounted for by the big Core-to-Cmm ratio, or do these examples also have Core sizes that scale badly (look at term size only)?

I'm guessing that the Core size does not scale badly. If that was so our questions would be

  • Why do we generate so much Cmm per unit of Core
  • Can we make sink scale better

Both are worth working on.

https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-core-size.png suggests that while Core size has a slight nonlinearity to it, the behavior is roughly the same for all examples, so I believe your guess that Core size doesn't scale as badly as C-- size does is justified.

Last edited 13 months ago by tdammers (previous) (diff)

Changed 13 months ago by tdammers

Attachment: examples100.zip added

100-field record examples

comment:72 in reply to:  70 Changed 13 months ago by tdammers

Replying to alexbiehl:

I tried the suggestion in comment 67 and got these results for compiling W2.hs

[...]

Good results I think. Though I need to check if allocation goes up significantly in common cases.

You could try running the examples in examples100.zip (https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/examples100.zip) to see whether any of them performs worse, that would be a good first indication.

Changed 13 months ago by tdammers

Profiling output with more SCC's added

comment:73 Changed 13 months ago by tdammers

Profiling a bit deeper (see https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-stage2-7258-deeper-SCC.prof) hints at a possible nonlinearity in tryToInline, particularly when recursing through the keep function.

Changed 13 months ago by tdammers

Performance graph for various examples, compiling with -O, using UniqSet (as in comment #67)

comment:74 Changed 13 months ago by tdammers

Modifications as proposed in comments 67 / 70 appear to improve performance for the "bad" cases, and not harm the "good" example cases significantly.

comment:75 Changed 13 months ago by bgamari

During the call Simon hypothesized that we should keep the free variables lists in mind when comparing STG. I stated that I didn't think that we currently had a way to print closures' free variable lists. I've checked and this isn't quite true: the pretty-printer will print free variable lists when -dppr-debug is passed. I've broken this out into its own flag in Phab:D4140.

Changed 13 months ago by tdammers

Attachment: numTerms.png added

Generated Core, number of terms

Changed 13 months ago by tdammers

Attachment: simplSize.png added

Generated Core, total size (in LOC)

Changed 13 months ago by tdammers

Attachment: stgSize.png added

Generated STG, total size (in LOC)

Changed 13 months ago by tdammers

Attachment: cmmSize.png added

Generated C--, total size (in LOC)

Changed 13 months ago by tdammers

Attachment: totalTime.png added

Total execution time (CPU, seconds)

comment:76 Changed 13 months ago by tdammers

Some better data extraction code for profiles generated earlier yields the above graphs. All of these were generated using the optimization as implemented in D4145 (using a UniqSet instead of a list). Observations:

  • The number of terms in generated Core scales in a perfectly linear fashion for all examples.
  • The size of the generated Core in lines-of-code, however, scales non-linearly, which tells us that the size of individual terms grows with input size. This behavior seems to be identical between all 4 examples though, so it does not explain the extreme performance differences.
  • Somewhere between Core and STG, the show and read examples misbehave more badly than the other two
  • Between STG and C--, things get utterly dramatic, show and read explode, while getline-appl and nothing are reduced back to near-linear behavior.

comment:77 Changed 13 months ago by tdammers

It seems that for the misbehaving examples, the offensive cost centre is Stg2Stg. As code size increases, CPU time for this cost centre grows disproportionally in the read and show examples, while it remains below 40% for the well-behaved ones.

comment:78 Changed 13 months ago by simonpj

the offensive cost centre is Stg2Stg

Very interesting! Keep digging :-)

comment:79 Changed 13 months ago by Ben Gamari <ben@…>

In 4353756/ghc:

CmmSink: Use a IntSet instead of a list

CmmProcs which have *lots* of local variables take a considerable
amount of time in CmmSink. This was noticed by @tdammers in #7258
while compiling files with large records (~200-400 fields).

Before:

```
        Sun Oct 29 19:58 2017 Time and Allocation Profiling Report (Final)

           ghc-stage2 +RTS -p -RTS
-B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs
-fforce-recomp -O2

        total time  =       26.00 secs   (25996 ticks @ 1000 us, 1 processor)
        total alloc = 14,921,627,912 bytes  (excludes profiling overheads)

COST CENTRE     MODULE      SRC %time %alloc

sink            CmmPipeline
compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        55.7   15.9
SimplTopBinds   SimplCore   compiler/simplCore/SimplCore.hs:761:39-74 19.5   30.6
FloatOutwards   SimplCore   compiler/simplCore/SimplCore.hs:471:40-66 4.2    9.0
RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    4.0   11.1
pprNativeCode   AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    2.8    6.3
NewStranal      SimplCore   compiler/simplCore/SimplCore.hs:480:40-63 1.6    3.7
OccAnal         SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     1.5    3.5
StgCmm          HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62)          1.2    2.4
regLiveness     AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    1.2    1.9
genMachCode     AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    0.9    1.8
NativeCodeGen   CodeOutput  compiler/main/CodeOutput.hs:171:18-78 0.9    2.1
CoreTidy        HscMain     compiler/main/HscMain.hs:1253:27-67 0.8    1.9
```

After:

```
        Sun Oct 29 19:18 2017 Time and Allocation Profiling Report (Final)

           ghc-stage2 +RTS -p -RTS
-B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs
-fforce-recomp -O2

        total time  =       13.31 secs   (13307 ticks @ 1000 us, 1 processor)
        total alloc = 15,772,184,488 bytes  (excludes profiling overheads)

COST CENTRE     MODULE         SRC %time %alloc

SimplTopBinds   SimplCore
compiler/simplCore/SimplCore.hs:761:39-74            38.3   29.0
sink            CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59)        13.2   20.3
RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)    8.3   10.5
FloatOutwards   SimplCore compiler/simplCore/SimplCore.hs:471:40-66             8.1    8.5
pprNativeCode   AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    5.4    5.9
NewStranal      SimplCore compiler/simplCore/SimplCore.hs:480:40-63             3.1    3.5
OccAnal         SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     2.9    3.3
StgCmm          HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62)          2.3    2.3
regLiveness     AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    2.1    1.8
NativeCodeGen   CodeOutput     compiler/main/CodeOutput.hs:171:18-78 1.7    2.0
genMachCode     AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    1.6    1.7
CoreTidy        HscMain        compiler/main/HscMain.hs:1253:27-67 1.4    1.8
foldNodesBwdOO  Hoopl.Dataflow compiler/cmm/Hoopl/Dataflow.hs:(397,1)-(403,17)       1.1    0.8
```

Reviewers: austin, bgamari, simonmar

Reviewed By: bgamari

Subscribers: duog, rwbarton, thomie, tdammers

GHC Trac Issues: #7258

Differential Revision: https://phabricator.haskell.org/D4145

comment:80 in reply to:  77 Changed 13 months ago by alexbiehl

Replying to tdammers:

It seems that for the misbehaving examples, the offensive cost centre is Stg2Stg. As code size increases, CPU time for this cost centre grows disproportionally in the read and show examples, while it remains below 40% for the well-behaved ones.

Since you don't compile the examples with profiling on the only "heavy" transformation in Stg2Stg is StgCse. You can try -fno-stg-cse to see if it has impact on the timings.

Last edited 13 months ago by alexbiehl (previous) (diff)

comment:81 Changed 13 months ago by simonpj

Very curious. I see in comment:79

Before
        total time  =       26.00 secs   (25996 ticks @ 1000 us, 1 processor)
        total alloc = 14,921,627,912 bytes  (excludes profiling overheads)

After
        total time  =       13.31 secs   (13307 ticks @ 1000 us, 1 processor)
        total alloc = 15,772,184,488 bytes  (excludes profiling overheads)

I don't think I have ever before seen a program that actually allocates more (15G instead of 14G) and yet runs in half the time. Usually allocation and runtime are more or less correlated. But apparently not in this case.

comment:82 Changed 13 months ago by tdammers

Stg2Stg turns out to b a dead end. The performance loss I observed there is entirely due to formatting and dumping about 1.3 GiB worth of STG when compiling with -ddump-stg; when compiling without -ddump-stg, Stg2Stg doesn't show up in the profile at all (0.0 on everything).

Re-running the profiling script without dumps shows different culprits: tryToInline, and simpl_bind. E.g. 400-field read:

	Fri Nov  3 08:43 2017 Time and Allocation Profiling Report  (Final)

	   ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c generated/t-400-read.hs

	total time  =       59.05 secs   (59049 ticks @ 1000 us, 1 processor)
	total alloc = 75,366,446,544 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE     SRC                                                 %time %alloc

simpl_bind                     Simplify   compiler/simplCore/Simplify.hs:141:83-101            29.4   22.0
tryToInline                    CmmSink    compiler/cmm/CmmSink.hs:410:3-41                     24.4   32.4
RegAlloc-linear                AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55)   10.6   14.1
FloatOutwards                  SimplCore  compiler/simplCore/SimplCore.hs:471:40-66             8.8    6.9
pprNativeCode                  AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65)    4.0    4.7
OccAnal                        SimplCore  compiler/simplCore/SimplCore.hs:(739,22)-(740,67)     2.3    2.2
StgCmm                         HscMain    compiler/main/HscMain.hs:(1426,13)-(1427,62)          2.1    1.8
regLiveness                    AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52)    1.9    1.3
NativeCodeGen                  CodeOutput compiler/main/CodeOutput.hs:171:18-78                 1.5    1.6
genMachCode                    AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62)    1.4    1.3
CoreTidy                       HscMain    compiler/main/HscMain.hs:1253:27-67                   1.3    1.4
dmdAnal'-Case-single_prod_ctor DmdAnal    compiler/stranal/DmdAnal.hs:(251,5)-(276,68)          1.3    1.5

Changed 13 months ago by tdammers

Attachment: t-read-scc-time-abs.png added

Largest profiling cost centres (in absolute terms) for the read example

Changed 13 months ago by tdammers

Largest profiling cost centres (in absolute terms) for the getline-appl example

Changed 13 months ago by tdammers

Attachment: t-show-scc-time-abs.png added

Largest profiling cost centres (in absolute terms) for the show example

comment:83 Changed 13 months ago by tdammers

Comparing the latest 3 graphs, we can see that tryToInline more badly than the other cost centres, at least in the misbehaving examples, while the other top cost centres seem to scale somewhat proportionally. simpl_binds is the largest one in most cases, but tryToInline is "more nonlinear" for the show and read examples, surpassing it between 150 and 200 fields in the show example, and (educated guess) around 500 fields in the read example.

comment:84 Changed 13 months ago by simonpj

tryToInline is in in CmmSink which we already know is working too hard.

There may well be algorithmic improvements that could be done CmmSink.

But, in addition, I would still like to understand why the STG-to-Cmm ratio climbs so sharply. If we have giant Cmm then any Cmm pass is going to take a long time. Fixing the non-linearity would be good, I agree; but generating less Cmm would also be big.

Why is it so big?

My guess: we are generating very deeply-nested closures that have a lot of free variables. E.g.

let x = let a5 = blah
            q = ...K a1 a2 a3 a4 a5 ...
        in ...

Here the closure for q has 5 free variables (a1-a5); while that for x has four (a1-a4). So the code for x will slurp lots of variables out of x's closure only to store them again in q's. And if that is nested we'll get O(n**2) code size.

But I'm speculating. Facts would be good. Let's measure STG size with and without counting the free vars of each closure (which are stored in the STG tree) and see.

comment:85 Changed 13 months ago by tdammers

Here's some evidence (compiled with -O):

https://ghc.haskell.org/trac/ghc/raw-attachment/ticket/7258/stgVarsSizeRatio.png

The 3 misbehaving examples (read, show, and to a lesser degree read-appl) show linear growth of the free-variables/other-stg ratio (which suggests roughly quadratic growth of the size of the overall STG size, assuming linear growth of the base STG size), while the well-behaved examples maintain a more-or-less constant ratio.

A quick glance at the generated STG suggests that there is indeed some deep nesting of closures going on.

Last edited 13 months ago by bgamari (previous) (diff)

Changed 13 months ago by tdammers

Attachment: stgVarsSizeRatio.png added

Free variables / other STG ratio

comment:86 Changed 13 months ago by tdammers

I've narrowed it down a bit more.

This behaves nicely:

module D where

import Text.ParserCombinators.ReadP as ReadP
import Control.Monad.State as State
import Data.Char (ord)

data DT = DT
    { field0 :: Int
    , field2 :: Int
    , field3 :: Int
    , field4 :: Int
    , field5 :: Int
    , field6 :: Int
    , field7 :: Int
    , field8 :: Int
    , field9 :: Int
    , field10 :: Int
    }
getlD :: IO DT
getlD = DT
    <$> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)
    <*> (read <$> getLine)

But this doesn't:

module D where

import Text.ParserCombinators.ReadP as ReadP
import Control.Monad.State as State
import Data.Char (ord)

data DT = DT
    { field0 :: Int
    , field2 :: Int
    , field3 :: Int
    , field4 :: Int
    , field5 :: Int
    , field6 :: Int
    , field7 :: Int
    , field8 :: Int
    , field9 :: Int
    , field10 :: Int
    }
readD :: ReadP DT
readD = DT
    <$> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)
    <*> (ord <$> ReadP.get)

Changed 13 months ago by tdammers

Attachment: t-10-dump.zip added

Various GHC dumps for 10-field getline-appl and read-appl examples. Read misbehaves, getLine doesn't.

comment:87 Changed 13 months ago by tdammers

Further evidence, based on the observation that the problem is caused by certain kinds of CPS-style code:

module Nested
where

-- A data type capturing the relevant structure of the P type from ReadP
data A a = A (Int -> A a) | N a

-- This implementation produces deeply nested Core:
f10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int)
f10 =
  A (\i0 ->
  A (\i1 ->
  A (\i2 ->
  A (\i3 ->
  A (\i4 ->
  A (\i5 ->
  A (\i6 ->
  A (\i7 ->
  A (\i8 ->
  A (\i9 ->
  N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)
  ))))))))))


-- This implementation produces flat Core.
g10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int)
g10 = a0
  where
    {-#NOINLINE a10#-}
    a10 = \(i0, i1, i2, i3, i4, i5, i6, i7, i8) i9 -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)
    {-#NOINLINE a9#-}
    a9  = \(i0, i1, i2, i3, i4, i5, i6, i7) i8 -> A (a10 (i0, i1, i2, i3, i4, i5, i6, i7, i8))
    {-#NOINLINE a8#-}
    a8  = \(i0, i1, i2, i3, i4, i5, i6) i7 -> A (a9 (i0, i1, i2, i3, i4, i5, i6, i7))
    {-#NOINLINE a7#-}
    a7  = \(i0, i1, i2, i3, i4, i5) i6 -> A (a8 (i0, i1, i2, i3, i4, i5, i6))
    {-#NOINLINE a6#-}
    a6  = \(i0, i1, i2, i3, i4) i5 -> A (a7 (i0, i1, i2, i3, i4, i5))
    {-#NOINLINE a5#-}
    a5  = \(i0, i1, i2, i3) i4 -> A (a6 (i0, i1, i2, i3, i4))
    {-#NOINLINE a4#-}
    a4  = \(i0, i1, i2) i3 -> A (a5 (i0, i1, i2, i3))
    {-#NOINLINE a3#-}
    a3  = \(i0, i1) i2 -> A (a4 (i0, i1, i2))
    {-#NOINLINE a2#-}
    a2  = \i0 i1 -> A (a3 (i0, i1))
    {-#NOINLINE a1#-}
    a1  = \i0 -> A (a2 i0)
    {-#NOINLINE a0#-}
    a0  = A a1

This module tries to provoke "tupling up the free variables" in g10, and compare it to the naive implementation f10. I do not know yet whether this would actually produce any performance benefits though.

The NOINLINE pragmas are needed by the way, otherwise GHC decides to inline the a functions, and we end up with a mess similar to f10.

comment:88 Changed 13 months ago by simonpj

Hmm. The Cmm for g10 is actually a lot bigger than that for f10, and looked to me as if it too would scale quadratically. Are f10 and g10 accurate reflections of the "scale well" and "scale badly" versions of Read?

I can see why things might scale badly, if we have a lot of

p1 >>= \x ->
p2 >>= \y ->
p3 >>= \z ->
return (K x y z)

And f10 has that structure, albeit with data constructors rather than functions. BUT: if that's it, why doesn't (>>=) inline, so we get a case expression instead??

So if f10 is reasonably accurate, I can see two ways forward:

1 Do something in the back end to generate less code.

2 Generate different code. I'm still intrigued (and a bit disbelieving) why a different exposition of Read generates code that scales better.

Concerning (1) the kind of thing I had in mind was more like this

f10 =
  A (\i0 ->
  A (\i1 ->
  A (\i2 ->
  A (\i3 ->
  A (\i4 -> let t = (i0,i1,i2,i3,i4) in
  A (\i5 ->
  A (\i6 ->
  A (\i7 ->
  A (\i8 ->
  A (\i9 ->
  case t of (i0,i1,i2,i3,i4) ->
     N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)

Now no function closure gets more than 5 free variables. I imagine one could do this as a STG-to-STG transformation if we decided that this was really the core issue.

A tantalising point is this. We have something like this:

   let f4 = [i0,i1,i2,i3] \i4 ->
            let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah...

The list before the \ are the free vars of the closure.

Inside the code for f4 we have f4's function closure in hand, with i0... in it. Rather than capturing i0, i1... as the free vars of f5, we could just store a pointer to f4's function closure (it's a kind of tuple, after all) and fetch its free vars from there. It's as if we wanted to say

   let f4 = [i0,i1,i2,i3] \i4 ->
            let f5 = [f4, i4] \i5 -> ...blah...

with f4 and i4 being the free vars of f5's closure. This would be a deeper change to the code generator, but it could make a lot of sense in cases where all of f4's free variables are also free in f5. (If not all where, you might get a space leak by closing over f4.)

comment:89 in reply to:  88 Changed 12 months ago by tdammers

Replying to simonpj:

Hmm. The Cmm for g10 is actually a lot bigger than that for f10, and looked to me as if it too would scale quadratically.

It would, but for different reasons, and I believe I can rewrite it to not scale quadratically (but haven't gotten around to it yet).

Are f10 and g10 accurate reflections of the "scale well" and "scale badly" versions of Read?

No; we don't actually have a "scale well" version of Read yet. All the implementations of Read that I have produced so far are in the "scale badly" category, although the degree to which they scale badly differs somewhat. We have two kinds of "scale well" examples:

  1. Trivial ones, where rather than nest code such that chains of closures appear, we just call read or something else N times, but in benign ways; these examples do not have much of an explanatory value, and I merely included them to establish a performance baseline
  2. Well-behaved nested / chained examples, such as the getline-appl one; however, none of these do anything equivalent to read.

I can see why things might scale badly, if we have a lot of

p1 >>= \x ->
p2 >>= \y ->
p3 >>= \z ->
return (K x y z)

And f10 has that structure, albeit with data constructors rather than functions. BUT: if that's it, why doesn't (>>=) inline, so we get a case expression instead??

I believe the (>>=) does in fact inline; f10 is the kind of shape you get when you inline the (>>=) from ReadP. (>>=) from other monadic types does not introduce this kind of shape, which would explain why Read misbehaves but getLine over IO does not.

So if f10 is reasonably accurate, I can see two ways forward:

1 Do something in the back end to generate less code.

2 Generate different code. I'm still intrigued (and a bit disbelieving) why a different exposition of Read generates code that scales better.

See above; Read always misbehaves here, so your disbelief is entirely justified. The bad behavior seems to be related to the actual implementation of (>>=), not so much whether the (>>=) gets inlined. Or, more accurately, inlining (>>=) actually *is* the problem; if we didn't inline it, then read-appl would compile to something equivalent to the generic applicative example, which behaves nicely.

Concerning (1) the kind of thing I had in mind was more like this

f10 =
  A (\i0 ->
  A (\i1 ->
  A (\i2 ->
  A (\i3 ->
  A (\i4 -> let t = (i0,i1,i2,i3,i4) in
  A (\i5 ->
  A (\i6 ->
  A (\i7 ->
  A (\i8 ->
  A (\i9 ->
  case t of (i0,i1,i2,i3,i4) ->
     N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)

Now no function closure gets more than 5 free variables. I imagine one could do this as a STG-to-STG transformation if we decided that this was really the core issue.

Further experimentation shows that this doesn't actually achieve anything; the resulting Core is *exactly the same*, and if you put both functions into the same compilation unit, they get optimized into expressing one in terms of the other (I'm guessing here that CSE is smart enough to figure out that they are equivalent). I have written 3 versions of f10, one as-is, one (g10) using the suggested 5-tuple optimization, and one (h10) using a nested-tuple optimization, and I end up with Core that has literally these lines:

f10 = A f1

g10 = f10

h10 = f10

A tantalising point is this. We have something like this:

   let f4 = [i0,i1,i2,i3] \i4 ->
            let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah...

The list before the \ are the free vars of the closure.

Inside the code for f4 we have f4's function closure in hand, with i0... in it. Rather than capturing i0, i1... as the free vars of f5, we could just store a pointer to f4's function closure (it's a kind of tuple, after all) and fetch its free vars from there. It's as if we wanted to say

   let f4 = [i0,i1,i2,i3] \i4 ->
            let f5 = [f4, i4] \i5 -> ...blah...

with f4 and i4 being the free vars of f5's closure. This would be a deeper change to the code generator, but it could make a lot of sense in cases where all of f4's free variables are also free in f5. (If not all where, you might get a space leak by closing over f4.)

comment:91 Changed 12 months ago by simonpj

Indeed, alexbiehl is on the way to implementing the "tantalising point" of comment:89. In a phone call today we agreed

  • That sharing the parent closure is a good thing to try; and not too hard.
  • That we should start
    • A ticket to track progress on it: #14461
    • A wiki page to explain how it all works. I have made a start: NestedClosures
Last edited 12 months ago by simonpj (previous) (diff)

comment:92 Changed 12 months ago by tdammers

Progress made so far:

  • Reduced algorithmic complexity in register spilling (33)
  • Minor improvements in the pretty-based code writer (34)
  • Refactored generated Read code (comment 37 / ticket 14364)
  • Observed that Read instances generate long chains of monadic binds over a monad that introduces nested closures when binding (ReadP, specifically); and that deeply nested closures in general cause the register allocator to repeatedly unpack and re-pack free variables, which leads to quadratic behavior. The fix for this has been split off into 14461.

Things left to do:

  • Replace pretty with something that doesn't share its nonlinearities (14005)
  • Once 14461 is done, it may be worth digging further; right now, the nested-closures nonlinearity makes it difficult to isolate other bottlenecks though.

comment:93 Changed 12 months ago by simonpj

Great summary, thanks.

Note: See TracTickets for help on using tickets.