Opened 5 years ago

Last modified 8 hours ago

#7258 new bug

Compiling DynFlags is jolly slow

Reported by: simonpj Owned by: simonpj
Priority: normal Milestone:
Component: Compiler Version: 7.6.1
Keywords: deriving-perf Cc: iustin@…, heisenbug, pho@…, bgamari, asr, gidyn, HairyDude, RyanGlScott, trommler
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

Compiling DynFlags really takes a long time these days.

Ian thinks that it's due to the Read and Show instances he has added (see attached W2.hs.

Simon M suggests: instead of using Read/Show, you could generate some code in mkDerivedConstants to use ReadP and Outputable, which should be much smaller and faster.

This ticket is

  • To see if we can speed up compilation of DynFlags
  • To check WHY it is so slow. Are there any lessons we can learn or ways to make it compile faster? Is it tickling some asymptotically-bad corner of the compiler?

Simon

Attachments (3)

W2.hs (6.4 KB) - added by simonpj 5 years ago.
Exhibts long compilation times
overview_heap_profile.png (85.6 KB) - added by igloo 5 years ago.
overview_optimised_heap_profile.png (70.6 KB) - added by igloo 5 years ago.

Download all attachments as: .zip

Change History (42)

Changed 5 years ago by simonpj

Attachment: W2.hs added

Exhibts long compilation times

comment:1 Changed 5 years ago by simonpj

Owner: set to igloo

comment:2 Changed 5 years ago by igloo

Milestone: 7.8.1

Changed 5 years ago by igloo

Attachment: overview_heap_profile.png added

comment:3 Changed 5 years ago by igloo

    Fri Oct 19 20:23 2012 Time and Allocation Profiling Report  (Final)

       ghc-stage2 +RTS -p -h -RTS -B/home/ian/ghc/git/ghc/inplace/lib -fforce-recomp -c W2.hs

    total time  =       26.08 secs   (26080 ticks @ 1000 us, 1 processor)
    total alloc = 14,601,152,424 bytes  (excludes profiling overheads)

COST CENTRE       MODULE         %time %alloc

OccAnal           SimplCore       27.0   24.5
RegAlloc          AsmCodeGen      19.2   22.4
pprNativeCode     AsmCodeGen      16.8   19.9
regLiveness       AsmCodeGen       8.0    6.3
genMachCode       AsmCodeGen       5.0    5.4
StgCmm            HscMain          4.3    3.5
occAnalBind.assoc OccurAnal        3.3    3.3
tops              CmmPipeline      3.3    3.4
SimplTopBinds     SimplCore        1.7    1.6
do_block          Hoopl.Dataflow   1.1    0.8

Changed 5 years ago by igloo

comment:4 Changed 5 years ago by igloo

    Fri Oct 19 20:32 2012 Time and Allocation Profiling Report  (Final)

       ghc-stage2 +RTS -p -h -RTS -B/home/ian/ghc/git/ghc/inplace/lib -fforce-recomp -O -c W2.hs

    total time  =       76.26 secs   (76258 ticks @ 1000 us, 1 processor)
    total alloc = 45,073,993,600 bytes  (excludes profiling overheads)

COST CENTRE       MODULE      %time %alloc

OccAnal           SimplCore    49.7   45.3
CoreTidy          HscMain      22.3   31.8
SimplTopBinds     SimplCore    12.9    9.9
pprNativeCode     AsmCodeGen    2.0    1.8
RegAlloc          AsmCodeGen    1.9    2.0
sink2             CmmPipeline   1.5    1.8
StgCmm            HscMain       1.0    0.8
occAnalBind.assoc OccurAnal     1.0    1.0

comment:5 Changed 5 years ago by igloo

                                                     individual     inherited   
COST CENTRE               MODULE      no.  entries  %time %alloc   %time %alloc

getProxies.fwd_pe         OccurAnal  1776     9226    6.7    0.0    25.5   23.7 
 getProxies.fwd_pe.add1   OccurAnal  1779  9733280    5.8    1.6    18.8   23.7 
  getProxies.fwd_pe.add2  OccurAnal  1781  9733280   13.1   22.1    13.1   22.1 

(without -O)

comment:6 Changed 5 years ago by igloo

Owner: changed from igloo to simonpj

Simon, I think the conclusion was that you'd look into this. Please bounce it back to me if you'd like me to investigate further.

comment:7 Changed 5 years ago by igloo

See also #7450

comment:8 Changed 5 years ago by iustin

Cc: iustin@… added

comment:9 Changed 5 years ago by PHO

Cc: pho@… added

comment:10 Changed 5 years ago by simonpj@…

commit 52e43004f63276c1342933e40a673ad25cf2113a

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Fri Dec 21 17:39:33 2012 +0000

    Use expectP in deriving( Read )
    
    Note [Use expectP]   in TcGenDeriv
    ~~~~~~~~~~~~~~~~~~
    Note that we use
       expectP (Ident "T1")
    rather than
       Ident "T1" <- lexP
    The latter desugares to inline code for matching the Ident and the
    string, and this can be very voluminous. The former is much more
    compact.  Cf Trac #7258, although that also concerned non-linearity in
    the occurrence analyser, a separate issue.

 compiler/prelude/PrelNames.lhs    |    3 +-
 compiler/typecheck/TcGenDeriv.lhs |   43 ++++++++++++++++++++++--------------
 2 files changed, 28 insertions(+), 18 deletions(-)

comment:11 Changed 5 years ago by simonpj@…

commit 9ea2b6666cb6684279a120c688e8557bcef3dc73

Author: Simon Peyton Jones <simonpj@microsoft.com>
Date:   Mon Dec 24 11:34:51 2012 +0000

    Simplify the binder-swap transformation
    
    The occurrence analyser implements the "binder-swap" transformation,
    described in Note [Binder swap] in OccAnal. For some reason I had
    implemeted an extremely complicated version, I believe intended to get
    as much as possible done in single simplifier pass.  But it turned
    out (Trac #7258) that the 'getProxies' bit of this complicated code
    scaled rather non-linearly, and all by itself could consume half of
    the entire compile time.
    
    The patch dramatically simplifies the transformation, so that
    we simply swizzle
         case x of y { I# v -> e }
    to
         case x of y { I# v -> let x = y in e }
    
    I can't see any reason not to do this
    
      * Compiler allocation for #7258 with 200 fields goes down by 25%
        and compile time by 20%
    
      * The nofib figures do not budge
    
      * Quite a bit of complicated code goes away

 compiler/simplCore/OccurAnal.lhs |  255 +++++++-------------------------------
 1 files changed, 46 insertions(+), 209 deletions(-)

comment:12 Changed 5 years ago by simonpj

OK I have verified that the changes to the occurrence analyser (above) make essentially zero different to nofib numbers. It's a very worthwhile simplification, because it completely gets rid of the getProxies stuff that was eating all the time before.

Alas, compiling W2 is still non-linear. Here's the allocation by the stage-2 compiler

  • 50 fields: 1.01 Gbyte
  • 100 fields: 2.98 Gbyte
  • 200 fields: 9.64 Gbyte

This is after including the improvements to the derived Read code in #7450.

So something is still wrong. Need to do some profilling to find out.

There are some very deeply nested lambdas, which lead to SRTs of ever-increasing size, so there is definitely a quadratic effect there. But I'm not sure that is all.

comment:13 Changed 5 years ago by simonmar

I can't compile DynFlags on my 256MB Raspberry Pi. Could we do something about having less auto-generated code in this module please?

comment:14 Changed 5 years ago by dterei

Cc: dterei added

comment:15 Changed 3 years ago by thoughtpolice

Milestone: 7.8.37.10.1

Moving to 7.10.1

comment:16 Changed 3 years ago by thoughtpolice

Milestone: 7.10.17.12.1

Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.

comment:17 Changed 3 years ago by thomie

Type of failure: None/UnknownCompile-time performance bug

comment:18 Changed 3 years ago by bgamari

Cc: bgamari added

comment:19 Changed 3 years ago by asr

Cc: asr added

comment:20 Changed 2 years ago by gidyn

Cc: gidyn added

comment:21 Changed 2 years ago by bgamari

I strongly suspect that some of the fixes that I've made while working on #7450 will help this although I have yet to test this.

comment:22 Changed 2 years ago by bgamari

ticket:9669#comment:13 contains a summary of the performance impact of Phab:D1012 and Phab:D1040 on this and potentially-related ticket #7450.

comment:23 Changed 2 years ago by thoughtpolice

Milestone: 7.12.18.0.1

Milestone renamed

comment:24 Changed 22 months ago by ezyang

Keywords: deriving-perf added

comment:25 Changed 21 months ago by thomie

Milestone: 8.0.1

comment:26 Changed 21 months ago by HairyDude

Cc: HairyDude added

comment:27 Changed 14 months ago by RyanGlScott

Cc: RyanGlScott added

comment:28 Changed 3 days ago by trommler

Cc: trommler added

comment:29 Changed 35 hours ago by simonpj

Tobaias found that the bottlenecks on W2 were

  • Pretty-printing of assembly code
  • Register allocation

He has some improvements in progress.

comment:30 Changed 35 hours ago by dfeuer

comment:31 Changed 35 hours ago by simonpj

Apparently this is a example of non-linearity in the pretty printer library.

See Bartosz's summary here, and #14005 (which is about switching over to a new pretty-printing library).

Last edited 35 hours ago by simonpj (previous) (diff)

comment:32 Changed 32 hours ago by heisenbug

Cc: heisenbug added

comment:33 Changed 16 hours ago by tdammers

This commit:

http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5

...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (assig) and the list of registers to keep; the new version uses set operations instead, based on UniqSet / UniqFW. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.

I believe there are further optimization opportunities here, such as:

  • changing the current list-based code to also use set/map operations
  • moving the register class check into the candidates part
  • precalculating the list of candidates (not sure about this one though)

comment:34 Changed 16 hours ago by tdammers

http://git.haskell.org/ghc.git/commitdiff/375d4bd0fc2afd72617bc827bf63b5eeb24f2f7c

This one makes crucial parts of the pretty-printing code stricter; the modified functions score better in profiling, but the overall effect on compilation times seems to be minimal.

comment:35 in reply to:  33 ; Changed 15 hours ago by heisenbug

Replying to tdammers:

This commit:

http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5

...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (assig) and the list of registers to keep; the new version uses set operations instead, based on UniqSet / UniqFW. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.

Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.

I believe there are further optimization opportunities here, such as:

  • changing the current list-based code to also use set/map operations
  • moving the register class check into the candidates part
  • precalculating the list of candidates (not sure about this one though)

comment:36 in reply to:  35 Changed 12 hours ago by tdammers

Replying to heisenbug:

Replying to tdammers:

This commit:

http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5

...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (assig) and the list of registers to keep; the new version uses set operations instead, based on UniqSet / UniqFW. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.

Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.

Could be worth it, but I'm a bit skeptical - the additional overhead of unpacking the sets into lists could easily cancel out the performance benefit, and it would make the code more complicated. I'll give it a try though.

comment:37 Changed 9 hours ago by bgamari

While chatting with tdammers about this I had a peek at the core; it seems that 5000 terms of the 7000 terms in the simplified core of W2 are in creadPrec_rdaO. Moreover, much of this is repetition. I've proposed an approach for dealing with this in #14364.

comment:38 Changed 9 hours ago by dterei

Cc: dterei removed

comment:39 Changed 8 hours ago by bgamari

Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.

I actually suspected this in the past, but dfeuer did some measurements in the context of ListSetUtils and found that sets aren't measurably different. Don't let that stop you from quickly trying though.

Note: See TracTickets for help on using tickets.