Opened 6 years ago
Last modified 15 months ago
#7258 new bug
Compiling DynFlags is jolly slow
Reported by: | simonpj | Owned by: | simonpj |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | Compiler | Version: | 7.6.1 |
Keywords: | deriving-perf | Cc: | iustin@…, heisenbug, pho@…, bgamari, asr, gidyn, HairyDude, RyanGlScott, trommler |
Operating System: | Unknown/Multiple | Architecture: | Unknown/Multiple |
Type of failure: | Compile-time performance bug | Test Case: | |
Blocked By: | Blocking: | ||
Related Tickets: | Differential Rev(s): | ||
Wiki Page: |
Description
Compiling DynFlags
really takes a long time these days.
Ian thinks that it's due to the Read
and Show
instances he has added (see attached W2.hs
.
Simon M suggests: instead of using Read/Show
, you could generate some code in mkDerivedConstants
to use ReadP
and Outputable
, which should be much smaller and faster.
This ticket is
- To see if we can speed up compilation of
DynFlags
- To check WHY it is so slow. Are there any lessons we can learn or ways to make it compile faster? Is it tickling some asymptotically-bad corner of the compiler?
Simon
Attachments (24)
Change History (117)
Changed 6 years ago by
comment:1 Changed 6 years ago by
Owner: | set to igloo |
---|
comment:2 Changed 6 years ago by
Milestone: | → 7.8.1 |
---|
Changed 6 years ago by
Attachment: | overview_heap_profile.png added |
---|
comment:3 Changed 6 years ago by
Fri Oct 19 20:23 2012 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/ian/ghc/git/ghc/inplace/lib -fforce-recomp -c W2.hs total time = 26.08 secs (26080 ticks @ 1000 us, 1 processor) total alloc = 14,601,152,424 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc OccAnal SimplCore 27.0 24.5 RegAlloc AsmCodeGen 19.2 22.4 pprNativeCode AsmCodeGen 16.8 19.9 regLiveness AsmCodeGen 8.0 6.3 genMachCode AsmCodeGen 5.0 5.4 StgCmm HscMain 4.3 3.5 occAnalBind.assoc OccurAnal 3.3 3.3 tops CmmPipeline 3.3 3.4 SimplTopBinds SimplCore 1.7 1.6 do_block Hoopl.Dataflow 1.1 0.8
Changed 6 years ago by
Attachment: | overview_optimised_heap_profile.png added |
---|
comment:4 Changed 6 years ago by
Fri Oct 19 20:32 2012 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/ian/ghc/git/ghc/inplace/lib -fforce-recomp -O -c W2.hs total time = 76.26 secs (76258 ticks @ 1000 us, 1 processor) total alloc = 45,073,993,600 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc OccAnal SimplCore 49.7 45.3 CoreTidy HscMain 22.3 31.8 SimplTopBinds SimplCore 12.9 9.9 pprNativeCode AsmCodeGen 2.0 1.8 RegAlloc AsmCodeGen 1.9 2.0 sink2 CmmPipeline 1.5 1.8 StgCmm HscMain 1.0 0.8 occAnalBind.assoc OccurAnal 1.0 1.0
comment:5 Changed 6 years ago by
individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc getProxies.fwd_pe OccurAnal 1776 9226 6.7 0.0 25.5 23.7 getProxies.fwd_pe.add1 OccurAnal 1779 9733280 5.8 1.6 18.8 23.7 getProxies.fwd_pe.add2 OccurAnal 1781 9733280 13.1 22.1 13.1 22.1
(without -O)
comment:6 Changed 6 years ago by
Owner: | changed from igloo to simonpj |
---|
Simon, I think the conclusion was that you'd look into this. Please bounce it back to me if you'd like me to investigate further.
comment:8 Changed 6 years ago by
Cc: | iustin@… added |
---|
comment:9 Changed 6 years ago by
Cc: | pho@… added |
---|
comment:10 Changed 6 years ago by
commit 52e43004f63276c1342933e40a673ad25cf2113a
Author: Simon Peyton Jones <simonpj@microsoft.com> Date: Fri Dec 21 17:39:33 2012 +0000 Use expectP in deriving( Read ) Note [Use expectP] in TcGenDeriv ~~~~~~~~~~~~~~~~~~ Note that we use expectP (Ident "T1") rather than Ident "T1" <- lexP The latter desugares to inline code for matching the Ident and the string, and this can be very voluminous. The former is much more compact. Cf Trac #7258, although that also concerned non-linearity in the occurrence analyser, a separate issue. compiler/prelude/PrelNames.lhs | 3 +- compiler/typecheck/TcGenDeriv.lhs | 43 ++++++++++++++++++++++-------------- 2 files changed, 28 insertions(+), 18 deletions(-)
comment:11 Changed 6 years ago by
commit 9ea2b6666cb6684279a120c688e8557bcef3dc73
Author: Simon Peyton Jones <simonpj@microsoft.com> Date: Mon Dec 24 11:34:51 2012 +0000 Simplify the binder-swap transformation The occurrence analyser implements the "binder-swap" transformation, described in Note [Binder swap] in OccAnal. For some reason I had implemeted an extremely complicated version, I believe intended to get as much as possible done in single simplifier pass. But it turned out (Trac #7258) that the 'getProxies' bit of this complicated code scaled rather non-linearly, and all by itself could consume half of the entire compile time. The patch dramatically simplifies the transformation, so that we simply swizzle case x of y { I# v -> e } to case x of y { I# v -> let x = y in e } I can't see any reason not to do this * Compiler allocation for #7258 with 200 fields goes down by 25% and compile time by 20% * The nofib figures do not budge * Quite a bit of complicated code goes away compiler/simplCore/OccurAnal.lhs | 255 +++++++------------------------------- 1 files changed, 46 insertions(+), 209 deletions(-)
comment:12 Changed 6 years ago by
OK I have verified that the changes to the occurrence analyser (above) make essentially zero different to nofib numbers. It's a very worthwhile simplification, because it completely gets rid of the getProxies
stuff that was eating all the time before.
Alas, compiling W2 is still non-linear. Here's the allocation by the stage-2 compiler
- 50 fields: 1.01 Gbyte
- 100 fields: 2.98 Gbyte
- 200 fields: 9.64 Gbyte
This is after including the improvements to the derived Read
code in #7450.
So something is still wrong. Need to do some profilling to find out.
There are some very deeply nested lambdas, which lead to SRTs of ever-increasing size, so there is definitely a quadratic effect there. But I'm not sure that is all.
comment:13 Changed 6 years ago by
I can't compile DynFlags
on my 256MB Raspberry Pi. Could we do something about having less auto-generated code in this module please?
comment:14 Changed 6 years ago by
Cc: | dterei added |
---|
comment:16 Changed 4 years ago by
Milestone: | 7.10.1 → 7.12.1 |
---|
Moving to 7.12.1 milestone; if you feel this is an error and should be addressed sooner, please move it back to the 7.10.1 milestone.
comment:17 Changed 4 years ago by
Type of failure: | None/Unknown → Compile-time performance bug |
---|
comment:18 Changed 4 years ago by
Cc: | bgamari added |
---|
comment:19 Changed 4 years ago by
Cc: | asr added |
---|
comment:20 Changed 4 years ago by
Cc: | gidyn added |
---|
comment:21 Changed 4 years ago by
I strongly suspect that some of the fixes that I've made while working on #7450 will help this although I have yet to test this.
comment:22 Changed 4 years ago by
ticket:9669#comment:13 contains a summary of the performance impact of Phab:D1012 and Phab:D1040 on this and potentially-related ticket #7450.
comment:24 Changed 3 years ago by
Keywords: | deriving-perf added |
---|
comment:25 Changed 3 years ago by
Milestone: | 8.0.1 |
---|
comment:26 Changed 3 years ago by
Cc: | HairyDude added |
---|
comment:27 Changed 2 years ago by
Cc: | RyanGlScott added |
---|
comment:28 Changed 16 months ago by
Cc: | trommler added |
---|
comment:29 Changed 16 months ago by
Tobaias found that the bottlenecks on W2 were
- Pretty-printing of assembly code
- Register allocation
He has some improvements in progress.
comment:30 Changed 16 months ago by
Related Tickets: | → #13426 |
---|
comment:31 Changed 16 months ago by
Related Tickets: | #13426 |
---|
Apparently this is a example of non-linearity in the pretty printer library.
See Bartosz's summary here, and #14005 (which is about switching over to a new pretty-printing library).
comment:32 Changed 16 months ago by
Cc: | heisenbug added |
---|
comment:33 follow-up: 35 Changed 16 months ago by
This commit:
http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5
...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (assig
) and the list of registers to keep
; the new version uses set operations instead, based on UniqSet
/ UniqFW
. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.
I believe there are further optimization opportunities here, such as:
- changing the current list-based code to also use set/map operations
- moving the register class check into the
candidates
part - precalculating the list of candidates (not sure about this one though)
comment:34 Changed 16 months ago by
http://git.haskell.org/ghc.git/commitdiff/375d4bd0fc2afd72617bc827bf63b5eeb24f2f7c
This one makes crucial parts of the pretty-printing code stricter; the modified functions score better in profiling, but the overall effect on compilation times seems to be minimal.
comment:35 follow-up: 36 Changed 16 months ago by
Replying to tdammers:
This commit:
http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5
...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (
assig
) and the list of registers tokeep
; the new version uses set operations instead, based onUniqSet
/UniqFW
. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.
Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.
I believe there are further optimization opportunities here, such as:
- changing the current list-based code to also use set/map operations
- moving the register class check into the
candidates
part- precalculating the list of candidates (not sure about this one though)
comment:36 Changed 16 months ago by
Replying to heisenbug:
Replying to tdammers:
This commit:
http://git.haskell.org/ghc.git/commitdiff/19a2ba3ea436b60466fc4022f53a6631e41b87a5
...reduces complexity in register allocation spilling from quadratic to logarithmic. The old version would run a carthesian product on the list of allocated registers (
assig
) and the list of registers tokeep
; the new version uses set operations instead, based onUniqSet
/UniqFW
. I've also moved things around a little to pre-filter the list of allocations to only include the interesting entries in the first place, reducing the number of list items from linear to (as far as I can tell) constant.Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.
Could be worth it, but I'm a bit skeptical - the additional overhead of unpacking the sets into lists could easily cancel out the performance benefit, and it would make the code more complicated. I'll give it a try though.
comment:37 follow-up: 40 Changed 16 months ago by
While chatting with tdammers about this I had a peek at the core; it seems that 5000 terms of the 7000 terms in the simplified core of W2
are in creadPrec_rdaO
. Moreover, much of this is repetition. I've proposed an approach for dealing with this in #14364.
comment:38 Changed 16 months ago by
Cc: | dterei removed |
---|
comment:39 Changed 16 months ago by
Hmmm, what do you think about having the cartesian product for a smaller cut-off size? Manipulating the sets surely also has a (large) constant factor built-in.
I actually suspected this in the past, but dfeuer did some measurements in the context of ListSetUtils
and found that sets aren't measurably different. Don't let that stop you from quickly trying though.
comment:40 Changed 16 months ago by
Replying to bgamari:
While chatting with tdammers about this I had a peek at the core; it seems that 5000 terms of the 7000 terms in the simplified core of
W2
are increadPrec_rdaO
. Moreover, much of this is repetition. I've proposed an approach for dealing with this in #14364.
I've just done a bit more profiling, and it turns out that removing the derived Read
instance from the W3
example (W2
extended to 500 fields) cuts overall compilation time down from 25.5 seconds to 3.6. So it seems that the edge case isn't just "constructors with many fields", but rather, "derived Read
instances for constructors with many fields".
comment:41 Changed 16 months ago by
On further thought, #14364 is a bit of a stop-gap measure; generating more efficient Read
instances is of course a good thing, but the register allocator edge case remains, and hand-written code (or TH-generated code) that follows a similar pattern will still hit it.
comment:42 Changed 16 months ago by
A few things I have tried that do not seem to make a difference:
- Pre-filtering the
assig
list to retain only the "interesting" entries (InReg
andInBoth
), and passing the filtered list around separately; while in theory this would get rid of unnecessary list filtering passes, in practice it seems that the overhead of maintaining a second list in parallel cancels out anything we may have won - Collecting the
InReg
andInBoth
candidate lists in a single pass. The idea behind this was that we would visit each list entry just once, putting it in one bin or the other, rather than traversing the list of candidates twice. In practice however there is no measurable performance benefit. - Merging the filtering predicates (checking whether a candidate is
InReg
orInBoth
, and checking whether the register class matches); the idea here was that doing both checks in one pass might help, but it doesn't, the performance is exactly the same.
Changed 16 months ago by
Attachment: | profiles.csv added |
---|
Sample GHC profile data for W2.hs style record type, comparing just the record type, derived Read, derived Show, and a hand-written getLine-based applicative record construction function
Changed 16 months ago by
Attachment: | ghc-large-record-types.png added |
---|
Performance profiling data for large record types.
comment:43 Changed 16 months ago by
Latest profiling results:
- graph plot: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types.png
- raw results: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/profiles.csv
- example code: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/examples.zip
The test module defines a record type of N fields, named field0
through field{N-1}
, called DT
; the performance metric is the "total time" value from the profiler output for GHC compiling the module on its own. The codenames refer to different versions of the test module (all of them defining the same DT
type though):
nothing
: just a raw record type, no instances or functions. This is the baseline.show
: a derived Show instance (deriving (Show)
)read
: a derived Read instance (deriving (Read)
)read-appl
: a hand-written Read instance using applicative syntax (DT <$> a1 <*> a2 <*> a3 <*> ...
)getline
: a hand-written function consisting of a series of N monadic binds followed by areturn
of an N-argument call to theDT
constructor. This is the same shape as what the derived Read and Show instances produce internally.getline-appl
: a hand-written function that reads fields from stdin usinggetLine
and applicatively combines them into a value of our record type:DT <$> (read <$> getLine) <*> (read <$> getLine) <*> ...
getline-standalone
: a set of N functions performing onegetLine
call each:getlD :: Int -> IO Int; getlD i = read <$> getLine
show-standalone
: a set of N functions, each constructing a string by combining an integer argument with the function's index
The -standalone
and nothing
flavors were added as controls.
Looking at the plot, the pattern is pretty obvious: the applicative versions show performance patterns very similar to the control, while the three versions that use many monadic binds all exhibit clearly nonlinear performance.
Changed 16 months ago by
Attachment: | examples.zip added |
---|
Code examples for the large records performance plot
comment:44 Changed 16 months ago by
So apparently, good shape (linear-ish performance):
D <$> a1 <*> a2 <*> a2 <*> ... <*> aN
Bad shape (exponential-ish performance):
do v1 <- a1 v2 <- a2 v3 <- a3 ... vN <- aN return $ D v1 v2 v3 ... vN
comment:45 Changed 16 months ago by
Core for monadic-bind style Read instance:
[1 of 1] Compiling D ( examples/t-10-read.hs, examples/t-10-read.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 373, types: 324, coercions: 0} -- RHS size: {terms: 269, types: 160, coercions: 0} $creadPrec_r1Hl $creadPrec_r1Hl = parens (prec (I# 11#) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "DT"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "{"#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field0"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a1_a1tq -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field2"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a2_a1tr -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field3"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a3_a1ts -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field4"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a4_a1tt -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field5"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a5_a1tu -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field6"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a6_a1tv -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field7"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a7_a1tw -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field8"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a8_a1tx -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field9"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a9_a1ty -> >> $fMonadReadPrec (expectP (Punc (unpackCString# ","#))) (>> $fMonadReadPrec (expectP (Ident (unpackCString# "field10"#))) (>> $fMonadReadPrec (expectP (Punc (unpackCString# "="#))) (>>= $fMonadReadPrec (reset (readPrec $fReadInt)) (\ a10_a1tz -> >> $fMonadReadPrec (expectP (Punc (unpackCString# "}"#))) (return $fMonadReadPrec (DT a1_a1tq a2_a1tr a3_a1ts a4_a1tt a5_a1tu a6_a1tv a7_a1tw a8_a1tx a9_a1ty a10_a1tz))))))))))))))))))))))))))))))))))))))))))))) Rec { -- RHS size: {terms: 5, types: 1, coercions: 0} $fReadDT $fReadDT = C:Read $creadsPrec_r1I5 $creadList_r1I6 $creadPrec_r1Hl $creadListPrec_r1I7 -- RHS size: {terms: 2, types: 1, coercions: 0} $creadsPrec_r1I5 $creadsPrec_r1I5 = $dmreadsPrec $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadList_r1I6 $creadList_r1I6 = readListDefault $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadListPrec_r1I7 $creadListPrec_r1I7 = readListPrecDefault $fReadDT end Rec } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1H8 -> case ds_d1H8 of _ { DT ds1_d1H9 ds2_d1Ha ds3_d1Hb ds4_d1Hc ds5_d1Hd ds6_d1He ds7_d1Hf ds8_d1Hg ds9_d1Hh ds10_d1Hi -> ds9_d1Hh } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1GX -> case ds_d1GX of _ { DT ds1_d1GY ds2_d1GZ ds3_d1H0 ds4_d1H1 ds5_d1H2 ds6_d1H3 ds7_d1H4 ds8_d1H5 ds9_d1H6 ds10_d1H7 -> ds8_d1H5 } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1GM -> case ds_d1GM of _ { DT ds1_d1GN ds2_d1GO ds3_d1GP ds4_d1GQ ds5_d1GR ds6_d1GS ds7_d1GT ds8_d1GU ds9_d1GV ds10_d1GW -> ds7_d1GT } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1GB -> case ds_d1GB of _ { DT ds1_d1GC ds2_d1GD ds3_d1GE ds4_d1GF ds5_d1GG ds6_d1GH ds7_d1GI ds8_d1GJ ds9_d1GK ds10_d1GL -> ds6_d1GH } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1Gq -> case ds_d1Gq of _ { DT ds1_d1Gr ds2_d1Gs ds3_d1Gt ds4_d1Gu ds5_d1Gv ds6_d1Gw ds7_d1Gx ds8_d1Gy ds9_d1Gz ds10_d1GA -> ds5_d1Gv } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Gf -> case ds_d1Gf of _ { DT ds1_d1Gg ds2_d1Gh ds3_d1Gi ds4_d1Gj ds5_d1Gk ds6_d1Gl ds7_d1Gm ds8_d1Gn ds9_d1Go ds10_d1Gp -> ds4_d1Gj } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1G4 -> case ds_d1G4 of _ { DT ds1_d1G5 ds2_d1G6 ds3_d1G7 ds4_d1G8 ds5_d1G9 ds6_d1Ga ds7_d1Gb ds8_d1Gc ds9_d1Gd ds10_d1Ge -> ds3_d1G7 } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1FT -> case ds_d1FT of _ { DT ds1_d1FU ds2_d1FV ds3_d1FW ds4_d1FX ds5_d1FY ds6_d1FZ ds7_d1G0 ds8_d1G1 ds9_d1G2 ds10_d1G3 -> ds2_d1FV } -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1FI -> case ds_d1FI of _ { DT ds1_d1FJ ds2_d1FK ds3_d1FL ds4_d1FM ds5_d1FN ds6_d1FO ds7_d1FP ds8_d1FQ ds9_d1FR ds10_d1FS -> ds10_d1FS } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1Fx -> case ds_d1Fx of _ { DT ds1_d1Fy ds2_d1Fz ds3_d1FA ds4_d1FB ds5_d1FC ds6_d1FD ds7_d1FE ds8_d1FF ds9_d1FG ds10_d1FH -> ds1_d1Fy } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1I8 $trModule1_r1I8 = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1I9 $trModule2_r1I9 = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1I8 $trModule2_r1I9 -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1Ia $tc'DT1_r1Ia = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1Ia -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1Ib $tcDT1_r1Ib = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1Ib
And for applicative-style Read implementation:
[1 of 1] Compiling D ( examples/t-10-read-appl.hs, examples/t-10-read-appl.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 272, types: 324, coercions: 0} -- RHS size: {terms: 168, types: 160, coercions: 0} $creadsPrec_r1Gi $creadsPrec_r1Gi = \ p_aNu -> readP_to_S (<* $fApplicativeReadP (*> $fApplicativeReadP (string (unpackCString# "DT{"#)) (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<*> $fApplicativeReadP (<$> $fFunctorReadP DT (*> $fApplicativeReadP (string (unpackCString# "field0="#)) (readS_to_P (readsPrec $fReadInt p_aNu)))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field1="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field2="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field3="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field4="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field5="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field6="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field7="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field8="#)) (readS_to_P (readsPrec $fReadInt p_aNu))))) (*> $fApplicativeReadP (string (unpackCString# ","#)) (*> $fApplicativeReadP (string (unpackCString# "field9="#)) (readS_to_P (readsPrec $fReadInt p_aNu)))))) (string (unpackCString# "}"#))) Rec { -- RHS size: {terms: 5, types: 1, coercions: 0} $fReadDT $fReadDT = C:Read $creadsPrec_r1Gi $creadList_r1GT $creadPrec_r1GU $creadListPrec_r1GV -- RHS size: {terms: 2, types: 1, coercions: 0} $creadList_r1GT $creadList_r1GT = $dmreadList $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadPrec_r1GU $creadPrec_r1GU = $dmreadPrec $fReadDT -- RHS size: {terms: 2, types: 1, coercions: 0} $creadListPrec_r1GV $creadListPrec_r1GV = $dmreadListPrec $fReadDT end Rec } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1G5 -> case ds_d1G5 of _ { DT ds1_d1G6 ds2_d1G7 ds3_d1G8 ds4_d1G9 ds5_d1Ga ds6_d1Gb ds7_d1Gc ds8_d1Gd ds9_d1Ge ds10_d1Gf -> ds9_d1Ge } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1FU -> case ds_d1FU of _ { DT ds1_d1FV ds2_d1FW ds3_d1FX ds4_d1FY ds5_d1FZ ds6_d1G0 ds7_d1G1 ds8_d1G2 ds9_d1G3 ds10_d1G4 -> ds8_d1G2 } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1FJ -> case ds_d1FJ of _ { DT ds1_d1FK ds2_d1FL ds3_d1FM ds4_d1FN ds5_d1FO ds6_d1FP ds7_d1FQ ds8_d1FR ds9_d1FS ds10_d1FT -> ds7_d1FQ } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1Fy -> case ds_d1Fy of _ { DT ds1_d1Fz ds2_d1FA ds3_d1FB ds4_d1FC ds5_d1FD ds6_d1FE ds7_d1FF ds8_d1FG ds9_d1FH ds10_d1FI -> ds6_d1FE } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1Fn -> case ds_d1Fn of _ { DT ds1_d1Fo ds2_d1Fp ds3_d1Fq ds4_d1Fr ds5_d1Fs ds6_d1Ft ds7_d1Fu ds8_d1Fv ds9_d1Fw ds10_d1Fx -> ds5_d1Fs } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Fc -> case ds_d1Fc of _ { DT ds1_d1Fd ds2_d1Fe ds3_d1Ff ds4_d1Fg ds5_d1Fh ds6_d1Fi ds7_d1Fj ds8_d1Fk ds9_d1Fl ds10_d1Fm -> ds4_d1Fg } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1F1 -> case ds_d1F1 of _ { DT ds1_d1F2 ds2_d1F3 ds3_d1F4 ds4_d1F5 ds5_d1F6 ds6_d1F7 ds7_d1F8 ds8_d1F9 ds9_d1Fa ds10_d1Fb -> ds3_d1F4 } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1EQ -> case ds_d1EQ of _ { DT ds1_d1ER ds2_d1ES ds3_d1ET ds4_d1EU ds5_d1EV ds6_d1EW ds7_d1EX ds8_d1EY ds9_d1EZ ds10_d1F0 -> ds2_d1ES } -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1EF -> case ds_d1EF of _ { DT ds1_d1EG ds2_d1EH ds3_d1EI ds4_d1EJ ds5_d1EK ds6_d1EL ds7_d1EM ds8_d1EN ds9_d1EO ds10_d1EP -> ds10_d1EP } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1Eu -> case ds_d1Eu of _ { DT ds1_d1Ev ds2_d1Ew ds3_d1Ex ds4_d1Ey ds5_d1Ez ds6_d1EA ds7_d1EB ds8_d1EC ds9_d1ED ds10_d1EE -> ds1_d1Ev } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1GW $trModule1_r1GW = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1GX $trModule2_r1GX = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1GW $trModule2_r1GX -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1GY $tc'DT1_r1GY = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1GY -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1GZ $tcDT1_r1GZ = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1GZ
comment:46 Changed 16 months ago by
getLine examples:
monadic-bind style:
[1 of 1] Compiling D ( examples/t-10-getline.hs, examples/t-10-getline.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 182, types: 231, coercions: 0} -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1JC -> case ds_d1JC of _ { DT ds1_d1JD ds2_d1JE ds3_d1JF ds4_d1JG ds5_d1JH ds6_d1JI ds7_d1JJ ds8_d1JK ds9_d1JL ds10_d1JM -> ds10_d1JM } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1Jr -> case ds_d1Jr of _ { DT ds1_d1Js ds2_d1Jt ds3_d1Ju ds4_d1Jv ds5_d1Jw ds6_d1Jx ds7_d1Jy ds8_d1Jz ds9_d1JA ds10_d1JB -> ds9_d1JA } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1Jg -> case ds_d1Jg of _ { DT ds1_d1Jh ds2_d1Ji ds3_d1Jj ds4_d1Jk ds5_d1Jl ds6_d1Jm ds7_d1Jn ds8_d1Jo ds9_d1Jp ds10_d1Jq -> ds8_d1Jo } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1J5 -> case ds_d1J5 of _ { DT ds1_d1J6 ds2_d1J7 ds3_d1J8 ds4_d1J9 ds5_d1Ja ds6_d1Jb ds7_d1Jc ds8_d1Jd ds9_d1Je ds10_d1Jf -> ds7_d1Jc } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1IU -> case ds_d1IU of _ { DT ds1_d1IV ds2_d1IW ds3_d1IX ds4_d1IY ds5_d1IZ ds6_d1J0 ds7_d1J1 ds8_d1J2 ds9_d1J3 ds10_d1J4 -> ds6_d1J0 } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1IJ -> case ds_d1IJ of _ { DT ds1_d1IK ds2_d1IL ds3_d1IM ds4_d1IN ds5_d1IO ds6_d1IP ds7_d1IQ ds8_d1IR ds9_d1IS ds10_d1IT -> ds5_d1IO } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Iy -> case ds_d1Iy of _ { DT ds1_d1Iz ds2_d1IA ds3_d1IB ds4_d1IC ds5_d1ID ds6_d1IE ds7_d1IF ds8_d1IG ds9_d1IH ds10_d1II -> ds4_d1IC } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1In -> case ds_d1In of _ { DT ds1_d1Io ds2_d1Ip ds3_d1Iq ds4_d1Ir ds5_d1Is ds6_d1It ds7_d1Iu ds8_d1Iv ds9_d1Iw ds10_d1Ix -> ds3_d1Iq } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1Ic -> case ds_d1Ic of _ { DT ds1_d1Id ds2_d1Ie ds3_d1If ds4_d1Ig ds5_d1Ih ds6_d1Ii ds7_d1Ij ds8_d1Ik ds9_d1Il ds10_d1Im -> ds2_d1Ie } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1I1 -> case ds_d1I1 of _ { DT ds1_d1I2 ds2_d1I3 ds3_d1I4 ds4_d1I5 ds5_d1I6 ds6_d1I7 ds7_d1I8 ds8_d1I9 ds9_d1Ia ds10_d1Ib -> ds1_d1I2 } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1EM $trModule1_r1EM = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1Ke $trModule2_r1Ke = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1EM $trModule2_r1Ke -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1Kf $tc'DT1_r1Kf = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1Kf -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1Kg $tcDT1_r1Kg = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1Kg -- RHS size: {terms: 93, types: 82, coercions: 0} getlD getlD = >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field1_aGE -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field12_aGF -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field13_aGG -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field14_aGH -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field15_aGI -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field16_aGJ -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field17_aGK -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field18_aGL -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field19_aGM -> >>= $fMonadIO (<$> $fFunctorIO (read $fReadInt) getLine) (\ field20_aGN -> return $fMonadIO (DT field1_aGE field12_aGF field13_aGG field14_aGH field15_aGI field16_aGJ field17_aGK field18_aGL field19_aGM field20_aGN)))))))))))
vs. applicative:
[1 of 1] Compiling D ( examples/t-10-getline-appl.hs, examples/t-10-getline-appl.o ) ==================== Tidy Core ==================== Result size of Tidy Core = {terms: 160, types: 264, coercions: 0} -- RHS size: {terms: 5, types: 12, coercions: 0} field10 field10 = \ ds_d1Je -> case ds_d1Je of _ { DT ds1_d1Jf ds2_d1Jg ds3_d1Jh ds4_d1Ji ds5_d1Jj ds6_d1Jk ds7_d1Jl ds8_d1Jm ds9_d1Jn ds10_d1Jo -> ds10_d1Jo } -- RHS size: {terms: 5, types: 12, coercions: 0} field9 field9 = \ ds_d1J3 -> case ds_d1J3 of _ { DT ds1_d1J4 ds2_d1J5 ds3_d1J6 ds4_d1J7 ds5_d1J8 ds6_d1J9 ds7_d1Ja ds8_d1Jb ds9_d1Jc ds10_d1Jd -> ds9_d1Jc } -- RHS size: {terms: 5, types: 12, coercions: 0} field8 field8 = \ ds_d1IS -> case ds_d1IS of _ { DT ds1_d1IT ds2_d1IU ds3_d1IV ds4_d1IW ds5_d1IX ds6_d1IY ds7_d1IZ ds8_d1J0 ds9_d1J1 ds10_d1J2 -> ds8_d1J0 } -- RHS size: {terms: 5, types: 12, coercions: 0} field7 field7 = \ ds_d1IH -> case ds_d1IH of _ { DT ds1_d1II ds2_d1IJ ds3_d1IK ds4_d1IL ds5_d1IM ds6_d1IN ds7_d1IO ds8_d1IP ds9_d1IQ ds10_d1IR -> ds7_d1IO } -- RHS size: {terms: 5, types: 12, coercions: 0} field6 field6 = \ ds_d1Iw -> case ds_d1Iw of _ { DT ds1_d1Ix ds2_d1Iy ds3_d1Iz ds4_d1IA ds5_d1IB ds6_d1IC ds7_d1ID ds8_d1IE ds9_d1IF ds10_d1IG -> ds6_d1IC } -- RHS size: {terms: 5, types: 12, coercions: 0} field5 field5 = \ ds_d1Il -> case ds_d1Il of _ { DT ds1_d1Im ds2_d1In ds3_d1Io ds4_d1Ip ds5_d1Iq ds6_d1Ir ds7_d1Is ds8_d1It ds9_d1Iu ds10_d1Iv -> ds5_d1Iq } -- RHS size: {terms: 5, types: 12, coercions: 0} field4 field4 = \ ds_d1Ia -> case ds_d1Ia of _ { DT ds1_d1Ib ds2_d1Ic ds3_d1Id ds4_d1Ie ds5_d1If ds6_d1Ig ds7_d1Ih ds8_d1Ii ds9_d1Ij ds10_d1Ik -> ds4_d1Ie } -- RHS size: {terms: 5, types: 12, coercions: 0} field3 field3 = \ ds_d1HZ -> case ds_d1HZ of _ { DT ds1_d1I0 ds2_d1I1 ds3_d1I2 ds4_d1I3 ds5_d1I4 ds6_d1I5 ds7_d1I6 ds8_d1I7 ds9_d1I8 ds10_d1I9 -> ds3_d1I2 } -- RHS size: {terms: 5, types: 12, coercions: 0} field2 field2 = \ ds_d1HO -> case ds_d1HO of _ { DT ds1_d1HP ds2_d1HQ ds3_d1HR ds4_d1HS ds5_d1HT ds6_d1HU ds7_d1HV ds8_d1HW ds9_d1HX ds10_d1HY -> ds2_d1HQ } -- RHS size: {terms: 5, types: 12, coercions: 0} field0 field0 = \ ds_d1HD -> case ds_d1HD of _ { DT ds1_d1HE ds2_d1HF ds3_d1HG ds4_d1HH ds5_d1HI ds6_d1HJ ds7_d1HK ds8_d1HL ds9_d1HM ds10_d1HN -> ds1_d1HE } -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule1_r1Ey $trModule1_r1Ey = TrNameS "main"# -- RHS size: {terms: 2, types: 0, coercions: 0} $trModule2_r1JQ $trModule2_r1JQ = TrNameS "D"# -- RHS size: {terms: 3, types: 0, coercions: 0} $trModule $trModule = Module $trModule1_r1Ey $trModule2_r1JQ -- RHS size: {terms: 2, types: 0, coercions: 0} $tc'DT1_r1JR $tc'DT1_r1JR = TrNameS "'DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tc'DT $tc'DT = TyCon 9521127001609462311## 17424978011088396301## $trModule $tc'DT1_r1JR -- RHS size: {terms: 2, types: 0, coercions: 0} $tcDT1_r1JS $tcDT1_r1JS = TrNameS "DT"# -- RHS size: {terms: 5, types: 0, coercions: 0} $tcDT $tcDT = TyCon 14693474152448962618## 5168028270650093369## $trModule $tcDT1_r1JS -- RHS size: {terms: 71, types: 115, coercions: 0} getlD getlD = <*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<*> $fApplicativeIO (<$> $fFunctorIO DT (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)) (<$> $fFunctorIO (read $fReadInt) getLine)
comment:48 follow-up: 50 Changed 16 months ago by
Is the labelling in the graph after comment:46 correct? read-appl looks worst by a long way, which wasn't true bef.re Ditto geline-appl seems worse than getline (which is invisible).
comment:49 follow-up: 51 Changed 16 months ago by
Here's a mystery.
- I took
t-10-getline.hs
andt-10-getine-appl.hs
fromexamples.zip
just before comment:44
- I compiled both with
-O -dshow-passes -ddump-simpl
.
- Both produced identical Core. In the early stages
getline-appl
is a little bigger, but they become identical after the first run of the simplifier following specialise/float-out. (I.e. early)
I am at a loss for how to reconcile these results with the dramatically worse compile times for getline
compared with getline-appl
reported in the graph plot in comment:43.
Tobias, any ideas?
comment:50 Changed 16 months ago by
Replying to simonpj:
Is the labelling in the graph after comment:46 correct? read-appl looks worst by a long way, which wasn't true bef.re Ditto geline-appl seems worse than getline (which is invisible).
Yes, it is correct; the difference between this new graph and the previous one is that the new one uses -O2
, while the old one doesn't use any optimizations. The profiler breakdown looks radically different between the optimized and unoptimized versions, for *all* examples - RegAlloc-linear and ppr, which are the most serious offenders in the unoptimized run, account for less than 1% of execution time in the optimized case.
comment:51 Changed 16 months ago by
Replying to simonpj:
Here's a mystery.
- I took
t-10-getline.hs
andt-10-getine-appl.hs
fromexamples.zip
just before comment:44
- I compiled both with
-O -dshow-passes -ddump-simpl
.
- Both produced identical Core. In the early stages
getline-appl
is a little bigger, but they become identical after the first run of the simplifier following specialise/float-out. (I.e. early)I am at a loss for how to reconcile these results with the dramatically worse compile times for
getline
compared withgetline-appl
reported in the graph plot in comment:43.Tobias, any ideas?
Hmm, maybe -O
vs. -O2
? I'll run a few trials myself.
Other thought, though probably not really an explanation: the t-10-XXXX examples are maybe too small to produce useful metrics, the performance differences only start to get dramatic past 100 fields or so.
comment:52 Changed 16 months ago by
the t-10-XXXX examples are maybe too small to produce useful metrics,
I wasn't measuring performance; I was just eye-balling the (identical) code.
The profiler breakdown looks radically different between the optimized and unoptimized versions, for *all* examples
OK, so there are two perhaps-different problems
- With -O compiling
DynFlags
is jolly slow. (The original bug report.)
- Without -O we get non-linear stuff going on (comment:43)
I think the -O case is the higher-priority item. (Though I accept that compiling without -O is supposed to be fast.)
I would stick to -O for now, not -O2. The latter involves SpecConstr
and LiberateCase
both of which can generate a lot of code. Again let's not ignore -O2 but let's nail -O first.
So what do the -O graphs look like?
comment:53 Changed 16 months ago by
I don't have -O graphs yet, unfortunately it takes a good while to generate those (I aborted the -O2 one after running all night and only making it to 420 fields). I'll start up a job and report back. One-shot experiments do suggest though that pretty-printing and register allocation aren't going to be bottlenecks here either.
Changed 16 months ago by
Attachment: | ghc-large-record-types-optimized.png added |
---|
Performance graph for various examples, compiling with -O
Changed 16 months ago by
Attachment: | ghc-large-record-types-optimized-O2.png added |
---|
Performance graph for various examples, compiling with -O2
comment:56 Changed 16 months ago by
Updated performance graph: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-optimized.png (this one uses -O, not -O2)
Key observation: read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.
A sample profile like the one for 400-field read, shows that the bottleneck is no longer ASM output or register allocation anymore once -O is involved:
Thu Oct 26 01:25 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-read.hs total time = 128.62 secs (128624 ticks @ 1000 us, 1 processor) total alloc = 37,659,351,320 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 1.6 5.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 1.1 4.2 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 1.0 4.7 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 0.9 4.2 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.7 2.9 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 0.6 1.9 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 0.5 1.4 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 0.4 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 0.4 1.6 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 0.2 1.1 deSugar HscMain compiler/main/HscMain.hs:511:7-44 0.2 1.1
comment:57 follow-up: 58 Changed 16 months ago by
read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.
Very good!
sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4
What does -dshow-passes
show about the program size? I.e. are we getting very large Core?
So sink
in C-- seems very slow. Nearly 4 bytes in 5 of all allocation is in tis pass alone. Is this a non-linear algorithmic effect. e.g if you double the size of the input, does the sink
time go up non-linearly while simplTopBinds
goes up linearly? That's my guess.
(Related question: does sink
show up a a big item when compiling "normal" programs?)
Guessing: the "shape" of the C-- code makes sink
behave very badly. Can you characterise what that shape is, and where the algorithmic badness comes from?
This is good... fixing these performance bumps will likely help ALL programs, even ones that don't hit the really bad spots.
comment:58 Changed 16 months ago by
Replying to simonpj:
read, read-appl and show "explode" while all the other examples, including getline-appl and getline, behave properly.
Very good!
sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4What does
-dshow-passes
show about the program size? I.e. are we getting very large Core?
For a 100-line read example:
Glasgow Haskell Compiler, Version 8.3.20171016, stage 2 booted by GHC version 8.0.1 Using binary package database: /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d/package.cache package flags [] loading package database /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d wired-in package ghc-prim mapped to ghc-prim-0.5.2.0 wired-in package integer-gmp mapped to integer-gmp-1.0.1.0 wired-in package base mapped to base-4.11.0.0 wired-in package rts mapped to rts wired-in package template-haskell mapped to template-haskell-2.13.0.0 wired-in package ghc mapped to ghc-8.3 wired-in package dph-seq not found. wired-in package dph-par not found. *** Checking old interface for D (use -ddump-hi-diffs for more details): *** Parser [D]: !!! Parser [D]: finished in 3.22 milliseconds, allocated 3.170 megabytes *** Renamer/typechecker [D]: !!! Renamer/typechecker [D]: finished in 124.33 milliseconds, allocated 72.473 megabytes *** Desugar [D]: Result size of Desugar (after optimization) = {terms: 2,667, types: 11,534, coercions: 0, joins: 0/0} !!! Desugar [D]: finished in 174.52 milliseconds, allocated 59.456 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 7,375, types: 18,990, coercions: 6,881, joins: 0/913} Result size of Simplifier iteration=2 = {terms: 3,722, types: 13,604, coercions: 1,351, joins: 0/198} Result size of Simplifier iteration=3 = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1} Result size of Simplifier = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1} !!! Simplifier [D]: finished in 1306.06 milliseconds, allocated 749.648 megabytes *** Specialise [D]: Result size of Specialise = {terms: 3,319, types: 12,598, coercions: 1,342, joins: 0/1} !!! Specialise [D]: finished in 19.96 milliseconds, allocated 17.006 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) = {terms: 4,923, types: 15,504, coercions: 1,342, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: finished in 122.82 milliseconds, allocated 103.551 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,115, types: 13,888, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,115, types: 13,888, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 216.40 milliseconds, allocated 122.592 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,115, types: 13,282, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,115, types: 13,282, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 183.50 milliseconds, allocated 131.494 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 183.89 milliseconds, allocated 126.382 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Float inwards [D]: finished in 8.67 milliseconds, allocated 4.238 megabytes *** Called arity analysis [D]: Result size of Called arity analysis = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Called arity analysis [D]: finished in 5.66 milliseconds, allocated 5.974 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 67.09 milliseconds, allocated 62.910 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 4,523, types: 14,200, coercions: 1,342, joins: 0/0} !!! Demand analysis [D]: finished in 170.27 milliseconds, allocated 91.193 megabytes *** Worker Wrapper binds [D]: Result size of Worker Wrapper binds = {terms: 4,541, types: 14,225, coercions: 1,342, joins: 0/3} !!! Worker Wrapper binds [D]: finished in 5.09 milliseconds, allocated 3.614 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0} Result size of Simplifier = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 351.93 milliseconds, allocated 219.389 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) = {terms: 4,530, types: 14,215, coercions: 1,342, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: finished in 268.13 milliseconds, allocated 280.555 megabytes *** Common sub-expression [D]: Result size of Common sub-expression = {terms: 3,740, types: 13,522, coercions: 1,342, joins: 0/0} !!! Common sub-expression [D]: finished in 38.50 milliseconds, allocated 48.593 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 3,740, types: 13,522, coercions: 1,342, joins: 0/0} !!! Float inwards [D]: finished in 5.37 milliseconds, allocated 3.369 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0} !!! Simplifier [D]: finished in 140.01 milliseconds, allocated 98.855 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0} !!! Demand analysis [D]: finished in 77.63 milliseconds, allocated 86.076 megabytes *** CoreTidy [D]: Result size of Tidy Core = {terms: 2,952, types: 12,337, coercions: 1,342, joins: 0/0} !!! CoreTidy [D]: finished in 54.47 milliseconds, allocated 69.051 megabytes Created temporary directory: /tmp/ghc27776_0 *** CorePrep [D]: Result size of CorePrep = {terms: 3,569, types: 13,155, coercions: 1,342, joins: 0/204} !!! CorePrep [D]: finished in 25.73 milliseconds, allocated 32.097 megabytes *** Stg2Stg: *** CodeGen [D]: !!! CodeGen [D]: finished in 2109.26 milliseconds, allocated 1630.642 megabytes *** Assembler: *** Deleting temp files: Warning: deleting non-existent /tmp/ghc27776_0/ghc_2.c *** Deleting temp dirs:
At 200 lines, we're looking at:
Glasgow Haskell Compiler, Version 8.3.20171016, stage 2 booted by GHC version 8.0.1 Using binary package database: /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d/package.cache package flags [] loading package database /home/tobias/well-typed/devel/ghc/inplace/lib/package.conf.d wired-in package ghc-prim mapped to ghc-prim-0.5.2.0 wired-in package integer-gmp mapped to integer-gmp-1.0.1.0 wired-in package base mapped to base-4.11.0.0 wired-in package rts mapped to rts wired-in package template-haskell mapped to template-haskell-2.13.0.0 wired-in package ghc mapped to ghc-8.3 wired-in package dph-seq not found. wired-in package dph-par not found. *** Checking old interface for D (use -ddump-hi-diffs for more details): *** Parser [D]: !!! Parser [D]: finished in 3.80 milliseconds, allocated 6.207 megabytes *** Renamer/typechecker [D]: !!! Renamer/typechecker [D]: finished in 230.12 milliseconds, allocated 138.058 megabytes *** Desugar [D]: Result size of Desugar (after optimization) = {terms: 5,267, types: 43,034, coercions: 0, joins: 0/0} !!! Desugar [D]: finished in 289.66 milliseconds, allocated 201.767 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 14,575, types: 57,790, coercions: 13,681, joins: 0/1,813} Result size of Simplifier iteration=2 = {terms: 7,322, types: 47,104, coercions: 2,651, joins: 0/398} Result size of Simplifier iteration=3 = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1} Result size of Simplifier = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1} !!! Simplifier [D]: finished in 5652.80 milliseconds, allocated 2700.163 megabytes *** Specialise [D]: Result size of Specialise = {terms: 6,519, types: 45,098, coercions: 2,642, joins: 0/1} !!! Specialise [D]: finished in 62.79 milliseconds, allocated 58.261 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) = {terms: 9,723, types: 50,904, coercions: 2,642, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) [D]: finished in 401.80 milliseconds, allocated 374.441 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,115, types: 47,688, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,115, types: 47,688, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 754.47 milliseconds, allocated 393.159 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,115, types: 46,482, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,115, types: 46,482, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 579.29 milliseconds, allocated 411.892 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 630.25 milliseconds, allocated 382.851 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Float inwards [D]: finished in 8.17 milliseconds, allocated 8.300 megabytes *** Called arity analysis [D]: Result size of Called arity analysis = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Called arity analysis [D]: finished in 21.61 milliseconds, allocated 11.860 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 362.68 milliseconds, allocated 190.967 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 8,923, types: 48,300, coercions: 2,642, joins: 0/0} !!! Demand analysis [D]: finished in 551.36 milliseconds, allocated 339.529 megabytes *** Worker Wrapper binds [D]: Result size of Worker Wrapper binds = {terms: 8,941, types: 48,325, coercions: 2,642, joins: 0/3} !!! Worker Wrapper binds [D]: finished in 7.70 milliseconds, allocated 11.542 megabytes *** Simplifier [D]: Result size of Simplifier iteration=1 = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0} Result size of Simplifier = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 1307.95 milliseconds, allocated 727.784 megabytes *** Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: Result size of Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) = {terms: 8,930, types: 48,315, coercions: 2,642, joins: 0/0} !!! Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) [D]: finished in 1393.21 milliseconds, allocated 1078.391 megabytes *** Common sub-expression [D]: Result size of Common sub-expression = {terms: 7,340, types: 46,922, coercions: 2,642, joins: 0/0} !!! Common sub-expression [D]: finished in 227.12 milliseconds, allocated 194.184 megabytes *** Float inwards [D]: Result size of Float inwards = {terms: 7,340, types: 46,922, coercions: 2,642, joins: 0/0} !!! Float inwards [D]: finished in 11.63 milliseconds, allocated 6.545 megabytes *** Simplifier [D]: Result size of Simplifier = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0} !!! Simplifier [D]: finished in 590.60 milliseconds, allocated 349.746 megabytes *** Demand analysis [D]: Result size of Demand analysis = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0} !!! Demand analysis [D]: finished in 485.19 milliseconds, allocated 330.909 megabytes *** CoreTidy [D]: Result size of Tidy Core = {terms: 5,752, types: 44,537, coercions: 2,642, joins: 0/0} !!! CoreTidy [D]: finished in 293.13 milliseconds, allocated 258.898 megabytes Created temporary directory: /tmp/ghc18725_0 *** CorePrep [D]: Result size of CorePrep = {terms: 6,969, types: 46,155, coercions: 2,642, joins: 0/404} !!! CorePrep [D]: finished in 163.49 milliseconds, allocated 113.842 megabytes *** Stg2Stg: *** CodeGen [D]: !!! CodeGen [D]: finished in 23041.19 milliseconds, allocated 12665.144 megabytes *** Assembler: *** Deleting temp files: Warning: deleting non-existent /tmp/ghc18725_0/ghc_2.c *** Deleting temp dirs:
Not sure whether this qualifies as "unusually large". -ddump-cmm
shows that the C-- for a 100-field read instance is more than twice as large as that for the 100-field getline example (39234 lines vs 14895), but size alone doesn't explain the huge difference.
However, at 300 fields, the read example produces 237541 lines of C--, while getline grows roughly linearly, to 47312 lines, meaning that there is definitely some sort of nonlinearity going on wrt C-- code size.
So
sink
in C-- seems very slow. Nearly 4 bytes in 5 of all allocation is in tis pass alone. Is this a non-linear algorithmic effect. e.g if you double the size of the input, does thesink
time go up non-linearly whilesimplTopBinds
goes up linearly? That's my guess.
Yes, in fact it does, quite clearly so:
10-field example:
Wed Oct 25 14:57 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-10-read.hs total time = 0.19 secs (188 ticks @ 1000 us, 1 processor) total alloc = 134,109,112 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 35.6 28.7 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(493,4)-(555,7) 9.0 9.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 4.8 5.5 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 4.3 5.1 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 3.7 3.5 tcRnImports TcRnDriver compiler/typecheck/TcRnDriver.hs:240:20-50 3.7 3.6 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 3.7 2.1 sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 3.2 1.5 setSessionDynFlags GHC compiler/main/GHC.hs:(578,1)-(584,16) 3.2 4.7 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 2.1 0.9 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 2.1 2.9 bin_tycldecls HscTypes compiler/main/HscTypes.hs:1085:52-57 1.6 1.0 initGhcMonad GHC compiler/main/GHC.hs:(493,1)-(503,25) 1.6 4.7 MAIN MAIN <built-in> 1.1 0.3 fixStgRegisters AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:566:17-42 1.1 0.2 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.1 1.3 shortcutBranches AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:694:17-54 1.1 0.2 elimCommonBlocks CmmPipeline compiler/cmm/CmmPipeline.hs:(71,13)-(72,76) 1.1 1.0 CAF PrimOp <entire-module> 1.1 0.5 CAF PrelInfo <entire-module> 1.1 1.0 deSugar HscMain compiler/main/HscMain.hs:511:7-44 1.1 0.9 CorePrep HscMain compiler/main/HscMain.hs:(1313,24)-(1314,57) 1.1 0.6 withCleanupSession GHC compiler/main/GHC.hs:(466,1)-(475,37) 1.1 0.8 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 0.5 2.0 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 0.5 2.4 layoutStack CmmPipeline compiler/cmm/CmmPipeline.hs:(97,13)-(99,40) 0.5 1.0 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.5 1.0
100 fields:
Wed Oct 25 15:00 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-100-read.hs total time = 3.10 secs (3103 ticks @ 1000 us, 1 processor) total alloc = 2,397,379,664 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 38.9 36.6 sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 16.0 7.3 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 7.9 10.8 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 4.3 5.8 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 4.1 5.3 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 3.8 4.6 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.8 2.5 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 2.8 4.8 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 1.8 1.9 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 1.8 2.9 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.8 1.6 tc_rn_src_decls TcRnDriver compiler/typecheck/TcRnDriver.hs:(493,4)-(555,7) 1.3 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 1.0 1.7 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 0.8 1.2 deSugar HscMain compiler/main/HscMain.hs:511:7-44 0.8 1.3
400 fields:
Thu Oct 26 01:25 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-read.hs total time = 128.62 secs (128624 ticks @ 1000 us, 1 processor) total alloc = 37,659,351,320 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 77.8 23.6 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 9.6 30.0 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 3.0 10.4 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 1.6 5.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 1.1 4.2 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 1.0 4.7 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 0.9 4.2 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.7 2.9 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 0.6 1.9 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 0.5 1.4 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 0.4 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 0.4 1.6 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 0.2 1.1 deSugar HscMain compiler/main/HscMain.hs:511:7-44 0.2 1.1
(Related question: does
sink
show up a a big item when compiling "normal" programs?)
It does, but nowhere near this dominantly. E.g. getline, 400 fields:
Thu Oct 26 02:54 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c examples/t-400-getline.hs total time = 14.13 secs (14126 ticks @ 1000 us, 1 processor) total alloc = 12,479,030,504 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 31.7 3.3 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 24.7 39.0 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 7.2 9.7 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 6.5 10.4 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 5.2 7.6 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 4.1 6.2 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.0 1.6 deSugar HscMain compiler/main/HscMain.hs:511:7-44 2.0 3.3 CommonSubExpr SimplCore compiler/simplCore/SimplCore.hs:462:40-56 1.9 3.2 CorePrep HscMain compiler/main/HscMain.hs:(1313,24)-(1314,57) 1.4 2.5 Specialise SimplCore compiler/simplCore/SimplCore.hs:486:40-50 0.8 1.3 StgCse SimplStg compiler/simplStg/SimplStg.hs:(88,14)-(91,44) 0.5 1.3
Guessing: the "shape" of the C-- code makes
sink
behave very badly. Can you characterise what that shape is, and where the algorithmic badness comes from?
Not yet, but I have a hunch that it's simply a matter of producing too much of it. Maybe there is quadratic behavior there, where the number of lines of C-- grows proportionally with the square of the number of fields? That seems like a plausible explanation so far.
comment:59 Changed 16 months ago by
Maybe there is quadratic behavior there, where the number of lines of C-- grows proportionally with the square of the number of fields?
But then all the C-- passes would get slower, wouldn't they? Still it'd be easy to measure code size of
- Core
- C--
as the number of fields increases.
My guess is that there's an algorithmic blow-up (quadratic or worse) in sink
, so that it starts to dominate when there is a very large CmmProc
.
comment:60 Changed 16 months ago by
Tobias, I had a quick glimpse over CmmSink and `skipped` looks suspicious:
For every assignment l = e
which is not sunk l
is consed to skipped
.
Further, for each assignment l = e
we check a) is l
element of skipped
and b) is there any register in skipped
mentioned in e
:
cannot_inline = skipped `regsUsedIn` rhs -- Note [dependent assignments] || l `elem` skipped || not (okToInline dflags rhs node)
This sounds accidentally quadric to me. Considering the amount of code generated this might be a thing.
comment:61 follow-up: 64 Changed 16 months ago by
Indeed. Actually any use of elem
on a long list is suspicious. c.f. ListSetOpts.unionLists
which warns if it sees long lists.
Maybe we could have ListSetOps.checkedElem
which checks for over-long lists and warns. And change all uses of elem
into checkedElem
. There may be more of these land-mines!
Changed 16 months ago by
Attachment: | ghc-large-record-types-cmm-size.png added |
---|
Size of C-- for large record examples
comment:62 Changed 16 months ago by
More data! https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-cmm-size.png shows C-- code size growth as the number of fields grows for the same generated examples as in the other graphs. getline and getline-appl generate the exact same C--, so there's no point including both in the graph; they are the only ones that grow linearly among the "interesting" examples.
comment:63 follow-up: 65 Changed 16 months ago by
Interesting. I can't tell whether that's as non-linear as the runtime, but the bad cases are clearly big!
What happens to the size of Core? (Neglect types and coercions which are erased.)
And finally does the Core-to-Cmm size ratio go up?
comment:64 Changed 16 months ago by
Replying to simonpj:
Indeed. Actually any use of
elem
on a long list is suspicious. c.f.ListSetOpts.unionLists
which warns if it sees long lists.Maybe we could have
ListSetOps.checkedElem
which checks for over-long lists and warns. And change all uses ofelem
intocheckedElem
. There may be more of these land-mines!
What about using a different kind of data structure like a map or similar, that has logarithmic lookup performance? That would work even if we cannot reduce the size of the list, but if the list is short-lived, constructing the map might introduce prohibitive overhead?
comment:65 Changed 16 months ago by
Replying to simonpj:
Interesting. I can't tell whether that's as non-linear as the runtime, but the bad cases are clearly big!
What happens to the size of Core? (Neglect types and coercions which are erased.)
And finally does the Core-to-Cmm size ratio go up?
I'll check that next, unfortunately this means I have to run all those examples again...
comment:66 follow-up: 67 Changed 16 months ago by
What about using a different kind of data structure like a map or similar
Of course that the first thing we should do, once we are convinced that's where the time is going! (We only ave suggestive evidence so far.) Small maps should be cheap; that's the business of the containers library.
comment:67 Changed 16 months ago by
Replying to simonpj:
What about using a different kind of data structure like a map or similar
Of course that the first thing we should do, once we are convinced that's where the time is going! (We only ave suggestive evidence so far.) Small maps should be cheap; that's the business of the containers library.
LocalReg
is Uniquable
so something like this should work:
type LocalRegSet = UniqSet LocalReg emptyLocalRegSet :: LocalRegSet emptyLocalRegSet = emptyUniqSet nullLocalRegSet :: LocalRegSet -> Bool nullLocalRegSet = isEmptyUniqSet elemLocalRegSet :: LocalReg -> LocalRegSet -> Bool elemLocalRegSet = elementOfUniqSet insertLocalRegSet :: LocalRegSet -> LocalReg -> LocalRegSet insertLocalRegSet = addOneToUniqSet cannot_inline = skipped `regsUsedIn` rhs -- Note [dependent assignments] || l `elemLocalRegSet` skipped || not (okToInline dflags rhs node) regsUsedIn :: LocalRegSet -> CmmExpr -> Bool regsUsedIn ls _ | nullLocalRegSet ls = False regsUsedIn ls e = wrapRecExpf f e False where f (CmmReg (CmmLocal l)) _ | l `elemLocalRegSet` ls = True f (CmmRegOff (CmmLocal l) _) _ | l `elemLocalRegSet` ls = True f _ z = z
comment:68 Changed 16 months ago by
Interesting: https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-cmm-per-core.png
For the three offenders (read, show, read-appl), the C--/Core ratio goes *up* as the total code size grows, for the well-behaved examples it goes down. The latter is expected (constant overhead + linear code size), the former hints at some sort of nonlinearity.
Changed 16 months ago by
Attachment: | ghc-large-record-types-cmm-per-core.png added |
---|
C-- / Core size ratio for large record modules
comment:69 follow-up: 71 Changed 16 months ago by
That is indeed interesting.
- The bad cases are
read
,read-appl
andshow
(comment:56) - They all generate very big Cmm (png just before (comment:62)
- They all generate unusually large Core-to-Cmm ratio (comment:68)
- This very large Cmm makes
sink
go bananas.
Is the large Cmm accounted for by the big Core-to-Cmm ratio, or do these examples also have Core sizes that scale badly (look at term size only)?
I'm guessing that the Core size does not scale badly. If that was so our questions would be
- Why do we generate so much Cmm per unit of Core
- Can we make
sink
scale better
Both are worth working on.
comment:70 follow-up: 72 Changed 16 months ago by
I tried the suggestion in comment 67 and got these results for compiling W2.hs
Vanilla:
Sun Oct 29 19:58 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs -fforce-recomp -O2 total time = 26.00 secs (25996 ticks @ 1000 us, 1 processor) total alloc = 14,921,627,912 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 55.7 15.9 SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 19.5 30.6 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 4.2 9.0 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 4.0 11.1 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 2.8 6.3 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 1.6 3.7 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 1.5 3.5 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 1.2 2.4 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 1.2 1.9 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 0.9 1.8 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 0.9 2.1 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 0.8 1.9
skpped
as UniqSet:
Sun Oct 29 19:18 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -RTS -B/Users/alexbiehl/git/ghc/inplace/lib /Users/alexbiehl/Downloads/W2.hs -fforce-recomp -O2 total time = 13.31 secs (13307 ticks @ 1000 us, 1 processor) total alloc = 15,772,184,488 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc SimplTopBinds SimplCore compiler/simplCore/SimplCore.hs:761:39-74 38.3 29.0 sink CmmPipeline compiler/cmm/CmmPipeline.hs:(104,13)-(105,59) 13.2 20.3 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 8.3 10.5 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 8.1 8.5 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 5.4 5.9 NewStranal SimplCore compiler/simplCore/SimplCore.hs:480:40-63 3.1 3.5 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 2.9 3.3 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.3 2.3 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 2.1 1.8 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 1.7 2.0 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.6 1.7 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 1.4 1.8 foldNodesBwdOO Hoopl.Dataflow compiler/cmm/Hoopl/Dataflow.hs:(397,1)-(403,17) 1.1 0.8
Good results I think. Though I need to check if allocation goes up significantly in common cases.
Changed 16 months ago by
Attachment: | ghc-large-record-types-core-size.png added |
---|
Core size for large record modules
comment:71 Changed 16 months ago by
Replying to simonpj:
Is the large Cmm accounted for by the big Core-to-Cmm ratio, or do these examples also have Core sizes that scale badly (look at term size only)?
I'm guessing that the Core size does not scale badly. If that was so our questions would be
- Why do we generate so much Cmm per unit of Core
- Can we make
sink
scale betterBoth are worth working on.
https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-large-record-types-core-size.png suggests that while Core size has a slight nonlinearity to it, the behavior is roughly the same for all examples, so I believe your guess that Core size doesn't scale as badly as C-- size does is justified.
comment:72 Changed 16 months ago by
Replying to alexbiehl:
I tried the suggestion in comment 67 and got these results for compiling
W2.hs
[...]
Good results I think. Though I need to check if allocation goes up significantly in common cases.
You could try running the examples in examples100.zip (https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/examples100.zip) to see whether any of them performs worse, that would be a good first indication.
Changed 16 months ago by
Attachment: | ghc-stage2-7258-deeper-SCC.prof added |
---|
Profiling output with more SCC's added
comment:73 Changed 16 months ago by
Profiling a bit deeper (see https://ghc.haskell.org/trac/ghc/attachment/ticket/7258/ghc-stage2-7258-deeper-SCC.prof) hints at a possible nonlinearity in tryToInline
, particularly when recursing through the keep
function.
Changed 16 months ago by
Attachment: | ghc-large-record-types-optimized-2.png added |
---|
Performance graph for various examples, compiling with -O, using UniqSet (as in comment #67)
comment:74 Changed 16 months ago by
comment:75 Changed 16 months ago by
During the call Simon hypothesized that we should keep the free variables lists in mind when comparing STG. I stated that I didn't think that we currently had a way to print closures' free variable lists. I've checked and this isn't quite true: the pretty-printer will print free variable lists when -dppr-debug
is passed. I've broken this out into its own flag in Phab:D4140.
comment:76 Changed 16 months ago by
Some better data extraction code for profiles generated earlier yields the above graphs. All of these were generated using the optimization as implemented in D4145 (using a UniqSet instead of a list). Observations:
- The number of terms in generated Core scales in a perfectly linear fashion for all examples.
- The size of the generated Core in lines-of-code, however, scales non-linearly, which tells us that the size of individual terms grows with input size. This behavior seems to be identical between all 4 examples though, so it does not explain the extreme performance differences.
- Somewhere between Core and STG, the
show
andread
examples misbehave more badly than the other two - Between STG and C--, things get utterly dramatic,
show
andread
explode, whilegetline-appl
andnothing
are reduced back to near-linear behavior.
comment:77 follow-up: 80 Changed 16 months ago by
It seems that for the misbehaving examples, the offensive cost centre is Stg2Stg. As code size increases, CPU time for this cost centre grows disproportionally in the read
and show
examples, while it remains below 40% for the well-behaved ones.
comment:78 Changed 16 months ago by
the offensive cost centre is Stg2Stg
Very interesting! Keep digging :-)
comment:80 Changed 16 months ago by
Replying to tdammers:
It seems that for the misbehaving examples, the offensive cost centre is Stg2Stg. As code size increases, CPU time for this cost centre grows disproportionally in the
read
andshow
examples, while it remains below 40% for the well-behaved ones.
Since you don't compile the examples with profiling on the only "heavy" transformation in Stg2Stg is StgCse. You can try -fno-stg-cse
to see if it has impact on the timings.
comment:81 Changed 16 months ago by
Very curious. I see in comment:79
Before total time = 26.00 secs (25996 ticks @ 1000 us, 1 processor) total alloc = 14,921,627,912 bytes (excludes profiling overheads) After total time = 13.31 secs (13307 ticks @ 1000 us, 1 processor) total alloc = 15,772,184,488 bytes (excludes profiling overheads)
I don't think I have ever before seen a program that actually allocates more (15G instead of 14G) and yet runs in half the time. Usually allocation and runtime are more or less correlated. But apparently not in this case.
comment:82 Changed 16 months ago by
Stg2Stg turns out to b a dead end. The performance loss I observed there is entirely due to formatting and dumping about 1.3 GiB worth of STG when compiling with -ddump-stg; when compiling without -ddump-stg, Stg2Stg
doesn't show up in the profile at all (0.0 on everything).
Re-running the profiling script without dumps shows different culprits: tryToInline
, and simpl_bind
. E.g. 400-field read
:
Fri Nov 3 08:43 2017 Time and Allocation Profiling Report (Final) ghc-stage2 +RTS -p -h -RTS -B/home/tobias/well-typed/devel/ghc/inplace/lib -B/home/tobias/well-typed/devel/ghc/inplace/lib -O -fforce-recomp -c generated/t-400-read.hs total time = 59.05 secs (59049 ticks @ 1000 us, 1 processor) total alloc = 75,366,446,544 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc simpl_bind Simplify compiler/simplCore/Simplify.hs:141:83-101 29.4 22.0 tryToInline CmmSink compiler/cmm/CmmSink.hs:410:3-41 24.4 32.4 RegAlloc-linear AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(658,27)-(660,55) 10.6 14.1 FloatOutwards SimplCore compiler/simplCore/SimplCore.hs:471:40-66 8.8 6.9 pprNativeCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(529,37)-(530,65) 4.0 4.7 OccAnal SimplCore compiler/simplCore/SimplCore.hs:(739,22)-(740,67) 2.3 2.2 StgCmm HscMain compiler/main/HscMain.hs:(1426,13)-(1427,62) 2.1 1.8 regLiveness AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(591,17)-(593,52) 1.9 1.3 NativeCodeGen CodeOutput compiler/main/CodeOutput.hs:171:18-78 1.5 1.6 genMachCode AsmCodeGen compiler/nativeGen/AsmCodeGen.hs:(580,17)-(582,62) 1.4 1.3 CoreTidy HscMain compiler/main/HscMain.hs:1253:27-67 1.3 1.4 dmdAnal'-Case-single_prod_ctor DmdAnal compiler/stranal/DmdAnal.hs:(251,5)-(276,68) 1.3 1.5
Changed 16 months ago by
Attachment: | t-read-scc-time-abs.png added |
---|
Largest profiling cost centres (in absolute terms) for the read
example
Changed 16 months ago by
Attachment: | t-getline-appl-scc-time-abs.png added |
---|
Largest profiling cost centres (in absolute terms) for the getline-appl
example
Changed 16 months ago by
Attachment: | t-show-scc-time-abs.png added |
---|
Largest profiling cost centres (in absolute terms) for the show
example
comment:83 Changed 16 months ago by
Comparing the latest 3 graphs, we can see that tryToInline
more badly than the other cost centres, at least in the misbehaving examples, while the other top cost centres seem to scale somewhat proportionally. simpl_binds
is the largest one in most cases, but tryToInline
is "more nonlinear" for the show
and read
examples, surpassing it between 150 and 200 fields in the show
example, and (educated guess) around 500 fields in the read
example.
comment:84 Changed 16 months ago by
tryToInline
is in in CmmSink
which we already know is working too hard.
There may well be algorithmic improvements that could be done CmmSink
.
But, in addition, I would still like to understand why the STG-to-Cmm ratio climbs so sharply. If we have giant Cmm then any Cmm pass is going to take a long time. Fixing the non-linearity would be good, I agree; but generating less Cmm would also be big.
Why is it so big?
My guess: we are generating very deeply-nested closures that have a lot of free variables. E.g.
let x = let a5 = blah q = ...K a1 a2 a3 a4 a5 ... in ...
Here the closure for q
has 5 free variables (a1-a5); while that for x
has four (a1-a4). So the code for x
will slurp lots of variables out of x's closure only to store them again in q's. And if that is nested we'll get O(n**2)
code size.
But I'm speculating. Facts would be good. Let's measure STG size with and without counting the free vars of each closure (which are stored in the STG tree) and see.
comment:85 Changed 16 months ago by
Here's some evidence (compiled with -O
):
The 3 misbehaving examples (read, show, and to a lesser degree read-appl) show linear growth of the free-variables/other-stg ratio (which suggests roughly quadratic growth of the size of the overall STG size, assuming linear growth of the base STG size), while the well-behaved examples maintain a more-or-less constant ratio.
A quick glance at the generated STG suggests that there is indeed some deep nesting of closures going on.
comment:86 Changed 16 months ago by
I've narrowed it down a bit more.
This behaves nicely:
module D where import Text.ParserCombinators.ReadP as ReadP import Control.Monad.State as State import Data.Char (ord) data DT = DT { field0 :: Int , field2 :: Int , field3 :: Int , field4 :: Int , field5 :: Int , field6 :: Int , field7 :: Int , field8 :: Int , field9 :: Int , field10 :: Int } getlD :: IO DT getlD = DT <$> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine) <*> (read <$> getLine)
But this doesn't:
module D where import Text.ParserCombinators.ReadP as ReadP import Control.Monad.State as State import Data.Char (ord) data DT = DT { field0 :: Int , field2 :: Int , field3 :: Int , field4 :: Int , field5 :: Int , field6 :: Int , field7 :: Int , field8 :: Int , field9 :: Int , field10 :: Int } readD :: ReadP DT readD = DT <$> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get) <*> (ord <$> ReadP.get)
Changed 16 months ago by
Attachment: | t-10-dump.zip added |
---|
Various GHC dumps for 10-field getline-appl and read-appl examples. Read misbehaves, getLine doesn't.
comment:87 Changed 16 months ago by
Further evidence, based on the observation that the problem is caused by certain kinds of CPS-style code:
module Nested where -- A data type capturing the relevant structure of the P type from ReadP data A a = A (Int -> A a) | N a -- This implementation produces deeply nested Core: f10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int) f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) )))))))))) -- This implementation produces flat Core. g10 :: A (Int, Int, Int, Int, Int, Int, Int, Int, Int, Int) g10 = a0 where {-#NOINLINE a10#-} a10 = \(i0, i1, i2, i3, i4, i5, i6, i7, i8) i9 -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9) {-#NOINLINE a9#-} a9 = \(i0, i1, i2, i3, i4, i5, i6, i7) i8 -> A (a10 (i0, i1, i2, i3, i4, i5, i6, i7, i8)) {-#NOINLINE a8#-} a8 = \(i0, i1, i2, i3, i4, i5, i6) i7 -> A (a9 (i0, i1, i2, i3, i4, i5, i6, i7)) {-#NOINLINE a7#-} a7 = \(i0, i1, i2, i3, i4, i5) i6 -> A (a8 (i0, i1, i2, i3, i4, i5, i6)) {-#NOINLINE a6#-} a6 = \(i0, i1, i2, i3, i4) i5 -> A (a7 (i0, i1, i2, i3, i4, i5)) {-#NOINLINE a5#-} a5 = \(i0, i1, i2, i3) i4 -> A (a6 (i0, i1, i2, i3, i4)) {-#NOINLINE a4#-} a4 = \(i0, i1, i2) i3 -> A (a5 (i0, i1, i2, i3)) {-#NOINLINE a3#-} a3 = \(i0, i1) i2 -> A (a4 (i0, i1, i2)) {-#NOINLINE a2#-} a2 = \i0 i1 -> A (a3 (i0, i1)) {-#NOINLINE a1#-} a1 = \i0 -> A (a2 i0) {-#NOINLINE a0#-} a0 = A a1
This module tries to provoke "tupling up the free variables" in g10
, and compare it to the naive implementation f10
. I do not know yet whether this would actually produce any performance benefits though.
The NOINLINE pragmas are needed by the way, otherwise GHC decides to inline the a
functions, and we end up with a mess similar to f10
.
comment:88 follow-up: 89 Changed 16 months ago by
Hmm. The Cmm
for g10
is actually a lot bigger than that for f10
, and looked to me as if it too would scale quadratically. Are f10
and g10
accurate reflections of the "scale well" and "scale badly" versions of Read?
I can see why things might scale badly, if we have a lot of
p1 >>= \x -> p2 >>= \y -> p3 >>= \z -> return (K x y z)
And f10
has that structure, albeit with data constructors rather than functions. BUT: if that's it, why doesn't (>>=)
inline, so we get a case expression instead??
So if f10
is reasonably accurate, I can see two ways forward:
1 Do something in the back end to generate less code.
2 Generate different code. I'm still intrigued (and a bit disbelieving) why a different exposition of Read generates code that scales better.
Concerning (1) the kind of thing I had in mind was more like this
f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> let t = (i0,i1,i2,i3,i4) in A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> case t of (i0,i1,i2,i3,i4) -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)
Now no function closure gets more than 5 free variables. I imagine one could do this as a STG-to-STG transformation if we decided that this was really the core issue.
A tantalising point is this. We have something like this:
let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah...
The list before the \
are the free vars of the closure.
Inside the code for f4
we have f4
's function closure in hand, with i0...
in it. Rather than capturing i0, i1...
as the free vars of f5, we could just store a pointer to f4
's function closure (it's a kind of tuple, after all) and fetch its free vars from there. It's as if we wanted to say
let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [f4, i4] \i5 -> ...blah...
with f4
and i4
being the free vars of f5
's closure. This would be a deeper change to the code generator, but it could make a lot of sense in cases where all of f4
's free variables are also free in f5
. (If not all where, you might get a space leak by closing over f4
.)
comment:89 Changed 15 months ago by
Replying to simonpj:
Hmm. The
Cmm
forg10
is actually a lot bigger than that forf10
, and looked to me as if it too would scale quadratically.
It would, but for different reasons, and I believe I can rewrite it to not scale quadratically (but haven't gotten around to it yet).
Are
f10
andg10
accurate reflections of the "scale well" and "scale badly" versions of Read?
No; we don't actually have a "scale well" version of Read
yet. All the implementations of Read
that I have produced so far are in the "scale badly" category, although the degree to which they scale badly differs somewhat. We have two kinds of "scale well" examples:
- Trivial ones, where rather than nest code such that chains of closures appear, we just call
read
or something else N times, but in benign ways; these examples do not have much of an explanatory value, and I merely included them to establish a performance baseline - Well-behaved nested / chained examples, such as the
getline-appl
one; however, none of these do anything equivalent toread
.
I can see why things might scale badly, if we have a lot of
p1 >>= \x -> p2 >>= \y -> p3 >>= \z -> return (K x y z)And
f10
has that structure, albeit with data constructors rather than functions. BUT: if that's it, why doesn't(>>=)
inline, so we get a case expression instead??
I believe the (>>=)
does in fact inline; f10
is the kind of shape you get when you inline the (>>=)
from ReadP
. (>>=)
from other monadic types does not introduce this kind of shape, which would explain why Read
misbehaves but getLine
over IO
does not.
So if
f10
is reasonably accurate, I can see two ways forward:1 Do something in the back end to generate less code.
2 Generate different code. I'm still intrigued (and a bit disbelieving) why a different exposition of Read generates code that scales better.
See above; Read
always misbehaves here, so your disbelief is entirely justified. The bad behavior seems to be related to the actual implementation of (>>=)
, not so much whether the (>>=)
gets inlined. Or, more accurately, inlining (>>=)
actually *is* the problem; if we didn't inline it, then read-appl
would compile to something equivalent to the generic applicative example, which behaves nicely.
Concerning (1) the kind of thing I had in mind was more like this
f10 = A (\i0 -> A (\i1 -> A (\i2 -> A (\i3 -> A (\i4 -> let t = (i0,i1,i2,i3,i4) in A (\i5 -> A (\i6 -> A (\i7 -> A (\i8 -> A (\i9 -> case t of (i0,i1,i2,i3,i4) -> N (i0, i1, i2, i3, i4, i5, i6, i7, i8, i9)Now no function closure gets more than 5 free variables. I imagine one could do this as a STG-to-STG transformation if we decided that this was really the core issue.
Further experimentation shows that this doesn't actually achieve anything; the resulting Core is *exactly the same*, and if you put both functions into the same compilation unit, they get optimized into expressing one in terms of the other (I'm guessing here that CSE is smart enough to figure out that they are equivalent). I have written 3 versions of f10
, one as-is, one (g10
) using the suggested 5-tuple optimization, and one (h10
) using a nested-tuple optimization, and I end up with Core that has literally these lines:
f10 = A f1 g10 = f10 h10 = f10
A tantalising point is this. We have something like this:
let f4 = [i0,i1,i2,i3] \i4 -> let f5 = [i0,i1,i2,i3, i4] \i5 -> ...blah...The list before the
\
are the free vars of the closure.Inside the code for
f4
we havef4
's function closure in hand, withi0...
in it. Rather than capturingi0, i1...
as the free vars of f5, we could just store a pointer tof4
's function closure (it's a kind of tuple, after all) and fetch its free vars from there. It's as if we wanted to saylet f4 = [i0,i1,i2,i3] \i4 -> let f5 = [f4, i4] \i5 -> ...blah...with
f4
andi4
being the free vars off5
's closure. This would be a deeper change to the code generator, but it could make a lot of sense in cases where all off4
's free variables are also free inf5
. (If not all where, you might get a space leak by closing overf4
.)
comment:90 Changed 15 months ago by
comment:91 Changed 15 months ago by
Indeed, alexbiehl is on the way to implementing the "tantalising point" of comment:89. In a phone call today we agreed
- That sharing the parent closure is a good thing to try; and not too hard.
- That we should start
- A ticket to track progress on it: #14461
- A wiki page to explain how it all works. I have made a start: NestedClosures
comment:92 Changed 15 months ago by
Progress made so far:
- Reduced algorithmic complexity in register spilling (33)
- Minor improvements in the pretty-based code writer (34)
- Refactored generated
Read
code (comment 37 / ticket 14364) - Observed that
Read
instances generate long chains of monadic binds over a monad that introduces nested closures when binding (ReadP
, specifically); and that deeply nested closures in general cause the register allocator to repeatedly unpack and re-pack free variables, which leads to quadratic behavior. The fix for this has been split off into 14461.
Things left to do:
Exhibts long compilation times