Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#10293 closed bug (fixed)

CallArity taking 20% of compile time

Reported by: simonpj Owned by: nomeata
Priority: high Milestone: 7.10.2
Component: Compiler Version: 7.10.1
Keywords: Cc: michal.terepeta@…, gidyn, scpmw@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description

The CallArity analysis can, apparently, take 20% of compile time. That's a lot for one analysis that is on by default with -O.

Michael Terepeta writes: Out of curiosity I had a look at compiling haskell-src-exts since that takes quite a while. I've used ghc HEAD and 7.8.4 (both built with BuildFlavour=prof & bootstrapped with a standard ghc 7.8.4) and it's interesting -- the current HEAD takes quite a bit longer and allocates way more than 7.8.4. One of the main things that stand out is the CallArity analysis (which IIRC was not there in 7.8.4). So unless I messed something up with measuring, the analysis seem to be pretty expensive.

HEAD

	Sun Apr 12 15:52 2015 Time and Allocation Profiling Report  (Final)

	  ghc +RTS -p -RTS [...]

	total time  =      147.84 secs   (147841 ticks @ 1000 us, 1 processor)
	total alloc = 172,378,600,408 bytes  (excludes profiling overheads)

COST CENTRE       MODULE       %time %alloc

SimplTopBinds     SimplCore     32.4   28.8
CallArity         SimplCore     18.4   25.6
lintAnnots        CoreLint       4.5    4.6
CoreTidy          HscMain        4.5    5.1
pprNativeCode     AsmCodeGen     3.2    3.4
OccAnal           SimplCore      3.2    3.1
occAnalBind.assoc OccurAnal      2.6    2.5
StgCmm            HscMain        2.3    1.9
Simplify          SimplCore      2.1    0.2
RegAlloc          AsmCodeGen     2.1    2.4
FloatOutwards     SimplCore      2.0    1.6
regLiveness       AsmCodeGen     1.9    1.9
tc_rn_src_decls   TcRnDriver     1.8    1.3
sink              CmmPipeline    1.7    1.5
NewStranal        SimplCore      1.3    1.5
genMachCode       AsmCodeGen     1.1    1.0
layoutStack       CmmPipeline    1.0    1.0

HEAD with -fno-call-arity

	Sun Apr 12 18:16 2015 Time and Allocation Profiling Report  (Final)

	  ghc +RTS -p -RTS [...] -fno-call-arity

	total time  =      113.71 secs   (113714 ticks @ 1000 us, 1 processor)
	total alloc = 121,884,896,720 bytes  (excludes profiling overheads)

COST CENTRE       MODULE          %time %alloc

SimplTopBinds     SimplCore        37.2   36.6
CoreTidy          HscMain           6.0    7.3
lintAnnots        CoreLint          5.8    6.5
pprNativeCode     AsmCodeGen        4.1    4.8
OccAnal           SimplCore         3.6    3.8
occAnalBind.assoc OccurAnal         2.9    3.2
StgCmm            HscMain           2.9    2.6
RegAlloc          AsmCodeGen        2.6    3.4
FloatOutwards     SimplCore         2.6    2.3
regLiveness       AsmCodeGen        2.5    2.8
tc_rn_src_decls   TcRnDriver        2.4    1.9
Simplify          SimplCore         2.4    0.3
sink              CmmPipeline       2.1    2.2
NewStranal        SimplCore         1.7    2.1
genMachCode       AsmCodeGen        1.4    1.4
layoutStack       CmmPipeline       1.4    1.4
NativeCodeGen     CodeOutput        1.1    1.2
FloatInwards      SimplCore         1.1    1.4
do_block          Hoopl.Dataflow    1.0    0.6
Digraph.scc       Digraph           0.8    1.3

GHC 7.8.4

	Sun Apr 12 15:41 2015 Time and Allocation Profiling Report  (Final)

	  ghc +RTS -p -RTS [...]

	total time  =       93.11 secs   (93112 ticks @ 1000 us, 1 processor)
	total alloc = 103,135,975,120 bytes  (excludes profiling overheads)

COST CENTRE                 MODULE         %time %alloc

SimplTopBinds               SimplCore       38.5   37.4
pprNativeCode               AsmCodeGen       6.2    7.2
StgCmm                      HscMain          3.9    4.2
RegAlloc                    AsmCodeGen       3.7    5.1
occAnalBind.assoc           OccurAnal        3.3    3.6
OccAnal                     SimplCore        3.3    3.6
regLiveness                 AsmCodeGen       3.1    3.4
FloatOutwards               SimplCore        2.9    2.4
sink                        CmmPipeline      2.8    2.8
Simplify                    SimplCore        2.6    0.3
tc_rn_src_decls             TcRnDriver       2.4    2.1
genMachCode                 AsmCodeGen       1.9    2.0
NewStranal                  SimplCore        1.8    2.1
layoutStack                 CmmPipeline      1.8    1.8
Core2Core                   HscMain          1.3    1.2
deSugar                     HscMain          1.1    1.1
do_block                    Hoopl.Dataflow   1.1    0.7
CoreTidy                    HscMain          1.0    1.1
CorePrep                    HscMain          1.0    1.1
Digraph.scc                 Digraph          0.9    1.5
versioninfo                 MkIface          0.9    1.0
zonkEvBndr_zonkTcTypeToType TcHsSyn          0.6    1.4

Change History (24)

comment:1 Changed 3 years ago by simonpj

Milestone: 7.12.1
Owner: set to nomeata
Priority: normalhigh

comment:2 Changed 3 years ago by nomeata

Thanks for the report. I did test its impact on GHC and nofib, and ensured that it would run reasonably fast, but it’s not unlikely that there are classes of programs where it goes through the roof. I’ll have a look, but I’ll first have to catch up a post-vaction backlog.

comment:3 Changed 3 years ago by nomeata

Cc: michal.terepeta@… added

Michal, with what module of haskell-src-exts can this behavior be observed? How exactly did you invoke ghc to get that output?

comment:4 Changed 3 years ago by nomeata

The culprit is callArityRecEnv , which analyzes recursive groups and calls calledWith with quadratic complexity.

I might have an idea for way to shortcut this. Or maybe the code should revert to a quick and conservative conversation when analyzing a very large recursive group – the largest here is has 754 mutually recursive definitions.

I’ll need to ponder this some more.

comment:5 Changed 3 years ago by George

This is good news, any chance we can get it in 7.10.2?

comment:6 Changed 3 years ago by nomeata

Possibly, if it validates without problems... current state on Phab:D843. I yet have to run nofib and see if I broke something.

comment:7 Changed 3 years ago by Joachim Breitner <mail@…>

In 9654a7cf8580bc3a027bf8b39c06d916050c446d/ghc:

Call Arity: Trade precision for performance in large mutually recursive groups

Sometimes (especial with derived Data instances, it seems), one can have
very large mutually recursive bindings. Calculating the Call Arity
analysis result with full precision is an expensive operation in these
case. So above a certain threshold (25, for no good reason besides
intuition), skip this calculation and assume the recursion is not
linear, which is a conservative result.

With this, the Call Arity analysis accounts for 3.7% of the compile time
of haskell-src-exts. Fixes #10293

Differential Revision: https://phabricator.haskell.org/D843

comment:8 Changed 3 years ago by nomeata

Resolution: fixed
Status: newclosed

I pushed a fix. Michael Terepeta, can yon confirm that it fixes the problem for you?

comment:9 Changed 3 years ago by nomeata

Status: closedmerge

Oh, someone said that this might be something we want in 7.10.2. Marking as merge, deferring this decision to the release manager.

comment:10 Changed 3 years ago by gidyn

Has there been any testing to determine the effect on the speed of compiled code whose compilation was significantly speeded up? The Phab only mentions nofib times, whose compile times aren't effected anyway.

comment:11 Changed 3 years ago by gidyn

Hard-coding the limit of 25 seems rash when there's only one data point (haskell-src-exts). This should probably be configurable at run-time until its had a lot more use out in the wild to see if this really is the best value in practice.

comment:12 Changed 3 years ago by nomeata

Has there been any testing to determine the effect on the speed of compiled code whose compilation was significantly speeded up? The Phab only mentions nofib times, whose compile times aren't effected anyway.

No, unfortunately we lack good benchmarks for this. haskell-src-exts does not define any, and even if it would, there is little infrastructure to run and compare such benchmarks easily.

Again intuition, but: Those cases that blew up tend to be large and complex, so the chances that everything fits together in a way for Call Arity to succeed are small.

Hard-coding the limit of 25 seems rash when there's only one data point (haskell-src-exts). This should probably be configurable at run-time until its had a lot more use out in the wild to see if this really is the best value in practice.

People were asking for a fix that can go into 7.10.2, so this is a rash move. I believe the limit is still higher than the lowest reasonable limit, so we should be fine – but again without evidence.

Also, it’s always annoying to add flags-dependent values to pure code with no DynFlags around, as it is the case here.

comment:13 Changed 3 years ago by gidyn

Cc: gidyn added

comment:14 Changed 3 years ago by gidyn

Compile-time setting (i.e. when building GHC)? At least that way they'll be some option for folks to play around with it if they think it's too high/low without actually changing the source code.

comment:15 Changed 3 years ago by nomeata

Compile-time setting (i.e. when building GHC)? At least that way they'll be some option for folks to play around with it if they think it's too high/low without actually changing the source code.

That doesn’t gain you much over changing a number in the source file.

Also, so far the existence of such folk is just hypothetical. Much more likely is that someone will fine data points for one or the other side of the limit („this code compiles too slowly“, „Call Arity did not fire even though it could“) and report that. Until there are such reports, it is safe to assume that it simply doesn’t matter.

comment:16 Changed 3 years ago by michalt

Thanks for CCing me and sorry for the delay (unfortunately the email landed in spam). Anyway, I've rebuilt ghc (build flavor prof) and I can confirm that your changes improve the performance significantly:

	Thu Apr 16 21:48 2015 Time and Allocation Profiling Report  (Final)

	   ghc +RTS -p -RTS [...]

	total time  =      121.66 secs   (121660 ticks @ 1000 us, 1 processor)
	total alloc = 132,375,235,176 bytes  (excludes profiling overheads)

COST CENTRE       MODULE       %time %alloc

SimplTopBinds     SimplCore     38.8   37.5
lintAnnots        CoreLint       5.5    6.0
CoreTidy          HscMain        5.2    6.7
pprNativeCode     AsmCodeGen     3.8    4.4
OccAnal           SimplCore      3.8    4.0
occAnalBind.assoc OccurAnal      3.1    3.3
StgCmm            HscMain        2.8    2.4
Simplify          SimplCore      2.5    0.3
RegAlloc          AsmCodeGen     2.5    3.1
FloatOutwards     SimplCore      2.4    2.1
regLiveness       AsmCodeGen     2.3    2.5
tc_rn_src_decls   TcRnDriver     2.3    1.7
sink              CmmPipeline    2.1    2.0
CallArity         SimplCore      1.7    3.1
NewStranal        SimplCore      1.6    1.9
genMachCode       AsmCodeGen     1.4    1.3
layoutStack       CmmPipeline    1.2    1.3
NativeCodeGen     CodeOutput     1.1    1.1
FloatInwards      SimplCore      1.0    1.2
Digraph.scc       Digraph        0.7    1.2

So it's still slower than GHC 7.8.4, but now CallArity is pretty low in the list. :-)

(For the record, steps to reproduce: build GHC with build flavor prof and add +RTS -p -RTS to the cabal file of haskell-src-exts, then compile it. The ghc.prof file contains the information.)

comment:17 Changed 3 years ago by nomeata

A hint for further investigation: There might more code in GHC that does not handle a mutually recursive group with 700-elements well.

comment:18 Changed 3 years ago by simonpj

Cc: scpmw@… added

I really want to know why lintAnnots is so high.

Peter can you investigate? Or someone?

Simon

comment:19 Changed 3 years ago by George

Type of failure: None/UnknownCompile-time performance bug

comment:20 Changed 3 years ago by scpmw

Okay, I'll have a look at it. Might be another misattribution along the lines of #10007.

comment:21 Changed 3 years ago by goldfire

The lintAnnots thing may more properly belong in #9630, or perhaps a new ticket...

comment:22 Changed 3 years ago by simonpj

Milestone: 7.12.17.10.2

Let's see if we can get the current patch in 7.10.2 (email to ghc-devs on 7/5/15).

comment:23 Changed 3 years ago by thoughtpolice

Status: mergeclosed

Merged to ghc-7.10.

comment:24 Changed 3 years ago by AlekseyKliger

I've put together a Gist (by excising code from unbound-generics) that runs into this issue. I've not tried the committed fix yet (don't have a GHC tree around), but perhaps someone else would like to play around with it:

https://gist.github.com/lambdageek/c05cd379fe649600070c

     $ ghc -c -O Alpha.hs
     $ ghc -c -v -O Calc.hs
     # note Calc.hs hangs on "Called arity analysis"

To make the compilation take longer, add some more branches to the Expr datatype - Generics are pretty good for building giant blobs of mutually recursive functions.

Note: See TracTickets for help on using tickets.