Opened 3 years ago

Closed 2 years ago

Last modified 2 years ago

#10370 closed bug (fixed)

Compile time regression in OpenGLRaw

Reported by: michalt Owned by:
Priority: normal Milestone: 7.10.2
Component: Compiler Version: 7.10.1
Keywords: Cc: paolo.veronelli@…, simonpj@…, ghc.haskell.org@…, michalt
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:

Description (last modified by simonpj)

As reported by Paolo Veronelli:

Hello, I'm using ghc 7.10.1 to compile OpenGLRaw which is now impossible with -O1 and -O2 due to "ghc : out of memory error" on a 4GB linux host. The file making memory explode is Graphics.Rendering.OpenGL.Raw.Functions. With -O0 it uses 600 MB.

The file is really huge, but I could compile it with prior versions of ghc.

I've had a look and indeed the file takes way more memory to compile with HEAD. Looking at the heap profiles, it seems that the problem is in GHC itself (space leak somewhere?). I'll attach the heap profiles (from compiling the single module itself, i.e., touching the file and running prof GHC on it).

Paolo also says: ere is another file , which is small, which cannot be compiled within 4GB memory.

https://raw.githubusercontent.com/benl23x5/gloss/master/gloss-examples/raster/Fluid/src-repa/Stage/Linear.hs

Attachments (5)

module-hc.png (161.1 KB) - added by michalt 3 years ago.
profiled with "-RTS -hc -p"
module-hd.png (138.1 KB) - added by michalt 3 years ago.
profiled with "-RTS -hd -p"
module-hy.png (176.3 KB) - added by michalt 3 years ago.
profiled with "-RTS -hy -p"
ghc-show-passes-with-revert (4.6 KB) - added by michalt 3 years ago.
ghc-show-passes-without-revert (4.6 KB) - added by michalt 3 years ago.

Download all attachments as: .zip

Change History (22)

Changed 3 years ago by michalt

Attachment: module-hc.png added

profiled with "-RTS -hc -p"

Changed 3 years ago by michalt

Attachment: module-hd.png added

profiled with "-RTS -hd -p"

Changed 3 years ago by michalt

Attachment: module-hy.png added

profiled with "-RTS -hy -p"

comment:1 Changed 3 years ago by michalt

I've had another quick look at this and compiled profiled GHC with addition of {-# OPTIONS_GHC -fprof-auto #-} to simplCore/Simplify.hs. The results are interesting:

	Sun May  3 17:33 2015 Time and Allocation Profiling Report  (Final)

	   ghc +RTS -hc -L100 -p -RTS [...] Graphics.Rendering.OpenGL.Raw.Functions

	total time  =       42.38 secs   (42385 ticks @ 1000 us, 1 processor)
	total alloc = 65,222,063,200 bytes  (excludes profiling overheads)

COST CENTRE               MODULE       %time %alloc

tc_rn_src_decls           TcRnDriver    15.1   10.4
completeCall.(...)        Simplify       9.5   22.0
simplCast.add_coerce.arg' Simplify       9.1   21.8
pprNativeCode             AsmCodeGen     4.2    3.2
Parser                    HscMain        3.3    4.2
RegAlloc                  AsmCodeGen     3.0    2.7
regLiveness               AsmCodeGen     2.5    2.0
simplLam                  Simplify       2.5    1.3
OccAnal                   SimplCore      2.5    1.8
StgCmm                    HscMain        2.3    1.2
CallArity                 SimplCore      2.2    3.4
FloatOutwards             SimplCore      1.7    1.1
sink                      CmmPipeline    1.7    1.1
simplAlts                 Simplify       1.5    0.8
genMachCode               AsmCodeGen     1.5    0.9
deSugar                   HscMain        1.5    1.3
simplAlt                  Simplify       1.4    0.7
simplUnfolding            Simplify       1.3    0.8
layoutStack               CmmPipeline    1.2    0.9
completeCall.info         Simplify       1.2    0.6
completeBind              Simplify       1.2    0.6
NativeCodeGen             CodeOutput     1.2    0.8
rebuildCall               Simplify       1.1    1.0
Simplify                  SimplCore      1.1    0.0

Looking at git blame the Simplify.simplCast didn't change much recently... Next week I'll try to have a look at Core generated by 7.8.4 and HEAD (maybe some stage before simplifier generates something weird?).

comment:2 Changed 3 years ago by liyang

Cc: ghc.haskell.org@… added

comment:3 Changed 3 years ago by rwbarton

I created a synthetic test case that exhibits the same space blow-up:

modu = unlines $
       "module Out where" :
       "import Control.Monad (forever)" :
       [ var ++ " :: IO (); " ++ var ++ " = forever $ putStrLn \"" ++ var ++ "\"" |
         n <- [1..3000], let var = "a" ++ show n ]
main = writeFile "Out.hs" modu

The string literals seem to be important, I didn't manage to find a variation that blew up without them.

comment:4 Changed 3 years ago by simonpj

Sven Panne says: To alleviate the pain a bit, I've uploaded a new version of OpenGLRaw (2.5.0.0) to Hackage, containing 2 improvements:

  • foreign import "dynamic"s with the same signature are re-used, cutting down their number from 3062 to 864.
  • Those foreign import "dynamic"s live in a separate module now.

Travis CI seems to be happy with these changes (the VMs there don't have much memory, either), although Haddock still seems to eat memory like hell. But that's a different story...

comment:5 Changed 3 years ago by simonpj

This may or may not be related, but Greg Weber reports: We have observed issues with compile-time inlining taking much longer in newer versions of GHC in some cases: https://github.com/larskuhtz/toCaseFoldBuildTimes. This particular issue was reported to the text repo: https://github.com/bos/text/issues/116

The former link has a helpful collection of associated data.

Last edited 3 years ago by simonpj (previous) (diff)

comment:6 Changed 3 years ago by simonpj

Description: modified (diff)

comment:7 Changed 3 years ago by michalt

Cc: michalt added

@rwbarton Thanks a lot for a smaller use case!

I had a brief look at generated Core (for both 7.8.4 and HEAD) and couldn't really find anything suspicious. Also core stats seem similar -- we generate almost identical number of terms/types/coercions. (although I'm far from an expert so take that with a grain of salt)

The good news is that I managed to narrow it down to b8392ae76a6d39c57be94b5ba041c450ab479e2b What I did: initially I just wanted to bisect but couldn't really get GHC to build before Jan 2015. So I checkout a commit from early January (the problem is reproducible there) and reverted the above patch (at which point the problem is no longer reproducible).

I'll try to look into that commit tomorrow.

comment:8 Changed 3 years ago by michalt

I'm a bit stuck. I've had a look at that commit, but I don't really understand simplifier well enough to find the reason of the blow-up.

Any ideas are welcome! :-)

comment:9 Changed 3 years ago by simonpj

Can you give the output of -dshow-passes before and after reverting the patch? That should show if there is a big change in code size. Thanks.

Changed 3 years ago by michalt

Attachment: ghc-show-passes-with-revert added

Changed 3 years ago by michalt

comment:10 Changed 3 years ago by michalt

tl;dr: No difference

If anyone wants to try to reproduce: checkout b0379819e46796047c1574a6abccf186afd27afa and then revert b8392ae76a6d39c57be94b5ba041c450ab479e2b

comment:11 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In c89bd681d34d3339771ebdde8aa468b1d9ab042b/ghc:

Fix quadratic behaviour in tidyOccName

In the test program from comment:3 of Trac #10370, it turned out
that 25% of all compile time was going in OccName.tidyOccName!

It was all becuase the algorithm for finding an unused OccName
had a quadratic case.

This patch fixes it.  THe effect is pretty big:

Before:
	total time  =       34.30 secs   (34295 ticks @ 1000 us, 1 processor)
	total alloc = 15,496,011,168 bytes  (excludes profiling overheads)

After
	total time  =       25.41 secs   (25415 ticks @ 1000 us, 1 processor)
	total alloc = 11,812,744,816 bytes  (excludes profiling overheads)

comment:12 Changed 3 years ago by Simon Peyton Jones <simonpj@…>

In 45d9a15c4b85a2ed89579106bdafd84accf2cb39/ghc:

Fix a huge space leak in the mighty Simplifier

This long-standing, terrible, adn somewhat subtle bug was exposed
by Trac #10370, thanks to Reid Barton's brilliant test case (comment:3).

The effect is large on the Trac #10370 test.
Here is what the profile report says:

Before:
 total time  =       24.35 secs   (24353 ticks @ 1000 us, 1 processor)
 total alloc = 11,864,360,816 bytes  (excludes profiling overheads)

After:
 total time  =       21.16 secs   (21160 ticks @ 1000 us, 1 processor)
 total alloc = 7,947,141,136 bytes  (excludes profiling overheads)

The /combined/ effect of the tidyOccName fix, plus this one, is dramtic
for Trac #10370.  Here is what +RTS -s says:

Before:
  15,490,210,952 bytes allocated in the heap
   1,783,919,456 bytes maximum residency (20 sample(s))

  MUT     time   30.117s  ( 31.383s elapsed)
  GC      time   90.103s  ( 90.107s elapsed)
  Total   time  120.843s  (122.065s elapsed)

After:
   7,928,671,936 bytes allocated in the heap
      52,914,832 bytes maximum residency (25 sample(s))

  MUT     time   13.912s  ( 15.110s elapsed)
  GC      time    6.809s  (  6.808s elapsed)
  Total   time   20.789s  ( 21.954s elapsed)

- Heap allocation halved
- Residency cut by a factor of more than 30.
- ELapsed time cut by a factor of 6

Not bad!

The details
~~~~~~~~~~~
The culprit was SimplEnv.mkCoreSubst, which used mapVarEnv to do some
impedence-matching from the substitituion used by the simplifier to
the one used by CoreSubst.  But the impedence-mactching was recursive!

  mk_subst tv_env cv_env id_env
    = CoreSubst.mkSubst in_scope tv_env cv_env (mapVarEnv fiddle id_env)

  fiddle (DoneEx e)          = e
  fiddle (DoneId v)          = Var v
  fiddle (ContEx tv cv id e) = CoreSubst.substExpr (mk_subst tv cv id) e

Inside fiddle, in the ContEx case, we may do another whole level of
fiddle.  And so on.  Moreover, UniqFM (which is built on Data.IntMap) is
strict, so the fiddling is done eagerly.  I didn't wok through all the
details but the result is a gargatuan blow-up of entirely unnecessary work.

Laziness would make this go away, I think, but I don't want to mess
with IntMap.  And in any case, the impedence matching is a royal pain.

In the end I simply ceased trying to use CoreSubst.substExpr in the
simplifier, and instead just use simplExpr.  That does mean bit of
duplication; e.g.  new code for simplRules.  But it's not a big deal
and it's far more direct and easy to reason about.

A bit of knock-on refactoring:

 * Data type ArgSummary moves to CoreUnfold.

 * interestingArg moves from CoreUnfold to SimplUtils, and gets a
   SimplEnv argument which can be used when we encounter a variable.

 * simplLamBndrs, addBndrRules move from SimplEnv to Simplify
   (because they now calls simplUnfolding, simplRules resp)

 * SimplUtils.substExpr, substUnfolding, mkCoreSubst die completely

 * In Simplify some several functions that were previously pure
   substitution-based functions are now monadic:
     - addBndrRules, simplRule
     - addCoerce, add_coerce in simplCast

 * In case 2c of Simplify.rebuildCase, there was a pretty disgusting
   expression-substitution taking place for 'rhs'; and we really don't
   want to make that monadic becuase 'rhs' can be big.
   Solution: reduce the arity of the rules for seq.
   See Note [User-defined RULES for seq] in MkId.

comment:13 Changed 3 years ago by simonpj

The commit messages tell the story. Reid's example in comment:3 showed up TWO separate, substantial non-linear performance holes in GHC, both of which I have now fixed. Reid, your example was amazingly helpful.

Austin I have not added a perf/compiler test case. Could you do so please? I wasn't sure that a 20-second compile time was acceptable. I'm leaving the ticket open for that reason.

Simon

comment:14 Changed 3 years ago by thoughtpolice

Milestone: 7.10.2
Status: newmerge

Yeah, so we want something smaller, we don't need the 20 second variant. We can probably just cut Reid's test case in half or even a third and check that in, double checking the behavior is still fixed (which is obviously pretty easy). I'll get on it shortly.

I'm also moving this to 7.10.2 - the tidyOccName fix already applies cleanly in my tree. The evil simplifier bug applies modulo a small change in SimplCore - so this is an excellent bugfix to go in!

comment:15 Changed 2 years ago by thoughtpolice

Resolution: fixed
Status: mergeclosed

Merged to ghc-7.10.

comment:16 Changed 2 years ago by bgamari

I'ved opened Phab:D1352 to add a performance test for this issue.

comment:17 Changed 2 years ago by Ben Gamari <ben@…>

In c2fab842/ghc:

Add testcase for #10370

Test Plan: Validate.

Reviewers: austin

Reviewed By: austin

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D1352

GHC Trac Issues: #10370
Note: See TracTickets for help on using tickets.