Opened 3 years ago

Closed 3 years ago

Last modified 2 months ago

#10397 closed bug (fixed)

Compiler performance regression 7.6 -> 7.8 in elimCommonBlocks

Reported by: TobyGoodwin Owned by:
Priority: normal Milestone: 7.10.2
Component: Compiler Version: 7.8.4
Keywords: performance Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case: see ticket
Blocked By: Blocking:
Related Tickets: #14226 Differential Rev(s): Phab:D892 Phab:D896
Wiki Page:

Description

I've been studying a compiler performance problem in ghc-7.8.4 which is not seen in ghc-7.6.3. Profiling shows the compiler spends 74.3% of its time in elimCommonBlocks (in compiler/cmm/CmmCommonBlockElim.hs).

I came across the problem in real code, and rwbarton on #ghc was able to turn it into a sensible test case, which can be found at http://static.tobold.org/ghc-prof/code/

It's not really a regression, since elimCommonBlocks is not enabled by default in 7.6. It can be turned on with -fnew-codegen, and that demonstrates a similar slowdown. However 7.8 is worse than 7.6, and 7.10 slightly worse still, for reasons that I haven't explored. Compile times for RegBig.hs are:

7.6.3 -O2                 17s
7.6.3 -O2 -fnew-codegen  177s
7.8.4 -O2                230s
7.10.1 -O2               241s

The problem obviously stems from the applicative expression with lots of branches Foo <$> a <*> b <*> c <*> ... It seems that some level of inlining occurs which means that each extra term in this expression doubles the size of the code. Incidentally, this blowup only occurs when both RegBig.hs and Form.hs are compiled with -O2. I haven't looked to see why the blowup occurs; although I do see that a similar blowup happens with 7.6.

...
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 16,525, types: 31,503, coercions: 73}
Result size of Simplifier iteration=2
  = {terms: 29,070, types: 45,079, coercions: 73}
Result size of Simplifier iteration=3
  = {terms: 50,139, types: 62,057, coercions: 73}
Result size of Simplifier iteration=4
  = {terms: 84,172, types: 83,805, coercions: 73}
Result size of Simplifier
  = {terms: 84,172, types: 83,805, coercions: 73}
...

I've looked in some detail at what happens next, tinkering with elimCommonBlocks to try to understand why it runs so slowly in this case.

There are two major problems. First, elimCommonBlocks is worse than O(n2), and the inlining blowup mentioned above results in a very large input: 51695 blocks in this example. (Across the project of 50 files or so where this problem originated, I see the input to elimCommonBlocks is normally a few hundred blocks at most.)

Secondly, there is an effort made in elimCommonBlocks to avoid comparing whole blocks all the time, by computing a hash value of some of the block. This fails to produce sufficient unique hash values for this case.

The hash must obviously omit the unique block label, and in practice all labels, and various other things, are omitted. In this case, presumably because the code is blown up from the tiny input, many many blocks are similar, and the differences are elided by the hash function. For example, 8177 blocks share the hash value 1094.

These 8177 similar blocks all end up in a normal list, which is searched with Data.List.find and the full-block comparison function! Eventually this process concludes that the vast majority are in fact different: for all its hard work, elimCommonBlocks only finds about a dozen common blocks in total.

Two blocks which both hash to 1094 are these:

       cCHc:
           _se0q::P64 = R1;
           _c1grM::P64 = _se0q::P64 & 7;
           if (_c1grM::P64 >= 2) goto c1grR; else goto c1grS;

       cWK7:
           _sig3::P64 = R1;
           _c1iXk::P64 = _sig3::P64 & 7;
           if (_c1iXk::P64 >= 2) goto c1iXp; else goto c1iXq;

I suggest the following steps need to be taken.

1. Mitigation

Since this is a real problem in the 7.8 and 7.10 series, it needs a simple but effective mitigation. I suggest a test that the input to elimCommonBlocks is not "too big", where 1000 seems a reasonable cutoff. Another simple mitigation would be to disable common block elimination altogether (after all, -O2 is usually understood to mean "make my code go faster even if it gets bigger").

I've tried both these possibilities with 7.8.4. The trivial patches are http://static.tobold.org/ghc-prof/0001-cbe-cutoff.patch and http://static.tobold.org/ghc-prof/0001-no-cbe.patch

Compilation times for the test case RegBig.hs are as follows, and there is no difference in size in the .o file.

7.8.4 -O2                        230s
cbe only if input < 1000 blocks   88s
no common block elimination       86s

2. Investigation

I focused on elimCommonBlocks thinking that the inline blowup seen earlier in the compilation is reasonable. On reflection, this is probably not so, and I think someone better acquainted with ghc internals should investigate this.

3. Optimization

-- TODO: Use optimization fuel it says in CmmCommonBlockElim.hs

Since elimCommonBlocks is essentially doing a nub, it will never perform better than O(n2). In fact, it's currently much worse than that as the comparison function changes as duplicate blocks are found, which in turn may create new duplicates, so we have to iterate the nub function till no more dups are found.

I did experiment with improving the hash function. It's possible to include target labels, at the expense of needing to recompute block hash values on each iteration. (And currently some highly ugly code to convert labels to integers.) This considerably reduces the length of some of the hash bucket lists, and so offers a significant speedup:

ghc-7.8.4 -O2  230s
"hack"         170s

Patch for this is http://static.tobold.org/ghc-prof/0001-cbe-hacks.patch

This looks like a promising approach. Including target labels doesn't eliminate all the large hash buckets, so there is scope for further improvements along these lines. I don't see any reason why the hash function should not be as discerning as the equality function, with just occasional accidental clashes.

I've also pondered whether more radical rewriting of elimCommonBlocks might pay dividends. It's a commonplace that, if you don't need to preserve order, map head . group . sort is a faster nub. Is the order of blocks significant? I'm not sure - they all seem to end with either goto or call. I'd assume that call then falls through to the next block but I'm not sure. Anyway, presumably we could zip [0..] first, and sort again afterwards, and that should still be asymptotically faster than nub.

Of course, this presupposes that blocks could be (sensibly? efficiently?) made an instance of Ord. (If they were, hash buckets could be stored in an O(1) data structure and binary search employed.)

I may be able to find time to pursue some of these avenues, but as a very novice ghc hacker, I'd need some strategic direction here.

Any thoughts?

Attachments (1)

ghc10397.tar (10.0 KB) - added by TobyGoodwin 6 months ago.
rwbarton's simple code to reproduce the compiler performance problem

Download all attachments as: .zip

Change History (42)

comment:1 Changed 3 years ago by carter

If an ord instance can be defined sensibly, perhaps on a new type wrapping of that block skeleton, I'd be in favor of that. It doesn't have to be a canonical ordering, merely a valid total ordering living on a new type.

comment:2 Changed 3 years ago by nomeata

Maybe the blocks could be grouped not by hash, but rather by block they might possible jump to. Then for each block B one only needs to compare the blocks that can jump to B. When identical blocks are found, the corresponding may-jump-to-sets can simply be unioned.

OTOH, this seems relatively obvious, so there must be a reason why it is not done this way.

comment:3 Changed 3 years ago by nomeata

Oh wow, just looked at the test case you provided. It’s really small:

  19 Form.hs
  46 RegBig.hs
  52 RegSmall.hs
  17 Y.hs
 134 insgesamt
}}
but indeed takes very long. We should certainly improve the situation here.

comment:4 Changed 3 years ago by rwbarton

The source is very small, but the simplifier blows things up quite a lot, even though I have not used any INLINE directives. That is probably worth a separate issue. However, even after that blow-up the Core program size is not so unreasonably large that it should take several minutes to compile. (That is it is unreasonably large for the specific source program which produced it, but not unreasonably large for a different, reasonably large source program.)

comment:5 Changed 3 years ago by nomeata

Right. But the inliner issue needs to be recorded – will you file a separate bug about this?

Then we need a test case that exercises the elimCommonBlocks only. Let’s see if I can write some nonsensical code that is basically only a large number of almost equal blocks.

comment:6 in reply to:  5 Changed 3 years ago by rwbarton

Replying to nomeata:

Right. But the inliner issue needs to be recorded – will you file a separate bug about this?

Done, #10421.

comment:7 Changed 3 years ago by nomeata

Despite the rule that one should set up a proper benchmark first, and then start hacking, I am now validating a patch that groups the blocks not only by hash, but also by the list of successor labels. These should be equal for all equal blocks, so we do not lose anything, but this should also give much smaller lists to search through linearly, and they can be updated with a new substitution without traversing the whole heap.

I can probably report if that helps tomorrow.

There are more way to improve the code (e.g. not repeatedly building this map, but rather working with one), but before I do that I really should get some benchmarks for this issue :-)

comment:8 Changed 3 years ago by nomeata

I have Phab:D892, but I haven’t done any measurements yet, but for now it’s “Good night”.

comment:9 Changed 3 years ago by nomeata

Compiling your RegBig.hs. Without elimCommonBlocks:

<<ghc: 30310299032 bytes, 1461 GCs, 80330740/286438424 avg/max bytes residency (30 samples), 786M in use, 0.001 INIT (0.000 elapsed), 20.887 MUT (33.307 elapsed), 10.015 GC (10.013 elapsed) :ghc>>

Before:

<<ghc: 97671621696 bytes, 2579 GCs, 77685892/300385928 avg/max bytes residency (31 samples), 777M in use, 0.001 INIT (0.001 elapsed), 134.310 MUT (146.546 elapsed), 10.639 GC (10.637 elapsed) :ghc>>

after my patch:

<<ghc: 69883688328 bytes, 3038 GCs, 83489944/273420096 avg/max bytes residency (39 samples), 750M in use, 0.001 INIT (0.001 elapsed), 55.414 MUT (67.757 elapsed), 14.896 GC (14.892 elapsed) :ghc>>

So it still takes long, but the time spend in elimCommonBlocks is cut down by ⅔.

comment:10 Changed 3 years ago by nomeata

According to a profiled run, elimCommonBlocks now caters for 24.4% of the runtime, second to sequenceBlocks in AsmCodeGen with 59.5% (which I did not look at).

elimCommonBlocks still spends most of the time comparing blocks, most of the time uselessly: It iterates, and in each iteration compares all blocks with the same hash key (hash + list of successor labels) again. This is pointless: If the list of successor labels has not changed, then the blocks cannot suddenly be equal.

I have an idea how to avoid that. Let’s see if it pays off...

comment:11 Changed 3 years ago by nomeata

Looks like it does: Down to 8.52% of the runtime. Cut it in half by replacing

groupBy :: (a -> Int) -> [a] -> [[a]]
groupBy f xs = eltsUFM $ listToUFM_C (++) [(f x, [x]) | x <- xs]

with

groupBy :: (a -> Int) -> [a] -> [[a]]
groupBy f xs = eltsUFM $ List.foldl' go emptyUFM xs
  where go m x = alterUFM (Just . maybe [x] (x:)) m (f x) 

to avoid the risk that listToUFM_C would ++ the lists the wrong way.

Last edited 3 years ago by nomeata (previous) (diff)

comment:12 Changed 3 years ago by nomeata

Differential Rev(s): Phab:D892
Status: newpatch

comment:13 Changed 3 years ago by nomeata

While working on this I found another bottleneck in the code generator, #10422. With that fixed¹ as well, we go from

<<ghc: 97671622904 bytes, 3397 GCs, 80319390/302142488 avg/max bytes residency (30 samples), 788M in use, 0.001 INIT (0.001 elapsed), 131.423 MUT (143.189 elapsed), 9.869 GC (9.877 elapsed) :ghc>>

to

<<ghc: 41423234208 bytes, 2190 GCs, 77441625/272967128 avg/max bytes residency (38 samples), 750M in use, 0.001 INIT (0.001 elapsed), 31.470 MUT (42.974 elapsed), 13.051 GC (13.059 elapsed) :ghc>>

Yay!

¹ patch pending validation, maybe I accidentially broke everything

comment:14 Changed 3 years ago by nomeata

And just to share the joy, here a side-by-side of the profile output, before and after:

          Sat May 16 18:12 2015 Time and Allocation Profiling Rep|          Sat May 16 18:06 2015 Time and Allocation Profiling Rep
                                                                 |  
             ghc-stage2 +RTS -t -p -RTS -B/home/jojo/build/haskel|             ghc-stage2 +RTS -t -p -RTS -B/home/jojo/build/haskel
                                                                 |  
          total time  =      198.47 secs   (198467 ticks @ 1000 u|          total time  =       14.22 secs   (14225 ticks @ 1000 us
          total alloc = 97,018,068,880 bytes  (excludes profiling|          total alloc = 14,686,848,016 bytes  (excludes profiling
                                                                 |  
  COST CENTRE      MODULE       %time %alloc                     |  COST CENTRE      MODULE          %time %alloc                  
                                                                 |  
  elimCommonBlocks CmmPipeline   80.5   69.1                     |  elimCommonBlocks CmmPipeline      19.9   21.2                  
  sequenceBlocks   AsmCodeGen    13.6   19.2                     |  SimplTopBinds    SimplCore        10.6   10.2                  
  SimplTopBinds    SimplCore      0.8    1.5                     |  pprNativeCode    AsmCodeGen        7.6    8.5                  
  pprNativeCode    AsmCodeGen     0.7    1.3                     |  regLiveness      AsmCodeGen        7.6    6.8                  
  regLiveness      AsmCodeGen     0.5    1.0                     |  StgCmm           HscMain           5.6    5.3                  
  RegAlloc         AsmCodeGen     0.4    1.1                     |  RegAlloc         AsmCodeGen        5.5    7.1                  

comment:15 Changed 3 years ago by carter

can these patches apply cleanly to the 7.10 branch? if so i'd like to nominate them for 7.10.2!

comment:16 Changed 3 years ago by nomeata

A quick git rebase --onto origin/ghc-7.10 master goes through fine (ignoring the test suite number update, which should probably be re-done anyways). But let’s first get them into master :-)

comment:17 Changed 3 years ago by Joachim Breitner <mail@…>

In c256357242ee2dd282fd0516260edccbb7617244/ghc:

Speed up elimCommonBlocks by grouping blocks also by outgoing labels

This is an attempt to improve the situation described in #10397, where
the linear scan of possible candidates for commoning up is far too
expensive. There is (ever) more room for improvement, but this is a
start.

Differential Revision: https://phabricator.haskell.org/D892

comment:18 Changed 3 years ago by Joachim Breitner <mail@…>

In 8e4dc8fb63b8d3bfee485c1c830776f3ed704f4d/ghc:

Greatly speed up nativeCodeGen/seqBlocks

When working on #10397, I noticed that "reorder" in
nativeCodeGen/seqBlocks took more than 60% of the time. With this
refactoring, it does not even show up in the profile any more. This
fixes #10422.

Differential Revision: https://phabricator.haskell.org/D893

comment:19 Changed 3 years ago by nomeata

Status: patchmerge

comment:20 Changed 3 years ago by carter

Milestone: 7.10.2

setting a milestone so its on the table for 7.10.2

comment:21 Changed 3 years ago by carter

setting a milestone so its on the table for 7.10.2

comment:22 Changed 3 years ago by simonpj

That's a trememdous improvement, thanks Joachim.

That said, 20% of compilation time in one, relatively minor, optimisation is far too much. I wonder how we could improve matters further? For example, the bigger the block the less likely it is to be identical, but the more expensive it is to compare (I guess). Maybe we can cut off at some block size?

Also why are you comparing (hash, list of successor labels) rather than just including the successor labels in the hash?

Simon

comment:23 Changed 3 years ago by nomeata

Note that this is a rather extreme test case. I’m sure for a lot of optimizations we can construct a case where they account for most of the program. If it were 20% for a real-world example, I’d fully agree. Right now, I’m fine with the current state of affairs.

For example, the bigger the block the less likely it is to be identical, but the more expensive it is to compare (I guess). Maybe we can cut off at some block size?

I doubt it. The bigger the blocks, the more likely it is that the hash is different.

Also why are you comparing (hash, list of successor labels) rather than just including the successor labels in the hash?

Because the successor labels change (or rather the equality on them) as we common up blocks; the hash is calculated exactly once. I guess you are implicitly asking for more comments. Added it to the body of text already there.

I wonder if it is faster to hash the list of labels and group them using a single level IntMap instead of the current trie based on nested IntMap.... That’s worth a try.

comment:24 in reply to:  23 Changed 3 years ago by nomeata

Replying to nomeata:

I wonder if it is faster to hash the list of labels and group them using a single level IntMap instead of the current trie based on nested IntMap.... That’s worth a try.

no, it’s not. And profiling shows why: The time is still spent mostly in eqBlockBodyWith. With my refactoring, the number of calls to that function is minimal anyways, so the only way to improve the situation is

  • a better hash function (it currently does not take operators into account, and maybe it can also take “internal” functions into account)
  • no linear search by using a trie for blocks, similar how we do it for Core expressions.
  • lose precision.

I’ll add some debug output to see how these blocks that cannot be merged look like.

comment:25 Changed 3 years ago by nomeata

<rant>Geez, how do I print these CmmBlocks? The lack of ubiquitous Outputable or Show instances can be quite a time waster.... ah import PprCmm () helps.</rant>

Indeed, the hash function is simply not fine-grained enough. Including the uniques of local registers yields

             ghc-stage2 +RTS -t -p -RTS -B/home/jojo/build/haskell/ghc|             ghc-stage2 +RTS -t -p -RTS -B/home/jojo/build/haskell/gh
                                                                      |  
          total time  =       13.83 secs   (13831 ticks @ 1000 us, 1 p|          total time  =       11.79 secs   (11791 ticks @ 1000 us, 1 
          total alloc = 14,684,289,032 bytes  (excludes profiling over|          total alloc = 11,894,920,976 bytes  (excludes profiling ove
                                                                      |  
  COST CENTRE      MODULE          %time %alloc                       |  COST CENTRE      MODULE          %time %alloc
                                                                      |  
  elimCommonBlocks CmmPipeline      17.4   21.2                       |  SimplTopBinds    SimplCore        12.7   12.6                      
  SimplTopBinds    SimplCore        11.1   10.2                       |  regLiveness      AsmCodeGen        9.4    8.4                      
  regLiveness      AsmCodeGen        7.7    6.8                       |  pprNativeCode    AsmCodeGen        8.3   10.5                      
  pprNativeCode    AsmCodeGen        7.0    8.5                       |  RegAlloc         AsmCodeGen        7.1    8.8                      
  RegAlloc         AsmCodeGen        5.9    7.1                       |  StgCmm           HscMain           6.9    6.5                      
  StgCmm           HscMain           5.6    5.3                       |  sink             CmmPipeline       6.3    5.9                      
  sink             CmmPipeline       5.3    4.7                       |  genMachCode      AsmCodeGen        3.9    3.7                      
  genMachCode      AsmCodeGen        3.5    3.0                       |  layoutStack      CmmPipeline       3.9    4.0                      
  layoutStack      CmmPipeline       3.5    3.2                       |  do_block         Hoopl.Dataflow    3.2    1.9                      
  do_block         Hoopl.Dataflow    2.7    1.5                       |  postorderDfs     CmmUtils          2.9    2.4                      
  NativeCodeGen    CodeOutput        2.4    2.2                       |  NativeCodeGen    CodeOutput        2.7    2.7                      
  postorderDfs     CmmUtils          2.3    2.0                       |  elimCommonBlocks CmmPipeline       2.5    2.8                      
  sequenceBlocks   AsmCodeGen        2.0    1.8                       |  sequenceBlocks   

which now should be finally sufficient. Will create a DR for validation and then push this.

I also replaced my fancy trie by a plain Data.Map. It turned out to be not performance critical, so let’s remove the custom code.

comment:26 Changed 3 years ago by nomeata

Differential Rev(s): Phab:D892Phab:D892 Phab:D896

comment:27 Changed 3 years ago by simonpj

Yay, great stuff.

comment:28 Changed 3 years ago by Joachim Breitner <mail@…>

In 73f836f5d57a3106029b573c42f83d2039d21d89/ghc:

CmmCommonBlockElim: Improve hash function

Previously, the hash function used to cut down the number of block
comparisons did not take local registers into account, causing far too
many similar, but different bocks to be considered candidates for the
(expensive!) comparision.

Adding register to the hash takes CmmCommonBlockElim's share of the
runtime of the example in #10397 from 17% to 2.5%, and eliminates all
unwanted hash collisions.

This patch also replaces the fancy trie by a plain Data.Map. It turned
out to be not performance critical, so this simplifies the code.

Differential Revision: https://phabricator.haskell.org/D896

comment:29 Changed 3 years ago by thoughtpolice

Resolution: fixed
Status: mergeclosed

Merged to ghc-7.10.

comment:30 Changed 3 years ago by nomeata

Status: closedmerge

I guess 73f836f5d57a3106029b573c42f83d2039d21d89, a further improvement over the previous patch, should also go into 7.10.

comment:31 Changed 3 years ago by TobyGoodwin

Fantastic work on this. Can I just point out that it is real-world code? It's the yesod backend for this web page and I stumbled across this problem simply by trying to upgrade from 7.6 to 7.8.

Which brings me to another point: it's great that this will be in 7.10. Is there any chance of getting it into 7.8 too? Most of the distros (Debian, Ubuntu LTS, even Fedora) have not made the jump to 7.8 yet, and when they do I'd hate for anyone else to have the same experience that I did, of a massive slowdown in compile times.

comment:32 Changed 3 years ago by simonmar

@nomeata: great job on these perf improvements!

comment:33 Changed 3 years ago by thoughtpolice

Joachim, I also picked this last one over to ghc-7.10, so these will all hit 7.10.2.

@TobyGoodwin - Unfortunately, we aren't going to be doing any more releases of the 7.8.x series at this point, since the branch is quite out of date. In theory, all of the current performance fixes Joachim has made should apply to the 7.8 branch still, although there will probably be things like conflicts with whitespace and the tabs->spaces conversion that happened.

I'm not sure what the policy is for upstream distributions here is, especially if they still aren't even shipping 7.8 to users - and if they aren't, can they not use 7.10? I have no clue what timeline most distros use for upgrading major versions of things like GHC. Luckily, for Debian, Joachim is the maintainer, so I'm sure he'll know the deal. :)

comment:34 Changed 3 years ago by thoughtpolice

Status: mergeclosed

comment:35 Changed 10 months ago by rwbarton

Hi TobyGoodwin, do you still have the reproducer for this issue? I referenced it in #10421, not expecting it to disappear... should have reuploaded it myself.

Changed 6 months ago by TobyGoodwin

Attachment: ghc10397.tar added

rwbarton's simple code to reproduce the compiler performance problem

comment:36 Changed 6 months ago by TobyGoodwin

Argh! (1) I only just saw this message a couple of days ago. Not sure why I didn't get / read / act upon email from trac.

Argh! (2) I never meant to delete it from my website, I musta just been trigger happy one day. And for some reason it had never been added to the git repo that backs the site.

After a deal of searching, I did manage to find a copy. I've reinstated it on my website, and have attached it to this ticket.

comment:37 Changed 6 months ago by bgamari

Thanks TobyGoodwin!

comment:38 Changed 6 months ago by simonpj

So is the problem solved (by comment:28) or not?

comment:39 Changed 2 months ago by bgamari

Unfortunately the commit from comment:28 severely reduces our ability to CBE. To call it "broken" may be a bit strong, but it's certainly a regression, which I am tracking as #14226.

comment:40 Changed 2 months ago by simonpj

severely reduces our ability to CBE

Do you mean: it makes un-equal blocks that should be considered equal? Can you give an example? (For truly un-equal blocks we want to restrict our CBE: they aren't equal!)

comment:41 Changed 2 months ago by bgamari

I should correct my statement from comment:39 as I misunderstood the intent of the patch from comment:28:

The commit does not change the behavior of CBE; rather, it only adjusts the hash function used for "rough" comparison to more closely reflect the equality relation used later.

However, I would argue that our CBE is still significantly less aggressive than it should be. Let's continue this discussion on #14226.

Note: See TracTickets for help on using tickets.