Compiler performance regression 7.6 -> 7.8 in elimCommonBlocks

I've been studying a compiler performance problem in ghc-7.8.4 which is not seen in ghc-7.6.3. Profiling shows the compiler spends 74.3% of its time in elimCommonBlocks (in compiler/cmm/CmmCommonBlockElim.hs).

I came across the problem in real code, and rwbarton on #ghc was able to turn it into a sensible test case, which can be found at http://static.tobold.org/ghc-prof/code/

It's not really a regression, since elimCommonBlocks is not enabled by default in 7.6. It can be turned on with -fnew-codegen, and that demonstrates a similar slowdown. However 7.8 is worse than 7.6, and 7.10 slightly worse still, for reasons that I haven't explored. Compile times for RegBig.hs are:

7.6.3 -O2                 17s
7.6.3 -O2 -fnew-codegen  177s
7.8.4 -O2                230s
7.10.1 -O2               241s

The problem obviously stems from the applicative expression with lots of branches Foo <$> a <*> b <*> c <*> ... It seems that some level of inlining occurs which means that each extra term in this expression doubles the size of the code. Incidentally, this blowup only occurs when both RegBig.hs and Form.hs are compiled with -O2. I haven't looked to see why the blowup occurs; although I do see that a similar blowup happens with 7.6.

...
*** Simplifier:
Result size of Simplifier iteration=1
  = {terms: 16,525, types: 31,503, coercions: 73}
Result size of Simplifier iteration=2
  = {terms: 29,070, types: 45,079, coercions: 73}
Result size of Simplifier iteration=3
  = {terms: 50,139, types: 62,057, coercions: 73}
Result size of Simplifier iteration=4
  = {terms: 84,172, types: 83,805, coercions: 73}
Result size of Simplifier
  = {terms: 84,172, types: 83,805, coercions: 73}
...

I've looked in some detail at what happens next, tinkering with elimCommonBlocks to try to understand why it runs so slowly in this case.

There are two major problems. First, elimCommonBlocks is worse than O(n^2^), and the inlining blowup mentioned above results in a very large input: 51695 blocks in this example. (Across the project of 50 files or so where this problem originated, I see the input to elimCommonBlocks is normally a few hundred blocks at most.)

Secondly, there is an effort made in elimCommonBlocks to avoid comparing whole blocks all the time, by computing a hash value of some of the block. This fails to produce sufficient unique hash values for this case.

The hash must obviously omit the unique block label, and in practice all labels, and various other things, are omitted. In this case, presumably because the code is blown up from the tiny input, many many blocks are similar, and the differences are elided by the hash function. For example, 8177 blocks share the hash value 1094.

These 8177 similar blocks all end up in a normal list, which is searched with Data.List.find and the full-block comparison function! Eventually this process concludes that the vast majority are in fact different: for all its hard work, elimCommonBlocks only finds about a dozen common blocks in total.

Two blocks which both hash to 1094 are these:

       cCHc:
           _se0q::P64 = R1;
           _c1grM::P64 = _se0q::P64 & 7;
           if (_c1grM::P64 >= 2) goto c1grR; else goto c1grS;

       cWK7:
           _sig3::P64 = R1;
           _c1iXk::P64 = _sig3::P64 & 7;
           if (_c1iXk::P64 >= 2) goto c1iXp; else goto c1iXq;

I suggest the following steps need to be taken.

1. Mitigation

Since this is a real problem in the 7.8 and 7.10 series, it needs a simple but effective mitigation. I suggest a test that the input to elimCommonBlocks is not "too big", where 1000 seems a reasonable cutoff. Another simple mitigation would be to disable common block elimination altogether (after all, -O2 is usually understood to mean "make my code go faster even if it gets bigger").

I've tried both these possibilities with 7.8.4. The trivial patches are http://static.tobold.org/ghc-prof/0001-cbe-cutoff.patch and http://static.tobold.org/ghc-prof/0001-no-cbe.patch

Compilation times for the test case RegBig.hs are as follows, and there is no difference in size in the .o file.

7.8.4 -O2                        230s
cbe only if input < 1000 blocks   88s
no common block elimination       86s

2. Investigation

I focused on elimCommonBlocks thinking that the inline blowup seen earlier in the compilation is reasonable. On reflection, this is probably not so, and I think someone better acquainted with ghc internals should investigate this.

3. Optimization

-- TODO: Use optimization fuel it says in CmmCommonBlockElim.hs

Since elimCommonBlocks is essentially doing a nub, it will never perform better than O(n^2^). In fact, it's currently much worse than that as the comparison function changes as duplicate blocks are found, which in turn may create new duplicates, so we have to iterate the nub function till no more dups are found.

I did experiment with improving the hash function. It's possible to include target labels, at the expense of needing to recompute block hash values on each iteration. (And currently some highly ugly code to convert labels to integers.) This considerably reduces the length of some of the hash bucket lists, and so offers a significant speedup:

ghc-7.8.4 -O2  230s
"hack"         170s

Patch for this is http://static.tobold.org/ghc-prof/0001-cbe-hacks.patch

This looks like a promising approach. Including target labels doesn't eliminate all the large hash buckets, so there is scope for further improvements along these lines. I don't see any reason why the hash function should not be as discerning as the equality function, with just occasional accidental clashes.

I've also pondered whether more radical rewriting of elimCommonBlocks might pay dividends. It's a commonplace that, if you don't need to preserve order, map head . group . sort is a faster nub. Is the order of blocks significant? I'm not sure - they all seem to end with either goto or call. I'd assume that call then falls through to the next block but I'm not sure. Anyway, presumably we could zip [0..] first, and sort again afterwards, and that should still be asymptotically faster than nub.

Of course, this presupposes that blocks could be (sensibly? efficiently?) made an instance of Ord. (If they were, hash buckets could be stored in an O(1) data structure and binary search employed.)

I may be able to find time to pursue some of these avenues, but as a very novice ghc hacker, I'd need some strategic direction here.

Any thoughts?

Trac metadata

Trac field	Value
Version	7.8.4
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information