Opened 8 months ago

Last modified 3 months ago

#13535 new bug

vector test suite uses excessive memory on GHC 8.2

Reported by: RyanGlScott Owned by: bgamari
Priority: high Milestone: 8.4.1
Component: Compiler Version: 8.1
Keywords: Cc: George, asr, bgamari
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #10800 Differential Rev(s):
Wiki Page:

Description

First noticed here. I haven't managed to boil this down to a test case with no dependencies yet, so for the time being, this requires vector. To reproduce, follow these steps:

$ git clone https://github.com/erikd/vector
$ cd vector/
$ cabal install --only-dependencies --enable-tests -w /opt/ghc/8.2.1/bin/ghc
$ cabal configure --enable-tests -w /opt/ghc/8.2.1/bin/ghc
$ cabal test

When building vector-tests-O2, GHC will stall when compiling the Tests.Vector module. On machines with modest memory allowances (e.g., the machines used on Travis CI), GHC will be killed with an out-of-memory error after trying to compile Tests.Vector for a while.

Attachments (5)

Vector.hs (19.1 KB) - added by dfeuer 5 months ago.
Reduced a little more
ds-diff (23.3 KB) - added by dfeuer 5 months ago.
dump-ds diff for offending patch
ds.diff (23.3 KB) - added by dfeuer 5 months ago.
Proper extension on the diff
Smaller-Vector.hs (5.2 KB) - added by dfeuer 5 months ago.
smaller-ds.diff (23.3 KB) - added by dfeuer 5 months ago.

Download all attachments as: .zip

Change History (47)

comment:1 Changed 8 months ago by RyanGlScott

In particular, limiting GHC's memory to 4 GB is sufficient:

$ bash -c 'ulimit -v 4000000; cabal test'
...
Preprocessing test suite 'vector-tests-O2' for vector-0.12.0.1...
[6 of 7] Compiling Tests.Vector     ( tests/Tests/Vector.hs, dist/build/vector-tests-O2/vector-tests-O2-tmp/Tests/Vector.o )

tests/Tests/Vector.hs:9:1: warning: [-Wunused-imports]
    The import of ‘Data.Foldable’ is redundant
      except perhaps to import instances from ‘Data.Foldable’
    To import instances alone, use: import Data.Foldable()
  |
9 | import Data.Foldable (Foldable(foldMap))
  | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/Tests/Vector.hs:25:1: warning: [-Wunused-imports]
    The import of ‘Data.Monoid’ is redundant
      except perhaps to import instances from ‘Data.Monoid’
    To import instances alone, use: import Data.Monoid()
   |
25 | import Data.Monoid
   | ^^^^^^^^^^^^^^^^^^

tests/Tests/Vector.hs:29:1: warning: [-Wunused-imports]
    The import of ‘Data.Functor.Identity’ is redundant
      except perhaps to import instances from ‘Data.Functor.Identity’
    To import instances alone, use: import Data.Functor.Identity()
   |
29 | import Data.Functor.Identity
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests/Tests/Vector.hs:438:5: warning: [-Wname-shadowing]
    This binding for ‘limitUnfolds’ shadows the existing binding
      imported from ‘Utilities’ at tests/Tests/Vector.hs:5:1-24
    |
438 |     limitUnfolds f (theirs, ours)
    |     ^^^^^^^^^^^^
ghc: out of memory

comment:2 Changed 8 months ago by rwbarton

Looks like a regular "Core program gets very large" issue. The Core program got very large in 8.0.1 too (I didn't try earlier versions) but now it gets even more very large.

I tried with the versions of GHC before and after the Join points commit since I have them lying around. Already before the Join points commit the Core program grows to be very large quickly (actually, even more so than with HEAD).

comment:3 Changed 8 months ago by George

Cc: george added

comment:4 Changed 8 months ago by asr

Cc: asr added

comment:5 Changed 8 months ago by bgamari

Cc: bgamari added

comment:6 Changed 7 months ago by bgamari

Note that #10800 is another ticket reporting a compile-time regression in vector's testsuite between GHC 7.8 and 7.10. While it was apparently resolved in GHC 8.0, we never were able to come to a conclusion on what fixed it.

comment:7 Changed 7 months ago by bgamari

I setup a test environment to build just tests/Utilities.hs, tests/Boilerplater.hs, and tests/Tests/Vector.hs with both GHC 8.3 (13131ce9165b4e5e5193dc381f6f3d021e53792f and 8.0.2. In addition to building Tests.Vector as-is, I also built it enabling each of the testcases in the tests list individually (that is, commenting out all but one). The result is interesting: While compilation on master is significantly faster than 8.0.2 in the individual cases, it is significantly slower when all of the testcases are enabled. Clearly there is something non-linear going on here. I'm on the case.

Building Utilities, Boilerplater, and Tests.Vector of vector 1d208ee9e3a252941ebd112e14e8cd5a982ac2bb

GHC 8.3 (13131ce9165b4e5e5193dc381f6f3d021e53792f
nothing:                                   <<ghc:   2925044864 bytes,   278 GCs,  22310773/  86050104 avg/max bytes residency (10 samples),  185M in use, 0.001 INIT (0.001 elapsed),   3.107 MUT   (3.960 elapsed),   1.025 GC  (1.047 elapsed) :ghc>>
Data.Vector.Vector (Bool)                  <<ghc:  25467716144 bytes,  1155 GCs,  89898567/ 202234368 avg/max bytes residency (19 samples),  558M in use, 0.000 INIT (0.000 elapsed),  23.601 MUT  (24.429 elapsed),   9.281 GC  (9.277 elapsed) :ghc>>
Data.Vector.Vector (Int)                   <<ghc:  25538028952 bytes,  1060 GCs,  87592922/ 201496320 avg/max bytes residency (19 samples),  530M in use, 0.000 INIT (0.001 elapsed),  23.226 MUT  (24.239 elapsed),   8.672 GC  (8.674 elapsed) :ghc>>
Data.Vector.Primitive.Vector (Int)         <<ghc:  21588867344 bytes,  1004 GCs,  78872219/ 167736032 avg/max bytes residency (18 samples),  468M in use, 0.000 INIT (0.000 elapsed),  20.046 MUT  (20.788 elapsed),   7.783 GC  (7.779 elapsed) :ghc>>
Data.Vector.Primitive.Vector (Double)      <<ghc:  22386494072 bytes,  1070 GCs,  81997699/ 179105312 avg/max bytes residency (18 samples),  501M in use, 0.001 INIT (0.001 elapsed),  21.594 MUT  (22.527 elapsed),   8.253 GC  (8.260 elapsed) :ghc>>
Data.Vector.Storable.Vector (Int)          <<ghc:  22169976680 bytes,   972 GCs,  80817260/ 175301760 avg/max bytes residency (18 samples),  476M in use, 0.000 INIT (0.000 elapsed),  21.496 MUT  (22.450 elapsed),   8.426 GC  (8.433 elapsed) :ghc>>
Data.Vector.Storable.Vector (Double)       <<ghc:  23029038616 bytes,  1001 GCs,  85430763/ 193090296 avg/max bytes residency (18 samples),  521M in use, 0.001 INIT (0.001 elapsed),  22.058 MUT  (23.002 elapsed),   8.746 GC  (8.769 elapsed) :ghc>>
Data.Vector.Unboxed.Vector ()              <<ghc:  17260324632 bytes,   797 GCs,  64556858/ 149284504 avg/max bytes residency (16 samples),  410M in use, 0.001 INIT (0.005 elapsed),  13.861 MUT  (14.588 elapsed),   4.881 GC  (5.020 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Bool)          <<ghc:  22619684880 bytes,   918 GCs,  78584919/ 172329120 avg/max bytes residency (18 samples),  479M in use, 0.001 INIT (0.000 elapsed),  16.857 MUT  (17.451 elapsed),   6.200 GC  (6.195 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Int)           <<ghc:  22019118280 bytes,  1077 GCs,  80830924/ 183396408 avg/max bytes residency (18 samples),  489M in use, 0.000 INIT (0.000 elapsed),  15.595 MUT  (16.157 elapsed),   6.095 GC  (6.093 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Double)        <<ghc:  22359631568 bytes,  1046 GCs,  82354182/ 189338112 avg/max bytes residency (18 samples),  508M in use, 0.000 INIT (0.000 elapsed),  16.458 MUT  (17.004 elapsed),   6.365 GC  (6.361 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Int,Bool)      <<ghc:  34451441464 bytes,  1265 GCs, 111090192/ 271240872 avg/max bytes residency (21 samples),  700M in use, 0.000 INIT (0.000 elapsed),  23.294 MUT  (24.014 elapsed),   9.330 GC  (9.324 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Int,Bool,Int)  <<ghc:  43613769744 bytes,  1304 GCs, 129477158/ 317303208 avg/max bytes residency (23 samples),  813M in use, 0.001 INIT (0.000 elapsed),  29.817 MUT  (30.697 elapsed),  12.156 GC (12.147 elapsed) :ghc>>
everything                                 <<ghc: 264695216008 bytes,  4003 GCs, 547291435/1356463272 avg/max bytes residency (35 samples), 3566M in use, 0.000 INIT (0.000 elapsed), 227.028 MUT (236.855 elapsed),  97.713 GC (97.751 elapsed) :ghc>>


GHC 8.0.2
nothing                                    <<ghc:   3602875384 bytes,  424 GCs,   19427090/  80066928 avg/max bytes residency (10 samples),  172M in use, 0.000 INIT (0.001 elapsed),   0.011 MUT   (2.709 elapsed),   0.001 GC  (0.696 elapsed) :ghc>>
Data.Vector.Vector (Bool)                  <<ghc:  36799624512 bytes, 1640 GCs,  126755494/ 312669744 avg/max bytes residency (23 samples),  790M in use, 0.000 INIT (0.001 elapsed),   0.219 MUT  (27.706 elapsed),   0.069 GC (13.049 elapsed) :ghc>>
Data.Vector.Vector (Int)                   <<ghc:  36339886904 bytes, 1640 GCs,  122810903/ 305389720 avg/max bytes residency (24 samples),  758M in use, 0.000 INIT (0.001 elapsed),   0.242 MUT  (26.717 elapsed),   0.054 GC (12.381 elapsed) :ghc>>
Data.Vector.Primitive.Vector (Int)         <<ghc:  37367710936 bytes, 1765 GCs,  127647151/ 255413768 avg/max bytes residency (23 samples),  710M in use, 0.000 INIT (0.001 elapsed),   0.076 MUT  (28.191 elapsed),   0.036 GC (13.306 elapsed) :ghc>>
Data.Vector.Primitive.Vector (Double)      <<ghc:  41169666648 bytes, 1936 GCs,  136122135/ 336152832 avg/max bytes residency (24 samples),  837M in use, 0.000 INIT (0.001 elapsed),   0.149 MUT  (38.908 elapsed),   0.051 GC (17.125 elapsed) :ghc>>
Data.Vector.Storable.Vector (Int)          <<ghc:  40219735080 bytes, 1978 GCs,  132890651/ 362367016 avg/max bytes residency (24 samples),  858M in use, 0.000 INIT (0.002 elapsed),   0.875 MUT  (43.129 elapsed),   0.393 GC (18.668 elapsed) :ghc>>
Data.Vector.Storable.Vector (Double)       <<ghc:  44209977992 bytes, 1993 GCs,  140871502/ 369615896 avg/max bytes residency (25 samples),  887M in use, 0.000 INIT (0.001 elapsed),   0.155 MUT  (41.027 elapsed),   0.093 GC (18.374 elapsed) :ghc>>
Data.Vector.Unboxed.Vector ()              <<ghc:  29436191560 bytes, 1532 GCs,  102427352/ 241113128 avg/max bytes residency (21 samples),  636M in use, 0.000 INIT (0.001 elapsed),   0.278 MUT  (29.596 elapsed),   0.102 GC (11.599 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Bool)          <<ghc:  40109059768 bytes, 1923 GCs,  126872809/ 251075736 avg/max bytes residency (24 samples),  662M in use, 0.000 INIT (0.001 elapsed),   0.204 MUT  (38.656 elapsed),   0.044 GC (15.936 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Int)           <<ghc:  38385729456 bytes, 1772 GCs,  126917981/ 319870600 avg/max bytes residency (23 samples),  838M in use, 0.000 INIT (0.000 elapsed),   0.285 MUT  (36.855 elapsed),   0.075 GC (15.675 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Double)        <<ghc:  41411280800 bytes, 1807 GCs,  134424736/ 373355704 avg/max bytes residency (24 samples),  890M in use, 0.000 INIT (0.001 elapsed),   0.296 MUT  (40.076 elapsed),   0.180 GC (17.301 elapsed) :ghc>>
Data.Vector.Unboxed.Vector (Int,Bool)      <<ghc:  64510301216 bytes, 2629 GCs,  175595458/ 458265816 avg/max bytes residency (28 samples), 1148M in use, 0.000 INIT (0.001 elapsed),   0.354 MUT  (60.415 elapsed),   0.154 GC (26.855 elapsed) :ghc>> 
Data.Vector.Unboxed.Vector (Int,Bool,Int)  <<ghc:  82592190776 bytes, 2955 GCs,  211131511/ 486148096 avg/max bytes residency (29 samples), 1259M in use, 0.000 INIT (0.001 elapsed),   0.440 MUT  (73.214 elapsed),   0.192 GC (33.230 elapsed) :ghc>>
everything                                 <<ghc:  65784761184 bytes, 2602 GCs,  180664844/ 403016968 avg/max bytes residency (29 samples),  994M in use, 0.000 INIT (0.001 elapsed),   0.469 MUT  (52.120 elapsed),   0.259 GC (24.801 elapsed) :ghc>>

comment:8 Changed 7 months ago by bgamari

Indeed in most phases of simplification the program with all testcases enabled is roughly three to five times larger in master than it was in 8.0.2. After CorePrep, however, the program is nearly a factor of 8 larger,

# 8.0.2
Result size of CorePrep
  = {terms: 118,005, types: 138,524, coercions: 50,256}

# master
Result size of CorePrep
  = {terms: 796,181,
     types: 907,739,
     coercions: 1,094,846,
     joins: 3,382/34,300}

It may also be interesting to note that if you sum the allocations of all of the testcases individually, you get something a bit higher than the allocations required to compile the "everything" case. I wonder if

So, there are two (possibly related) questions here:

  1. Why did the Core get so much larger from 8.0 to 8.2?
  2. How was GHC 8.0 able to compile the "everything" case with less than a factor of two of the allocations of the individual cases, while 8.2 requires a factor of five?

comment:9 Changed 7 months ago by bgamari

Comparing the compilation of just the Data.Vector.Vector (Bool) with GHC 8.0.2 and 8.2 is also quite interesting: starting early in simplification we see that 8.2's core is typically less than half(!) the size of 8.0's (perhaps some goodness due to early inlining?). For instance,

# 8.0.2
  = {terms: 271,375, types: 552,012, coercions: 226,814}

# 8.2
Result size of Simplifier iteration=1
  = {terms: 105,610,
     types: 209,276,
     coercions: 135,191,
     joins: 299/5,710}

This remains the case throughout most of simplification; however, it seems that either Core Tidy or Core Prep cleans things up in 8.0, resulting in rather similar program sizes,

# 8.0.2
Result size of CorePrep
  = {terms: 56,693, types: 73,297, coercions: 24,938}
!!! CorePrep [Tests.Vector]: finished in 135.38 milliseconds, allocated 97.842 megabytes

# 8.2
Result size of CorePrep
  = {terms: 69,730,
     types: 85,131,
     coercions: 125,647,
     joins: 248/2,453}
!!! CorePrep [Tests.Vector]: finished in 132.14 milliseconds, allocated 106.584 megabytes

Seeing that the Core sizes are comparable when compiling a single testcase, question (2) is becoming quite interesting.

comment:10 Changed 7 months ago by bgamari

As I suspected, there appears to be a large number of duplicate top-level bindings in the code that gets passed Core Tidy in 8.2. In particular, I count at least 25 copies of the same binding,

$s$wfoldlM_loop :: State# RealWorld-> [] Bool-> Int#-> MutableArray# RealWorld Bool-> Int#-> Int#-> (#,#) (TupleRep ([] RuntimeRep)) LiftedRep (State# RealWorld) (Vector Bool)
{- Core Size{terms=209 types=244 cos=466 vbinds=7 jbinds=0} -}
$s$wfoldlM_loop = ...

I see similar amounts of repetition for foldlM variants instantiated at other types. I'll need to work out why CSE fails to catch this tomorrow.

Last edited 7 months ago by bgamari (previous) (diff)

comment:11 Changed 7 months ago by simonpj

Can you describe how to reproduce this test case?

comment:12 Changed 7 months ago by bgamari

You should be able to reproduce this with,

$ git clone git://github.com/haskell/vector
$ cd vector
$ git checkout 1d208ee9e3a252941ebd112e14e8cd5a982ac2bb
$ cabal install . --allow-newer=base,directory,process,haskeline,time,unix --disable-library-profiling --enable-test
$ time ghc -no-link \
    -i \
    -idist/build/vector-tests-O2/vector-tests-O2-tmp \
    -itests \
    -idist/build/vector-tests-O2/autogen \
    -idist/build/global-autogen \
    -Idist/build/vector-tests-O2/autogen \
    -Idist/build/global-autogen \
    -Idist/build/vector-tests-O2/vector-tests-O2-tmp \
    -optP-include \
    -optPdist/build/vector-tests-O2/autogen/cabal_macros.h \
    -XHaskell2010 \
    -XCPP \
    -XScopedTypeVariables \
    -XPatternGuards \
    -XMultiParamTypeClasses \
    -XFlexibleContexts \
    -XRank2Types \
    -XTypeSynonymInstances \
    -XTypeFamilies \
    -XTemplateHaskell \
    tests/Tests/Vector \
    -O2 \
    -Wall \
    -fno-warn-orphans \
    -fno-warn-missing-signatures \
    -Wno-name-shadowing \
    -Wno-unused-binds \
    -Wno-unused-matches \
    -Wno-unused-imports \
    -ddump-to-file \
    -ddump-simpl \
    -dsuppress-idinfo \
    -fforce-recomp \
    -v \
    +RTS -t 2>&1 | tee log

Then grep for ^\$s\$wfoldlM in the dump-simpl output. You should see a number of top-level bindings of the type that I mention in comment:10. In my case they are all exactly of size,

-- RHS size: {terms: 209, types: 244, coercions: 466, joins: 0/7}

and the right-hand sides are identical.

It seems that the bindings in question are being excluded from CSE by CSE.noCSE. I suspect they may be join points, but I'm still trying to work out the particulars.

comment:13 Changed 7 months ago by bgamari

Indeed I've confirmed that the bindings in question are join points. I don't see any reason why top-level join points can't be CSE'd. I'll try fixing this.

comment:14 Changed 7 months ago by bgamari

Please excuse the nonsense in comment:13. My trace mislead me into believing that the bindings in question were join points when they were not (and now that I look again at this, I don't believe that such top-level join points could even arise).

The problem is instead that the not (isAlwaysActive (idInlineActivation id)) condition in noCSE fires for the bindings in question. This appears to be due to the INLINE[0] pragma applied to these bindings by worker/wrapper (namely in WorkWrap.splitFun.wrap_id).

Last edited 7 months ago by bgamari (previous) (diff)

comment:15 Changed 7 months ago by simonpj

I'm still a bit boggled by how all these identical functions get created in the first place. I have not yet had a chance to gaze at the repro case. Do you understand why?

comment:16 Changed 7 months ago by bgamari

So it turns out that the failure of CSE isn't actually a regression. I can observe the same repeated bindings produced by 8.0.2.

Regardless, I think the CSE failure should probably be considered a bug. Afterall, worker/wrapper originally applies this inline pragma to ensure that the wrapper doesn't inline before any possible specialisations have had a chance to do so. If we CSE'd in this case we aren't going to hide any specialisation opportunities as the bindings we are CSE'ing are all identical and should therefore have the same available specialisations.

comment:17 Changed 7 months ago by bgamari

simonpj, the identical functions come into existence by inlining into numerous distinct call-sites of foldlM (which has foldM_loop let-bound in its unfolding).

Last edited 7 months ago by bgamari (previous) (diff)

comment:18 Changed 7 months ago by simonpj

OK, but all those call sites must be applied to the same combining-function argument (else the loop would be different in each case). So we just have lots of all to foldlM k z, for a particular k and z?

Could we CSE those calls before inlining the foldlM?

comment:19 Changed 7 months ago by bgamari

Could we CSE those calls before inlining the foldlM?

It seems like what is happening is this,

  1. The specialiser creates an INLINE specialisation,

$sunstream :: Bundle Vector Int -> New Vector Int

which has foldlM (and consequently foldlM_loop) let-bound within it

  1. FloatOut gets run, but since we are in an InlineCtxt we don't float the foldlM_loops out to top-level where CSE might be able to consolidate them (incidentally, is Note [FloatOut inside INLINE] in SetLevels still correct? It doesn't seem to be referenced from anywhere)
  2. The specialisation is inlined into dozens of callsites
  3. We do lots of expensive, redundant simplification
  4. Eventually we do CSE, but even then we don't eliminate the redundant top-level foldlM_loops for the reason discussed in comment:16

So, to answer the original question, we might be able to CSE those calls before inlining foldlM, but we'll need to change something about the above situation.

Last edited 7 months ago by bgamari (previous) (diff)

comment:20 Changed 7 months ago by bgamari

(incidentally, is Note [FloatOut inside INLINE] in SetLevels still correct? It doesn't seem to be referenced from anywhere)

It actually appears that the logic associated with this note was refactored in,

commit c0fe534fbf3d3e30a6f35b355da86e6f18601348
Author: simonpj@microsoft.com <unknown>
Date:   Tue Apr 22 11:50:03 2008 +0000

    Fix a long-standing bug in FloatOut

and eventually removed in,

commit 72462499b891d5779c19f3bda03f96e24f9554ae
Author: simonpj@microsoft.com <unknown>
Date:   Thu Oct 29 14:30:51 2009 +0000

    The Big INLINE Patch: totally reorganise way that INLINE pragmas work

I suspect the note is now incorrect (although it's a bit unclear why point (2) of comment:19 stands in this case).

Last edited 7 months ago by bgamari (previous) (diff)

comment:21 Changed 7 months ago by simonpj

So it turns out that the failure of CSE isn't actually a regression. I can observe the same repeated bindings produced by 8.0.2.

But nevertheless, regardless of CSE, things got worse between 8.0 and 8.2 and we do not yet know why.

comment:22 Changed 7 months ago by simonpj

ToDo:

  1. Vector library author: do we really need to INLINE unstream and friends, or would INLINABLE do?
  1. Why is 8.2 worse than 8.0?
  1. Maybe we could do a better job of CSE as discussed in #13589.

comment:23 Changed 7 months ago by bgamari

Owner: set to bgamari

comment:24 Changed 6 months ago by George

with ghc 8.2.1-rc2 on a Mac, now succeeds in less than 4gb , watching Activity Monitor, Memory max usage seems between 3.5 and 4gb:

ulimit -v 4000000; cabal test
...
Preprocessing test suite 'vector-tests-O2' for vector-0.12.0.1...
...
[4 of 7] Compiling Tests.Vector     ( tests/Tests/Vector.hs, dist/build/vector-tests-O2/vector-tests-O2-tmp/Tests/Vector.o )
...
tests/Tests/Vector.hs:438:5: warning: [-Wname-shadowing]
    This binding for ‘limitUnfolds’ shadows the existing binding
      imported from ‘Utilities’ at tests/Tests/Vector.hs:5:1-24
      (and originally defined at tests/Utilities.hs:347:1-12)
    |
438 |     limitUnfolds f (theirs, ours)
    |     ^^^^^^^^^^^^
[5 of 7] Compiling Tests.Move       ( tests/Tests/Move.hs, dist/build/vector-tests-O2/vector-tests-O2-tmp/Tests/Move.o )
...
[7 of 7] Compiling Main             ( tests/Main.hs, dist/build/vector-tests-O2/vector-tests-O2-tmp/Main.o )
Linking dist/build/vector-tests-O2/vector-tests-O2 ...
Running 2 test suites...
Test suite vector-tests-O0: RUNNING...
Test suite vector-tests-O0: PASS
Test suite logged to: dist/test/vector-0.12.0.1-vector-tests-O0.log
Test suite vector-tests-O2: RUNNING...
Test suite vector-tests-O2: PASS
Test suite logged to: dist/test/vector-0.12.0.1-vector-tests-O2.log
2 of 2 test suites (2 of 2 test cases) passed.
bash-3.2$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.2.0.20170507
Last edited 6 months ago by George (previous) (diff)

comment:25 Changed 6 months ago by RyanGlScott

4 GB was a somewhat arbitrary number I picked. On my machine, compiling the tests on GHC 8.0.2 took about 1.6 GB, whereas with GHC 8.2.1, it took about 6.7 GB.

comment:26 Changed 6 months ago by RyanGlScott

I've finally managed to identify the commit that caused this memory spike: 1c4a39d3a8d36803382792ff78b4709794358883 (Prioritise class-level equality costraints).

comment:27 Changed 6 months ago by RyanGlScott

I've also confirmed that reverting 1c4a39d3a8d36803382792ff78b4709794358883 from the current master (d39a3409acd3c40fb018ec1c114f15d3ecef6ef9) makes GHC able to compile the vector test suite again, so that is a viable workaround.

comment:28 Changed 6 months ago by bgamari

Great work Ryan! That is a great hint.

comment:29 Changed 6 months ago by bgamari

It is a bit surprising to hear that this patch is the culprit, however, as it appears to be solely a typechecker change. Perhaps there is some change in the generated Core which then triggers some latent simplifier bug.

comment:30 Changed 5 months ago by dfeuer

FWIW, this is what I see on HEAD right now, compiling with

~/src/ghc-clean-june28/inplace/bin/ghc-stage2  -no-link -fbuilding-cabal-package -static -dynamic-too -dynosuf dyn_o -dynhisuf dyn_hi -outputdir dist/build/vector-tests-O2/vector-tests-O2-tmp -odir dist/build/vector-tests-O2/vector-tests-O2-tmp -hidir dist/build/vector-tests-O2/vector-tests-O2-tmp -stubdir dist/build/vector-tests-O2/vector-tests-O2-tmp -i -idist/build/vector-tests-O2/vector-tests-O2-tmp -itests -idist/build/autogen -Idist/build/autogen -Idist/build/vector-tests-O2/vector-tests-O2-tmp -optP-include -optPdist/build/autogen/cabal_macros.h -hide-all-packages -package-db dist/package.conf.inplace -package HUnit-1.6.0.0 -package QuickCheck-2.9.2 -package base -package random-1.1 -package template-haskell -package test-framework-0.8.1.1 -package test-framework-hunit-0.3.0.2 -package test-framework-quickcheck2-0.3.0.3 -package transformers-0.5.2.0 -package vector-0.12.0.1 -XHaskell2010 -XCPP -XScopedTypeVariables -XPatternGuards -XMultiParamTypeClasses -XFlexibleContexts -XRank2Types -XTypeSynonymInstances -XTypeFamilies -XTemplateHaskell tests/Tests/Vector.hs -O2 -fno-warn-orphans -fno-warn-missing-signatures -Rghc-timing
DV Bool:
<<ghc: 20949749104 bytes, 908 GCs, 85137861/201300312 avg/max bytes residency (18 samples), 550M in use, 0.001 INIT (0.001 elapsed), 16.766 MUT (17.304 elapsed), 7.197 GC (7.203 elapsed) :ghc>>

DV Int:
<<ghc: 20880225088 bytes, 986 GCs, 86497889/198602480 avg/max bytes residency (19 samples), 552M in use, 0.001 INIT (0.001 elapsed), 16.368 MUT (16.871 elapsed), 7.422 GC (7.415 elapsed) :ghc>>

DVP Int:
<<ghc: 17623866944 bytes, 806 GCs, 74242958/170729104 avg/max bytes residency (17 samples), 462M in use, 0.001 INIT (0.001 elapsed), 14.402 MUT (14.861 elapsed), 6.131 GC (6.150 elapsed) :ghc>>

DVP Double:
<<ghc: 18468475504 bytes, 872 GCs, 78279555/177860440 avg/max bytes residency (18 samples), 492M in use, 0.001 INIT (0.001 elapsed), 15.021 MUT (15.457 elapsed), 6.609 GC (6.603 elapsed) :ghc>>

DVS Int:
<<ghc: 18872431048 bytes, 1046 GCs, 76865901/183460000 avg/max bytes residency (18 samples), 488M in use, 0.001 INIT (0.001 elapsed), 15.192 MUT (15.736 elapsed), 6.655 GC (6.651 elapsed) :ghc>>

DVS Double:
<<ghc: 19785957096 bytes, 917 GCs, 80582965/199247800 avg/max bytes residency (18 samples), 516M in use, 0.001 INIT (0.001 elapsed), 15.735 MUT (16.301 elapsed), 6.882 GC (6.908 elapsed) :ghc>>

DVU ():
<<ghc: 14525404952 bytes, 727 GCs, 62579804/151799480 avg/max bytes residency (16 samples), 421M in use, 0.001 INIT (0.001 elapsed), 11.827 MUT (12.210 elapsed), 4.514 GC (4.509 elapsed) :ghc>>

DVU Bool:
<<ghc: 18936248568 bytes, 962 GCs, 79517335/183260976 avg/max bytes residency (18 samples), 512M in use, 0.001 INIT (0.001 elapsed), 14.761 MUT (15.275 elapsed), 6.348 GC (6.340 elapsed) :ghc>>

DVU Int:
<<ghc: 17814993240 bytes, 891 GCs, 72855136/165527616 avg/max bytes residency (18 samples), 463M in use, 0.001 INIT (0.001 elapsed), 14.094 MUT (14.503 elapsed), 5.989 GC (5.983 elapsed) :ghc>>

DVU Double:
<<ghc: 18153061952 bytes, 1018 GCs, 76748433/177534552 avg/max bytes residency (18 samples), 497M in use, 0.001 INIT (0.001 elapsed), 14.585 MUT (15.029 elapsed), 6.340 GC (6.340 elapsed) :ghc>>

DVU (Int, Bool):
<<ghc: 28661798488 bytes, 1120 GCs, 103585683/238779464 avg/max bytes residency (20 samples), 667M in use, 0.001 INIT (0.001 elapsed), 21.903 MUT (22.544 elapsed), 9.687 GC (9.690 elapsed) :ghc>>

DVU (Int, Bool, Int):
<<ghc: 35820203960 bytes, 1458 GCs, 129650246/294742648 avg/max bytes residency (22 samples), 811M in use, 0.001 INIT (0.001 elapsed), 26.240 MUT (27.016 elapsed), 12.636 GC (12.650 elapsed) :ghc>>

Total allocations for all individual:
       250492415944

all together:
<<ghc: 228679556144 bytes, 3540 GCs, 530024080/1227152400 avg/max bytes residency (35 samples), 3529M in use, 0.001 INIT (0.001 elapsed), 179.387 MUT (188.302 elapsed), 84.584 GC (84.509 elapsed) :ghc>>

So it currently allocates somewhat less compiling all together than compiling separately. bgamari's table for 8.0.2 looks like it has the wrong number for the "all together" case, so I'm not sure how to compare.

Last edited 5 months ago by dfeuer (previous) (diff)

comment:31 Changed 5 months ago by simonpj

David will summarise by saying

  • Compile this module X with these flags
  • Compiler allocations (a) pre offending patch (b) post offending patch (c) HEAD
  • Tidy Core sizes for (a)-(c); i.e. is it slower because we are compiling more code or because we are taking long to compile it?

Changed 5 months ago by dfeuer

Attachment: Vector.hs added

Reduced a little more

comment:32 Changed 5 months ago by dfeuer

Starting with vector 1d208ee9e3a252941ebd112e14e8cd5a982ac2bb, replace tests/Test/Vector.hs with the attached file. Compile using

-no-link -fbuilding-cabal-package -static -dynamic-too -dynosuf dyn_o \
-dynhisuf dyn_hi -outputdir dist/build/vector-tests-O2/vector-tests-O2-tmp \
-odir dist/build/vector-tests-O2/vector-tests-O2-tmp -hidir \
dist/build/vector-tests-O2/vector-tests-O2-tmp -stubdir \
dist/build/vector-tests-O2/vector-tests-O2-tmp -i \
-idist/build/vector-tests-O2/vector-tests-O2-tmp -itests \
-idist/build/autogen -Idist/build/autogen \
-Idist/build/vector-tests-O2/vector-tests-O2-tmp \
-optP-include -optPdist/build/autogen/cabal_macros.h -hide-all-packages \
-package-db dist/package.conf.inplace -package HUnit-1.6.0.0 \
-package QuickCheck-2.9.2 -package base -package random-1.1 -package \
template-haskell -package test-framework-0.8.1.1 -package \
test-framework-hunit-0.3.0.2 -package test-framework-quickcheck2-0.3.0.3 \
-package transformers-0.5.2.0 -package vector-0.12.0.1 -XHaskell2010 \
-XCPP -XScopedTypeVariables -XPatternGuards -XMultiParamTypeClasses \
-XFlexibleContexts -XRank2Types -XTypeSynonymInstances -XTypeFamilies \
-XTemplateHaskell tests/Tests/Vector.hs -O2 -fno-warn-orphans \
-fno-warn-missing-signatures -Rghc-timing -v

In the below, the "offending patch" is 1c4a39d3a8d36803382792ff78b4709794358883 and "recent HEAD" is 7097f94df0c6667be2343306ffcda152fa22abcf.

Allocations:

  1. Before the offending patch: 9,778,021,752
  2. After the offending patch: 125,401,972,904 (13x worse)
  3. Recent HEAD: 60,079,129,264 (6x worse)

Tidy core size (terms)

  1. Before offending: 20,934
  2. After offending: 170,128
  3. Recent HEAD: 170,023

Maximum reported core size (terms) before Tidy Core:

  1. Before offending: 45,213
  2. After offending: 1,358,264
  3. Recent HEAD: 503,803

So we are clearly generating far more code, and while we eliminate a lot of it, we're still left with an awful lot.

Last edited 5 months ago by dfeuer (previous) (diff)

Changed 5 months ago by dfeuer

Attachment: ds-diff added

dump-ds diff for offending patch

comment:33 Changed 5 months ago by dfeuer

I've attached a diff of the results of -ddump-ds -dsuppress-uniques before and after the offending patch. The change looks pretty small, but something is blowing everything up.

Changed 5 months ago by dfeuer

Attachment: ds.diff added

Proper extension on the diff

Changed 5 months ago by dfeuer

Attachment: Smaller-Vector.hs added

Changed 5 months ago by dfeuer

Attachment: smaller-ds.diff added

comment:34 Changed 5 months ago by dfeuer

I cut down the test case some more, to Smaller-Vector.hs, with desugar dump in smaller-ds.diff. The relative performance contrast isn't quite as strong, but it's still almost five times as bad in HEAD as before the badness started.

comment:35 Changed 4 months ago by bgamari

Milestone: 8.2.18.2.2

We are going to punt this to 8.2.2.

comment:36 Changed 4 months ago by bgamari

Milestone: 8.2.28.4.1
Priority: highesthigh

Actually, perhaps 8.4.

comment:37 Changed 4 months ago by George

Given the difficulty of fixing compiler bugs of excessive memory use, does it make sense to introduce a test to prevent future regressions:

bash -c 'ulimit -v <appropriate limti for current HEAD.; cabal test'

We could then change the limit as bugs get fixed or improvements as made. This assumes that this test would catch a lot of such errors. If people think this isn't the case then maybe such a test would not be worthwhile.

comment:38 Changed 4 months ago by George

Cc: George added; george removed

comment:39 Changed 4 months ago by bgamari

Well, we have a testsuite which includes a variety of performance tests. However, they tend to be "microbenchmarks" are don't approach the scale of vector. Unfortunately, these tests IMHO already require far more effort to maintain than they give us benefit due to the need for manual bumping of thresholds. The problem is summarized in #12758 (which jared-w is working on fixing this summer).

I do think there is room for more compiler performance tests, but I think they should generally be something that we monitor, not something that we actively need to bump thresholds on. As far as using the vector testsuite for such a testcase is concerned, I'd skeptical of including it wholesale. Rather, I think we would be better off building a minimized version of vector's testcase (which will hopefully fall out naturally as a result of working on this ticket) and including that. Testcases with lots of dependencies are generally problematic as you get breakage due to interface changes, spurious performance fluctuations due to changes in dependencies, and the pain of having to track down where in your testcase GHC is regressing.

comment:40 Changed 4 months ago by lippling

(I'm not exactly sure if this is the right ticket)

I tried to compile one of our server applications with 8.2.1 (which compiles fine with 8.0.2).

The compilation runs smooth, but when it reaches a specific file, the RAM usage goes up to > 20GB pretty fast on my 16GB machine and the GHC process gets terminated with an stack overflow error.

comment:41 in reply to:  40 Changed 3 months ago by bgamari

Replying to lippling:

(I'm not exactly sure if this is the right ticket)

I tried to compile one of our server applications with 8.2.1 (which compiles fine with 8.0.2).

The compilation runs smooth, but when it reaches a specific file, the RAM usage goes up to > 20GB pretty fast on my 16GB machine and the GHC process gets terminated with an stack overflow error.

If you can take the time to reduce your program to a couple of shareable modules it would be appreciated if you could open a new ticket with your repro.

Note: See TracTickets for help on using tickets.