(super!) linear slowdown of parallel builds on 40 core machine

changed weight to 7

so my understanding is that the parUpsweep function in compiler/main/GhcMake.hs is the main culprit / object to focus on for this performance issue, right? (literally the only thing thats tripped the the -jN ghc invocation )

Attached file test-ghc-cabal.sh (download).

a good benchmark for cabal build

I've noticed it as well. For superficial tests -jN>4 does not make much sense even on 8-core box.

Attached a cabal build test (takes ~50 seconds to build). Use as:

$ cd $ghc-source
./test-ghc-cabal.sh >build-log

Results for my machine:

RUN jobs: 1

real    0m51.745s
user    0m49.825s
sys     0m1.553s
RUN jobs: 2

real    0m34.254s
user    0m57.075s
sys     0m6.278s
RUN jobs: 3

real    0m31.670s
user    1m7.058s
sys     0m10.970s
RUN jobs: 4

real    0m32.008s
user    1m10.548s
sys     0m18.194s
RUN jobs: 5

real    0m32.329s
user    1m15.384s
sys     0m27.939s
RUN jobs: 6

real    0m33.993s
user    1m25.190s
sys     0m41.473s
RUN jobs: 7

real    0m35.410s
user    1m32.354s
sys     0m51.201s
RUN jobs: 8

real    0m36.111s
user    1m42.945s
sys     1m1.740s
RUN jobs: 9

real    0m37.426s
user    1m49.708s
sys     1m7.805s
RUN jobs: 10

real    0m40.149s
user    2m0.625s
sys     1m13.054s
RUN jobs: 11

real    0m44.515s
user    2m18.503s
sys     1m21.783s
RUN jobs: 12

real    0m44.393s
user    2m25.161s
sys     1m26.875s
RUN jobs: 13

real    0m47.298s
user    2m44.370s
sys     1m29.611s
RUN jobs: 14

real    0m52.647s
user    3m16.386s
sys     1m37.780s
RUN jobs: 15

real    0m54.757s
user    3m18.954s
sys     1m45.547s
RUN jobs: 16

real    0m58.655s
user    3m49.732s
sys     1m49.191s

Notice how sys time creeps up taking over performance gain on N>4.

Trac metadata

Trac field	Value
CC	→ slyfox

Attached file graph1.png (download).

Plot of data in comment:2

Replying to [ticket:9221#comment:84373 slyfox]:

Notice how sys time creeps up taking over performance gain on N>4.

It's also quite visible in the quick R plot I made:

[[Image(graph1.png)]]

Box I performed tests on is a 8-CPU one (4-cores x 2 threads each):

Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz

Trac metadata

Trac field	Value
Related	→ #910 (closed)

changed milestone to %7.10.1

Moving to 7.10.1.

mentioned in issue #9370

I think I know what's going on here. If you look at parUpsweep in compiler/main/GhcMake.js, its argument n_jobs is used in two places: one is the initial value of the par_sem semaphore used to limit parallelization, and the other is a call to setNumCapabilities. The latter seems to be the cause of the slowdown.

Note that setNumCapabilities is only invoked if the previous count of capabilities was 1. I used that to control for both settings independently, and it turns out that the runtime overhead is mostly independent of the semaphore value and highly influenced by capability count.

I ran some experiments on a 16-CPU VM (picked a larger one deliberately to make the differences more pronounced). Running with jobs=4 & caps=4, a test took 37s walltime, jobs=4 & caps=16 took 51s, jobs=4 & caps=32 took 114s (344s of MUT and 1021s of GC!). The figures are very similar for jobs=16 and jobs=64. See attached log for more details (-sstderr output).

It looks like the runtime GC is just inefficient when running with many capabilities, even if many physical cores are available. I'll try a few experiments to verify that this is a general pattern that is not specific to the GhcMake implementation.

Logic and a few experiments indicate that it does not help walltime to set the number of jobs (semaphore value) higher than the number of capabilities, so there's not much we can do about those two parameters in the parUpsweep implementation other than capping n_jobs at some constant (probably <= 8).

Attached file bench.log (download).

Benchmark log (semaphore value & capability count)

While looking around I found #3758 (closed), and it appears that +RTS -qg (disable parallel GC) helps a lot with the superlinear overhead. For example, the benchmark above with jobs=24 & caps=24 without -qg took:

real 1m3.596s user 6m31.072s sys 3m10.732s

With -qg:

real 0m47.747s user 1m33.352s sys 0m2.024s

However, for smaller -j values -qg slightly increases walltime: 2 jobs: 44s without -qg, 46s with -qg 4 jobs: 37s vs 40s 8 jobs: 37s vs 41s 15 jobs: 42s vs 44s 16 jobs: 49s vs 44s (walltime crossover point for this 16-core machine)

Also #8224 (closed) looks like it could be related.

	compile time
-j1	0m2.693s
-j4	0m2.507s
-j10	0m2.763s
-j25	0m12.634s
-j30	0m39.154s
-j40	0m57.511s
-j60	2m21.821s

(super!) linear slowdown of parallel builds on 40 core machine

Child items 0

Activity

(super!) linear slowdown of parallel builds on 40 core machine

Relates to

Activity