Opened 11 months ago

Last modified 4 days ago

#9221 new bug

(super!) linear slowdown of parallel builds on 40 core machine

Reported by: carter Owned by:
Priority: high Milestone: 7.12.1
Component: Compiler Version: 7.8.2
Keywords: Cc: slyfox, tibbe, gidyn, nh2, kolmodin, erikd
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #910 Differential Revisions:

Description

im seeing slowdowns in parallel builds of a (simple!!) 6 module project when I build it on a 40 core server i'm using for work. for any given ghc invocation with -jn, once n>10, i start to see a super linear slow down as a function of n

heres some basic numbers

at -j1 0m2.693s

at -j4 0m2.507s

at -j10 0m2.763s

at -j25 0m12.634s

at -j30 : 0m39.154s

at -j40 : 0m57.511s

at -j60 : 2m21.821s

these timings are another 2-4x worse if ghc is invoked indirectly via cabal-install / setup.hs

according to the linux utility latencytop, 100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.

the timing in the -j40 case stayed the same even when ghc was also passed -O0 (and -fforce-recomp to ensure it did the same )

a bit of experimentation makes me believe that in *ANY* cabalized project on a 40 core machine will exhibit this perf issue.

cabal clean ; cabal configure --ghc-options="-j" ; cabal build -j1

should be enough to trigger the lock contention.

That said, I'll try to cook up a minimal repro that i can share the source for post haste.

Attachments (5)

test-ghc-cabal.sh (515 bytes) - added by slyfox 11 months ago.
a good benchmark for cabal build
graph1.png (24.7 KB) - added by hvr 11 months ago.
Plot of data in comment:2
bench.log (10.9 KB) - added by gintas 8 months ago.
Benchmark log (semaphore value & capability count)
overload.log (3.7 KB) - added by gintas 8 months ago.
Experiment with 32 capabilities on a 16-core machine
heapsize.log (12.6 KB) - added by gintas 8 months ago.
Experiment with multiple -A values on the same invocation

Download all attachments as: .zip

Change History (31)

comment:1 Changed 11 months ago by carter

so my understanding is that the parUpsweep function in compiler/main/GhcMake.hs is the main culprit / object to focus on for this performance issue, right? (literally the only thing thats tripped the the -jN ghc invocation )

Changed 11 months ago by slyfox

a good benchmark for cabal build

comment:2 follow-up: Changed 11 months ago by slyfox

I've noticed it as well. For superficial tests -jN>4 does not make much sense even on 8-core box.

Attached a cabal build test (takes ~50 seconds to build). Use as:

$ cd $ghc-source
./test-ghc-cabal.sh >build-log

Results for my machine:

RUN jobs: 1

real    0m51.745s
user    0m49.825s
sys     0m1.553s
RUN jobs: 2

real    0m34.254s
user    0m57.075s
sys     0m6.278s
RUN jobs: 3

real    0m31.670s
user    1m7.058s
sys     0m10.970s
RUN jobs: 4

real    0m32.008s
user    1m10.548s
sys     0m18.194s
RUN jobs: 5

real    0m32.329s
user    1m15.384s
sys     0m27.939s
RUN jobs: 6

real    0m33.993s
user    1m25.190s
sys     0m41.473s
RUN jobs: 7

real    0m35.410s
user    1m32.354s
sys     0m51.201s
RUN jobs: 8

real    0m36.111s
user    1m42.945s
sys     1m1.740s
RUN jobs: 9

real    0m37.426s
user    1m49.708s
sys     1m7.805s
RUN jobs: 10

real    0m40.149s
user    2m0.625s
sys     1m13.054s
RUN jobs: 11

real    0m44.515s
user    2m18.503s
sys     1m21.783s
RUN jobs: 12

real    0m44.393s
user    2m25.161s
sys     1m26.875s
RUN jobs: 13

real    0m47.298s
user    2m44.370s
sys     1m29.611s
RUN jobs: 14

real    0m52.647s
user    3m16.386s
sys     1m37.780s
RUN jobs: 15

real    0m54.757s
user    3m18.954s
sys     1m45.547s
RUN jobs: 16

real    0m58.655s
user    3m49.732s
sys     1m49.191s

Notice how sys time creeps up taking over performance gain on N>4.

comment:3 Changed 11 months ago by slyfox

  • Cc slyfox added

Changed 11 months ago by hvr

Plot of data in comment:2

comment:4 in reply to: ↑ 2 Changed 11 months ago by hvr

Replying to slyfox:

Notice how sys time creeps up taking over performance gain on N>4.

It's also quite visible in the quick R plot I made:

Plot of data in comment:2

comment:5 Changed 11 months ago by tibbe

  • Cc tibbe added

comment:6 Changed 11 months ago by slyfox

Box I performed tests on is a 8-CPU one (4-cores x 2 threads each):

Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz

comment:7 Changed 11 months ago by hvr

comment:8 Changed 11 months ago by thoughtpolice

  • Milestone set to 7.10.1

Moving to 7.10.1.

comment:9 Changed 8 months ago by gintas

I think I know what's going on here. If you look at parUpsweep in compiler/main/GhcMake.js, its argument n_jobs is used in two places: one is the initial value of the par_sem semaphore used to limit parallelization, and the other is a call to setNumCapabilities. The latter seems to be the cause of the slowdown.

Note that setNumCapabilities is only invoked if the previous count of capabilities was 1. I used that to control for both settings independently, and it turns out that the runtime overhead is mostly independent of the semaphore value and highly influenced by capability count.

I ran some experiments on a 16-CPU VM (picked a larger one deliberately to make the differences more pronounced). Running with jobs=4 & caps=4, a test took 37s walltime, jobs=4 & caps=16 took 51s, jobs=4 & caps=32 took 114s (344s of MUT and 1021s of GC!). The figures are very similar for jobs=16 and jobs=64. See attached log for more details (-sstderr output).

It looks like the runtime GC is just inefficient when running with many capabilities, even if many physical cores are available. I'll try a few experiments to verify that this is a general pattern that is not specific to the GhcMake implementation.

Logic and a few experiments indicate that it does not help walltime to set the number of jobs (semaphore value) higher than the number of capabilities, so there's not much we can do about those two parameters in the parUpsweep implementation other than capping n_jobs at some constant (probably <= 8).

Changed 8 months ago by gintas

Benchmark log (semaphore value & capability count)

comment:10 Changed 8 months ago by gintas

While looking around I found #3758, and it appears that +RTS -qg (disable parallel GC) helps a lot with the superlinear overhead. For example, the benchmark above with jobs=24 & caps=24 without -qg took:

real 1m3.596s
user 6m31.072s
sys 3m10.732s

With -qg:

real 0m47.747s
user 1m33.352s
sys 0m2.024s

However, for smaller -j values -qg slightly increases walltime:

2 jobs: 44s without -qg, 46s with -qg
4 jobs: 37s vs 40s
8 jobs: 37s vs 41s
15 jobs: 42s vs 44s
16 jobs: 49s vs 44s (walltime crossover point for this 16-core machine)

comment:11 Changed 8 months ago by gintas

Also #8224 looks like it could be related.

comment:12 Changed 8 months ago by simonmar

We probably want to be running with larger heap sizes when there are lots of cores, to counteract the synchronization overhead of stop-the-world GC across many cores. e.g. +RTS -A32m at least.

Check whether your machines really have N cores, or if N/2 of those are hyperthreaded cores. I don't recommend including hyperthreaded cores in the -N value you give to GHC, it's better to leave them free to soak up any extra load while letting the GC synchronize properly.

@carter, when you said "100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.", how did you discover that? Which lock in particular?

comment:13 Changed 8 months ago by carter

Yeah, I definitely found that making the nursery larger helped A LOT, i'd done -A50m or -A100m on a few occasions, though I must admit I didn't think to do that for running GHC itself!

Good point about hyperthreads, that would match my experience of only seeing good perf up to about 15-20ish for -N.
This does raise the problem that the default -N flag should probably be set to the number of physical cores rather than hyper threaded cores, would that be a reasonable patch for me to write for GHC? (though of course theres the matter of correctly detecting that info in a portable way! )

I unfortunately dont have access anymore to that machine (or presently any SSH access to a mega core sized box) so I can't repro it to dig in deeper, I was using some cli tool who's name escapes me right now [edit: i'll try to dig up what the cli tools name was when i have the chance]

Last edited 8 months ago by carter (previous) (diff)

comment:14 Changed 8 months ago by gintas

I ran some experiments with -A and it does help a lot with performance, but also increases peak memory usage. I observed continuous improvement all the way from -A1m to -A128m in terms of walltime (41s to 36s), but "total memory in use" also went up from 265MB to 2182MB. Not sure where the sweet spot is.

-A seems to help especially if the number of capabilities exceeds the number of cores. With 32 capabilities on a 16 core machine, a -qg run took 50s, -A128m took 41s (still a penalty over 36s but not nearly as bad) and a vanilla run took almost 2min. Of course, total memory use with -A128m went up to 4388m...

Looks like the sweet spot is somewhere in the middle. How about building ghc --with-rtsopts=-A16m by default? Are there any downsides to that? On my machine the "total memory in use" overhead for that is less than 2x (475MB vs 265MB) and it really helps especially in the degenerate cases.

(The machine I was testing on had 16 real cores without any hyperthreading.)

Changed 8 months ago by gintas

Experiment with 32 capabilities on a 16-core machine

Changed 8 months ago by gintas

Experiment with multiple -A values on the same invocation

comment:15 Changed 8 months ago by gidyn

  • Cc gidyn added

comment:16 Changed 8 months ago by simonmar

You should never have more capabilities than cores.

It's hard to know where to set the default on the memory-vs-time tradeoff curve. GHC has typically been quite conservative here.

Personally I'd like to know why -j is hardly providing any benefit beyond 2-3 cores, regardless of heap size settings.

comment:17 Changed 7 months ago by nh2

  • Cc nh2 added

comment:18 Changed 7 months ago by nh2

@simonmar
How much do you think is this affected by #8224?

comment:19 Changed 7 months ago by simonmar

@nh2 I don't think anyone has fully investigated what's going on in #8224.

comment:20 Changed 5 months ago by kolmodin

  • Cc kolmodin added

comment:21 Changed 4 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

(Pushing back to 7.12 pending investigation on this ticket and #8224).

comment:22 Changed 13 days ago by erikd

  • Cc erikd added

comment:23 Changed 13 days ago by ezyang

I was chatting with one of my colleagues about this problem recently, and they said something very provocative: if GHC is not scaling because there is some global mutable state (e.g. the NameCache) which all the threads are hitting, it doesn't make sense to try to fix the compiler to scale; you should just run multiple processes of the compiler in parallel (like GHC's build does). Guaranteed scaling!

Do people agree with this viewpoint? Disagree? If we take this viewpoint seriously, we should spend more time improving GHC's ability to output the dependency analysis for other tools to then build.

comment:24 Changed 13 days ago by simonpj

Alternatively, give each thread its own NameCache. To do that, if one thread consumes a ModIface produced by another, it'd need to run over it, looking up all the Names to give them the Unique local to that thread. Not so very different to having entirely separate processes communicating through files; and the latter is perhaps easier to think about.

So, yes.

comment:25 Changed 11 days ago by simonmar

@ezyang: I absolutely agree. Parallelism within a single process is always going to be harder to scale than separate processes. That said, there might still be low-hanging fruit (e.g. better default heap settings) for parallel compilation.

I very much like -j in GHCi though - we cut the time to load a large project by 3x just by turning on -j in GHCi with some suitable heap settings (-A256m).

comment:26 Changed 4 days ago by yongqli

Will adding -j9 to the ghc-options: field of a cabal project mitigate this in the mean time until this is fixed?

Note: See TracTickets for help on using tickets.