Opened 2 years ago

Last modified 7 hours ago

#9221 new bug

(super!) linear slowdown of parallel builds on 40 core machine

Reported by: carter Owned by:
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 7.8.2
Keywords: Cc: slyfox, tibbe, gidyn, nh2, kolmodin, erikd, kazu-yamamoto, scpmw, mboes
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #910, #8224 Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

I'm seeing slowdowns in parallel builds of a (simple!!) 6 module project when I build it on a 40 core server that I'm using for work. For any given ghc invocation with -jn, once n>10, I start to see a super-linear slow down as a function of n,

here are some basic numbers,

compile time
-j1 0m2.693s
-j4 0m2.507s
-j10 0m2.763s
-j25 0m12.634s
-j30 0m39.154s
-j40 0m57.511s
-j60 2m21.821s

These timings are another 2-4x worse if ghc is invoked indirectly via cabal-install / Setup.hs

According to the linux utility latencytop, 100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.

The timing in the -j40 case stayed the same even when ghc was also passed -O0 (and -fforce-recomp to ensure it did the same )

A bit of experimentation makes me believe that *any* cabalized project on a 40 core machine will exhibit this performance issue.

cabal clean
cabal configure --ghc-options="-j"
cabal build -j1

should be enough to trigger the lock contention.

That said, I'll try to cook up a minimal repro that i can share the source for post haste.

Attachments (6)

test-ghc-cabal.sh (515 bytes) - added by slyfox 2 years ago.
a good benchmark for cabal build
graph1.png (24.7 KB) - added by hvr 2 years ago.
Plot of data in comment:2
bench.log (10.9 KB) - added by gintas 2 years ago.
Benchmark log (semaphore value & capability count)
overload.log (3.7 KB) - added by gintas 2 years ago.
Experiment with 32 capabilities on a 16-core machine
heapsize.log (12.6 KB) - added by gintas 2 years ago.
Experiment with multiple -A values on the same invocation
synth.bash (494 bytes) - added by slyfox 4 days ago.
synth.bash - perfectly parallel workload for ghc

Download all attachments as: .zip

Change History (72)

comment:1 Changed 2 years ago by carter

so my understanding is that the parUpsweep function in compiler/main/GhcMake.hs is the main culprit / object to focus on for this performance issue, right? (literally the only thing thats tripped the the -jN ghc invocation )

Changed 2 years ago by slyfox

a good benchmark for cabal build

comment:2 follow-up: Changed 2 years ago by slyfox

I've noticed it as well. For superficial tests -jN>4 does not make much sense even on 8-core box.

Attached a cabal build test (takes ~50 seconds to build). Use as:

$ cd $ghc-source
./test-ghc-cabal.sh >build-log

Results for my machine:

RUN jobs: 1

real    0m51.745s
user    0m49.825s
sys     0m1.553s
RUN jobs: 2

real    0m34.254s
user    0m57.075s
sys     0m6.278s
RUN jobs: 3

real    0m31.670s
user    1m7.058s
sys     0m10.970s
RUN jobs: 4

real    0m32.008s
user    1m10.548s
sys     0m18.194s
RUN jobs: 5

real    0m32.329s
user    1m15.384s
sys     0m27.939s
RUN jobs: 6

real    0m33.993s
user    1m25.190s
sys     0m41.473s
RUN jobs: 7

real    0m35.410s
user    1m32.354s
sys     0m51.201s
RUN jobs: 8

real    0m36.111s
user    1m42.945s
sys     1m1.740s
RUN jobs: 9

real    0m37.426s
user    1m49.708s
sys     1m7.805s
RUN jobs: 10

real    0m40.149s
user    2m0.625s
sys     1m13.054s
RUN jobs: 11

real    0m44.515s
user    2m18.503s
sys     1m21.783s
RUN jobs: 12

real    0m44.393s
user    2m25.161s
sys     1m26.875s
RUN jobs: 13

real    0m47.298s
user    2m44.370s
sys     1m29.611s
RUN jobs: 14

real    0m52.647s
user    3m16.386s
sys     1m37.780s
RUN jobs: 15

real    0m54.757s
user    3m18.954s
sys     1m45.547s
RUN jobs: 16

real    0m58.655s
user    3m49.732s
sys     1m49.191s

Notice how sys time creeps up taking over performance gain on N>4.

comment:3 Changed 2 years ago by slyfox

  • Cc slyfox added

Changed 2 years ago by hvr

Plot of data in comment:2

comment:4 in reply to: ↑ 2 Changed 2 years ago by hvr

Replying to slyfox:

Notice how sys time creeps up taking over performance gain on N>4.

It's also quite visible in the quick R plot I made:

Plot of data in comment:2

comment:5 Changed 2 years ago by tibbe

  • Cc tibbe added

comment:6 Changed 2 years ago by slyfox

Box I performed tests on is a 8-CPU one (4-cores x 2 threads each):

Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz

comment:7 Changed 2 years ago by hvr

comment:8 Changed 2 years ago by thoughtpolice

  • Milestone set to 7.10.1

Moving to 7.10.1.

comment:9 Changed 2 years ago by gintas

I think I know what's going on here. If you look at parUpsweep in compiler/main/GhcMake.js, its argument n_jobs is used in two places: one is the initial value of the par_sem semaphore used to limit parallelization, and the other is a call to setNumCapabilities. The latter seems to be the cause of the slowdown.

Note that setNumCapabilities is only invoked if the previous count of capabilities was 1. I used that to control for both settings independently, and it turns out that the runtime overhead is mostly independent of the semaphore value and highly influenced by capability count.

I ran some experiments on a 16-CPU VM (picked a larger one deliberately to make the differences more pronounced). Running with jobs=4 & caps=4, a test took 37s walltime, jobs=4 & caps=16 took 51s, jobs=4 & caps=32 took 114s (344s of MUT and 1021s of GC!). The figures are very similar for jobs=16 and jobs=64. See attached log for more details (-sstderr output).

It looks like the runtime GC is just inefficient when running with many capabilities, even if many physical cores are available. I'll try a few experiments to verify that this is a general pattern that is not specific to the GhcMake implementation.

Logic and a few experiments indicate that it does not help walltime to set the number of jobs (semaphore value) higher than the number of capabilities, so there's not much we can do about those two parameters in the parUpsweep implementation other than capping n_jobs at some constant (probably <= 8).

Changed 2 years ago by gintas

Benchmark log (semaphore value & capability count)

comment:10 Changed 2 years ago by gintas

While looking around I found #3758, and it appears that +RTS -qg (disable parallel GC) helps a lot with the superlinear overhead. For example, the benchmark above with jobs=24 & caps=24 without -qg took:

real 1m3.596s user 6m31.072s sys 3m10.732s

With -qg:

real 0m47.747s user 1m33.352s sys 0m2.024s

However, for smaller -j values -qg slightly increases walltime:

2 jobs: 44s without -qg, 46s with -qg 4 jobs: 37s vs 40s 8 jobs: 37s vs 41s 15 jobs: 42s vs 44s 16 jobs: 49s vs 44s (walltime crossover point for this 16-core machine)

comment:11 Changed 2 years ago by gintas

Also #8224 looks like it could be related.

comment:12 Changed 2 years ago by simonmar

We probably want to be running with larger heap sizes when there are lots of cores, to counteract the synchronization overhead of stop-the-world GC across many cores. e.g. +RTS -A32m at least.

Check whether your machines really have N cores, or if N/2 of those are hyperthreaded cores. I don't recommend including hyperthreaded cores in the -N value you give to GHC, it's better to leave them free to soak up any extra load while letting the GC synchronize properly.

@carter, when you said "100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.", how did you discover that? Which lock in particular?

comment:13 Changed 2 years ago by carter

Yeah, I definitely found that making the nursery larger helped A LOT, i'd done -A50m or -A100m on a few occasions, though I must admit I didn't think to do that for running GHC itself!

Good point about hyperthreads, that would match my experience of only seeing good perf up to about 15-20ish for -N. This does raise the problem that the default -N flag should probably be set to the number of physical cores rather than hyper threaded cores, would that be a reasonable patch for me to write for GHC? (though of course theres the matter of correctly detecting that info in a portable way! )

I unfortunately dont have access anymore to that machine (or presently any SSH access to a mega core sized box) so I can't repro it to dig in deeper, I was using some cli tool who's name escapes me right now [edit: i'll try to dig up what the cli tools name was when i have the chance]

Last edited 2 years ago by carter (previous) (diff)

comment:14 Changed 2 years ago by gintas

I ran some experiments with -A and it does help a lot with performance, but also increases peak memory usage. I observed continuous improvement all the way from -A1m to -A128m in terms of walltime (41s to 36s), but "total memory in use" also went up from 265MB to 2182MB. Not sure where the sweet spot is.

-A seems to help especially if the number of capabilities exceeds the number of cores. With 32 capabilities on a 16 core machine, a -qg run took 50s, -A128m took 41s (still a penalty over 36s but not nearly as bad) and a vanilla run took almost 2min. Of course, total memory use with -A128m went up to 4388m...

Looks like the sweet spot is somewhere in the middle. How about building ghc --with-rtsopts=-A16m by default? Are there any downsides to that? On my machine the "total memory in use" overhead for that is less than 2x (475MB vs 265MB) and it really helps especially in the degenerate cases.

(The machine I was testing on had 16 real cores without any hyperthreading.)

Changed 2 years ago by gintas

Experiment with 32 capabilities on a 16-core machine

Changed 2 years ago by gintas

Experiment with multiple -A values on the same invocation

comment:15 Changed 2 years ago by gidyn

  • Cc gidyn added

comment:16 Changed 2 years ago by simonmar

You should never have more capabilities than cores.

It's hard to know where to set the default on the memory-vs-time tradeoff curve. GHC has typically been quite conservative here.

Personally I'd like to know why -j is hardly providing any benefit beyond 2-3 cores, regardless of heap size settings.

comment:17 Changed 22 months ago by nh2

  • Cc nh2 added

comment:18 Changed 22 months ago by nh2

@simonmar How much do you think is this affected by #8224?

comment:19 Changed 22 months ago by simonmar

@nh2 I don't think anyone has fully investigated what's going on in #8224.

comment:20 Changed 20 months ago by kolmodin

  • Cc kolmodin added

comment:21 Changed 19 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

(Pushing back to 7.12 pending investigation on this ticket and #8224).

comment:22 Changed 16 months ago by erikd

  • Cc erikd added

comment:23 follow-up: Changed 16 months ago by ezyang

I was chatting with one of my colleagues about this problem recently, and they said something very provocative: if GHC is not scaling because there is some global mutable state (e.g. the NameCache) which all the threads are hitting, it doesn't make sense to try to fix the compiler to scale; you should just run multiple processes of the compiler in parallel (like GHC's build does). Guaranteed scaling!

Do people agree with this viewpoint? Disagree? If we take this viewpoint seriously, we should spend more time improving GHC's ability to output the dependency analysis for other tools to then build.

comment:24 Changed 16 months ago by simonpj

Alternatively, give each thread its own NameCache. To do that, if one thread consumes a ModIface produced by another, it'd need to run over it, looking up all the Names to give them the Unique local to that thread. Not so very different to having entirely separate processes communicating through files; and the latter is perhaps easier to think about.

So, yes.

comment:25 Changed 16 months ago by simonmar

@ezyang: I absolutely agree. Parallelism within a single process is always going to be harder to scale than separate processes. That said, there might still be low-hanging fruit (e.g. better default heap settings) for parallel compilation.

I very much like -j in GHCi though - we cut the time to load a large project by 3x just by turning on -j in GHCi with some suitable heap settings (-A256m).

comment:26 Changed 16 months ago by yongqli

Will adding -j9 to the ghc-options: field of a cabal project mitigate this in the mean time until this is fixed?

comment:27 Changed 13 months ago by bgamari

For future readers, highlighting-kate would likely be an excellent package to benchmark with as it has a large number of largely independent modules.

comment:28 Changed 13 months ago by slyfox

The selfcontained test contains 195 modules to build highlighting-kate:

http://code.haskell.org/~slyfox/T9221__highlighting-kate-build-benchmark.tar.gz

To run it you need to tweak a path to your 'ghc=' in mk.bash. On my 8 core box best results can be achieved by setting -A128M (or larger).

The test is handy to run 'perf record' on it. My results for ./mk.bash -j8 -A128M:

$ perf record -g ./mk.bash -j8 -A128M
$ perf report -g

+   15,42%     0,78%  ghc_worker  libc-2.21.so             [.] __sched_yield
+   15,08%     1,62%  ghc_worker  [kernel.vmlinux]         [k] entry_SYSCALL_64
+   12,70%     0,38%  ghc_worker  [kernel.vmlinux]         [k] sys_sched_yield
+   10,91%     6,47%  ghc_worker  ghc-stage2               [.] clE_info
+   10,50%     0,38%  ghc_worker  [kernel.vmlinux]         [k] schedule
+    9,14%     1,58%  ghc_worker  [kernel.vmlinux]         [k] __schedule
+    8,60%     0,00%  ghc_worker  [unknown]                [.] 0x48032822f800c748
+    5,85%     5,64%  ghc_worker  ghc-stage2               [.] evacuate
+    3,47%     0,94%  ghc_worker  ghc-stage2               [.] c7F_info
+    2,91%     0,69%  ghc_worker  [kernel.vmlinux]         [k] pick_next_task_fair
+    2,70%     0,00%  ghc_worker  [unknown]                [.] 0x0000000000000004
+    2,63%     2,28%  ghc_worker  ghc-stage2               [.] c2k_info
+    2,55%     0,00%  ghc_worker  [unknown]                [.] 0x2280f98148088b48
+    1,90%     0,00%  ghc_worker  [unknown]                [.] 0x834807e283da8948
+    1,90%     0,00%  as          ld-2.21.so               [.] _dl_sysdep_start
+    1,89%     0,00%  as          ld-2.21.so               [.] dl_main
+    1,83%     0,15%  as          [kernel.vmlinux]         [k] page_fault
+    1,81%     0,47%  as          ld-2.21.so               [.] _dl_relocate_object

If perf does not lie most of the time is spent cycling over sleeping kernel threads. clE_info is a 'INFO_TABLE(stg_BLACKHOLE,1,0,BLACKHOLE,"BLACKHOLE","BLACKHOLE")'

comment:29 Changed 13 months ago by bgamari

slyfox, out of curiosity what sort of parallel speed-up did you observe in that test?

comment:30 in reply to: ↑ 23 Changed 13 months ago by nh2

Replying to ezyang:

I was chatting with one of my colleagues about this problem recently, and they said something very provocative: if GHC is not scaling because there is some global mutable state (e.g. the NameCache) ...

Do people agree with this viewpoint? Disagree?

I disagree. Threads should almost always be more efficient to use as they allow to efficiently/easily share resources when it makes things faster, but that doesn't mean that we have to share all the things. Processes force us to not share anything. If building is faster with separate processes, then we should be able to achieve the same speed with threads by simply not sharing that thing that makes it slow and that processes force us to not share.

However, I wouldn't be surprised if this isn't even the problem here.

Replying to slyfox:

If perf does not lie most of the time is spent cycling over sleeping kernel threads

This sounds much more like the problem.

If I had to make a guess (and based on the very limited look I had into this issue last year) it feels like we are accidentally busy polling something somewhere.

When I run some non-build Haskell stuff with -RTS +N18 on the current generation of 18 core AWS instances, with many more Haskell threads than needed for building a 200 module project, and with shorter thread life times than in this case (e.g. let's say building a module takes around 0.5 seconds), that stuff scales pretty nicely, much better than ghc's --make scales here. This makes me think that we might be simply doing something wrong in the parallel upsweep code, and that the rest (compiler, runtime etc.) is doing quite OK.

comment:31 follow-up: Changed 13 months ago by simonmar

It sounds like there might be a lot of threads hitting blackholes, with consequent context-switching churn. Finding out which blackhole would be good - probably something in a shared data structure (NameCache or the FastString table, perhaps). ThreadScope would be a good next step.

comment:32 Changed 13 months ago by bgamari

simonmar, can ThreadScope help point one in the direction of which blackhole is to blame? I was under the impression that blackhole performance debugging an area where our current tools fell a little short, but maybe I've missed something.

Somewhere around here I have the beginnings of a patch adding support for profiling of blackhole block events but I wonder if our new DWARF capabilities might be a better fit for this sort of performance work.

comment:33 in reply to: ↑ 31 Changed 13 months ago by slyfox

Replying to bgamari:

slyfox, out of curiosity what sort of parallel speed-up did you observe in that test?

At best I get ~2.5 speedup (-j8 -A128M):

  • -j1 : 41.1s
  • -j8 -A128M : 17.4s

Replying to simonmar:

It sounds like there might be a lot of threads hitting blackholes, with consequent context-switching churn. Finding out which blackhole would be good - probably something in a shared data structure (NameCache or the FastString table, perhaps). ThreadScope would be a good next step.

Tried to run as: ./mk.bash -j8 +RTS -A128m -l:

http://code.haskell.org/~slyfox/T9221-A128M-j8-l.eventlog

I see a chain of 'thread yields', 'blocked on an MVar' there, but can't draw any conclusions.

Can I somehow get callers of blackhole? I guess 'perf report -G' will lie as stacks are lost, but here goes it's output:

  Children      Self  Command     Shared Object            Symbol
+   15,92%     0,81%  ghc_worker  libc-2.21.so             [.] __sched_yield
+   15,54%     1,69%  ghc_worker  [kernel.vmlinux]         [k] entry_SYSCALL_64
+   13,11%     0,40%  ghc_worker  [kernel.vmlinux]         [k] sys_sched_yield
-   10,85%     6,43%  ghc_worker  ghc-stage2               [.] clI_info
   - 40,76% clI_info (stg_BLACKHOLE_info)
      + 2,23% c8E2_info (ghc_Pretty_reduceDoc_info)
      + 1,81% c9p9_info (ghc_Pretty_vcatzugo_info)
      + 1,65% cJw6_info (ghczmprim_GHCziClasses_divIntzh_info)
      + 1,62% c8zA_info (...others)
      + 1,62% ca_info
      + 1,39% apic_timer_interrupt
      + 1,34% c4nL_info
      + 1,33% c3ZK_info
      + 1,27% c2k_info
      + 1,04% cmKx_info
      + 1,03% cp2g_info
        0,92% cp_info
      + 0,89% shV8_info
      + 0,88% strlen
      + 0,88% c9r1_info
      + 0,87% r63b_info
      + 0,79% s1P5_info
      + 0,73% c4IG_info
      + 0,69% c4uY_info
      + 0,63% r117_info
      + 0,62% sw4Y_info
        0,62% cqIM_info
        0,61% caEA_info
      + 0,58% c2l_info
        0,57% cyqt_info
      + 0,52% c9xp_info
     33,29% 0x480328785000c748
   + 1,82% 0x4b7202f9834807e1
   + 1,54% 0x4808588b48f8e083
   + 0,87% 0x438b482677000003
   + 0,87% 0x8d3b49307507c1f6
   + 0,82% 0x4640cd24fffc498b
   + 0,81% 0xf07f98348fc498b
   + 0,75% 0x9b820f02f8
   + 0,65% 0x24ff07e283da8948
   + 0,63% 0x11d820f0af883
   + 0,61% 0xf4b8b4807438b48
   + 0,53% 0x58c48349677202f8
   + 0,50% 0x48074b8b48d88948
+   10,79%     0,38%  ghc_worker  [kernel.vmlinux]         [k] schedule
+    9,44%     1,60%  ghc_worker  [kernel.vmlinux]         [k] __schedule  

comment:34 Changed 13 months ago by thomie

comment:35 Changed 13 months ago by kazu-yamamoto

  • Cc kazu-yamamoto added

comment:36 Changed 13 months ago by simonmar

@bgamari, correct - ThreadScope won't tell us anything about the origin of blackholes, but it would confirm whether there is a lot of blocking on blackholes. Ideally we should see all the capabilities fully busy when compiling a perfectly parallel workload. Note that dependencies between modules reduce the amount of parallelism available in general, so for a good test it would be best to construct an example with plenty of parallelism.

@slyfox, I don't believe perf is telling us anything sensible here.

comment:37 Changed 13 months ago by bgamari

  • Cc scpmw added

@scpmw, would it be possible to identify the expressions being blackholed (assuming that is indeed what is happening here) from perf's output? It seems like this might be a good use-case for our new DWARF support if so.

comment:38 Changed 13 months ago by slyfox

Done the following:

  • patched GHC [1] to print only blackholes in '+RTS -Da' and hide unresolvable addresses as [addr]
  • extended blackhole printer to dump '->indirectee' contents as a heap object (is it safe?)
  • added synchronization to printClosure to save stderr from multiple threads
  • linked ghc-stage2 with debug runtime
  • ran as '-j8 +RTS -A128M -Da'
  • sorted output by most frequent calls ('sort l | uniq -c | sort -n -k1 -r | head -n30')

Here is the result:

4179540 BLACKHOLE(Object [addr] = ghc-prim:GHC.Types.I#([addr])\n)\n
2031781 BLACKHOLE(Object [addr] = ghc:CLabel.IdLabel([addr], [addr], [addr])\n)\n
1782769 BLACKHOLE(Object [addr] = FUN/1(<ghc_X86ziInstr_zdfInstructionInstrzuzdcjumpDestsOfInstr_info>) )\n
 911023 BLACKHOLE(Object [addr] = FUN/1(<s8t1_info>, [addr]) )\n
 474894 BLACKHOLE(Object [addr] = ghc:TypeRep.TyConApp([addr], [addr])\n)\n
 380862 BLACKHOLE(Object [addr] = FUN/2(<ghczmprim_GHCziClasses_zdfOrdIntzuzdcmax_info>) )\n
 370195 BLACKHOLE(Object [addr] = ghc-prim:GHC.Tuple.(,,)([addr], [addr], [addr])\n)\n
 352184 BLACKHOLE(Object [addr] = PAP/2([addr], [addr])\n)\n
 282341 BLACKHOLE(Object [addr] = ghc:Var.Id([addr], [addr], <ghc_Var_mkCoVar1_closure>, [addr], [addr], [addr])\n)\n
 190979 BLACKHOLE(Object [addr] = ghc:X86.Instr.JMP([addr], [addr])\n)\n
 189194 BLACKHOLE(Object [addr] = ghc:X86.Instr.JXX([addr], [addr])\n)\n
 176401 BLACKHOLE(Object [addr] = ghc:Unique.MkUnique([addr])\n)\n
 130071 BLACKHOLE(Object [addr] = PAP/1([addr], [addr])\n)\n
 127332 BLACKHOLE(Object [addr] = ghc:VarEnv.InScope([addr], [addr])\n)\n
 121511 BLACKHOLE(Object [addr] = ghc:TypeRep.FunTy([addr], [addr])\n)\n
 118142 BLACKHOLE(Object [addr] = FUN/1(<ghc_CodeGenziPlatformziX86zu64_globalRegMaybe_info>) )\n
 106457 BLACKHOLE(Object [addr] = ghc:StgCmmMonad.HeapUsage([addr], [addr])\n)\n
  84274 BLACKHOLE(Object [addr] = 0tT640fErehCGZtZRn6YbE:Data.Map.Base.Bin([addr], [addr], [addr], [addr], [addr])\n)\n
  83945 BLACKHOLE(Object [addr] = ghc:Module.Module([addr], [addr])\n)\n
  71331 BLACKHOLE(Object [addr] = ghc:TrieMap.SingletonMap([addr], [addr])\n)\n
  69775 BLACKHOLE(Object [addr] = ghc:StgCmmClosure.LFThunk([addr], [addr], [addr], [addr], [addr])\n)\n
  69754 BLACKHOLE(Object [addr] = FUN/2(<spvK_info>, [addr], [addr]) )\n
  69396 BLACKHOLE(Object [addr] = 0tT640fErehCGZtZRn6YbE:Data.IntMap.Base.Tip([addr], [addr])\n)\n
  66129 BLACKHOLE(Object [addr] = 0tT640fErehCGZtZRn6YbE:Data.IntMap.Base.Nil()\n)\n
  58339 BLACKHOLE(Object [addr] = ghc:HsPat.ConPatOut([addr], [addr], [addr], [addr], <ghc_TcEvidence_emptyTcEvBinds_closure>, [addr], [addr])\n)\n
  55027 BLACKHOLE(Object [addr] = FUN/1(<ghc_CodeGenziPlatformziX86zu64_callerSaves_info>) )\n
  53318 BLACKHOLE(Object [addr] = ghc:Var.Id([addr], [addr], [addr], [addr], [addr], [addr])\n)\n
  53142 BLACKHOLE(Object [addr] = FUN/2(<sfXN_info>, [addr]) )\n
  53057 BLACKHOLE(Object [addr] = FUN/2(<ghc_Lexer_zdfMonadPzuzdczgzg_info>) )\n
  34197 BLACKHOLE(Object [addr] = ghc:Name.Name([addr], [addr], [addr], [addr])\n)\n

Is it expected to see I# here?

small part of BLACKHOLE patch [1]:

@@ -221,7 +223,7 @@ printClosure( StgClosure *obj )
     case BLACKHOLE:
             debugBelch("BLACKHOLE(");
-            printPtr((StgPtr)((StgInd*)obj)->indirectee);
-            debugBelch(")\n");
+            printObj((StgPtr)((StgInd*)obj)->indirectee);
+            debugBelch(")\\n");
             break;

comment:39 Changed 13 months ago by simonmar

I talked with @bgamari about this a bit today. What I think you're seeing in the above data is entries of *indirections*, not blackholes. Indirections and blackholes look the same, except that in a true black hole, the indirectee points to a TSO rather than a value. Looking for the cases where the indirectee is a TSO will tell you how many times we get blocked on a black hole.

comment:40 Changed 12 months ago by thoughtpolice

  • Milestone changed from 7.12.1 to 8.0.1

Milestone renamed

comment:41 Changed 12 months ago by slyfox

Didin't try to inspect thunks for TSOs yet. Another datapoint:

Built whole GHC with SRC_HC_OPTS += -feager-blackholing and ran on the same source from comment 28 with -j${i} +RTS -A128M >/dev/null.

The best wall-clock (real) result is achieved on -j8 (for box with 8 logical cores!)

$ for i in `seq 1 40`; do echo "running $i"; time ./mk.bash -j${i} +RTS -A128M >/dev/null; done


running 1

real    0m45.510s
user    0m41.844s
sys     0m2.699s
running 2

real    0m26.609s
user    0m42.121s
sys     0m4.227s
running 3

real    0m20.976s
user    0m47.452s
sys     0m5.892s
running 4

real    0m18.391s
user    0m48.771s
sys     0m7.716s
running 5

real    0m16.504s
user    0m53.015s
sys     0m9.339s
running 6

real    0m16.108s
user    0m59.391s
sys     0m12.256s
running 7

real    0m15.845s
user    1m3.952s
sys     0m15.632s
running 8

real    0m15.163s
user    1m9.096s
sys     0m14.940s
running 9

real    0m15.693s
user    1m11.332s
sys     0m17.059s
running 10

real    0m16.155s
user    1m13.646s
sys     0m18.863s
running 11

real    0m16.595s
user    1m16.520s
sys     0m19.622s
running 12

real    0m19.407s
user    1m18.191s
sys     0m22.702s
running 13

real    0m17.047s
user    1m18.216s
sys     0m22.033s
running 14

real    0m17.668s
user    1m19.592s
sys     0m24.237s
running 15

real    0m18.087s
user    1m21.217s
sys     0m26.945s
running 16

real    0m18.304s
user    1m21.447s
sys     0m28.425s
running 17

real    0m18.912s
user    1m23.788s
sys     0m29.615s
running 18

real    0m19.715s
user    1m24.838s
sys     0m32.569s
running 19

real    0m20.534s
user    1m25.534s
sys     0m31.648s
running 20

real    0m21.440s
user    1m27.836s
sys     0m42.114s

comment:42 Changed 11 months ago by rrnewton

If blackholing *is* the culprit then I wonder how localized the strictification will need to be...

Ezyang's multi-process suggestion seems appealing. If it's a forkProcess approach, is it possible to load most of the relevant state into memory before forking a child process per module? That could avoid or reduce communication through files or shared pages.

The granularity of per-module compiles is big, so maybe processes would be ok. Also, there's the nice side effect that the GCs of child processes are disentangled, removing that particular scaling bottleneck.

Last edited 11 months ago by rrnewton (previous) (diff)

comment:43 Changed 10 months ago by bgamari

  • Description modified (diff)

comment:44 Changed 9 months ago by mboes

  • Cc mboes added

comment:45 Changed 8 months ago by bgamari

  • Milestone changed from 8.0.1 to 8.2.1
  • Priority changed from high to normal

I intend on looking into this but it doesn't that anything will happen for 8.0 and indeed it's not terribly high priority given that GHC's -j option isn't very widely used.

comment:46 Changed 8 months ago by nh2

given that GHC's -j option isn't very widely used

I think it's only not used so widely because it doesn't work well yet due to this bug - I have worked on 3 projects of 3 companies that would significantly benefit from it.

comment:47 Changed 8 months ago by slyfox

Don't know about other users but starting from ghc-7.8 every gentoo packages are built with -j4 on machines with enough CPU cores. Thinking about building initial ghc-cabal for GHC build in parallel.

comment:48 Changed 6 months ago by dobenour

I remember reading somewhere that the RTS has a global lock for allocating sufficiently large objects. Could that be the problem?

comment:50 Changed 4 weeks ago by slyfox

As -feaged-blackholing helps for this case how would one find out statistics of closures needlessly computed multiple times?

One of 'unsafeDupable' examples that might be heavyweight when done multiple times:

http://git.haskell.org/ghc.git/blob/HEAD:/libraries/base/GHC/Fingerprint.hs#l45

comment:51 Changed 3 weeks ago by rrnewton

@slyfox -- can we get a fixed up version of that black hole debugging patch merged? That seems really great.

Ideally I'd like to put GHC in a debugging mode were the first time a thread blocks on a black hole the program crashes with a stack trace ;-). Edit rinse repeat until they're all gone.

comment:52 Changed 3 weeks ago by bgamari

Ideally I'd like to put GHC in a debugging mode were the first time a thread blocks on a black hole the program crashes with a stack trace ;-). Edit rinse repeat until they're all gone.

You may be interested in Phab:D1216, which builds on the the DWARF-based statistical profiling support that I have in the works for provide profiling of black-hole entry events. As-written this patch only captures the head of the stack, although this could be changed to provide a deeper stacktrace.

comment:53 Changed 12 days ago by slyfox

I've experimented a bit more with trying to pin down where slowdown comes from.

Some observations:

Observation 1. -j <K> not only allows <K> modules to be compiled at the same time, but also enables:

  • <K> Capabilities
  • and <K> garbage collection threads

I've locally removed Capability adjustment from -j handling and used -j <K> +RTS -N. That does not make performance as bad with increasing K. That makes sense GC OS threads don't fight over the same cache.

It would be nice if +RTS -N would have a precedence over -j option

Observation 2. [Warning: I have no idea how parallel GC works].

The more GC threads we have - the more chances are that one of GC threads will finish scanning it's part oh heap and will sit in sched_yield() loop on a free core while main GC thread waits for completion of other threads doing useful work.

I've found it out by changing yieldThread() to print it's caller. Vast majority of calls comes from any_work():

static rtsBool
any_work (void)
{
    int g;
    gen_workspace *ws;

    gct->any_work++;

    write_barrier();

    // scavenge objects in compacted generation
    if (mark_stack_bd != NULL && !mark_stack_empty()) {
        return rtsTrue;
    }

    // Check for global work in any gen.  We don't need to check for
    // local work, because we have already exited scavenge_loop(),
    // which means there is no local work for this thread.
    for (g = 0; g < (int)RtsFlags.GcFlags.generations; g++) {
        ws = &gct->gens[g];
        if (ws->todo_large_objects) return rtsTrue;
        if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue;
        if (ws->todo_overflow) return rtsTrue;
    }

#if defined(THREADED_RTS)
    if (work_stealing) {
        uint32_t n;
        // look for work to steal
        for (n = 0; n < n_gc_threads; n++) {
            if (n == gct->thread_index) continue;
            for (g = RtsFlags.GcFlags.generations-1; g >= 0; g--) {
                ws = &gc_threads[n]->gens[g];
                if (!looksEmptyWSDeque(ws->todo_q)) return rtsTrue;
            }
        }
    }
#endif

    gct->no_work++;
#if defined(THREADED_RTS)
    yieldThread("any_work");
#endif

    return rtsFalse;
}

I need to dig more into how parallel GC traverses heap to understand how much of a problem it is.

comment:54 follow-up: Changed 9 days ago by simonmar

@slyfox the GC threads all spin in this any_work() loop looking for work they can steal from other threads. It's how the GC work gets distributed amongst the available worker threads.

comment:55 in reply to: ↑ 54 Changed 8 days ago by slyfox

Replying to simonmar:

@slyfox the GC threads all spin in this any_work() loop looking for work they can steal from other threads. It's how the GC work gets distributed amongst the available worker threads.

Aha, thanks! I've read through http://community.haskell.org/~simonmar/papers/parallel-gc.pdf The paper does not look outdated.

The crucial detail: work stealing is enabled by default for gen=1 and upper. As default nursery is tiny we don't do stealing from it. That's why I see poor GC parallelism on large nurseries for this compilation workload:

[ The numbers below are for 4-CPU system, not 8-CPU i used to post before. I'll redo tests in a few days once I get access to it. ]

-qb1/-qb0 comparison:

-qb1 (default):

$ time ./mk.bash -j4 +RTS -N4 -RTS +RTS -sstderr -A768M -RTS -O0 -j4 +RTS -l

  55,162,983,632 bytes allocated in the heap
     639,999,688 bytes copied during GC
      88,807,552 bytes maximum residency (2 sample(s))
       4,100,072 bytes maximum slop
            3379 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        20 colls,    20 par    5.617s   1.582s     0.0791s    0.1410s
  Gen  1         2 colls,     1 par    0.227s   0.081s     0.0405s    0.0500s

  Parallel GC work balance: 22.34% (serial 0%, perfect 100%)

  TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.027s  (  0.026s elapsed)
  MUT     time   59.245s  ( 18.963s elapsed)
  GC      time    5.843s  (  1.663s elapsed)
  EXIT    time    0.032s  (  0.034s elapsed)
  Total   time   65.174s  ( 20.686s elapsed)

  Alloc rate    931,102,797 bytes per MUT second

  Productivity  91.0% of total user, 91.8% of total elapsed

gc_alloc_block_sync: spin=10830; yield=5
whitehole_spin: 0
gen[0].sync: spin=6297; yield=5
gen[1].sync: spin=4240; yield=3

real	0m20.863s
user	1m5.752s
sys	0m4.724s

-qb0:

  55,154,416,472 bytes allocated in the heap
     840,077,056 bytes copied during GC
     135,018,976 bytes maximum residency (3 sample(s))
       5,739,552 bytes maximum slop
            2375 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        31 colls,    31 par    3.078s   0.859s     0.0277s    0.0682s
  Gen  1         3 colls,     2 par    0.607s   0.177s     0.0591s    0.0947s

  Parallel GC work balance: 74.48% (serial 0%, perfect 100%)

  TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.023s  (  0.023s elapsed)
  MUT     time   59.476s  ( 18.325s elapsed)
  GC      time    3.685s  (  1.036s elapsed)
  EXIT    time    0.027s  (  0.028s elapsed)
  Total   time   63.244s  ( 19.412s elapsed)

  Alloc rate    927,345,481 bytes per MUT second

  Productivity  94.1% of total user, 94.5% of total elapsed

gc_alloc_block_sync: spin=58847; yield=38
whitehole_spin: 0
gen[0].sync: spin=126024; yield=104
gen[1].sync: spin=34648; yield=31

real	0m19.575s
user	1m6.804s
sys	0m1.760s

Note how GC parallelism increases from ~20% to ~75% for this huge nursery of 768M.

Something hinders mutator's parallelism still. threadscope says threads are usually blocked on MVars for long periods of time (order of 0.1-0.2 seconds). Perhaps those wait for external gcc/gas to assemble modules.

comment:56 follow-up: Changed 7 days ago by simonmar

Yes, perhaps we should default to -qb0 when -A is larger than some threshold.

The mutator parallelism looks not too bad (3.24 out of 4). Was there enough parallelism in the program you were compiling to do better than that?

The MVars are probably just the compilation manager: it fires up a thread for every module, and they wait on the results of compiling the modules they depend on.

comment:57 in reply to: ↑ 56 Changed 7 days ago by slyfox

Replying to simonmar:

Yes, perhaps we should default to -qb0 when -A is larger than some threshold.

The mutator parallelism looks not too bad (3.24 out of 4). Was there enough parallelism in the program you were compiling to do better than that?

I assumed there is as it's ~100 independent haskell modules, but some of them are disproportionally large.

Got to 24-core VM to run the same benchmark. At first it manages to peak 24 HEC utilisation but at the tail it clearly has not enough parallelism:

-qb1: http://code.haskell.org/~slyfox/T9221/ghc-stage2.eventlog.N24.j24.png

-qb0: http://code.haskell.org/~slyfox/T9221/ghc-stage2.eventlog.N24.j24.A256M.qb0.png

The MVars are probably just the compilation manager: it fires up a thread for every module, and they wait on the results of compiling the modules they depend on.

Yes, that's more likely. It would be cool to have labelMVar similar to existing labelThread to see IDs/names of blocker MVars.

Changed 4 days ago by slyfox

synth.bash - perfectly parallel workload for ghc

comment:58 Changed 35 hours ago by Sergei Trofimovich <siarheit@…>

In 9d175605/ghc:

GhcMake: limit Capability count to CPU count in parallel mode

In Trac #9221 one of problems using high --jobs=<N>
is amount of mutator (or GC) threads we crate.

We use userspace spinning-and-yielding (see ACQUIRE_SPIN_LOCK)
to acess work stealing queues. In case of
N-worker-threads > N-CPUs fraction of time when
thread holding spin lock gets descheduled by kernel
increases. That causes other threads to waste CPU time
before giving up CPU.

Signed-off-by: Sergei Trofimovich <siarheit@google.com>

Test Plan:
ghc --make -j8 and -j80 have comparable sys time
on a 8-core system.

Reviewers: austin, gintas, bgamari, simonmar

Reviewed By: bgamari, simonmar

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2482

GHC Trac Issues: #9221

comment:59 Changed 35 hours ago by Sergei Trofimovich <siarheit@…>

In a5d26f2/ghc:

rts: enable parallel GC scan of large (32M+) allocation area

Parallel GC does not scan large allocation area (-A)
effectively as it does not do work stealing from nursery
by default.

That leads to large imbalance when only one of threads
overflows allocation area: most of GC threads finish
quickly (as there is not much to collect) and sit idle
waiting while single GC thread finishes scan of single
allocation area for that thread.

The patch enables work stealing for (equivalent of -qb0)
allocation area of -A32M or higher.

Tested on a highlighting-kate package from Trac #9221

On 8-core machine the difference is around 5% faster
of wall-clock time. On 24-core VM the speedup is 20%.

Signed-off-by: Sergei Trofimovich <siarheit@google.com>

Test Plan: measured wall time and GC parallelism on highlighting-kate build

Reviewers: austin, bgamari, erikd, simonmar

Reviewed By: bgamari, simonmar

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2483

GHC Trac Issues: #9221

comment:60 Changed 29 hours ago by carter

Very cool work!

Do we want to limit the number of capabilities to the number of hyper threaded cores or the number of physical cores?

Likewise, another opportunity for improvement might be seeing about making sure the capabilities (when CPU pining is enabled for the rts) are spread out maximally across the available CPU cores? I believe this or a similar issue was discussed / observed by Ryan Yates / fryguybob previously but I'm not sure if we had worked out a portable way to decide this or not

comment:61 follow-up: Changed 27 hours ago by slyfox

[warning: not a NUMA expert]

Tl;DR:

I think it depends on what exactly we hit as a bottleneck.

I have a suspiction we saturate RAM bandwidth, not CPU ability to retire instructions due to hyperthreads. Basically GHC does too many non-local references and one of the ways to speed GHC up is either insrease memory locality or decrease HEAP usage.

Long version:

For a while I tried to figure out why exactly I don't see perfect scaling of ghc --make on my box.

It's easy to see/compare with synth.bash +RTS -A256M -RTS benchmark ran with -j1 / -j options.

I don't have hard evidence but I suspect bottleneck is not due to hyperthreads/real core execution engines but due to RAM bandwidth limit on CPU-to-memory path. One of the hints is perf stat:

$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,mem-loads,mem-stores ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS

 Performance counter stats for './synth.bash -j +RTS -sstderr -A256M -qb0 -RTS':

     3 248 577 545      cache-references                                              (28,64%)
       740 590 736      cache-misses              #   22,797 % of all cache refs      (42,93%)
   390 025 361 812      cycles                                                        (57,18%)
   171 496 925 132      instructions              #    0,44  insn per cycle                                              (71,45%)
    33 736 976 296      branches                                                      (71,47%)
         1 061 039      faults                                                      
             1 524      migrations                                                  
            67 895      mem-loads                                                     (71,42%)
    27 652 025 890      mem-stores                                                    (14,27%)

      15,131608490 seconds time elapsed

22% of all cache refs are misses. A huge number. I think it dominates performance (assuming memory access is ~100 times slower than CPU cache access), but I have no hard evidence :)

I have 4 cores with x2 hyperthreads each and get best performance from -j8, not -j4 as one would expect from hyperthreads inctruction retirement:

-j1: 55s; -j4: 18s; -j6: 15s; j8: 14.2s; -j10: 15.0s

./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS

  66,769,724,456 bytes allocated in the heap
   1,658,350,288 bytes copied during GC
     127,385,728 bytes maximum residency (5 sample(s))
       1,722,080 bytes maximum slop
            2389 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        31 colls,    31 par    6.535s   0.831s     0.0268s    0.0579s
  Gen  1         5 colls,     4 par    1.677s   0.225s     0.0449s    0.0687s

  Parallel GC work balance: 80.03% (serial 0%, perfect 100%)

  TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time   87.599s  ( 12.868s elapsed)
  GC      time    8.212s  (  1.056s elapsed)
  EXIT    time    0.013s  (  0.015s elapsed)
  Total   time   95.841s  ( 13.942s elapsed)

  Alloc rate    762,222,437 bytes per MUT second

  Productivity  91.4% of total user, 92.4% of total elapsed

gc_alloc_block_sync: 83395
whitehole_spin: 0
gen[0].sync: 280927
gen[1].sync: 134537

real    0m14.070s
user    1m44.835s
sys     0m2.899s

I've noticed that building GHC with -fno-worker-wrapper -fno-spec-constr makes GHC 4% faster (-j8) (memory allocations is 7% less, a bug #11565 is likely at fault) which also hints at memory throughput as a bottleneck.

The conclusion:

AFAIU, thus to make most of GHC we should strive for amount of active threads capable of saturating all the memory IO channels the machine has (but not much more).

perf bench mem all suggests RAM bandwidth performance is in range of 2-32GB/s depending how bad workload is. I would assume GHC workload is very non-linear (and thus bad).

comment:62 Changed 27 hours ago by rrnewton

Would people agree that we should not have spinlocks in the system without some back-off strategy? E.g. switch to sleeping/blocking after some number of iterations.

Even a constant upper bound could reign in worst case behaviors. Apparently OS's provide some notion of "adaptive lock" that has other nice features, e.g.: http://stackoverflow.com/questions/19863734/what-is-pthread-mutex-adaptive-np

comment:63 in reply to: ↑ 61 Changed 26 hours ago by carter

Replying to slyfox:

[warning: not a NUMA expert]

Tl;DR:

I think it depends on what exactly we hit as a bottleneck.

I have a suspiction we saturate RAM bandwidth, not CPU ability to retire instructions due to hyperthreads. Basically GHC does too many non-local references and one of the ways to speed GHC up is either insrease memory locality or decrease HEAP usage.

That's exactly why I'm wondering if hyper threading is messing with us! Each pair of hyper thread cores shares the same l1 l2 cache, so if we're memory limited that might be Triggering a higher rate of cache thrash? Also in some cases when capabilities numbers are below the number of cores I think we pin two capabitilties to the same physical core needlessly. I need to dig up those references and revisit that though :)

Long version:

For a while I tried to figure out why exactly I don't see perfect scaling of ghc --make on my box.

It's easy to see/compare with synth.bash +RTS -A256M -RTS benchmark ran with -j1 / -j options.

I don't have hard evidence but I suspect bottleneck is not due to hyperthreads/real core execution engines but due to RAM bandwidth limit on CPU-to-memory path. One of the hints is perf stat:

$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations,mem-loads,mem-stores ./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS

 Performance counter stats for './synth.bash -j +RTS -sstderr -A256M -qb0 -RTS':

     3 248 577 545      cache-references                                              (28,64%)
       740 590 736      cache-misses              #   22,797 % of all cache refs      (42,93%)
   390 025 361 812      cycles                                                        (57,18%)
   171 496 925 132      instructions              #    0,44  insn per cycle                                              (71,45%)
    33 736 976 296      branches                                                      (71,47%)
         1 061 039      faults                                                      
             1 524      migrations                                                  
            67 895      mem-loads                                                     (71,42%)
    27 652 025 890      mem-stores                                                    (14,27%)

      15,131608490 seconds time elapsed

22% of all cache refs are misses. A huge number. I think it dominates performance (assuming memory access is ~100 times slower than CPU cache access), but I have no hard evidence :)

I have 4 cores with x2 hyperthreads each and get best performance from -j8, not -j4 as one would expect from hyperthreads inctruction retirement:

-j1: 55s; -j4: 18s; -j6: 15s; j8: 14.2s; -j10: 15.0s

./synth.bash -j +RTS -sstderr -A256M -qb0 -RTS

  66,769,724,456 bytes allocated in the heap
   1,658,350,288 bytes copied during GC
     127,385,728 bytes maximum residency (5 sample(s))
       1,722,080 bytes maximum slop
            2389 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        31 colls,    31 par    6.535s   0.831s     0.0268s    0.0579s
  Gen  1         5 colls,     4 par    1.677s   0.225s     0.0449s    0.0687s

  Parallel GC work balance: 80.03% (serial 0%, perfect 100%)

  TASKS: 21 (1 bound, 20 peak workers (20 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time   87.599s  ( 12.868s elapsed)
  GC      time    8.212s  (  1.056s elapsed)
  EXIT    time    0.013s  (  0.015s elapsed)
  Total   time   95.841s  ( 13.942s elapsed)

  Alloc rate    762,222,437 bytes per MUT second

  Productivity  91.4% of total user, 92.4% of total elapsed

gc_alloc_block_sync: 83395
whitehole_spin: 0
gen[0].sync: 280927
gen[1].sync: 134537

real    0m14.070s
user    1m44.835s
sys     0m2.899s

I've noticed that building GHC with -fno-worker-wrapper -fno-spec-constr makes GHC 4% faster (-j8) (memory allocations is 7% less, a bug #11565 is likely at fault) which also hints at memory throughput as a bottleneck.

The conclusion:

AFAIU, thus to make most of GHC we should strive for amount of active threads capable of saturating all the memory IO channels the machine has (but not much more).

perf bench mem all suggests RAM bandwidth performance is in range of 2-32GB/s depending how bad workload is. I would assume GHC workload is very non-linear (and thus bad).

comment:64 Changed 13 hours ago by simonmar

GHC has a +RTS --numa option that you might want to try.

Memory bandwidth *might* be a scaling issue, it depends a lot on what hardware you're using (how many memory channels, are they populated with DIMMs?).

You could always verify this by comparing the scaling that you get with N independent GHC processes vs. ghc -jN.

comment:65 Changed 8 hours ago by slyfox

Used the following GNUMakefile for ./synth.bash to compare separate processes:

OBJECTS := $(patsubst %.hs,%.o,$(wildcard src/*.hs))

all: $(OBJECTS)

src/%.o: src/%.hs
        ~/dev/git/ghc-perf/inplace/bin/ghc-stage2 -c +RTS -A256M -RTS $< -o $@

clean:
        $(RM) $(OBJECTS)

.PHONY: clean

CPU topology:

$ lstopo-no-graphics 
Machine (30GB)
  Socket L#0 + L3 L#0 (8192KB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#4)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#5)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#2)
      PU L#5 (P#6)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#3)
      PU L#7 (P#7)

$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 31122 MB
node 0 free: 28003 MB
node distances:
node   0 
  0:  10

Separate processes:

$ make clean; time make -j1
real    1m2.561s
user    0m56.523s
sys     0m5.560s

$ make clean; time taskset --cpu-list 0-3 make -j4

real    0m18.756s
user    1m7.758s
sys     0m6.460s

$ make clean; time make -j4

real    0m18.936s
user    1m7.549s
sys     0m6.857s

$ make clean; time make -j6
real    0m17.365s
user    1m32.107s
sys     0m9.155s

$ make clean; time make -j8

real    0m15.964s
user    1m52.058s
sys     0m9.929s

The speedup compared to -j1 is almost exactly 4x, but it happens on -j higher than 4 as well. Using CPU affinity makes things better on -j4.

$ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS

real    0m51.702s
user    0m50.840s
sys     0m0.844s

$ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -RTS

real    0m17.526s
user    1m6.978s
sys     0m1.412s

$ ./synth.bash -j4 +RTS -sstderr -A256M -qb0 -qa -RTS

real    0m17.007s
user    1m4.867s
sys     0m1.508s

$ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS

real    0m13.829s
user    1m44.295s
sys     0m2.669s

$ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -qa -RTS

real    0m14.597s
user    1m43.145s
sys     0m3.285s

The speedup compared to -j1 is around 3.5x, also happens on -j higher than 4. Using CPU affinity makes things worse on -j4.

In absolute times ghc --make -j is slightly better that separate processes due to less startup(?) overhead. But something else slowly creeps up and we don't see 4x factor.

It's more visible on 24-core VM, will post in a few minutes.

comment:66 Changed 7 hours ago by slyfox

24-core VM.

CPU topology:

$ lstopo-no-graphics 
Machine (118GB)
  Package L#0 + L3 L#0 (30MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#1)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#2)
      PU L#3 (P#3)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#4)
      PU L#5 (P#5)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#6)
      PU L#7 (P#7)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
      PU L#8 (P#8)
      PU L#9 (P#9)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
      PU L#10 (P#10)
      PU L#11 (P#11)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
      PU L#12 (P#12)
      PU L#13 (P#13)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
      PU L#14 (P#14)
      PU L#15 (P#15)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
      PU L#16 (P#16)
      PU L#17 (P#17)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
      PU L#18 (P#18)
      PU L#19 (P#19)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
      PU L#20 (P#20)
      PU L#21 (P#21)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
      PU L#22 (P#22)
      PU L#23 (P#23)

$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 120881 MB
node 0 free: 120192 MB
node distances:
node   0 
  0:  10 

(I would not trust numactl output).

Separate processes:

$ make clean; time make -j1

real    1m33.147s
user    1m20.836s
sys     0m11.556s

$ make clean; time make -j10

real    0m11.275s
user    1m29.800s
sys     0m12.856s

$ make clean; time make -j12

real    0m10.537s
user    1m36.276s
sys     0m16.948s

$ make clean; time make -j14

real    0m9.117s
user    1m39.132s
sys     0m18.332s

$ make clean; time make -j20

real    0m8.498s
user    2m7.064s
sys     0m17.912s

$ make clean; time make -j22

real    0m7.468s
user    2m9.808s
sys     0m18.592s

$ make clean; time make -j24

real    0m7.336s
user    2m15.936s
sys     0m19.004s

$ make clean; time make -j26

real    0m7.433s
user    2m17.612s
sys     0m19.648s

$ make clean; time make -j28

real    0m7.554s
user    2m17.760s
sys     0m19.564s

$ make clean; time make -j30

real    0m7.563s
user    2m16.776s
sys     0m21.104s

Numbers are jumping slightly from run to run but the gist is best performance is around -j24, not -j12.

Single process:

$ ./synth.bash -j1 +RTS -sstderr -A256M -qb0 -RTS

real    1m15.214s
user    1m14.060s
sys     0m0.984s

$ ./synth.bash -j8 +RTS -sstderr -A256M -qb0 -RTS

real    0m11.275s
user    1m21.708s
sys     0m2.912s

$ ./synth.bash -j10 +RTS -sstderr -A256M -qb0 -RTS

real    0m10.279s
user    1m25.184s
sys     0m3.664s

$ ./synth.bash -j12 +RTS -sstderr -A256M -qb0 -RTS

real    0m9.605s
user    1m32.688s
sys     0m4.292s

$ ./synth.bash -j14 +RTS -sstderr -A256M -qb0 -RTS

real    0m9.144s
user    1m40.288s
sys     0m4.964s

$ ./synth.bash -j16 +RTS -sstderr -A256M -qb0 -RTS

real    0m10.003s
user    1m51.916s
sys     0m6.604s

$ ./synth.bash -j20 +RTS -sstderr -A256M -qb0 -RTS

real    0m10.215s
user    2m7.924s
sys     0m8.208s

$ ./synth.bash -j22 +RTS -sstderr -A256M -qb0 -RTS

real    0m10.483s
user    2m13.440s
sys     0m10.456s

$ ./synth.bash -j24 +RTS -sstderr -A256M -qb0 -RTS

real    0m10.985s
user    2m18.028s
sys     0m10.780s

$ ./synth.bash -j32 +RTS -sstderr -A256M -qb0 -RTS

real    0m12.636s
user    2m32.312s
sys     0m14.508s

Here we see best numbers around -j12 and those are worse than multiprocess run.

From perf record it's not very clear what happens.

I'll try to get a 64-core VM next week and see if the effect will be visible there much better.

Note: See TracTickets for help on using tickets.