Opened 23 months ago

Last modified 2 months ago

#9221 new bug

(super!) linear slowdown of parallel builds on 40 core machine

Reported by: carter Owned by:
Priority: normal Milestone: 8.2.1
Component: Compiler Version: 7.8.2
Keywords: Cc: slyfox, tibbe, gidyn, nh2, kolmodin, erikd, kazu-yamamoto, scpmw, mboes
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Compile-time performance bug Test Case:
Blocked By: Blocking:
Related Tickets: #910, #8224 Differential Rev(s):
Wiki Page:

Description (last modified by bgamari)

I'm seeing slowdowns in parallel builds of a (simple!!) 6 module project when I build it on a 40 core server that I'm using for work. For any given ghc invocation with -jn, once n>10, I start to see a super-linear slow down as a function of n,

here are some basic numbers,

compile time
-j1 0m2.693s
-j4 0m2.507s
-j10 0m2.763s
-j25 0m12.634s
-j30 0m39.154s
-j40 0m57.511s
-j60 2m21.821s

These timings are another 2-4x worse if ghc is invoked indirectly via cabal-install / Setup.hs

According to the linux utility latencytop, 100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.

The timing in the -j40 case stayed the same even when ghc was also passed -O0 (and -fforce-recomp to ensure it did the same )

A bit of experimentation makes me believe that *any* cabalized project on a 40 core machine will exhibit this performance issue.

cabal clean
cabal configure --ghc-options="-j"
cabal build -j1

should be enough to trigger the lock contention.

That said, I'll try to cook up a minimal repro that i can share the source for post haste.

Attachments (5)

test-ghc-cabal.sh (515 bytes) - added by slyfox 22 months ago.
a good benchmark for cabal build
graph1.png (24.7 KB) - added by hvr 22 months ago.
Plot of data in comment:2
bench.log (10.9 KB) - added by gintas 20 months ago.
Benchmark log (semaphore value & capability count)
overload.log (3.7 KB) - added by gintas 19 months ago.
Experiment with 32 capabilities on a 16-core machine
heapsize.log (12.6 KB) - added by gintas 19 months ago.
Experiment with multiple -A values on the same invocation

Download all attachments as: .zip

Change History (54)

comment:1 Changed 23 months ago by carter

so my understanding is that the parUpsweep function in compiler/main/GhcMake.hs is the main culprit / object to focus on for this performance issue, right? (literally the only thing thats tripped the the -jN ghc invocation )

Changed 22 months ago by slyfox

a good benchmark for cabal build

comment:2 follow-up: Changed 22 months ago by slyfox

I've noticed it as well. For superficial tests -jN>4 does not make much sense even on 8-core box.

Attached a cabal build test (takes ~50 seconds to build). Use as:

$ cd $ghc-source
./test-ghc-cabal.sh >build-log

Results for my machine:

RUN jobs: 1

real    0m51.745s
user    0m49.825s
sys     0m1.553s
RUN jobs: 2

real    0m34.254s
user    0m57.075s
sys     0m6.278s
RUN jobs: 3

real    0m31.670s
user    1m7.058s
sys     0m10.970s
RUN jobs: 4

real    0m32.008s
user    1m10.548s
sys     0m18.194s
RUN jobs: 5

real    0m32.329s
user    1m15.384s
sys     0m27.939s
RUN jobs: 6

real    0m33.993s
user    1m25.190s
sys     0m41.473s
RUN jobs: 7

real    0m35.410s
user    1m32.354s
sys     0m51.201s
RUN jobs: 8

real    0m36.111s
user    1m42.945s
sys     1m1.740s
RUN jobs: 9

real    0m37.426s
user    1m49.708s
sys     1m7.805s
RUN jobs: 10

real    0m40.149s
user    2m0.625s
sys     1m13.054s
RUN jobs: 11

real    0m44.515s
user    2m18.503s
sys     1m21.783s
RUN jobs: 12

real    0m44.393s
user    2m25.161s
sys     1m26.875s
RUN jobs: 13

real    0m47.298s
user    2m44.370s
sys     1m29.611s
RUN jobs: 14

real    0m52.647s
user    3m16.386s
sys     1m37.780s
RUN jobs: 15

real    0m54.757s
user    3m18.954s
sys     1m45.547s
RUN jobs: 16

real    0m58.655s
user    3m49.732s
sys     1m49.191s

Notice how sys time creeps up taking over performance gain on N>4.

comment:3 Changed 22 months ago by slyfox

  • Cc slyfox added

Changed 22 months ago by hvr

Plot of data in comment:2

comment:4 in reply to: ↑ 2 Changed 22 months ago by hvr

Replying to slyfox:

Notice how sys time creeps up taking over performance gain on N>4.

It's also quite visible in the quick R plot I made:

Plot of data in comment:2

comment:5 Changed 22 months ago by tibbe

  • Cc tibbe added

comment:6 Changed 22 months ago by slyfox

Box I performed tests on is a 8-CPU one (4-cores x 2 threads each):

Intel(R) Core(TM) i7-2700K CPU @ 3.50GHz

comment:7 Changed 22 months ago by hvr

comment:8 Changed 22 months ago by thoughtpolice

  • Milestone set to 7.10.1

Moving to 7.10.1.

comment:9 Changed 20 months ago by gintas

I think I know what's going on here. If you look at parUpsweep in compiler/main/GhcMake.js, its argument n_jobs is used in two places: one is the initial value of the par_sem semaphore used to limit parallelization, and the other is a call to setNumCapabilities. The latter seems to be the cause of the slowdown.

Note that setNumCapabilities is only invoked if the previous count of capabilities was 1. I used that to control for both settings independently, and it turns out that the runtime overhead is mostly independent of the semaphore value and highly influenced by capability count.

I ran some experiments on a 16-CPU VM (picked a larger one deliberately to make the differences more pronounced). Running with jobs=4 & caps=4, a test took 37s walltime, jobs=4 & caps=16 took 51s, jobs=4 & caps=32 took 114s (344s of MUT and 1021s of GC!). The figures are very similar for jobs=16 and jobs=64. See attached log for more details (-sstderr output).

It looks like the runtime GC is just inefficient when running with many capabilities, even if many physical cores are available. I'll try a few experiments to verify that this is a general pattern that is not specific to the GhcMake implementation.

Logic and a few experiments indicate that it does not help walltime to set the number of jobs (semaphore value) higher than the number of capabilities, so there's not much we can do about those two parameters in the parUpsweep implementation other than capping n_jobs at some constant (probably <= 8).

Changed 20 months ago by gintas

Benchmark log (semaphore value & capability count)

comment:10 Changed 20 months ago by gintas

While looking around I found #3758, and it appears that +RTS -qg (disable parallel GC) helps a lot with the superlinear overhead. For example, the benchmark above with jobs=24 & caps=24 without -qg took:

real 1m3.596s user 6m31.072s sys 3m10.732s

With -qg:

real 0m47.747s user 1m33.352s sys 0m2.024s

However, for smaller -j values -qg slightly increases walltime:

2 jobs: 44s without -qg, 46s with -qg 4 jobs: 37s vs 40s 8 jobs: 37s vs 41s 15 jobs: 42s vs 44s 16 jobs: 49s vs 44s (walltime crossover point for this 16-core machine)

comment:11 Changed 20 months ago by gintas

Also #8224 looks like it could be related.

comment:12 Changed 20 months ago by simonmar

We probably want to be running with larger heap sizes when there are lots of cores, to counteract the synchronization overhead of stop-the-world GC across many cores. e.g. +RTS -A32m at least.

Check whether your machines really have N cores, or if N/2 of those are hyperthreaded cores. I don't recommend including hyperthreaded cores in the -N value you give to GHC, it's better to leave them free to soak up any extra load while letting the GC synchronize properly.

@carter, when you said "100% of ghc's cpu time was spent on user-space lock contention when I did the -j40 invocation.", how did you discover that? Which lock in particular?

comment:13 Changed 20 months ago by carter

Yeah, I definitely found that making the nursery larger helped A LOT, i'd done -A50m or -A100m on a few occasions, though I must admit I didn't think to do that for running GHC itself!

Good point about hyperthreads, that would match my experience of only seeing good perf up to about 15-20ish for -N. This does raise the problem that the default -N flag should probably be set to the number of physical cores rather than hyper threaded cores, would that be a reasonable patch for me to write for GHC? (though of course theres the matter of correctly detecting that info in a portable way! )

I unfortunately dont have access anymore to that machine (or presently any SSH access to a mega core sized box) so I can't repro it to dig in deeper, I was using some cli tool who's name escapes me right now [edit: i'll try to dig up what the cli tools name was when i have the chance]

Last edited 20 months ago by carter (previous) (diff)

comment:14 Changed 19 months ago by gintas

I ran some experiments with -A and it does help a lot with performance, but also increases peak memory usage. I observed continuous improvement all the way from -A1m to -A128m in terms of walltime (41s to 36s), but "total memory in use" also went up from 265MB to 2182MB. Not sure where the sweet spot is.

-A seems to help especially if the number of capabilities exceeds the number of cores. With 32 capabilities on a 16 core machine, a -qg run took 50s, -A128m took 41s (still a penalty over 36s but not nearly as bad) and a vanilla run took almost 2min. Of course, total memory use with -A128m went up to 4388m...

Looks like the sweet spot is somewhere in the middle. How about building ghc --with-rtsopts=-A16m by default? Are there any downsides to that? On my machine the "total memory in use" overhead for that is less than 2x (475MB vs 265MB) and it really helps especially in the degenerate cases.

(The machine I was testing on had 16 real cores without any hyperthreading.)

Changed 19 months ago by gintas

Experiment with 32 capabilities on a 16-core machine

Changed 19 months ago by gintas

Experiment with multiple -A values on the same invocation

comment:15 Changed 19 months ago by gidyn

  • Cc gidyn added

comment:16 Changed 19 months ago by simonmar

You should never have more capabilities than cores.

It's hard to know where to set the default on the memory-vs-time tradeoff curve. GHC has typically been quite conservative here.

Personally I'd like to know why -j is hardly providing any benefit beyond 2-3 cores, regardless of heap size settings.

comment:17 Changed 18 months ago by nh2

  • Cc nh2 added

comment:18 Changed 18 months ago by nh2

@simonmar How much do you think is this affected by #8224?

comment:19 Changed 18 months ago by simonmar

@nh2 I don't think anyone has fully investigated what's going on in #8224.

comment:20 Changed 16 months ago by kolmodin

  • Cc kolmodin added

comment:21 Changed 15 months ago by thoughtpolice

  • Milestone changed from 7.10.1 to 7.12.1

(Pushing back to 7.12 pending investigation on this ticket and #8224).

comment:22 Changed 12 months ago by erikd

  • Cc erikd added

comment:23 follow-up: Changed 12 months ago by ezyang

I was chatting with one of my colleagues about this problem recently, and they said something very provocative: if GHC is not scaling because there is some global mutable state (e.g. the NameCache) which all the threads are hitting, it doesn't make sense to try to fix the compiler to scale; you should just run multiple processes of the compiler in parallel (like GHC's build does). Guaranteed scaling!

Do people agree with this viewpoint? Disagree? If we take this viewpoint seriously, we should spend more time improving GHC's ability to output the dependency analysis for other tools to then build.

comment:24 Changed 12 months ago by simonpj

Alternatively, give each thread its own NameCache. To do that, if one thread consumes a ModIface produced by another, it'd need to run over it, looking up all the Names to give them the Unique local to that thread. Not so very different to having entirely separate processes communicating through files; and the latter is perhaps easier to think about.

So, yes.

comment:25 Changed 12 months ago by simonmar

@ezyang: I absolutely agree. Parallelism within a single process is always going to be harder to scale than separate processes. That said, there might still be low-hanging fruit (e.g. better default heap settings) for parallel compilation.

I very much like -j in GHCi though - we cut the time to load a large project by 3x just by turning on -j in GHCi with some suitable heap settings (-A256m).

comment:26 Changed 11 months ago by yongqli

Will adding -j9 to the ghc-options: field of a cabal project mitigate this in the mean time until this is fixed?

comment:27 Changed 9 months ago by bgamari

For future readers, highlighting-kate would likely be an excellent package to benchmark with as it has a large number of largely independent modules.

comment:28 Changed 9 months ago by slyfox

The selfcontained test contains 195 modules to build highlighting-kate:

http://code.haskell.org/~slyfox/T9221__highlighting-kate-build-benchmark.tar.gz

To run it you need to tweak a path to your 'ghc=' in mk.bash. On my 8 core box best results can be achieved by setting -A128M (or larger).

The test is handy to run 'perf record' on it. My results for ./mk.bash -j8 -A128M:

$ perf record -g ./mk.bash -j8 -A128M
$ perf report -g

+   15,42%     0,78%  ghc_worker  libc-2.21.so             [.] __sched_yield
+   15,08%     1,62%  ghc_worker  [kernel.vmlinux]         [k] entry_SYSCALL_64
+   12,70%     0,38%  ghc_worker  [kernel.vmlinux]         [k] sys_sched_yield
+   10,91%     6,47%  ghc_worker  ghc-stage2               [.] clE_info
+   10,50%     0,38%  ghc_worker  [kernel.vmlinux]         [k] schedule
+    9,14%     1,58%  ghc_worker  [kernel.vmlinux]         [k] __schedule
+    8,60%     0,00%  ghc_worker  [unknown]                [.] 0x48032822f800c748
+    5,85%     5,64%  ghc_worker  ghc-stage2               [.] evacuate
+    3,47%     0,94%  ghc_worker  ghc-stage2               [.] c7F_info
+    2,91%     0,69%  ghc_worker  [kernel.vmlinux]         [k] pick_next_task_fair
+    2,70%     0,00%  ghc_worker  [unknown]                [.] 0x0000000000000004
+    2,63%     2,28%  ghc_worker  ghc-stage2               [.] c2k_info
+    2,55%     0,00%  ghc_worker  [unknown]                [.] 0x2280f98148088b48
+    1,90%     0,00%  ghc_worker  [unknown]                [.] 0x834807e283da8948
+    1,90%     0,00%  as          ld-2.21.so               [.] _dl_sysdep_start
+    1,89%     0,00%  as          ld-2.21.so               [.] dl_main
+    1,83%     0,15%  as          [kernel.vmlinux]         [k] page_fault
+    1,81%     0,47%  as          ld-2.21.so               [.] _dl_relocate_object

If perf does not lie most of the time is spent cycling over sleeping kernel threads. clE_info is a 'INFO_TABLE(stg_BLACKHOLE,1,0,BLACKHOLE,"BLACKHOLE","BLACKHOLE")'

comment:29 Changed 9 months ago by bgamari

slyfox, out of curiosity what sort of parallel speed-up did you observe in that test?

comment:30 in reply to: ↑ 23 Changed 9 months ago by nh2

Replying to ezyang:

I was chatting with one of my colleagues about this problem recently, and they said something very provocative: if GHC is not scaling because there is some global mutable state (e.g. the NameCache) ...

Do people agree with this viewpoint? Disagree?

I disagree. Threads should almost always be more efficient to use as they allow to efficiently/easily share resources when it makes things faster, but that doesn't mean that we have to share all the things. Processes force us to not share anything. If building is faster with separate processes, then we should be able to achieve the same speed with threads by simply not sharing that thing that makes it slow and that processes force us to not share.

However, I wouldn't be surprised if this isn't even the problem here.

Replying to slyfox:

If perf does not lie most of the time is spent cycling over sleeping kernel threads

This sounds much more like the problem.

If I had to make a guess (and based on the very limited look I had into this issue last year) it feels like we are accidentally busy polling something somewhere.

When I run some non-build Haskell stuff with -RTS +N18 on the current generation of 18 core AWS instances, with many more Haskell threads than needed for building a 200 module project, and with shorter thread life times than in this case (e.g. let's say building a module takes around 0.5 seconds), that stuff scales pretty nicely, much better than ghc's --make scales here. This makes me think that we might be simply doing something wrong in the parallel upsweep code, and that the rest (compiler, runtime etc.) is doing quite OK.

comment:31 follow-up: Changed 9 months ago by simonmar

It sounds like there might be a lot of threads hitting blackholes, with consequent context-switching churn. Finding out which blackhole would be good - probably something in a shared data structure (NameCache or the FastString table, perhaps). ThreadScope would be a good next step.

comment:32 Changed 9 months ago by bgamari

simonmar, can ThreadScope help point one in the direction of which blackhole is to blame? I was under the impression that blackhole performance debugging an area where our current tools fell a little short, but maybe I've missed something.

Somewhere around here I have the beginnings of a patch adding support for profiling of blackhole block events but I wonder if our new DWARF capabilities might be a better fit for this sort of performance work.

comment:33 in reply to: ↑ 31 Changed 9 months ago by slyfox

Replying to bgamari:

slyfox, out of curiosity what sort of parallel speed-up did you observe in that test?

At best I get ~2.5 speedup (-j8 -A128M):

  • -j1 : 41.1s
  • -j8 -A128M : 17.4s

Replying to simonmar:

It sounds like there might be a lot of threads hitting blackholes, with consequent context-switching churn. Finding out which blackhole would be good - probably something in a shared data structure (NameCache or the FastString table, perhaps). ThreadScope would be a good next step.

Tried to run as: ./mk.bash -j8 +RTS -A128m -l:

http://code.haskell.org/~slyfox/T9221-A128M-j8-l.eventlog

I see a chain of 'thread yields', 'blocked on an MVar' there, but can't draw any conclusions.

Can I somehow get callers of blackhole? I guess 'perf report -G' will lie as stacks are lost, but here goes it's output:

  Children      Self  Command     Shared Object            Symbol
+   15,92%     0,81%  ghc_worker  libc-2.21.so             [.] __sched_yield
+   15,54%     1,69%  ghc_worker  [kernel.vmlinux]         [k] entry_SYSCALL_64
+   13,11%     0,40%  ghc_worker  [kernel.vmlinux]         [k] sys_sched_yield
-   10,85%     6,43%  ghc_worker  ghc-stage2               [.] clI_info
   - 40,76% clI_info (stg_BLACKHOLE_info)
      + 2,23% c8E2_info (ghc_Pretty_reduceDoc_info)
      + 1,81% c9p9_info (ghc_Pretty_vcatzugo_info)
      + 1,65% cJw6_info (ghczmprim_GHCziClasses_divIntzh_info)
      + 1,62% c8zA_info (...others)
      + 1,62% ca_info
      + 1,39% apic_timer_interrupt
      + 1,34% c4nL_info
      + 1,33% c3ZK_info
      + 1,27% c2k_info
      + 1,04% cmKx_info
      + 1,03% cp2g_info
        0,92% cp_info
      + 0,89% shV8_info
      + 0,88% strlen
      + 0,88% c9r1_info
      + 0,87% r63b_info
      + 0,79% s1P5_info
      + 0,73% c4IG_info
      + 0,69% c4uY_info
      + 0,63% r117_info
      + 0,62% sw4Y_info
        0,62% cqIM_info
        0,61% caEA_info
      + 0,58% c2l_info
        0,57% cyqt_info
      + 0,52% c9xp_info
     33,29% 0x480328785000c748
   + 1,82% 0x4b7202f9834807e1
   + 1,54% 0x4808588b48f8e083
   + 0,87% 0x438b482677000003
   + 0,87% 0x8d3b49307507c1f6
   + 0,82% 0x4640cd24fffc498b
   + 0,81% 0xf07f98348fc498b
   + 0,75% 0x9b820f02f8
   + 0,65% 0x24ff07e283da8948
   + 0,63% 0x11d820f0af883
   + 0,61% 0xf4b8b4807438b48
   + 0,53% 0x58c48349677202f8
   + 0,50% 0x48074b8b48d88948
+   10,79%     0,38%  ghc_worker  [kernel.vmlinux]         [k] schedule
+    9,44%     1,60%  ghc_worker  [kernel.vmlinux]         [k] __schedule  

comment:34 Changed 9 months ago by thomie

comment:35 Changed 9 months ago by kazu-yamamoto

  • Cc kazu-yamamoto added

comment:36 Changed 9 months ago by simonmar

@bgamari, correct - ThreadScope won't tell us anything about the origin of blackholes, but it would confirm whether there is a lot of blocking on blackholes. Ideally we should see all the capabilities fully busy when compiling a perfectly parallel workload. Note that dependencies between modules reduce the amount of parallelism available in general, so for a good test it would be best to construct an example with plenty of parallelism.

@slyfox, I don't believe perf is telling us anything sensible here.

comment:37 Changed 9 months ago by bgamari

  • Cc scpmw added

@scpmw, would it be possible to identify the expressions being blackholed (assuming that is indeed what is happening here) from perf's output? It seems like this might be a good use-case for our new DWARF support if so.

comment:38 Changed 9 months ago by slyfox

Done the following:

  • patched GHC [1] to print only blackholes in '+RTS -Da' and hide unresolvable addresses as [addr]
  • extended blackhole printer to dump '->indirectee' contents as a heap object (is it safe?)
  • added synchronization to printClosure to save stderr from multiple threads
  • linked ghc-stage2 with debug runtime
  • ran as '-j8 +RTS -A128M -Da'
  • sorted output by most frequent calls ('sort l | uniq -c | sort -n -k1 -r | head -n30')

Here is the result:

4179540 BLACKHOLE(Object [addr] = ghc-prim:GHC.Types.I#([addr])\n)\n
2031781 BLACKHOLE(Object [addr] = ghc:CLabel.IdLabel([addr], [addr], [addr])\n)\n
1782769 BLACKHOLE(Object [addr] = FUN/1(<ghc_X86ziInstr_zdfInstructionInstrzuzdcjumpDestsOfInstr_info>) )\n
 911023 BLACKHOLE(Object [addr] = FUN/1(<s8t1_info>, [addr]) )\n
 474894 BLACKHOLE(Object [addr] = ghc:TypeRep.TyConApp([addr], [addr])\n)\n
 380862 BLACKHOLE(Object [addr] = FUN/2(<ghczmprim_GHCziClasses_zdfOrdIntzuzdcmax_info>) )\n
 370195 BLACKHOLE(Object [addr] = ghc-prim:GHC.Tuple.(,,)([addr], [addr], [addr])\n)\n
 352184 BLACKHOLE(Object [addr] = PAP/2([addr], [addr])\n)\n
 282341 BLACKHOLE(Object [addr] = ghc:Var.Id([addr], [addr], <ghc_Var_mkCoVar1_closure>, [addr], [addr], [addr])\n)\n
 190979 BLACKHOLE(Object [addr] = ghc:X86.Instr.JMP([addr], [addr])\n)\n
 189194 BLACKHOLE(Object [addr] = ghc:X86.Instr.JXX([addr], [addr])\n)\n
 176401 BLACKHOLE(Object [addr] = ghc:Unique.MkUnique([addr])\n)\n
 130071 BLACKHOLE(Object [addr] = PAP/1([addr], [addr])\n)\n
 127332 BLACKHOLE(Object [addr] = ghc:VarEnv.InScope([addr], [addr])\n)\n
 121511 BLACKHOLE(Object [addr] = ghc:TypeRep.FunTy([addr], [addr])\n)\n
 118142 BLACKHOLE(Object [addr] = FUN/1(<ghc_CodeGenziPlatformziX86zu64_globalRegMaybe_info>) )\n
 106457 BLACKHOLE(Object [addr] = ghc:StgCmmMonad.HeapUsage([addr], [addr])\n)\n
  84274 BLACKHOLE(Object [addr] = 0tT640fErehCGZtZRn6YbE:Data.Map.Base.Bin([addr], [addr], [addr], [addr], [addr])\n)\n
  83945 BLACKHOLE(Object [addr] = ghc:Module.Module([addr], [addr])\n)\n
  71331 BLACKHOLE(Object [addr] = ghc:TrieMap.SingletonMap([addr], [addr])\n)\n
  69775 BLACKHOLE(Object [addr] = ghc:StgCmmClosure.LFThunk([addr], [addr], [addr], [addr], [addr])\n)\n
  69754 BLACKHOLE(Object [addr] = FUN/2(<spvK_info>, [addr], [addr]) )\n
  69396 BLACKHOLE(Object [addr] = 0tT640fErehCGZtZRn6YbE:Data.IntMap.Base.Tip([addr], [addr])\n)\n
  66129 BLACKHOLE(Object [addr] = 0tT640fErehCGZtZRn6YbE:Data.IntMap.Base.Nil()\n)\n
  58339 BLACKHOLE(Object [addr] = ghc:HsPat.ConPatOut([addr], [addr], [addr], [addr], <ghc_TcEvidence_emptyTcEvBinds_closure>, [addr], [addr])\n)\n
  55027 BLACKHOLE(Object [addr] = FUN/1(<ghc_CodeGenziPlatformziX86zu64_callerSaves_info>) )\n
  53318 BLACKHOLE(Object [addr] = ghc:Var.Id([addr], [addr], [addr], [addr], [addr], [addr])\n)\n
  53142 BLACKHOLE(Object [addr] = FUN/2(<sfXN_info>, [addr]) )\n
  53057 BLACKHOLE(Object [addr] = FUN/2(<ghc_Lexer_zdfMonadPzuzdczgzg_info>) )\n
  34197 BLACKHOLE(Object [addr] = ghc:Name.Name([addr], [addr], [addr], [addr])\n)\n

Is it expected to see I# here?

small part of BLACKHOLE patch [1]:

@@ -221,7 +223,7 @@ printClosure( StgClosure *obj )
     case BLACKHOLE:
             debugBelch("BLACKHOLE(");
-            printPtr((StgPtr)((StgInd*)obj)->indirectee);
-            debugBelch(")\n");
+            printObj((StgPtr)((StgInd*)obj)->indirectee);
+            debugBelch(")\\n");
             break;

comment:39 Changed 8 months ago by simonmar

I talked with @bgamari about this a bit today. What I think you're seeing in the above data is entries of *indirections*, not blackholes. Indirections and blackholes look the same, except that in a true black hole, the indirectee points to a TSO rather than a value. Looking for the cases where the indirectee is a TSO will tell you how many times we get blocked on a black hole.

comment:40 Changed 8 months ago by thoughtpolice

  • Milestone changed from 7.12.1 to 8.0.1

Milestone renamed

comment:41 Changed 8 months ago by slyfox

Didin't try to inspect thunks for TSOs yet. Another datapoint:

Built whole GHC with SRC_HC_OPTS += -feager-blackholing and ran on the same source from comment 28 with -j${i} +RTS -A128M >/dev/null.

The best wall-clock (real) result is achieved on -j8 (for box with 8 logical cores!)

$ for i in `seq 1 40`; do echo "running $i"; time ./mk.bash -j${i} +RTS -A128M >/dev/null; done


running 1

real    0m45.510s
user    0m41.844s
sys     0m2.699s
running 2

real    0m26.609s
user    0m42.121s
sys     0m4.227s
running 3

real    0m20.976s
user    0m47.452s
sys     0m5.892s
running 4

real    0m18.391s
user    0m48.771s
sys     0m7.716s
running 5

real    0m16.504s
user    0m53.015s
sys     0m9.339s
running 6

real    0m16.108s
user    0m59.391s
sys     0m12.256s
running 7

real    0m15.845s
user    1m3.952s
sys     0m15.632s
running 8

real    0m15.163s
user    1m9.096s
sys     0m14.940s
running 9

real    0m15.693s
user    1m11.332s
sys     0m17.059s
running 10

real    0m16.155s
user    1m13.646s
sys     0m18.863s
running 11

real    0m16.595s
user    1m16.520s
sys     0m19.622s
running 12

real    0m19.407s
user    1m18.191s
sys     0m22.702s
running 13

real    0m17.047s
user    1m18.216s
sys     0m22.033s
running 14

real    0m17.668s
user    1m19.592s
sys     0m24.237s
running 15

real    0m18.087s
user    1m21.217s
sys     0m26.945s
running 16

real    0m18.304s
user    1m21.447s
sys     0m28.425s
running 17

real    0m18.912s
user    1m23.788s
sys     0m29.615s
running 18

real    0m19.715s
user    1m24.838s
sys     0m32.569s
running 19

real    0m20.534s
user    1m25.534s
sys     0m31.648s
running 20

real    0m21.440s
user    1m27.836s
sys     0m42.114s

comment:42 Changed 6 months ago by rrnewton

If blackholing *is* the culprit then I wonder how localized the strictification will need to be...

Ezyang's multi-process suggestion seems appealing. If it's a forkProcess approach, is it possible to load most of the relevant state into memory before forking a child process per module? That could avoid or reduce communication through files or shared pages.

The granularity of per-module compiles is big, so maybe processes would be ok. Also, there's the nice side effect that the GCs of child processes are disentangled, removing that particular scaling bottleneck.

Last edited 6 months ago by rrnewton (previous) (diff)

comment:43 Changed 6 months ago by bgamari

  • Description modified (diff)

comment:44 Changed 5 months ago by mboes

  • Cc mboes added

comment:45 Changed 4 months ago by bgamari

  • Milestone changed from 8.0.1 to 8.2.1
  • Priority changed from high to normal

I intend on looking into this but it doesn't that anything will happen for 8.0 and indeed it's not terribly high priority given that GHC's -j option isn't very widely used.

comment:46 Changed 4 months ago by nh2

given that GHC's -j option isn't very widely used

I think it's only not used so widely because it doesn't work well yet due to this bug - I have worked on 3 projects of 3 companies that would significantly benefit from it.

comment:47 Changed 4 months ago by slyfox

Don't know about other users but starting from ghc-7.8 every gentoo packages are built with -j4 on machines with enough CPU cores. Thinking about building initial ghc-cabal for GHC build in parallel.

comment:48 Changed 2 months ago by dobenour

I remember reading somewhere that the RTS has a global lock for allocating sufficiently large objects. Could that be the problem?

Note: See TracTickets for help on using tickets.