For a number of months now, we have been designing and developing a new garbage collector for GHC, with the primary goal of improving scaling on multicores. The current garbage collector isn't exactly a slouch: it is a parallel generational design, and after going to some effort to improve locality we managed to achieve some respectable scaling results on up to 8 cores, and the Intel CnC project has been able to achieve even greater speedups (as much as 22x on a 32-core machine) with carefully-tuned benchmarks.
Still, we recognise that the garbage collector is often the bottleneck where scaling is concerned. Tuning a program for scaling usually involves reducing its demands on the memory subsystem and hence reducing the amount of GC that happens. Here's how the current GC impedes scaling:
This is a screenshot from ThreadScope, showing part of the timeline of the execution of a parallel Haskell program - in this case, the minimax program for searching the game tree of 4x4 noughts and crosses. Each bar represents the execution of a single CPU, and the thin orange/green bars are the GCs. Every GC stops all the threads and performs the GC in parallel, before restarting the computation again. This is known as "stop the world" parallal GC - compared to other techniques such as concurrent GC it's easier to implement and generally gives good performance. Many industrial-strength virtual machines use stop-the-world parallel GC in their server products, because it gives the best throughput. However, the diagram clearly shows that we're wasting time here: the GCs aren't well-balanced (the green parts show the actual work, the blank parts are idle time). We deliberately don't load-balance the work in young-generation collections, because doing so impacts locality and is worse for overall performance (we discussed this in the ICFP'09 paper and gave measurements). So we end up wasting some of our processor bandwidth, and hence losing some scaling.
Not only that, but the cost of the all-core synchronisation becomes a bottleneck as the number of cores increases, and the need for rapid synchronisations causes problems if the OS decides to borrow one of the cores to do something else for a while: the symptom of this problem has been dubbed the "last core parallel slowdown" in Haskell.
Here's what our new GC looks like on the same program:
The heap organisation in fact hasn't changed much: there are still per-core young generations and a single old generation, but the improvement is that the per-core young generations can be collected independently without stopping the other threads. Collecting the old generation - the "global heap" - still requires stopping all the threads, but since this is the old generation, global collections are much less frequent than local collections.
Since there are fewer synchronisations and less wasted processor bandwidth, the overall throughput is higher - only about 10% on 8 cores with this example, but there are other (less picturesque) examples that improve more, and it's still early days and we have a lot of tuning to do. We expect the benefits to be greater on larger numbers of cores, and furthermore the "last core slowdown" should be finally banished.
Designs like this have appeared in the literature (starting with Doligez/Leroy POPL'93), and ours borrows ideas from several of these earlier designs while adding a few novel twists of our own. We plan to write up the design properly once all the tuning is done and we have some solid results. I expect it to land in GHC HEAD in a few months time, and it should be in the autumn 2011 major release of GHC.