Posts by author simonmar

The new code generator is nearly ready to go live

I've just spent another week working on the Glorious New Code Generator, and I'm pleased to say it's now in a state that we can think about switching over.

Just to recap, the "new code generator" is a replacement for the STG->Cmm phase of the compiler, using a new intermediate representation based on Hoopl , and separating out the mechanics of layout out stack frames from code generation. It's mainly an architectural change with little visible benefit, but it will be a big improvement in terms of flexibility and modularity inside the compiler. We'll be able to express many optimisations that were hard before, e.g. #1498.

Furthermore, many bits of cruft will go away, such as large chunks of the horrible CoreToStg pass, and the SRT analysis. These things are replaced by much cleaner analyses of the generated code using Hoopl.

For more details on the new code generator see Commentary/Compiler/NewCodeGen, although that page needs a lot of updating after my recent hacking.

The current status in more detail:

  • everything works, with the possible exception of profiling. I've just got a stage2 compiler working, and the whole test suite passes with no new failures.
  • generated code is on average as good as before, sometimes better and sometimes worse. I have a few examples of tight loops that get better with the new code gen, but of course for complex low-level loopy code LLVM will still be necessary to get really good code. On the other hand, GHC can take advantage of its better knowledge about aliasing, and I expect to make more use of this in the future. There's quite a bit of scope for generating better code and adding cool Hoopl-based optimisations later. Things are definitely looking up for low-level performance in GHC.
  • compile times will get a bit worse, at least initially. This is where I've been spending a lot of effort: when I started work on the new code generator it was adding more than 100% to compile times, and I've got that down to about 5% for optimised compilation, 10-15% for non-optimised. Compile times will improve further when we switch over properly and we can start removing a lot of cruft.

We now have a 7.6 branch, so my plan is to fix the one remaining blocker (profiling) and then switch over. There will no doubt be breakage that hasn't shown up so far, but we have until 7.8 to find and fix it.

EDIT: I should mention that the new code generator is the work of many people, including Norman Ramsey, Simon Peyton Jones, John Dias, Edward Yang and probably others I've forgotten.

  • Posted: 2012-08-03 15:13 (Updated: 2012-08-03 15:27)
  • Author: simonmar
  • Categories: (none)
  • Comments (2)

An overhaul of stack management, and some performance improvements

I just committed a patch I've been working on for a few days that revamps the way thread stacks are managed in the runtime system. There'll be some nice performance improvements coming in 7.2.1 as a result of this.

Background

In GHC 7.0 and earlier, a thread is represented by a TSO object in the heap. The TSO contains:

  • some thread-specific state and flags
  • a couple of link fields for linking the TSO onto lists
  • the stack, growing backwards from the end of the TSO

when the stack overflows, as long as the maximum stack size (+RTS -K<size>) has not yet been reached, then the runtime system would create a new TSO double the size of the old one, copy the stack and the thread state into the new object, and mark the old TSO as a ThreadRelocated pointing to the new one. This last step is important, because the TSO might be reachable from other places: a ThreadId is basically a pointer to the TSO, for instance.

This ThreadRelocated thing is a nuisance, because it means every time we derefernce a TSO pointer, we have to check for ThreadRelocated in a little loop. This loop was enshrined in a function deRefTSO() in the RTS, and was called from various places. Over the years we've had several bugs caused by a missing deRefTSO.

Overheads of the old way

Obviously there's a certain amount of overhead associated with copying the whole stack every time we need to enlarge the TSO: every stack word is written approximately twice. What's worse, though, is that the entire stack is traversed during every GC, if the thread is active (threads that haven't run since the last GC are marked "clean" and not traversed). If the thread has a deep stack, this makes minor GCs expensive, which shows up as a high GC overhead. A common workaround is to use a larger allocation area, e.g. +RTS -A4m, but that has the disadvantage of making the allocation area larger than the L2 cache, which also hurts performance.

A better way: stack chunks

The solution to the repeated traversal problem is to identify the parts of the stack that are unmodified since the last GC, and not traverse them. The GC already does a lot of this kind of thing - it's called a "write barrier", and it's a basic requirement for generational GC.

There are two ways we could do this: either do a card-marking trick like we do for mutable arrays, or we could divide the stack up into multiple objects and mark each object as either clean or dirty. The card-marking option would have been fairly complicated to administer, so we discarded that idea. In contrast, having multiple objects containing parts of the stack was relatively straightforward.

Perhaps we should have done this ages ago. I put it off because I thought it would be complex - we have lots of places in the RTS that traverse stacks. It turns out that the majority of these places don't care about stack chunks, which is nice.

Death to ThreadRelocated

The obvious solution to the ThreadRelocated problem is to separate the stack from the TSO, and that's exactly what we've done. Now the TSO contains the thread state only, and it points to a separate STACK object for the topmost stack chunk, containing the stack pointer and the stack itself. The result is that a fragile invariant goes away, which is a very good thing.

Why didn't we do this before? Well, in fact I did try making this change a long time ago, and found that it made performance worse for some thread benchmarks. I put it down to the extra cache-line fill required to access the stack pointer in the separate STACK object. However, this time the results look pretty good - I'm not entirely sure why, maybe it's the combination of doing this with stack chunks, or maybe I did something wrong last time, who knows (or cares!).

Performance

Here are some performance figures from a small collection of thread benchmarks (http://darcs.haskell.org/nofib/smp). This is comparing yesterday's HEAD with my working branch:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed
--------------------------------------------------------------------------------
    callback001        +289.3%    -13.5%     +0.8%     +1.2%
    callback002        +302.0%    -13.5%     -1.3%     -1.3%
          sieve        +329.2%     -0.2%     -1.6%     -1.3%
     threads001        +292.4%     +0.0%     +4.2%     +4.5%
     threads003        +298.9%     -0.5%    -36.5%    -36.4%
     threads006        +304.1%     -2.2%    -66.4%    -66.4%
     threads007        +409.7%     +0.0%     +2.4%     +2.5%
--------------------------------------------------------------------------------
            Min        +289.3%    -13.5%    -66.4%    -66.4%
            Max        +409.7%     +0.0%     +4.2%     +4.5%
 Geometric Mean        +316.3%     -4.5%    -19.3%    -19.2%

Ignore the size figures, it's because the HEAD build was using -split-objs.

We can see that several benchmarks display no difference or get very slightly worse, but a couple (threads003 and threads006) see a big jump in performance. threads003 is an old variant on threadring, and threads006 is a microbenchmark for throwTo. I don't fully understand the difference in performance here, but typically I don't investigate when things go faster, only when they go slower.

GHC itself got a little faster: 1-2% when compiling Cabal.

But what you really wanted to know was... what about threadring, right? Well, it currently goes a few percent slower with the new changes. However, I notice that it's allocating a lot more than before, so I think there's something suspicious going on. I shall investigate...

Update: some more results

BinaryTrees, from the shootout:

  • before: 64.26s
  • after: 17.61s

So I guess the problem with BinaryTrees was more to do with repeated stack traversals than it was to do with a low mortality rate. Nice that we can get good performance from this program without tweaking the heap settings at all now.

Hackage benchmarks, from David Peixotto's fibon benchmark suite:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           Agum        +306.5%     +3.5%     +0.8%     +0.9%     +0.0%
          Bzlib          -----     -----     -----     -----     -----
           Cpsa        +125.1%     +2.1%     -2.2%     -2.3%     +0.0%
         Crypto          -----     -----     -----     -----     -----
            Fgl        +408.9%     +0.1%     +2.0%     +2.0%     -5.4%
            Fst        +137.7%     -0.0%     -4.4%     -4.3%     +0.0%
         Funsat        +151.4%     +0.2%     +1.7%     +1.8%     -2.0%
             Gf         13990k 13885653k      9.03      9.60   116736k
          HaLeX        +195.9%     +0.0%     +1.2%     +1.1%     +0.0%
          Happy        +135.0%     +3.3%     -5.3%     -5.4%    -13.7%
         Hgalib        +236.7%     +1.0%     +0.7%     +0.7%     +0.0%
    Palindromes        +243.1%    -10.1%     +3.4%     +3.4%     +2.1%
          Pappy        +148.9%     +0.5%     +3.7%     +3.8%    -17.7%
     QuickCheck        +139.5%     -0.1%     -1.6%     -1.6%     +0.0%
          Regex        +182.1%     +0.0%     +3.5%     +3.5%    -21.4%
          Simgi        +159.3%     +0.0%     +6.9%     +6.9%     +0.0%
   TernaryTrees        +519.4%     -2.0%    -32.4%    -32.7%    -14.3%
          Xsact        +246.4%     +1.2%     +0.9%     +0.9%     -0.6%
--------------------------------------------------------------------------------
            Min        +125.1%    -10.1%    -32.4%    -32.7%    -21.4%
            Max        +519.4%     +3.5%     +6.9%     +6.9%     +2.1%
 Geometric Mean        +207.6%     -0.1%     -1.9%     -1.9%     -5.2%

One or two nice ones in there: TernaryTrees, Happy, Fst all look good. Simgi probably deserves further investigation.

Notice how the overall memory size is reducing too, by 5% on average.

Update 2: threadring results

threadring 10000000, without -threaded:

  • 6.12.3: 1.50s
  • 7.0.1: 2.05s
  • HEAD with stack patches: 1.67s

So the stack patches recovered almost all the performance lost between 6.12.3 and 7.0.1 on non-threaded threadring. The performance lost in 7.0.1 was due to the BLACKHOLE changes and some other changes to fix pathological cases of bad MVar performance.

More interesting are the -threaded numbers:

  • 6.12.3: 5.86s
  • 7.0.1: 3.00s
  • HEAD with stack patches: 2.31s

The new code is very clearly winning here. I spent a little while with the perf tool on Linux trying to understand the results, but wasn't able to fully account for it, sadly.

  • Posted: 2010-12-15 16:29 (Updated: 2010-12-22 16:19)
  • Author: simonmar
  • Categories: (none)
  • Comments (3)

First results from GHC's new garbage collector

For a number of months now, we have been designing and developing a new garbage collector for GHC, with the primary goal of improving scaling on multicores. The current garbage collector isn't exactly a slouch: it is a parallel generational design, and after going to some effort to improve locality we managed to achieve some respectable scaling results on up to 8 cores, and the Intel CnC project has been able to achieve even greater speedups (as much as 22x on a 32-core machine) with carefully-tuned benchmarks.

Still, we recognise that the garbage collector is often the bottleneck where scaling is concerned. Tuning a program for scaling usually involves reducing its demands on the memory subsystem and hence reducing the amount of GC that happens. Here's how the current GC impedes scaling:

This is a screenshot from ThreadScope, showing part of the timeline of the execution of a parallel Haskell program - in this case, the minimax program for searching the game tree of 4x4 noughts and crosses. Each bar represents the execution of a single CPU, and the thin orange/green bars are the GCs. Every GC stops all the threads and performs the GC in parallel, before restarting the computation again. This is known as "stop the world" parallal GC - compared to other techniques such as concurrent GC it's easier to implement and generally gives good performance. Many industrial-strength virtual machines use stop-the-world parallel GC in their server products, because it gives the best throughput. However, the diagram clearly shows that we're wasting time here: the GCs aren't well-balanced (the green parts show the actual work, the blank parts are idle time). We deliberately don't load-balance the work in young-generation collections, because doing so impacts locality and is worse for overall performance (we discussed this in the ICFP'09 paper and gave measurements). So we end up wasting some of our processor bandwidth, and hence losing some scaling.

Not only that, but the cost of the all-core synchronisation becomes a bottleneck as the number of cores increases, and the need for rapid synchronisations causes problems if the OS decides to borrow one of the cores to do something else for a while: the symptom of this problem has been dubbed the "last core parallel slowdown" in Haskell.

Here's what our new GC looks like on the same program:

The heap organisation in fact hasn't changed much: there are still per-core young generations and a single old generation, but the improvement is that the per-core young generations can be collected independently without stopping the other threads. Collecting the old generation - the "global heap" - still requires stopping all the threads, but since this is the old generation, global collections are much less frequent than local collections.

Since there are fewer synchronisations and less wasted processor bandwidth, the overall throughput is higher - only about 10% on 8 cores with this example, but there are other (less picturesque) examples that improve more, and it's still early days and we have a lot of tuning to do. We expect the benefits to be greater on larger numbers of cores, and furthermore the "last core slowdown" should be finally banished.

Designs like this have appeared in the literature (starting with Doligez/Leroy POPL'93), and ours borrows ideas from several of these earlier designs while adding a few novel twists of our own. We plan to write up the design properly once all the tuning is done and we have some solid results. I expect it to land in GHC HEAD in a few months time, and it should be in the autumn 2011 major release of GHC.

Debian 6.0 "Squeeze" frozen... announcement mentions GHC

The announcement for the recent freeze of Debian 6.0 specifically mentions GHC 6.12 alongside Python, Perl, and GCC in the list of compilers and interpreters for "common languages". A milestone reached? (thanks to Reuben Thomas for sending us the link)

  • Posted: 2010-08-25 13:32 (Updated: 2010-08-25 13:33)
  • Author: simonmar
  • Categories: (none)
  • Comments (1)

The GHC blog

As an experiment, we are moving the GHC blog from http://ghcmutterings.wordpress.com/ to here.

  • Posted: 2010-08-24 13:20 (Updated: 2010-08-24 13:22)
  • Author: simonmar
  • Categories: (none)
  • Comments (0)