Version 5 (modified by guest, 11 years ago) (diff)


The GHC Garbage Collector Notes

These are my notes of the Glasgow Haskell Compiler’s Garbage Calloector made over my period of internship at Microsoft Research in Summer 2006. These notes are in process of constantly being updated as I study the system further. The objective of my work at MSRC is to implement a parallel GC for Haskell – one that will allow multiple threads to simultaneously garbage collect.


Lets dive right into the working of things. GHC has an abstraction called capabilities. A capability is a 1 or more OS threads. They may or maynot have processor affinities on multiprocessor machines. Each capability can run multiple Haskell threads. These Haskell threads are what are known as interpreter threads or green threads or user threads in the terminology of other systems. The OS is not aware of their presence and the switching of context between these threads is controlled purely by the Haskell runtime system. The runtime system, abbreviated as RTS is something we will keep referring to again and again.

Runtime System

The RTS is located in the folder “rts” of the ghc tree. It consists mostly of C code that is compiled into the resulting Haskell executable so that the required runtime services for the executable are packaged in. This is unlike .Net, JVM, Chez Scheme and other systems where the binaries rely heavily on the runtime support provided by their host VMs and thus the binaries cannot be independently deployed onto machines that that don’t have whole or part of the host VM services. Haskell executables are designed to be standalone executable requiring only standard OS services and do not usually require language support binaries.

The tradeoff is in the fact that every Haskell binary has the RTS compiled into it, making Haskell binaries rather large. The RTS consists of facilities like the support of user threads (or Haskell threads), garbage collection etc. We are interested in focusing on Garbage Collection. However before we get into the GC, let us look at how that is connected to the rest of the system.

The Scheduler

Most of the interesting things related to scheduling and multithreading in Haskell center around the function schedule() that is define in Schedule.c. This is the part of schedule that take a thread from the run and decides what to do with it.

static Capability * schedule (Capability *initialCapability, Task *task)

In schedule() is a pretty classical scheduler loop. I have stripped away several parts of the code here to get down to the essentials.

    t = popRunQueue(cap);
    prev_what_next = t->what_next;

    switch (prev_what_next) {
    case ThreadKilled:
    case ThreadComplete:
        /* Thread already finished, return to scheduler. */
        ret = ThreadFinished;
    case ThreadRunGHC:
        StgRegTable *r;
        r = StgRun((StgFunPtr) stg_returnToStackTop, &cap->r);
        cap = regTableToCapability(r);
        ret = r->rRet;
    case ThreadInterpret:
        cap = interpretBCO(cap);
        ret = cap->r.rRet;
        barf("schedule: invalid what_next field");

The scheduler picks up a thread off the run queand decides what to do with it. If it is runnable, then it calles the function StgRun() to run it. At the end of the code block, the variable “ret” is set to indicate why the the thread stopped.

Haskell threads are not time-sliced via a timer (potentially a time rinterrupt) the way OS threads are [cross check if there is some time sliced mechanism]. Instead they are interreupted by certain commonly occuring events. Due to the lazy nature of Haskell thunks need to be created and values need to be computed very often. Hence the execution of a thread entails lots of of memory allocation. One of the ways the execution of a thread is interrupted is when a thread has run out of space in its current block - it then returns control back to the scheduler.

A GHC block is a 4k page that is page aligned for the OS VM system.

Here is what the scheduler does with the "ret" -

    switch (ret) {
    case HeapOverflow:
        ready_to_gc = scheduleHandleHeapOverflow(cap,t);

    case StackOverflow:

    case ThreadYielding:
        if (scheduleHandleYield(cap, t, prev_what_next)) {
            // shortcut for switching between compiler/interpreter:
            goto run_thread; 

    case ThreadBlocked:

    case ThreadFinished:
        if (scheduleHandleThreadFinished(cap, task, t)) return cap;

      barf("schedule: invalid thread return code %d", (int)ret);

The scheduleHandleHeapOverflow(cap,t) call decides to give the thread another block, (or a set of blocks if the thread was asking for allocation of a large object (a large object is one that is larger than a block). If the scheduleHandleHeapOverflow() function feels that there aren't enough free blocks left, it decides to Garbage Collect. This is the point at which everything else stops and the GC kicks in.

Stepping into the GC

The part that we are interested in is the Garbage Collector. The main entry point into the GC is the GarbageCollect() function defined in GC.c.

The existing GC in GHC is a single threaded one. When the RTS detects memory pressure the GC stops all the Haskell threads and then one thread that does the garbage collection and then resumes all the other suspended threads. On a multiprocessor machine such a design is obviously a bottle neck and it is desirable to garbage collect using multiple parallel threads.

GC Data Structures

Blocks and MegaBlocks

The GC allocates memory from the OS in 1 Mb sized chunks, called mega blocks, which it then divides into pages and manages itself. Each 4k page is called a block and is associated with a block descripter (the abbrevaition BD is used a lot). The BDs for all the blocks on the are stroed at the start of the mega block.

The BD keeps information about a block. This the BD defintion from blocks.h in the RTS.

typedef struct bdescr_ {
  StgPtr start;                 /* start addr of memory */
  StgPtr free;                  /* first free byte of memory */
  struct bdescr_ *link;         /* used for chaining blocks together */
  union { 
      struct bdescr_ *back;     /* used (occasionally) for doubly-linked lists*/
      StgWord *bitmap;
  } u;
  unsigned int gen_no;          /* generation */
  struct step_ *step;           /* step */
  StgWord32 blocks;             /* no. of blocks (if grp head, 0 otherwise) */
  StgWord32 flags;              /* block is in to-space */
#if SIZEOF_VOID_P == 8
  StgWord32 _padding[2];
  StgWord32 _padding[0];
} bdescr;

Most of the fields ina BD are self explanatory. Let me add a few quick decriptions however. Each bdescr or BD maintains its start address and a "free" pointer used to indicate the next free location in the block. Each block also knows about what generation it belongs to and has a pointer to the step it belongs to (more about generations and steps later).

If a large object is allocated and a block is a part of a large object, then the first block is has a count of the number of blocks that are part of the object. The link list of blocks making up the object is maintained by the link pointer. [This may not be entirely correct - I will come back to this later].






Motivation for Parallelization

The essential idea is best described here: [refer to the flood paper]

It is helpful to be aware of copy collection and mark-compact collection before you read the above paper. The Richard Jones and Raphael Lins text on Garbage Collection is a recommended resource.

For our garbage collector, we are yet to work out the details of how Gen0 should be compacted. The best idea for later generations is some variant of the work stealing approach proposed in the above paper. This of course might change in the course of the project.

Measurement of Block Distance while Scavenging

Here are some plots of block distance against the collection number and the average block distance and the collection number. Block distance is defined to be the number of links one has to follow from the scan_bd to reach the hp_bd in a step during garbage collection. If the block distance in 2 then it means that there is atleast one independent block in between the pointers that can be taken up by another thread.

The essential idea behind work stealing is that free threads can steal work from busy threads. The work is essentially the work of scavenging live objects. hp_bd points to the top of the to-space where the next free object can go. scan_bd points to the block where the next object to be scanned is. All objects between scan_bd and hp_bd are objects that are yet to be scanned. A free thread essentially steal a block of objects in this range and can scan them, essentially reducing the load of the busy thread.

The following program was used to generate some the graphs below. Changing the treeDepth and the nTrees values below one can get the program to have different memory profiles.

import System

treeDepth = 17
nTrees = 40

makeList 0 d = []
makeList n d = d : (makeList (n-1) d)

main :: IO ()
main = if (recVal (makeList nTrees treeDepth)) < 10
       then print "###"
       else print "##"

data Tree a = L a
	    | B (Tree a) (Tree a)

makeTree 0 = L 1
makeTree n = B (makeTree (n-1)) (makeTree (n-1))

sumTree (L x) = x
sumTree (B t1 t2) = 1 + (sumTree t1) + (sumTree t2)

treeVal n rest = let tr1 = makeTree n in
		     sumTree(tr1) + sumTree(makeTree n) + (recVal rest) + sumTree(tr1)

recVal [] = 0
recVal (x:xs) = treeVal x xs

Here are some plots:

Here is how to interpret the graphs. The label ‘#’ on an axis indicates that it is time where each tick is a garbage collection. The label ‘live_objs’ indicates the total number of live objects encountered during this collection. This is not the total number of live objects in the system but only those in the generations currently collected. The value ‘block_dist’ indicates the maximum block distance encountered during a collection.

The value `avg_block_dist’ indicates the average block distance encountered during a collection. If you think about the block distance a bit you realize that it essentially starts from zero increases and decreases during the duration of scavenging and finally becomes zero when the scan point catches up with the heap pointer. We wanted to measure approximately the area under this region as a indication of the average chance of parallelization. Further to make the measurement a little less fine grained, it was taken only when a new block was allocated to the to-space. This value can be considered indicative of how much parallelization is possible on average during that GC run. [At least I hope so]

Here are similar plots of some programs in the nofib test suite that is available in the GHC source tree.

Plots of real/fulsom (with input 8)

Plots of real/pic (with input 20000)

Plots of real/fem (with fem.stdin)

fem did not do an G1 collections.

Compiling GHC

I think it is useful to have a small section about compiling GHC. I ran into several magic problems getting GHC to build. I don’t fully understand the reasons for some of the fantastic sounding ones, however its worth mentioning them. Some of these are just general discipline guidelines, but are useful to keep in mind.

Problem 1

Make sure that you do actually have the latest versions of everything involved. These include:

  • Darcs (a version control system that GHC is shifting towards)
  • GHC source (get it using Darcs)
  • Alex (the Lexer generator)
  • Happy (the Parser generator)

Since the compilation of Haskell is very Unix styled, on Windows (I work using a Win XP machine), one needs to add several unix tools to windows. I don’t know why GHC building doesn’t target SFU on windows yet – maybe that’s something that they just haven’t got down to doing yet. Here are the unix-ish components that you need:

  • MinGW (The compiler set, please get the latest – they have a downloading installer available somewhere, I used that one. I had several magic problems with many packaged binaries that I found on the net).
  • MSYS (These maybe downloaded as binaries from the Msys site).

The GHC compiler notes explains all of this rather well.

Problem 2

Make sure your path is right. Make sure you get the right versions of things in your path. As an example, I have SFU binaries in my path before I had the MSYS ones they apparently don’t like each other too much.

Problem 3

Watch out for what files you edit. There are some files in the build process are parsed by a simplistic C preprocessor and will not understand C++ style comments. Here are a list of these files:

[get file list]

Problem 4

This is a classical magic problem. I got GHC to build on my desktop but it would not my laptop. As a matter of fact when I type in “autoreconf” the machine freezes up after about 30 seconds. Very puzzling.

This behavior would often end with power cycling the laptop and was rather frustrating for a while. After studying the process with procexp, filemon and some standard monitoring tools it seemed like the “Logitech LVPrcSrv module” related to my Logitech webcam seems to hog the CPU. Since I wasn’t using the webcam, I killed the process and the build went through fine. At this point I can’t guess at what the relationship is or why there should be one.

Roshan James (rpjames [at] cs [dot] indiana [dot] edu)