Version 18 (modified by guest, 8 years ago) (diff)


The GHC Garbage Collector Notes

These are my notes of the Glasgow Haskell Compiler’s Garbage Calloector made over my period of internship at Microsoft Research in Summer 2006. These notes are in process of constantly being updated as I study the system further. The objective of my work at MSRC is to implement a parallel GC for Haskell – one that will allow multiple threads to simultaneously garbage collect.


Lets dive right into the working of things. GHC has an abstraction called capabilities. A capability is a 1 or more OS threads. They may or maynot have processor affinities on multiprocessor machines. Each capability can run multiple Haskell threads. These Haskell threads are what are known as interpreter threads or green threads or user threads in the terminology of other systems. The OS is not aware of their presence and the switching of context between these threads is controlled purely by the Haskell runtime system. The runtime system, abbreviated as RTS is something we will keep referring to again and again.

Runtime System

The RTS is located in the folder “rts” of the ghc tree. It consists mostly of C code that is compiled into the resulting Haskell executable so that the required runtime services for the executable are packaged in. This is unlike .Net, JVM, Chez Scheme and other systems where the binaries rely heavily on the runtime support provided by their host VMs and thus the binaries cannot be independently deployed onto machines that that don’t have whole or part of the host VM services. Haskell executables are designed to be standalone executable requiring only standard OS services and do not usually require language support binaries.

The tradeoff is in the fact that every Haskell binary has the RTS compiled into it, making Haskell binaries rather large. The RTS consists of facilities like the support of user threads (or Haskell threads), garbage collection etc. We are interested in focusing on Garbage Collection. However before we get into the GC, let us look at how that is connected to the rest of the system.

The Scheduler

Most of the interesting things related to scheduling and multithreading in Haskell center around the function schedule() that is define in Schedule.c. This is the part of schedule that take a thread from the run and decides what to do with it.

static Capability * schedule (Capability *initialCapability, Task *task)

In schedule() is a pretty classical scheduler loop. I have stripped away several parts of the code here to get down to the essentials.

    t = popRunQueue(cap);
    prev_what_next = t->what_next;

    switch (prev_what_next) {
    case ThreadKilled:
    case ThreadComplete:
        /* Thread already finished, return to scheduler. */
        ret = ThreadFinished;
    case ThreadRunGHC:
        StgRegTable *r;
        r = StgRun((StgFunPtr) stg_returnToStackTop, &cap->r);
        cap = regTableToCapability(r);
        ret = r->rRet;
    case ThreadInterpret:
        cap = interpretBCO(cap);
        ret = cap->r.rRet;
        barf("schedule: invalid what_next field");

The scheduler picks up a thread off the run queand decides what to do with it. If it is runnable, then it calles the function StgRun?() to run it. At the end of the code block, the variable “ret” is set to indicate why the the thread stopped.

Haskell threads are not time-sliced via a timer (potentially a time rinterrupt) the way OS threads are [cross check if there is some time sliced mechanism]. Instead they are interreupted by certain commonly occuring events. Due to the lazy nature of Haskell thunks need to be created and values need to be computed very often. Hence the execution of a thread entails lots of of memory allocation. One of the ways the execution of a thread is interrupted is when a thread has run out of space in its current block - it then returns control back to the scheduler.

I stand corrected about the above - We do have a time-slice mechanism: the timer interrupt (see Timer.c) sets the context_switch flag, which causes the running thread to return to the scheduler the next time a heap check fails (at the end of the current nursery block). When a heap check fails, the thread doesn't necessarily always return to the scheduler: as long as the context_switch flag isn't set, and there is another block in the nursery, it resets Hp and HpLim? to point to the new block, and continues.

A GHC block is a 4k page that is page aligned for the OS VM system.

Here is what the scheduler does with the "ret" -

    switch (ret) {
    case HeapOverflow:
        ready_to_gc = scheduleHandleHeapOverflow(cap,t);

    case StackOverflow:

    case ThreadYielding:
        if (scheduleHandleYield(cap, t, prev_what_next)) {
            // shortcut for switching between compiler/interpreter:
            goto run_thread; 

    case ThreadBlocked:

    case ThreadFinished:
        if (scheduleHandleThreadFinished(cap, task, t)) return cap;

      barf("schedule: invalid thread return code %d", (int)ret);

The scheduleHandleHeapOverflow(cap,t) call decides to give the thread another block, (or a set of blocks if the thread was asking for allocation of a large object (a large object is one that is larger than a block). If the scheduleHandleHeapOverflow() function feels that there aren't enough free blocks left, it decides to Garbage Collect. This is the point at which everything else stops and the GC kicks in.

Stepping into the GC

The part that we are interested in is the Garbage Collector. The main entry point into the GC is the GarbageCollect?() function defined in GC.c.

The existing GC in GHC is a single threaded one. When the RTS detects memory pressure the GC stops all the Haskell threads and then one thread that does the garbage collection and then resumes all the other suspended threads. On a multiprocessor machine such a design is obviously a bottle neck and it is desirable to garbage collect using multiple parallel threads.

GC Data Structures

Blocks and Mega Blocks

The GC allocates memory from the OS in 1 Mb sized chunks, called mega blocks, which it then divides into pages and manages itself. Each 4k page is called a block and is associated with a block descripter (the abbrevaition BD is used a lot). The BDs for all the blocks on the are stroed at the start of the mega block.

The BD keeps information about a block. This the BD defintion from blocks.h in the RTS.

typedef struct bdescr_ {
  StgPtr start;                 /* start addr of memory */
  StgPtr free;                  /* first free byte of memory */
  struct bdescr_ *link;         /* used for chaining blocks together */
  union { 
      struct bdescr_ *back;     /* used (occasionally) for doubly-linked lists*/
      StgWord *bitmap;
  } u;
  unsigned int gen_no;          /* generation */
  struct step_ *step;           /* step */
  StgWord32 blocks;             /* no. of blocks (if grp head, 0 otherwise) */
  StgWord32 flags;              /* block is in to-space */
#if SIZEOF_VOID_P == 8
  StgWord32 _padding[2];
  StgWord32 _padding[0];
} bdescr;

Most of the fields ina BD are self explanatory. Let me add a few quick decriptions however. Each bdescr or BD maintains its start address and a "free" pointer used to indicate the next free location in the block. Each block also knows about what generation it belongs to and has a pointer to the step it belongs to (more about generations and steps later).

If a large object is allocated and a block is a part of a large object, then the first block is has a count of the number of blocks that are part of the object. The link list of blocks making up the object is maintained by the link pointer. [This may not be entirely correct - I will come back to this later].


The GHC GC is a generational collector. The number of generations is set to 2 by default and they are referred to as gen 0 and gen 1. gen 0 has never objects, objects that survive collections in gen 0 are promoted to gen 1. Older objects reside in gen 1.

Command Line Switches

The number of generations in an exceution of a compiled haskell program can be changed by by using the command line switch -G<n>, where n is the number of generations. This is an RTS switch and so has to be used as follows -

main.exe +RTS -G5 -RTS 

Where main.exe is the compiled program and we want it to have 5 generations. More about the RTS switches can be found here Runtime Control.

On the topic of command line switches, for a compiler hacker this is also interesting - Debugging the compiler


GHC generations are divided into steps. The number of steps per generations is a configurable value as well. The last generation, the oldest one has only one step. By default GHC compiled programs have 2 steps per generation. Since GHC usually has only 2 generations, it has 2 steps in gen 0 and 1 step in gen 1. Steps of a generation are referred to by their index number starting from 0.

Garbage collection happens at the level of generations. So when you say that you are collecting gen 0, you are essentially collecting all the steps in gen 0. If you are collecting gen 4, then you are collecting all the steps in gen 0 to gen 4. Objects that survive a collection are promoted into the next higher step, or into the next higher generation if the object is already in the highest step of its generation.

The reasoning behing having steps is approximately this - if there were no steps and a surviving a gen0 collection meant automativ promotion to gen1, then it means that at the time when gen0 happened several very new objects (which are potentially shortlived as well) get promoted to gen1. gen1 collections are expensive since higher generations are larger and hence these objects will reside in memory a long time before they get collected. On the other had if gen0 were divided into two steps then a gen0step0 object hops to gen0setp1 on its first survival and needs to survive yet another GC before it is promoted to gen1. [This I think is a very neat idea. My current understanding is that the .Net GC does have steps. Is this correct?]

That said, lets look at the data structures. This is what a generation looks like -

typedef struct generation_ {
  unsigned int   no;			/* generation number */
  step *         steps;			/* steps */
  unsigned int   n_steps;		/* number of steps */
  unsigned int   max_blocks;		/* max blocks in step 0 */
  bdescr        *mut_list;      	/* mut objects in this gen (not G0)*/

  /* temporary use during GC: */
  bdescr        *saved_mut_list;

  /* stats information */
  unsigned int collections;
  unsigned int failed_promotions;
} generation;

This is what a step looks like -

typedef struct step_ {
  unsigned int         no;		/* step number */
  bdescr *             blocks;		/* blocks in this step */
  unsigned int         n_blocks;	/* number of blocks */
  struct step_ *       to;		/* destination step for live objects */
  struct generation_ * gen;		/* generation this step belongs to */
  unsigned int         gen_no;          /* generation number (cached) */
  bdescr *             large_objects;	/* large objects (doubly linked) */
  unsigned int         n_large_blocks;  /* no. of blocks used by large objs */
  int                  is_compacted;	/* compact this step? (old gen only) */

  /* During GC, if we are collecting this step, blocks and n_blocks
   * are copied into the following two fields.  After GC, these blocks
   * are freed. */
  bdescr *     old_blocks;	        /* bdescr of first from-space block */
  unsigned int n_old_blocks;		/* number of blocks in from-space */

  /* temporary use during GC: */
  StgPtr       hp;			/* next free locn in to-space */
  StgPtr       hpLim;			/* end of current to-space block */
  bdescr *     hp_bd;			/* bdescr of current to-space block */
  StgPtr       scavd_hp;		/* ... same as above, but already */
  StgPtr       scavd_hpLim;		/*     scavenged.  */
  bdescr *     scan_bd;			/* block currently being scanned */
  StgPtr       scan;			/* scan pointer in current block */
  bdescr *     new_large_objects;    	/* large objects collected so far */
  bdescr *     scavenged_large_objects; /* live large objs after GC (d-link) */
  unsigned int n_scavenged_large_blocks;/* size of above */
  bdescr *     bitmap;  		/* bitmap for compacting collection */
} step;

These definitions can be found in rts\storage.h.

Here is a little digram of the data structure formed by the generations and steps. The global variable 'generations' is a an array of pointers to generations. Each generation has 'steps' as a pointer array to its steps.






Roshan James (rpjames [at] cs [dot] indiana [dot] edu)