The Register Allocator
The register allocator is responsible for assigning real/hardware regs (hregs) to each of the virtual regs (vregs) present in the code emitted by the native code generator. It also inserts spill/reload instructions to save vregs to the stack in situations where not enough hregs are available.
GHC currently provides three register allocation algorithms, one which does simple linear scan and two version of graph coloring. Support for linear scan is likely to be removed in a subequent version.
- Linear scan
The linear allocator is turned on by default. This is what you get when you compile with
-fasm. The linear allocator does a single pass through the code, allocating registers on a first-come-first-served basis. It is quick, and does a reasonable job for code with little register pressure.
This algorithm has no look-ahead. If say, a particular hreg will be clobbered by a function call, it does not know to avoid allocating to it in the code before the call, and subsequently inserts more spill/reload instructions than strictly needed.
- Graph coloring (enabled with
The graph coloring algorithm operates on the code for a whole function at a time. From each function it extracts a register conflict graph which has a node for every vreg and an edge between two vregs if they are in use at the same time and thus cannot share the same hreg. The algorithm tries to assign hregs (imagined as colors) to the nodes so that no two adjacent nodes share the same color, if it can't then it inserts spill code, rebuilds the graph and tries again.
Graph coloring tends to do better than the linear allocator because the conflict graph helps it avoid the look-ahead problem. The coloring allocator also tries harder to allocate the source and destination of reg-to-reg move instructions to the same hreg. This is done by coalescing (merging) move-related nodes. If this succeeds then the associated moves can be erased.
- Graph coloring with iterative coalescing (enabled with
Iterative coalescing is an improvement over regular graph coloring whereby coalescing passes are interleaved with coloring passes. Iterative coalescing does a better job than regular graph coloring, but is slower because it must alternate between the coloring and coalescing of nodes.
For an outline of the code see Commentary/Compiler/Backends/NCG/RegisterAllocator/Code
If you decide to do some hacking on the register allocator then take a look at (at least) these papers first:
Iterated Register Coalescing
George, Appel, 1996
Decribes the core graph coloring algorithm used.
A Generalised Algorithm for Graph-Coloring Register Allocation
Smith, Ramsey, Holloway, 2004
For a decription of how to deal with overlapping register sets, which aren't fully implemented. Explains what the
triv functions are for.
Design and Implementation of a Graph Coloring Register Allocator for GCC
For an overview of techniques for inserting spill code.
Register pressure in Haskell code
Present GHC compiled code places very little pressure on the register set. Even on x86 with only 3 allocable registers, most modules do not need spill/reloads. This is a mixed blessing - on one hand the conflict graphs are small so we can avoid performance problems related to how the graph is represented, on the other hand it can be hard to find code to test against. Register pressure is expected to increase as the Stg->Cmm transform improves.
In the meantime, here are some good sources for test code:
Only a few nofib benchmarks create spills with
-O2, two are
- Turn on profiling
Register pressure increases significantly when the module is compiled with profiling. checkSpills.report gives tuples of
(spills, reloads, reg-reg-moves)present in output code generated by the three algorithms when compiled with
-O2 -prof. Left to right are the stats for the linear, graph coloring and iterative coalescing algorithms. Note that most modules compile with no spill/reloads inserted, but a few (notably
real/compress2/Encode) need several hundred.
I've found it useful to maintain three darcs repos when working on the allocator.
-Onotfor fast compilation during hacking,
ghc-HEAD-proffor testing with profiling turned on, and
ghc-HEAD-validatefor running the validate script. Patches are created in
work, pushed into
checkSpillsis used to compile the nofib benchmarks with the most register pressure. Once we're happy that the performance is ok, the patch is then pushed into
validatefor validation before pushing to the main repo on
- SHA from darcs
SHA1.lhsmodule from the darcs source, compiled with
-O2creates the most register pressure out of any Haskell code that I'm aware of. When compiling SHA1, GHC inlines several worker functions and the native code block that computes the hash ends up being around 1700 instructions long. vregs that live in the middle of the block have in the order of 30 conflict neighbors. (evidently, the conflict graph is too large for most of the graphviz layout algorithms to cope with)
For these reasons,
SHA1.lhscan be treated as a good worst-case input to the allocator. In fact, the current linear allocator cannot compile it with
-O2 -profon x86 as it runs out of stack slots, which are allocated from a static pool. Make sure to test any changes to the allocator against this module.
- Turn on
Breaking the allocator can result in compiled programs crashing randomly (if you're lucky) or producing the wrong output. Make sure to always turn on
-fasm-lint. Doing this makes the allocator call
GraphOps.validateGraphafter every spill/color stage.
validateGraphchecks that all the edges point to valid nodes, that no conflicting nodes have the same color, and if the graph is supposed to be colored then all nodes are really colored.
- Some useful dump flags
Shows the code and conflict graph after ever spill/color stage. Also shows spill costs, and what registers were coalesced.
Gives statistics about how many spills/reloads/reg-reg-moves are in the output program.
Gives the final output code.
Diverts dump output to files. This can be used to get dumps from each module in a nofib benchmark.
cd nofib/real/anna make EXTRA_HC_OPTS="-O2 -fregs-iterative -ddump-to-file -ddump-asm-regalloc-stages"
- Visualisation of conflict graphs
Graphviz, available from http://www.graphviz.org can be used to make nice visualisations of the register conflict graphs. Use
-ddump-asm-regalloc-stages, and copy one of the graph descriptions into a new file
circo -Tpng niceGraph.dot -o niceGraph.png
Here's two from
checkSpills.hs is a nasty, throw away script which can be used to automate the comparison of allocation algorithms. Copy it and a list of test like checkSpills.tests to the top level nofib directory, compile and run. It will build the nofib benchmarks in the list 6 times each, once each with each of the allocators to extract spill counts, and then once again to get compile timings which are unperterbed by the space leaks introduced by compiling with debugging turned on. It's only needed if you're hacking on the allocator, parses the nofib make output directly, and is likely to rot - which is why it isn't included in the main source tree.
Runtime performance of the graph coloring allocator is proportional to the size of the conflict graph and the number of build/spill cycles needed to obtain a coloring. Most functions have graphs < 100 nodes and generate no spills, so register allocation is a small fraction of overall compile time.
These are some ideas for improving the current allocator, most potentially useful first.
- Work lists for iterative coalescing.
The iterative coalescing alternates between scanning the graph for trivially colorable (triv) nodes and perforing coalescing. When two nodes are coalesced, other nodes that are not adjacent to the coalesced nodes do not change and do not need to be rescanned straight away. Runtime performance of the iterative coalescer could probably be improved by keeping a work-list of "nodes that might have become trivially colorable", to help find nodes that won't have changed.
- Improve spill code generator/cleaner.
When spilling a particular vreg, the current spill code generator simply inserts a spill after each def and a reload before each use. This quickly reduces the density of conflicts in the graph, but produces inefficient code because more spill/reloads are inserted than strictly nessesary. Good code is recovered by the spill cleaner which runs after allocation and removes spill/reload instructions that aren't nessesary. Some things to try:
- Spill coalescing
No attempt is currently made to share spill slots between different vregs. Each named vreg is spilled to its own static spill slot on the C stack. The amount of stack space needed could be reduced by sharing spill slots between vregs so long as their live ranges do not overlap.
- Try to split live ranges before spilling
If a live range has several use/defs then we could insert fresh reg-reg moves to break it up into several smaller live ranges. We then might get away with spilling just one section instead of the whole range. Not sure if this would be a win over the current situation. We would need spill-coalescing to be implemented before this so that we don't require an extra slot for each new live range.
As the spill cleaner walks through the code it builds a mapping of which slots and registers hold the same value. On each reload instruction, if the slot and reg are known to already have the same value then the reload can be erased. This mapping could be extended with constants, so that if a vreg holding a constant value cannot be allocated a hreg, the constant value can be rematerialized instead of being spilled/reloaded to a stack slot.
- Spill coalescing
- Revisit choosing of spill candidates
If the graph cannot be colored then a node/vreg must be chosen to be potentially spilled. Chaitin's forumula says to calculate the spill cost by adding up the number of uses and defs of that vreg and divide by the degree of the node. In the code that I've tested against, it's been better to just choose the live range that lives the longest. Perhaps this is because the 'real' spill cost would depend on the spills/reloads actually inserted, not a simple count of use/defs. Perhaps choosing the longest live range is just better for the particular kind of code that GHC generates.
- Revisit trivColorable / aliasing of register sets
For the architectures currently supported, x86, x86_64 and ppc, the native code generator currently emits code using only two register classes
RcDouble. As these classes are disjoint (ie, none of the regs from one class alias with with regs from another), checking whether a node of a certain class is trivially colorable reduces to counting up the number of neighbours of that class.
If the NCG starts to use aliasing register classes eg: both 32bit
RcFloats and 64bit
RcDoubles on sparc; combinations of 8, 16, and 32 bit integers on x86 / x86_x6 or usage of sse / altivec regs in different modes, then this can be supported via the method described in [Smith et al]. The allocator was designed with this in mind - ie, by passing a function to test if a node is trivially colorable as a parameter to the coloring function - and there is already a description of the register set for x86 in compiler/nativeGen/RegArchX86.hs, but the native code generator doesn't currently emit code to test it against.
- checkSpills.report (49.1 KB) - added by 10 years ago.
checkSpills.tests (1.5 KB) - added by 10 years ago.
list of working nofib benchmarks to run with checkSpills.hs
checkSpills.hs (10.2 KB) - added by 10 years ago.
script to help test allocator performance over nofib suite
graph-colored.png (195.5 KB) - added by 10 years ago.
example colored register conflict graph
graph.png (81.5 KB) - added by 10 years ago.
example colored register conflict graph
graph.dot (4.4 KB) - added by 10 years ago.
graphviz source for graph.png
graph-colored.dot (3.6 KB) - added by 10 years ago.
graphviz source for graph-colored.png
Download all attachments as: .zip