wiki:Building/Architecture

Version 8 (modified by simonmar, 5 years ago) (diff)

remove the "modifying" parts, to put in a new page

NOTE: this page relates to the new GHC build system, due to be added to GHC in April 2009. For documentation of the "old" build system, see Building/Old/Using? and Building/BuildSystem?.

The GHC Build System Architecture

This section contains information you need to know in order to understand and modify the GHC build system. The build system is non-standard in various ways (to be explained shortly), and is decidedly non-trivial: do not attempt to modify it without having a grasp of the concepts that follow!

It's difficult to document a system that is full of details and subject to constant change. The approach we've adopted here is to split the documentation in two:

  • The high-level architectural design, the stuff that is less likely to change, is documented here. Occasionally we'll include direct links to source files to illustrate the details.
  • The low-level technical details, such as the order of arguments to a macro, and the names of specific variables, are documented as comments in the build-system code. Hopefully this way the documentation is more likely to stay up to date.

Historical note: this is the third major revision of the GHC build system. The first incarnation was based on "jmake", a derivative of X11's "imake", which is based on using the C preprocessor to add macro capabilities and #include to plain make. The second incarnation used GNU make's extensions for including makefiles (but lost the ability to use macros, since at the time GNU make didn't have support for general macros). In this third revision, we use even more of GNU make's extensions, and we make a fundamental change to the design, as described in the next section.

Overall structure and important files

ghc.mk
This is where you should start reading: ghc.mk is the main file in the build system which ties together all the other build-system files. It uses make's include directive to include all the files in mk/*.mk, rules/*.mk, and all the other ghc.mk files elsewhere in the tree.

Makefile
The top-level Makefile, recursively invokes make on ghc.mk according to the phase ordering idiom.
rules/*.mk
Each .mk file in the rules directory corresponds to a single macro that can be called using make's $(call ...) expression. For example, the build-package macro is in rules/build-package.mk.
config.mk.in
The configuration information for the build system, processed by configure to produce mk/config.mk. Settings can be overriden by creating a local file mk/build.mk (see Build configuration).
compiler/ghc.mk, rts/ghc.mk, etc.
Most subdirectories of the source tree have a ghc.mk file which contains the instructions for building the components in that directory. Note: these ghc.mk files cannot be invoked individually, they should only be included by the top-level ghc.mk.

Idioms

Each of the following subsections describes one of the idioms that we use in the build system. There are a handful of such idioms, and when you've understood them all you'll be able to understand most of the code you'll find in the build system. We'll describe the idioms first, and then get on to the specifics of how we build GHC.

Idiom: non-recursive make

Build systems for large projects often use the technique commonly known as "recursive make", where there is a separate Makefile in each directory that is capable of building that part of the system. The Makefiles may share some common infrastructure and configuration by using GNU make's include directive; this is exactly what the previous GHC build system did. However, this design has a number of flaws, as described in Peter Miller's Recursive Make Considered Harmful.

The GHC build system adopts the non-recursive make idiom. That is, we never invoke make from inside a Makefile, and the whole build system is effectively a single giant Makefile.

This gives us the following advantages:

  • Specifying dependencies between different parts of the tree is easy. In this way, we can accurately specify many dependencies that we could not in the old recursive-make system. This makes it much more likely that when you say "make" after modifying parts of the tree or pulling new patches, the build system will bring everything up-to-date in the correct order, and leave you with a working system.
  • More parallelism: dependencies are more fine-grained, and there is no need to build separate parts of the system in sequence, so the overall effect is that we have more parallelism in the build.

Doesn't this sacrifice modularity? No - we can still split the build system into separate files, using GNU make's include.

Specific notes related to this idiom:

  • Individual directories usually have a ghc.mk file which contains the build instructions for that directory.
  • Other parts of the build system are in mk/*.mk and rules/*.mk.
  • The top-level ghc.mk file includes all the other *.mk files in the tree. The top-level Makefile invokes make on ghc.mk (this is the only recursive invocation of make; see the "phase ordering" idiom below).

Idiom: stub makefiles

It's all very well having a single giant Makefile that knows how to build everything in the right order, but sometimes you want to build just part of the system. When working on GHC itself, we might want to build just the compiler, for example. In the recursive make system we would do cd ghc and then make. In the non-recursive system we can still achieve this by specifying the target with something like `make ghc/stage1/build/ghc`, but that's not so convenient.

Our second idiom therefore supports the cd ghc; make idiom, just as with recursive make. To achieve this we put tiny stub Makefile in each directory whose job it is to invoke the main Makefile specifying the appropriate target(s) for that directory. These stub Makefiles follow a simple pattern:

dir = libraries/base
TOP = ../..
include $(TOP)/mk/sub-makefile.mk

where mk/sub-makefile.mk knows how to recursively invoke the giant top-level make.

Idiom: standard targets (all, clean, etc.)

We want an all target that builds everything, but we also want a way to build individual components (say, everything in rts/). This is achieved by having a separate "all" target for each directory, named all_directory. For example in rts/ghc.mk we might have this:

all : all_rts
.PHONY all_rts
all_rts : ...dependencies...

When the top level make includes all these ghc.mk files, it will see that target all depends on all_rts, all_ghc, ...etc...; so make all will make all of these. But the individual targets are still available. In particular, you can say

  • make all_rts (anywhere) to build everything in the RTS directory
  • make all (anywhere) to build everything
  • make, with no explicit target, makes the default target in the current directory's stub Makefile, which in turn makes the target all_dir, where dir is the current directory.

Other standard targets such as clean, install, and so on use the same technique. There are pre-canned macros to define your "all" and "clean" targets, take a look in rules/all-target.mk and rules/clean-target.mk.

Idiom: stages

What do we use to compile GHC? GHC itself, of course. In a complete build we actually build GHC twice: once using the GHC version that is installed, and then again using the GHC we just built. To be clear about which GHC we are talking about, we number them:

  • Stage 0 is the GHC you have installed. The "GHC you have installed" is also called "the bootstrap compiler".
  • Stage 1 is the first GHC we build, using stage 0. Stage 1 is then used to build the packages.
  • Stage 2 is the second GHC we build, using stage 1. This is the one we normally install when you say make install.
  • Stage 3 is optional, but is sometimes built to test stage 2.

Stage 1 does not support interactive execution (GHCi) and Template Haskell. The reason being that when running byte code we must dynamically link the packages, and only in stage 2 and later can we guarantee that the packages we dynamically link are compatible with those that GHC was built against (because they are the very same packages).

Idiom: distdir

Often we want to build a component multiple times in different ways. For example:

  • certain libraries (e.g. Cabal) are required by GHC, so we build them once with the bootstrapping compiler, and again with stage 1 once that is built.
  • GHC itself is built multiple times (stage 1, stage 2, maybe stage 3)
  • some tools (e.g. ghc-pkg) are also built once with the bootstrapping compiler, and then again using stage 1 later.

In order to support multiple builds in a directory, we place all generated files in a subdirectory, called the "distdir". The distdir can be anything at all; for example in compiler/ we name our distdirs after the stage (stage1, stage2 etc.). When there is only a single build in a directory, by convention we usually call the distdir simply "dist".

There is a related concept called ways, which includes profiling and dynamic-linking. Multiple ways are currently part of the same "build" and use the same distdir, but in the future we might unify these concepts and give each way its own distdir.

Idiom: interaction with Cabal

Many of the components of the GHC build system are also Cabal packages, with package metadata defined in a foo.cabal file. For the GHC build system we need to extract that metadata and use it to build the package. This is done by the program ghc-cabal (in utils/ghc-cabal in the GHC source tree). This program reads foo.cabal and produces package-data.mk containing the package metadata in the form of makefile bindings that we can use directly.

We adhere to the following rule: ghc-cabal generates only makefile variable bindings, such as

  HS_SRCS = Foo.hs Bar.hs

ghc-cabal never generates makefile rules, macro, macro invocations etc. All the makefile code is therefore contained in fixed, editable .mk files.

Idiom: variable names

Now that our build system is one giant Makefile, all our variables share the same namespace. Where previously we might have had a variable that contained a list of the Haskell source files called HS_SRCS, now we have one of these for each directory (and indeed each build, or distdir) in the source tree, so we have to give them all different names.

The idiom that we use for distinguishing variable names is to prepend the directory name and the distdir to the variable. So for example the list of Haskell sources in the directory utils/hsc2hs would be in the variable utils/hsc2hs_dist_HS_SRCS (make doesn't mind slashes in variable names). The pattern is: directory_distdir_variable.

Idiom: macros

The build system makes extensive use of Gnu make macros. A macro is defined in GNU make using define, e.g.

define build-package
# args: $1 = directory, $2 = distdir
... makefile code to build a package ...
endef

(for example, see rules/build-package), and is invoked like this:

$(eval $(call build-package,libraries/base,dist))

(this invocation would be in libraries/base/ghc.mk).

Note that eval works like this: its argument is expended as normal, and then the result is interpreted by make as makefile code. This means the body of the define gets expanded twice. Typically this means we need to use $$ instead of $ everywhere in the body of define.

Now, the build-package macro may need to define local variables. There is no support for local variables in macros, but we can define variables which are guaranteed to not clash with other variables by preceding their names with a string that is unique to this macro call. A convenient unique string to use is directory_distdir_; this is unique as long as we only call each macro with a given directory/build pair once. Most macros in the GHC build system take the directory and build as the first two arguments for exactly this reason. For example, here's an excerpt from the build-prog macro:

define build-prog
# $1 = dir
# $2 = distdir
# $3 = GHC stage to use (0 == bootstrapping compiler)

$1_$2_INPLACE = $$(INPLACE_BIN)/$$($1_$2_PROG)
...

So if build-prog is called with utils/hsc2hs and dist for the first two arguments, after expansion make would see this:

utils/hsc2hs_dist_INPLACE = $(INPLACE_BIN)/$(utils/hsc2hs_dist_PROG)

The idiom of $$($1_$2_VAR) is very common throughout the build system - get used to reading it! Note that the only time we use a single $ in the body of define is to refer to the parameters $1, $2, and so on.

Idiom: phase ordering

NB. you need to understand this section if either (a) you are modifying parts of the build system that include automatically-generated Makefile code, or (b) you need to understand why we have a top-level Makefile that recursively invokes make.

The main hitch with non-recursive make arises when parts of the build system are automatically-generated. The automatically-generated parts of our build system fall into two main categories:

  • Dependencies: we use ghc -M to generate make-dependencies for Haskell source files, and similarly gcc -M to do the same for C files. The dependencies are normally generated into a file .depend, which is included as normal.
  • Makefile binding generated from .cabal package descriptions. See "Idiom: interaction with Cabal".

Now, we also want to be able to use make to build these files, since they have complex dependencies themselves. For example, in order to build package-data.mk we need to first build ghc-cabal etc.; similarly, a .depend file needs to be re-generated if any of the source files have changed.

GNU make has a clever strategy for handling this kind of scenario. It first reads all the included Makefiles, and then tries to build each one if it is out-of-date, using the rules in the Makefiles themselves. When it has brought all the included Makefiles up-to-date, it restarts itself to read the newly-generated Makefiles.

This works fine, unless there are dependencies between the Makefiles. For example in the GHC build, the .depend file for a package cannot be generated until package-data.mk has been generated and make has been restarted to read in its contents, because it is the package-data.mk file that tells us which modules are in the package. But make always makes all the included Makefiles before restarting - it doesn't know how to restart itself earlier when there is a dependency between included Makefiles.

Consider the following Makefile:

all :

include inc1.mk

inc1.mk : Makefile
	echo "X = C" >$@

include inc2.mk

inc2.mk : inc1.mk
	echo "Y = $(X)" >$@

Now try it:

$ make -f fail.mk
fail.mk:3: inc1.mk: No such file or directory
fail.mk:8: inc2.mk: No such file or directory
echo "X = C" >inc1.mk
echo "Y = " >inc2.mk
make: Nothing to be done for `all'.

make built both inc1.mk and inc2.mk without restarting itself between the two (even though we added a dependency on inc1.mk from inc2.mk).

The solution we adopt in the GHC build system is as follows. We have two Makefiles, the first a wrapper around the second.

# top-level Makefile
% :
        $(MAKE) -f inc.mk PHASE=0 just-makefiles
        $(MAKE) -f inc.mk $<
# inc.mk

include inc1.mk

ifeq "$(PHASE)" "0"

inc1.mk : inc.mk
	echo "X = C" >$@

else

include inc2.mk

inc2.mk : inc1.mk
	echo "Y = $(X)" >$@

endif

just-makefiles:
        @: # do nothing

clean :
	rm -f inc1.mk inc2.mk

Each time make is invoked, we recursively invoke make in several phases:

  • Phase 0: invoke inc.mk with PHASE=0. This brings inc1.mk up-to-date (and only inc1.mk).
  • Final phase: invoke inc.mk again (with PHASE unset). Now we can be sure that inc1.mk is up-to-date and proceed to generate inc2.mk. If this changes inc2.mk, then make automatically re-invokes itself, repeating the final phase.

We could instead have abandoned make's automatic re-invocation mechanism altogether, and used three explicit phases (0, 1, and final), but in practice it's very convenient to use the automatic re-invocation when there are no problematic dependencies.

Note that the inc1.mk rule is only enabled in phase 0, so that if we accidentally call inc.mk without first performing phase 0, we will either get a failure (if inc1.mk doesn't exist), or otherwise make will not update inc1.mk if it is out-of-date.

In the case of the GHC build system we need 4 such phases, see the comments in the top-level ghc.mk for details.

This approach is not at all pretty, and re-invoking make every time is slow, but we don't know of a better workaround for this problem.

Idiom: no double-colon rules

Make has a special type of rule of the form target :: prerequisites, with the behaviour that all double-colon rules for a given target are executed if the target needs to be rebuilt. This style was popular for things like "all" and "clean" targets in the past, but it's not really necessary - see the "all" idiom above - and this means there's one fewer makeism you need to know about.

Idiom: the vanilla way

Libraries can be built in several different "ways", for example "profiling" and "dynamic" are two ways. Each way has a short tag associated with it; "p" and "dyn" are the tags for profiling and dynamic respectively. In previous GHC build systems, the "normal" way didn't have a name, it was just always built. Now we explicitly call it the "vanilla" way and use the tag "v" to refer to it.

This means that the GhcLibWays variable, which lists the ways in which the libraries are built, must include "v" if you want the vanilla way to be built (this is included in the default setup, of course).

Idiom: whitespace

make has a rather ad-hoc approach to whitespace. Most of the time it ignores it, e.g.

FOO = bar

sets FOO to "bar", not " bar". However, sometimes whitespace is significant, and calling macros is one example. For example, we used to have a call

$(call all-target, $$($1_$2_INPLACE))

and this passed " $$($1_$2_INPLACE)" as the argument to all-target. This in turn generated

.PHONY: all_ inplace/bin/ghc-asm

which caused an infinite loop, as make continually thought that ghc-asm was out-of-date, rebuilt it, reinvoked make, and then thought it was out of date again.

The moral of the story is, avoid white space unless you're sure it'll be OK!

Idiom: platform names

There are three platforms of interest when building GHC:

  • $(BUILDPLATFORM): The build platform.
    The platform on which we are doing this build.
  • $(HOSTPLATFORM): The host platform.
    The platform on which these binaries will run.
  • $(TARGETPLATFORM): The target platform.
    The platform for which this compiler will generate code.

These platforms are set when running the configure script, using the --build, --host, and --target options. The mk/project.mk file, which is generated by configure from project.mk.in, defines several symbols related to the platform settings.

We don't currently support build and host being different, because the build process creates binaries that are both run during the build, and also installed.

If host and target are different, then we are building a cross-compiler. For GHC, this means a compiler which will generate intermediate .hc files to port to the target architecture for bootstrapping. The libraries and stage 2 compiler will be built as HC files for the target system (see Porting GHC for details).

More details on when to use BUILD, HOST or TARGET can be found in the comments in project.mk.in.