wiki:Commentary/Compiler/CmmType

Version 1 (modified by p_tanski, 7 years ago) (diff)

--

Note To Reader

This page was written with more detail than usual since you may need to know how to work with Cmm as a programming language. Cmm is the basis for the future of GHC, Native Code Generation, and if you are interested in hacking Cmm at least this page might help reduce your learning curve. As a finer detail, if you read the Compiler pipeline wiki page or glanced at the diagram there you may have noticed that whether you are working backward from an intermediate C (Haskell-C "HC", .hc) file or an Assembler file you get to Cmm before you get to the STG language, the Simplifier or anything else. In other words, for really low-level debugging you may have an easier time if you know what Cmm is about. Cmm also has opportunities for implementing small and easy hacks, such as little optimisations and implementing new Cmm Primitive Operations.

A portion of the RTS is written in Cmm: rts/Apply.cmm, rts/Exception.cmm, rts/HeapStackCheck.cmm, rts/PrimOps.cmm, rts/StgMiscClosures.cmm, rts/StgStartup.cmm and StgStdThunks.cmm. (For notes related to PrimOps.cmm see the PrimOps page; for much of the rest, see the HaskellExecution page.) Cmm is optimised before GHC outputs either HC or Assembler. The C compiler (from HC, pretty printed by compiler/cmm/PprC.hs) and the Native Code Generator (NCG) Backends are closely tied to data representations and transformations performed in Cmm. In GHC, Cmm roughly performs a function similar to the intermediate Register Transfer Language (RTL) in GCC.

Table of Contents

  1. Additions in Cmm
  2. Compiling Cmm with GHC
  3. Basic Cmm
    1. Code Blocks in Cmm
    2. Variables, Registers and Types
      1. Local Registers
      2. Global Registers and Hints
      3. Declaration and Initialisation
      4. Memory Access
    3. Literals and Labels
    4. Sections and Directives
    5. Expressions
    6. Statements and Calls
    7. Operators and Primitive Operations
      1. Operators
      2. Primitive Operations
  4. Cmm Design: Observations and Areas for Potential Improvement

The Cmm language

Cmm is the GHC implementation of the C-- language; it is also the extension of Cmm source code files: .cmm (see What the hell is a .cmm file?). The GHC Code Generator (CodeGen) compiles the STG program into C-- code, represented by the Cmm data type. This data type follows the definition of `C--` pretty closely but there are some remarkable differences. For a discussion of the Cmm implementation noting most of those differences, see the Basic Cmm section, below.

Additions in Cmm

Although both Cmm and C-- allow foreign calls, the .cmm syntax includes the

foreign "C" cfunctionname(R1) [R2];

The [R2] part is the (set of) register(s) that you need to save over the call.

Other additions to C-- are noted throughout the Basic Cmm section, below.

Compiling Cmm with GHC

GHC is able to compile .cmm files with a minimum of user-effort. To compile .cmm files, simply invoke the main GHC driver but remember to:

  • add the option -dcmm-lint if you have handwritten Cmm code;
  • add appropriate includes, especially includes/Cmm.h if you are using Cmm macros or GHC defines for certain types, such as W_ for bits32 or bits64 (depending on the machine word size)--Cmm.h is in the /includes directory of every GHC distribution, i.e., usr/local/lib/ghc-6.6/includes; and,
  • if you do include GHC header files, remember to pass the code through the C preprocessor by adding the -cpp option.

For additional fun, you may pass GHC the -keep-s-file option to keep the temporary assembler file in your compile directory. For example:

ghc -cpp -dcmm-lint -keep-s-file -c Foo.cmm -o Foo.o

This will only work with very basic Cmm files. If you noticed that GHC currently provides no -keep-cmm-file option and -keep-tmp-files does not save a .cmm file and you are thinking about redirecting output from -ddump-cmm, beware. The output from -ddump-cmm contains equal-lines and dash-lines separating Cmm Blocks and Basic Blocks; these are unparseable. The parser also cannot handle const sections. For example, the parser will fail on the first 0 or alphabetic token after const:

section "data" {
    rOG_closure:
        const rOG_info;	// parse error `rOG_info'
        const 0;	// parse error `0'
        const 0;
        const 0;
}

Although GHC's Cmm pretty printer outputs C-- standard parenthetical list of arguments after procedure names, i.e., (), the Cmm parser will fail at the ( token. For example:

__stginit_Main_() {	// parse error `('
    cUX:
        Sp = Sp + 4;
        jump (I32[Sp + (-4)]);
}

The Cmm procedure names in rts/PrimOps.cmm are not followed by a (possibly empty) parenthetical list of arguments; all their arguments are Global (STG) Registers, anyway, see Variables, Registers and Types, below. Don't be confused by the procedure definitions in other handwritten .cmm files in the RTS, such as rts/Apply.cmm: all-uppercase procedure invocations are special reserved tokens in compiler/cmm/CmmLex.x and compiler/cmm/CmmParse.y. For example, INFO_TABLE is parsed as one of the tokens in the Alex info predicate:

info	:: { ExtFCode (CLabel, [CmmLit],[CmmLit]) }
	: 'INFO_TABLE' '(' NAME ',' INT ',' INT ',' INT ',' STRING ',' STRING ')'
		-- ptrs, nptrs, closure type, description, type
		{ stdInfo $3 $5 $7 0 $9 $11 $13 }

GHC's Cmm parser also cannot parse nested code blocks. For example:

s22Q_ret() {
	s22Q_info {  	// parse error `{'
		const Main_main_srt-s22Q_info+24;
		const 0;
		const 2228227;
	}
    c23f:
	R2 = base_GHCziHandle_stdout_closure;
	R3 = 10;
	Sp = Sp + 4;    /* Stack pointer */
	jump base_GHCziIO_zdwhPutChar_info;
}

The C-- specification example in section 4.6.2, "Procedures as section contents" also will not parse in Cmm:

section "data" { 
	const PROC = 3; 	// parse error `PROC'
	bits32[] {p_end, PROC}; // parse error `[' (only bits8[] is allowed)
				// parse error `{' (no {...} variable initialisation)

	p (bits32 i) {	// parse error `{' (Cmm thinks "p (bits32 i)" is a statement)
		loop: 
			i = i-1; 
		if (i >= 0) { goto loop ; }	// no parse error 
						// (if { ... } else { ... } *is* parseable)
		return; 
	} 
	p_end: 
} 

Note that if p (bits32 i) { ... } were written as a Cmm-parseable procedure, as p { ... }, the parse error would occur at the closing curly bracket for the section "data" { ... p { ... } }<- here.

Basic Cmm

Cmm is a high level assembler with a syntax style similar to C. This section describes Cmm by working up from assembler--the C-- papers and specification work down from C. At the least, you should know what a "high level" assembler is, see What is a High Level Assembler?. Cmm is different than other high level assembler languages in that it was designed to be a semi-portable intermediate language for compilers; most other high level assemblers are designed to make the tedium of assembly language more convenient and intelligible to humans. If you are completely new to C--, I highly recommend these papers listed on the C-- Papers page: