wiki:Commentary/Compiler/CmmType

Version 6 (modified by p_tanski, 7 years ago) (diff)

--

Note To Reader

This page was written with more detail than usual since you may need to know how to work with Cmm as a programming language. Cmm is the basis for the future of GHC, Native Code Generation, and if you are interested in hacking Cmm at least this page might help reduce your learning curve. As a finer detail, if you read the Compiler pipeline wiki page or glanced at the diagram there you may have noticed that whether you are working backward from an intermediate C (Haskell-C "HC", .hc) file or an Assembler file you get to Cmm before you get to the STG language, the Simplifier or anything else. In other words, for really low-level debugging you may have an easier time if you know what Cmm is about. Cmm also has opportunities for implementing small and easy hacks, such as little optimisations and implementing new Cmm Primitive Operations.

A portion of the RTS is written in Cmm: rts/Apply.cmm, rts/Exception.cmm, rts/HeapStackCheck.cmm, rts/PrimOps.cmm, rts/StgMiscClosures.cmm, rts/StgStartup.cmm and StgStdThunks.cmm. (For notes related to PrimOps.cmm see the PrimOps page; for much of the rest, see the HaskellExecution page.) Cmm is optimised before GHC outputs either HC or Assembler. The C compiler (from HC, pretty printed by compiler/cmm/PprC.hs) and the Native Code Generator (NCG) Backends are closely tied to data representations and transformations performed in Cmm. In GHC, Cmm roughly performs a function similar to the intermediate Register Transfer Language (RTL) in GCC.

Table of Contents

  1. Additions in Cmm
  2. Compiling Cmm with GHC
  3. Basic Cmm
    1. Code Blocks in Cmm
    2. Variables, Registers and Types
      1. Local Registers
      2. Global Registers and Hints
      3. Declaration and Initialisation
      4. Memory Access
    3. Literals and Labels
    4. Sections and Directives
    5. Expressions
    6. Statements and Calls
    7. Operators and Primitive Operations
      1. Operators
      2. Primitive Operations
  4. Cmm Design: Observations and Areas for Potential Improvement

The Cmm language

Cmm is the GHC implementation of the C-- language; it is also the extension of Cmm source code files: .cmm (see What the hell is a .cmm file?). The GHC Code Generator (CodeGen) compiles the STG program into C-- code, represented by the Cmm data type. This data type follows the definition of `C--` pretty closely but there are some remarkable differences. For a discussion of the Cmm implementation noting most of those differences, see the Basic Cmm section, below.

Additions in Cmm

Although both Cmm and C-- allow foreign calls, the .cmm syntax includes the

foreign "C" cfunctionname(R1) [R2];

The [R2] part is the (set of) register(s) that you need to save over the call.

Other additions to C-- are noted throughout the Basic Cmm section, below.

Compiling Cmm with GHC

GHC is able to compile .cmm files with a minimum of user-effort. To compile .cmm files, simply invoke the main GHC driver but remember to:

  • add the option -dcmm-lint if you have handwritten Cmm code;
  • add appropriate includes, especially includes/Cmm.h if you are using Cmm macros or GHC defines for certain types, such as W_ for bits32 or bits64 (depending on the machine word size)--Cmm.h is in the /includes directory of every GHC distribution, i.e., usr/local/lib/ghc-6.6/includes; and,
  • if you do include GHC header files, remember to pass the code through the C preprocessor by adding the -cpp option.

For additional fun, you may pass GHC the -keep-s-file option to keep the temporary assembler file in your compile directory. For example:

ghc -cpp -dcmm-lint -keep-s-file -c Foo.cmm -o Foo.o

This will only work with very basic Cmm files. If you noticed that GHC currently provides no -keep-cmm-file option and -keep-tmp-files does not save a .cmm file and you are thinking about redirecting output from -ddump-cmm, beware. The output from -ddump-cmm contains equal-lines and dash-lines separating Cmm Blocks and Basic Blocks; these are unparseable. The parser also cannot handle const sections. For example, the parser will fail on the first 0 or alphabetic token after const:

section "data" {
    rOG_closure:
        const rOG_info;	// parse error `rOG_info'
        const 0;	// parse error `0'
        const 0;
        const 0;
}

Although GHC's Cmm pretty printer outputs C-- standard parenthetical list of arguments after procedure names, i.e., (), the Cmm parser will fail at the ( token. For example:

__stginit_Main_() {	// parse error `('
    cUX:
        Sp = Sp + 4;
        jump (I32[Sp + (-4)]);
}

The Cmm procedure names in rts/PrimOps.cmm are not followed by a (possibly empty) parenthetical list of arguments; all their arguments are Global (STG) Registers, anyway, see Variables, Registers and Types, below. Don't be confused by the procedure definitions in other handwritten .cmm files in the RTS, such as rts/Apply.cmm: all-uppercase procedure invocations are special reserved tokens in compiler/cmm/CmmLex.x and compiler/cmm/CmmParse.y. For example, INFO_TABLE is parsed as one of the tokens in the Alex info predicate:

info	:: { ExtFCode (CLabel, [CmmLit],[CmmLit]) }
	: 'INFO_TABLE' '(' NAME ',' INT ',' INT ',' INT ',' STRING ',' STRING ')'
		-- ptrs, nptrs, closure type, description, type
		{ stdInfo $3 $5 $7 0 $9 $11 $13 }

GHC's Cmm parser also cannot parse nested code blocks. For example:

s22Q_ret() {
	s22Q_info {  	// parse error `{'
		const Main_main_srt-s22Q_info+24;
		const 0;
		const 2228227;
	}
    c23f:
	R2 = base_GHCziHandle_stdout_closure;
	R3 = 10;
	Sp = Sp + 4;    /* Stack pointer */
	jump base_GHCziIO_zdwhPutChar_info;
}

The C-- specification example in section 4.6.2, "Procedures as section contents" also will not parse in Cmm:

section "data" { 
	const PROC = 3; 	// parse error `PROC'
	bits32[] {p_end, PROC}; // parse error `[' (only bits8[] is allowed)
				// parse error `{' (no {...} variable initialisation)

	p (bits32 i) {	// parse error `{' (Cmm thinks "p (bits32 i)" is a statement)
		loop: 
			i = i-1; 
		if (i >= 0) { goto loop ; }	// no parse error 
						// (if { ... } else { ... } *is* parseable)
		return; 
	} 
	p_end: 
} 

Note that if p (bits32 i) { ... } were written as a Cmm-parseable procedure, as p { ... }, the parse error would occur at the closing curly bracket for the section "data" { ... p { ... } }<- here.

Basic Cmm

Cmm is a high level assembler with a syntax style similar to C. This section describes Cmm by working up from assembler--the C-- papers and specification work down from C. At the least, you should know what a "high level" assembler is, see What is a High Level Assembler?. Cmm is different than other high level assembler languages in that it was designed to be a semi-portable intermediate language for compilers; most other high level assemblers are designed to make the tedium of assembly language more convenient and intelligible to humans. If you are completely new to C--, I highly recommend these papers listed on the C-- Papers page:

Cmm is not a stand alone C-- compiler; it is an implementation of C-- embedded in the GHC compiler. One difference between Cmm and a C-- compiler like Quick C-- is this: Cmm uses the C preprocessor (cpp). Cpp lets Cmm integrate with C code, especially the C header defines in includes, and among many other consequences it makes the C-- import and export statements irrelevant; in fact, according to compiler/cmm/CmmParse.y they are ignored. The most significant action taken by the Cmm modules in the Compiler is to optimise Cmm, through compiler/cmm/CmmOpt.hs. The Cmm Optimiser generally runs a few simplification passes over primitive Cmm operations, inlines simple Cmm expressions that do not contain global registers (these would be left to one of the Backends, which currently cannot handle inlines with global registers) and performs a simple loop optimisation.

Code Blocks in Cmm

The Haskell representation of Cmm separates contiguous code into:

  • modules (compilation units; a .cmm file); and
  • basic blocks

Cmm modules contain static data elements (see Literals and Labels) and Basic Blocks, collected together in Cmm, a type synonym for GenCmm, defined in compiler/cmm/Cmm.hs:

newtype GenCmm d i = Cmm [GenCmmTop d i]
 
type Cmm = GenCmm CmmStatic CmmStmt

data GenCmmTop d i
  = CmmProc
     [d]	       -- Info table, may be empty
     CLabel            -- Used to generate both info & entry labels
     [LocalReg]        -- Argument locals live on entry (C-- procedure params)
     [GenBasicBlock i] -- Code, may be empty.  The first block is
                       -- the entry point.  The order is otherwise initially 
                       -- unimportant, but at some point the code gen will
                       -- fix the order.

		       -- the BlockId of the first block does not give rise
		       -- to a label.  To jump to the first block in a Proc,
		       -- use the appropriate CLabel.

  -- some static data.
  | CmmData Section [d]	-- constant values only

type CmmTop = GenCmmTop CmmStatic CmmStmt

CmmStmt is described in Statements and Calls;
Section is described in Sections and Directives;
the static data in [d] is [CmmStatic] from the type synonym Cmm;
CmmStatic is described in Literals and Labels.

Basic Blocks and Procedures

Cmm procedures are represented by the first constructor in GenCmmTop d i:

    CmmProc [d] CLabel [LocalReg] [GenBasicBlock i]

For a description of Cmm labels and the CLabel data type, see the subsection Literals and Labels, below.

Cmm Basic Blocks are labeled blocks of Cmm code ending in an explicit jump. Sections (see Sections and Directives) have no jumps--in Cmm, Sections cannot contain nested Procedures (see, e.g., Compiling Cmm with GHC). In Basic Blocks represent parts of Procedures. The data type GenBasicBlock and the type synonym CmmBasicBlock encapsulate Basic Blocks; they are defined in compiler/cmm/Cmm.hs:

data GenBasicBlock i = BasicBlock BlockId [i]

type CmmBasicBlock = GenBasicBlock CmmStmt

newtype BlockId = BlockId Unique
  deriving (Eq,Ord)

instance Uniquable BlockId where
  getUnique (BlockId u) = u

The BlockId data type simply carries a Unique with each Basic Block. For descriptions of Unique, see

Variables, Registers and Types

Like other high level assembly languages, all variables in C-- are machine registers, separated into different types according to bit length (8, 16, 32, 64, 80, 128) and register type (integral or floating point). The C-- standard specifies little more type information about a register than its bit length: there are no distinguishing types for signed or unsigned integrals, or for "pointers" (registers holding a memory address). A C-- standard compiler supports additional information on the type of a register value through compiler hints. In a foreign call, a "signed" bits8 would be sign-extended and may be passed as a 32-bit value. Cmm diverges from the C-- specification on this point somewhat (see below). C-- and Cmm do not represent special registers, such as a Condition Register (CR) or floating point unit (FPU) status and control register (FPSCR on the PowerPC, MXCSR on Intel x86 processors), as these are a matter for the Backends.

C-- and Cmm hide the actual number of registers available on a particular machine by assuming an "infinite" supply of registers. A backend, such as the NCG or C compiler on GHC, will later optimise the number of registers used and assign the Cmm variables to actual machine registers; the NCG temporarily stores any overflow in a small memory stack called the spill stack, while the C compiler relies on C's own runtime system. Haskell handles Cmm registers with three data types: LocalReg, GlobalReg and CmmReg. LocalRegs and GlobalRegs are collected together in a single Cmm data type:

data CmmReg 
  = CmmLocal  LocalReg
  | CmmGlobal GlobalReg

Local Registers

Local Registers exist within the scope of a Procedure:

data LocalReg
  = LocalReg !Unique MachRep

For a list of references with information on Unique, see the Basic Blocks and Procedures section, above.

A MachRep, the type of a machine register, is defined in compiler/cmm/MachOp.hs:

data MachRep
  = I8		-- integral type, 8 bits wide (a byte)
  | I16		-- integral type, 16 bits wide
  | I32		-- integral type, 32 bits wide
  | I64		-- integral type, 64 bits wide
  | I128	-- integral type, 128 bits wide (an integral vector register)
  | F32		-- floating point type, 32 bits wide (float)
  | F64		-- floating point type, 64 bits wide (double)
  | F80		-- extended double-precision, used in x86 native codegen only.

There is currently no register for floating point vectors, such as F128. The types of Cmm variables are defined in the Happy parser file compiler/cmm/CmmParse.y and the Alex lexer file compiler/cmm/CmmLex.x. (Happy and Alex will compile these into CmmParse.hs and CmmLex.hs, respectively.) Cmm recognises the following C-- types as parseable tokens, listed next to their corresponding defines in includes/Cmm.h and their STG types:

Cmm Token Cmm.h #define STG type
bits8 I8 StgChar or StgWord8
bits16 I16 StgWord16
bits32 I32, CInt, CLong StgWord32; StgWord (depending on architecture)
bits64 I64, CInt, CLong, L_ StgWord64; StgWord (depending on architecture)
float32 F_ StgFloat
float64 D_ StgDouble