Changes between Version 11 and Version 12 of Commentary/Compiler/Backends/NCG


Ignore:
Timestamp:
May 7, 2011 4:11:46 PM (3 years ago)
Author:
igloo
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Commentary/Compiler/Backends/NCG

    v11 v12  
    22 
    33For other information related to this page, see: 
    4  * [http://darcs.haskell.org/ghc/docs/comm/the-beast/ncg.html the Old GHC Commentary: Native Code Generator] page (comments regarding Maximal Munch and register allocation optimisations are mostly still valid) 
    54 * [wiki:BackEndNotes] for optimisation ideas regarding the current NCG 
    6  * The New GHC Commentary Cmm page: [wiki:Commentary/Compiler/CmmType The Cmm language] (the NCG code works from Haskell's implementation of C-- and many optimisations in the NCG relate to Cmm) 
     5 * [wiki:Commentary/Compiler/CmmType The Cmm language] (the NCG code works from Haskell's implementation of C-- and many optimisations in the NCG relate to Cmm) 
    76 * [wiki:Commentary/Compiler/Backends/NCG/RegisterAllocator The register allocator]. 
    87 
    9 === Overview: Files, Parts === 
     8On some platforms (currently x86 and x86_64, with possibly bitrotted support for PowerPC and Sparc), GHC can generate assembly code directly. The NCG is enabled by default on supported platforms. 
     9 
     10The NCG has always been something of a second-class citizen inside GHC, an unloved child, rather. This means that its integration into the compiler as a whole is rather clumsy, which brings some problems described below. That apart, the NCG proper is fairly cleanly designed, as target-independent as it reasonably can be, and so should not be difficult to retarget. 
     11 
     12NOTE! The native code generator was largely rewritten as part of the C-- backend changes, around May 2004. Unfortunately the rest of this document still refers to the old version, and was written with relation to the CVS head as of end-Jan 2002. Some of it is relevant, some of it isn't. 
     13 
     14=== Files, Parts === 
    1015 
    1116After GHC has produced [wiki:Commentary/Compiler/CmmType Cmm] (use -ddump-cmm or -ddump-opt-cmm to view), the Native Code Generator (NCG) transforms Cmm into architecture-specific assembly code.  The NCG is located in [[GhcFile(compiler/nativeGen)]] and is separated into eight modules: 
     
    4449 1. define, manage and allocate the real registers available on the target system; 
    4550 1. pretty-print the Haskell-assembler to GNU AS (GAS) assembler code 
     51 
     52== Overview == 
     53 
     54The top-level code generator function is 
     55{{{ 
     56absCtoNat :: AbstractC -> UniqSM (SDoc, Pretty.Doc) 
     57}}} 
     58The returned `SDoc` is for debugging, so is empty unless you specify `-ddump-stix`. The `Pretty.Doc` bit is the final assembly code. Translation involves three main phases, the first and third of which are target-independent. 
     59 
     60==== Translation into the Stix representation ==== 
     61 
     62Stix is a simple tree-like RTL-style language, in which you can mention: 
     63 * An infinite number of temporary, virtual registers. 
     64 * The STG "magic" registers (`MagicId`), such as the heap and stack pointers. 
     65 * Literals and low-level machine ops (`MachOp`). 
     66 * Simple address computations. 
     67 * Reads and writes of: memory, virtual regs, and various STG regs. 
     68 * Labels and `if ... goto ...` style control-flow.  
     69 
     70Stix has two main associated types: 
     71 * `StixStmt` -- trees executed for their side effects: assignments, control transfers, and auxiliary junk such as segment changes and literal data. 
     72 * `StixExpr` -- trees which denote a value.  
     73 
     74Translation into Stix is almost completely target-independent. Needed dependencies are knowledge of word size and endianness, used when generating code to do deal with half-word fields in info tables. This could be abstracted out easily enough. Also, the Stix translation needs to know which `MagicId`s map to registers on the given target, and which are stored in offsets from `BaseReg`. 
     75 
     76After initial Stix generation, the trees are cleaned up with constant-folding and a little copy-propagation ("Stix inlining", as the code misleadingly calls it). We take the opportunity to translate `MagicId`s which are stored in memory on the given target, into suitable memory references. Those which are stored in registers are left alone. There is also a half-hearted attempt to lift literal strings to the top level in cases where nested strings have been observed to give incorrect code in the past. 
     77 
     78Primitive machine-level operations will already be phrased in terms of `MachOp`s in the presented Abstract C, and these are passed through unchanged. We comment only that the `MachOp`s have been chosen so as to be easy to implement on all targets, and their meaning is intended to be unambiguous, and the same on all targets, regardless of word size or endianness. 
     79 
     80'''A note on `MagicId`s'''. Those which are assigned to registers on the current target are left unmodified. Those which are not are stored in memory as offsets from `BaseReg` (which is assumed to permanently have the value (`&MainCapability.r`)), so the constant folder calculates the offsets and inserts suitable loads/stores. One complication is that not all archs have `BaseReg` itself in a register, so for those (sparc), we instead generate the address as an offset from the static symbol `MainCapability`, since the register table lives in there. 
     81 
     82Finally, `BaseReg` does occasionally itself get mentioned in Stix expression trees, and in this case what is denoted is precisely (`&MainCapability.r`), not, as in all other cases, the value of memory at some offset from the start of the register table. Since what it denotes is an r-value and not an l-value, assigning `BaseReg` is meaningless, so the machinery checks to ensure this never happens. All these details are taken into account by the constant folder. 
     83 
     84==== Instruction selection ==== 
     85 
     86This is the only majorly target-specific phase. It turns Stix statements and expressions into sequences of `Instr`, a data type which is different for each architecture. Instr, unsurprisingly, has various supporting types, such as `Reg`, `Operand`, `Imm`, etc. The generated instructions may refer to specific machine registers, or to arbitrary virtual registers, either those created within the instruction selector, or those mentioned in the Stix passed to it. 
     87 
     88The instruction selectors live in `MachCode.lhs`. The core functions, for each target, are: 
     89 
     90{{{ 
     91getAmode :: StixExpr -> NatM Amode 
     92getRegister :: StixExpr -> NatM Register 
     93assignMem_IntCode :: PrimRep -> StixExpr -> StixExpr -> NatM InstrBlock 
     94assignReg_IntCode :: PrimRep -> StixReg -> StixExpr -> NatM InstrBlock 
     95}}} 
     96 
     97The insn selectors use the "maximal munch" algorithm. The bizarrely-misnamed `getRegister` translates expressions. A simplified version of its type is: 
     98{{{ 
     99getRegister :: StixExpr -> NatM (OrdList Instr, Reg) 
     100}}} 
     101That is: it (monadically) turns a StixExpr into a sequence of instructions, and a register, with the meaning that after executing the (possibly empty) sequence of instructions, the (possibly virtual) register will hold the resulting value. The real situation is complicated by the presence of fixed registers, and is detailed below. 
     102 
     103Maximal munch is a greedy algorithm and is known not to give globally optimal code sequences, but it is good enough, and fast and simple. Early incarnations of the NCG used something more sophisticated, but that is long gone now. 
     104 
     105Similarly, `getAmode` translates a value, intended to denote an address, into a sequence of insns leading up to a (processor-specific) addressing mode. This stuff could be done using the general `getRegister` selector, but would necessarily generate poorer code, because the calculated address would be forced into a register, which might be unnecessary if it could partially or wholly be calculated using an addressing mode. 
     106 
     107Finally, `assignMem_IntCode` and `assignReg_IntCode` create instruction sequences to calculate a value and store it in the given register, or at the given address. Because these guys translate a statement, not a value, they just return a sequence of insns and no associated register. Floating-point and 64-bit integer assignments have analogous selectors. 
     108 
     109Apart from the complexities of fixed vs floating registers, discussed below, the instruction selector is as simple as it can be. It looks long and scary but detailed examination reveals it to be fairly straightforward. 
     110 
     111==== Register allocation ==== 
     112 
     113The register allocator, `AsmRegAlloc.lhs` takes sequences of Instrs which mention a mixture of real and virtual registers, and returns a modified sequence referring only to real ones. It is gloriously and entirely target-independent. Well, not exactly true. Instead it regards `Instr` (instructions) and `Reg` (virtual and real registers) as abstract types, to which it has the following interface: 
     114{{{ 
     115insnFuture :: Instr -> InsnFuture 
     116regUsage :: Instr -> RegUsage 
     117patchRegs :: Instr -> (Reg -> Reg) -> Instr 
     118}}} 
     119`insnFuture` is used to (re)construct the graph of all possible control transfers between the insns to be allocated. `regUsage` returns the sets of registers read and written by an instruction. And `patchRegs` is used to apply the allocator's final decision on virtual-to-real reg mapping to an instruction. 
     120 
     121Clearly these 3 fns have to be written anew for each architecture. They are defined in `RegAllocInfo.lhs`. Think twice, no, thrice, before modifying them: making false claims about insn behaviour will lead to hard-to-find register allocation errors. 
     122 
     123`AsmRegAlloc.lhs` contains detailed comments about how the allocator works. Here is a summary. The head honcho 
     124{{{ 
     125allocUsingTheseRegs :: [Instr] -> [Reg] -> (Bool, [Instr]) 
     126}}} 
     127takes a list of instructions and a list of real registers available for allocation, and maps as many of the virtual regs in the input into real ones as it can. The returned `Bool` indicates whether or not it was successful. If so, that's the end of it. If not, the caller of `allocUsingTheseRegs` will attempt spilling. More of that later. What `allocUsingTheseRegs` does is: 
     128 * Implicitly number each instruction by its position in the input list. 
     129 * Using `insnFuture`, create the set of all flow edges -- possible control transfers -- within this set of insns. 
     130 * Using `regUsage` and iterating around the flow graph from the previous step, calculate, for each virtual register, the set of flow edges on which it is live. 
     131 * Make a real-register committment map, which gives the set of edges for which each real register is committed (in use). These sets are initially empty. For each virtual register, attempt to find a real register whose current committment does not intersect that of the virtual register -- ie, is uncommitted on all edges that the virtual reg is live. If successful, this means the vreg can be assigned to the realreg, so add the vreg's set to the realreg's committment. 
     132 * If all the vregs were assigned to a realreg, use `patchInstr` to apply the mapping to the insns themselves.  
     133 
     134=== Spilling === 
     135 
     136If `allocUsingTheseRegs` fails, a baroque mechanism comes into play. We now know that much simpler schemes are available to do the same thing and give better results. Anyways: 
     137 
     138The logic above `allocUsingTheseRegs`, in `doGeneralAlloc` and `runRegAllocate`, observe that allocation has failed with some set R of real registers. So they apply `runRegAllocate` a second time to the code, but remove (typically) two registers from R before doing so. This naturally fails too, but returns a partially-allocated sequence. `doGeneralAlloc` then inserts spill code into the sequence, and finally re-runs `allocUsingTheseRegs`, but supplying the original, unadulterated R. This is guaranteed to succeed since the two registers previously removed from R are sufficient to allocate all the spill/restore instructions added. 
     139 
     140Because x86 is very short of registers, and in the worst case needs three removed from R, a softly-softly approach is used. `doGeneralAlloc` first tries with zero regs removed from R, then if that fails one, then two, etc. This means `allocUsingTheseRegs` may get run several times before a successful arrangement is arrived at. `findReservedRegs` cooks up the sets of spill registers to try with. 
     141 
     142The resulting machinery is complicated and the generated spill code is appalling. The saving grace is that spills are very rare so it doesn't matter much. I did not invent this -- I inherited it. 
     143 
     144=== Dealing with common cases fast === 
     145 
     146The entire reg-alloc mechanism described so far is general and correct, but expensive overkill for many simple code blocks. So to begin with we use `doSimpleAlloc`, which attempts to do something simple. It exploits the observation that if the total number of virtual registers does not exceed the number of real ones available, we can simply dole out a new realreg each time we see mention of a new vreg, with no regard for control flow. `doSimpleAlloc` therefore attempts this in a single pass over the code. It gives up if it runs out of real regs or sees any condition which renders the above observation invalid (fixed reg uses, for example). 
     147 
     148This clever hack handles the majority of code blocks quickly. It was copied from the previous reg-allocator (the Mattson/Partain/Marlow/Gill one).  
     149 
     150== Complications, observations, and possible improvements == 
     151 
     152=== Real vs virtual registers in the instruction selectors === 
     153 
     154The instruction selectors for expression trees, namely `getRegister`, are complicated by the fact that some expressions can only be computed into a specific register, whereas the majority can be computed into any register. We take x86 as an example, but the problem applies to all archs. 
     155 
     156Terminology: `rreg` means real register, a real machine register. `vreg` means one of an infinite set of virtual registers. The type `Reg` is the sum of `rreg` and `vreg`. The instruction selector generates sequences with unconstrained use of vregs, leaving the register allocator to map them all into rregs. 
     157 
     158Now, where was I ? Oh yes. We return to the type of `getRegister`, which despite its name, selects instructions to compute the value of an expression tree. 
     159{{{ 
     160getRegister :: StixExpr -> NatM Register 
     161 
     162data Register 
     163  = Fixed   PrimRep Reg InstrBlock 
     164  | Any     PrimRep (Reg -> InstrBlock) 
     165 
     166type InstrBlock -- sequence of instructions 
     167}}} 
     168 
     169At first this looks eminently reasonable (apart from the stupid name). `getRegister`, and nobody else, knows whether or not a given expression has to be computed into a fixed rreg or can be computed into any rreg or vreg. In the first case, it returns `Fixed` and indicates which rreg the result is in. In the second case it defers committing to any specific target register by returning a function from `Reg` to `InstrBlock`, and the caller can specify the target reg as it sees fit. 
     170 
     171Unfortunately, that forces `getRegister`'s callers (usually itself) to use a clumsy and confusing idiom in the common case where they do not care what register the result winds up in. The reason is that although a value might be computed into a fixed rreg, we are forbidden (on pain of segmentation fault :) from subsequently modifying the fixed reg. This and other rules are record in "Rules of the game" inside `MachCode.lhs`. 
     172 
     173Why can't fixed registers be modified post-hoc? Consider a simple expression like `Hp+1`. Since the heap pointer `Hp` is definitely in a fixed register, call it R, `getRegister` on subterm `Hp` will simply return Fixed with an empty sequence and R. But we can't just emit an increment instruction for R, because that trashes `Hp`; instead we first have to copy it into a fresh vreg and increment that. 
     174 
     175With all that in mind, consider now writing a `getRegister` clause for terms of the form `(1 + E)`. Contrived, yes, but illustrates the matter. First we do `getRegister` on `E`. Now we are forced to examine what comes back. 
     176{{{ 
     177getRegister (OnePlus e) 
     178   = getRegister e           `thenNat`   \ e_result -> 
     179     case e_result of 
     180        Fixed e_code e_fixed  
     181           -> returnNat (Any IntRep (\dst -> e_code ++ [MOV e_fixed dst, INC dst])) 
     182        Any e_any  
     183           -> Any (\dst -> e_any dst ++ [INC dst]) 
     184}}} 
     185This seems unreasonably cumbersome, yet the instruction selector is full of such idioms. A good example of the complexities induced by this scheme is shown by `trivialCode` for x86 in `MachCode.lhs`. This deals with general integer dyadic operations on x86 and has numerous cases. It was difficult to get right. 
     186 
     187An alternative suggestion is to simplify the type of `getRegister` to this: 
     188{{{ 
     189getRegister :: StixExpr -> NatM (InstrBloc, VReg) 
     190type VReg = .... a vreg ... 
     191}}} 
     192and then we could safely write 
     193{{{ 
     194getRegister (OnePlus e) 
     195   = getRegister e        `thenNat`  \ (e_code, e_vreg) -> 
     196     returnNat (e_code ++ [INC e_vreg], e_vreg) 
     197}}} 
     198which is about as straightforward as you could hope for. Unfortunately, it requires `getRegister` to insert moves of values which naturally compute into an rreg, into a vreg. Consider: 
     199{{{ 
     2001 + ccall some-C-fn 
     201}}} 
     202On x86 the ccall result is returned in rreg `%eax`. The resulting sequence, prior to register allocation, would be: 
     203{{{ 
     204# push args 
     205call some-C-fn 
     206# move %esp to nuke args 
     207movl   %eax, %vreg 
     208incl   %vreg 
     209}}} 
     210If, as is likely, `%eax` is not held live beyond this point for any other purpose, the move into a fresh register is pointless; we'd have been better off leaving the value in `%eax` as long as possible. 
     211 
     212The simplified `getRegister` story is attractive. It would clean up the instruction selectors significantly and make it simpler to write new ones. The only drawback is that it generates redundant register moves. I suggest that eliminating these should be the job of the register allocator. Indeed: 
     213 
     214 * There has been some work on this already ("Iterated register coalescing" ?), so this isn't a new idea. 
     215 
     216 * You could argue that the existing scheme inappropriately blurs the boundary between the instruction selector and the register allocator. The instruction selector should .. well .. just select instructions, without having to futz around worrying about what kind of registers subtrees get generated into. Register allocation should be ''entirely'' the domain of the register allocator, with the proviso that it should endeavour to allocate registers so as to minimise the number of non-redundant reg-reg moves in the final output.  
     217 
     218== Selecting insns for 64-bit values/loads/stores on 32-bit platforms == 
     219 
     220Note that this stuff doesn't apply on 64-bit archs, since the `getRegister` mechanism applies there. The relevant functions are: 
     221{{{ 
     222assignMem_I64Code :: StixExpr -> StixExpr -> NatM InstrBlock 
     223assignReg_I64Code :: StixReg  -> StixExpr -> NatM InstrBlock 
     224iselExpr64        :: StixExpr -> NatM ChildCode64 
     225 
     226data ChildCode64     -- a.k.a "Register64" 
     227   = ChildCode64  
     228        InstrBlock   -- code 
     229        VRegUnique   -- unique for the lower 32-bit temporary 
     230}}} 
     231 
     232`iselExpr64` is the 64-bit, plausibly-named analogue of `getRegister`, and `ChildCode64` is the analogue of `Register`. The aim here was to generate working 64 bit code as simply as possible. To this end, I used the simplified `getRegister` scheme described above, in which iselExpr64generates its results into two vregs which can always safely be modified afterwards. 
     233 
     234Virtual registers are, unsurprisingly, distinguished by their `Unique`s. There is a small difficulty in how to know what the vreg for the upper 32 bits of a value is, given the vreg for the lower 32 bits. The simple solution adopted is to say that any low-32 vreg may also have a hi-32 counterpart which shares the same unique, but is otherwise regarded as a separate entity. `getHiVRegFromLo` gets one from the other. 
     235{{{ 
     236data VRegUnique 
     237   = VRegUniqueLo Unique          -- lower part of a split quantity 
     238   | VRegUniqueHi Unique          -- upper part thereof 
     239}}} 
     240Apart from that, 64-bit code generation is really simple. The sparc and x86 versions are almost copy-n-pastes of each other, with minor adjustments for endianness. The generated code isn't wonderful but is certainly acceptable, and it works. 
     241 
     242== Shortcomings and inefficiencies in the register allocator == 
     243 
     244=== Redundant reconstruction of the control flow graph === 
     245 
     246The allocator goes to considerable computational expense to construct all the flow edges in the group of instructions it's allocating for, by using the `insnFuture` function in the `Instr` pseudo-abstract type. 
     247 
     248This is really silly, because all that information is present at the abstract C stage, but is thrown away in the translation to Stix. So a good thing to do is to modify that translation to produce a directed graph of Stix straight-line code blocks, and to preserve that structure through the insn selector, so the allocator can see it. 
     249 
     250This would eliminate the fragile, hacky, arch-specific `insnFuture` mechanism, and probably make the whole compiler run measurably faster. Register allocation is a fair chunk of the time of non-optimising compilation (10% or more), and reconstructing the flow graph is an expensive part of reg-alloc. It would probably accelerate the vreg liveness computation too. 
     251 
     252=== Really ridiculous method for doing spilling === 
     253 
     254This is a more ambitious suggestion, but ... reg-alloc should be reimplemented, using the scheme described in "Quality and speed in linear-scan register allocation." (Traub?) For straight-line code blocks, this gives an elegant one-pass algorithm for assigning registers and creating the minimal necessary spill code, without the need for reserving spill registers ahead of time. 
     255 
     256I tried it in Rigr, replacing the previous spiller which used the current GHC scheme described above, and it cut the number of spill loads and stores by a factor of eight. Not to mention being simpler, easier to understand and very fast. 
     257 
     258The Traub paper also describes how to extend their method to multiple basic blocks, which will be needed for GHC. It comes down to reconciling multiple vreg-to-rreg mappings at points where control flow merges. 
     259 
     260=== Redundant-move support for revised instruction selector suggestion === 
     261 
     262As mentioned above, simplifying the instruction selector will require the register allocator to try and allocate source and destination vregs to the same rreg in reg-reg moves, so as to make as many as possible go away. Without that, the revised insn selector would generate worse code than at present. I know this stuff has been done but know nothing about it. The Linear-scan reg-alloc paper mentioned above does indeed mention a bit about it in the context of single basic blocks, but I don't know if that's sufficient. 
     263 
     264== x86 arcana that you should know about == 
     265 
     266The main difficulty with x86 is that many instructions have fixed register constraints, which can occasionally make reg-alloc fail completely. And the FPU doesn't have the flat register model which the reg-alloc abstraction (implicitly) assumes. 
     267 
     268Our strategy is: do a good job for the common small subset, that is integer loads, stores, address calculations, basic ALU ops (+, -, and, or, xor), and jumps. That covers the vast majority of executed insns. And indeed we do do a good job, with a loss of less than 2% compared with gcc. 
     269 
     270Initially we tried to handle integer instructions with awkward register constraints (mul, div, shifts by non-constant amounts) via various jigglings of the spiller et al. This never worked robustly, and putting platform-specific tweaks in the generic infrastructure is a big No-No. (Not quite true; shifts by a non-constant amount are still done by a giant kludge, and should be moved into this new framework.) 
     271 
     272Fortunately, all such insns are rare. So the current scheme is to pretend that they don't have any such constraints. This fiction is carried all the way through the register allocator. When the insn finally comes to be printed, we emit a sequence which copies the operands through memory (`%esp`-relative), satisfying the constraints of the real instruction. This localises the gruesomeness to just one place. Here, for example, is the code generated for integer divison of `%esi` by `%ecx`: 
     273{{{ 
     274# BEGIN IQUOT %ecx, %esi 
     275pushl $0 
     276pushl %eax   
     277pushl %edx 
     278pushl %ecx 
     279movl  %esi,% eax 
     280cltd 
     281idivl 0(%esp) 
     282movl %eax, 12(%esp) 
     283popl %edx   
     284popl %edx 
     285popl %eax 
     286popl %esi 
     287# END   IQUOT %ecx, %esi 
     288}}} 
     289This is not quite as appalling as it seems, if you consider that the division itself typically takes 16+ cycles, whereas the rest of the insns probably go through in about 1 cycle each. 
     290 
     291This trick is taken to extremes for FP operations. 
     292 
     293All notions of the x86 FP stack and its insns have been removed. Instead, we pretend, to the instruction selector and register allocator, that x86 has six floating point registers, `%fake0` .. `%fake5`, which can be used in the usual flat manner. We further claim that x86 has floating point instructions very similar to SPARC and Alpha, that is, a simple 3-operand register-register arrangement. Code generation and register allocation proceed on this basis. 
     294 
     295When we come to print out the final assembly, our convenient fiction is converted to dismal reality. Each fake instruction is independently converted to a series of real x86 instructions. `%fake0` .. `%fake5` are mapped to `%st(0)` .. `%st(5)`. To do reg-reg arithmetic operations, the two operands are pushed onto the top of the FP stack, the operation done, and the result copied back into the relevant register. When one of the operands is also the destination, we emit a slightly less scummy translation. There are only six `%fake` registers because 2 are needed for the translation, and x86 has 8 in total. 
     296 
     297The translation is inefficient but is simple and it works. A cleverer translation would handle a sequence of insns, simulating the FP stack contents, would not impose a fixed mapping from `%fake` to `%st` regs, and hopefully could avoid most of the redundant reg-reg moves of the current translation. 
     298 
     299There are, however, two unforeseen bad side effects: 
     300 
     301 * This doesn't work properly, because it doesn't observe the normal conventions for x86 FP code generation. It turns out that each of the 8 elements in the x86 FP register stack has a tag bit which indicates whether or not that register is notionally in use or not. If you do a FPU operation which happens to read a tagged-as-empty register, you get an x87 FPU (stack invalid) exception, which is normally handled by the FPU without passing it to the OS: the program keeps going, but the resulting FP values are garbage. The OS can ask for the FPU to pass it FP stack-invalid exceptions, but it usually doesn't. 
     302 
     303 Anyways: inside NCG created x86 FP code this all works fine. However, the NCG's fiction of a flat register set does not operate the x87 register stack in the required stack-like way. When control returns to a gcc-generated world, the stack tag bits soon cause stack exceptions, and thus garbage results. 
     304 
     305 The only fix I could think of -- and it is horrible -- is to clear all the tag bits just before the next STG-level entry, in chunks of code which use FP insns. `i386_insert_ffrees` inserts the relevant `ffree` insns into such code blocks. It depends critically on `is_G_instr` to detect such blocks. 
     306 
     307 * It's very difficult to read the generated assembly and reason about it when debugging, because there's so much clutter. We print the fake insns as comments in the output, and that helps a bit.  
     308 
     309== Generating code for ccalls == 
     310 
     311For reasons I don't really understand, the instruction selectors for generating calls to C (genCCall) have proven surprisingly difficult to get right, and soaked up a lot of debugging time. As a result, I have once again opted for schemes which are simple and not too difficult to argue as correct, even if they don't generate excellent code. 
     312 
     313The sparc ccall generator in particular forces all arguments into temporary virtual registers before moving them to the final out-registers (`%o0` .. `%o5`). This creates some unnecessary reg-reg moves. The reason is explained in a comment in the code. 
     314 
     315== Duplicate implementation for many STG macros == 
     316 
     317This has been discussed at length already. It has caused a couple of nasty bugs due to subtle untracked divergence in the macro translations. The macro-expander really should be pushed up into the Abstract C phase, so the problem can't happen. 
     318 
     319Doing so would have the added benefit that the NCG could be used to compile more "ways" -- well, at least the 'p' profiling way. 
     320 
     321== How to debug the NCG without losing your sanity/hair/cool == 
     322 
     323Last, but definitely not least ... 
     324 
     325The usual syndrome is that some program, when compiled via C, works, but not when compiled via the NCG. Usually the problem is fairly simple to fix, once you find the specific code block which has been mistranslated. But the latter can be nearly impossible, since most modules generate at least hundreds and often thousands of them. 
     326 
     327My solution: cheat. 
     328 
     329Because the via-C and native routes diverge only late in the day, it is not difficult to construct a 1-1 correspondence between basic blocks on the two routes. So, if the program works via C but not on the NCG, do the following: 
     330 * Recompile `AsmCodeGen.lhs` in the afflicted compiler with `-DDEBUG_NCG`, so that it inserts `___ncg_debug_markers` into the assembly it emits. 
     331 * Using a binary search on modules, find the module which is causing the problem. 
     332 * Compile that module to assembly code, with identical flags, twice, once via C and once via NCG. Call the outputs `ModuleName.s-gcc` and `ModuleName.s-nat`. Check that the latter does indeed have `___ncg_debug_markers` in it; otherwise the next steps fail. 
     333 * Build (with a working compiler) the program `utils/debugNCG/diff_gcc_nat`. 
     334 * Run: `diff_gcc_nat ModuleName.s`. This will construct the 1-1 correspondence, and emits on stdout a cppable assembly output. Place this in a file -- I always call it synth.S. Note, the capital S is important; otherwise it won't get cpp'd. You can feed this file directly to ghc and it will automatically get cpp'd; you don't have to do so yourself. 
     335 * By messing with the `#define`s at the top of `synth.S`, do a binary search to find the incorrect block. Keep a careful record of where you are in the search; it is easy to get confused. Remember also that multiple blocks may be wrong, which also confuses matters. Finally, I usually start off by re-checking that I can build the executable with all the `#define`s set to 0 and then all to 1. This ensures you won't get halfway through the search and then get stuck due to some snafu with gcc-specific literals. Usually I set `UNMATCHED_GCC` to 1 all the time, and this bit should contain only literal data. `UNMATCHED_NAT` should be empty.  
     336 
     337`diff_gcc_nat` was known to work correctly last time I used it, in December 01, for both x86 and sparc. If it doesn't work, due to changes in assembly syntax, or whatever, make it work. The investment is well worth it. Searching for the incorrect block(s) any other way is a total time waster.