wiki:LightweightCCSCallStack

Version 5 (modified by mnislaih, 6 years ago) (diff)

--

We have been talking about this for a long time and never come around to actually implement it. During the Hackathon Freiburg'07 Alexey Rodriguez and me (Pepe Iborra) have been talking about how to do this.

The Basic Idea

The only known way to obtain a call trace in a lazy language which resembles in some way a strict call trace is by reusing the Cost Center Stack framework. In order to have backtraces one does not need the full CCS machinery. In particular there is no need for keeping track of costs and live objects in heap. But CCS are still too heavyweight because one needs to annotate thunks with the current CCS when they are allocated. Current implementation of CCS stores this, together with cost data, in the info table. Our proposal is that, since all we need is a Word, we can move this to the header for AP and PAP closures. The interpreter will make use of this new field whereas compiled code can safely ignore it. Therefore this proposal does not annotate Thunk closures, only AP and PAP closures.

The Details

Storing the stack of CCS

For creating the CCS, we will follow the semantics found in [Samson 97], ignoring all the cost assignment stuff. As ghc currently does, we follow the extensions in Jarvis 93 but ignoring the bits about collapsing the stack traces.

In principle, the current CCS configuration would be stored in a global variable in the C side, living in the interpreter. However, For reasons that will become clear when we talk about case statements, we are going to need a stack of CCS. Hence the current CCS will be a pointer to the head of this stack, and the stack will consist of pointers to rows in the table.

A CCS configuration is a list of tuples of (tick #, module). As expected, we keep a table with all the possible CCS configurations; the current CCS conf. is a word pointing at one of these entries. The actual data structure used for this table should be the one described by Jarvis.

To make all of this clear, we have:

  • A table of possible CCS configurations
  • A current CCS register, being a pointer to a row in the table

We define the following primitive operations on these structures:

  • PUSH_CENTER Pushes a new cost center in the current CCS. For this it needs to append the cost center to the current configuration, look at the table for this configuration and possibly extend the table if the configuration is not there.

We will take advantage of ticks to guide the insertion of Cost Centers in the current CCS. For this, we will distinguish ticks that are placed at the beginning of top level function declarations (How? Perhaps with a distinguished BCI?) When the evaluator sees one such tick, it will call PUSH_CENTER with the current tick.

Annotating thunks

There are two closure types that we must consider: APs and PAPs. APs are only created and dealt with by the bytecode interpreter, whereas PAPs interact with the compiled code too. These closures must be annotated with the CCS configuration at the moment of their allocation, so that this CCS can be temporarily installed when the closure is being entered. The proposal is to extend their headers with an extra word to store this.

  • Allocation of AP and PAP closures in the bytecode interpreter: look at the current CCS and store it in the closure header.
  • Entering AP closures: Following the semantics of [Samson 97], we must
  1. Extract the SCC annotated in the closure and push it in the CCS stack
  2. Deallocate the AP closure and enter it
  3. When we are done and before leaving, pop the CCS stack.
  • Entering PAP closures: propagate the CCS annotation in the closure to the AP being allocated. (I am assuming that PAPs are never entered directly, even if the application is saturated)

Case statements

Following the semantics, we must restore the CCS after the scrutinee of a case statement has been entered. -prof compiled code does this by pushing the CCS to the stack and having the case continuation restore it before proceeding to the case alternative.

For us it will be more troublesome since we have to deal with non -prof compiled code, which we cannot modify. Case continuations in this code do not know about the CCS. Only BCOs can modify the CCS. The scenario we have to deal with is the compiled code entering a BCO with the right number of arguments in the scrutinee of a case statement. Since that call involves a context switch to the bytecode interpreter, maybe there is an opportunity to detect and handle CCS restoring in the next switch (it seems very unlikely though).

AP/PAP/Thunk updates

When entering a thunk-like closure we load the CCS recorded in the thunk, and when the thunk is updated we must restore the initial CCS.

But not all thunk-like closures, right? Well, even though we do not annotate Thunk closures and therefore there is no recorded CCS to load, we still have to restore the CCS after they have been forced.

We can handle APs easily since we are in the bytecode interpreter, PAPs might prove a bit more difficult but probably manageable, since it looks like the code for handling them is in the RTS. But how to handle Thunks being forced in the compiled code? Even if we could install special update frames, who is going to stack-push a copy of the original CCS before entering the Thunk?

QUESTION: Do we really need to restore the CCS after an update? The semantics in Samson does not say so, and the restore would anyways be redundant since thunks are forced only by case statements, and those already do a restore. However, currently GHC does the restore after an update:

 HS:
     2	boolfun x y = let {-# NOINLINE z #-}  
     3		              z = show y 
     4		          in if x then "2"++z else "3"++z
     5	
     6	
 STG:
     8	....
     9	boolfun_rdc =
    10	   CCS_SUBSUMED sat-only \r srt:(0,*bitmap*) [$dShow_s12l x_s12u y_s12q]
    11	       let {
    12	         z_s12t =
    13	             CCCS \u []
    14	                 case $dShow_s12l of tpl_s13v {
    15	                   GHC.Show.:DShow tpl1_s13w tpl2_s12r tpl3_s13x -> tpl2_s12r y_s12q;
    16	                 };
    17	       } in ...
    18	
 CMM: 
    20	.....
    21	
    22	-- code for z_s12t
    23	s12t_entry() {
    24	
    25	 ...
    26	       I32[Sp + (-16)] = stg_upd_frame_info;  -- Push the update frame
    27	       I32[Sp + (-4)] = R1;                   
    28	       I32[Sp + (-12)] = I32[CCCS];           -- Store current CCS
    29	       I32[CCCS] = I32[R1 + 4];               -- Load thunk CCS
    30	
    31	 //  Here comes the code for the case statement
    32	       I32[Sp + (-24)] = I32[CCCS];           -- Store current CCS
    33	       I32[Sp + (-20)] = I32[R1 + 20];        -- Free Vars
    34	       R1 = I32[R1 + 16];                     
    35	       I32[Sp + (-28)] = s13v_info;           -- case continuation
    36	       Sp = Sp + (-28);
    37	       jump I32[R1];                          -- enter(evaluate) the dictionary for Show
    38	}
    39	
    40	-- code for the case continuation
    41	s13v_ret() {
    42	...
    43	       I32[CCCS] = I32[Sp + 4]; -- Restore the CCS found in the 1st stack slot
    44	                                -- this will be the original CCS, put there by line 32
    45	       R1 = I32[R1 + 16];
    46	       R2 = I32[Sp + 8];
    47	       Sp = Sp + 12;
    48	       jump stg_ap_p_fast; 

After line 36, the stack looks as follows

-4 : -> R1 (Thunk z_s12t ?)
-8 :
-12: CCCS
-16: Update Frame
-20: Free Vars
-24: CCCS
-28: Cont case

The example shows that -prof code is doing a CCS restore after the update, which to me indeed seems redundant. What am I missing? If it is not necessary, that means one less problem for us.

What we expect to obtain

We cannot observe compiled code for two reasons: it has no ticks that can guide cost center introduction, and the thunks it allocates lack CCS annotations. Therefore our backtraces will be restricted to interpreted code; this is not too bad in my opinion(pepe).

In principle the overhead incurred should be acceptable, but we are worried that ticks will interact badly with dup'ing/pop'ing around case statements scrutinees, given the huge number of those showing up in the bytecode (as revealed by -ddump-bcos).

Caveats

The boxing trick

CCS doesn't do exactly what it is supposed to do. It gives strange results for CAFs and for some higher order code. The latter can be minimised by a helper simple program transformation that boxes higher order arguments in function calls. For example:

f x = error "f"

id x = x

main = id f 3

you'll see the stack <main,id,f>, when it should be <main,f>, because lexically f occurs in the body of main, not in id. By doing the boxing trick the body of main becomes "let f' = f in id f' 3", which recovers the desired stack trace.

double = scc "double" \x. ...
map    = scc "map"    \f. \xs. ... f x ...
main   = scc "main"   map double xs

Here double is called by main, not by map. So we must "capture" the stack with double when we pass it to map, like this:

 main  = scc "main"  map (box double) xs
     where box x = x

Or

 main  = scc "main"  let double' = double in map double' xs

Indirections mess with the boxing trick

Simon: IIRC, there's a problem when updating a thunk with an indirection to a FUN. In this case, the CCS that was stored in the thunk is lost, but we really need it when entering the FUN. So the idea is to overwrite the thunk with a permanent indirection (IND_PERM) containing the CCS. However, I've just checked and I don't think the current RTS does this, which looks like a bug - it might have got lost in the conversion to eval/apply. The trouble is, without this the boxing trick won't work (because the box is a thunk that evaluates to a FUN). I don't know of a good way to fix this right now.

CAFs

Handling of CAFs is a problem for any approach to stack traces in Haskell. The problem is as follows. Constant applicative forms(top level expressions with no arguments) are evaluated only the first time they are accessed, as by the dynamic semantics of Haskell. But if they are called more than once, why should the first caller be blamed for the cost? Instead, the costs are assigned to an artificial cost center called CAF("SUB" in Samson). For example:

 g = scc g 
     let j = \x. double x in
     \h. h j;

 main = 
    let h = \f. f 21 in
    \s. scc main g h;

here j is a function closure, free in the right hand side of g. j gets its cost centre when g is first evaluated, and it'll be <CAF,g>. This is wrong: g might be called from a deeper context (e.g. <TOP,main>, in the definition of main), and we need to record that when the RHS of j is executing it is part of this deeper context.

In our context this still applies, since though we do not have the problem of "sharing" the costs, CAFs cannot be assigned the right stack trace. A non option is changing the dynamic semantics so that CAFs are re-evaluated every time.

A bigger example (test12):

constr   = scc constr \x. \f. f x;
deconstr = \p. let f = \x. x in p f;

g = scc g 
   let j = \x. double x in
   let p = constr j in
   \h. scc g1 let p1 = p in h p1;

main = 
  scc main 
   let h = \p. deconstr p c in
   \x. scc main1 g h;

Now, both p and j are free in the lambda expression in the definition of g. Also, j is free in p. This gives us a problem: we can't box j and p, because that doesn't box the j that is free in p. By the time p has been evaluated, it has captured j with a a fixed stack, so it's too late. Note that we cannot box j in the body of p because p is updatable, hence the box would capture the context only in the first call to j via p, producing incorrect contexts in all the forthcoming calls to j via p.

We have to transform p so as to abstract j (test14).

g = scc g 
   let j = \x. double x in
   let p = \j. constr j in
   \h. scc g1 let j1 = j; p1 = p j1 in h p1;

this works, but it's bad: we lost sharing.

We can simplify a bit, lifting out the lets:

id       = \x . x; 
constr   = \x. \f. f x;

j = \x. double x;
p = constr j;

main = 
  scc main 
   let h = \p. p id c in
   \x. scc main1 h p;

The problem is that we can't use the boxing trick for the argument j in the definition of p, because at the point where the argument is passed, we aren't inside a lambda and therefore boxing doesn't have any effect: it can't capture the context. We can abstract j as a parameter, but that might lose sharing. We may know nothing about the function constr, so we can't inline it or know whether it is safe to eta-expand.

Dictionaries

Desugaring of type classes to dictionary code brings up new case statements to evaluate the dictionaries. As an example, the expression

{{{boolfun x y = let {-# NOINLINE z #-}

z = show y in

if x then "2"++z else "3"++z

}}}

desugars to something like

boolfun dShow x y = case dShow of 
	GHC.Show.:DShow showsPrec show showList -> 
		let z = ... in
		if ...

which includes an extra case statement for the Show dictionary. Since we are not interested in cost assignment, we should be able to ignore case statements related to dictionaries.

References

  • [Samson 97] P. Samson and S.P.J. - Formally based profiling for Higher-Order languages
  • [Jarvis 93] R.G.Morgan and S.A.Jarvis - Profiling large scale lazy functional programs