Changes between Version 3 and Version 4 of LightweightCCSCallStack


Ignore:
Timestamp:
Oct 29, 2007 12:02:34 PM (6 years ago)
Author:
mnislaih
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • LightweightCCSCallStack

    v3 v4  
    55 
    66The only known way to obtain a call trace in a lazy language which resembles in some way a strict call trace is by reusing the Cost Center Stack framework. In order to have backtraces one does not need the full CCS machinery. In particular there is no need for keeping track of costs and live objects in heap.  
    7 But CCS are still too heavyweight because one needs to annotate thunks with the current CCS when they are allocated. Current implementation of CCS stores this in the info table. Our proposal is that, since all we need is a Word, we can move this to the header for AP and PAP closures. The interpreter will make use of this new field whereas compiled code can safely ignore it. 
    8  
     7But CCS are still too heavyweight because one needs to annotate thunks with the current CCS when they are allocated. Current implementation of CCS stores this, together with cost data, in the info table. Our proposal is that, since all we need is a Word, we can move this to the header for AP and PAP closures. The interpreter will make use of this new field whereas compiled code can safely ignore it. Therefore this proposal does not annotate Thunk closures, only AP and PAP closures. 
    98 
    109= The Details = 
     
    1211== Storing the stack of CCS == 
    1312 
    14 For creating the CCS, we will follow the semantics found in [Samson 97] and extended in [Jarvis 93], ignoring all the cost assignment stuff.  
    15 In principle, the current CCS configuration would be stored in a global variable in the C side, living in the interpreter.  However, For reasons that will become clear when we talk about case statements, we are going to need a stack of CCCS. Hence the current CCS will be a pointer to the head of this stack, and the stack will consist of pointers to rows in the table. A CCS configuration is a list of tuples of (tick #, module). As expected, we also keep a table of possible CCS configurations; the current CCS conf. is a word pointing at one of these entries. 
     13For creating the CCS, we will follow the semantics found in [Samson 97], ignoring all the cost assignment stuff. As ghc currently does, we follow the extensions in Jarvis 93 but ignoring the bits  about collapsing the stack traces. 
     14 
     15In principle, the current CCS configuration would be stored in a global variable in the C side, living in the interpreter.  However, For reasons that will become clear when we talk about case statements, we are going to need a stack of CCS. Hence the current CCS will be a pointer to the head of this stack, and the stack will consist of pointers to rows in the table.  
     16 
     17A CCS configuration is a list of tuples of (tick #, module). As expected, we keep a table with all the possible CCS configurations; the current CCS conf. is a word pointing at one of these entries. The actual data structure used for this table should be the one described by Jarvis. 
    1618 
    1719To make all of this clear, we have: 
    1820 
    19  * A current CCS stack 
    20  * A pointer to the top of this stack 
    2121 * A table of possible CCS configurations 
     22 * A current CCS register, being a pointer to a row in the table 
    2223 
    2324We define the following primitive operations on these structures: 
    2425 
    25  * PUSH_CENTER Pushes a new cost center in the current CCS at the top of the CCS stack. For this it needs to look at the table for the target configuration and possibly extend it if it does not already exist 
    26  * DUP_STACK Duplicates the current top of the stack. Used before entering the scrutinee in a case statement 
    27  * PUSH_STACK Pushes a CCS in the stack. Used when deallocating APs. 
    28  * POP_STACK  Discards the top of the stack out. Used after entering the scrutinee of a case statement 
     26 * PUSH_CENTER Pushes a new cost center in the current CCS. For this it needs to append the cost center to the current configuration, look at the table for this configuration and possibly extend the table if the configuration is not there. 
    2927 
    3028We will take advantage of ticks to guide the insertion of Cost Centers in the current CCS. For this, we will distinguish ticks that are placed at the beginning of top level function declarations (How? Perhaps with a distinguished BCI?) When the evaluator sees one such tick, it will call PUSH_CENTER with the current tick. 
     
    3230== Annotating thunks == 
    3331 
    34 There are two closure types that we must consider: APs and PAPs. APs are only created and seen by interpreted code (really?) whereas PAPs interact with compiled code too. These closures must be annotated with the CCS configuration at the moment of their allocation. We propose to simply extend their headers with an extra word to store this.  
    35  
    36 We will have to change the following bits of code for this: 
    37  
    38 * Allocating AP and PAP closures. There are two scenarios we consider: 
    39  * Allocating a PAP/AP from scratch: for this look at the current CCS (top of the CCS stack) and store it in the closure header. 
    40  * Building a PAP/AP on top of a previous PAP: for this propagate the CCS stored in the base PAP, extend with an extra segment and store the result in the new closure header. The extra segment added is the identifier of the current function, which we can obtain by looking at the current CCS configuration and keeping the last element in the list of identifiers.  
    41 {{{ 
    42 Example:  
    43 We have the base PAP annotated with CCS '''[main,f,g,h]'''. 
    44 The current CCS configuration at the moment of deallocating the base PAP is '''[main,f,j]'''. 
    45 We deallocate the base PAP, allocate the new AP closure, and annotate it with CCS '''[main,f,g,h,j]'''. 
    46 }}} 
    47 * Deallocating AP closures: Following the semantics of [Samson 97], we must 
     32There are two closure types that we must consider: APs and PAPs. APs are only created and dealt with by the bytecode interpreter, whereas PAPs interact with the compiled code too. These closures must be annotated with the CCS configuration at the moment of their allocation, so that this CCS can be temporarily installed when the closure is being entered. The proposal is to extend their headers with an extra word to store this. 
     33 
     34 * Allocation of AP and PAP closures in the bytecode interpreter: look at the current CCS and store it in the closure header. 
     35 * Entering AP closures: Following the semantics of [Samson 97], we must 
    4836 1. Extract the SCC annotated in the closure and push it in the CCS stack 
    4937 1. Deallocate the AP closure and enter it 
    5038 1. When we are done and before leaving, pop the CCS stack. 
    5139 
    52 * Deallocating PAP closures: propagate the CCS annotation in the closure to the AP being allocated. (I am assuming that PAPs are never entered directly, even if the application is saturated) 
    53  
    54 == Handling case statements == 
    55  
    56 Following the semantics, we must restore the CCS after the scrutinee of a case statement has been entered. I.e. we should duplicate the CCS stack top before entering the scrutinee, and pop out before continuing to the alternatives. We are not very sure on how to do this, but the current plan is the following.  
    57  
    58  1. When we see a PUSH_ALT instruction, we do a DUP_STACK in the CCS stack. 
    59  1. After that, we know that the scrutinee will be entered, with a continuation. We modify the code generated for this continuation so that it will pushes a POP_CCS in the GHC stack. 
    60  
    61 POP_CCS will be some piece of code that will pop the CCS stack.  
    62 (I am not very sure how this part works since Alexey is not here and my knowledge of all this is very vague). 
    63  
    64  
    65 = Further Details = 
    66  
    67 The CCS table should be implemented efficiently following the data structure advices given in [Samson 97]. Hopefully there is already code for this in the profiling infrastructure. 
     40 * Entering PAP closures: propagate the CCS annotation in the closure to the AP being allocated. (I am assuming that PAPs are never entered directly, even if the application is saturated) 
     41 
     42== Case statements == 
     43 
     44Following the semantics, we must restore the CCS after the scrutinee of a case statement has been entered. -prof compiled code does this by pushing the CCS to the stack and having the case continuation restore it before proceeding to the case alternative.  
     45 
     46For us it will be more troublesome since we have to deal with non -prof compiled code, which we cannot modify. Case continuations in this code do not know about the CCS. 
     47Only BCOs can modify the CCS. The scenario we have to deal with is the compiled code entering a BCO with the right number of arguments in the scrutinee of a case statement. Since that call involves a context switch to the bytecode interpreter, maybe there is an opportunity to detect and handle CCS restoring in the next switch (it seems very unlikely though). 
     48 
     49== AP/PAP/Thunk updates == 
     50 
     51When entering a thunk-like closure we load the CCS recorded in the thunk, and when the thunk is updated we must restore the initial CCS.  
     52 
     53''But not all thunk-like closures, right?'' Well, even though we do not ''annotate'' Thunk closures and therefore there is no recorded CCS to load, we still have to restore the CCS after they have been forced.  
     54 
     55We can handle APs easily since we are in the bytecode interpreter, PAPs might prove a bit more difficult but probably manageable, since it looks like the code for handling them is in the RTS. But how to handle Thunks being forced in the compiled code? Even if we could install special update frames, who is going to stack-push a copy of the original CCS before entering the Thunk? 
     56 
     57QUESTION: Do we really need to restore the CCS after an update? The semantics in Samson does not say so, and the restore would anyways be redundant since thunks are forced only by case statements, and those already do a restore. However, currently GHC does the restore after an update: 
     58 
     59{{{ 
     60 HS: 
     61     2  boolfun x y = let {-# NOINLINE z #-}   
     62     3                        z = show y  
     63     4                    in if x then "2"++z else "3"++z 
     64     5   
     65     6   
     66 STG: 
     67     8  .... 
     68     9  boolfun_rdc = 
     69    10     CCS_SUBSUMED sat-only \r srt:(0,*bitmap*) [$dShow_s12l x_s12u y_s12q] 
     70    11         let { 
     71    12           z_s12t = 
     72    13               CCCS \u [] 
     73    14                   case $dShow_s12l of tpl_s13v { 
     74    15                     GHC.Show.:DShow tpl1_s13w tpl2_s12r tpl3_s13x -> tpl2_s12r y_s12q; 
     75    16                   }; 
     76    17         } in ... 
     77    18   
     78 CMM:  
     79    20  ..... 
     80    21   
     81    22  -- code for z_s12t 
     82    23  s12t_entry() { 
     83    24   
     84    25   ... 
     85    26         I32[Sp + (-16)] = stg_upd_frame_info;  -- Push the update frame 
     86    27         I32[Sp + (-4)] = R1;                    
     87    28         I32[Sp + (-12)] = I32[CCCS];           -- Store current CCS 
     88    29         I32[CCCS] = I32[R1 + 4];               -- Load thunk CCS 
     89    30   
     90    31   //  Here comes the code for the case statement 
     91    32         I32[Sp + (-24)] = I32[CCCS];           -- Store current CCS 
     92    33         I32[Sp + (-20)] = I32[R1 + 20];        -- Free Vars 
     93    34         R1 = I32[R1 + 16];                      
     94    35         I32[Sp + (-28)] = s13v_info;           -- case continuation 
     95    36         Sp = Sp + (-28); 
     96    37         jump I32[R1];                          -- enter(evaluate) the dictionary for Show 
     97    38  } 
     98    39   
     99    40  -- code for the case continuation 
     100    41  s13v_ret() { 
     101    42  ... 
     102    43         I32[CCCS] = I32[Sp + 4]; -- Restore the CCS found in the 1st stack slot 
     103    44                                  -- this will be the original CCS, put there by line 32 
     104    45         R1 = I32[R1 + 16]; 
     105    46         R2 = I32[Sp + 8]; 
     106    47         Sp = Sp + 12; 
     107    48         jump stg_ap_p_fast;  
     108}}} 
     109 
     110After line 36, the stack looks as follows 
     111{{{ 
     112-4 : -> R1 (Thunk z_s12t ?) 
     113-8 : 
     114-12: CCCS 
     115-16: Update Frame 
     116-20: Free Vars 
     117-24: CCCS 
     118-28: Cont case 
     119}}} 
     120 
     121The example shows that -prof code is doing a CCS restore after the update, which to me indeed seems redundant. What am I missing? If it is not necessary, that means one less problem for us. 
    68122 
    69123= What we expect to obtain = 
     
    72126 
    73127In principle the overhead incurred should be acceptable, but we are worried that ticks will interact badly with dup'ing/pop'ing around case statements scrutinees, given the huge number of those showing up in the bytecode (as revealed by -ddump-bcos). 
     128 
     129= Caveats = 
     130== The boxing trick == 
     131CCS doesn't do exactly what it is supposed to do. It gives strange results for CAFs and for some higher order code. The latter can be minimised by a helper simple program transformation that boxes higher order arguments in function calls. For example: 
     132{{{ 
     133f x = error "f" 
     134 
     135id x = x 
     136 
     137main = id f 3 
     138}}} 
     139 
     140you'll see the stack <main,id,f>, when it should be <main,f>, because lexically f occurs in the body of main, not in id. By doing the boxing trick the body of main becomes "let f' = f in id f' 3", which recovers the desired stack trace. 
     141 
     142{{{ 
     143double = scc "double" \x. ... 
     144map    = scc "map"    \f. \xs. ... f x ... 
     145main   = scc "main"   map double xs 
     146}}} 
     147Here double is called by main, not by map.  So we must "capture" the stack with double when we pass it to map, like this: 
     148{{{ 
     149 main  = scc "main"  map (box double) xs 
     150     where box x = x 
     151}}} 
     152Or 
     153{{{ 
     154 main  = scc "main"  let double' = double in map double' xs 
     155}}} 
     156 
     157 
     158=== Indirections mess with the boxing trick === 
     159 
     160Simon: IIRC, there's a problem when updating a thunk with an indirection to a FUN.  In this case, the CCS that was stored in the thunk is lost, but we really need it when entering the FUN.  So the idea is to overwrite the thunk with a permanent indirection (IND_PERM) containing the CCS.  However, I've just checked and I don't think the current RTS does this, which looks like a bug - it might have got lost in the conversion to eval/apply.  The trouble is, without this the boxing trick won't work (because the box is a thunk that evaluates to a FUN).  I don't know of a good way to fix this right now. 
     161 
     162== CAFs == 
     163 
     164Handling of CAFs is a problem for any approach to stack traces in Haskell. 
     165The problem is as follows. Constant applicative forms(top level expressions with no arguments) are evaluated only the first time they are accessed, as by the dynamic semantics of Haskell. But if they are called more than once, why should the first caller be blamed for the cost? Instead, the costs are assigned to an artificial cost center called CAF("SUB" in Samson). For example: 
     166{{{ 
     167 g = scc g  
     168     let j = \x. double x in 
     169     \h. h j; 
     170 
     171 main =  
     172    let h = \f. f 21 in 
     173    \s. scc main g h; 
     174}}} 
     175 
     176here `j` is a function closure, free in the right hand side of `g`. `j` gets its cost centre when `g` is first evaluated, and it'll be `<CAF,g>`.  This is wrong: `g` might be called from a deeper context (e.g. `<TOP,main>`, in the definition of `main`), and we need to record that when the RHS of `j` is executing it is part of this deeper context. 
     177 
     178In our context this still applies, since though we do not have the problem of "sharing" the costs, CAFs cannot be assigned the right stack trace. A non option is changing the dynamic semantics so that CAFs are re-evaluated every time. 
     179 
     180A bigger example (test12): 
     181{{{ 
     182constr   = scc constr \x. \f. f x; 
     183deconstr = \p. let f = \x. x in p f; 
     184 
     185g = scc g  
     186   let j = \x. double x in 
     187   let p = constr j in 
     188   \h. scc g1 let p1 = p in h p1; 
     189 
     190main =  
     191  scc main  
     192   let h = \p. deconstr p c in 
     193   \x. scc main1 g h; 
     194}}} 
     195Now, both p and j are free in the lambda expression in the definition 
     196of g.  Also, j is free in p.  This gives us a problem: we can't box j 
     197and p, because that doesn't box the j that is free in p.  By the time 
     198p has been evaluated, it has captured j with a a fixed stack, so it's 
     199too late. Note that we cannot box j in the body of p because p is updatable,  
     200hence the box would capture the context only in the first call to j via p,  
     201producing incorrect contexts in all the forthcoming calls to j via p. 
     202 
     203We have to transform p so as to abstract j (test14). 
     204{{{ 
     205g = scc g  
     206   let j = \x. double x in 
     207   let p = \j. constr j in 
     208   \h. scc g1 let j1 = j; p1 = p j1 in h p1; 
     209}}} 
     210this works, but it's bad: we lost sharing.   
     211 
     212We can simplify a bit, lifting out the lets: 
     213{{{ 
     214id       = \x . x;  
     215constr   = \x. \f. f x; 
     216 
     217j = \x. double x; 
     218p = constr j; 
     219 
     220main =  
     221  scc main  
     222   let h = \p. p id c in 
     223   \x. scc main1 h p; 
     224}}} 
     225The problem is that we can't use the boxing trick for the argument j 
     226in the definition of p, because at the point where the argument is 
     227passed, we aren't inside a lambda and therefore boxing doesn't have 
     228any effect: it can't capture the context.  We can abstract j as a 
     229parameter, but that might lose sharing.  We may know nothing about the 
     230function constr, so we can't inline it or know whether it is safe to 
     231eta-expand. 
     232 
     233== Dictionaries == 
     234Desugaring of type classes to dictionary code brings up new case statements to evaluate the dictionaries. As an example, the expression 
     235 
     236{{{boolfun x y = let {-# NOINLINE z #-} 
     237                     z = show y in  
     238                 if x then "2"++z else "3"++z 
     239}}} 
     240 
     241desugars to something like 
     242 
     243{{{ 
     244boolfun dShow x y = case dShow of  
     245        GHC.Show.:DShow showsPrec show showList ->  
     246                let z = ... in 
     247                if ... 
     248}}} 
     249 
     250which includes an extra case statement for the Show dictionary. 
     251Since we are not interested in cost assignment, we should be able 
     252to ignore case statements related to dictionaries. 
    74253 
    75254= References =