Changes between Version 13 and Version 14 of Frisby2013Q1


Ignore:
Timestamp:
Feb 13, 2013 4:16:49 PM (2 years ago)
Author:
nfrisby
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Frisby2013Q1

    v13 v14  
    3434 * The original conjecture was that doing it would save allocation: a dynamically allocated closure becomes a static top-level function. 
    3535 * Max Bolingbroke did a quick implementation of this idea some years ago (~mid 2000s), but it seems it was abandoned. I don't know why. 
     36 
     37=== Notes for Write-up === 
     38 
     39  * puzzle"s $fEnumItemType.$cenumFromThen has a nice demonstration: a cascade of let-no-escapes becomes a series of top-level functions, all tail-calls 
    3640 
    3741=== Method === 
     
    138142  * yy - do not restrict application of the binding's free variables 
    139143 
    140 I have yet to see a clear results; there's no variant that's bested the others on most programs' runtime. 
    141  
    142 This is even after I developed some bash script (I'm so sorry) to transpose the NoFib loops; instead of running the entire NoFib suite for one set of switches and then running it again for the next set of switches, and so on, I build all the variants, and then run each variant's version of each program sequentially. I intend for this to reduce noise by improving the time locality of the measurements of the same test. Even so, the noise was bad. 
    143  
    144 So I turned my attention to allocation instead, for now. Roughly, we expect that more floating means (barely) less allocation but worse runtime (by how much?) because some known calls become unknown calls. But, eg, going from nn -> yn --- ie floating functions that undersaturate free variables instead of not floating them --- caused worse allocation! This investigation led to [#MitigatingLNEAbstraction]. 
    145  
    146 Based on that example, it occurred to me that we should only restrict the binding's saturation of its *known* free variables. For example, I was not floating a binding because its RHS applied a free variable, even though that free variable was lambda bound. That decision has no benefit, and indeed was causing knock-on effects that increase allocation (eg [#MitigatingLNEAbstraction]). 
     144There was no variant that bested the others on most programs' runtime. And the data was so noisy that it was difficult to choose which tests to investigate. I eventually developed some bash script (I'm so sorry) to transpose the NoFib loops; instead of running the entire NoFib suite for one set of switches and then running it again for the next set of switches, and so on, I build all the variants, and then run each variant's version of each program sequentially. I intend for this to reduce noise by improving the time locality of the measurements of the same test. Even so, the noise in Runtime was bad. Eventually, I turned the iterations up to the 40/50 range and found some steadiness. To my surprise, there were a couple tests that had the best Runtime if we *do* abstract over functions with fast calls! This happens. More on this later (cf [#puzzle-time-issue]). 
     145 
     146I also saw some surprises in allocation. Roughly, we expect that more floating means (barely) less allocation but worse runtime (by how much?) because some known calls become unknown calls. But, eg, going from nn -> yn --- ie floating functions that undersaturate free variables instead of not floating them --- caused worse allocation! This investigation led to [#MitigatingLNEAbstraction]. 
     147 
     148Based on that example, it occurred to me that we should only restrict the binding's saturation of its *known* free variables... duh. For example, we should floating a binding even if its RHS exactly applies a free variable when that free variable is lambda bound. Not floating in that case has no benefit, and indeed was causing knock-on effects that increase allocation (eg [#MitigatingLNEAbstraction]). 
     149 
     150After allocation leads to some code tweaks, I reran the Run time tests with high iterations. hpg and puzzle were odd cases where ny did the best, by far. Both have low run times, so I cranked the iterations up to 200. hpg's results change drastically, which I haven't yet resolved. But puzzle remained; see [#puzzle-time-issue]. 
    147151 
    148152I have yet to determine that the preservation of fast entries is worth the trouble --- I certainly hope so... the parameter sweeps have taken a lot of time! 
    149153 
    150 To enable further measurements, I have identified the semantics of some ticky counters, cf [#TickyCounters]. 
     154To enable further measurements, I have identified the semantics of some ticky counters, cf [#TickyCounters], and started resurrecting useful ones that are no longer enabled. 
    151155 
    152156==== Mitigating LNE Abstraction ==== 
     
    193197=== Discovered Benefits of LLF === 
    194198 
    195 We didn't see as much decrease in allocation as we would have liked. 
    196  
    197   * nice demonstration of the basic effect in puzzle 
     199We haven't seen as much decrease in allocation as we would have liked, but there have been some nice benefits: 
     200 
     201==== Creates Inliner Opportunities ==== 
     202 
     203Floating functions to the top-level creates more opportunities for the inliner. We've found two ways. 
     204 
    198205  * #7663 - simulates having the inliner discount for free variables like it discounts for parameters 
    199   * Floating functions to the top-level creates more opportunities for the simplifier. 
     206 
     207  * It also decreases size of functions by floating out internal let-bindings (eg big join points, etc). 
     208 
     209Both of these have been observed on puzzle, with commit feec91b71, it-thunk-limit=10, protect-last-arg. We get a big improvement in both allocation (-15.1%) and runtime (-1.4%) by allowing fast entries to be abstracted over. Oddly, if we additionally disallow undersat known calls to be abstract over, we get another runtime boost (up to -3.9%). These are both unfortunate from the fast-entry perspective, but demonstrate a big win. 
     210 
     211In particular, the worker for the derived equality function for the `StateType` contains a join-point. When the join-point is floated, the worker's Guidance goes from 
     212 
     213{{{IF_ARGS [70 70 80 0 60 60 60 0] 360 20}}} 
     214 
     215to 
     216 
     217{{{IF_ARGS [130 0 0 0 60 0 0 0] 220 20}}} 
     218 
     219while the floated join point itself gets a Guidance of 
     220 
     221{{{IF_ARGS [170 160 0 60 120 0 0] 300 60}}}} 
     222 
     223The loss of parameter discounts may be bad, but the reduction in size exemplifies a good thing. 
     224 
     225But there's a bigger change in puzzle's main loop: `$wtransfer` gets a 28% reduction in allocation. Furthermore, one of its contained letrecs gets a 56% percent reduction. This results in a %15 percent reduction for the whole program.b 
     226 
     227TODO: I need ticky to track LNEs in order to pin down what's happening there. 
     228 
     229==== Creates Simplifier Opportunities ==== 
     230 
     231Floating functions to the top-level creates more opportunities for the simplifier. 
     232 
     233Abstracted from boyer2 (where `f` is a join point): 
    200234 
    201235{{{ 
     
    203237}}} 
    204238 
    205    The let prevents the cases from being merged. Since LLF is so aggressive, it floats f when it otherwise wouldn't be. 
     239The let prevents the cases from being merged. Since LLF is so aggressive, it floats f when it otherwise wouldn't be, enabling the case merge. 
    206240 
    207241=== Miscellaneous Findings ===