Changes between Version 18 and Version 19 of Frisby2013Q1


Ignore:
Timestamp:
Feb 22, 2013 7:30:09 PM (2 years ago)
Author:
nfrisby
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Frisby2013Q1

    v18 v19  
    1414 
    1515  * emacs scroll-all-mode 
     16      * use C-o to fix up horizontal alignment without skewing the cursors 
    1617 
    1718  * navigate -ddump-*'s output via the *** headers 
     
    1920  * lots of information: -dverbose-core2core + -ddump-inlinings 
    2021    * simplCore/Simplify has many pprTraces that are commented out 
     22 
     23  * ghc --show-iface M.hi will dump its definitions 
     24      * the unfolding of a wrapper is not shown, because it is actually not in the `.hi` file; it is reconstructed by `TcIface.tcIfaceWrapper` based on the type of it and the worker and the worker's demand information 
    2125 
    2226  * diff -w can highlight major changes 
     
    2428    * -dsuppress-uniques or sed -r 's/_[[:alnum:]]{2,4}//g' (or leave out the underscore) 
    2529      * removes ''most'' uniques 
    26       * sed is handy also for diffing .ticky files 
     30      * this sed is handy also for diffing .ticky files 
     31 
     32  * a strictly demanded let and a thunk with an unlifted type both become cases in the STG. 
    2733 
    2834=== Ticky Counters === 
     
    373379 
    374380Changing the let-no-escape thunks to \void -> ... closures upstream would then subject the binding to more optimisations. Formerly, it's non-lambda status meant that inlining it, eg, would lose sharing. 
     381 
     382== Late Strictness/WW == 
     383 
     384There are two core-to-core passes related to demand (= strictness & usage & CPR): 
     385 
     386  * the demand analyzer pass (stranal/DmdAnal) and 
     387  * the worker-wrapper split (stranal/WorkWrap). 
     388 
     389The split currently happens once in the pipeline, and the demand analysis happens immediately before it. 
     390 
     391{{{ 
     392sat 
     393vectorise 
     394specialize 
     395float-out 
     396float-in 
     397simpl 
     398HERE 
     399float-out 
     400cse 
     401float-in 
     402liberate-case;simpl 
     403spec-constr 
     404simpl 
     405}}} 
     406 
     407This pair of passes can be enabled/disabled by `-fstrictness` flag. 
     408 
     409Additionally, the demand analyzer is optionally be ran before each execution of some arbitrary phases of the simplifier using the `-fstrictness-before` flag. A few direct invocations of the simplifier, eg, after vectorisation, are not affected by this flag. 
     410 
     411=== The Idea === 
     412 
     413The demands change as we optimize Core terms. The passes are careful to remove the demand info annotation when transforming a term in a way that invalidates the current demand info. Doing so, however, can hinder downstream optimizations. Running the strictness analyzer a second time may therefore be helpful. 
     414 
     415=== The Design === 
     416 
     417I added to flags `-flate-strictness` and `-flate-wwsplit`. 
     418 
     419  * `-flate-strictness` invokes the demand analyzer after !SpecConstr. 
     420  * `-flat-wwsplit` implies `-flate-strictness` and immediately follows it with a worker-wrapper split. 
     421 
     422I placed them arbitrarily in the pipeline. But phase-ordering is known to have an impact, so maybe we should try running in between all other passes. 
     423 
     424==== Initial Core Lint error ==== 
     425 
     426Enabling either flag caused a core lint error when compiling `GHC.Float`. (I anticipate that `-fstrictness-before` would have done the same.) Some cleverness in the `.hi` files was surreptitiously creating an ill-typed unfolding for `GHC.Real.even`. Here's why. 
     427 
     428The unfolding for a wrapper function is not actually stored in a `.hi` file. This unfolding can be reconstructed based on the type of the function and the demand info it had when the split was performed. Thus only the name of the worker is stored in the `.hi` file and `TcIface.tcIfaceWrapper` rebuilds the wrapper's unfolding. 
     429 
     430However, re-running the demand analyzer after the worker wrapper split may change the wrapper's demand info. In the program offending Core Lint, the new demand info results in a worker with an arity of 4 while the old info had an arity of 5. 
     431 
     432For now, I've merely disabled the `.hi` cleverness;thusly storing the wrapper's actual unfolding works fine. 
     433 
     434=== The initial measurements === 
     435 
     436Using different flag combinations, I built the libraries and ran nofib tests. 
     437 
     438  1. baseline        = -O1 
     439  1. late-strictness = -O1 -flate-strictness 
     440  1. late-wwsplit    = -O1 -flate-wwsplit 
     441 
     442NB the libraries are usually compiled with -O2. 
     443 
     444Allocation changes: 
     445 
     446  * allocation for late-strictness and late-wwsplit are always the same. 
     447 
     448  * The big changes are all good, but nothing truly spectacular. 
     449 
     450{{{ 
     451knights  2258392    -5.70% 
     452fulsom   335718008  -2.50% 
     453scs      1030151712 -1.60% 
     454simple   226413112  -1.20% 
     455pic      3528968    -0.20% 
     456gamteb   59846096   -0.10% 
     457gg       9159680    -0.10% 
     458 
     459ansi     128632      0.10% 
     460awards   292416      0.10% 
     461expert   373048      0.10% 
     462pretty   145640      0.10% 
     463rfib     115688      0.10% 
     464grep     72992       0.20% 
     465mkhprog  3371224     0.20% 
     466scc      59568       0.20% 
     467tak      110136      0.30% 
     468maillist 92431136    0.50% 
     469 
     470}}} 
     471 
     472Repeatable Elapsed time changes: 
     473 
     474{{{ 
     475              baseline   baseline2  late-strictness  wwsplit 
     476atom          3.66       0.10%       3.00%            -1.10% 
     477compress2     1.97       0.10%      -3.80%            -3.40% 
     478cryptarithm1  2.36       0.10%      -1.90%             2.90% 
     479fft           0.23       0.00%      -2.30%           -10.40% 
     480fft2          0.28      -0.20%       0.10%             2.10% 
     481genfft        0.21      -0.30%       1.90%             5.30% 
     482hpg           0.47       0.10%      -3.00%             2.80% 
     483integer       5.69      -0.20%      -0.70%             2.90% 
     484integrate     0.66       0.30%      -0.20%            -4.10% 
     485life          1.85       0.10%       1.90%             1.30% 
     486para          1.65      -0.40%       0.20%            -3.40% 
     487scs           3.59       0.60%      -0.90%           -12.50% 
     488transform     1.86      -0.20%      -0.20%             2.20% 
     489treejoin      1.61       0.40%       1.10%             2.30% 
     490wave4main     1.18      -0.10%       1.90%             0.40% 
     491}}} 
     492 
     493=== Analysis === 
     494 
     495==== Allocation ==== 
     496 
     497 * knights - `possibleMoves`'s letrec includes a join point: baseline, there's no demand info, but late-strictness records that it's strict in its first argument. Thus after CoreTidy, calls `$j (case sNo {I# x -> I# (x -# 1)})` become `case sNo {I# x -> $j (I# (x -# 1))}`. By STG, the -# 1 gets floated out, and the sat thunk's closure is instead just the `I#` constructor's closure. This has many benefits, including saving us a word since we don't need the free variable. This is approximately what the ww-split version ends up doing, too, ww-split just makes the change more explicit in the Core and STG. 
     498 
     499  * fulsom - A constructor argument `(let a = case fv1 of D# x -> case fv2 of D# y -> D# (x +## y) in GHC.Float.timesDouble a a)` becomes `case fv1 of D# x -> case fv2 of D# y -> let a = x +## y in D# (a *## a)`. The let at an unlifted type is actually a case, and hence not allocated, eliminating a two-free-variable closure altogether. Moreover, the entry count of `GHC.Float.timesDouble` is halved. 
     500 
     501  * scs - a hot recursive LNE has some lets converted to cases, since their strict demand is identified. 79,751,836 reduces to 77,930,374. 
     502 
     503  * scs - `GHC.Float.$w$sfloatToDigits1` allocates 95% as much as before (1070916 on 14098 entries). 
     504 
     505  * scs - `GHC.Float.$wfromRat''` no longer allocates, saving 170315 allocation on 37304 entries. 
     506 
     507  * simple - a number of lets become cases, partly because `revised_temperature` is newly found strict in its second arg 
     508 
     509==== Runtime ==== 
     510 
     511TODO