Changes between Version 18 and Version 19 of Frisby2013Q1


Ignore:
Timestamp:
Feb 22, 2013 7:30:09 PM (2 years ago)
Author:
nfrisby
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Frisby2013Q1

    v18 v19  
    1414
    1515  * emacs scroll-all-mode
     16      * use C-o to fix up horizontal alignment without skewing the cursors
    1617
    1718  * navigate -ddump-*'s output via the *** headers
     
    1920  * lots of information: -dverbose-core2core + -ddump-inlinings
    2021    * simplCore/Simplify has many pprTraces that are commented out
     22
     23  * ghc --show-iface M.hi will dump its definitions
     24      * the unfolding of a wrapper is not shown, because it is actually not in the `.hi` file; it is reconstructed by `TcIface.tcIfaceWrapper` based on the type of it and the worker and the worker's demand information
    2125
    2226  * diff -w can highlight major changes
     
    2428    * -dsuppress-uniques or sed -r 's/_[[:alnum:]]{2,4}//g' (or leave out the underscore)
    2529      * removes ''most'' uniques
    26       * sed is handy also for diffing .ticky files
     30      * this sed is handy also for diffing .ticky files
     31
     32  * a strictly demanded let and a thunk with an unlifted type both become cases in the STG.
    2733
    2834=== Ticky Counters ===
     
    373379
    374380Changing the let-no-escape thunks to \void -> ... closures upstream would then subject the binding to more optimisations. Formerly, it's non-lambda status meant that inlining it, eg, would lose sharing.
     381
     382== Late Strictness/WW ==
     383
     384There are two core-to-core passes related to demand (= strictness & usage & CPR):
     385
     386  * the demand analyzer pass (stranal/DmdAnal) and
     387  * the worker-wrapper split (stranal/WorkWrap).
     388
     389The split currently happens once in the pipeline, and the demand analysis happens immediately before it.
     390
     391{{{
     392sat
     393vectorise
     394specialize
     395float-out
     396float-in
     397simpl
     398HERE
     399float-out
     400cse
     401float-in
     402liberate-case;simpl
     403spec-constr
     404simpl
     405}}}
     406
     407This pair of passes can be enabled/disabled by `-fstrictness` flag.
     408
     409Additionally, the demand analyzer is optionally be ran before each execution of some arbitrary phases of the simplifier using the `-fstrictness-before` flag. A few direct invocations of the simplifier, eg, after vectorisation, are not affected by this flag.
     410
     411=== The Idea ===
     412
     413The demands change as we optimize Core terms. The passes are careful to remove the demand info annotation when transforming a term in a way that invalidates the current demand info. Doing so, however, can hinder downstream optimizations. Running the strictness analyzer a second time may therefore be helpful.
     414
     415=== The Design ===
     416
     417I added to flags `-flate-strictness` and `-flate-wwsplit`.
     418
     419  * `-flate-strictness` invokes the demand analyzer after !SpecConstr.
     420  * `-flat-wwsplit` implies `-flate-strictness` and immediately follows it with a worker-wrapper split.
     421
     422I placed them arbitrarily in the pipeline. But phase-ordering is known to have an impact, so maybe we should try running in between all other passes.
     423
     424==== Initial Core Lint error ====
     425
     426Enabling either flag caused a core lint error when compiling `GHC.Float`. (I anticipate that `-fstrictness-before` would have done the same.) Some cleverness in the `.hi` files was surreptitiously creating an ill-typed unfolding for `GHC.Real.even`. Here's why.
     427
     428The unfolding for a wrapper function is not actually stored in a `.hi` file. This unfolding can be reconstructed based on the type of the function and the demand info it had when the split was performed. Thus only the name of the worker is stored in the `.hi` file and `TcIface.tcIfaceWrapper` rebuilds the wrapper's unfolding.
     429
     430However, re-running the demand analyzer after the worker wrapper split may change the wrapper's demand info. In the program offending Core Lint, the new demand info results in a worker with an arity of 4 while the old info had an arity of 5.
     431
     432For now, I've merely disabled the `.hi` cleverness;thusly storing the wrapper's actual unfolding works fine.
     433
     434=== The initial measurements ===
     435
     436Using different flag combinations, I built the libraries and ran nofib tests.
     437
     438  1. baseline        = -O1
     439  1. late-strictness = -O1 -flate-strictness
     440  1. late-wwsplit    = -O1 -flate-wwsplit
     441
     442NB the libraries are usually compiled with -O2.
     443
     444Allocation changes:
     445
     446  * allocation for late-strictness and late-wwsplit are always the same.
     447
     448  * The big changes are all good, but nothing truly spectacular.
     449
     450{{{
     451knights  2258392    -5.70%
     452fulsom   335718008  -2.50%
     453scs      1030151712 -1.60%
     454simple   226413112  -1.20%
     455pic      3528968    -0.20%
     456gamteb   59846096   -0.10%
     457gg       9159680    -0.10%
     458
     459ansi     128632      0.10%
     460awards   292416      0.10%
     461expert   373048      0.10%
     462pretty   145640      0.10%
     463rfib     115688      0.10%
     464grep     72992       0.20%
     465mkhprog  3371224     0.20%
     466scc      59568       0.20%
     467tak      110136      0.30%
     468maillist 92431136    0.50%
     469
     470}}}
     471
     472Repeatable Elapsed time changes:
     473
     474{{{
     475              baseline   baseline2  late-strictness  wwsplit
     476atom          3.66       0.10%       3.00%            -1.10%
     477compress2     1.97       0.10%      -3.80%            -3.40%
     478cryptarithm1  2.36       0.10%      -1.90%             2.90%
     479fft           0.23       0.00%      -2.30%           -10.40%
     480fft2          0.28      -0.20%       0.10%             2.10%
     481genfft        0.21      -0.30%       1.90%             5.30%
     482hpg           0.47       0.10%      -3.00%             2.80%
     483integer       5.69      -0.20%      -0.70%             2.90%
     484integrate     0.66       0.30%      -0.20%            -4.10%
     485life          1.85       0.10%       1.90%             1.30%
     486para          1.65      -0.40%       0.20%            -3.40%
     487scs           3.59       0.60%      -0.90%           -12.50%
     488transform     1.86      -0.20%      -0.20%             2.20%
     489treejoin      1.61       0.40%       1.10%             2.30%
     490wave4main     1.18      -0.10%       1.90%             0.40%
     491}}}
     492
     493=== Analysis ===
     494
     495==== Allocation ====
     496
     497 * knights - `possibleMoves`'s letrec includes a join point: baseline, there's no demand info, but late-strictness records that it's strict in its first argument. Thus after CoreTidy, calls `$j (case sNo {I# x -> I# (x -# 1)})` become `case sNo {I# x -> $j (I# (x -# 1))}`. By STG, the -# 1 gets floated out, and the sat thunk's closure is instead just the `I#` constructor's closure. This has many benefits, including saving us a word since we don't need the free variable. This is approximately what the ww-split version ends up doing, too, ww-split just makes the change more explicit in the Core and STG.
     498
     499  * fulsom - A constructor argument `(let a = case fv1 of D# x -> case fv2 of D# y -> D# (x +## y) in GHC.Float.timesDouble a a)` becomes `case fv1 of D# x -> case fv2 of D# y -> let a = x +## y in D# (a *## a)`. The let at an unlifted type is actually a case, and hence not allocated, eliminating a two-free-variable closure altogether. Moreover, the entry count of `GHC.Float.timesDouble` is halved.
     500
     501  * scs - a hot recursive LNE has some lets converted to cases, since their strict demand is identified. 79,751,836 reduces to 77,930,374.
     502
     503  * scs - `GHC.Float.$w$sfloatToDigits1` allocates 95% as much as before (1070916 on 14098 entries).
     504
     505  * scs - `GHC.Float.$wfromRat''` no longer allocates, saving 170315 allocation on 37304 entries.
     506
     507  * simple - a number of lets become cases, partly because `revised_temperature` is newly found strict in its second arg
     508
     509==== Runtime ====
     510
     511TODO