wiki:LateDmd

Version 6 (modified by nfrisby, 8 months ago) (diff)

--

Notes about running demand analysis a second time, late in the pipeline.

Commit c080f727ba5f83921b842fcff71e9066adbdc250

The numbers quoted on this wiki page were using ef017944600cf4e153aad686a6a78bfb48dea67a as the base commit — after measuring, I rebased my patch to apply it to 33c880b43ed72d77f6b1d95d5ccefbd376c78c78

The corresponding testsuite commit is [a7920ef6eefa5578c89b7cda0d6be207ee38c502/testsuite]

Commit notes

The -flate-dmd-anal flag runs the demand analysis a second time just before CorePrep. It's not on by default yet, but we hope -O2 will eventually imply it, perhaps even for the GHC 7.8 release.

The bulk of this patch merely simplifies the treatment of wrappers in interface files.

TODO

  • Update the documentation to explain -flate-dmd-anal.
  • Ask the community for help in determining if we should make -O2 imply -flate-dmd-anal.

Relation to other tickets

There are some tickets documenting runtime bugs that can be cleaned up by running the demand analyzer (followed by a simplifier run) a second time at the end of the pipeline: #4941, #5302, #6087. #6070 ? Others?

Removing the clever .hi files scheme

Running the demand analyzer twice breaks some expectations of the .hi file format. Prior to this commit, GHC regenerated the wrapper's body from the its strictness signature and worker id. Now, instead, the body is simply encoded just like any other InlineStable.

This change…

  1. simplifies a special case; there's plenty of knock-on code elimination from no longer having ids in UnfoldingSource,
  2. increases the size of .hi files (see below),
  3. accordingly increases compile time a bit (eg ~ +1% over nofib),
  4. accommodates the late demand analysis (see below)
  5. similarly accommodates the -ffun-to-thunk flag

Simplifying the .hi scheme was the easiest way to enable -flate-dmd-anal and make -ffun-to-thunk safe to use. It is possible to revert back to the clever .hi scheme. It will however require some care in order to safely interoperate with -flate-dmd-anal, -ffun-to-thunk, and any future work that similarly effects the accuracy of the clever .hi file scheme's regeneration phase.

Effect on .hi file size

Removing the clever .hi file scheme for wrappers results as expected in an increase of .hi file size.

In $TOPDIR/libraries, there's an extra 569,509 bytes of .hi file.

Here's the files with a growth >10K.

(bytes growth,file)
(11103,"base/dist-install/build/GHC/Arr.hi")
(12479,"template-haskell/dist-install/build/Language/Haskell/TH/Lib.hi")
(12756,"binary/dist-install/build/Data/Binary/Class.hi")
(15727,"random/dist-install/build/System/Random.hi")
(29348,"base/dist-install/build/Data/Data.hi")
(30497,"template-haskell/dist-install/build/Language/Haskell/TH/Syntax.hi")
(37081,"Cabal/Cabal/dist-install/build/Distribution/PackageDescription.hi")
(64200,"ghc-prim/dist-install/build/GHC/Classes.hi")

Here's the files with a growth >10%.

(0.10163132137030995,"Cabal/Cabal/dist-install/build/Distribution/Simple/Bench.hi")
(0.1067165410638649,"hoopl/dist-install/build/Compiler/Hoopl/XUtil.hi")
(0.11125552378476736,"base/dist-install/build/Control/Monad.hi")
(0.11311653959856854,"time/dist-install/build/Data/Time/Calendar/Private.hi")
(0.12166183143643532,"transformers/dist-install/build/Data/Functor/Compose.hi")
(0.1584435579816642,"hoopl/dist-install/build/Compiler/Hoopl/Combinators.hi")
(0.21422422135168143,"ghc-prim/dist-install/build/GHC/Classes.hi")

Accommodation of -flate-dmd-anal and -ffun-to-thunk --

The clever .hi scheme caused CoreLint? errors when combined with -flate-dmd-anal. I irresponsibly cannot remember the recipe for this bug. It was triggered in one of three ways: building GHC, running nofib, or running ./validate.

Similar to -flate-dmd-anal, abandoning the clever .hi scheme lets us safely import code compiled with/without -ffun-to-thunk from a module compiled without/with -ffun-to-thunk. I can explain this one.

  • Compile A.hs with -ffun-to-thunk
  • Compile a file B.hs that imports A.hs without -ffun-to-thunk

If demand analysis removes all the value arguments from a function f in A.hs and B.hs uses that function, compilation of B.hs will crash. The problem is that the regeneration of the body of f in B will attempt to apply f to a realWorld# argument because there is no -ffun-to-thunk flag. However, f no longer accepts any arguments, since it was compiled with -ffun-to-thunk. Boom.

(The -flate-dmd-anal bug was similar, but more involved.)

-flate-dmd-anal

-flate-dmd-anal adds a second demand analysis with a subsequent invocation of the simplifier just before CorePrep. Cf #7782

Effect on .hi file size and .a file size

The second demand analysis generates more worker/wrapper splits, so it also generates larger .hi files and larger .o files. The numbers in this section measure the difference between -O2 -flate-dmd-anal and -O2 -fno-late-dmd-anal. This is on my 64 bit Mac OS X.

It's based on the size of the .hi and .a files in $TOPDIR/libraries.

.hi bytes.a bytes
no late-dmd
late-dmd
difference +552,057 +684,696

These are the big .hi changes over 10K.

(growth bytes,  module)
(35807,"base/dist-install/build/Data/Data.hi")
(54562,"template-haskell/dist-install/build/Language/Haskell/TH/Syntax.hi")
(59000,"Cabal/Cabal/dist-install/build/Distribution/PackageDescription.hi")
(69900,"template-haskell/dist-install/build/Language/Haskell/TH/Lib.hi")

These are the big .hi changes over 10%.

(growth%,  module)
(0.10158001494608733,"haskeline/dist-install/build/System/Console/Haskeline/Command.hi")
(0.10499966324675034,"hoopl/dist-install/build/Compiler/Hoopl/MkGraph.hi")
(0.11207246180884142,"haskeline/dist-install/build/System/Console/Haskeline/Command/Undo.hi")
(0.11254620966637761,"transformers/dist-install/build/Control/Applicative/Lift.hi")
(0.11394046020649104,"base/dist-install/build/GHC/Event/Thread.hi")
(0.11417453220731909,"dph/dph-lifted-base/dist-install/build/Data/Array/Parallel/PArray/Reference.hi")
(0.11493796526054591,"hoopl/dist-install/build/Compiler/Hoopl/XUtil.hi")
(0.11842105263157894,"dph/dph-prim-seq/dist-install/build/Data/Array/Parallel/Unlifted/Sequential/Extracts.hi")
(0.1252496671105193,"base/dist-install/build/Control/Concurrent/QSemN.hi")
(0.13623208379272325,"base/dist-install/build/Numeric.hi")
(0.174892616905746,"haskeline/dist-install/build/System/Console/Haskeline/Backend/DumbTerm.hi")
(0.17564356435643563,"base/dist-install/build/Data/Ratio.hi")
(0.1764402762032361,"base/dist-install/build/Control/Concurrent/QSem.hi")
(0.2952895972676818,"dph/dph-lifted-copy/dist-install/build/Data/Array/Parallel/Lifted/TH/Repr.hi")
(0.3762859126952084,"template-haskell/dist-install/build/Language/Haskell/TH/Lib.hi")

These are the big .a changes over 10K.

growth bytes module
-19408libHSdph-prim-par-0.8.0.1.a
-16976libHSdph-prim-seq-0.8.0.1.a
10440libHShoopl-3.10.0.0.a
11120libHStransformers-0.3.0.0.a
11472libHSold-time-1.1.0.1.a
22584libHStime-1.4.0.2.a
30168libHSdph-lifted-copy-0.8.0.1.a
35224libHSvector-0.9.1.a
44448libHScontainers-0.5.0.0.a
48408libHShaskeline-0.7.0.4.a
115104libHStemplate-haskell-2.9.0.0.a
120936libHSbase-4.7.0.0.a
237088libHSCabal-1.17.0.a

New performance numbers

I'm using commit

I use these abbreviations in the following tables

00 - no late dmd analysis on either libs or nofib tests
10 - late demand analysis on libs, but not on nofib tests
11 - late demand analysis on both libs and nofib tests

build.mk included

DYNAMIC_BY_DEFAULT   = NO
DYNAMIC_GHC_PROGRAMS = NO

SRC_HC_OPTS     = -O -H64m
GhcStage1HcOpts = -O -fasm
GhcStage2HcOpts = -O2 -fasm
GhcHcOpts       = -Rghc-timing
GhcLibHcOpts    = -O2

2.7Ghz Core i7 MacBook Pro, 16 GB, 64-bit

Binary Sizes

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
        -1 s.d.                -----           +0.4%           +0.4%
        +1 s.d.                -----           +0.7%           +0.7%
        Average                -----           +0.6%           +0.6%
mode=norm
Allocations

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
       cichelli             80307264           +0.0%          -22.9%
        mandel2              1041544           +0.0%          -21.4%
reverse-complem            150153040          -13.2%          -13.2%
          fasta            401153024           -9.1%           -9.1%
      integrate            474063360           +0.0%           -5.1%
   k-nucleotide           4125099504           -0.0%           -4.8%
        knights              1968072           +0.0%           -3.8%
         fulsom            323486224           +0.0%           -2.6%
      transform            696343224           +0.0%           -2.4%

       nucleic2             87567072           +0.0%           +3.4%
   cryptarithm2             24028936           +0.0%           +4.2%

        -1 s.d.                -----           -1.9%           -4.8%
        +1 s.d.                -----           +1.5%           +3.1%
        Average                -----           -0.2%           -0.9%
Run Time

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
           life                 0.23          -13.0%          -13.0%

   binary-trees                 0.61           +6.3%           +5.9%

        -1 s.d.                -----           -3.5%           -4.1%
        +1 s.d.                -----           +2.9%           +2.3%
        Average                -----           -0.4%           -0.9%
Elapsed Time

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
      compress2                 0.23          -14.2%          -17.7%
      typecheck                 0.20           +2.0%           -8.9%
           life                 0.26          -12.3%           -6.2%
         simple                 0.24           -9.0%           -4.9%

            hpg                 0.21           -1.9%           +6.7%
reverse-complem                 0.27          +13.5%          +12.8%

        -1 s.d.                -----           -5.7%           -5.6%
        +1 s.d.                -----           +4.2%           +4.3%
        Average                -----           -0.9%           -0.8%

Old performance numbers

NB These were from April 2013.

Here's the effects on nofib. Run time didn't seem to change as drastically. The "X/Y" column headers mean "library-flags/test-flags" given to GHC when compiling the respective bit.

Allocations

-------------------------------------------------------------------------------
        Program                O2/O2     late-dmd+O2/O2    late-dmd+O2/late-dmd+O2
-------------------------------------------------------------------------------
   cryptarithm2             25078168           +0.0%           +8.0%
       nucleic2             98331744           +0.0%           +3.2%

       cichelli             80310632           +0.0%          -22.9%
          fasta            401159024           -9.1%           -9.1%
         fulsom            321427240           +0.0%           -2.6%
   k-nucleotide           4125102928           -0.0%           -4.8%
        knights              2037984           +0.0%           -3.7%
        mandel2              1041840           +0.0%          -21.4%
        parstof              3103208           +0.0%           -1.4%
reverse-complem            155188304          -12.8%          -12.8%
         simple            226412800           -0.0%           -1.0%

All other changes less than 1% allocation. Note that it improves a couple tests significantly just via changes in the base libraries.

For cryptarithm2, (cf remarks in #4941)

  • 4% increase allocation is due to reboxing
  • 4% is due to dead closures, because the fix in #4962 isn't working for some reason.

For nucleic2, in var_most_distant_atom, an let-bound function is inlined after w/w, and hence grows numerous closures by a significant amount. I'm not sure where to lay the blame for this. Note however, that just making nucleic2's data types use strict !Float fields changes its allocation -72.4%, so maybe this "bad practice" corner case is a small issue.