wiki:LateDmd

Version 9 (modified by nfrisby, 8 months ago) (diff)

--

Notes about running demand analysis a second time, late in the pipeline.

Commit c080f727ba5f83921b842fcff71e9066adbdc250

The numbers quoted on this wiki page were using ef017944600cf4e153aad686a6a78bfb48dea67a as the base commit — after measuring, I rebased my patch to apply it to 33c880b43ed72d77f6b1d95d5ccefbd376c78c78

The corresponding testsuite commit is [a7920ef6eefa5578c89b7cda0d6be207ee38c502/testsuite]

Commit notes

The -flate-dmd-anal flag runs the demand analysis a second time just before CorePrep. It's not on by default yet, but we hope -O2 will eventually imply it, perhaps even for the GHC 7.8 release.

The bulk of this patch merely simplifies the treatment of wrappers in interface files.

TODO

  • Update the documentation to explain -flate-dmd-anal.
  • Ask the performance czars and community for help in determining if we should make -O2 imply -flate-dmd-anal.

Relation to other tickets

There are some tickets documenting runtime bugs that can be cleaned up by running the demand analyzer (followed by a simplifier run) a second time at the end of the pipeline: #4941, #5302, #6087. #6070 ? Others?

Removing the clever .hi files scheme

Running the demand analyzer twice breaks some expectations of the .hi file format. Prior to this commit, GHC regenerated the wrapper's body from the its strictness signature and worker id. Now, instead, the body is simply encoded just like any other InlineStable.

This change…

  1. simplifies a special case; there's plenty of knock-on code elimination from no longer having ids in UnfoldingSource,
  2. increases the size of .hi files (see below),
  3. accordingly increases compile time a bit (eg ~ +1% over nofib),
  4. accommodates the late demand analysis (see below)
  5. similarly accommodates the -ffun-to-thunk flag

Simplifying the .hi scheme was the easiest way to enable -flate-dmd-anal and make -ffun-to-thunk safe to use. It is possible to revert back to the clever .hi scheme. It will however require some care in order to safely interoperate with -flate-dmd-anal, -ffun-to-thunk, and any future work that similarly effects the accuracy of the clever .hi file scheme's regeneration phase.

Effect on .hi file size

Removing the clever .hi file scheme for wrappers results as expected in an increase of .hi file size.

In $TOPDIR/libraries, there's an extra 569,509 bytes of .hi file.

Here's the files with a growth >10K.

(bytes growth,file)
(11103,"base/dist-install/build/GHC/Arr.hi")
(12479,"template-haskell/dist-install/build/Language/Haskell/TH/Lib.hi")
(12756,"binary/dist-install/build/Data/Binary/Class.hi")
(15727,"random/dist-install/build/System/Random.hi")
(29348,"base/dist-install/build/Data/Data.hi")
(30497,"template-haskell/dist-install/build/Language/Haskell/TH/Syntax.hi")
(37081,"Cabal/Cabal/dist-install/build/Distribution/PackageDescription.hi")
(64200,"ghc-prim/dist-install/build/GHC/Classes.hi")

Here's the files with a growth >10%.

(0.10163132137030995,"Cabal/Cabal/dist-install/build/Distribution/Simple/Bench.hi")
(0.1067165410638649,"hoopl/dist-install/build/Compiler/Hoopl/XUtil.hi")
(0.11125552378476736,"base/dist-install/build/Control/Monad.hi")
(0.11311653959856854,"time/dist-install/build/Data/Time/Calendar/Private.hi")
(0.12166183143643532,"transformers/dist-install/build/Data/Functor/Compose.hi")
(0.1584435579816642,"hoopl/dist-install/build/Compiler/Hoopl/Combinators.hi")
(0.21422422135168143,"ghc-prim/dist-install/build/GHC/Classes.hi")

Main Benefit of Removal

The clever .hi scheme caused CoreLint? errors when combined with -flate-dmd-anal. I irresponsibly cannot remember the recipe for this bug. It was triggered in one of three ways: building GHC, running nofib, or running ./validate.

Similar to -flate-dmd-anal, abandoning the clever .hi scheme lets us safely import code compiled with/without -ffun-to-thunk from a module compiled without/with -ffun-to-thunk. I can explain this one.

  • Compile A.hs with -ffun-to-thunk
  • Compile a file B.hs that imports A.hs without -ffun-to-thunk

If demand analysis removes all the value arguments from a function f in A.hs and B.hs uses that function, compilation of B.hs will crash. The problem is that the regeneration of the body of f in B will attempt to apply f to a realWorld# argument because there is no -ffun-to-thunk flag. However, f no longer accepts any arguments, since it was compiled with -ffun-to-thunk. Boom.

(The -flate-dmd-anal bug was similar, but more involved.)

-flate-dmd-anal

-flate-dmd-anal adds a second demand analysis with a subsequent invocation of the simplifier just before CorePrep. Cf #7782

Effect on .hi file size and .a file size

The second demand analysis generates more worker/wrapper splits, so it also generates larger .hi files and larger .o files. The numbers in this section measure the difference between -O2 -flate-dmd-anal and -O2 -fno-late-dmd-anal. This is on my 64 bit Mac OS X.

It's based on the size of the .hi and .a files in $TOPDIR/libraries.

.hi bytes.a bytes
no late-dmd
late-dmd
difference +552,057 +684,696

These are the big .hi changes over 10K.

(growth bytes,  module)
(35807,"base/dist-install/build/Data/Data.hi")
(54562,"template-haskell/dist-install/build/Language/Haskell/TH/Syntax.hi")
(59000,"Cabal/Cabal/dist-install/build/Distribution/PackageDescription.hi")
(69900,"template-haskell/dist-install/build/Language/Haskell/TH/Lib.hi")

These are the big .hi changes over 10%.

(growth%,  module)
(0.10158001494608733,"haskeline/dist-install/build/System/Console/Haskeline/Command.hi")
(0.10499966324675034,"hoopl/dist-install/build/Compiler/Hoopl/MkGraph.hi")
(0.11207246180884142,"haskeline/dist-install/build/System/Console/Haskeline/Command/Undo.hi")
(0.11254620966637761,"transformers/dist-install/build/Control/Applicative/Lift.hi")
(0.11394046020649104,"base/dist-install/build/GHC/Event/Thread.hi")
(0.11417453220731909,"dph/dph-lifted-base/dist-install/build/Data/Array/Parallel/PArray/Reference.hi")
(0.11493796526054591,"hoopl/dist-install/build/Compiler/Hoopl/XUtil.hi")
(0.11842105263157894,"dph/dph-prim-seq/dist-install/build/Data/Array/Parallel/Unlifted/Sequential/Extracts.hi")
(0.1252496671105193,"base/dist-install/build/Control/Concurrent/QSemN.hi")
(0.13623208379272325,"base/dist-install/build/Numeric.hi")
(0.174892616905746,"haskeline/dist-install/build/System/Console/Haskeline/Backend/DumbTerm.hi")
(0.17564356435643563,"base/dist-install/build/Data/Ratio.hi")
(0.1764402762032361,"base/dist-install/build/Control/Concurrent/QSem.hi")
(0.2952895972676818,"dph/dph-lifted-copy/dist-install/build/Data/Array/Parallel/Lifted/TH/Repr.hi")
(0.3762859126952084,"template-haskell/dist-install/build/Language/Haskell/TH/Lib.hi")

These are the big .a changes over 10K.

growth bytes module
-19408libHSdph-prim-par-0.8.0.1.a
-16976libHSdph-prim-seq-0.8.0.1.a
10440libHShoopl-3.10.0.0.a
11120libHStransformers-0.3.0.0.a
11472libHSold-time-1.1.0.1.a
22584libHStime-1.4.0.2.a
30168libHSdph-lifted-copy-0.8.0.1.a
35224libHSvector-0.9.1.a
44448libHScontainers-0.5.0.0.a
48408libHShaskeline-0.7.0.4.a
115104libHStemplate-haskell-2.9.0.0.a
120936libHSbase-4.7.0.0.a
237088libHSCabal-1.17.0.a

New performance numbers

I'm using commit

I use these abbreviations in the following tables

00 - no late dmd analysis on either libs or nofib tests
10 - late demand analysis on libs, but not on nofib tests
11 - late demand analysis on both libs and nofib tests

build.mk included

SRC_HC_OPTS     = -O -H64m
GhcStage1HcOpts = -O -fasm
GhcStage2HcOpts = -O2 -fasm
GhcHcOpts       = -Rghc-timing
GhcLibHcOpts    = -O2

SplitObjs          = NO

DYNAMIC_BY_DEFAULT   = NO
DYNAMIC_GHC_PROGRAMS = NO

The changes in binary size were the same on my two tests platforms so far (both 64-bit). It looks like essentially we're seeing the effects of an increase in the size of the base library. The smallest programs increased by +1.1% in both 10 and 11. Other programs usually had ~0.1% difference in 10 and 11. nucleic2 has about a +1% from 10 to 11, but that is a known anomaly — cf the discussion in "old performance numbers" below.

Binary Sizes

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
        -1 s.d.                -----           +0.4%           +0.4%
        +1 s.d.                -----           +0.7%           +0.7%
        Average                -----           +0.6%           +0.6%

2.7Ghz Core i7 MacBook Pro, 16 GB, 64-bit

mode=norm NoFibRuns=30
Allocations

-- NB nucliec2 and cryptarithm2 are explained in the "Old performance numbers" section below.

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
       cichelli             80307264           +0.0%          -22.9%
        mandel2              1041544           +0.0%          -21.4%
reverse-complem            150153040          -13.2%          -13.2%
          fasta            401153024           -9.1%           -9.1%
      integrate            474063360           +0.0%           -5.1%
   k-nucleotide           4125099504           -0.0%           -4.8%
        knights              1968072           +0.0%           -3.8%
         fulsom            323486224           +0.0%           -2.6%
      transform            696343224           +0.0%           -2.4%

       -- everything else changed less

       nucleic2             87567072           +0.0%           +3.4%
   cryptarithm2             24028936           +0.0%           +4.2%

        -1 s.d.                -----           -1.9%           -4.8%
        +1 s.d.                -----           +1.5%           +3.1%
        Average                -----           -0.2%           -0.9%
Run Time

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
           life                 0.23          -13.0%          -13.0%

       -- everything else changed less

   binary-trees                 0.61           +6.3%           +5.9%

        -1 s.d.                -----           -3.5%           -4.1%
        +1 s.d.                -----           +2.9%           +2.3%
        Average                -----           -0.4%           -0.9%
Elapsed Time

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
      compress2                 0.23          -14.2%          -17.7%
      typecheck                 0.20           +2.0%           -8.9%
           life                 0.26          -12.3%           -6.2%
         simple                 0.24           -9.0%           -4.9%

       -- everything else changed less

            hpg                 0.21           -1.9%           +6.7%
reverse-complem                 0.27          +13.5%          +12.8%

        -1 s.d.                -----           -5.7%           -5.6%
        +1 s.d.                -----           +4.2%           +4.3%
        Average                -----           -0.9%           -0.8%

really big many-core server, 48 GB, 64-bit

mode=norm NoFibRuns=30
Allocations

-- NB nucliec2 and cryptarithm2 are explained in the "Old performance numbers" section below.

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
       cichelli             80307264           +0.0%          -22.9%
        mandel2              1041544           +0.0%          -21.4%
reverse-complem            150153040          -13.2%          -13.2%
          fasta            401153024           -9.1%           -9.1%
      integrate            474063360           +0.0%           -5.1%
   k-nucleotide           4125099504           -0.0%           -4.8%
        knights              1968072           +0.0%           -3.8%
         fulsom            323486224           +0.0%           -2.6%
      transform            696343224           +0.0%           -2.4%
            ida            128551480           +0.0%           -1.2%
        parstof              3102544           +0.0%           -1.4%
         simple            226411568           -0.0%           -1.0%

       -- everything else changed less

           bspt             12285840           +0.0%           +1.2%
       nucleic2             87567496           +0.0%           +3.4%
   cryptarithm2             24028936           +0.0%           +4.2%

        -1 s.d.                -----           -1.9%           -4.8%
        +1 s.d.                -----           +1.5%           +3.1%
        Average                -----           -0.2%           -0.9%
Run Time


-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
         simple                 0.27           -2.6%           -6.4%
      transform                 0.39           -1.3%           -5.1%
          fasta                 0.59           -2.5%           -4.7%

       -- everything else changed less

          kahan                 0.30           +3.6%           +3.9%
   binary-trees                 0.88           +7.2%           +6.9%
      typecheck                 0.24           +8.3%           +8.3%
         hidden                 0.49           +4.1%          +10.2%

        -1 s.d.                -----           -1.7%           -3.0%
        +1 s.d.                -----           +2.9%           +3.5%
        Average                -----           +0.6%           +0.2%
Elapsed Time

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
         simple                 0.27           -2.6%           -6.8%
      transform                 0.39           -1.3%           -5.1%
          fasta                 0.59           -2.7%           -3.7%

       -- everything else changed less

   binary-trees                 0.88           +7.3%           +6.9%
      typecheck                 0.24           +8.3%           +8.3%
         hidden                 0.49           +4.1%          +10.1%

        -1 s.d.                -----           -1.6%           -2.9%
        +1 s.d.                -----           +3.1%           +3.6%
        Average                -----           +0.7%           +0.3%
mode=slow NoFibRuns=30
Allocations

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
       cichelli             80307264           +0.0%          -22.9%
        mandel2              1041544           +0.0%          -21.4%
reverse-complem           1500677840          -13.2%          -13.2%
          fasta           4005660304           -9.1%           -9.1%
      integrate            948063920           +0.0%           -5.1%
   k-nucleotide          41144014840           +0.0%           -4.9%
         fulsom            323486224           +0.0%           -2.6%
      transform           1389145136           +0.0%           -2.4%
         genfft           1796463848           +0.0%           -1.2%
            ida            733628984           +0.0%           -1.0%
        parstof              3102544           +0.0%           -1.4%
         simple            226411568           -0.0%           -1.0%

       -- everything else changed less

           bspt             12285840           +0.0%           +1.2%
       nucleic2             87567496           +0.0%           +3.4%
   cryptarithm2             24028936           +0.0%           +4.2%

        -1 s.d.                -----           -1.9%           -4.7%
        +1 s.d.                -----           +1.5%           +3.1%
        Average                -----           -0.2%           -0.9%
Run Time

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
         mandel                 0.22           -9.1%           -9.1%
      transform                 0.80           -0.3%           -8.7%
reverse-complem                 1.39           -5.9%           -6.1%
         simple                 0.26           -1.4%           -5.2%
          fasta                 5.84           -3.9%           -4.2%
    gen_regexps                 1.01           -4.6%           -4.7%

       -- everything else changed less

      paraffins                 1.00           +0.2%           +3.4%
      typecheck                 0.49          +10.2%           +8.2%
         hidden                 0.49           +4.1%          +10.2%

        -1 s.d.                -----           -2.6%           -3.3%
        +1 s.d.                -----           +2.9%           +2.7%
        Average                -----           +0.1%           -0.3%
Elapsed Time

-------------------------------------------------------------------------------
        Program                   00              10              11
-------------------------------------------------------------------------------
         mandel                 0.22           -9.1%           -9.1%
      transform                 0.80           +0.0%           -8.5%
reverse-complem                 1.39           -5.9%           -5.8%
         simple                 0.27           -2.1%           -5.2%
          fasta                 5.86           -3.9%           -4.2%
    gen_regexps                 1.01           -4.5%           -4.6%

       -- everything else changed less

      paraffins                 1.00           +0.2%           +3.7%
      typecheck                 0.49          +10.2%           +8.2%
         hidden                 0.49           +4.5%          +10.2%

        -1 s.d.                -----           -2.6%           -3.2%
        +1 s.d.                -----           +2.9%           +2.8%
        Average                -----           +0.1%           -0.3%

Old performance numbers

NB These were from April 2013.

Here's the effects on nofib. Run time didn't seem to change as drastically. The "X/Y" column headers mean "library-flags/test-flags" given to GHC when compiling the respective bit.

Allocations

-------------------------------------------------------------------------------
        Program                O2/O2     late-dmd+O2/O2    late-dmd+O2/late-dmd+O2
-------------------------------------------------------------------------------
   cryptarithm2             25078168           +0.0%           +8.0%
       nucleic2             98331744           +0.0%           +3.2%

       -- everything else changed less

       cichelli             80310632           +0.0%          -22.9%
          fasta            401159024           -9.1%           -9.1%
         fulsom            321427240           +0.0%           -2.6%
   k-nucleotide           4125102928           -0.0%           -4.8%
        knights              2037984           +0.0%           -3.7%
        mandel2              1041840           +0.0%          -21.4%
        parstof              3103208           +0.0%           -1.4%
reverse-complem            155188304          -12.8%          -12.8%
         simple            226412800           -0.0%           -1.0%

All other changes less than 1% allocation. Note that it improves a couple tests significantly just via changes in the base libraries.

For cryptarithm2, (cf remarks in #4941)

  • 4% increase allocation is due to reboxing
  • 4% is due to dead closures, because the fix in #4962 isn't working for some reason.

For nucleic2, in var_most_distant_atom, an let-bound function is inlined after w/w, and hence grows numerous closures by a significant amount. I'm not sure where to lay the blame for this. Note however, that just making nucleic2's data types use strict !Float fields changes its allocation -72.4%, so maybe this "bad practice" corner case is a small issue.