Changes between Version 12 and Version 13 of Commentary/Compiler/RecompilationAvoidance


Ignore:
Timestamp:
Jun 20, 2008 1:33:23 PM (7 years ago)
Author:
simonmar
Comment:

Update for new fingerprint story

Legend:

Unmodified
Added
Removed
Modified
  • Commentary/Compiler/RecompilationAvoidance

    v12 v13  
    1111   is not necessary.
    1212
    13 If both of these hold, GHC stops compilation early, because the existing object code and interface are still valid.  In GHCi and `--make`, we must generate the `ModDetails` from the `ModIface`, but this is easily done by calling `MkIface.typecheckIface`.
     13If both of these hold, GHC stops compilation early, because the
     14existing object code and interface are still valid.  In GHCi and
     15`--make`, we must generate the `ModDetails` from the `ModIface`, but
     16this is easily done by calling `MkIface.typecheckIface`.
    1417
    1518== Example ==
     
    6871=== GHCi and `--make` ===
    6972
    70 The simple fact is that when you make a small change to a large program, it is often not necessary to recompile
    71 every module that depends directly or indirectly on something that changed.  In GHCi and `--make`, GHC considers every module in the program in dependency order, and decides whether it needs to be recompiled, or whether the existing object code and interface will do.
     73The simple fact is that when you make a small change to a large
     74program, it is often not necessary to recompile every module that
     75depends directly or indirectly on something that changed.  In GHCi and
     76`--make`, GHC considers every module in the program in dependency
     77order, and decides whether it needs to be recompiled, or whether the
     78existing object code and interface will do.
    7279
    7380=== `make` ===
     
    7582`make` works by checking the timestamps on dependencies and
    7683recompiling things when the dependencies are newer.  Dependency lists
    77 for `make` look like this:
     84for `make` look like this (generated by `ghc -M`):
    7885
    7986{{{
     
    101108    as before, there is no need to modify the `.hi` file.  If the `.hi`
    102109    file is not modified by the compilation, then `make` will notice
    103     and not recompile `B` or `C`.  This is an important optimisation.
     110    and not recompile `B` or `C`, or indeed `A`.  This is an important
     111    optimisation.
    104112
    105113  * Suppose the change to `D` did cause a change in the interface
    106114    (e.g. the type of `f` changed).  Now, `make` will recompile both
    107     `B` and `C`.  Now, suppose that the interfaces to `B` and `C`
     115    `B` and `C`.  Suppose that the interfaces to `B` and `C`
    108116    remain the same: B's interface says only that it re-exports `D.f`,
    109117    so the fact that `f` has a new type does not affect `B`'s
     
    142150== How does it work? ==
    143151
     152We use [http://en.wikipedia.org/wiki/Fingerprint_%28computing%29 fingerprints] to uniquely identify the interface exposed by a module,
     153and to detect when it changes.  In particular, we currently use
     154128-bit hashes produced by the MD5 algorithm (see
     155[[GhcFile(compiler/utils/Fingerprint.hsc)]]).
     156
    144157An [wiki:Commentary/Compiler/IfaceFiles interface file] contains:
    145158
    146  * The module ''version'' (more about this below)
     159 * Various fingerprints:
     160   * The ''interface hash'', which depends on the entire contents of the
     161     interface file.  This is used to detect whether we should
     162     update the interface on disk after recompiling the module.  If the
     163     interface didn't change at all, then we don't want to touch the
     164     on-disk version because that would cause `make` to perform more
     165     compilations.
     166   * The ''ABI hash'', which depends on everything that the module
     167     exposes about its implementation: think of this as a hash of
     168     ''exports'' and ''decls''.
     169   * The ''export-list hash'', which depends on the contents of the
     170     export list (a hash of ''exports'').
     171   * The ''orphan hash'', which depends on all the orphan
     172     instances/rules in the, and the orphan hashes of all orphan
     173     modules below this module in the dependency tree (see "Orphans"
     174     later).
    147175 * ''exports'': what the module exports
     176 * ''dependencies'': modules and packages that this module depends on
     177 * ''usages'': what specific entities the module depends on
    148178 * ''decls'': what the module defines
    149  * ''dependencies'' and ''usages'': what the module depends on
    150179 * various other stuff, but the above are the important bits
    151180
    152 The module version starts at 1, and is increased each time the module
    153 is compiled and either the ''exports'' or ''decls'' changes.
     181To look at the contents of an interface, use `ghc --show-iface`.  For
     182example, here's the output of `ghc --show-iface D.hi` for the module
     183`D` in our example:
     184
     185{{{
     186interface main:D 6090
     187  interface hash: 413dacc4c360257e9fb06ad0c13d9fc9
     188  ABI hash: 0c5278c6f22844f996006259c9f551c8
     189  export-list hash: cb9dd0d414976d16451bdfe51a021d7d
     190  orphan hash: 693e9af84d3dfcc71e640e005bdc5e2e
     191  where
     192export main:D T f h
     193module dependencies:
     194package dependencies: base integer ghc-prim
     195orphans: base:GHC.Base base:GHC.Num
     196family instance modules:
     197import base:GHC.Num 7a6f0c12ee2413f4c07165103924bd61
     198import base:Prelude ae91aa1798ed1ac514cde3dc4c921717
     199f4be8645a3e4099b4be0cfc42e976ed7
     200  data T a b
     201      RecFlag NonRecursive
     202      Generics: no
     203      {- abstract -}
     204      FamilyInstance: none
     2050bf838776ef3bc5671e369aec15c3b16
     206  f :: GHC.Base.Int -> GHC.Base.Int
     207aa9b8cdef43b6852bcc6f61f6ad6c584
     208  h :: GHC.Base.Int -> GHC.Base.Int
     209}}}
     210
     211Lines beginning `import` are the ''usages'', and after the usages are
     212the decls.
    154213
    155214=== Deciding whether to recompile ===
     
    179238
    180239First, the direct imports of the current module are resolved to
    181 `Module`s using `findModule` (a `Module` contains a module name and a
     240`Module`s using `Finder.findModule` (a `Module` contains a module name and a
    182241package identifier).  If any of those `Module`s are not listed amongst
    183242the dependencies of the old interface file, then either:
     
    190249and we must recompile.
    191250
    192 Second, the ''usages'' of the module are checked.  The usages in the
    193 interface file contains a list of every name that was ''used'' during
    194 typechecking: that is, everything external that is referred to by the
    195 source code.
     251Second, the ''usages'' of the module are checked.  The usages contains
     252two types of information:
     253
     254 * for a module that was imported, the export-list fingerprint of the
     255   imported module is recorded.  If any of the modules we imported now
     256   has a different export list we must recompile, so we check the
     257   current export-list fingerprints against those recorded in the
     258   usages.
     259
     260 * for every external name mentioned in the source code, the
     261   fingerprint of that name is recorded in the usages.  This is so
     262   that if we mention for example an external function `M.f`, we'll
     263   recompile if `M.f`'s type has changed, or anything referred to
     264   by `M.f`'s type has changed, or `M.f`'s unfolding has changed
     265   (when -O is on), and so on.
    196266
    197267The interface files for everything in the usages are read (they'll
     
    240310{{{
    241311$ ghc --show-iface D.hi
    242 interface main:D 3 6081 where
     312interface main:D 6090
     313  interface hash: 0a7e886588b3799d909cca39be4b9232
     314  ABI hash: 8d5cfe1723f32a5b53ded43bf9a1e55b
     315  export-list hash: cb9dd0d414976d16451bdfe51a021d7d
     316  orphan hash: 693e9af84d3dfcc71e640e005bdc5e2e
     317  where
    243318export main:D T f h
    244 module dependencies:
    245 package dependencies: base
    246 3 f :: GHC.Base.Int -> GHC.Base.Int
     319674f7fa7c2b13b368042f409007b1f29
     320  data T a b
     321      RecFlag NonRecursive
     322      Generics: no
     323      = C1 :: forall a b. a -> T a b Stricts: _ |
     324        C2 :: forall a b. b -> T a b Stricts: _
     325      FamilyInstance: none
     326790791c346a0d5965feece84894360a6
     327  f :: GHC.Base.Int -> GHC.Base.Int
    247328    {- Arity: 1 HasNoCafRefs Strictness: U(L)m
    248329       Unfolding: (__inline_me (\ x :: GHC.Base.Int ->
    249330                                GHC.Base.plusInt (D.h x) (GHC.Base.I# 1))) -}
    250 h :: GHC.Base.Int -> GHC.Base.Int
    251   {- Arity: 1 HasNoCafRefs Strictness: U(L)m
    252      Unfolding: (\ x :: GHC.Base.Int ->
    253                  case @ GHC.Base.Int x of wild { GHC.Base.I# x1 ->
    254                  GHC.Base.I# (GHC.Prim.+# x1 3) }) -}
    255 data T a b
    256     RecFlag NonRecursive
    257     Generics: no
    258     = C1 :: forall a b. a -> T a b Stricts: _ |
    259       C2 :: forall a b. b -> T a b Stricts: _
    260     FamilyInstance: none
    261 }}}
    262 
    263 Note from the first line that this is version 3.  The first change was
    264 to the definition of `D.f` above, and the second change we made just
    265 now was to add the INLINE pragma to `D.f`.  The version of `D.f` is
    266 also 3 (it has changed twice), and the version of `D.h` is still 1
    267 (the version number is omitted if it is 1).
     331a0a944e487ebec76031e0672e70bc923
     332  h :: GHC.Base.Int -> GHC.Base.Int
     333    {- Arity: 1 HasNoCafRefs Strictness: U(L)m
     334       Unfolding: (\ x :: GHC.Base.Int ->
     335                   case @ GHC.Base.Int x of wild { GHC.Base.I# x1 ->
     336                   GHC.Base.I# (GHC.Prim.+# x1 3) }) -}
     337}}}
    268338
    269339Note that the unfolding of `D.f` mentions `D.h`.
     
    273343{{{
    274344$ ghc -O --show-iface D.hi
    275 interface main:D 4 6081 where
     345interface main:D 6090
     346  interface hash: 55385e568aa80955acbd1b7370041890
     347  ABI hash: accc0413d94e27c90dff8427f4aafe6e
     348  export-list hash: cb9dd0d414976d16451bdfe51a021d7d
     349  orphan hash: 693e9af84d3dfcc71e640e005bdc5e2e
     350  where
    276351export main:D T f h
    277 module dependencies:
    278 package dependencies: base
    279 4 h :: GHC.Base.Int -> GHC.Base.Int
     352674f7fa7c2b13b368042f409007b1f29
     353  data T a b
     354      RecFlag NonRecursive
     355      Generics: no
     356      = C1 :: forall a b. a -> T a b Stricts: _ |
     357        C2 :: forall a b. b -> T a b Stricts: _
     358      FamilyInstance: none
     3597f2bf159cceae5306ce709db96720f4a
     360  f :: GHC.Base.Int -> GHC.Base.Int
     361    {- Arity: 1 HasNoCafRefs Strictness: U(L)m
     362       Unfolding: (__inline_me (\ x :: GHC.Base.Int ->
     363                                GHC.Base.plusInt (D.h x) (GHC.Base.I# 1))) -}
     3645c53713aa59b760e51e6d47173f95b4e
     365  h :: GHC.Base.Int -> GHC.Base.Int
    280366    {- Arity: 1 HasNoCafRefs Strictness: U(L)m
    281367       Unfolding: (\ x :: GHC.Base.Int ->
    282368                   case @ GHC.Base.Int x of wild { GHC.Base.I# x1 ->
    283369                   GHC.Base.I# (GHC.Prim.+# x1 4) }) -}
    284 4 f :: GHC.Base.Int -> GHC.Base.Int
    285     {- Arity: 1 HasNoCafRefs Strictness: U(L)m
    286        Unfolding: (__inline_me (\ x :: GHC.Base.Int ->
    287                                 GHC.Base.plusInt (D.h x) (GHC.Base.I# 1))) -}
    288 data T a b
    289     RecFlag NonRecursive
    290     Generics: no
    291     = C1 :: forall a b. a -> T a b Stricts: _ |
    292       C2 :: forall a b. b -> T a b Stricts: _
    293     FamilyInstance: none
    294 }}}
    295 
    296 The version of `D.h` jumped to 4: things that change get the new
    297 module version, rather than just increasing by 1.  This is so that
    298 if you remove something and then add it back in again, the version
    299 number does not go down.
    300 
    301 However, the version of `D.f` also changed.  This is vital, because
    302 anything that referred to `D.f` must be recompiled, because it may now
    303 see the new unfolding for `D.h`.
    304 
    305 In summary:
    306 
    307  * The version of an entity is bumped whenever either
    308    * the entity's definition changed, or
    309    * the version of anything that the definition refers to changed.
    310  * This means that when recording usages, we only have to record
    311    usages on things that are directly mentioned, not their
    312    transitive closure.
    313 
    314 == Alternative design choices ==
    315 
    316 === Fingerprints instead of versions ===
    317 
    318 Version numbers in this context have a couple of problems:
    319 
    320  * New version numbers are calculated from the old version numbers, so we have to
    321    keep the old interface around to generate the new one.  (but we need to keep
    322    the old interface around anyway, because we might avoid touching it if it
    323    doesn't change).
    324 
    325  * Removing an interface file can have disastrous effects, as all its version
    326    numbers will be reset when it is next compiled.  To be fair, this doesn't seem
    327    to affect many people (that we know of).
    328 
    329  * We want to detect changes in package dependencies (see #1372): right now, we only
    330    record a dependency on the package.  The right thing to do is to record a dependency
    331    on the module that we import, so we can detect when it changes.  But packages are
    332    generally compiled from scratch and then installed, so typically all the version
    333    numbers will be 1, and our dependency is not telling us when the package has changed.
    334 
    335 Using fingerprints or hashes instead of version numbers is morally the right thing to do.  A fingerprint can be calculated independently of the previous version of the interface interface, and it solves the package dependency problem.
    336 
    337 === Module-granularity instead of entity-granularity ===
    338 
    339 Fingerprints or hashes are larger than version numbers (typically 128 bits instead of 32), so we might not want to put a fingerprint on every entity.
    340 
    341 Also, it's worth thinking about whether tracking changes at a lower level of granularity would result in a system that is simpler to implement, understand, and get right.  The obvious choice is to track modules instead of entities.  Clearly this loses some resolution, and will therefore entail more unnecessary recompilation, but how much?  Let's look at how it would work, and then some examples.
    342 
    343 Every module will have a single version number (or fingerprint), which is increased whenever there is a change to either the exports or the decls (but not the usages) in the interface, or the version of any module referred to by the exports or decls changes.  The usages of the module lists the modules and versions referred to by the source code (as before, except that we only record modules not entities).
    344 
    345 '''Examples'''.
    346 
    347 Suppose we change the definition of `T` in module D. 
    348 
    349  * We recompile D, and D's version will change from 1 to 2, because its decls have changed. 
    350  * B will be recompiled, because its usages lists D (since it exports `D.f`).  Since its exports lists
    351    `D.f`, and D's version has changed, B's version will also increase to 2.
    352  * C will be recompiled, but its version stays the same, since it has no exports or decls.
    353  * A will be recompiled, because its usage list includes D, and D's version changed.  A's version will
    354    stay the same, however.
    355 
    356 So in this case modifying `D.T` has forced recompilation of B and A, although under the current scheme neither of these modules would be recompiled.
    357 
    358 
    359 
     370}}}
     371
     372The fingerprint for `D.h` has changed, because we changed its
     373definition.  The fingerprint for `D.f` has also changed, because it
     374depends on `D.h`.  And consequently, the ABI hash has changed, and so
     375has the interface hash (although the export hash and orphan hash are
     376still the same).
     377
     378Why did the fingerprint for `D.f` have to change?  This is vital,
     379because anything that referred to `D.f` must be recompiled, because it
     380may now see the new unfolding for `D.h`.
     381
     382So the fingerprint of an entity represents not just the definition of
     383the entity itself, but also the definitions of all the entities
     384reachable from it - its transitive closure.  The consequence of this
     385is that when recording usages we only have to record the fingerprints
     386of entities that were referred to directly in the source code, because
     387the transitive nature of the fingerprint means that we'll recompile if
     388anything reachable from these entities changes.
     389
     390=== How does fingerprinting work? ===
     391
     392We calculate fingerprints by serialising the data to be fingerprinted
     393using the `Binary` module, and then running the md5 algorithm over the
     394serlialised data.  When the data contains external `Name`s, the
     395serialiser emits the fingerprint of the `Name`; this is the way that
     396the fingerprint of a declaration can be made to depend on the
     397fingerprints of the things it mentions.
     398
     399=== Mutually recursive groups of entities ===
     400
     401When fingerprinting a recursive group of entities, we fingerprint the
     402group as a whole.  If any of the definitions changes, the fingerprint
     403of every entity in the group changes.
     404
     405=== Instances ===
     406
     407Instances are tricky in Haskell, because they aren't imported or
     408exported explicitly.  Haskell requires that any instance defined in a
     409module directly or indirectly imported by the current module is
     410visible.  So how do we track instances for recompilation, such that if
     411a relevant instance is changed, added, or removed anywhere beneath the
     412current module we will trigger a recompilation?
     413
     414Here's how it works.  For each instance we pick a distinguished entity
     415to attach the instance to - possibly the class itself, or a type
     416constructor mentioned in the instance.  The entity we pick must be
     417defined in the current module; if there are none to pick, then the
     418instance is an orphan (more about those in the section on Orphans,
     419below).
     420
     421Having picked the distinguished entity, when fingerprinting that
     422entity we include the instances.  For example, consider an instance
     423for class C at type T.  Any module that could use this instance must
     424depend (directly or indirectly) on both C and T, so it doesn't matter
     425whether we attach the instance to C or T - either way it will be
     426included in the fingerprint of something that the module depends on.
     427In this way we can be sure that if someone adds a new instance, or
     428removes an existing instance, if the instance is relevant to a module
     429then it will affect the fingerprint of something that the module
     430depends on, and hence will trigger recompilation.
     431
     432This is perfectly safe, but implementing it exactly as described above
     433would lead to unnecessary recompilation.  For example
     434
     435{{{
     436module A (T) where
     437import B (C)
     438data T = ...
     439instance C t where ...
     440}}}
     441
     442now the instance `C T` will be attached to `T`.  But we really don't
     443want to include `C`'s fingerprint in `T`'s fingerprint, because that would
     444register an unnecessary dependency of `T` on `C`: every time `C`
     445changed, we'd recompile anything that depended on `T`.  The dependency
     446is unnecessary, because we know anything that needs the instance will
     447already depend on both `C` and `T`.
     448
     449So instead of creating a fingerprint that depends on the instance
     450declaration and everything it mentions, we fingerprint ''just the
     451declaration itself''.  This works by fingerprinting each name in the
     452declaration as its literal string, rather than the fingerprint of the
     453name.
     454
     455But what about the ''contents'' of the instance declaration?  With
     456optimisation on, the compiler may inline instances and their methods,
     457so if an instance changes we need to trigger recompilation for any
     458module that actually used the previous version of the instance.  This
     459works as follows.  An instance declaration looks like this
     460
     461{{{
     462instance GHC.Base.Eq [GHC.Bool.Bool] = GHC.Base.$f10
     463}}}
     464
     465Here `GHC.Base.$f10` is the "dictionary function", i.e. the value
     466representing this dictionary.  `GHC.Base.$f10` has a declaration of
     467its own, in the same interface, complete with a fingerprint.
     468
     469{{{
     470c3e94597bf9a532e094067b08c216493
     471  $f10 :: GHC.Base.Eq GHC.Bool.Bool
     472    {- HasNoCafRefs Strictness: m
     473       Unfolding: (GHC.Base.:DEq @ GHC.Bool.Bool GHC.Base.==2
     474       GHC.Base.$s$dm/=1)
     475}}}
     476
     477So we ensure that any module that used this instance registers the
     478fingerprint of `GHC.Base.$f10` in its usages (this is the single
     479exception to the rule that the usages only contains names that are
     480explicitly mentioned in the source file).
     481
     482=== Orphans ===
     483
     484What if we have no declaration to attach the instance to?  Instances
     485with no obvious parent are called ''orphans'', and GHC must read the
     486interface for any module that contains orphan instances below the
     487current module, just in case those instances are relevant when
     488compiling the current module.
     489
     490Orphans require special treatment in the recompilation checker.
     491
     492 * Every module has an ''orphan hash'', which is a fingerprint of all
     493   the orphan instances (and rules) in the current module.
     494
     495 * The ''export hash'' depends on the ''orphan hash'' of the current
     496   module, and all modules below the current module in the dependency
     497   tree.  This models the fact that all instances defined in modules
     498   below the current module are available to importers of this module.
     499
     500So if we add, delete, or modify an orphan instance, the orphan hash of
     501the current module will change, and so will the export hash of the
     502current module.  This will trigger recompilation of modules that
     503import the current module, which will cause their export hashes to
     504change, and so on up the dependency tree. 
     505
     506This means a lot of recompilation, but it is at least safe.  The trick
     507is to avoid orphan instances as far as possible, which is why GHC has
     508the warning flag `-fwarn-orphans`.
     509
     510=== Rules ===
     511
     512RULEs are treated very much like instances: they are attached to one
     513particular parent declaration, and if a suitable parent cannot be
     514found, they become orphans and are handled in the same way as orphan
     515instances.
     516
     517=== On ordering ===
     518
     519When fingerprinting a collection of things, for example the export
     520list, we must be careful to use a canonical ordering for the
     521collection.  Otherwise, if we recompile the module without making any
     522changes, we might get a different fingerprint due to accidental
     523reordering of the elements.
     524
     525Why would we get accidental reordering?  GHC relies heavily on
     526"uniques" internally (see [[GhcFile(basicTypes/Unique.lhs)]]): every
     527entity has a unique, and uniques are assigned semi-randomly.  Asking
     528for the contents of a `UniqSet` or `UniqFM` will return the elements in
     529order of their uniques, which may vary from run to run of the
     530compiler.
     531
     532The solution is to sort the elements using a stable ordering, such as
     533lexicographic ordering.