Changes between Version 12 and Version 13 of Commentary/Compiler/RecompilationAvoidance


Ignore:
Timestamp:
Jun 20, 2008 1:33:23 PM (7 years ago)
Author:
simonmar
Comment:

Update for new fingerprint story

Legend:

Unmodified
Added
Removed
Modified
  • Commentary/Compiler/RecompilationAvoidance

    v12 v13  
    1111   is not necessary. 
    1212 
    13 If both of these hold, GHC stops compilation early, because the existing object code and interface are still valid.  In GHCi and `--make`, we must generate the `ModDetails` from the `ModIface`, but this is easily done by calling `MkIface.typecheckIface`. 
     13If both of these hold, GHC stops compilation early, because the 
     14existing object code and interface are still valid.  In GHCi and 
     15`--make`, we must generate the `ModDetails` from the `ModIface`, but 
     16this is easily done by calling `MkIface.typecheckIface`. 
    1417 
    1518== Example == 
     
    6871=== GHCi and `--make` === 
    6972 
    70 The simple fact is that when you make a small change to a large program, it is often not necessary to recompile 
    71 every module that depends directly or indirectly on something that changed.  In GHCi and `--make`, GHC considers every module in the program in dependency order, and decides whether it needs to be recompiled, or whether the existing object code and interface will do. 
     73The simple fact is that when you make a small change to a large 
     74program, it is often not necessary to recompile every module that 
     75depends directly or indirectly on something that changed.  In GHCi and 
     76`--make`, GHC considers every module in the program in dependency 
     77order, and decides whether it needs to be recompiled, or whether the 
     78existing object code and interface will do. 
    7279 
    7380=== `make` === 
     
    7582`make` works by checking the timestamps on dependencies and 
    7683recompiling things when the dependencies are newer.  Dependency lists 
    77 for `make` look like this: 
     84for `make` look like this (generated by `ghc -M`): 
    7885 
    7986{{{ 
     
    101108    as before, there is no need to modify the `.hi` file.  If the `.hi` 
    102109    file is not modified by the compilation, then `make` will notice 
    103     and not recompile `B` or `C`.  This is an important optimisation. 
     110    and not recompile `B` or `C`, or indeed `A`.  This is an important 
     111    optimisation. 
    104112 
    105113  * Suppose the change to `D` did cause a change in the interface 
    106114    (e.g. the type of `f` changed).  Now, `make` will recompile both 
    107     `B` and `C`.  Now, suppose that the interfaces to `B` and `C` 
     115    `B` and `C`.  Suppose that the interfaces to `B` and `C` 
    108116    remain the same: B's interface says only that it re-exports `D.f`, 
    109117    so the fact that `f` has a new type does not affect `B`'s 
     
    142150== How does it work? == 
    143151 
     152We use [http://en.wikipedia.org/wiki/Fingerprint_%28computing%29 fingerprints] to uniquely identify the interface exposed by a module, 
     153and to detect when it changes.  In particular, we currently use 
     154128-bit hashes produced by the MD5 algorithm (see 
     155[[GhcFile(compiler/utils/Fingerprint.hsc)]]). 
     156 
    144157An [wiki:Commentary/Compiler/IfaceFiles interface file] contains: 
    145158 
    146  * The module ''version'' (more about this below) 
     159 * Various fingerprints: 
     160   * The ''interface hash'', which depends on the entire contents of the 
     161     interface file.  This is used to detect whether we should 
     162     update the interface on disk after recompiling the module.  If the 
     163     interface didn't change at all, then we don't want to touch the 
     164     on-disk version because that would cause `make` to perform more 
     165     compilations. 
     166   * The ''ABI hash'', which depends on everything that the module 
     167     exposes about its implementation: think of this as a hash of 
     168     ''exports'' and ''decls''. 
     169   * The ''export-list hash'', which depends on the contents of the 
     170     export list (a hash of ''exports''). 
     171   * The ''orphan hash'', which depends on all the orphan 
     172     instances/rules in the, and the orphan hashes of all orphan 
     173     modules below this module in the dependency tree (see "Orphans" 
     174     later). 
    147175 * ''exports'': what the module exports 
     176 * ''dependencies'': modules and packages that this module depends on 
     177 * ''usages'': what specific entities the module depends on 
    148178 * ''decls'': what the module defines 
    149  * ''dependencies'' and ''usages'': what the module depends on 
    150179 * various other stuff, but the above are the important bits 
    151180 
    152 The module version starts at 1, and is increased each time the module 
    153 is compiled and either the ''exports'' or ''decls'' changes. 
     181To look at the contents of an interface, use `ghc --show-iface`.  For 
     182example, here's the output of `ghc --show-iface D.hi` for the module 
     183`D` in our example: 
     184 
     185{{{ 
     186interface main:D 6090 
     187  interface hash: 413dacc4c360257e9fb06ad0c13d9fc9 
     188  ABI hash: 0c5278c6f22844f996006259c9f551c8 
     189  export-list hash: cb9dd0d414976d16451bdfe51a021d7d 
     190  orphan hash: 693e9af84d3dfcc71e640e005bdc5e2e 
     191  where 
     192export main:D T f h 
     193module dependencies: 
     194package dependencies: base integer ghc-prim 
     195orphans: base:GHC.Base base:GHC.Num 
     196family instance modules: 
     197import base:GHC.Num 7a6f0c12ee2413f4c07165103924bd61 
     198import base:Prelude ae91aa1798ed1ac514cde3dc4c921717 
     199f4be8645a3e4099b4be0cfc42e976ed7 
     200  data T a b 
     201      RecFlag NonRecursive 
     202      Generics: no 
     203      {- abstract -} 
     204      FamilyInstance: none 
     2050bf838776ef3bc5671e369aec15c3b16 
     206  f :: GHC.Base.Int -> GHC.Base.Int 
     207aa9b8cdef43b6852bcc6f61f6ad6c584 
     208  h :: GHC.Base.Int -> GHC.Base.Int 
     209}}} 
     210 
     211Lines beginning `import` are the ''usages'', and after the usages are 
     212the decls. 
    154213 
    155214=== Deciding whether to recompile === 
     
    179238 
    180239First, the direct imports of the current module are resolved to 
    181 `Module`s using `findModule` (a `Module` contains a module name and a 
     240`Module`s using `Finder.findModule` (a `Module` contains a module name and a 
    182241package identifier).  If any of those `Module`s are not listed amongst 
    183242the dependencies of the old interface file, then either: 
     
    190249and we must recompile. 
    191250 
    192 Second, the ''usages'' of the module are checked.  The usages in the 
    193 interface file contains a list of every name that was ''used'' during 
    194 typechecking: that is, everything external that is referred to by the 
    195 source code. 
     251Second, the ''usages'' of the module are checked.  The usages contains 
     252two types of information: 
     253 
     254 * for a module that was imported, the export-list fingerprint of the 
     255   imported module is recorded.  If any of the modules we imported now 
     256   has a different export list we must recompile, so we check the 
     257   current export-list fingerprints against those recorded in the 
     258   usages. 
     259 
     260 * for every external name mentioned in the source code, the 
     261   fingerprint of that name is recorded in the usages.  This is so 
     262   that if we mention for example an external function `M.f`, we'll 
     263   recompile if `M.f`'s type has changed, or anything referred to 
     264   by `M.f`'s type has changed, or `M.f`'s unfolding has changed 
     265   (when -O is on), and so on. 
    196266 
    197267The interface files for everything in the usages are read (they'll 
     
    240310{{{ 
    241311$ ghc --show-iface D.hi 
    242 interface main:D 3 6081 where 
     312interface main:D 6090 
     313  interface hash: 0a7e886588b3799d909cca39be4b9232 
     314  ABI hash: 8d5cfe1723f32a5b53ded43bf9a1e55b 
     315  export-list hash: cb9dd0d414976d16451bdfe51a021d7d 
     316  orphan hash: 693e9af84d3dfcc71e640e005bdc5e2e 
     317  where 
    243318export main:D T f h 
    244 module dependencies: 
    245 package dependencies: base 
    246 3 f :: GHC.Base.Int -> GHC.Base.Int 
     319674f7fa7c2b13b368042f409007b1f29 
     320  data T a b 
     321      RecFlag NonRecursive 
     322      Generics: no 
     323      = C1 :: forall a b. a -> T a b Stricts: _ | 
     324        C2 :: forall a b. b -> T a b Stricts: _ 
     325      FamilyInstance: none 
     326790791c346a0d5965feece84894360a6 
     327  f :: GHC.Base.Int -> GHC.Base.Int 
    247328    {- Arity: 1 HasNoCafRefs Strictness: U(L)m 
    248329       Unfolding: (__inline_me (\ x :: GHC.Base.Int -> 
    249330                                GHC.Base.plusInt (D.h x) (GHC.Base.I# 1))) -} 
    250 h :: GHC.Base.Int -> GHC.Base.Int 
    251   {- Arity: 1 HasNoCafRefs Strictness: U(L)m 
    252      Unfolding: (\ x :: GHC.Base.Int -> 
    253                  case @ GHC.Base.Int x of wild { GHC.Base.I# x1 -> 
    254                  GHC.Base.I# (GHC.Prim.+# x1 3) }) -} 
    255 data T a b 
    256     RecFlag NonRecursive 
    257     Generics: no 
    258     = C1 :: forall a b. a -> T a b Stricts: _ | 
    259       C2 :: forall a b. b -> T a b Stricts: _ 
    260     FamilyInstance: none 
    261 }}} 
    262  
    263 Note from the first line that this is version 3.  The first change was 
    264 to the definition of `D.f` above, and the second change we made just 
    265 now was to add the INLINE pragma to `D.f`.  The version of `D.f` is 
    266 also 3 (it has changed twice), and the version of `D.h` is still 1 
    267 (the version number is omitted if it is 1). 
     331a0a944e487ebec76031e0672e70bc923 
     332  h :: GHC.Base.Int -> GHC.Base.Int 
     333    {- Arity: 1 HasNoCafRefs Strictness: U(L)m 
     334       Unfolding: (\ x :: GHC.Base.Int -> 
     335                   case @ GHC.Base.Int x of wild { GHC.Base.I# x1 -> 
     336                   GHC.Base.I# (GHC.Prim.+# x1 3) }) -} 
     337}}} 
    268338 
    269339Note that the unfolding of `D.f` mentions `D.h`. 
     
    273343{{{ 
    274344$ ghc -O --show-iface D.hi 
    275 interface main:D 4 6081 where 
     345interface main:D 6090 
     346  interface hash: 55385e568aa80955acbd1b7370041890 
     347  ABI hash: accc0413d94e27c90dff8427f4aafe6e 
     348  export-list hash: cb9dd0d414976d16451bdfe51a021d7d 
     349  orphan hash: 693e9af84d3dfcc71e640e005bdc5e2e 
     350  where 
    276351export main:D T f h 
    277 module dependencies: 
    278 package dependencies: base 
    279 4 h :: GHC.Base.Int -> GHC.Base.Int 
     352674f7fa7c2b13b368042f409007b1f29 
     353  data T a b 
     354      RecFlag NonRecursive 
     355      Generics: no 
     356      = C1 :: forall a b. a -> T a b Stricts: _ | 
     357        C2 :: forall a b. b -> T a b Stricts: _ 
     358      FamilyInstance: none 
     3597f2bf159cceae5306ce709db96720f4a 
     360  f :: GHC.Base.Int -> GHC.Base.Int 
     361    {- Arity: 1 HasNoCafRefs Strictness: U(L)m 
     362       Unfolding: (__inline_me (\ x :: GHC.Base.Int -> 
     363                                GHC.Base.plusInt (D.h x) (GHC.Base.I# 1))) -} 
     3645c53713aa59b760e51e6d47173f95b4e 
     365  h :: GHC.Base.Int -> GHC.Base.Int 
    280366    {- Arity: 1 HasNoCafRefs Strictness: U(L)m 
    281367       Unfolding: (\ x :: GHC.Base.Int -> 
    282368                   case @ GHC.Base.Int x of wild { GHC.Base.I# x1 -> 
    283369                   GHC.Base.I# (GHC.Prim.+# x1 4) }) -} 
    284 4 f :: GHC.Base.Int -> GHC.Base.Int 
    285     {- Arity: 1 HasNoCafRefs Strictness: U(L)m 
    286        Unfolding: (__inline_me (\ x :: GHC.Base.Int -> 
    287                                 GHC.Base.plusInt (D.h x) (GHC.Base.I# 1))) -} 
    288 data T a b 
    289     RecFlag NonRecursive 
    290     Generics: no 
    291     = C1 :: forall a b. a -> T a b Stricts: _ | 
    292       C2 :: forall a b. b -> T a b Stricts: _ 
    293     FamilyInstance: none 
    294 }}} 
    295  
    296 The version of `D.h` jumped to 4: things that change get the new 
    297 module version, rather than just increasing by 1.  This is so that 
    298 if you remove something and then add it back in again, the version 
    299 number does not go down. 
    300  
    301 However, the version of `D.f` also changed.  This is vital, because 
    302 anything that referred to `D.f` must be recompiled, because it may now 
    303 see the new unfolding for `D.h`. 
    304  
    305 In summary: 
    306  
    307  * The version of an entity is bumped whenever either  
    308    * the entity's definition changed, or 
    309    * the version of anything that the definition refers to changed. 
    310  * This means that when recording usages, we only have to record 
    311    usages on things that are directly mentioned, not their 
    312    transitive closure. 
    313  
    314 == Alternative design choices == 
    315  
    316 === Fingerprints instead of versions === 
    317  
    318 Version numbers in this context have a couple of problems: 
    319  
    320  * New version numbers are calculated from the old version numbers, so we have to 
    321    keep the old interface around to generate the new one.  (but we need to keep 
    322    the old interface around anyway, because we might avoid touching it if it 
    323    doesn't change). 
    324  
    325  * Removing an interface file can have disastrous effects, as all its version 
    326    numbers will be reset when it is next compiled.  To be fair, this doesn't seem 
    327    to affect many people (that we know of). 
    328  
    329  * We want to detect changes in package dependencies (see #1372): right now, we only 
    330    record a dependency on the package.  The right thing to do is to record a dependency 
    331    on the module that we import, so we can detect when it changes.  But packages are 
    332    generally compiled from scratch and then installed, so typically all the version 
    333    numbers will be 1, and our dependency is not telling us when the package has changed. 
    334  
    335 Using fingerprints or hashes instead of version numbers is morally the right thing to do.  A fingerprint can be calculated independently of the previous version of the interface interface, and it solves the package dependency problem. 
    336  
    337 === Module-granularity instead of entity-granularity === 
    338  
    339 Fingerprints or hashes are larger than version numbers (typically 128 bits instead of 32), so we might not want to put a fingerprint on every entity. 
    340  
    341 Also, it's worth thinking about whether tracking changes at a lower level of granularity would result in a system that is simpler to implement, understand, and get right.  The obvious choice is to track modules instead of entities.  Clearly this loses some resolution, and will therefore entail more unnecessary recompilation, but how much?  Let's look at how it would work, and then some examples. 
    342  
    343 Every module will have a single version number (or fingerprint), which is increased whenever there is a change to either the exports or the decls (but not the usages) in the interface, or the version of any module referred to by the exports or decls changes.  The usages of the module lists the modules and versions referred to by the source code (as before, except that we only record modules not entities). 
    344  
    345 '''Examples'''. 
    346  
    347 Suppose we change the definition of `T` in module D.   
    348  
    349  * We recompile D, and D's version will change from 1 to 2, because its decls have changed.   
    350  * B will be recompiled, because its usages lists D (since it exports `D.f`).  Since its exports lists 
    351    `D.f`, and D's version has changed, B's version will also increase to 2. 
    352  * C will be recompiled, but its version stays the same, since it has no exports or decls. 
    353  * A will be recompiled, because its usage list includes D, and D's version changed.  A's version will 
    354    stay the same, however. 
    355  
    356 So in this case modifying `D.T` has forced recompilation of B and A, although under the current scheme neither of these modules would be recompiled. 
    357  
    358  
    359  
     370}}} 
     371 
     372The fingerprint for `D.h` has changed, because we changed its 
     373definition.  The fingerprint for `D.f` has also changed, because it 
     374depends on `D.h`.  And consequently, the ABI hash has changed, and so 
     375has the interface hash (although the export hash and orphan hash are 
     376still the same). 
     377 
     378Why did the fingerprint for `D.f` have to change?  This is vital, 
     379because anything that referred to `D.f` must be recompiled, because it 
     380may now see the new unfolding for `D.h`. 
     381 
     382So the fingerprint of an entity represents not just the definition of 
     383the entity itself, but also the definitions of all the entities 
     384reachable from it - its transitive closure.  The consequence of this 
     385is that when recording usages we only have to record the fingerprints 
     386of entities that were referred to directly in the source code, because 
     387the transitive nature of the fingerprint means that we'll recompile if 
     388anything reachable from these entities changes. 
     389 
     390=== How does fingerprinting work? === 
     391 
     392We calculate fingerprints by serialising the data to be fingerprinted 
     393using the `Binary` module, and then running the md5 algorithm over the 
     394serlialised data.  When the data contains external `Name`s, the 
     395serialiser emits the fingerprint of the `Name`; this is the way that 
     396the fingerprint of a declaration can be made to depend on the 
     397fingerprints of the things it mentions. 
     398 
     399=== Mutually recursive groups of entities === 
     400 
     401When fingerprinting a recursive group of entities, we fingerprint the 
     402group as a whole.  If any of the definitions changes, the fingerprint 
     403of every entity in the group changes. 
     404 
     405=== Instances === 
     406 
     407Instances are tricky in Haskell, because they aren't imported or 
     408exported explicitly.  Haskell requires that any instance defined in a 
     409module directly or indirectly imported by the current module is 
     410visible.  So how do we track instances for recompilation, such that if 
     411a relevant instance is changed, added, or removed anywhere beneath the 
     412current module we will trigger a recompilation? 
     413 
     414Here's how it works.  For each instance we pick a distinguished entity 
     415to attach the instance to - possibly the class itself, or a type 
     416constructor mentioned in the instance.  The entity we pick must be 
     417defined in the current module; if there are none to pick, then the 
     418instance is an orphan (more about those in the section on Orphans, 
     419below). 
     420 
     421Having picked the distinguished entity, when fingerprinting that 
     422entity we include the instances.  For example, consider an instance 
     423for class C at type T.  Any module that could use this instance must 
     424depend (directly or indirectly) on both C and T, so it doesn't matter 
     425whether we attach the instance to C or T - either way it will be 
     426included in the fingerprint of something that the module depends on. 
     427In this way we can be sure that if someone adds a new instance, or 
     428removes an existing instance, if the instance is relevant to a module 
     429then it will affect the fingerprint of something that the module 
     430depends on, and hence will trigger recompilation. 
     431 
     432This is perfectly safe, but implementing it exactly as described above 
     433would lead to unnecessary recompilation.  For example 
     434 
     435{{{ 
     436module A (T) where 
     437import B (C) 
     438data T = ... 
     439instance C t where ... 
     440}}} 
     441 
     442now the instance `C T` will be attached to `T`.  But we really don't 
     443want to include `C`'s fingerprint in `T`'s fingerprint, because that would 
     444register an unnecessary dependency of `T` on `C`: every time `C` 
     445changed, we'd recompile anything that depended on `T`.  The dependency 
     446is unnecessary, because we know anything that needs the instance will 
     447already depend on both `C` and `T`. 
     448 
     449So instead of creating a fingerprint that depends on the instance 
     450declaration and everything it mentions, we fingerprint ''just the 
     451declaration itself''.  This works by fingerprinting each name in the 
     452declaration as its literal string, rather than the fingerprint of the 
     453name. 
     454 
     455But what about the ''contents'' of the instance declaration?  With 
     456optimisation on, the compiler may inline instances and their methods, 
     457so if an instance changes we need to trigger recompilation for any 
     458module that actually used the previous version of the instance.  This 
     459works as follows.  An instance declaration looks like this 
     460 
     461{{{ 
     462instance GHC.Base.Eq [GHC.Bool.Bool] = GHC.Base.$f10 
     463}}} 
     464 
     465Here `GHC.Base.$f10` is the "dictionary function", i.e. the value 
     466representing this dictionary.  `GHC.Base.$f10` has a declaration of 
     467its own, in the same interface, complete with a fingerprint. 
     468 
     469{{{ 
     470c3e94597bf9a532e094067b08c216493 
     471  $f10 :: GHC.Base.Eq GHC.Bool.Bool 
     472    {- HasNoCafRefs Strictness: m 
     473       Unfolding: (GHC.Base.:DEq @ GHC.Bool.Bool GHC.Base.==2 
     474       GHC.Base.$s$dm/=1) 
     475}}} 
     476 
     477So we ensure that any module that used this instance registers the 
     478fingerprint of `GHC.Base.$f10` in its usages (this is the single 
     479exception to the rule that the usages only contains names that are 
     480explicitly mentioned in the source file). 
     481 
     482=== Orphans === 
     483 
     484What if we have no declaration to attach the instance to?  Instances 
     485with no obvious parent are called ''orphans'', and GHC must read the 
     486interface for any module that contains orphan instances below the 
     487current module, just in case those instances are relevant when 
     488compiling the current module. 
     489 
     490Orphans require special treatment in the recompilation checker. 
     491 
     492 * Every module has an ''orphan hash'', which is a fingerprint of all 
     493   the orphan instances (and rules) in the current module. 
     494 
     495 * The ''export hash'' depends on the ''orphan hash'' of the current 
     496   module, and all modules below the current module in the dependency 
     497   tree.  This models the fact that all instances defined in modules 
     498   below the current module are available to importers of this module. 
     499 
     500So if we add, delete, or modify an orphan instance, the orphan hash of 
     501the current module will change, and so will the export hash of the 
     502current module.  This will trigger recompilation of modules that 
     503import the current module, which will cause their export hashes to 
     504change, and so on up the dependency tree.   
     505 
     506This means a lot of recompilation, but it is at least safe.  The trick 
     507is to avoid orphan instances as far as possible, which is why GHC has 
     508the warning flag `-fwarn-orphans`. 
     509 
     510=== Rules === 
     511 
     512RULEs are treated very much like instances: they are attached to one 
     513particular parent declaration, and if a suitable parent cannot be 
     514found, they become orphans and are handled in the same way as orphan 
     515instances. 
     516 
     517=== On ordering === 
     518 
     519When fingerprinting a collection of things, for example the export 
     520list, we must be careful to use a canonical ordering for the 
     521collection.  Otherwise, if we recompile the module without making any 
     522changes, we might get a different fingerprint due to accidental 
     523reordering of the elements. 
     524 
     525Why would we get accidental reordering?  GHC relies heavily on 
     526"uniques" internally (see [[GhcFile(basicTypes/Unique.lhs)]]): every 
     527entity has a unique, and uniques are assigned semi-randomly.  Asking 
     528for the contents of a `UniqSet` or `UniqFM` will return the elements in 
     529order of their uniques, which may vary from run to run of the 
     530compiler. 
     531 
     532The solution is to sort the elements using a stable ordering, such as 
     533lexicographic ordering.