Changes between Version 44 and Version 45 of Backpack

Aug 31, 2016 10:14:20 PM (21 months ago)

refer to the most up-to-date things


  • Backpack

    v44 v45  
    1 [ Backpack] is a proposal for retrofitting Haskell with an applicative, mix-in module system. The theory of Backpack is developed in the paper and its accompanying technical appendix. The purpose of this page is to give an overview of the design in GHC.
     1[ Backpack] is a proposal for retrofitting Haskell with an applicative, mix-in module system. The theory of Backpack is developed in the paper and its accompanying technical appendix. The purpose of this page is to give pointers to where you can learn more about its design in GHC.
    3 == Motivation ==
     3Here are the main documents:
    5 === Large-scale modularity ===
     5* [ Backpack specification]. This is the full user-facing specification, which also describes the GHC interface (not intended to be used by end-users, but an important IR.)
     6* [ Backpack implementation guide]. This is a half-baked guide to review the Backpack patchset.
    7 Large-scale modularity refers to the modularization of software into libraries, which are built upon other libraries.  A package manager offers a limited degree of flexibility by permitting a library to built against varying *versions* of its dependencies.  Backpack seeks to solve the following problems related to programming with libraries:
     8You might also be interested in the [wiki:Commentary/Compiler/Packages commentary pages about packages] and [wiki:CabalDependency].
    9 1. **I want to write a library that works with ByteString, Text and String, but I only want to write it once.**  Today, I may have to maintain multiple versions of the package: `foo-bytestring`, `foo-text`, `foo-string`, each specialized against a specific string representation.  (:TODO: is this a good example, or do people want to write their library differently for these different types?) Similar situations occur when a library want to support multiple "backends". This problem is exacerbated when someone else writes another library which builds on top of `foo`; now they have to write three versions of the package. It is better if the library can be written once, and parametrized on a signature describing strings, allowing users to fill in their own string implementation.
    10    * Here are examples of libraries with "multiple drivers":
    11      * [ satchmo] (satchmo-backends, satchmo-funsat, satchmo-toysat)
    12      * [ Chart] (Chart-cairo, Chart-diagrams , Chart-gtk).
    13      * [ FTGL] (FTGL-bytestring)
    14      * [ HDBC] (HDBC-mysql, HDBC-odbc, HDBC-postgresql, HDBC-session, HDBC-sqlite3)
    15      * [ MuCheck] (MuCheck-HUnit, MuCheck-Hspec, MuCheck-QuickCheck, MuCheck-SmallCheck)
    16      * [ Shellac] (Shellac-compatline, Shellac-editline, Shellac-haskeline, Shellac-readline)
    17      * (and more)
    18    * A close variant: I want to depend on some fancy full featured system, but I want people to be able to compile me against a simpler variant
    19      * [ Shake] provides an interface for writing build systems. I may not want to be coupled to Shake specifically and be able to use whatever build system I want.
    20      * Test frameworks? (Though, it's hard to think of a case where you want to swap out components)
    21    * A common comment is, "Don't type classes work for this case?"  Common problems with using type classes:
    22      * Ambiguity: type class resolution must be done with respect to a type parameter, even when there is no natural one.  In some cases, a proxy type must be used when there is no natural type. This is exacerbated with multiparameter type classes, when some methods may not have enough types in their signature to uniquely identify an instance.
    23      * Newtype: type class resolution is done purely based on types; no way to parametrize over implementation. You must newtype in this situation.
    24      * Multiple parameters: without associated types, a multiparameter type class must be used when an interface requires multiple types. These types must be threaded through all functions which use this type signature; an annoying amount of plumbing.
    25      * Type classes infect call sites: if you have a data type from an type class associated type, and want to refer to it from another data type `T`, any function using T must also remember to add the type class constraint to their site.
    26      * Lack of specialization: type classes mostly must always be done in dictionary passing style (with inlining, sometimes the dictionary can be inlined, but don't count on it)
    27    * Why only at the package level? Bringing it down to the module level (and even finer) is the subject of small-scale modularity.
     10== Backpack-related tickets ==
     12Backpack-related tickets are marked with keyword 'backpack'. If the ticket is assigned to ezyang, it means he's planning on working on it.
    31 === Small-scale modularity ===
    33 Today, functions and types in Haskell can be parametrized over types.  However, there is no direct way to parametrize over an implementation (type classes are an indirect method of parametrization based on type resolution), and parametrization over a type requires a type parameter to be explicitly plumbed through all use sites of the code.
    35 == Implementation ==
    37 Backpack consists of two major parts:
    39 1. A way to write //signatures// (`hsig` files), which specify the interface of a module without providing an implementation. (Think of them as a generalization of `hs-boot` files.)
    41 2.The frontend (implemented in a library) which implements //mix-in linking// to specify implementation of signatures.  Linking produces a build plan
    43 While it is hypothetically possible to use `hsig`s without the Backpack frontend, your user experience will be a lot more pleasant using the frontend.  We provide a few ways to use Backpack:
    45 1. `ghc-backpack` is a minimal frontend that parses Backpack files, a compact syntax for Backpack use inspired by the Backpack paper, and then invokes GHC to compile them.  This frontend is primarily intended to be used for testing; it essentially is a convenient way to play around with Backpack files without having to write a full Cabal file / go through the Cabal tool.
    47 2. `Cabal` knows how to translate Cabal files into the Backpack IR, and is able to compile any Backpack components for which it has source.  It is roughly analogous to `ghc-backpack`, except that it knows how to do proper, Cabalized builds.
    49 3. `cabal-install`, via its dependency on the Cabal library, can translate Cabal files to the Backpack IR, and subsequently is able to compile Backpack components spread over multiple components.
    51 The most up-to-date description for how `hsig` files are implemented can be found in `docs/backpack/algorithm.tex`, but here are some miscellaneous GHC-specific notes, including the full-stack description.
    53 === The Backpack library frontend ===
    55 The Backpack IR is a representation of the **pre-shape** of a Backpack component; the IR is sufficient to determine what instantiated components must be built, and how a component is instantiated.  (This is an abstraction of the package shape in the Backpack paper, which incorporates both module identities as well as the identities for entities defined in the modules.  Entity information is not needed for build planning, so the Backpack IR does not include it. You DON'T have to look at import declarations because signatures are per package, not per module.)
    57 Here is the IR type:
    59 {{{
    60 -- Basic data types. This is the NON-recursive UnitId formulation.
    61 data ModuleName  = ModuleName  String
    62 data ComponentId = ComponentId String
    63 data Module = Module { moduleUnitId :: UnitId
    64                      , moduleName :: ModuleName }
    65 data UnitId = UnitId { unitIdComponentId :: ComponentId
    66                      , unitIdInsts :: [(ModuleName, Module)] }
    67             | Hole
    69 -- The intermediate representation of a component
    70 data Component = Component {
    71   -- The UnitId of the component in question.  Invariant: every
    72   -- instantiation is H -> hole:H
    73   unitId :: UnitId,
    74   -- The direct dependencies of the component. They are the things
    75   -- that I need to build to build this.  These are a "renamed"
    76   -- version of includes.
    77   instantiatedDepends :: [UnitId],
    78   -- The modules from the includes which are "in scope".  Does not
    79   -- include requirements.
    80   -- TODO: Argue about whether or not it should be done this way.
    81   importedModules :: [(ModuleName, Module)],
    82   -- The exported modules of the component.  Local modules will
    83   -- have moduleUnitId == unitId, but there may also be reexports.
    84   -- Invariant: there are no HOLEs are in this list; you can determine
    85   -- what the holes of a package are by looking at unitId
    86   exposedModules :: [(ModuleName, Module)]
    87 }
    88 }}}
    90 Invariant: every hole the non-unitId fields is //bound// by a hole in `unitId`. (An indefinite unit has the )
    92 **Example.**
    94 Consider some old-style Backpack files:
    96 {{{
    97 package p where
    98   include q
    99   signature A
    100   module M
    101 package q where
    102   signature B
    103   module Q
    104 }}}
    106 Let's consider `q` first.  Here's what the data structure gets filled as:
    108 {{{
    109 Component {
    110   unitId = q(B -> hole:B),
    111   instantiatedDepends = [], -- no includes
    112   importedModules = [], -- nothing external in scope
    113   exposedModules = [(Q, q(B -> hole:B):Q)]
    114   -- Note this doesn't contain any of the holes. You can look
    115   -- at the unitId to get that information.
    116 }
    117 }}}
    119 Now consider `p`:
    121 {{{
    122 Component {
    123   unitId = p(A -> hole:A, B -> hole:B),
    124   instantiatedDepends = [q(B -> hole:B)],
    125   importedModules = [(Q, q(B -> hole:B):Q)]
    126 }
    127 }}}
    129 Note that if instead of `include q`, we had `include q requires B as B2`, then the `instantiatedDepends` is `[q(B -> hole:B2)]`. Note that this is //bound// by the `unitId` field, which is now `p(A -> hole:A, B2 -> hole:B2)`.
    131 On `importedModules`: it only shows things that were exposed by the included packages. We need to compute this when we do mix-in linking. (It could include `A`, but maybe not.) Why doesn't it have requirements?  Because you always know where the requirements are: for every requirement of a unit, we have to compile an `hsig` file for it, even if there is no local one: you have to make a blank one.
    133 :TODO: SPJ, maybe we can pair the module names with where the include comes from. At the moment, there's no way to say, "Where it came form." (EZY: Mix-in linking, you want to glom them together.)
    135 :TODO: Scott: I'd like to just be able to substitute, and have the form of `Component` stay the same.
    137 :TODO: exposedModules is a bad name, because it isn't the full set of what things are brought into scope.
    139 :TODO: Why not stuff the exposes into the `UnitId`? That lets us get rid of `importedModules` and `exposedModules`. Moving the data structure around. (EZY: Equality doesn't have the `exposedModules`). Is `exposedModules` a function of `unitIdComponentId` and `unitIdInsts`? No, you need the source code. It's not a function you want to compute lightly. (`Component` is derived off of `UnitId`.)
    141 **How to compile?**
    143 Intuitively, the algorithm for compiling a `UnitId` goes as follows:
    145 1. Lookup the `Component` corresponding to the `ComponentId` of the `UnitId`.
    146 2. Substitute according to `unitIdInsts` in all fields of `Component`. E.g., if you are instantiating `A -> base:A`, replace every occurrence of `hole:A` with `base:A`.
    147 3. Recursively compile every `instantiatedDepends`.
    148 4. Compile this `Component` with the correct flags derived from this data structure.
    150 === The GHC interface ===
    152 How do you compile a `Component`; that is to say, what flags are passed to GHC to compile an instantiated unit?  There are a few things that need to be passed:
    154 1. The chosen STRING unit ID (which is to be used for linker symbols).  Done in old versions of GHC with `-package-name` (or more recently `-this-package-key` and `-this-unit-id`.
    155 2. The full unit ID data structure.  My suggestion is that this is given in two parts: `-this-component-id` (a new flag) and `-sig-of` (shipped already in GHC 7.10)
    156 3. The set of modules that are in scope, from `importModules`.  :TODO: There are two ways to do this: a series of `-package-id "p (A as B)"` flags (which mimic a source-level include declaration, or manually specifying each of the modules in scope (some new flag).  Besides possibly being quite long, the latter is attractive because the elaboration to the Backpack IR needs to compute the set of modules in scope (so it knows how to instantiate things.) The former is attractive because it works without modification on old versions of GHC.
    157 4. The set of requirements that are in scope, from `instantiatedDepends`.  These indicate what other signatures need to be merged into the local `hsig`s.
    159 === Cabal syntax ===
    161 :TODO: More context
    163 I think for now we should resurrect the original Cabal syntax.  So, we add the following fields:
    165 {{{
    166 exposed-signatures: H1 H2
    167 reexported-modules: X as H3 -- NB: Cabal must know what modules are in scope!
    168 }}}
    170 Primary complication is permitting a build-depends to be included twice, with different dependencies.  We could put this inline into build-depends but I think it's more wise to just add a new field `includes:` which accepts a newline separated list of includes with renamings:
    172 {{{
    173 build-depends: foo >= 2.0
    174 includes:
    175   foo requires (H as H1)
    176   foo requires (H as H2)
    177 }}}
    179 If a package `bar` is not explicitly mentioned in `includes`, it is assumed to have been included as `include bar`.
    181 === Cabal changes ===
    183 * No longer a one-to-one mapping between Cabal file components, and actually components to build.  Cabal must call into the Backpack library to figure out what all the local, instantiated components to build should be (and what the flags should be.)
    184 * Must adjust register so that it can register multiple components (one for each internal one that was enabled for build)
    185 * build command needs to be componentized, so that a client can request a SPECIFIC component be built
    187 :TODO: This overlaps with Duncan's proposal, where Cabal "configures" everything fromt he getgo.
    189 === cabal-install changes ===
    191 * After version resolution, look at Cabal files, and use Backpack library to expand into a build-plan that also includes instantiated componetns. Then go and build them. (Benefit of having cabal-install call the Backpack library: no need for Setup executable to communicate to Cabal how it should go.)
    193 == GHC specific notes ==
    195 === Holes ===
    197 The big new typechecking feature of Backpack is to allow a user to replace a concrete module (`hs`) with a signature (`hsig`), whose implementation can be filled in later.  Both `hs` and `hsig` files compile to an `hi` file, but interface files generated by `hsig` files are different in the following ways:
    199   * `mi_sig_of` is non-empty, recording the **implementing module** of the interface in question.  For example, when you run `ghc -c A.hsig -sig-of other-pkg:A`, the produced `ModIface` has `main:A` for the `mi_module`, but `other-pkg:A` for `sem_mod`. (Normal interfaces have `Nothing` in `mi_sig_of`.)  When the implementing module is unknown, the recorded module is a fake module abscribed to the fictitious `hole` package; e.g. `hole:A`.
    201   * `mi_hsc_src` is `HsigFile`.
    203   * Like `hi-boot` files compiled from `hs-boot`, these `hi` files contain no unfoldings, can have abstract data types, etc.
    205 Internally, we often refer to what is recorded in `mi_sig_of` as the "semantic module"; this is the module that is used for `Name`s that come to the module; `mi_module` is the "identity module" which uniquely identifies an interface file.
    207 === Signature merging ===
    209 Unlike regular modules, you cannot simply *hide* a hole: it is a requirement that always must be fulfilled before you can compile the module in question. Instead, if you define an `A.hsig` for an `A` which is already required by another unit we included, the requirements of that unit are *merged* with the requirements of this unit.
    211 Presently, this merging process is carried out by `mergeRequirements`, which successively adds inherited required types/values from included units to the type/environment of the `hsig` file being compiled.  Any requirements for which we don't have a local `hsig` implicitly have a "blank" signature file which collects together the merged requirements.
    213 :TODO: Unclear how to handle dependency cycles here.
    215 === Lazy interface loading ===
    217 In paper Backpack, unification and substitution is performed eagerly on the type environment as soon as we discover that some type equality holds (e.g. during shaping).  In GHC, the type environment is *lazily* loaded, and thus this substitution must be deferred until we actually load this interface.  This has two consequences for GHC's interface loading code:
    219 1. We may want the types for a module `p(A -> HOLE:B):M`, but we only have an interface file for `p(A -> HOLE:A):M`. In this case, we have to rename the interface file from `A -> HOLE:A` to `A -> HOLE:B` before we typecheck and load it in.  This is handled by `computeInterface`, which calls `rnModIface` to apply the substitution.
    221 2. Inside an interface file, we may refer to a Name `hole:A.T`.  During shaping, we may have discovered that this Name actually is unified with `impl:A.T`; so when we are typechecking the interface file, we must use the real name and not stodgily adhere to the old one.  The `eps_shape` data structure records these unifications, and a few data types (interface file renaming, interface type-checking, and even type-checking a signature) abide by this substitution.
    223 === The database ===
    225 The package database contains both entries for old-fashioned definite units/packages, and also entries for indefinite unit, which contain code but no objects.  Conventionally, the unit ID for an indefinite is simply the unit ID with all of the requirements filled in with `hole`s.
    227 === Unit IDs ===
    229 A unit ID is a recursive data structure which is defined to be a component ID (specified by Cabal) plus a mapping from module names to modules, where a module is a a unit ID plus a module name. In common parlance, components and units are the same, but we've adopted the convention that a "unit ID" also includes an instantiation, while a component does not.  The component ID represents "source code"; we call it a component and not a unit because it coincides with the preexisting Cabal notion of a component.
    231 In some situations, we need serialize a unit ID into a compact, deterministic string, for use in linker symbols and file paths, as a full unit ID could be quite long (it is AT LEAST linear in the number of holes in a unit).  Ideally, the serialization format would be private to GHC (similarly to how z-encoding is GHC private), but the encoding leaks at least to the file path that GHC stores compilation results at... would be nice if there a way to avoid this problem.
    233 == Reading ==
    235 * [ GHC Proposal] (like a specification)
    236 * [ The Backpack documentation directory]. You'll have to build them (just `make`), but here are the files of interest:
    237   * algorithm.tex specifies the abstract core Backpack algorithms for GHC.
    238   * backpack-manual.tex is a user-facing manual for using Backpack.
    239   * backpack-impl.tex is an old but interesting technical document describing many of the technical tradeoffs in our design.
    240 * [ What's a module system good for anyway?] Motivates //why// you might care about Backpack
    241 * The [wiki:Commentary/Compiler/Packages commentary pages about packages],
    242 see also [wiki:CabalDependency]
    24416== Discarded approaches ==
    27143**Backpack smarts directly in GHC.** (Obviously) it makes more sense to put it in a library, which both GHC and Cabal can use.  There are a lot of technical difficulties of making this actually work well, but it "makes the most sense."
    273 == Backpack-related tickets ==
    275 Backpack-related tickets are marked with keyword 'backpack'. If the ticket is assigned to ezyang, it means he's planning on working on it.
    277 [[TicketQuery(keywords=backpack,format=table,col=type|summary|priority|owner,status=new,order=priority)]]
    27945== Future things to think about ==