wiki:Commentary/GSoCMultipleInstances

Version 22 (modified by phischu, 22 months ago) (diff)

--

Introduction

Cabal and GHC do not support multiple instances of the same package version installed at the same time. If a second instance of a package version is installed it is overwritten on the file system as well as in the PackageDB. This causes packages that depended upon the overwritten instance to break. The idea is to never overwrite an installed package. As already discussed in http://hackage.haskell.org/trac/ghc/wiki/Commentary/Packages/MultiInstances(this wiki entry) the following changes need to be made:

  • Cabal should install packages to a different location
  • ghc-pkg should add instances to the PackageDB instead of overwriting them
  • ghc --make and ghci should shadow the additional instances according to some rule of thumb
  • cabal-install should still find an InstallPlan?
  • some form of garbage collection should be invented

Changing the Install Location

Cabal does not support multiple instances of the same package version installed at the same time. Instead of installing them next to each other Cabal overwrites the previous instance with the same version.

Currently the library part of packages is installed to $prefix/lib/$pkgid/$compiler. For example the GLUT package of version 2.3.0.0 when compiled with GHC 7.4.1 when installed globally lands in /usr/local/lib/GLUT-2.3.0.0/ghc-7.4.1/. This is the default path. It is completely customizable by the user. In order to allow multiple instances of this package to coexist we need to change the install location to a path that is unique for each instance. Several ways to accomplish this have been discussed:

  1. The InstalledPackageId? is part of the path:

It currently contains the ABI hash but we are discussing to change it. Because it will always uniquely identify an installed package it is a good choice to be part of the path. Currently there is a data directory per default under $prefix/share/$pkgid/. This path needs to be known before the package is built because it is baked into Paths_foo.hs. If we introduce a new variable $installedpkgid and it contains the ABI hash its value is only known after compilation so it can not be used in the path for the data.

  1. A Cabal hash is part of the path:

We want to have Cabal compute a hash of all the information needed to build a package. We also want to avoid rebuilding a package if a package with the same hash is already present. Because of this there should only ever be one installed package with a certain Cabal hash on a machine so the Cabal hash would be a good choice to be part of the path.

  1. A unique number is part of the path:

A unique number could be the number of packages installed for example /usr/local/lib/GLUT-2.3.0.0-87 or the number of instances of this version installed for example /usr/local/lib/GLUT-2.3.0.0-2 or a random number for example /usr/local/lib/GLUT-2.3.0.0-83948393212. The advantages I see are that not much information is needed to come up with the file path and this seems to be robust against other design decisions we make now or in the future.

Dependency resolution in Cabal and GHC

Currently if GHC is invoked by the user it does some adhoc form of dependency resolution. The most common case of this is using ghci. If there are multiple instances of the same package in the PackageDBStack the policy used to select a single one prefers DBs higher in the stack. It then prefers packages with a higher version. We need a third criterium if there are multiple packages with the same version in the same PackageDB. Ideas:

  • build a complex solver into GHC
  • random
  • dependencies with the highest versions
  • order in the PackageDB
  • latest

Picking the most recently installed instance seems like the best idea right now. There are at least two ways to track which of the installed instances was most recently installed. In either you add a timestamp or the count of instances to InstalledPackageInfo?. Tracking the count means that you would lose the possibility to migrate packages between machines. So we want to track timestamps. The user should be informed about ambiguities and how they are resolved.

Currently if Cabal is asked to configure a package from a Setup.hs script without using cabal-install some adhoc dependency resolution takes place too.

Garbage Collection

It should be possible to have a garbage collection remove unneeded packages. It has to be interactive because there might be dependencies not known to Cabal and ghc-pkg. Sandboxes are useful for the user to keep track of what should be removable without causing too much damage.

Identifying packages

The InstalledPackageId? currently uniquely identifies an installed package and should to so in the future. It currently consists of the package name, the version and the abihash, for example GLUT-2.3.0.0-70c7b988404c00401d762b8eca475e5c. The ABI hash is used to discriminate between instances of the same package version, to avoid recompilation if a dependency has changed but its ABI has not and as a sanity check for GHC to refuse compilation rather than to produce garbage. The InstalledPackageId? as currently defined is unsuitable to uniquely identify installed package instances.

Dependency resolution in cabal-install

"so I see two general options for communicating knowledge about build flavors to the solver:

(1) "the direct way": i.e., all info is available to ghc-pkg and can be communicated back to Cabal and therefore the solver the solver can therefore figure out if a particular package is suitable to use or not, in advance

(2) "the agnostic way" this is based on the idea that the solver at first doesn't consider installed packages at all. it'll just do resolution on the source packages available. taking all build parameters into account, Cabal hashes will be computed. these can then be compared to hashes of installed packages. reusing installed packages instead of rebuilding them is then an optimization of the install plan. this doesn't require that ghc-pkg is actually directly aware of all the build parameters, as long as the hash computation is robust." -- kosmikus

The options are to support either both by putting all info into InstalledPackageInfo? or to support only (2) by just putting a hash into InstalledPackageInfo?. The disadvantage of supporting both is that InstalledPackageInfo? would have to change more often. This could be fixed by making InstalledPackageInfo? extensible. The advantages are that the additional info might be useful for other tools and that more complex rules for compatibility are possible for example non-profiling libs can depend on profiling libs. It would also be better for showing the user how two instances differ. The disadvantage of going for only (2) is that it is a big change and might cause problems with other Haskell implementations. Also if a package only exists installed and not in source form it is completely ignored.

The Cabal hash

We hash the build configuration of a package that is to be built. cabal-install uses this hash to check if a package is already installed.

A build configuration consists of the following:

The Cabal hashes of all the package instances that are actually used for compilation. This is the environment. It is available in the installedPkgs field of LocalBuildInfo? which is available in every step after configuration. It can also be extracted from an InstallPlan? after dependency resolution.

The compiler, its version and its arguments and the tools and their version and their arguments. Available from LocalBuildInfo? also. More specifically: compiler, withPrograms, withVanillaLib, withProfLib, withSharedLib, withDynExe, withProfExe, withOptimization, withGHCiLib, splitObjs, stripExes. And a lot more. [Like what?]

The source code. This is necessary because if the source code changes the result of compilation changes. For released packages i would assume that the version number uniquely identifies the source code. A hash of the source code should be available from hackage to avoid downloading the source code. For an unreleased package we need to find all the source files that are needed for building it. Including non-haskell source files. One way is to ask a source tarball to be built as if the package was released and then hash all the sources included in that.

OS dependencies are not taken into account because i think it would be very hard.

Released and Unreleased packages

If we cabal install a package that is released on hackage we call this a clean install. If we cabal install an unreleased package we call this a dirty install. Clean installs are mainly used to bring a package into scope for ghci and to install applications. While they can be used to satisfy dependencies this is discouraged. For released packages the set of source files needed for compilation is known. For unreleased packages this is currently not the case.

Separating storage and selection of packages

Currently the two concepts of storing package instances (cabal store) and selecting package instances for building (environment) are conflated into a PackageDB. Sandboxes are used as a workaround to create multiple different environments. But they also create multiple places to store installed packages. The disadvantages of this are disk usage, compilation time and one might lose the overview. Also if the multi-instance restriction is not lifted sandboxes will eventually suffer from the same unintended breakage of packages as non-sandboxed PackageDBs. There should be a separation between the set of all installed packages called the cabal store and a subset of these called an environment. While the cabal store can contain multiple instances of the same package version an environment needs to be consistent. An environment is consistent if for every package version it contains only one instance of that package version.

First class environments

It would be nice if we had some explicit notion of an environment.

Other

The ABI hash becomes a field of InstalledPackageInfo?. Some code in GHC needs to be adjusted to use this new field instead. [You mean this is a change in behaviour? What about packages that don't have one?]

What about builtin packages like ghc-prim, base, rts and so on?

This assumption does not hold in a multi user environment with multiple local package databases. This is a problem when building.

Open Questions

Inplace Registration?

Who has assumptions about the directory layout of installed packages?

Executables?

Haddock?

Installation Planner?

Custom Builds and BuildHooks??

Other Compilers, backwards compatibility?

What is ComponentLocalBuildInfo? for?