A fat interface file is like a normal hi file, but with extra information so that GHC can *recompile* the associated module without having to parse, rename and typecheck the source code again. The proposed command-line interface is something like (subject to bikeshedding):
ghc -c A.hs -fno-code -fwrite-buildable-interface# produces a .hi-build file, thenghc -c A.hi-build# completes building, as if ghc -c A.hs had been called
The primary motivation for this is to support partially compiling indefinite packages, which cannot be compiled to object code due to the fact that some of the dependencies haven't been provided yet. However, there are some other cases where this might be useful:
Supercompilation requires having the source of all bindings, buildable interfaces can make this information available, even beyond what storing inlinings might provide
A buildable interface file can be used to build variants of a source, e.g. profiling and optimization, even "on the fly" by GHC if necessary.
GHCi can take advantage of information from buildable interface files to give more detailed information about otherwise "compiled" modules
And maybe more we haven't thought of yet.
Note: we DO run the desugarer before we write out the buildable interface (so we are essentially serializing a ModGuts, not a TcGblEnv). This does mean that you can't rebuild a buildable interface with hpc (that's done during desugaring) but that seems like a small price to pay.
Question: should we tidy the Core bindings before serializing them out? This is only relevant for determining if A.hi files need to be rebuilt if A.hi-build changes: if we can avoid needless churn on A.hi-build, some A.hi files may not need to be rebuilt. However, we'd have to be pretty bang-on certain that the extra tidying phase wouldn't cause change in how we end up compiling things, which is unclear because tidying will drop things that might have been profitable during simplification, but are no longer profitable now. So the answer is probably no.
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Tidying does a number of things, listed at the top of TidyPgm.
I have not thought this through fully, but my guess is that most of them aren't what we want for a fat interface file; e.g. dropping rules; the whole findExternalIds thing; injecting implicit bindings, etc.
But we probably do need one aspect of tidying, namely to rename things. Consider \x_12 \x_14. x_12, where the _n part is the unique. When we serialise into an interface file we drop the unique, so it's important that the OccNames don't shadow.
Edward Z. Yangchanged title from Implement "buildable" interface files which can be directly compiled without source to Implement "fat" interface files which can be directly compiled without source
changed title from Implement "buildable" interface files which can be directly compiled without source to Implement "fat" interface files which can be directly compiled without source
Posting a little update about the tiresome root main function: the root main function is a special implicit binding that is added to Haskell programs during type checking and serves as the "well known" entry point to enter a Haskell program. Currently, it is named :Main.main (i.e. ZCMain_main): unusually, it is a binding in a Haskell program that can have a different Module than the module actually being compiled.
Because of this, it has to be treated specially when we serialize it out to an interface file: namely, we have to know /which/ binding is the root main binding (so it doesn't clobber the user-written main binding), as interface file serialization doesn't record Module, only the OccName.
In our last phone call, we proposed two fixes, but I don't think either of them will work:
Move the binding to be injected during tidying, rather than typechecking. This is tiresome because main is to have type IO a, where a is anything we want. We'd have to extract out this type to make a well-formed core binding (this is marginally easier in the typechecker, where the current code does unification to pull it out.) We are also only given the RdrName of the function which is supposed to be main; we have to consult the GlobalRdrEnv to turn this into an actual Name. So we also have to make sure to preserve this information until the final tidying so that we can point to the correct main function, which means yet another thing to record in ModGuts. Finally, we can't really eliminate checkMain, since we need to give errors when appropriate. So this is a large amount of work for a questionable amount of benefit.
Rename the main OccName to something special, e.g. $main, so we can identify it in the interface file. If we do this, it means the name that you must refer to for a Haskell program for main, ZCMain_main_closure, has to be renamed to ZCMain_zdmain_closure. This is pretty gratuitous and will break any C-Haskell bridges for no good reason. So I'd like not to do that.
I think my current plan is to just have a flag in IfaceBinding which says if this is "main" or not, and then we just typecheck it accordingly one way or another. A small amount of code, no penalty for normal interface files, and very simple.
Here is a comment about the extra things that we seem to need to serialize to a ModIface, beyond IfaceBindings:
mg_rdr_env :: GlobalRdrEnv. Originally, I claimed that this was only used by the vectorizor; however, I was slightly mistaken: it is ALSO used to generate error messages / debug information in the simplifier and interface generation code. For example, we don't warn about orphans until MkIface; and to give nice qualified names, we need a GlobalRdrEnv. I'm investigating if we can move orphan warnings earlier, but in general to not serialize GlobalRdrEnv we will pay a little bit in how good the error messages are (maybe this is OK for the simplifier monad, which I don't believe gives any user-visible messages using this).
mg_hpc_info :: HpcInfo and mg_foreign :: ForeignStubs are directly passed through to code generation during Tidy, so we can't get rid of them. So my inclination is to create a field in ModIface like mi_cg_info :: Maybe CgIface, which contains just these two bits.
I managed to eliminate the usage fields in ModGuts by computing usage information earlier, during desugaring. This seems to work quite well.
UPDATE: Actually, the problem of serializing ForeignStubs is quite vexing: it contains a pair of SDocs which can't easily be serialized. I guess we'll need to define some sort of data type representing foreign stubs which we then can code generate later? Harrumph!
Let's talk about recompilation avoidance and how the user-facing interface for fat interfaces work.
Ideally, compilation to a fat interface (-fno-code -fwrite-fat-interface) should be similar to compilation to an interface (-fno-code -fwrite-interface), including recompilation avoidance. This includes the following properties:
Incremental compile. If A imports B, and I've built B.hi-fat, I should also be able to build A.hi-fat. (This implies that an hi-fat file should be loadable to get TyThings.)
If I build A.hi-fat, and then modify A.hs, if I ask GHC to build A.hi-fat again, it should be rebuilt; similarly if the TyThings of any of its dependencies changed. However, if there are no changes, I shouldn't have to rebuild A.hi-fat.
If I am compiling an A.hi-fat to A.o and A.hi (e.g. finishing the compilation), it's unreasonable to expect GHC to handle the case where the A.hi-fat is out of date relative to the A.hs or any of its dependencies. However, if it is out of date, GHC should detect this and give an error, rather than generate some incorrect code.
The mental model I have is that we can think of an hi-fat file as consisting of two parts: an types only hi file that would have been the result of an -fno-code -fwrite-interface (so, a properly Tidied interface that is good for typechecking and then loading in for later typechecking, but without any unfoldings) as well as the fat bits which can be used to reconstruct a ModGuts and finish compilation. (Of course, you'd really like this to all be ONE file. Should look at this carefully and make sure it works!) Importantly, we DON'T care about fingerprinting the bindings and stuff for compilation; only the typechecking information.
We have to be careful not to confuse hi files and hi-fat files in GHC. If I'm typechecking for an hi-fat file, I only want to preferentially use hi-fat files; however, if I'm actually building a module, I really want hi files (so I get the right unfoldings). We have a somewhat similar situation in Backpack: I really don't want the types only definitions polluting the EPT, because then things won't optimize.
Here is where things get tricky, because when we typecheck a module and then desugar it, we are going to use the typecheck only hi files, but when we BUILD the ModGuts, we're going to want to use the actually built hi files. How do we tell if these are actually in sync? We're going to need some only "type-check" only hash which we can use to test for consistency. Hmm! ModIface should have something close to this already but we have to check.
This is a really cool idea. I'd like to encourage putting as much info as possible into these hi-fat files (or potentially make that an option). For example, type information for every sub expression, exact names for identifiers, and SrcLoc info. Why? Because this could also potentially allow for querying information about libraries without compiling them. For example, you might jump into some library code by using "go-to-definition" in an IDE. It would be excellent to then be able to immediately get info about the code you've jumped into, via these fat hi files. I don't know much about GHC internals, so maybe this doesn't make much sense. But, something to consider!
Hmmm... as you know, I'm pretty strongly opposed to the idea of saving source code, and this just seems like another variant of saving source code. When the instantiations of the holes are available, just compile the source code against the now-concrete dependencies. Until the concrete instantiations are available, you can typecheck but do nothing else. There are good reasons to want to do it this way - the main one being that "cabal install" should be an effect-free operation, except insofar as it makes things available in GHCi. If "cabal install" has a side effect, then it becomes very hard to explain the user interface, because we have to take into account the state of the system somehow, and I'm sure this will cause problems for users. We really want "cabal install" to depend on its inputs and nothing more. I understand that the motivation is to compile "the same thing that we typechecked" in some sense, but the right way to ensure that is to make sure you give "cabal install" the same inputs that it had when you typechecked, that is, move the guarantee of consistency to a higher level.
I don't object to the idea of fat interface files per se, but I wonder whether it's a feature that will pay for itself.
I would agree that the primary payoff of this feature is Backpack; without it, I think there is much less motivation. I think the "external core" use-case could be quite compelling, but there's more work to do extracting the interface file parser into its own library.
Do fat interfaces violate cabal install "depending on its inputs, and nothing more"? When I try to answer this question, I find myself asking, "what distinguishes these from normal interface files which we save for packages that are already installed?" Let me recap Duncan's new plan for implementing cabal install so that it is minimally affected by the existing database of packages:
We first solve dependencies, without making any reference to the existing package database. This allows us to compute a number of IPIDs of configured packages.
Then, we *improve* these configured packages into installed packages simply by looking up their IPIDs in the package database.
When we compile, we begin as if we had already compiled the already installed packages, and finish up building and installing the rest of the packages in the plan.
The emphasis seems to be removing side effects from step (1); but of course we do want to avoid rebuilding things that are already built, so (2) is "effectful" but in a benign way.
I think this plan, which avoids side-effects during dependency resolution, works for Backpack fat interfaces too. Here's how it looks:
Once again, we solve dependencies. This gives a number of component IDs for all configured components.
We improve these configured components into installed components. Some of these will map to pre-compiled components, but others will map to *indefinite components* which were installed with fat interface file
When we compile, we begin as if we had compiled all the installed packages, and partially compiled all the indefinite ones, and just "finish" up the rest of the build according to the plan.
There is extra complexity with ensuring that instantiated units get deduplicated correctly (arising from Backpack's applicativity) but I really think that fat interface files support the method that you are looking for.
Let me also point out the other big difference between fat interface files and source files: GHC *always* knows how to compile fat interface files. You cannot have any sort of custom build system associated with them. They're just a big pile of inlinings; there's no external C code, no preprocessing, etc. This means, for example, that if you want a indefinite component with some C code, the C code gets built ONCE when you initially installed the indefinite package, not each time you instantiate the component.
I'd really like to get you on board with this new plan, so let me know if you want to chat about it some more.
Yes ok, I agree that you could do this and retain the property that cabal-install is deterministic. However, since you also need to support building from source (because the user could just delete the package database and get the same results) isn't this scheme just a very elaborate way to avoid re-typechecking things in some cases? That is, it doesn't add any new functionality, it just saves repeated work.
Does this feature pay its way? It's still not clear to me. It feels like something we would have to consider in lots of places, like hs-boot files, a kind of tax on people who want to write build systems and tools that work with packages.
I wouldn't be surprised if there were tricky technical problems with implementing it fully too. What about core-to-core compiler plugins? What is the equivalent of ghc -M for fat interface files?
This would mean that we couldn't have C code in a package that depends on something provided by a hole too. Maybe you don't want to support that use case, but given that without fat interfaces it would "just work", it seems a shame to lose it.
Essentially I suppose I worry about losing the simple compilation model that we currently have in the name of saving some time.
Edward can correct me but I think it's about more than time/efficiency. To start again from source you need to have all your import paths, CPP settings, etc etc recorded so that you can replay them. You need to copy all the source files, including any C bits, into the installation tree. There's a LOT of front-end stuff to worry about.
But this way all that is gone. We simply have Core; and any C bits are compiled to .o files. Nice!
I completely agree that we shouldn't save source files and try to replay compilation from there - indeed I'm suggesting that we shouldn't record/replay anything at all, and just start from the cabal source packages each time we need to compile or typecheck.
Maybe a simple example will help focus this discussion. Imagine we have packages A and B, where A has a B-shaped hole. We can do
cabal install A
which just typechecks A, and we can do
cabal install A B
which compiles A and B. Now, "cabal install A B" must work and produce the same results regardless of whether we previously did "cabal install A", because cabal is just a function from source packages to compiled packages. So the only thing we can do in response to "cabal install A" is to cache something that might be useful later, which is what the fat interface files are: they let us avoid some of the work that we will need to do in a future "cabal install A B" for any B.
Since we have to support "cabal install A B" without a previous "cabal install A", there's no new functionality in saving fat interfaces.
Aha! SM, your latest comment pinpointed a misunderstanding. You said that "cabal install A B" must work whether or not "cabal install A" was done previously, thus fat interface files are not necessary. But in fact, even in the case of "cabal install A B", we use fat interfaces to compile an instantiated version of A: you will have to compile to a fat interface first, and THEN finish instantiating and finishing the compilation.
Why is it set up this way? There were a confluence of reasons that pushed us in this direction:
We want GHC(i) to be able to compile Backpack code directly, without requiring Cabal in the loop.
Conversely, we don't want to bake Backpack support into cabal-install; if we do, then alternate package managers like Stack also have to port their own support. (A cabal-install which knows, in "cabal install A B", that A is instantiated with B, knows too much about Backpack for its own good!)
We always need to typecheck A by itself *anyway*, because it may be ill-typed in a way that you won't discover when you instantiate it. The canonical example is:
unit A where signature A where data T = T signature B where data T module M where import qualified A import qualified B x = A.T :: B.T
A shouldn't type-check, because there is no reason to believe A.T and B.T are the same type. But if B implements A and B with the same T, it is actually possible to compile A instantiated with B.
OK, let me answer some of the other questions more directly:
It feels like something we would have to consider in lots of places, like hs-boot files, a kind of tax on people who want to write build systems and tools that work with packages.
I agree, it does make things more complicated. But you never have to interact with fat interfaces unless you want to build a Backpack package: they all go away once you instantiate things. So I think the complexity here is somewhat opt-in. It's the same deal with some sort of external core format: you don't have to deal with it unless you want to generate external core or compile it.
I wouldn't be surprised if there were tricky technical problems with implementing it fully too. What about core-to-core compiler plugins? What is the equivalent of ghc -M for fat interface files?
Core-to-core plugins run when compiling from hi-fat to hi; that should just work out of the box. As for ghc -M, fat interface files are "essentially" external core input files, i.e. they come with full information about their dependency structure. So you could just write ghc -M which would tell you what order to build hi-fat files in the same way you can do it for hs. (I have not implemented ghc -M, but I have implemented --make's analysis to work on fat interface fiels.)
This would mean that we couldn't have C code in a package that depends on something provided by a hole too. Maybe you don't want to support that use case, but given that without fat interfaces it would "just work", it seems a shame to lose it.
We are giving this up. But I am not losing too much sleep over it; the preprocessor for the C code would already have to know how to mangle the Haskell symbols to point to the right location.
I'm missing some context here, e.g. I don't know what your plans are with respect to using backpack code with regular GHC(i), and how cabal-install can be ignorant to whether a package contains backpack code. Hence I don't understand how (1) and (2) force you to save fat interface files.
I understand (3), but it still doesn't seem to force you to save a fat interface file. Why not typecheck first and then compile again from source?
Let's think about it from a user perspective. Right now, if I want to compile a simple Haskell file, I don't need to futz around with Cabal: I can just ask GHC to build it for me ghc --make A.hs. So we want a similar experience with Backpack: if someone writes a simple Backpack file, they should be able to build it with ghc --backpack p.bkp, without futzing around with Cabal.
This is MUCH more important for Backpack programming "in the small" (ala ML functors and C++ templates) rather than Backpack programming "in the large" (ala replacing package dependencies). I think it's important that we get a story that works well for both cases; even though the Backpack design is highly oriented towards large-scale modularity, it is much easier to get users using the small-scale features first. (Plus, there's interesting stuff to be done with alternate syntax building on Backpack but better suited for small-scale modularity.) But we can't ask them to use Cabal for that!
Re (2), how can you let cabal-install work without knowing about Backpack? If it doesn't know about Backpack units, then it can only operate as it does now: building a graph of packages (packages that use Backpack are still obligated to state, in a build-depends, what external packages they use). So when we say "cabal install A B", we mean, "install package A" and "install package B". What does it mean to install an indefinite package? Typecheck it and install the fat interface files (so we can typecheck against it). What does it mean to install a package which instantiates an indefinite package? Compile it, and also compile anything that it instantiates. With fat interfaces, GHC can manage compiling the instantiated units, since it can just resume that compilation!
How about installing the results? If Backpack were generative, every package which instantiates indefinite units would just get their own private copy of any other units it instantiated, and together they would constitute the package you would install. Fat interface files work perfectly for this case. But since Backpack is applicative, we also want identical instantiated units to be shared. Here is where you might object, since I am cheating here, but what seems easiest is to have GHC and Cabal (but not cabal-install) conspire to deduplicate instantiated units "behind the scenes".
You might say that this will work really poorly for traditional distribution package managers. If C and D both instantiate A(B), where do the "install files" for A(B) live? The only reasonable place to put them in a distro is in their own package, implying that a distro packages you'd create for packages that use Backpack would have to know about how everything is instantiated. But traditional distribution package managers (e.g. deb, rpm) already can't deal with having both foo-0.1 and foo-0.2 installed at the same time. So it's no surprise that they'll have some trouble dealing with Backpack, which subsumes this mode of use. And if you do have a distro that can deal with it (e.g. Nix), you can either bundle private copies of the instantiated copies (and rely on GHC's linker to deduplicate them later, following on how C++ compilers handle this), or you can do as you suggest and run enough of Backpack to figure out how things are instantiated, and then make a build task for each instantiated unit. I think you are right that we will want to build this eventually; but I don't want it to be the default, and I want something simpler for casual use.
First, I absolutely agree that we should be able to use Backpack
without Cabal, indeed it must be so since Cabal is just a layer on top
of GHC.
The original intuition with Backpack was that building fully
instantiated packages can work exactly as it does now - we build
everything from the leaves up to the root in dependency order, no
extra mechanisms are needed. Furthermore, when dealing with
indefinite packages, we can generate interface files but not code.
I'm still not seeing anything in this that requires fat interface
files.
Let me re-answer your questions: What does it mean to
install an indefinite package? We install just the interface files,
so that we can typecheck against it. What does it mean to install a
package that instantiates an indefinite package? Just built and
install it! Then we have to build the package that it instantiates,
which is exactly how cabal-install works now.
I think your story is more complicated. You said "GHC can manage
compiling the instantiated units, since it can just resume that
compilation!" but that's blurring the boundary between Cabal and GHC,
since suddenly GHC has to go and compile *and install* things. I can
imagine this getting really messy. Better to have Cabal manage
building and installing things at the package level, like it does now.
GHC will need to know when it is just typechecking something
vs. compiling it against definite dependencies. Hence, Cabal will
also need a new command (or something) to do this - it's a new mode of
operation, and the user needs to be in control, so there's no escaping
this being visible at the Cabal level, I believe.
Let me re-answer your questions: What does it mean to install an indefinite package? We install just the interface files, so that we can typecheck against it
We need more than that. We need to compile the indefinite package to code when we fill its holes. So we need access to its source code in some form.
One way to do that is to keep the original Haskell source code (pre-cpp, pre-everything) and typecheck it from scratch, being (a) careful to keep a copy of the source code and (b) ever so careful to replay exactly the front-end compiler flags that were used the first time round. We could do this. But it's just easier to snapshot the Core bindings that we already have in our hand. With unfoldings in interface files we already do this; it's mainly a question of keeping all unfoldings, not just some.
It's not an efficiency issue. If it's easier to compile again from source (remembering all the C bits etc) then let's do that. To me it looks harder. And that's the core issue to debate.
We need more than that. We need to compile the indefinite package to code when we fill its holes. So we need access to its source code in some form.
One way to do that is to keep the original Haskell source code (pre-cpp, pre-everything) and typecheck it from scratch, being (a) careful to keep a copy of the source code and (b) ever so careful to replay exactly the front-end compiler flags that were used the first time round. We could do this. But it's just easier to snapshot the Core bindings that we already have in our hand. With unfoldings in interface files we already do this; it's mainly a question of keeping all unfoldings, not just some.
So that's exactly what I think we should not do, we don't need to record and replay anything at all because when we instantiate the package it's a completely separate compilation.
So just to be clear, the plan for building package A & B where A has a B-shaped hole would be:
typecheck A alone, install (thin) interface files
build B from source, install code + interface files
build A from source against B, install code + interface files
And any of these steps can be omitted if we've already done it before and cached the results in the package database, which we can discover by looking up the package key. Isn't this much simpler than saving fat interface files? What am I missing?
I think the difference in opinion stems from whether or not it's reasonable to ask the user to have built the non-leaves prior to actually building the unit in question. In the case that it is necessary, the only way the user is going to be able to do this in practice is with external tooling, e.g. cabal-install. But this was true for pre-Backpack packages too; no one really can install a package without help from Cabal. And this seems to be fine.
But there are other things for which it's not reasonable to ask the user to have run Cabal beforehand. For example, I can work on a multi-module Haskell program without using Cabal, and it would pretty horrible user-experience if I had to "install" every module before I could use it in another program.
The crux is this: Backpack units are more like modules than packages, as far as user experience is concerned.
For example: today, if containers publishes Data.Map which has a map data type, I can use it immediately in an hs file, no fussing about with Cabal. If we translate this to Backpack, we get a data-map Backpack unit, which has a hole for the type of keys. It would be bad user experience to force a user to use Cabal to build and install all the instantiated copies of data-map they might want to use before they can build their application. To add insult to injury, if the key type in question is in the package I was working on, they would have separate it out into new unit containing just their type definition, so that could be installed first and the used to instantiated data-map, before they can go ahead and build the package they're thinking about.
So, I think your intuition is right when holes are roughly "package" shaped and Backpack units approximate packages. But I think it's quite important to have reasonable non-Cabal workflow when the holes are smaller, which forces us to have GHC instantiate things on the fly, which leads us to fat interfaces.
(I can also expand on why I think small-scale Backpack use is important, and there's also another argument that saying that the user invokes GHC with commands like "build A from source against B" makes for poor language specification.)
I think the difference in opinion stems from whether or not it's reasonable to ask the user to have built the non-leaves prior to actually building the unit in question. In the case that it is necessary, the only way the user is going to be able to do this in practice is with external tooling, e.g. cabal-install. But this was true for pre-Backpack packages too; no one really can install a package without help from Cabal. And this seems to be fine.
I'm not completely sure I understand what "built the non-leaves" means, but I expect all of the packages (and instantiations thereof) that are dependencies of the current set of modules being built, are already built and installed in the package database. Which as you say, is exactly as it is now.
But there are other things for which it's not reasonable to ask the user to have run Cabal beforehand. For example, I can work on a multi-module Haskell program without using Cabal, and it would pretty horrible user-experience if I had to "install" every module before I could use it in another program.
Absolutely. I expect to be able to do ghc --make on a collection of modules and signatures in a Backpack world, and provided all the external dependencies are satisfied in the package database, it should just work.
For example: today, if containers publishes Data.Map which has a map data type, I can use it immediately in an hs file, no fussing about with Cabal. If we translate this to Backpack, we get a data-map Backpack unit, which has a hole for the type of keys. It would be bad user experience to force a user to use Cabal to build and install all the instantiated copies of data-map they might want to use before they can build their application. To add insult to injury, if the key type in question is in the package I was working on, they would have separate it out into new unit containing just their type definition, so that could be installed first and the used to instantiated data-map, before they can go ahead and build the package they're thinking about.
Yes - I do believe that to instantiate Data.Map you should have to build it (from the original source package) and install the instantiated result in the package database. But our tooling is going to do this for us automatically. This story is simple, fits with our current compilation model, and retains the logical separation between GHC (for building a collection of modules) and Cabal (for building and installing packages).
I agree that this might be annoying from a user perspective if they aren't using the higher level tool, so I can see why you wanted to solve the problem differently. But isn't it going to be strange if building my program also requires a long sequence of compilations of external modules? And where do the results go? GHC has no business installing stuff, so they have to go in some local sandbox-like setup, which means that in some other project when you want the same instantiation you don't get to reuse the results. This is exactly what the package database is for! It caches compositions of packages.
I think this relates to what you said earlier:
Here is where you might object, since I am cheating here, but what seems easiest is to have GHC and Cabal (but not cabal-install) conspire to deduplicate instantiated units "behind the scenes".
I'm not sure exactly what mechanism you have in mind, but if Cabal and GHC have to conspire, what happens when you're using GHC by itself? This does seem extremely murky to me. Why not just use the package database, and have Cabal manage the installation of instantiated packages?
So while I agree with you that it's important to have a story for using GHC(i) directly, in practice this isn't going to be the way people want to use it, and furthermore the more we try to make this a smooth user experience, the more we will end up putting features that should be in the higher-level tools into GHC itself, leading to a mess. So I think we should retain the current clear division of labour: stack/cabal know about collections of packages, Cabal-the-library knows about building + installing single packages, and GHC knows about building collections of modules.
SM, I think you have convinced me that eventually, we will need a version of cabal-install that is Backpack-aware, and has a work item in its install plan for each instantiated Backpack package.
I am also going to claim this implementation of cabal-install is not going to be "simple". Let me first state an principle which we have both agreed is desirable:
*It should be possible to build a Backpack unit without Cabal** (...assuming that all external dependencies have been built already.)
This principle forces a few choices:
*We need a separate file format for Backpack.** Suppose that we don't
have a separate file format for Backpack, e.g. that all the Backpack
information is recorded in a Cabal file (as it was in our older design
from a year ago.) GHC can't parse a Cabal file, but it needs to know
specifically what unit provides any module it wants to import. How can
you get this info to GHC?
In our old design, you could pass a pile of -package flags
with renamings to make sure every module name points to the correct
module from the correct instantiated unit. Obviously this is
completely unusable by hand; you'd need a tool which knows about
Backpack to generate these flags. So much for using Backpack without
Cabal! My advisor complained a lot about this, and that was the
impetus that lead us down the route to fat interface files.
You could write any Backpack-specific information to its own file and
pass that to GHC. This is a lot better! Now the only extra information you might need to pass is how the holes are to be instantiated (in case the unit is indefinite but you want to compile it).
*It is GHC, not Cabal, who knows how to instantiate units.** Simple
corollary of the above. And I also assume that we don't want to duplicate code,
so this code for instantiation should only live in one place.
Notice that in the old design, it was Cabal that had the Backpack smarts!
But I think this was the wrong choice: GHC should have the Backpack smarts.
*Cabal needs to ask GHC for information about how it instantiated units.**
Since the code for unit instantiation lives in GHC, but each instantiated
unit needs to get installed to the database (a Cabal concern), it follows
that Cabal needs to call GHC in order to get information about how units
are instantiated.
This is directly relevant to cabal-install. Traditionally,
cabal-install takes the package to be built, resolves the dependency
graph, and then builds the dependency graph (now an install plan) in
topological order. But if GHC, not Cabal, does instantiation, we don't
know what the full set of instantiated units a Backpack package needs
until after we have configured it. Essentially, as we process packages
in the install plan, after we configure, we have to ask GHC to tell us
what other dependencies are needed, and then splice those dependencies
into the install plan as extra targets that must be built. This is further complicated
by the fact that the Cabal library may be embedded in a custom Setup script; in which case we have to define an interchange format between Cabal and cabal-install to communicate this information. I don't think this is the wrong way to do it, but it will take work to implement.
Let me argue for fat interface files one more time. One of my top priorities at this stage is to get Backpack working "just enough" so that we can roll it out and let users (including me) play with it and use it for real applications. So if there is a way to get thing working today (maybe not entirely the Right(TM) way) in a way that doesn't impair our ability to make it work right later, I would quite like to take it. Fat interfaces are a way to get Backpack integrated with the Cabal ecosystem without having to rewrite cabal-install's install plan runner, make it easier to experiment with Backpack, have the property, "If you don't care about it, you don't have to", and most importantly, is mostly implemented at this point.
This discussion has also reminded me that there is still an elephant in the
room with regards to recompilation avoidance in Backpack. In a package
model, GHC will not recompile if the package key of the depended upon
package has not changed. Since we don't generally expect package keys
to change on an edit/recompile cycle, it's unsound to keep around build
files from an old compilation. This means that you end up doing a lot
of recompiling with Cabal today when you change a package that is
depended upon by other packages you are building. In fact, we want
something like this: we don't *care* about recompilation avoidance
when we are installing, but if we are edit/rebuilding, we want accurate
recompilation avoidance for the entire set of packages which we may
be editing and building. This sounds exactly like the "local package
database" that we end up creating when compiling Backpack units using fat interfaces.
Well sure, I'm happy to move command-line flags into a file of some kind with sensible syntax.
It is GHC, not Cabal, who knows how to instantiate units.
In my (perhaps naive) view of the way this works, "instantiating a unit" is just compiling modules in dependency order, exactly as GHC does today. This is the case that we already have covered; all of the new stuff is in typechecking modules without concrete dependencies.
So yes - GHC instantiates units. But installing the instantiated unit and recording it in the package database is Cabal's job. The package database can record that we have an instance of A, and that it was built by instantiating A with B.
Cabal needs to ask GHC for information about how it instantiated units.
I don't think so. Why can't it look in the package database?
On your point about shipping something usable in a timely manner, by all means forge ahead. I don't want to block you, especially when you've done a lot more thinking about this problem and implementing it than I have! But I am very keen that we end up with a story that (a) retains the property of cabal-install depending only on its inputs, (b) keeps a sensible separation of layers between GHC, Cabal, and {cabal-install,stack}.
In my (perhaps naive) view of the way this works, "instantiating a unit" is just compiling modules in dependency order, exactly as GHC does today.
No, more has to be done. For example, imagine package foo and bar set up this way:
-- package foounit foo where signature H module P
-- package barunit a-impl where module Aunit bar where include p include a-impl (A as H)
At the end of the day, we need to compile an instance of foo compiled against h-impl, e.g. foo(H -> h-impl:A). How do we know that this is the case? Backpack specifies an ALGORITHM for figuring it out, which involves taking package bar, and then doing analysis on the includes (renames and all) to figure out what units are providing modules that other units require. If you support cross-package mutual recursion, this algorithm can be quite tricky indeed, since you need the business with infinite regular trees. So it seems quite desirable for this logic to live in GHC.
So no, Cabal cannot just figure out how to instantiate units by looking at the package database, without also implementing a key part of the Backpack algorithms. (In the old design, Cabal DID implement this algorithm, but we really want to get away from that.)
We have decided to can fat interfaces for Backpack. There might be other uses for this feature but this patchset will have to be picked up when there's another customer for the use-case.
I'm working on what is essentially a Core -> "something else" compiler, implemented as a Core plugin. The problem is that it really needs to do whole-program compilation, since it can't link against normal compiled Haskell modules.
Something like this seems like it would enable us to pretend that GHC is a whole-program compiler, which would be very helpful for this (admittedly unusual) use case. In that respect it's similar to the "supercompilation" use case.
(The current workaround is to export "reusable" code as TH quotes. This is pretty bad for a number of reasons, not least that everything is effectively inlined at the use site.)
Nix-style correct-by-construction caching. As a first step towards some "incremental compilation nirvana" I'll refrain from gesticulating wildly and vaguely about, I would like to be able to cache GHC with a conventional file-based build system between pipeline stages. All the serializing and deserializing and filesystem access will be slow, but it will also be correct, and a less radical departure from what GHC does than going straight for in-memory.
For the record, there has been recent demand from the ghcide side of things for a feature like this. Specifically, ghcide wants a way to reduce compilation effort due to evaluation of TemplateHaskell splices. Currently they use -fobject-code to compile these since they can preserve the object files. However, this is significantly more compilation effort than necessary. It would be better to rather use the bytecode interpreter, but we currently have no way to preserve bytecode.
My suggestion is that we introduce a feature like this, allowing ghcide to instruct GHC to produce an interface file containing the module's Core. We can then use that Core to avoid typechecking/desugaring/simplification in later recompilations, jumping right into bytecode generation.
@ezyang's initial implementation can be found here: 13615ca4e4bf759f323de22a3d182b06c4050f38