Parallelise 'cabal build'
Short description: This project aims to add support for module-level parallel builds to Cabal, a system for building and packaging Haskell libraries and programs. I suggest a novel solution for this problem in the context of GHC/Cabal that, unlike existing approaches, will make a more effective use of multi-core CPUs.
Additional info: https://github.com/haskell/cabal/issues/976
This project aims to add support for module-level parallel builds to Cabal, a system for building and packaging Haskell libraries and programs. I suggest a novel solution for this problem in the context of GHC/Cabal that, unlike existing approaches, will make a more effective use of multi-core CPUs.
Haskell users have long yearned for true module-level parallel builds . It is frustrating for the owners of modern multi-core machines to not be able to fully utilise them to speed up the compilation cycle. Unfortunately, the existing approaches to this problem have not brought the expected speed benefits due to some properties of GHC's internal architecture.
GHC is a so-called semi-whole program optimising compiler (it inlines across module boundaries), so compiling a single module may require loading interface files for a number of other modules. When compiling the whole program in batch mode ('ghc --make'), GHC is able to cache this information in RAM. However, when GHC is compiling a single module at a time ('ghc -c'), it has to load, deserialise and typecheck all interface files that a module refers to each time anew. Since all existing solutions for parallel compilation are implemented on top of 'ghc -c', this overhead cancels out the speed-up that the parallel compilation brings (see below for experimental results).
I propose to solve this problem by adding a "compile server" mode to GHC. To efficiently compile Haskell code in parallel, the build system will start a number of persistent 'ghc --server' processes (usually one per core), among which it will then be distributing compile jobs. Since each persistent 'ghc --server' process will be used for compiling multiple source files, they will be able to cache loaded interface files, thus avoiding the overhead (see  for a more detailed analysis).
Unlike some previous work in this direction , the "compile server" mode will be used only as a back-end for compiling a single program or a library and not as a persistent daemon, which should make it easier to implement .
OpenShake - Neil Mitchell's Shake tool can compile a multi-module Haskell program in parallel by extracting a dependency graph with
'ghc -M' and then starting multiple '
ghc-c' processes. Unfortunately, Neil has found that this approach doesn't make the build faster. Quoting the Shake paper: "Building the same project with '
ghc --make' takes 7.69 seconds, compared to Shake with 11.83 seconds on one processor and 7.41 seconds on four processors."  The reason for this phenomenon was explained above.
This project naturally lends itself to decomposition into two parts, which correspond to the periods before and after the midterm evaluation.
Changes to GHC - I'll start by adding a "compile server" mode to GHC and then extending my ghc-parmake tool to use it as a back-end (a standalone "parallel make" tool for Haskell/GHC that is not integrated with Cabal is quite useful by itself). I'll try to implement 'ghc --server' first by using GHC API, if possible. There already exists a prototype "ghc server" written by Evan Laforge , so I won't be starting from scratch.
Changes to Cabal - In Cabal, we'd like to support both parallel builds (module-level parallelism) and parallel installs (package-level parallelism). While extending 'cabal build' with support for parallelism should be relatively simple, integrating that with parallel installs will require using IPC. A single coordinating
cabal install -j N process will spawn a number of
setup.exe build --semaphore=/path/to/semaphore children, and each child will be building at most N modules simultaneously. An added benefit of this approach is that nothing special will have to be done to support custom setup scripts.
There are some alternative ways to approach the problem of parallel module-level builds that I decided against for this project. I'll briefly discuss the rationale for rejecting them.
Use ghc -c -As discussed above, this has already been done, but doesn't bring the expected speed improvements due to the absence of interface file caching.
Make ghc -c as fast as ghc --make - This requires a considerable amount of GHC hacking, and is in my view less likely to succeed since I have much less experience working with the GHC codebase than with Cabal one. Moreover, improvements that make both 'ghc -c' and 'ghc --make' faster won't actually benefit the project - we're only interested in the relative speedup vs. 'ghc --make', and making the speed of deserialising/typechecking interface files negligible is expected to be quite challenging .
Parallelise ghc --make itself - Again, deemed too risky because I don't have much experience working on GHC code. Additionally, this has been attempted by Thomas Schilling in the past before, with negative results - parallel patches were slowing down the serial compilation too much. I should note that if someone implements this approach in the future, it should be possible to change 'cabal build -j' to use it as a back-end.
The main risk is that there will be unforeseen obstacles preventing the implementation of a "compile server" mode for GHC. I don't have much experience with GHC hacking, so I purposefully chose a more limited form of 'ghc --server' - a back-end for a single parallel compilation instead of a more general compile server that sits in the background the whole time and is used to submit jobs to. This is, after all, exactly the same functionality that GHC itself uses for implementing 'ghc --make' - the only existing limitation is that compilations can't be distributed across several different OS processes. Other than that, I'm quite confident in my abilities to implement this design.
'ghc --make -j' vs. 'ghc --server'
There is an alternative proposal for implementing parallel builds by parallelising 'ghc --make' directly . In case both proposals get accepted, I'll refocus my project a bit to pay more attention to parts where they differ:
- As discussed above, the changes to Cabal that I'll need to implement should be possible to adapt to support both approaches. This will free the other student from having to think about extending Cabal.
- There are some other opportunities for parallelising 'cabal build', such as compiling normal and profiling versions of libraries and running Haddock in parallel. I'll make sure to take a look at that.
- With some extensions, the "compile server" mode can be made useful by itself for enabling fast incremental builds. It is an important issue in practice; I've been told that "yesod devel" implements a special-purpose compile daemon to make incremental builds faster. I will try to extend 'ghc --server' in this direction. An extended "compile server" mode can also be useful for IDE implementors as described in .
- Last but not least, my project will serve as a hedge against the risk of failure of the other project. Even if the other project will succeed, it's still unclear which approach will give the best speed-up. Simon Marlow's 2005 SMP-GHC paper  reports a speed-up of 1.32 for 'ghc --make' parallelised using Concurrent Haskell. Of course, there may be ways to improve on that result.
Approximate time plan
Before the coding period starts
Get acquainted with GHC source code and the development process.
June 17 - Aug 2
Extend GHC with support for a single-compilation "compile server".
Rework ghc-parmake to use 'ghc --server' as a back-end.
Aug 2 - Sep 27
Integrate the new version of ghc-parmake with 'cabal build'.
Polishing, bugfixing, testing.
A set of patches for GHC (or a program written using GHC API) that implements a single-compilation "compile server" mode.
A version of ghc-parmake that uses this modified version of GHC as a backend. A set of patches that add support for parallel module-level builds to Cabal.
Benefits for the Haskell community
A significant share of Haskell developers has multi-core machines, and fast parallel builds will directly benefit them. Unlike package-level parallelism, this project will have a much broader impact, since it will help with ongoing development and not only the initial installation of packages.
This project will also bring Cabal one step closer to being a true build system (right now it delegates all module-level dependency management to the underlying compiler), which is important for its future evolution.
I'm a masters student in Computer Science at Umeå University, Sweden and an experienced Haskell and C++ programmer. I'm an active contributor to the Cabal project (see ); since 2009 I'm also involved in the Haskell Platform project as the maintainer of the Windows installer.
I've already successfully participated in Google Summer of Code twice (in 2011 and 2012), also working on Haskell/Cabal projects. I've been supporting and extending my patches even after the end of the summer periods; both sets of patches have been merged into the mainline Cabal repo; the first one was included in an official release.
This project is a direct continuation of my work during GSoC 2011 on building multiple Cabal packages in parallel ('cabal install -j'). Implementing infrastructure projects that benefit a large number of people and support the growth of the language is fun and rewarding.