Darcs: primitive patches version 3

Petr Ročkai

Short description: Darcs, a revision control system, uses so-called patches to represent changes to individual version-controlled files, where the "primitive" patches are the lowest level of this representation, capturing notions like "hunks" (akin to what diff(1) produces), token replace and file and directory addition/removal. I propose to implement a different representation of these primitive patches, hoping to improve both performance and flexibility of darcs and to facilitate future development.

About the project

In last two years, significant performance improvements have been made in the darcs codebase, most notably in the handling of the local repository access. Nevertheless, many issues remain with handling large repositories and many are very hard or even impossible to address with the current primitive patch format. I am proposing a project to write a new implementation of primitive patches, based on the research and discussions on this topic that had taken place since the release of darcs 2.2. (See e.g. http://wiki.darcs.net/Ideas/AddAddConflicts and http://web.mornfall.net/blog/patch_formats.html.)

A proper implementation of new primitive patch format should bring a number of benefits: better overall performance, faster merging in most cases, making unrelated-repository merges possible and even quite straightforward. Other than these more direct benefits, a new format should enable a number of future improvements as well, including better on-disk format and improvements in network operation speed, which is another sore point of darcs as it stands.

Implementation

My aim is to build on existing research. While a lot of thinking has already gone into this, I believe that it is now time to prototype an implementation: without "feedback" that an actual implementation will generate, it is hard to assess the relative benefits of different approaches. So far, we can only hypothesize about advantages of new primitive formats, and while the consensus is that substantial improvements can be derived from this change, these are yet to be quantified.

As far as the actual implementation is concerned, Ganesh has been working on improvements to the internal interfaces around the patch format implementation in darcs: I fully intend to leverage this work, which should make it reasonably easy to add a new (primitive) patch format without compromising stability of existing code. Of course, to realize many of the benefits of the changes, other parts of the code will need to be modified, sometimes even substantially. Nevertheless, I believe it should be possible to integrate this work into darcs within its current "stable" release series, as long as the new patch format stays disabled by default. The collateral code refactoring will hopefully bring some benefits even to code paths of existing patch formats.

I fully realize the difficulties of introducing incompatible format changes in a version control system. Nevertheless, the darcs project is, as far as I can tell, unanimous in the decision that a new patch format is required to enable further improvements in darcs. Even though the overall patch format consists of two semi-independent layers (primitive patches on one hand, changset and conflict representation on the other), changing either will still lead to an incompatible format. Therefore, we would like to switch both at once, to avoid the overhead (both for developers and for users) of two incompatible changes in a row. In this sense, a new implementation of primitive patches is a stepping stone for a new implementation of the higher, conflict-representing layer. Unfortunately, doing both in a single summer is, in my opinion, completely unrealistic. Of these two layers, I think the primitive patches are better understood and are more likely to be a success. The opinions about the right representation for conflicts are more diverse, and no single direction can so far clearly guarantee an improvement that would justify a format change.

Concrete Goals

The main, specific goal of the project is an implementation of primitive patches with following traits:

  • file content and file location (existence) are tracked separately
  • the on-disk format allows either detached or at least seekable storage of hunk (chunk) content
  • allow more efficient application of patches (no newline scanning in hunks)
  • allow efficient encoding of binary deltas and binary files in general
  • commuting patches only loads as much of the patch as is required: in the most common case, the hunk content does not need to be present in memory

Moreover, depending on how the schedule turns out, I would like to add at least some of the following:

  • the hunk content can be shared  with (hashed) pristine cache
  • hunk (chunk) move patches
  • tracking objects other than binary blobs and text files (abstract lists, sets, maps)
  • indentation patches

Either way, the result of this project is a prototype implementation of the format. Nevertheless, it should be tested as thoroughly as possible and ideally, it would be ready to be combined with a new conflict/changeset representation that is planned. Therefore, I do not expect the outputs of this project to be directly available to end users as a part of a stable release in the timeframe of the project. However, a future darcs release should derive significant benefits from such an implementation being available, and, ideally, tested in advance.

Scaling back

While I believe that at least the core objectives outlined above can be done within the project timeframe, I acknowledge that the project expands into a territory not entirely well-known or mapped, and it may encounter unexpected roadblocks. I believe that the most important area to cover is the decoupling of file name from file content (in other words, tracking files by UUIDs instead of their names and mapping the UUIDs to filesystem objects independently, as covered by the material linked above). Therefore, I am making this the first priority, with the remaining objectives getting priority in the listed order.

Considering the prototype nature of the implementation, it should be no harm to the project if it is partial, in the sense of only covering a part of the requirements, since the goal is not to deploy the code in a stable release right away but to provide a foundation for stable code to build upon. The extensibility and viability for future development is a priority over feature completeness.

Why?

I am a long-time darcs user and developer. Even though the Haskell community has seen a shift from using darcs to using git and github for hosting, I still believe that darcs has a lot of potential. Several darcs concepts are unique in the version control world, and could offer genuine advantages to the tool, even though presently, this is hampered by implementation issues. Darcs has been gifted with a great initial idea, but has suffered a number of poor design decisions throughout its life. The representation of primitive patches is one such issue that holds darcs back from realizing its full potential. The goal of my project is to reverse this drawback.