Inter-File Branching in PERFORCE

Inter-File Branching TM

A Practical Method for Representing Variants

Christopher Seiwald
Perforce Software

based on a paper presented at the
Sixth International Workshop
on Software Configuration Management

Berlin, 25th-26th March 1996


Abstract

Contemporary software configuration management (SCM) systems identify variants in the same namespace that identifies revisions. The variants -- the alternate implementations of a configuration item that must exist in parallel -- and revisions -- the iterative refinements that each variant takes on over time -- form a two dimensional version tree for a configuration item. So typically a configuration item will have two names: one that names the item and another that names the version.

This paper presents an alternate approach where the identification of a variant is moved into the name of the configuration item, leaving the version namespace only a linear set of revisions. Because this method has been realized in a working system where the configuration items are software source files, it is called Inter-File Branching TM. Branching is the act of creating variants, files are the configuration items, and inter-file reflects that fact that variants are separate files.

1. Introduction

This paper describes an new mechanism for handling file variants. First it describes the two major facets of the current method -- the way the variants are named and how they are created -- and cites their practical limitations. It then shows how a new approach called Inter-File Branching (IFB) does not suffer the same limitations, and describes the new mechanism in detail. It concludes with some empirical observations about using IFB.

2. Version-Tree Branching

2.1. The Version Names the Variant

A file is typically identified by two names: the name of the file itself and the name of the version of the file. The version itself decomposes into two pieces of information: the name of the variant and the name of the revision of the variant. The variant selects one of any alternate implementations that exist in parallel, while the revision selects one of the implementations of a variant that has evolved over time [4]. Usually, the first revision of a new variant is actually some revision of another variant; this is called the branch point. Joined at branch points, the variants and their revisions form a version tree, with each variant a branch in the tree and each revision a node on that branch.

In SCCS [5] and RCS [8] (and the many dozens of systems built upon these tools), the variants (in fact called "branches") are named with pairs of numbers and the revision is named by another pair of numbers. The "trunk branch" has an empty name. For example, 1.2.2.1.1.3 represents the 1.3 revision of variant 1.2.2.1, which itself is revision 2.1 of variant 1.2, which itself is revision 1.2 of the trunk. Obviously, remembering the significance of these numbers can become daunting for users, and so symbolic names are often applied to important versions.

Atria's ClearCase product [1] dispenses with revision numbers altogether and exposes a hierarchical version name to the user. For example, main/release2/bugfix/3 represents revision 3 of the bugfix branch, which is derived (at some point) from the release2 branch, which is derived from the main trunk. This naming is much more mnemonic than SCCS's numbering scheme, with the arguable liability that the branch point isn't explicitly named.

The approach of representing variants within the version namespace has several shortcomings:

Two Hierarchies. Each variant of a file is identified by the product of two hierarchical names -- the file's name and the variant name. While it is a subjective point, having files be identified by two hierarchies can leave users unable to visualize configurations. In theory, SCM systems are functional enough that users don't actually need such a convenience, but in practice engineers dealing with complex software products and their multiple releases often need to comprehend the "shape" of their product's file tree.

The picture can get especially murky if variants are created on-demand. Such a mechanism stores an update to a version not as a new revision of the same variant but as the first revision of a newly-created variant. The result of this is that, after some time, many files have many incidental variants, and important configurations may include differently named variants for different files.

Disparate semantics. Having one namespace for files and another for the variants of those files poses a dilemma for the user of existing SCM systems because those two namespaces are supported by different semantics. Specifically, the support for automated delta merging is normally tied into the version namespace: one can only merge deltas from one variant to another. (A delta is the change that leads from one revision to its successor.) On the other hand, only one variant of a file can be surfaced to the user at a time.

Thus if the user plans to merge deltas from one variant to another, they must be variants of a single file. On the other hand, if the user wants to access both variants at the same time then they must be different files. There are several real-world examples where this unsupported combination is desirable: in client/server systems, where client and server implementations of a protocol machine are similar but not identical; in multi-platform systems, where operating system specific files are similar but not identical; in systems migrating from one programming language to another, where perhaps C and C++ implementations must be kept in sync for the time being. In all these cases the files are variants of each other but by common practice they are in fact separate files.

Mainline-Centric. The representation of the version namespace assumes that one variant is the main branch from which all others descend. This mainline-centric view is often applicable to the early stages of a single software project, but rarely fits the structure of existing, mature products (especially those which began with poor SCM). These products are likely to have several "main lines" of the same source files, with active, disparate development on each.

The awkwardness of developing on multiple main variants can force users to set up separate repositories for files that would otherwise be variants of each other within the same repository. Doing so generally forgoes all the automated merging and tracking support afforded by the SCM system.

2.2. Creating Variants

Coupled with the common method for representing variants is the common method for creating them: for each file, the user selects an existing version and marks that as the starting point for a new variant. Typically, the existing version is marked with some sort of symbolic name, so that the user need not individually list each file's variant and revision. The new variant is similarly given a symbolic name, to make for easy future reference.

Users often talk about "a branch" to mean all files sharing a common variant name. But the variant creation does not support the semantics of a collective: it only creates variants of existing files, and not any files that may be created in the future. If the changes in one branch are to be merged into another, users then must take an extra step to create the new variants of any newly created files. More importantly, since the information used to create the variants was nothing more than a marked version for each file, there is no external means of telling if a new file should have a corresponding variant made as well.

To address this problem some SCM systems treat containers as configuration items and create variants of them as well. But this scheme can add considerable complexity and confusion: in addition to creating more namespaces that play into the identification of individual versions, it still fails to automate the creation of a variant when a new file is created.

3. Inter-File Branching (IFB)

IFB is an alternate approach that offers potential advantages over version-tree branching. The kernel concept is that creating a variant should be akin to the process fork in UNIX: a new file is created, related to the original in initial content and past history but with an unencumbered future.

This model is similar to what users without any SCM support typically must do when faced with the need to branch their software: they copy the files off and worry about any merging later. The difference between this primitive method and IFB is that the latter automates both the initial copying and the subsequent merging. In essence IFB is mostly a matter of extending the practical support for variants -- merging deltas and tracking merging history -- to span separate files.

With IFB different variants are differently named files. The file's name incorporates both what is normally considered the file name as well as the variant name, although the exact naming is left to the user. For example it is possible to have a file /release2.1/database/src/btree.c be a variant of /patch2.1.5/database/src/btree.c or /x/y/z be a variant of /a/b/c/d.

IFB builds upon a familiar practice: copying things to make variants of them. In doing so it gives users a natural way to express the relationship of a new variant to the old: by giving it a similar name. For example, sometimes a user will simply copy a file foo.c into foo.c.bak. A perhaps better example is when the user copies mainline/src/* into bob/src/*. Bob's files are variants of the mainline, and their names imply their relationship.

If this copying happens within the SCM repository rather than in a user's private workspace, the result is a repository namespace that is natural for the user to visualize. IFB gives users a single handle to identify the variant of a file, which makes the "shape" of configurations plain for the users to see. As in the example above, Bob knows that the configuration that describes his work is simply "what's in src/bob/* in the repository."

By treating variants as separate files, an SCM system supporting IFB can provide automated merging and tracking between any pair of files. The semantics for variants can even be applied post hoc to files that weren't originally related to each other. Thus it is possible to take two files (perhaps from before they were under SCM control) and begin treating them as variants of each other.

With IFB, variants have different names. In fact, the names don't even need to be similar. This makes it possible to create a variant of a whole configuration (i.e. variants of all of the files in a configuration), completely reorganizing the namespace in the process by naming the newly created variants in the desiring naming scheme. This can be useful in a variety of situations, most notably when software is branched into another project as a form of code sharing: the new project may very well need its copy of the software "reshaped" to fit its needs.

IFB relates variants through the namespace of the files. This makes it possible to relate whole configurations by expressing the relationship of the names of their component files. If this relationship expression leverages a natural organization of software, it can be quite compact. For example, to say that Bob's code is a variant of the mainline code is "bob/src/* is a variant of mainline/src/*". This expression relates not only the existing files but also any future files that may be created within the namespace.

Because the version namespace no longer holds the name of the variant, it can be reduced to a linear set of revision numbers (beginning at 1) for each file. This means that important configurations are often the tip (highest numbered, most recent) revisions of files within a certain part of the namespace. For example, Bob's latest work is the tip revisions of the files named bob/src/*.

3.1. Inter-File Branching Algorithm

The semantics which support this automation of merging stem from the following premise: that given a source file ("source"), any or all of its individual deltas can be meaningfully merged into the contents of a target file ("target"). This includes the initial delta that brings a file from empty content to its first revision. The distinguishing feature between the source and target files is their names. The user selects a source, some subset of its deltas, and the target. The following logic is then applied:

3.2. Integration History

The record of deltas that have been merged from one file into another forms the integration history for that ordered pair of files. It serves two very important functions. First, if the integration history shows that any deltas have been merged from one file into another, it indicates that the target is a variant of the source. Second, the integration history can be used to compute what deltas have yet to be merged. Having this information relieves the user of having to select the source deltas for every merging.

Integration history can also be used as an audit trail and can make for comprehensive reporting. Through transitive closure, it is possible to compute for any revision of any file whatever deltas it incorporates from other files through first, second, third, etc. generation merges.

3.3. Branch Views.

If variant files are to be related by their names, something must express that relationship. This is the job of branch views. A branch view is a named, one-to-one mapping between the names of two sets of files: the source and the target of the intended branch. To apply the branch view, the name of every known file is mapped as a candidate source through the view. The resulting source/target pairs define the files to be branched or merged.

There are two important facets of this mechanism: first, even though each branch view allows only one-to-one mappings between sources and targets in the view, there can be multiple branch views with the same sources and/or targets. Thus the same source can be branched into many different targets or, perhaps less intuitively, the same target can be the variant of more than one source. Second, the mapping is applied only at the user's request, rather than as potential sources are created or updated. That is, only when the user requests to create or merge target files is the branch view applied.

Branch views merely document that a user had intended for a set of targets to reflect a set of sources. Actual merging is through the application of the branch view. On the user's request, the names of all existing files are projected through the view to produce a list of source/target pairs. For each pair the integration history is considered and the complementary integrations are proposed to the user. These proposed changes -- creating the first revision of new variants, creating new revisions of existing variants, or deleting variants whose sources had been deleted -- attempt to make the target set reflect the deltas in the source set. It is of course the user's choice as to whether these changes are to be incorporated wholesale, modified somewhat, or ignored altogether. It is rare that the user wishes to keep two configurations exactly identical.

3.4. Computing the Merge Basis

Previously we mentioned that integration history, which records individually what deltas of a source file have been merged to a target file, can be used to compute exactly what deltas have yet to be merged. A product of this computation is useful as the version to form the basis of a three-way file diff/merge. It is the revision of the source prior to the first to be merged.

The usual method for reconciling changes made to a file is to find a common ancestor and use that as a basis for the diff/merge process. This process involves computing the textual difference between the basis and each of the variant's revisions, and then merging those differences. This has been implemented in a wide variety of tools, from RCS's simple rcsmerge program to elaborate GUIs that colorfully illustrate the lines of text that have changed in each version.

Most systems use the original branch point as the common ancestor. Unfortunately, as the text of the two variants evolves further away from the ancestor, more and more of the changes in the variants appear to be in conflict or overlapping with each other. Thus each time the user merges the variants he must revisit any textual conflicts that had been resolved in previous merges. Another approach is to record the highest delta of the source that has been merged, as ClearCase does using hyperlinks. But this assumes that deltas will always be merged in order. Sometimes, for example, it is desirable to merge a single delta containing a bug fix immediately, while leaving the previous and subsequent deltas for later.

Tracking individual deltas merged makes it possible always to produce the optimum basis for merging, namely the revision of the source prior to the first to be merged. This minimizes the difference between the basis and the source head, and consequently minimizes the complexity of the merge.

3.5. "Pure" Integrations.

Because the integration history tracks individually deltas that are merged from one variant to another, a special advantage can be gained by recognizing pure integrations. A pure integration is one in which the new revision of the target incorporates only deltas from a single source. This is important in that a delta which is the result of a pure integration does not need to be merged back into the source.

Without integration history, the fact is lost that a target variant's new delta is only the merging of deltas from another variant. Any subsequent automated merge would find the new delta and attempt to merge it back into the source. If the delta was originally merged literally, the actual merge process should be a no-op and of little consequence to the user. But if the delta required conflict resolution or other editing when it was originally merged into the target, the user will have to revisit that resolution when merging the delta back.

With an integration history that records pure integrations, automated merges can avoid ever merging a delta back into its source. In practice, the user must provide some hint that the integration was kept pure: during conflict resolution the user has the opportunity to make other changes which may, in fact, need to be merged back. In the end, the user must actually confirm that the integration was kept pure, and the integration history can then record the fact. This record can then be used to suppress an automated attempt to merge the delta back into the source.

3.6. Namespace management.

If the hierarchical namespace of files is to serve double-duty as the name of variants, then room should be left in the file name for the name of the variant. This is largely a matter of individual use, but we find for simple cases that putting the variant name at the beginning of the file name renders the full file tree easy to grasp. For example, it is fairly easy for users to understand that the main line of development is under /main/src/... while the latest release has been branched into /release2.1/src/..., etc.

3.7. Client Views.

It is important that the name of the files in the SCM repository need not match the name of the file when in the hands (or filesystem) of the user. If the name must encode the variant, then it is desirable to be able to strip that information from the file name when file is delivered to the users. For this a client view can be used: each client workspace has a view (much like a branch view) that projects the names of the files in the repository onto the user's local filesystem. In this way, a single entity (the client view) specifies not only the files but also the variants that a user wishes to see. Such a view can bring two different variants onto the client merely by giving them distinct names on the client.

3.8. Variants as Virtual Copies

In practice, if branching a file means copying it in the repository then storage space can be a concern. A simple antidote is to perform a virtual copy, where the newly branched file makes use of the contents of the original file. This requires a level of indirection between the repository namespace and the actual underlying object store. When the branch is extended by adding a new revision, the branched file can acquire its own separate entity in the object store.

Supporting variants with virtual copies reduces the space requirement of a variant to be merely a record for the newly created variant that points to the original file. This is, more or less, no greater than the cost of a traditional variant.

This virtual copy can also be used when one variant is explicitly synchronized with another. If the user wishes to fold a branch back into a trunk and make the two identical, both can reference the same content at that point.

4. Discussion

4.1. Comparison with Existing Technology

IFB has been directly contrasted in this paper with the version tree model found in popular systems such as RCS and ClearCase. There are two other approaches to variant support that deserve mention.

4.1.1. Baseline Model

The baseline model [2] can be viewed as a much simplified attempt at IFB. Strictly speaking, a baseline is just a configuration that serves as the basis for future changes, but in common practice the configuration's contents is copied and the newly copied files become the actual baseline. In this respect, it resembles IFB's copying, because the new baseline itself has a name that distinguishes it from the original configuration.

Compared to IFB, the baseline model is limited in three ways: first, the relationship of the names is fixed -- usually the baselines are named, while the files within the baselines have names that are fixed across baselines. Second, a baseline is some predetermined set of files, usually a "project" or "product". It is not possible to create variants of just one or two files. Third, integration history, if present at all, is fairly limited with baselines. It doesn't, for example, track individual deltas merged from one variant to another.

4.1.2. Change Set Model/ICE Version Set Model

The change set model [7] and the version set model in ICE [10] are practically the antithesis of IFB. In both of these models, variants and revisions do not form a tree of named versions for each file. Instead, each file has a pool of deltas that can arbitrarily be applied to produce a (possibly senseless) version. With the change set model, the combination of deltas to be applied is called a "change set", and a change set itself can be recorded as part of a named configuration. With ICE's version set model, the deltas have attributes (called "features") and are selected by finding those deltas with the right combination of attributes.

While both the change set model and ICE's version set model probably compare well with the version tree model, they are markedly different from IFB. IFB seeks to keep variants distinct in the repository namespace, so that user can view and handle them as independent files. Change sets and version sets, on the other hand, pile all variants of a file into a single entity and require users to select (or create) the desired version with every interaction. To put it on a continuum: with IFB important configurations can be identified with merely the names of the component files; with version trees there is a fair chance that important configurations are not on the trunk, and so the variants must be explicitly named; with changes sets and version sets, important configurations surely require explicit specification of the change set or feature list.

4.2. Empirical Evaluation

An earlier version of the IFB model was implemented at Ingres Corporation as part of an in-house SCM system called Piccolo [6]. It lacked the branch views and the detail in the integration history of the current model. Thus creating variants in the first place was cumbersome as the user had to enumerate specific files to be branched, and sometimes the system would (iteratively) insist that a delta be merged back after being merged into another file.

Nonetheless, IFB proved to be a solid foundation for carrying out parallel development and releases. Several hundred developers and other engineers spread across three continents used Piccolo's branching to support perhaps a dozen products totalling 12,000 files. At any one time, a dozen simultaneous releases were in progress.

For the most part, branches were made by perturbing the first component of the name of each file stored in the repository. The main line of descent was the development line and was called "main". From this were dozens of branches for individual developers or groups, each named after the developer or project. Also from the main line were branches at every major release, named after the release. These major releases became sub-main lines themselves, and from them patch branches were sprouted.

One of the unexpected benefits of IFB was that because branches were expressed as a relationship between the names of two arbitrary files, developers were free to create branches which had a different tree structure than the main line of code. This allowed them to build and maintain their product in an environment customized to their needs. For example, one group might be responsible for directories that were "far apart" in the name space of the main line. As a convenience, they could bring these directories together in their group's branch.

The current model for IFB, as described here, is implemented in a commercially available system, PERFORCE. PERFORCE is a deliberate attempt to build upon the semantics of IFB while remedying the deficiencies found in the Piccolo implementation. PERFORCE is new, but early experience shows the model to flexible, usable, and fairly complete. We expect that the record of integration history could prove to be a bounty of useful information in a large, mature software product, but we have no history for such a product yet.

5. Conclusion

The popularity of traditional variant handling belies a fundamentally unnatural model. Even though experienced users of current SCM systems understand the uses of a combined variant and revision namespace, the weakness in the foundations of this approach eventually lead to a collection of gaps in its total capability. IFB, on the other hand, begins with a natural model -- copying software files and renaming them -- and finishes it with a collection of techniques that make its model usable. Given the opportunity, IFB could rival the functionality of the traditional variant handling.

6. Bibliography

[1] Atria Corporation, ClearCase Concepts Manual. Natick Massachusetts, 1994

[2] Bersoff, Henderson, Siegel, Software Configuration Management. Prentice-Hall, 1980.

[3] Susan Dart, The Past, Present, and Future of Configuration Management. Technical Report CMU/SEI-92-TR-8, Software Engineering Institute, Carnegie Mellon University, 1992.

[4] Stephen MacKay, The State of the Art in Concurrent, Distributed Configuration Management. Proceedings of the 5th International Workshop on Software Configuration Management, Seattle, WA, 1995.

[5] Rochkind, The Source Code Control System. IEEE Transactions on Software Engineering, Volume SE-1, December 1975.

[6] Roger Rohrbach and Christopher Seiwald, Galileo: A Software Maintenance Environment. Proceedings of the International Workshop on Software Version and Configuration Control, Grassau, 1988.

[7] Software Maintenance and Development Systems (SMDS). Aide-de-Camp Users Manual, Concord Mass, 1993.

[8] Walter Tichy, RCS -- a system for version control. Software -- Practice and Experience, July 1985.

[9] Walter Tichy, Tools for Software Configuration Management. Proceedings of the International Workshop on Software Version and Configuration Control, Grassau, 1988.

[10] Andreas Zeller, Smooth Operations with Square Operators -- The Version Set Model in ICE. Informatik-Bericht No. 95-08, Technical University of Braunschweig, Germany, 1995.


Inter-File Branching in PERFORCE
Copyright 1996 Perforce Software. Comments to info@perforce.com.
Last updated: September 18, 1996