PEP: 374 Title: Migrating from svn to Mercurial Version: $Revision$ Last-Modified: $Date$ Author: Brett Cannon , Dirkjan Ochtman Status: Active Type: Process Content-Type: text/x-rst Created: 07-Nov-2008 Post-History: 07-Nov-2008 22-Jan-2009 .. warning:: This PEP is in the draft stages and is still under active development in terms of the transition plan even though Hg is the chosen DVCS. Motivation ========== Python has been using a centralized version control system (VCS; first CVS, now Subversion) for years to great effect. Having a master copy of the official version of Python provides people with a single place to always get the official Python source code. It has also allowed for the storage of the history of the language, mostly for help with development, but also for posterity. And of course the V in VCS is very helpful when developing. But a centralized version control system has its drawbacks. First and foremost, in order to have the benefits of version control with Python in a seamless fashion, one must be a "core developer" (i.e. someone with commit privileges on the master copy of Python). People who are not core developers but who wish to work with Python's revision tree, e.g. anyone writing a patch for Python or creating a custom version, do not have direct tool support for revisions. This can be quite a limitation, since these non-core developers cannot easily do basic tasks such as reverting changes to a previously saved state, creating branches, publishing one's changes with full revision history, etc. For non-core developers, the last safe tree state is one the Python developers happen to set, and this prevents safe development. This second-class citizenship is a hindrance to people who wish to contribute to Python with a patch of any complexity and want a way to incrementally save their progress to make their development lives easier. There is also the issue of having to be online to be able to commit one's work. Because centralized VCSs keep a central copy that stores all revisions, one must have Internet access in order for their revisions to be stored; no Net, no commit. This can be annoying if you happen to be traveling and lack any Internet. There is also the situation of someone wishing to contribute to Python but having a bad Internet connection where committing is time-consuming and expensive and it might work out better to do it in a single step. Another drawback to a centralized VCS is that a common use case is for a developer to revise patches in response to review comments. This is more difficult with a centralized model because there's no place to contain intermediate work. It's either all checked in or none of it is checked in. In the centralized VCS, it's also very difficult to track changes to the trunk as they are committed, while you're working on your feature or bug fix branch. This increases the risk that such branches will grow stale, out-dated, or that merging them into the trunk will generate too may conflicts to be easily resolved. Lastly, there is the issue of maintenance of Python. At any one time there is at least one major version of Python under development (at the time of this writing there are two). For each major version of Python under development there is at least the maintenance version of the last minor version and the in-development minor version (e.g. with 2.6 just released, that means that both 2.6 and 2.7 are being worked on). Once a release is done, a branch is created between the code bases where changes in one version do not (but could) belong in the other version. As of right now there is no natural support for this branch in time in central VCSs; you must use tools that simulate the branching. Tracking merges is similarly painful for developers, as revisions often need to be merged between four active branches (e.g. 2.6 maintenance, 3.0 maintenance, 2.7 development, 3.1 development). In this case, VCSs such as Subversion only handle this through arcane third party tools. Distributed VCSs (DVCSs) solve all of these problems. While one can keep a master copy of a revision tree, anyone is free to copy that tree for their own use. This gives everyone the power to commit changes to their copy, online or offline. It also more naturally ties into the idea of branching in the history of a revision tree for maintenance and the development of new features bound for Python. DVCSs also provide a great many additional features that centralized VCSs don't or can't provide. This PEP explores the possibility of changing Python's use of Subversion to any of the currently popular DVCSs, in order to gain the benefits outlined above. This PEP does not guarantee that a switch to a DVCS will occur at the conclusion of this PEP. It is quite possible that no clear winner will be found and that svn will continue to be used. If this happens, this PEP will be revisited and revised in the future as the state of DVCSs evolves. Choice of DVCS ============== This PEP included a thorough investigation of three DVCSs as options for migration, with substantial work from Barry Warsaw, Alexandre Vassalotti and Stephen Turnbull. That comparison has been moved to `DvcsComparison`_, and this PEP now includes more information on the migration to Mercurial. .. _DvcsComparison: http://wiki.python.org/moin/DvcsComparison At PyCon 2009, a `decision `_ was made to go with Mercurial. The choice to go with Mercurial was made for three important reasons: * According to a small survey, Python developers are more interested in using Mercurial than in Bazaar or Git. * Mercurial is written in Python, which is congruent with the python-dev tendency to 'eat their own dogfood'. * Mercurial is significantly faster than bzr (it's slower than git, though by a much smaller difference). * Mercurial is easier to learn for SVN users than bzr. Although all of these points can be debated, in the end a pronouncement from the BDFL was made to go with hg as the chosen DVCS for the Python project. Transition Plan =============== Introduction ------------ To make the most of hg, I (Dirkjan) want to make a high-fidelity conversion, such that (a) as much of the svn metadata as possible is retained, and (b) all metadata is converted to formats that are common in Mercurial. This way, tools written for Mercurial can be optimally used. In order to do this, I want to use the `hgsubversion `_ software to do an initial conversion. This hg extension is focused on providing high-quality conversion from Subversion to Mercurial for use in two-way correspondence, meaning it doesn't throw away as much available metadata as other solutions. Such a conversion also seems like a good time to reconsider the contents of the repository and determine if some things are still valuable. In this spirit, in the following sections I propose discarding some of the older metadata. Branch strategy --------------- Mercurial has two basic ways of using branches: cloned branches, where each branch is kept in a separate directory, and named branches, where each revision keeps metadata to note on which branch it belongs. The former makes it easier to distinguish branches, at the expense of requiring more disk space on the client. The latter makes it a little easier to switch between branches, but often has somewhat unintuitive results for people (though this has been getting better in recent versions of Mercurial). For Python, I think it would work well to have cloned branches and keep most things separate. This is predicated on the assumption that most people work on just one (or maybe two) branches at a time. Branches can be exposed separately, though I would advocate merging old (and tagged!) branches into mainline so that people can easily revert to older releases. At what age of a release this should be done can be debated (a natural point might be when the branch gets unsupported, e.g. 2.4 at the release of 2.6). Converting branches ------------------- There are quite a lot of branches in SVN's branches directory. I propose to clean this up a bit, by employing the following the strategy: * Keep all release (maintenance) branches * Discard branches that haven't been touched in 18 months, unless somone indicates there's still interest in such a branch * Keep branches that have been touched in the last 18 months, unless someone indicates the branch can be deprecated Converting tags --------------- The SVN tags directory contains a lot of old stuff. Some of these are not, in fact, full tags, but contain only a smaller subset of the repository. I think we should keep all release tags, and consider other tags for inclusion based on requests from the developer community. I'd like to consider unifying the release tag naming scheme to make some things more consistent, if people feel that won't create too many problems. Author map ---------- In order to provide user names the way they are common in hg (in the 'First Last ' format), we need an author map to map cvs and svn user names to real names and their email addresses. I have a complete version of such a map in my `migration tools repository`_. The email addresses in it might be out of date; that's bound to happen, although it would be nice to try and have as many people as possible review it for addresses that are out of date. The current version also still seems to contain some encoding problems. .. _migration tools repository: http://hg.xavamedia.nl/cpython/pymigr/ Generating .hgignore -------------------- The .hgignore file can be used in Mercurial repositories to help ignore files that are not eligible for version control. It does this by employing several possible forms of pattern matching. The current Python repository already includes a rudimentary .hgignore file to help with using the hg mirrors. It might be useful to have the .hgignore be generated automatically from svn:ignore properties. This would make sure all historic revisions also have useful ignore information (though one could argue ignoring isn't really relevant to just checking out an old revision). Revlog reordering ----------------- As an optional optimization technique, we should consider trying a reordering pass on the revlogs (internal Mercurial files) resulting from the conversion. In some cases this results in dramatic decreases in on-disk repository size. Other repositories ------------------ Richard Tew has indicated that he'd like the Stackless repository to also be converted. What other projects in the svn.python.org repository should be converted? Do we want to convert the peps repository? distutils? others? Infrastructure ============== hg-ssh ------ Developers should access the repositories through ssh, similar to the current setup. Public keys can be used to grant people access to a shared hg@ account. A hgwebdir instance should also be set up for easy browsing and read-only access. Some facility for sandboxes/incubator repositories could be discussed. Hooks ----- A number of hooks is currently in use. The hg equivalents for these should be developed and deployed. The following hooks are being used: * check whitespace: a hook to reject commits in case the whitespace doesn't match the rules for the Python codebase. Should be straightforward to re-implement from the current version. Open issue: do we check only the tip after each push, or do we check every commit in a changegroup? * commit mails: we can leverage the notify extension for this * buildbots: both the regular and the community build masters must be notified. Fortunately buildbot includes support for hg. I've also implemented this for Mercurial itself, so I don't expect problems here. * check contributors: in the current setup, all changesets bear the username of committers, who must have signed the contributor agreement. In a DVCS, the committers are not necessarily the same people who push, and so we can't check if the committer is a contributor. We could use a hook to check if the committer is a contributor if we keep a list of registered contributors. hgwebdir -------- A more or less stock hgwebdir installation should be set up. We might want to come up with a style to match the Python website. It may also be useful to build a quick extension to augment the URL rev parser so that it can also take r[0-9]+ args and come up with the matching hg revision.