python-peps/pep-0374.txt

PEP: 374
Title: Migrating from svn to Mercurial
Version: $Revision$
Last-Modified: $Date$
Author: Brett Cannon <brett@python.org>,
        Dirkjan Ochtman <dirkjan@ochtman.nl>
Status: Active
Type: Process
Content-Type: text/x-rst
Created: 07-Nov-2008
Post-History: 07-Nov-2008
              22-Jan-2009

.. warning::
   This PEP is in the draft stages and is still under active
   development in terms of the transition plan even though Hg is the
   chosen DVCS.


Motivation
==========

Python has been using a centralized version control system (VCS;
first CVS, now Subversion) for years to great effect. Having a master
copy of the official version of Python provides people with a single
place to always get the official Python source code. It has also
allowed for the storage of the history of the language, mostly for
help with development, but also for posterity. And of course the V in
VCS is very helpful when developing.

But a centralized version control system has its drawbacks. First and
foremost, in order to have the benefits of version control with
Python in a seamless fashion, one must be a "core developer" (i.e.
someone with commit privileges on the master copy of Python). People
who are not core developers but who wish to work with Python's
revision tree, e.g. anyone writing a patch for Python or creating a
custom version, do not have direct tool support for revisions. This
can be quite a limitation, since these non-core developers cannot
easily do basic tasks such as reverting changes to a previously
saved state, creating branches, publishing one's changes with full
revision history, etc. For non-core developers, the last safe tree
state is one the Python developers happen to set, and this prevents
safe development. This second-class citizenship is a hindrance to
people who wish to contribute to Python with a patch of any
complexity and want a way to incrementally save their progress to
make their development lives easier.

There is also the issue of having to be online to be able to commit
one's work. Because centralized VCSs keep a central copy that stores
all revisions, one must have Internet access in order for their
revisions to be stored; no Net, no commit. This can be annoying if
you happen to be traveling and lack any Internet. There is also the
situation of someone wishing to contribute to Python but having a
bad Internet connection where committing is time-consuming and
expensive and it might work out better to do it in a single step.

Another drawback to a centralized VCS is that a common use case is
for a developer to revise patches in response to review comments.
This is more difficult with a centralized model because there's no
place to contain intermediate work. It's either all checked in or
none of it is checked in. In the centralized VCS, it's also very
difficult to track changes to the trunk as they are committed, while
you're working on your feature or bug fix branch. This increases
the risk that such branches will grow stale, out-dated, or that
merging them into the trunk will generate too may conflicts to be
easily resolved.

Lastly, there is the issue of maintenance of Python. At any one time
there is at least one major version of Python under development (at
the time of this writing there are two). For each major version of
Python under development there is at least the maintenance version
of the last minor version and the in-development minor version (e.g.
with 2.6 just released, that means that both 2.6 and 2.7 are being
worked on). Once a release is done, a branch is created between the
code bases where changes in one version do not (but could) belong in
the other version. As of right now there is no natural support for
this branch in time in central VCSs; you must use tools that
simulate the branching. Tracking merges is similarly painful for
developers, as revisions often need to be merged between four active
branches (e.g. 2.6 maintenance, 3.0 maintenance, 2.7 development,
3.1 development). In this case, VCSs such as Subversion only handle
this through arcane third party tools.

Distributed VCSs (DVCSs) solve all of these problems. While one can
keep a master copy of a revision tree, anyone is free to copy that
tree for their own use. This gives everyone the power to commit
changes to their copy, online or offline. It also more naturally
ties into the idea of branching in the history of a revision tree
for maintenance and the development of new features bound for
Python. DVCSs also provide a great many additional features that
centralized VCSs don't or can't provide.

This PEP explores the possibility of changing Python's use of Subversion
to any of the currently popular  DVCSs, in order to gain
the benefits outlined above. This PEP does not guarantee that a switch
to a DVCS will occur at the conclusion of this PEP. It is quite
possible that no clear winner will be found and that svn will continue
to be used. If this happens, this PEP will be revisited and revised in
the future as the state of DVCSs evolves.


Choice of DVCS
==============

This PEP included a thorough investigation of three DVCSs as options for
migration, with substantial work from Barry Warsaw, Alexandre Vassalotti and
Stephen Turnbull. That comparison has been moved to `DvcsComparison`_, and
this PEP now includes more information on the migration to Mercurial.

.. _DvcsComparison: http://wiki.python.org/moin/DvcsComparison

At PyCon 2009, a `decision
<http://mail.python.org/pipermail/python-dev/2009-March/087931.html>`_
was made to go with Mercurial.

The choice to go with Mercurial was made for three important reasons:

* According to a small survey, Python developers are more interested in
  using Mercurial than in Bazaar or Git.

* Mercurial is written in Python, which is congruent with the python-dev
  tendency to 'eat their own dogfood'.

* Mercurial is significantly faster than bzr (it's slower than git, though
  by a much smaller difference).

* Mercurial is easier to learn for SVN users than bzr.

Although all of these points can be debated, in the end a pronouncement from
the BDFL was made to go with hg as the chosen DVCS for the Python project.


Transition Plan
===============


Introduction
------------

To make the most of hg, I (Dirkjan) want to make a high-fidelity conversion,
such that (a) as much of the svn metadata as possible is retained, and (b) all
metadata is converted to formats that are common in Mercurial. This way, tools
written for Mercurial can be optimally used. In order to do this, I want to use
the `hgsubversion <http://bitbucket.org/durin42/hgsubversion>`_ software to do
an initial conversion. This hg extension is focused on providing high-quality
conversion from Subversion to Mercurial for use in two-way correspondence,
meaning it doesn't throw away as much available metadata as other solutions.

Such a conversion also seems like a good time to reconsider the contents of
the repository and determine if some things are still valuable. In this spirit,
in the following sections I propose discarding some of the older metadata.

Branch strategy
---------------

Mercurial has two basic ways of using branches: cloned branches, where each
branch is kept in a separate directory, and named branches, where each revision
keeps metadata to note on which branch it belongs. The former makes it easier
to distinguish branches, at the expense of requiring more disk space on the
client. The latter makes it a little easier to switch between branches, but
often has somewhat unintuitive results for people (though this has been
getting better in recent versions of Mercurial).

For Python, I think it would work well to have cloned branches and keep most
things separate. This is predicated on the assumption that most people work on
just one (or maybe two) branches at a time. Branches can be exposed separately,
though I would advocate merging old (and tagged!) branches into mainline so
that people can easily revert to older releases. At what age of a release this
should be done can be debated (a natural point might be when the branch gets
unsupported, e.g. 2.4 at the release of 2.6).

Converting branches
-------------------

There are quite a lot of branches in SVN's branches directory. I propose to
clean this up a bit, by employing the following the strategy:

* Keep all release (maintenance) branches
* Discard branches that haven't been touched in 18 months, unless somone
  indicates there's still interest in such a branch
* Keep branches that have been touched in the last 18 months, unless someone
  indicates the branch can be deprecated

Converting tags
---------------

The SVN tags directory contains a lot of old stuff. Some of these are not, in
fact, full tags, but contain only a smaller subset of the repository. I think
we should keep all release tags, and consider other tags for inclusion based
on requests from the developer community. I'd like to consider unifying the
release tag naming scheme to make some things more consistent, if people feel
that won't create too many problems.

Author map
----------

In order to provide user names the way they are common in hg (in the 'First Last
<user@example.org>' format), we need an author map to map cvs and svn user
names to real names and their email addresses. I have a complete version of such
a map in my `migration tools repository`_. The email addresses in it might be
out of date; that's bound to happen, although it would be nice to try and
have as many people as possible review it for addresses that are out of date.
The current version also still seems to contain some encoding problems.

.. _migration tools repository: http://hg.xavamedia.nl/cpython/pymigr/

Generating .hgignore
--------------------

The .hgignore file can be used in Mercurial repositories to help ignore files
that are not eligible for version control. It does this by employing several
possible forms of pattern matching. The current Python repository already
includes a rudimentary .hgignore file to help with using the hg mirrors.

It might be useful to have the .hgignore be generated automatically from
svn:ignore properties. This would make sure all historic revisions also have
useful ignore information (though one could argue ignoring isn't really
relevant to just checking out an old revision).

Revlog reordering
-----------------

As an optional optimization technique, we should consider trying a reordering
pass on the revlogs (internal Mercurial files) resulting from the conversion.
In some cases this results in dramatic decreases in on-disk repository size.

Other repositories
------------------

Richard Tew has indicated that he'd like the Stackless repository to also be
converted. What other projects in the svn.python.org repository should be
converted? Do we want to convert the peps repository? distutils? others?


Infrastructure
==============

hg-ssh
------

Developers should access the repositories through ssh, similar to the current
setup. Public keys can be used to grant people access to a shared hg@ account.
A hgwebdir instance should also be set up for easy browsing and read-only
access. Some facility for sandboxes/incubator repositories could be discussed.

Hooks
-----

A number of hooks is currently in use. The hg equivalents for these should be
developed and deployed. The following hooks are being used:

* check whitespace: a hook to reject commits in case the whitespace doesn't
  match the rules for the Python codebase. Should be straightforward to
  re-implement from the current version. Open issue: do we check only the tip
  after each push, or do we check every commit in a changegroup?

* commit mails: we can leverage the notify extension for this

* buildbots: both the regular and the community build masters must be notified.
  Fortunately buildbot includes support for hg. I've also implemented this for
  Mercurial itself, so I don't expect problems here.

* check contributors: in the current setup, all changesets bear the username of
  committers, who must have signed the contributor agreement. In a DVCS, the
  committers are not necessarily the same people who push, and so we can't
  check if the committer is a contributor. We could use a hook to check if the
  committer is a contributor if we keep a list of registered contributors.

hgwebdir
--------

A more or less stock hgwebdir installation should be set up. We might want to
come up with a style to match the Python website. It may also be useful to
build a quick extension to augment the URL rev parser so that it can also take
r[0-9]+ args and come up with the matching hg revision.