388 lines
16 KiB
Plaintext
388 lines
16 KiB
Plaintext
PEP: 438
|
||
Title: Transitioning to release-file hosting on PyPI
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Holger Krekel <holger@merlinux.eu>, Carl Meyer <carl@oddbird.net>
|
||
Discussions-To: catalog-sig@python.org
|
||
Status: Draft
|
||
Type: Process
|
||
Content-Type: text/x-rst
|
||
Created: 15-Mar-2013
|
||
Post-History:
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP proposes a backward-compatible two-phase transition process
|
||
to speed up, simplify and robustify installing from the
|
||
pypi.python.org (PyPI) package index. To ease the transition and
|
||
minimize client-side friction, **no changes to distutils or existing
|
||
installation tools are required in order to benefit from the first
|
||
transition phase, which will result in faster, more reliable installs
|
||
for most existing packages**.
|
||
|
||
The first transition phase implements an easy and explicit means for a
|
||
package maintainer to control which release file links are served to
|
||
present-day installation tools. The first phase also includes the
|
||
implementation of analysis tools for present-day packages, to support
|
||
communication with package maintainers and the automated setting of
|
||
default modes for controlling release file links. The first phase
|
||
also will default newly-registered projects on PyPI to only serve
|
||
links to release files which were uploaded to PyPI.
|
||
|
||
The second transition phase concerns end-user installation tools,
|
||
which shall default to only install release files that are hosted on
|
||
PyPI and tell the user if external release files exist, offering a
|
||
choice to automatically use those external files.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
.. _history:
|
||
|
||
History and motivations for external hosting
|
||
--------------------------------------------
|
||
|
||
When PyPI went online, it offered release registration but had no
|
||
facility to host release files itself. When hosting was added, no
|
||
automated downloading tool existed yet. When Philip Eby implemented
|
||
automated downloading (through setuptools), he made the choice to
|
||
allow people to use download hosts of their choice. The finding of
|
||
externally-hosted packages was implemented as follows:
|
||
|
||
#. The PyPI ``simple/`` index for a package contains all links found
|
||
by scraping them from that package's long_description metadata for
|
||
any release. Links in the "Download-URL" and "Home-page" metadata
|
||
fields are given ``rel=download`` and ``rel=homepage`` attributes,
|
||
respectively.
|
||
|
||
#. Any of these links whose target is a file whose name appears to be
|
||
in the form of an installable source or binary distribution, with
|
||
name in the form "packagename-version.ARCHIVEEXT", is considered a
|
||
potential installation candidate by installation tools.
|
||
|
||
#. Similarly, any links suffixed with an "#egg=packagename-version"
|
||
fragment are considered an installation candidate.
|
||
|
||
#. Additionally, the ``rel=homepage`` and ``rel=download`` links are
|
||
crawled by installation tools and, if HTML, are themselves scraped
|
||
for release-file links in the above formats.
|
||
|
||
Today, most packages released on PyPI host their release files on
|
||
PyPI, but a small percentage (XXX need updated data) rely on external
|
||
hosting.
|
||
|
||
There are many reasons [2]_ why people have chosen external
|
||
hosting. To cite just a few:
|
||
|
||
- release processes and scripts have been developed already and upload
|
||
to external sites
|
||
|
||
- it takes too long to upload large files from some places in the
|
||
world
|
||
|
||
- export restrictions e.g. for crypto-related software
|
||
|
||
- company policies which require offering open source packages through
|
||
own sites
|
||
|
||
- problems with integrating uploading to PyPI into one's release
|
||
process (because of release policies)
|
||
|
||
- desiring download statistics different from those maintained by PyPI
|
||
|
||
- perceived bad reliability of PyPI
|
||
|
||
- not aware that PyPI offers file-hosting
|
||
|
||
Irrespective of the present-day validity of these reasons, there
|
||
clearly is a history why people choose to host files externally and it
|
||
even was for some time the only way you could do things. This PEP
|
||
takes the position that there are at least some valid reasons for
|
||
external hosting.
|
||
|
||
Problem
|
||
-------
|
||
|
||
**Today, python package installers (pip, easy_install, buildout, and
|
||
others) often need to query many non-PyPI URLs even if there are no
|
||
externally hosted files**. Apart from querying pypi.python.org's
|
||
simple index pages, also all homepages and download pages ever
|
||
specified with any release of a package are crawled by an installer.
|
||
The need for installers to crawl external sites slows down
|
||
installation and makes for a brittle and unreliable installation
|
||
process. Those sites and packages also don't take part in the
|
||
:pep:`381` mirroring infrastructure, further decreasing reliability
|
||
and speed of automated installation processes around the world.
|
||
|
||
Most packages are hosted directly on pypi.python.org [1]_. Even for
|
||
these packages, installers still crawl their homepage and
|
||
download-url, if specified. Many package uploaders are not aware that
|
||
specifying the "homepage" or "download-url" in their package metadata
|
||
will needlessly slow down the installation process for all users.
|
||
|
||
Relying on third party sites also opens up more attack vectors for
|
||
injecting malicious packages into sites using automated installs. A
|
||
simple attack might just involve getting hold of an old now-unused
|
||
homepage domain and placing malicious packages there. Moreover,
|
||
performing a Man-in-The-Middle (MITM) attack between an installation
|
||
site and any of the download sites can inject malicious packages on
|
||
the installation site. As many homepages and download locations are
|
||
using HTTP and not HTTPS, such attacks are not hard to launch. Such
|
||
MITM attacks can easily happen even for packages which never intended
|
||
to host files externally as their homepages are contacted by
|
||
installers anyway.
|
||
|
||
There is currently no way for package maintainers to avoid
|
||
external-link crawling, other than removing all homepage/download url
|
||
metadata for all historic releases. While a script [3]_ has been
|
||
written to perform this action, it is not a good general solution
|
||
because it removes useful metadata from PyPI releases.
|
||
|
||
Even if the sites referenced by "Homepage" and "Download-URL" links
|
||
were not scraped for further links, there is no obvious way under the
|
||
current system for a package owner to link to an installable file from
|
||
a long_description metadata field (which is shown as package
|
||
documentation on ``/pypi/PKG``) without installation tools
|
||
automatically considering that file a candidate for installation.
|
||
Conversely, there is no way to explicitly register multiple external
|
||
release files without putting them in metadata fields.
|
||
|
||
|
||
Goals
|
||
-----
|
||
|
||
These are the goals to be achieved by implementation of this PEP:
|
||
|
||
* Package owners should be able to explicitly control which files are
|
||
presented by PyPI to installer tools as installation
|
||
candidates. Installation should not be slowed and made less reliable
|
||
by extensive and unnecessary crawling of links that package owners
|
||
did not explicitly nominate as installation files.
|
||
|
||
* It should remain possible for package owners to choose to host their
|
||
release files on their own hosting, external to PyPI. It should be
|
||
easy for a user to request the installation of such releases using
|
||
automated installer tools.
|
||
|
||
* Automated installer tools should not install externally-hosted
|
||
packages **by default**, but only when explicitly authorized to do
|
||
so by the user. When tools refuse to install such a package by
|
||
default, they should tell the user exactly which external link(s)
|
||
they would need to follow, and what option(s) the user can provide
|
||
to authorize the tool to follow those links. PyPI should provide all
|
||
necessary metadata for installer tools to implement this easily and
|
||
within a single request/reply interaction.
|
||
|
||
* Migration from the status quo to the above points should be gradual
|
||
and minimize breakage. This includes tooling that makes it easy for
|
||
package owners with an existing release process that uploads to
|
||
non-PyPI hosting to also upload those release files to PyPI.
|
||
|
||
|
||
Solution / two transition phases
|
||
================================
|
||
|
||
The first transition phase introduces a "hosting-mode" field for each
|
||
project on PyPI, allowing package owners explicit control of which
|
||
release file links are served to present-day installation tools in the
|
||
machine-readable ``simple/`` index. The first transition will, after
|
||
successful hosting-mode manipulations by individual early-adopters,
|
||
set a default hosting mode for existing packages, based on automated
|
||
analysis. **Maintainers will be notified one month ahead of any such
|
||
automated change**. At completion of the first transition phase,
|
||
**all present-day existing release and installation processes and
|
||
tools are expected to continue working**. Any remaining errors or
|
||
problems are expected to only relate to installation of individual
|
||
packages and can be easily corrected by package maintainers or PyPI
|
||
admins if maintainers are not reachable.
|
||
|
||
Also in the first phase, each link served in the ``simple/`` index
|
||
will be explicitly marked as ``rel="internal"`` (hosted by the index
|
||
itself) or ``rel="external"`` (linking to an external site that is not
|
||
part of the index).
|
||
|
||
In the second transition phase, PyPI client installation tools shall
|
||
be updated to default to only install ``rel="internal"`` packages
|
||
unless a user specifies option(s) to permit installing from external
|
||
links.
|
||
|
||
Maintainers of packages which currently host release files on non-PyPI
|
||
sites shall receive instructions and tools to ease "re-hosting" of
|
||
their historic and future package release files. This re-hosting tool
|
||
MUST be available before automated hosting-mode changes are announced
|
||
to package maintainers.
|
||
|
||
|
||
Implementation
|
||
==============
|
||
|
||
Hosting modes
|
||
-------------
|
||
|
||
The foundation of the first transition phase is the introduction of
|
||
three "modes" of PyPI hosting for a package, affecting which links are
|
||
generated for the ``simple/`` index. These modes are implemented
|
||
without requiring changes to installation tools via changes to the
|
||
algorithm for generating the machine-readable ``simple/`` index.
|
||
|
||
The modes are:
|
||
|
||
- ``pypi-scrape-crawl``: no change from the current situation of
|
||
generating machine-readable links for installation tools, as
|
||
outlined in the history_.
|
||
|
||
- ``pypi-scrape``: for a package in this mode, links to be added to
|
||
the ``simple/`` index are still scraped from package
|
||
metadata. However, the "Home-page" and "Download-url" links are
|
||
given ``rel=ext-homepage`` and ``rel=ext-download`` attributes
|
||
instead of ``rel=homepage`` and ``rel=download``. The effect of this
|
||
(with no change in installation tools necessary) is that these links
|
||
will not be followed and scraped for further candidate links by
|
||
present-day installation tools: only installable files directly
|
||
hosted from PyPI or linked directly from PyPI metadata will be
|
||
considered for installation. Installation tools MAY evolve to offer
|
||
an option to use the new rel-attribution to crawl external pages but
|
||
MUST NOT default to it.
|
||
|
||
- ``pypi-explicit``: for a package in this mode, only links to release
|
||
files uploaded to PyPI, and external links to release files
|
||
explicitly nominated by the package owner (via a new interface
|
||
exposed by PyPI) will be added to the ``simple/`` index.
|
||
|
||
Thus the hope is that eventually all projects on PyPI can be migrated
|
||
to the ``pypi-explicit`` mode, while preserving the ability to install
|
||
release files hosted externally via installer tools. Deprecation of
|
||
hosting modes to eventually only allow the ``pypi-explicit`` mode is
|
||
NOT REGULATED by this PEP but is expected to become feasible some time
|
||
after successful implementation of the transition phases described in
|
||
this PEP. It is expected that deprecation requires **a new process to
|
||
deal with abandoned packages** because of unreachable maintainers for
|
||
still popular packages.
|
||
|
||
|
||
First transition phase (PyPI)
|
||
-----------------------------
|
||
|
||
The proposed solution consists of multiple implementation and
|
||
communication steps:
|
||
|
||
#. Implement in PyPI the three modes described above, with an
|
||
interface for package owners to select the mode for each package
|
||
and register explicit external file URLs.
|
||
|
||
#. For packages in all modes, label all links in the ``simple/`` index
|
||
with ``rel="internal"`` or ``rel="external"``, to make it easier
|
||
for client tools to distinguish the types of links in the second
|
||
transition phase.
|
||
|
||
#. Default all newly-registered packages to ``pypi-explicit`` mode
|
||
(package owners can still switch to the other modes as desired).
|
||
|
||
#. Determine (via an automated analysis tool) which packages have all
|
||
installable files available on PyPI itself (group A), which have
|
||
all installable files linked directly from PyPI metadata (group B),
|
||
and which have installable versions available that are linked only
|
||
from external homepage/download HTML pages (group C).
|
||
|
||
#. Send mail to maintainers of projects in group A that their project
|
||
will be automatically configured to ``pypi-explicit`` mode in one
|
||
month, and similarly to maintainers of projects in group B that
|
||
their project will be automatically configured to ``pypi-scrape``
|
||
mode. Inform them that this change is not expected to affect
|
||
installability of their project at all, but will result in faster
|
||
and safer installs for their users. Encourage them to set this
|
||
mode themselves sooner to benefit their users.
|
||
|
||
#. Send mail to maintainers of packages in group C that their package
|
||
hosting mode is ``pypi-scrape-crawl``, list the URLs which
|
||
currently are crawled, and suggest that they either re-host their
|
||
packages directly on PyPI and switch to ``pypi-explicit``, or at
|
||
least provide direct links to release files in PyPI metadata and
|
||
switch to ``pypi-scrape``. Provide instructions and tools to help
|
||
with these transitions.
|
||
|
||
|
||
Second transition phase (installer tools)
|
||
-----------------------------------------
|
||
|
||
For the second transition phase, maintainers of installation tools are
|
||
asked to release two updates.
|
||
|
||
The first update shall provide clear warnings if externally-hosted
|
||
release files (that is, files whose link is ``rel="external"``) are
|
||
selected for download, for which projects and URLs exactly this
|
||
happens, and warn that in future versions externally-hosted downloads
|
||
will be disabled by default.
|
||
|
||
The second update should change the default mode to allow only
|
||
installation of ``rel="internal"`` package files, and allow
|
||
installation of externally-hosted packages only when the user supplies
|
||
an option (ideally an option specifying exactly which external domains
|
||
are to be trusted as download sources). When download of an
|
||
externally-hosted package is disallowed, the user should be notified,
|
||
with instructions for how to make the install succeed and warnings
|
||
about the implication (that a file will be downloaded from a site that
|
||
is not part of the package index).
|
||
|
||
|
||
Open Questions / tasks
|
||
======================
|
||
|
||
- Should we introduce some form of PyPI API versioning in this PEP?
|
||
(it might complicate matters and delay the implementation but is
|
||
often seen as good practise).
|
||
|
||
- Do another round of discussions with installation tool authors and
|
||
see about incorporating their feedback. There is one known issue in
|
||
particular from Philip J. Eby who considers a host-based pattern
|
||
matching algorithm preferable to interpreting "rel" attributes.
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted,
|
||
http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html
|
||
(XXX need to update this data for all easy_install-supported formats)
|
||
|
||
.. [2] Marc-Andre Lemburg, reasons for external hosting,
|
||
http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html
|
||
|
||
.. [3] Holger Krekel, script to remove homepage/download metadata for
|
||
all releases
|
||
http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html
|
||
|
||
|
||
Acknowledgments
|
||
===============
|
||
|
||
Philip Eby for precise information and the basic ideas to implement
|
||
the transition via server-side changes only.
|
||
|
||
Donald Stufft for pushing away from external hosting and offering to
|
||
implement both a Pull Request for the necessary PyPI changes and the
|
||
analysis tool to drive the transition phase 1.
|
||
|
||
Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for
|
||
thinking through issues regarding getting rid of "external hosting".
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|