python-peps/pep-0420.txt

PEP: 420
Title: Implicit Namespace Packages
Version: $Revision$
Last-Modified: $Date$
Author: Eric V. Smith <eric@trueblade.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 19-Apr-2012
Python-Version: 3.3
Post-History:

Abstract
========

Namespace packages are a mechanism for splitting a single Python package
across multiple directories on disk.  In current Python versions, an algorithm
to compute the packages ``__path__`` must be formulated.  With the enhancement
proposed here, the import machinery itself will construct the list of
directories that make up the package.  This PEP builds upon previous work,
documented in PEP 382 and PEP 402.  Those PEPs have since been rejected in
favor of this one.  An implementation of this PEP is at [1]_.

Terminology
===========

Within this PEP:

 * "package" refers to Python packages as defined by Python's import
   statement.
 * "distribution" refers to separately installable sets of Python
   modules as stored in the Python package index, and installed by
   distutils or setuptools.
 * "vendor package" refers to groups of files installed by an
   operating system's packaging mechanism (e.g. Debian or Redhat
   packages install on Linux systems).
 * "regular package" refers to packages as they are implemented in
   Python 3.2 and earlier.
 * "portion" refers to a set of files in a single directory (possibly
   stored in a zip file) that contribute to a namespace package.
 * "legacy portion" refers to a portion that uses ``__path__``
   manipulation in order to implement namespace packages.

This PEP defines a new type of package, the "namespace package".

Namespace packages today
========================

Python currently provides ``pkgutil.extend_path`` to denote a package
as a namespace package.  The recommended way of using it is to put::

    from pkgutil import extend_path
    __path__ = extend_path(__path__, __name__)

in the package's ``__init__.py``.  Every distribution needs to provide
the same contents in its ``__init__.py``, so that ``extend_path`` is
invoked independent of which portion of the package gets imported
first.  As a consequence, the package's ``__init__.py`` cannot
practically define any names as it depends on the order of the package
fragments on ``sys.path`` to determine which portion is imported
first.  As a special feature, ``extend_path`` reads files named
``<packagename>.pkg`` which allows declaration of additional portions.

setuptools provides a similar function named
``pkg_resources.declare_namespace`` that is used in the form::

    import pkg_resources
    pkg_resources.declare_namespace(__name__)

In the portion's ``__init__.py``, no assignment to ``__path__`` is
necessary, as ``declare_namespace`` modifies the package ``__path__``
through ``sys.modules``.  As a special feature, ``declare_namespace``
also supports zip files, and registers the package name internally so
that future additions to ``sys.path`` by setuptools can properly add
additional portions to each package.

setuptools allows declaring namespace packages in a distribution's
``setup.py``, so that distribution developers don't need to put the
magic ``__path__`` modification into ``__init__.py`` themselves.

See PEP 402's "The Problem" section [2]_ for more details on the
motivation for namespace packages.  Note that PEP 402 has been
rejected, but the motivating use cases are still valid.

Rationale
=========

The current imperative approach to namespace packages has lead to
multiple slightly-incompatible mechanisms for providing namespace
packages.  For example, pkgutil supports ``*.pkg`` files; setuptools
doesn't.  Likewise, setuptools supports inspecting zip files, and
supports adding portions to its ``_namespace_packages`` variable,
whereas pkgutil doesn't.

Namespace packages are designed to support being split across multiple
directories (and hence found via multiple ``sys.path`` entries).  In
this configuration, it doesn't matter if multiple portions all provide
an ``__init__.py`` file, so long as each portion correctly initializes
the namespace package.  However, Linux distribution vendors (amongst
others) prefer to combine the separate portions and install them all
into the *same* filesystem directory.  This creates a potential for
conflict, as the portions are now attempting to provide the *same*
file on the target system - something that is not allowed by many
package managers.  Allowing implicit namespace packages means that the
requirement to provide an ``__init__.py`` file can be dropped
completely, and affected portions can be installed into a common
directory or split across multiple directories as distributions see
fit.

Specification
=============

Regular packages will continue to have an ``__init__.py`` and will
reside in a single directory.

Namespace packages cannot contain an ``__init__.py``.  As a
consequence, ``pkgutil.extend_path`` and
``pkg_resources.declare_namespace`` become obsolete for purposes of
namespace package creation.  There will be no marker file or directory
for specifying a namespace package.

During import processing, the import machinery will continue to
iterate over each directory in the parent path as it does in Python
3.2.  While looking for a module or package named "foo", for each
directory in the parent path:

 * If ``<directory>/foo/__init__.py`` is found, a regular package is
   imported and returned.

 * If not, but ``<directory>/foo.{py,pyc,so,pyd}`` is found, a module
   is imported and returned.

 * If not, but ``<directory>/foo`` is found and is a directory, it is
   recorded and the scan continues with the next directory in the
   parent path.

 * Otherwise the scan continues with the next directory in the parent
   path.

If the scan completes without returning a module or package, and at
least one directory was recorded, then a namespace package is created.
The new namespace package:

 * Has a ``__path__`` attribute set to an iterable of the path strings
   that were found and recorded during the scan.

 * Does not have a ``__file__`` attribute.

Note that if "import foo" is executed and "foo" is found as a
namespace package (using the above rules), then "foo" is immediately
created as a package.  The creation of the namespace package is not
deferred until a sub-level import occurs.

A namespace package is not fundamentally different from a regular
package.  It is just a different way of creating packages.  Once a
namespace package is created, there is no functional difference
between it and a regular package.

Dynamic path computation
------------------------

A namespace package's ``__path__`` will be recomputed if the value of
the parent path changes. In order for this feature to work, the parent
path must be modified in-place, not replaced with a new object. For
example, for top-level namespace packages, this will work::

    sys.path.append('new-dir')

But this will not::

    sys.path = sys.path + ['new-dir']

Impact on import finders and loaders
------------------------------------

PEP 302 defines "finders" that are called to search path elements.
These finders' ``find_module`` methods return either a "loader" object
or ``None``.

For a finder to contribute to namespace packages, it must implement a
new ``find_loader(fullname)`` method.  ``fullname`` has the same
meaning as for ``find_module``.  ``find_loader`` always returns a
2-tuple of ``(loader, <iterable-of-path-entries>)``.  ``loader`` may
be ``None``, in which case ``<iterable-of-path-entries>`` (which may
be empty) is added to the list of recorded path entries and path
searching continues.  If ``loader`` is not ``None``, it is immediately
used to load a module or regular package.

Even if ``loader`` is returned and is not ``None``,
``<iterable-of-path-entries>`` must still contain the path entries for
the package.  This allows code such as ``pkgutil.extend_path()`` to
compute path entries for packages that it does not load.

Note that multiple path entries per finder are allowed.  This is to
support the case where a finder discovers multiple namespace portions
for a given ``fullname``.  Many finders will support only a single
namespace package portion per ``find_loader`` call, in which case this
iterable will contain only a single string.

The import machinery will call ``find_loader`` if it exists, else fall
back to ``find_module``.  Legacy finders which implement
``find_module`` but not ``find_loader`` will be unable to contribute
portions to a namespace package.

The specification expands PEP 302 loaders to include an optional method called
``module_repr()`` which if present, is used to generate module object reprs.
See the section below for further details.

Differences between namespace packages and regular packages
-----------------------------------------------------------

Namespace packages and regular packages are very similar. The
differences are:

 * Portions of namespace packages need not all come from the same
   directory structure, or even from the same loader. Regular packages
   are self-contained: all parts live in the same directory hierarchy.

 * Namespace packages have no ``__file__`` attribute.

 * Namespace packages' ``__path__`` attribute is a read-only iterable
   of strings, which is automatically updated when the parent path is
   modified.

 * Namespace packages have no ``__init__.py`` module.

 * Namespace packages have a different type of object for their
   ``__loader__`` attribute.


Namespace packages in the standard library
------------------------------------------

It is possible, and this PEP explicitly allows, that parts of the
standard library be implemented as namespace packages.  When and if
any standard library packages become namespace packages is outside the
scope of this PEP.


Migrating from legacy namespace packages
----------------------------------------

As described above, prior to this PEP ``pkgutil.extend_path()`` was
used by legacy portions to create namespace packages.  Because it is
likely not practical for all existing portions of a namespace package
to be migrated to this PEP at once, ``extend_path()`` will be modified
to also recognize PEP 420 namespace packages.  This will allow some
portions of a namespace to be legacy portions while others are
migrated to PEP 420.  These hybrid namespace packages will not have
the dynamic path computation that normal namespace packages have,
since ``extend_path()`` never provided this functionality in the past.


Packaging Implications
======================

Multiple portions of a namespace package can be installed into the
same directory, or into separate directories.  For this section,
suppose there are two portions which define "foo.bar" and "foo.baz".
"foo" itself is a namespace package.

If these are installed in the same location, a single directory "foo"
would be in a directory that is on ``sys.path``.  Inside "foo" would
be two directories, "bar" and "baz".  If "foo.bar" is removed (perhaps
by an OS package manager), care must be taken not to remove the
"foo/baz" or "foo" directories.  Note that in this case "foo" will be
a namespace package (because it lacks an ``__init__.py``), even though
all of its portions are in the same directory.

Note that "foo.bar" and "foo.baz" can be installed into the same "foo"
directory because they will not have any files in common.

If the portions are installed in different locations, two different
"foo" directories would be in directories that are on ``sys.path``.
"foo/bar" would be in one of these sys.path entries, and "foo/baz"
would be in the other.  Upon removal of "foo.bar", the "foo/bar" and
corresponding "foo" directories can be completely removed.  But
"foo/baz" and its corresponding "foo" directory cannot be removed.

It is also possible to have the "foo.bar" portion installed in a
directory on ``sys.path``, and have the "foo.baz" portion provided in
a zip file, also on ``sys.path``.

Discussion
==========

At PyCon 2012, we had a discussion about namespace packages at which
PEP 382 and PEP 402 were rejected, to be replaced by this PEP [3]_.

There is no intention to remove support of regular packages.  If a
developer knows that her package will never be a portion of a
namespace package, then there is a performance advantage to it being a
regular package (with an ``__init__.py``).  Creation and loading of a
regular package can take place immediately when it is located along
the path.  With namespace packages, all entries in the path must be
scanned before the package is created.

Note that an ImportWarning will no longer be raised for a directory
lacking an ``__init__.py`` file.  Such a directory will now be
imported as a namespace package, whereas in prior Python versions an
ImportWarning would be raised.

Nick Coghlan presented a list of his objections to this proposal [4]_.
They are:

  1. Implicit package directories go against the Zen of Python.

  2. Implicit package directories pose awkward backwards compatibility
     challenges.

  3. Implicit package directories introduce ambiguity into filesystem
     layouts.

  4. Implicit package directories will permanently entrench current
     newbie-hostile behavior in ``__main__``.

Nick gave a detailed response [5]_, which is summarized here:

  1. The practicality of this PEP wins over other proposals and the
     status quo.

  2. Minor backward compatibility issues are okay, as long as they are
     properly documented.

  3. This will be addressed in PEP 395.

  4. This will also be addressed in PEP 395.

The inclusion of namespace packages in the standard library was
motivated by Martin v. Löwis, who wanted the ``encodings`` package to
become a namespace package [6]_.  While this PEP allows for standard
library packages to become namespaces, it defers a decision on
``encodings``.

``find_module`` versus ``find_loader``
--------------------------------------

An early draft of this PEP specified a change to the ``find_module``
method in order to support namespace packages.  It would be modified
to return a string in the case where a namespace package portion was
discovered.

However, this caused a problem with existing code outside of the
standard library which calls ``find_module``.  Because this code would
not be upgraded in concert with changes required by this PEP, it would
fail when it would receive unexpected return values from
``find_module``.  Because of this incompatibility, this PEP now
specifies that finders that want to provide namespace portions must
implement the ``find_loader`` method, described above.

The use case for supporting multiple portions per ``find_loader`` call
is given in [7]_.


Module reprs
============

Previously, module reprs were hard coded based on assumptions about a module's
``__file__`` attribute.  If this attribute existed and was a string, it was
assumed to be a file system path, and the module object's repr would include
this in its value.  The only exception was that PEP 302 reserved missing
``__file__`` attributes to built-in modules, and in CPython, this assumption
was baked into the module object's implementation.  Because of this
restriction, some module contained contrived ``__file__`` values that did not
reflect file system paths, and which could cause unexpected problems later
(e.g. ``os.path.join()`` on a non-path ``__file__`` would return gibberish).

This PEP relaxes this constraint, and leaves the setting of ``__file__`` to
the purview of the loader producing the module.  Loaders may opt to leave
``__file__`` unset if no file system path is appropriate.  Loaders may also
set additional reserved attributes on the module if useful.  This means that
the definitive way to determine the origin of a module is to check its
``__loader__`` attribute.

For example, namespace packages as described in this PEP will have no
``__file__`` attribute because no corresponding file exists.  In order to
provide flexibility and descriptiveness in the reprs of such modules, a new
optional protocol is added to PEP 302 loaders.  Loaders can implement a
``module_repr()`` method which takes a single argument, the module object.
This method should return the string to be used verbatim as the repr of the
module.  The rules for producing a module repr are now standardized as:

 * If the module has an ``__loader__`` and that loader has a ``module_repr()``
   method, call it with a single argument, which is the module object.  The
   value returned is used as the module's repr.
 * Exceptions from ``module_repr()`` are ignored, and the following steps
   are used instead.
 * If the module has an ``__file__`` attribute, this is used as part of the
   module's repr.
 * If the module has no ``__file__`` but does have an ``__loader__``, then the
   loader's repr is used as part of the module's repr.
 * Otherwise, just use the module's ``__name__`` in the repr.


References
==========

.. [1] PEP 420 branch (http://hg.python.org/features/pep-420)

.. [2] PEP 402's description of use cases for namespace packages
       (http://www.python.org/dev/peps/pep-0402/#the-problem)

.. [3] PyCon 2012 Namespace Package discussion outcome
       (http://mail.python.org/pipermail/import-sig/2012-March/000421.html)

.. [4] Nick Coghlan's objection to the lack of marker files or directories
       (http://mail.python.org/pipermail/import-sig/2012-March/000423.html)

.. [5] Nick Coghlan's response to his initial objections
       (http://mail.python.org/pipermail/import-sig/2012-April/000464.html)

.. [6] Martin v. Löwis's suggestion to make ``encodings`` a namespace
       package
       (http://mail.python.org/pipermail/import-sig/2012-May/000540.html)

.. [7] Use case for multiple portions per ``find_loader`` call
       (http://mail.python.org/pipermail/import-sig/2012-May/000585.html)

Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: