New Import Hooks PEP

This commit is contained in:
Just van Rossum 2002-12-20 13:07:24 +00:00
parent 1ce5ccea6e
commit 650c382d0e
1 changed files with 467 additions and 0 deletions

467
pep-0302.txt Normal file
View File

@ -0,0 +1,467 @@
PEP: 302
Title: New Import Hooks
Version: $Revision$
Last-Modified: $Date$
Author: Just van Rossum <just@letterror.com>,
Paul Moore <gustav@morpheus.demon.co.uk>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 19-Dec-2002
Python-Version: 2.3
Post-History: 19-Dec-2002
Abstract
This PEP proposes to add a new set of import hooks that offer better
customization of the Python import mechanism. Contrary to the
current __import__ hook, a new-style hook can be injected into the
existing scheme, allowing for a finer grained control of how modules
are found and how they are loaded.
Motivation
The only way to customize the import mechanism is currently to
override the builtin __import__ function. However, overriding
__import__ has many problems. To begin with:
- An __import__ replacement needs to *fully* reimplement the entire
import mechanism, or call the original __import__ before or after
the custom code.
- It has very complex semantics and responsibilities.
- __import__ gets called even for modules that are already in
sys.modules, which is almost never what you want, unless you're
writing some sort of monitoring tool.
The situation gets worse when you need to extend the import
mechanism from C: it's currently impossible, apart from hacking
Python's import.c or reimplementing much of import.c from scratch.
There is a fairly long history of tools written in Python that allow
extending the import mechanism in various way, based on the
__import__ hook. The Standard Library includes two such tools:
ihooks.py (by GvR) and imputil.py (Greg Stein), but perhaps the most
famous is iu.py by Gordon McMillan, available as part of his
Installer [1] package. Their usefulness is somewhat limited because
they are written in Python; bootstrapping issues need to worked
around as you can't load the module containing the hook with the
hook itself. So if you want the entire Standard Library to be
loadable from an import hook, the hook must be written in C.
Use cases
This section lists several existing applications that depend on
import hooks. Among these, a lot of duplicate work was done that
could have been saved if there had been a more flexible import hook
at the time. This PEP should make life a lot easier for similar
projects in the future.
Extending the import mechanism is needed when you want to load
modules that are stored in a non-standard way. Examples include
modules that are bundled together in an archive; byte code that is
not stored in a pyc formatted file; modules that are loaded from a
database over a network.
The work on this PEP was partly triggered by the implementation of
PEP 273 [2], which adds imports from Zip archives as a builtin
feature to Python. While the PEP itself was widely accepted as a
must-have feature, the implementation left a few things to desire.
For one thing it went through great lengths to integrate itself with
import.c, adding lots of code that was either specific for Zip file
imports or *not* specific to Zip imports, yet was not generally
useful (or even desirable) either. Yet the PEP 273 implementation
can hardly be blamed for this: it is simply extremely hard to do,
given the current state of import.c.
Packaging applications for end users is a typical use case for
import hooks, if not *the* typical use case. Distributing lots of
source or pyc files around is not always appropriate (let alone a
separate Python installation), so there is a frequent desire to
package all needed modules in a single file. So frequent in fact
that multiple solutions have been implemented over the years.
The oldest one is included with the Python source code: Freeze [3].
It puts marshalled byte code into static objects in C source code.
Freeze's "import hook" is hard wired into import.c, and has a couple
of issues. Later solutions include Fredrik Lundh's Squeeze [4],
Gordon McMillan's Installer [1] and Thomas Heller's py2exe [5].
MacPython ships with a tool called BuildApplication.
Squeeze, Installer and py2exe use an __import__ based scheme (py2exe
currently uses Installer's iu.py, Squeeze used ihooks.py), MacPython
has two Mac-specific import hooks hard wired into import.c, that are
similar to the Freeze hook. The hooks proposed in this PEP enables
us (at least in theory; it's not a short term goal) to get rid of
the hard coded hooks in import.c, and would allow the
__import__-based tools to get rid of most of their import.c
emulation code.
Before work on the design and implementation of this PEP was
started, a new BuildApplication-like tool for MacOSX prompted one of
the authors of this PEP (JvR) to expose the table of frozen modules
to Python, in the imp module. The main reason was to be able to use
the freeze import hook (avoiding fancy __import__ support), yet to
also be able to supply a set of modules at runtime. This resulted
in sf patch #642578 [6], which was mysteriously accepted (mostly
because nobody seemed to care either way ;-). Yet it is completely
superfluous when this PEP gets accepted, as it offers a much nicer
and general way to do the same thing.
Rationale
While experimenting with alternative implementation ideas to get
builtin Zip import, it was discovered that achieving this is
possible with only a fairly small amount of changes to import.c.
This allowed to factor out the Zip-specific stuff into a new source
file, while at the same time creating a *general* new import hook
scheme: the one you're reading about now.
An earlier design allowed non-string objects on sys.path. Such an
object would have the neccesary methods to handle an import. This
has two disadvantages: 1) it breaks code that assumes all items on
sys.path are strings; 2) it is not compatible with the PYTHONPATH
environment variable. The latter is directly needed for Zip
imports. A compromise came from Jython: allow string *subclasses*
on sys.path, which would then act as importer objects. This avoids
some breakage, and seems to work well for Jython (where it is used
to load modules from .jar files), but it was perceived as an "ugly
hack".
This lead to a more elaborate scheme, (mostly copied from McMillan's
iu.py) in which each in a list of candidates is asked whether it can
handle the sys.path item, until one is found that can. This list of
candidates is a new object in the sys module: sys.path_hooks.
Traversing sys.path_hooks for each path item for each new import can
be expensive, so the results are cached in another new object in the
sys module: sys.path_importer_cache. It maps sys.path entries to
importer objects.
To minimize the impact on import.c as well as to avoid adding extra
overhead, it was chosen to not add an explicit hook and importer
object for the existing file system import logic (as iu.py has), but
to simply fall back to the builtin logic if no hook on
sys.path_hooks could handle the path item. If this is the case, a
None value is stored in sys.path_importer_cache, again to avoid
repeated lookups. (Later we can go further and add a real importer
object for the builtin mechanism, for now, the None fallback scheme
should suffice.)
A question was raised: what about importers that don't need *any*
entry on sys.path? (Builtin and frozen modules fall into that
category.) Again, Gordon McMillan to the rescue: iu.py contains a
thing he calls the "metapath". In this PEP's implementation, it's a
list of importer objects that is traversed *before* sys.path. This
list is yet another new object in the sys.module: sys.meta_path.
Currently, this list is empty by default, and frozen and builtin
module imports are done after traversing sys.meta_path, but still
before sys.path. (Again, later we can add real frozen, builtin and
sys.path importer objects on sys.meta_path, allowing for some extra
flexibility, but this could be done as a "phase 2" project, possibly
for Python 2.4. It would be the finishing touch as then *every*
import would go through sys.meta_path, making it the central import
dispatcher.)
As a bonus, the idea from the second paragraph of this section was
implemented after all: a sys.path item may *be* an importer object.
This use is discouraged for general purpose code, but it's very
convenient, for experimentation as well as for projects of which
it's known that no component wrongly assumes that sys.path items are
strings.
Specification part 1: The Importer Protocol
This PEP introduces a new protocol: the "Importer Protocol". It is
important to understand the context in which the protocol operates,
so here is a brief overview of the outer shells of the import
mechanism.
When an import statement is encountered, the interpreter looks up
the __import__ function in the builtin name space. __import__ is
then called with four arguments, amongst which are the name of the
module being imported (may be a dotted name) and a reference to the
current global namespace.
The builtin __import__ function (known as PyImport_ImportModuleEx in
import.c) will then check to see whether the module doing the import
is a package by looking for a __path__ variable in the current
global namespace. If it is indeed a package, it first tries to do
the import relative to the package. For example if a package named
"spam" does "import eggs", it will first look for a module named
"spam.eggs". If that fails, the import continues as an absolute
import: it will look for a module named "eggs". Dotted name imports
work pretty much the same: if package "spam" does "import
eggs.bacon", first "spam.eggs.bacon" is tried, and only if that
fails "eggs.bacon" is tried.
Deeper down in the mechanism, a dotted name import is split up by
its components. For "import spam.ham", first an "import spam" is
done, and only when that succeeds is "ham" imported as a submodule
of "spam".
The Importer Protocol operates at this level of *individual*
imports. By the time an importer gets a request for "spam.ham",
module "spam" has already been imported.
The protocol involves two objects: an importer and a loader. An
importer object has a single method:
importer.find_module(fullname)
This method returns a loader object if the module was found, or None
if it wasn't. If find_module() raises an exception, it will be
propagated to the caller, aborting the import.
A loader object also has one method:
loader.load_module(fullname)
This method returns the loaded module. In many cases the importer
and loader can be one and the same object: importer.find_module()
would just return self.
The 'fullname' argument of both methods is the fully qualified
module name, for example "spam.eggs.ham". As explained above, when
importer.find_module("spam.eggs.ham") is called, "spam.eggs" has
already been imported and added to sys.modules. However, the
find_module() method isn't neccesarily always called during an
actual import: meta tools that analyze import dependencies (such as
freeze, Installer or py2exe) don't actually load modules, so an
importer shouldn't *depend* on the parent package being available in
sys.modules.
The load_module() method has a few responsibilities that it must
fulfill *before* it runs any code:
- It must create the module object. From Python this can be done
via the new.module() function, the imp.new_module() function or
via the module type object; from C with the PyModule_New()
function or the PyImport_ModuleAdd() function. The latter also
does the following step:
- It must add the module to sys.modules. This is crucial because
the module code may (directly or indirectly) import itself; adding
it to sys.modules beforehand prevents unbounded recursion in the
worst case and multiple loading in the best.
- The __file__ attribute must be set. This must be a string, but it
may be a dummy value, for example "<frozen>". The priviledge of
not having a __file__ attribute at all is reserved for builtin
modules.
- If it's a package, the __path__ variable must be set. This must
be a list, but may be empty if __path__ has no further
significance to the importer (more on this later).
- It should add an __importer__ attribute to the module, set to the
loader object. This is mostly for introspection, but can be used
for importer-specific extra's, for example getting data associated
with an importer.
If the module is a Python module (as opposed to a builtin module or
an dynamically loaded extension), it should execute the module's
code in the module's global name space (module.__dict__).
Here is a minimal pattern for a load_module() method:
def load_module(self, fullname):
ispkg, code = self._get_code(fullname)
mod = imp.new_module(fullname)
sys.modules[fullname] = mod
mod.__file__ = "<%s>" % self.__class__.__name__
mod.__importer__ = self
if ispkg:
mod.__path__ = []
exec code in mod.__dict__
return mod
Specification part 2: Registering Hooks
There are two types of import hooks: Meta hooks and Path hooks.
Meta hooks are called at the start of import processing, before any
other import processing (so that meta hooks can override sys.path
processing, or frozen modules, or even builtin modules). To
register a meta hook, simply add the importer object to
sys.meta_path (the list of registered meta hooks).
Path hooks are called as part of sys.path (or package.__path__)
processing, at the point where their associated path item is
encountered. A path hook can be registered in either of two ways:
- By simply including an importer object directly on the path.
This approach is discouraged for general purpose hooks, as
existing code may not be expecting non-strings to exist on
sys.path.
- By registering an importer factory in sys.path_hooks.
sys.path_hooks is a list of callables, which will be checked in
sequence to determine if they can handle a given path item. The
callable is called with one argument, the path item. The callable
must raise ImportError if it is unable to handle the path item, and
return an importer object if it can handle the path item. The
callable is typically the class of the import hook, and hence the
class __init__ method is called. (This is also the reason why it
should raise ImportError: an __init__ method can't return anything.
This would be possible with a __new__ method in a new style class,
but we don't want to require anything about how a hook is
implemented.)
The results of path hook checks are cached in
sys.path_importer_cache, which is a dictionary mapping path entries
to importer objects. The cache is checked before sys.path_hooks is
scanned. If it is necessary to force a rescan of sys.path_hooks, it
is possible to manually clear all or part of
sys.path_importer_cache.
Just like sys.path itself, the new sys variables must have specific
types:
sys.meta_path and sys.path_hooks must be Python lists.
sys.path_importer_cache must be a Python dict.
Modifying these variables in place is allowed, as is replacing them
with new objects.
Packages and the role of __path__
If a module has a __path__ attribute, the import mechanism will
treat it as a package. The __path__ variable is used instead of
sys.path when importing submodules of the package. The rules for
sys.path therefore also apply to pkg.__path__. So sys.path_hooks is
also consulted when pkg.__path__ is traversed and importer objects
as path items are also allowed (yet, are discouraged for the same
reasons as they are discouraged on sys.path, at least for general
purpose code). Meta importers don't neccesarily use sys.path at all
to do their work and therefore may also ignore the value of
pkg.__path__. In this case it is still advised to set it to list,
which can be empty.
Integration with the 'imp' module
The new import hooks are not easily integrated in the existing
imp.find_module() and imp.load_module() calls. It's questionable
whether it's possible at all without breaking code; it is better to
simply add a new function to the imp module. The meaning of the
existing imp.find_module() and imp.load_module() calls changes from:
"they expose the builtin import mechanism" to "they expose the basic
*unhooked* builtin import mechanism". They simply won't invoke any
import hooks. A new imp module function is proposed under the name
"find_module2", with is used like the following pattern:
loader = imp.find_module2(fullname, path)
if loader is not None:
loader.load_module(fullname)
In the case of a "basic" import, one the imp.find_module() function
would handle, the loader object would be a wrapper for the current
output of imp.find_module(), and loader.load_module() would call
imp.load_module() with that output.
Note that this wrapper is currently not yet implemented, although a
Python prototype exists in the test_importhooks.py script (the
ImpWrapper class) included with the patch.
Open Issues
The new hook method allows for the possibility of objects other than
strings appearing on sys.path. Existing code is entitled to assume
that sys.path only contains strings (the Python documentation states
this). It is not clear if this will cause significant breakage. In
particular, it is much less clear that code is entitled to assume
that sys.path contains a list of *directory names* - most code which
assumes that sys.path items contain strings also rely on this extra
assumption, and so could be considered as broken (or at least "not
robust") already.
Modules often need supporting data files to do their job,
particularly in the case of complex packages or full applications.
Current practice is generally to locate such files via sys.path (or
a package.__path__ attribute). This approach will not work, in
general, for modules loaded via an import hook.
There are a number of possible ways to address this problem:
- "Don't do that". If a package needs to locate data files via its
__path__, it is not suitable for loading via an import hook. The
package can still be located on a directory in sys.path, as at
present, so this should not be seen as a major issue.
- Locate data files from a standard location, rather than relative
to the module file. A relatively simple approach (which is
supported by distutils) would be to locate data files based on
sys.prefix (or sys.exec_prefix). For example, looking in
os.path.join(sys.prefix, "data", package_name).
- Import hooks could offer a standard way of getting at datafiles
relative to the module file. The standard zipimport object
provides a method get_data(name) which returns the content of the
"file" called name, as a string. To allow modules to get at the
importer object, zipimport also adds an attribute "__importer__"
to the module, containing the zipimport object used to load the
module. If such an approach is used, it is important that client
code takes care not to break if the get_data method (or the
__importer__ attribute) is not available, so it is not clear that
this approach offers a general answer to the problem.
Requiring loaders to set the module's __importer__ attribute means
that the loader will not get thrown away once the load is complete.
This increases memory usage, and stops loaders from being
lightweight, "throwaway" objects. As loader objects are not
required to offer any useful functionality (any such functionality,
such as the zipimport get_data() method mentioned above, is
optional) it is not clear that the __importer__ attribute will be
helpful, in practice.
On the other hand, importer objects are mostly permanent, as they
live or are kept alive on sys.meta_path, sys.path_importer_cache or
sys.path, so for a loader to keep a reference to the importer costs
us nothing extra. Whether loaders will ever need to carry so much
independent state for this to become a real issue is questionable.
Implementation
A C implementation is available as SourceForge patch 652586.
http://www.python.org/sf/652586
References
[1] Installer by Gordon McMillan
http://www.mcmillan-inc.com/install1.html
[2] PEP 273, Import Modules from Zip Archives, Ahlstrom
http://www.python.org/peps/pep-0273.html
[3] The Freeze tool
Tools/freeze/ in a Python source distribution
[4] Squeeze
http://starship.python.net/crew/fredrik/ipa/squeeze.htm
[5] py2exe by Thomas Heller
http://py2exe.sourceforge.net/
[6] imp.set_frozenmodules() patch
http://www.python.org/sf/642578
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: