New Import Hooks PEP
This commit is contained in:
parent
1ce5ccea6e
commit
650c382d0e
|
@ -0,0 +1,467 @@
|
|||
PEP: 302
|
||||
Title: New Import Hooks
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Just van Rossum <just@letterror.com>,
|
||||
Paul Moore <gustav@morpheus.demon.co.uk>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/plain
|
||||
Created: 19-Dec-2002
|
||||
Python-Version: 2.3
|
||||
Post-History: 19-Dec-2002
|
||||
|
||||
|
||||
Abstract
|
||||
|
||||
This PEP proposes to add a new set of import hooks that offer better
|
||||
customization of the Python import mechanism. Contrary to the
|
||||
current __import__ hook, a new-style hook can be injected into the
|
||||
existing scheme, allowing for a finer grained control of how modules
|
||||
are found and how they are loaded.
|
||||
|
||||
|
||||
Motivation
|
||||
|
||||
The only way to customize the import mechanism is currently to
|
||||
override the builtin __import__ function. However, overriding
|
||||
__import__ has many problems. To begin with:
|
||||
|
||||
- An __import__ replacement needs to *fully* reimplement the entire
|
||||
import mechanism, or call the original __import__ before or after
|
||||
the custom code.
|
||||
|
||||
- It has very complex semantics and responsibilities.
|
||||
|
||||
- __import__ gets called even for modules that are already in
|
||||
sys.modules, which is almost never what you want, unless you're
|
||||
writing some sort of monitoring tool.
|
||||
|
||||
The situation gets worse when you need to extend the import
|
||||
mechanism from C: it's currently impossible, apart from hacking
|
||||
Python's import.c or reimplementing much of import.c from scratch.
|
||||
|
||||
There is a fairly long history of tools written in Python that allow
|
||||
extending the import mechanism in various way, based on the
|
||||
__import__ hook. The Standard Library includes two such tools:
|
||||
ihooks.py (by GvR) and imputil.py (Greg Stein), but perhaps the most
|
||||
famous is iu.py by Gordon McMillan, available as part of his
|
||||
Installer [1] package. Their usefulness is somewhat limited because
|
||||
they are written in Python; bootstrapping issues need to worked
|
||||
around as you can't load the module containing the hook with the
|
||||
hook itself. So if you want the entire Standard Library to be
|
||||
loadable from an import hook, the hook must be written in C.
|
||||
|
||||
|
||||
Use cases
|
||||
|
||||
This section lists several existing applications that depend on
|
||||
import hooks. Among these, a lot of duplicate work was done that
|
||||
could have been saved if there had been a more flexible import hook
|
||||
at the time. This PEP should make life a lot easier for similar
|
||||
projects in the future.
|
||||
|
||||
Extending the import mechanism is needed when you want to load
|
||||
modules that are stored in a non-standard way. Examples include
|
||||
modules that are bundled together in an archive; byte code that is
|
||||
not stored in a pyc formatted file; modules that are loaded from a
|
||||
database over a network.
|
||||
|
||||
The work on this PEP was partly triggered by the implementation of
|
||||
PEP 273 [2], which adds imports from Zip archives as a builtin
|
||||
feature to Python. While the PEP itself was widely accepted as a
|
||||
must-have feature, the implementation left a few things to desire.
|
||||
For one thing it went through great lengths to integrate itself with
|
||||
import.c, adding lots of code that was either specific for Zip file
|
||||
imports or *not* specific to Zip imports, yet was not generally
|
||||
useful (or even desirable) either. Yet the PEP 273 implementation
|
||||
can hardly be blamed for this: it is simply extremely hard to do,
|
||||
given the current state of import.c.
|
||||
|
||||
Packaging applications for end users is a typical use case for
|
||||
import hooks, if not *the* typical use case. Distributing lots of
|
||||
source or pyc files around is not always appropriate (let alone a
|
||||
separate Python installation), so there is a frequent desire to
|
||||
package all needed modules in a single file. So frequent in fact
|
||||
that multiple solutions have been implemented over the years.
|
||||
|
||||
The oldest one is included with the Python source code: Freeze [3].
|
||||
It puts marshalled byte code into static objects in C source code.
|
||||
Freeze's "import hook" is hard wired into import.c, and has a couple
|
||||
of issues. Later solutions include Fredrik Lundh's Squeeze [4],
|
||||
Gordon McMillan's Installer [1] and Thomas Heller's py2exe [5].
|
||||
MacPython ships with a tool called BuildApplication.
|
||||
|
||||
Squeeze, Installer and py2exe use an __import__ based scheme (py2exe
|
||||
currently uses Installer's iu.py, Squeeze used ihooks.py), MacPython
|
||||
has two Mac-specific import hooks hard wired into import.c, that are
|
||||
similar to the Freeze hook. The hooks proposed in this PEP enables
|
||||
us (at least in theory; it's not a short term goal) to get rid of
|
||||
the hard coded hooks in import.c, and would allow the
|
||||
__import__-based tools to get rid of most of their import.c
|
||||
emulation code.
|
||||
|
||||
Before work on the design and implementation of this PEP was
|
||||
started, a new BuildApplication-like tool for MacOSX prompted one of
|
||||
the authors of this PEP (JvR) to expose the table of frozen modules
|
||||
to Python, in the imp module. The main reason was to be able to use
|
||||
the freeze import hook (avoiding fancy __import__ support), yet to
|
||||
also be able to supply a set of modules at runtime. This resulted
|
||||
in sf patch #642578 [6], which was mysteriously accepted (mostly
|
||||
because nobody seemed to care either way ;-). Yet it is completely
|
||||
superfluous when this PEP gets accepted, as it offers a much nicer
|
||||
and general way to do the same thing.
|
||||
|
||||
|
||||
Rationale
|
||||
|
||||
While experimenting with alternative implementation ideas to get
|
||||
builtin Zip import, it was discovered that achieving this is
|
||||
possible with only a fairly small amount of changes to import.c.
|
||||
This allowed to factor out the Zip-specific stuff into a new source
|
||||
file, while at the same time creating a *general* new import hook
|
||||
scheme: the one you're reading about now.
|
||||
|
||||
An earlier design allowed non-string objects on sys.path. Such an
|
||||
object would have the neccesary methods to handle an import. This
|
||||
has two disadvantages: 1) it breaks code that assumes all items on
|
||||
sys.path are strings; 2) it is not compatible with the PYTHONPATH
|
||||
environment variable. The latter is directly needed for Zip
|
||||
imports. A compromise came from Jython: allow string *subclasses*
|
||||
on sys.path, which would then act as importer objects. This avoids
|
||||
some breakage, and seems to work well for Jython (where it is used
|
||||
to load modules from .jar files), but it was perceived as an "ugly
|
||||
hack".
|
||||
|
||||
This lead to a more elaborate scheme, (mostly copied from McMillan's
|
||||
iu.py) in which each in a list of candidates is asked whether it can
|
||||
handle the sys.path item, until one is found that can. This list of
|
||||
candidates is a new object in the sys module: sys.path_hooks.
|
||||
|
||||
Traversing sys.path_hooks for each path item for each new import can
|
||||
be expensive, so the results are cached in another new object in the
|
||||
sys module: sys.path_importer_cache. It maps sys.path entries to
|
||||
importer objects.
|
||||
|
||||
To minimize the impact on import.c as well as to avoid adding extra
|
||||
overhead, it was chosen to not add an explicit hook and importer
|
||||
object for the existing file system import logic (as iu.py has), but
|
||||
to simply fall back to the builtin logic if no hook on
|
||||
sys.path_hooks could handle the path item. If this is the case, a
|
||||
None value is stored in sys.path_importer_cache, again to avoid
|
||||
repeated lookups. (Later we can go further and add a real importer
|
||||
object for the builtin mechanism, for now, the None fallback scheme
|
||||
should suffice.)
|
||||
|
||||
A question was raised: what about importers that don't need *any*
|
||||
entry on sys.path? (Builtin and frozen modules fall into that
|
||||
category.) Again, Gordon McMillan to the rescue: iu.py contains a
|
||||
thing he calls the "metapath". In this PEP's implementation, it's a
|
||||
list of importer objects that is traversed *before* sys.path. This
|
||||
list is yet another new object in the sys.module: sys.meta_path.
|
||||
Currently, this list is empty by default, and frozen and builtin
|
||||
module imports are done after traversing sys.meta_path, but still
|
||||
before sys.path. (Again, later we can add real frozen, builtin and
|
||||
sys.path importer objects on sys.meta_path, allowing for some extra
|
||||
flexibility, but this could be done as a "phase 2" project, possibly
|
||||
for Python 2.4. It would be the finishing touch as then *every*
|
||||
import would go through sys.meta_path, making it the central import
|
||||
dispatcher.)
|
||||
|
||||
As a bonus, the idea from the second paragraph of this section was
|
||||
implemented after all: a sys.path item may *be* an importer object.
|
||||
This use is discouraged for general purpose code, but it's very
|
||||
convenient, for experimentation as well as for projects of which
|
||||
it's known that no component wrongly assumes that sys.path items are
|
||||
strings.
|
||||
|
||||
|
||||
Specification part 1: The Importer Protocol
|
||||
|
||||
This PEP introduces a new protocol: the "Importer Protocol". It is
|
||||
important to understand the context in which the protocol operates,
|
||||
so here is a brief overview of the outer shells of the import
|
||||
mechanism.
|
||||
|
||||
When an import statement is encountered, the interpreter looks up
|
||||
the __import__ function in the builtin name space. __import__ is
|
||||
then called with four arguments, amongst which are the name of the
|
||||
module being imported (may be a dotted name) and a reference to the
|
||||
current global namespace.
|
||||
|
||||
The builtin __import__ function (known as PyImport_ImportModuleEx in
|
||||
import.c) will then check to see whether the module doing the import
|
||||
is a package by looking for a __path__ variable in the current
|
||||
global namespace. If it is indeed a package, it first tries to do
|
||||
the import relative to the package. For example if a package named
|
||||
"spam" does "import eggs", it will first look for a module named
|
||||
"spam.eggs". If that fails, the import continues as an absolute
|
||||
import: it will look for a module named "eggs". Dotted name imports
|
||||
work pretty much the same: if package "spam" does "import
|
||||
eggs.bacon", first "spam.eggs.bacon" is tried, and only if that
|
||||
fails "eggs.bacon" is tried.
|
||||
|
||||
Deeper down in the mechanism, a dotted name import is split up by
|
||||
its components. For "import spam.ham", first an "import spam" is
|
||||
done, and only when that succeeds is "ham" imported as a submodule
|
||||
of "spam".
|
||||
|
||||
The Importer Protocol operates at this level of *individual*
|
||||
imports. By the time an importer gets a request for "spam.ham",
|
||||
module "spam" has already been imported.
|
||||
|
||||
The protocol involves two objects: an importer and a loader. An
|
||||
importer object has a single method:
|
||||
|
||||
importer.find_module(fullname)
|
||||
|
||||
This method returns a loader object if the module was found, or None
|
||||
if it wasn't. If find_module() raises an exception, it will be
|
||||
propagated to the caller, aborting the import.
|
||||
|
||||
A loader object also has one method:
|
||||
|
||||
loader.load_module(fullname)
|
||||
|
||||
This method returns the loaded module. In many cases the importer
|
||||
and loader can be one and the same object: importer.find_module()
|
||||
would just return self.
|
||||
|
||||
The 'fullname' argument of both methods is the fully qualified
|
||||
module name, for example "spam.eggs.ham". As explained above, when
|
||||
importer.find_module("spam.eggs.ham") is called, "spam.eggs" has
|
||||
already been imported and added to sys.modules. However, the
|
||||
find_module() method isn't neccesarily always called during an
|
||||
actual import: meta tools that analyze import dependencies (such as
|
||||
freeze, Installer or py2exe) don't actually load modules, so an
|
||||
importer shouldn't *depend* on the parent package being available in
|
||||
sys.modules.
|
||||
|
||||
The load_module() method has a few responsibilities that it must
|
||||
fulfill *before* it runs any code:
|
||||
|
||||
- It must create the module object. From Python this can be done
|
||||
via the new.module() function, the imp.new_module() function or
|
||||
via the module type object; from C with the PyModule_New()
|
||||
function or the PyImport_ModuleAdd() function. The latter also
|
||||
does the following step:
|
||||
|
||||
- It must add the module to sys.modules. This is crucial because
|
||||
the module code may (directly or indirectly) import itself; adding
|
||||
it to sys.modules beforehand prevents unbounded recursion in the
|
||||
worst case and multiple loading in the best.
|
||||
|
||||
- The __file__ attribute must be set. This must be a string, but it
|
||||
may be a dummy value, for example "<frozen>". The priviledge of
|
||||
not having a __file__ attribute at all is reserved for builtin
|
||||
modules.
|
||||
|
||||
- If it's a package, the __path__ variable must be set. This must
|
||||
be a list, but may be empty if __path__ has no further
|
||||
significance to the importer (more on this later).
|
||||
|
||||
- It should add an __importer__ attribute to the module, set to the
|
||||
loader object. This is mostly for introspection, but can be used
|
||||
for importer-specific extra's, for example getting data associated
|
||||
with an importer.
|
||||
|
||||
If the module is a Python module (as opposed to a builtin module or
|
||||
an dynamically loaded extension), it should execute the module's
|
||||
code in the module's global name space (module.__dict__).
|
||||
|
||||
Here is a minimal pattern for a load_module() method:
|
||||
|
||||
def load_module(self, fullname):
|
||||
ispkg, code = self._get_code(fullname)
|
||||
mod = imp.new_module(fullname)
|
||||
sys.modules[fullname] = mod
|
||||
mod.__file__ = "<%s>" % self.__class__.__name__
|
||||
mod.__importer__ = self
|
||||
if ispkg:
|
||||
mod.__path__ = []
|
||||
exec code in mod.__dict__
|
||||
return mod
|
||||
|
||||
|
||||
Specification part 2: Registering Hooks
|
||||
|
||||
There are two types of import hooks: Meta hooks and Path hooks.
|
||||
Meta hooks are called at the start of import processing, before any
|
||||
other import processing (so that meta hooks can override sys.path
|
||||
processing, or frozen modules, or even builtin modules). To
|
||||
register a meta hook, simply add the importer object to
|
||||
sys.meta_path (the list of registered meta hooks).
|
||||
|
||||
Path hooks are called as part of sys.path (or package.__path__)
|
||||
processing, at the point where their associated path item is
|
||||
encountered. A path hook can be registered in either of two ways:
|
||||
|
||||
- By simply including an importer object directly on the path.
|
||||
This approach is discouraged for general purpose hooks, as
|
||||
existing code may not be expecting non-strings to exist on
|
||||
sys.path.
|
||||
|
||||
- By registering an importer factory in sys.path_hooks.
|
||||
|
||||
sys.path_hooks is a list of callables, which will be checked in
|
||||
sequence to determine if they can handle a given path item. The
|
||||
callable is called with one argument, the path item. The callable
|
||||
must raise ImportError if it is unable to handle the path item, and
|
||||
return an importer object if it can handle the path item. The
|
||||
callable is typically the class of the import hook, and hence the
|
||||
class __init__ method is called. (This is also the reason why it
|
||||
should raise ImportError: an __init__ method can't return anything.
|
||||
This would be possible with a __new__ method in a new style class,
|
||||
but we don't want to require anything about how a hook is
|
||||
implemented.)
|
||||
|
||||
The results of path hook checks are cached in
|
||||
sys.path_importer_cache, which is a dictionary mapping path entries
|
||||
to importer objects. The cache is checked before sys.path_hooks is
|
||||
scanned. If it is necessary to force a rescan of sys.path_hooks, it
|
||||
is possible to manually clear all or part of
|
||||
sys.path_importer_cache.
|
||||
|
||||
Just like sys.path itself, the new sys variables must have specific
|
||||
types:
|
||||
|
||||
sys.meta_path and sys.path_hooks must be Python lists.
|
||||
sys.path_importer_cache must be a Python dict.
|
||||
|
||||
Modifying these variables in place is allowed, as is replacing them
|
||||
with new objects.
|
||||
|
||||
|
||||
Packages and the role of __path__
|
||||
|
||||
If a module has a __path__ attribute, the import mechanism will
|
||||
treat it as a package. The __path__ variable is used instead of
|
||||
sys.path when importing submodules of the package. The rules for
|
||||
sys.path therefore also apply to pkg.__path__. So sys.path_hooks is
|
||||
also consulted when pkg.__path__ is traversed and importer objects
|
||||
as path items are also allowed (yet, are discouraged for the same
|
||||
reasons as they are discouraged on sys.path, at least for general
|
||||
purpose code). Meta importers don't neccesarily use sys.path at all
|
||||
to do their work and therefore may also ignore the value of
|
||||
pkg.__path__. In this case it is still advised to set it to list,
|
||||
which can be empty.
|
||||
|
||||
|
||||
Integration with the 'imp' module
|
||||
|
||||
The new import hooks are not easily integrated in the existing
|
||||
imp.find_module() and imp.load_module() calls. It's questionable
|
||||
whether it's possible at all without breaking code; it is better to
|
||||
simply add a new function to the imp module. The meaning of the
|
||||
existing imp.find_module() and imp.load_module() calls changes from:
|
||||
"they expose the builtin import mechanism" to "they expose the basic
|
||||
*unhooked* builtin import mechanism". They simply won't invoke any
|
||||
import hooks. A new imp module function is proposed under the name
|
||||
"find_module2", with is used like the following pattern:
|
||||
|
||||
loader = imp.find_module2(fullname, path)
|
||||
if loader is not None:
|
||||
loader.load_module(fullname)
|
||||
|
||||
In the case of a "basic" import, one the imp.find_module() function
|
||||
would handle, the loader object would be a wrapper for the current
|
||||
output of imp.find_module(), and loader.load_module() would call
|
||||
imp.load_module() with that output.
|
||||
|
||||
Note that this wrapper is currently not yet implemented, although a
|
||||
Python prototype exists in the test_importhooks.py script (the
|
||||
ImpWrapper class) included with the patch.
|
||||
|
||||
|
||||
Open Issues
|
||||
|
||||
The new hook method allows for the possibility of objects other than
|
||||
strings appearing on sys.path. Existing code is entitled to assume
|
||||
that sys.path only contains strings (the Python documentation states
|
||||
this). It is not clear if this will cause significant breakage. In
|
||||
particular, it is much less clear that code is entitled to assume
|
||||
that sys.path contains a list of *directory names* - most code which
|
||||
assumes that sys.path items contain strings also rely on this extra
|
||||
assumption, and so could be considered as broken (or at least "not
|
||||
robust") already.
|
||||
|
||||
Modules often need supporting data files to do their job,
|
||||
particularly in the case of complex packages or full applications.
|
||||
Current practice is generally to locate such files via sys.path (or
|
||||
a package.__path__ attribute). This approach will not work, in
|
||||
general, for modules loaded via an import hook.
|
||||
|
||||
There are a number of possible ways to address this problem:
|
||||
|
||||
- "Don't do that". If a package needs to locate data files via its
|
||||
__path__, it is not suitable for loading via an import hook. The
|
||||
package can still be located on a directory in sys.path, as at
|
||||
present, so this should not be seen as a major issue.
|
||||
|
||||
- Locate data files from a standard location, rather than relative
|
||||
to the module file. A relatively simple approach (which is
|
||||
supported by distutils) would be to locate data files based on
|
||||
sys.prefix (or sys.exec_prefix). For example, looking in
|
||||
os.path.join(sys.prefix, "data", package_name).
|
||||
|
||||
- Import hooks could offer a standard way of getting at datafiles
|
||||
relative to the module file. The standard zipimport object
|
||||
provides a method get_data(name) which returns the content of the
|
||||
"file" called name, as a string. To allow modules to get at the
|
||||
importer object, zipimport also adds an attribute "__importer__"
|
||||
to the module, containing the zipimport object used to load the
|
||||
module. If such an approach is used, it is important that client
|
||||
code takes care not to break if the get_data method (or the
|
||||
__importer__ attribute) is not available, so it is not clear that
|
||||
this approach offers a general answer to the problem.
|
||||
|
||||
Requiring loaders to set the module's __importer__ attribute means
|
||||
that the loader will not get thrown away once the load is complete.
|
||||
This increases memory usage, and stops loaders from being
|
||||
lightweight, "throwaway" objects. As loader objects are not
|
||||
required to offer any useful functionality (any such functionality,
|
||||
such as the zipimport get_data() method mentioned above, is
|
||||
optional) it is not clear that the __importer__ attribute will be
|
||||
helpful, in practice.
|
||||
|
||||
On the other hand, importer objects are mostly permanent, as they
|
||||
live or are kept alive on sys.meta_path, sys.path_importer_cache or
|
||||
sys.path, so for a loader to keep a reference to the importer costs
|
||||
us nothing extra. Whether loaders will ever need to carry so much
|
||||
independent state for this to become a real issue is questionable.
|
||||
|
||||
|
||||
Implementation
|
||||
|
||||
A C implementation is available as SourceForge patch 652586.
|
||||
http://www.python.org/sf/652586
|
||||
|
||||
|
||||
References
|
||||
|
||||
[1] Installer by Gordon McMillan
|
||||
http://www.mcmillan-inc.com/install1.html
|
||||
[2] PEP 273, Import Modules from Zip Archives, Ahlstrom
|
||||
http://www.python.org/peps/pep-0273.html
|
||||
[3] The Freeze tool
|
||||
Tools/freeze/ in a Python source distribution
|
||||
[4] Squeeze
|
||||
http://starship.python.net/crew/fredrik/ipa/squeeze.htm
|
||||
[5] py2exe by Thomas Heller
|
||||
http://py2exe.sourceforge.net/
|
||||
[6] imp.set_frozenmodules() patch
|
||||
http://www.python.org/sf/642578
|
||||
|
||||
|
||||
Copyright
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
End:
|
Loading…
Reference in New Issue