New Import Hooks PEP

2002-12-20 13:07:24 +00:00 · 2002-12-20 13:07:24 +00:00 · 650c382d0e
parent 1ce5ccea6e
commit 650c382d0e
1 changed files with 467 additions and 0 deletions
--- a/pep-0302.txt
+++ b/pep-0302.txt
@ -0,0 +1,467 @@
+PEP: 302
+Title: New Import Hooks
+Version: $Revision$
+Last-Modified: $Date$
+Author: Just van Rossum <just@letterror.com>,
+    Paul Moore <gustav@morpheus.demon.co.uk>
+Status: Draft
+Type: Standards Track
+Content-Type: text/plain
+Created: 19-Dec-2002
+Python-Version: 2.3
+Post-History: 19-Dec-2002
+
+
+Abstract
+
+    This PEP proposes to add a new set of import hooks that offer better
+    customization of the Python import mechanism.  Contrary to the
+    current __import__ hook, a new-style hook can be injected into the
+    existing scheme, allowing for a finer grained control of how modules
+    are found and how they are loaded.
+
+
+Motivation
+
+    The only way to customize the import mechanism is currently to
+    override the builtin __import__ function.  However, overriding
+    __import__ has many problems.  To begin with:
+
+    - An __import__ replacement needs to *fully* reimplement the entire
+      import mechanism, or call the original __import__ before or after
+      the custom code.
+
+    - It has very complex semantics and responsibilities.
+
+    - __import__ gets called even for modules that are already in
+      sys.modules, which is almost never what you want, unless you're
+      writing some sort of monitoring tool.
+
+    The situation gets worse when you need to extend the import
+    mechanism from C: it's currently impossible, apart from hacking
+    Python's import.c or reimplementing much of import.c from scratch.
+
+    There is a fairly long history of tools written in Python that allow
+    extending the import mechanism in various way, based on the
+    __import__ hook.  The Standard Library includes two such tools:
+    ihooks.py (by GvR) and imputil.py (Greg Stein), but perhaps the most
+    famous is iu.py by Gordon McMillan, available as part of his
+    Installer [1] package.  Their usefulness is somewhat limited because
+    they are written in Python; bootstrapping issues need to worked
+    around as you can't load the module containing the hook with the
+    hook itself.  So if you want the entire Standard Library to be
+    loadable from an import hook, the hook must be written in C.
+
+
+Use cases
+
+    This section lists several existing applications that depend on
+    import hooks.  Among these, a lot of duplicate work was done that
+    could have been saved if there had been a more flexible import hook
+    at the time.  This PEP should make life a lot easier for similar
+    projects in the future.
+
+    Extending the import mechanism is needed when you want to load
+    modules that are stored in a non-standard way.  Examples include
+    modules that are bundled together in an archive; byte code that is
+    not stored in a pyc formatted file; modules that are loaded from a
+    database over a network.
+
+    The work on this PEP was partly triggered by the implementation of
+    PEP 273 [2], which adds imports from Zip archives as a builtin
+    feature to Python.  While the PEP itself was widely accepted as a
+    must-have feature, the implementation left a few things to desire.
+    For one thing it went through great lengths to integrate itself with
+    import.c, adding lots of code that was either specific for Zip file
+    imports or *not* specific to Zip imports, yet was not generally
+    useful (or even desirable) either.  Yet the PEP 273 implementation
+    can hardly be blamed for this: it is simply extremely hard to do,
+    given the current state of import.c.
+
+    Packaging applications for end users is a typical use case for
+    import hooks, if not *the* typical use case.  Distributing lots of
+    source or pyc files around is not always appropriate (let alone a
+    separate Python installation), so there is a frequent desire to
+    package all needed modules in a single file.  So frequent in fact
+    that multiple solutions have been implemented over the years.
+
+    The oldest one is included with the Python source code: Freeze [3].
+    It puts marshalled byte code into static objects in C source code.
+    Freeze's "import hook" is hard wired into import.c, and has a couple
+    of issues.  Later solutions include Fredrik Lundh's Squeeze [4],
+    Gordon McMillan's Installer [1] and Thomas Heller's py2exe [5].
+    MacPython ships with a tool called BuildApplication.
+
+    Squeeze, Installer and py2exe use an __import__ based scheme (py2exe
+    currently uses Installer's iu.py, Squeeze used ihooks.py), MacPython
+    has two Mac-specific import hooks hard wired into import.c, that are
+    similar to the Freeze hook.  The hooks proposed in this PEP enables
+    us (at least in theory; it's not a short term goal) to get rid of
+    the hard coded hooks in import.c, and would allow the
+    __import__-based tools to get rid of most of their import.c
+    emulation code.
+
+    Before work on the design and implementation of this PEP was
+    started, a new BuildApplication-like tool for MacOSX prompted one of
+    the authors of this PEP (JvR) to expose the table of frozen modules
+    to Python, in the imp module.  The main reason was to be able to use
+    the freeze import hook (avoiding fancy __import__ support), yet to
+    also be able to supply a set of modules at runtime.  This resulted
+    in sf patch #642578 [6], which was mysteriously accepted (mostly
+    because nobody seemed to care either way ;-).  Yet it is completely
+    superfluous when this PEP gets accepted, as it offers a much nicer
+    and general way to do the same thing.
+
+
+Rationale
+
+    While experimenting with alternative implementation ideas to get
+    builtin Zip import, it was discovered that achieving this is
+    possible with only a fairly small amount of changes to import.c.
+    This allowed to factor out the Zip-specific stuff into a new source
+    file, while at the same time creating a *general* new import hook
+    scheme: the one you're reading about now.
+
+    An earlier design allowed non-string objects on sys.path.  Such an
+    object would have the neccesary methods to handle an import.  This
+    has two disadvantages: 1) it breaks code that assumes all items on
+    sys.path are strings; 2) it is not compatible with the PYTHONPATH
+    environment variable.  The latter is directly needed for Zip
+    imports.  A compromise came from Jython: allow string *subclasses*
+    on sys.path, which would then act as importer objects.  This avoids
+    some breakage, and seems to work well for Jython (where it is used
+    to load modules from .jar files), but it was perceived as an "ugly
+    hack".
+
+    This lead to a more elaborate scheme, (mostly copied from McMillan's
+    iu.py) in which each in a list of candidates is asked whether it can
+    handle the sys.path item, until one is found that can.  This list of
+    candidates is a new object in the sys module: sys.path_hooks.
+
+    Traversing sys.path_hooks for each path item for each new import can
+    be expensive, so the results are cached in another new object in the
+    sys module: sys.path_importer_cache.  It maps sys.path entries to
+    importer objects.
+
+    To minimize the impact on import.c as well as to avoid adding extra
+    overhead, it was chosen to not add an explicit hook and importer
+    object for the existing file system import logic (as iu.py has), but
+    to simply fall back to the builtin logic if no hook on
+    sys.path_hooks could handle the path item.  If this is the case, a
+    None value is stored in sys.path_importer_cache, again to avoid
+    repeated lookups.  (Later we can go further and add a real importer
+    object for the builtin mechanism, for now, the None fallback scheme
+    should suffice.)
+
+    A question was raised: what about importers that don't need *any*
+    entry on sys.path? (Builtin and frozen modules fall into that
+    category.)  Again, Gordon McMillan to the rescue: iu.py contains a
+    thing he calls the "metapath".  In this PEP's implementation, it's a
+    list of importer objects that is traversed *before* sys.path.  This
+    list is yet another new object in the sys.module: sys.meta_path.
+    Currently, this list is empty by default, and frozen and builtin
+    module imports are done after traversing sys.meta_path, but still
+    before sys.path.  (Again, later we can add real frozen, builtin and
+    sys.path importer objects on sys.meta_path, allowing for some extra
+    flexibility, but this could be done as a "phase 2" project, possibly
+    for Python 2.4.  It would be the finishing touch as then *every*
+    import would go through sys.meta_path, making it the central import
+    dispatcher.)
+
+    As a bonus, the idea from the second paragraph of this section was
+    implemented after all: a sys.path item may *be* an importer object.
+    This use is discouraged for general purpose code, but it's very
+    convenient, for experimentation as well as for projects of which
+    it's known that no component wrongly assumes that sys.path items are
+    strings.
+
+
+Specification part 1: The Importer Protocol
+
+    This PEP introduces a new protocol: the "Importer Protocol".  It is
+    important to understand the context in which the protocol operates,
+    so here is a brief overview of the outer shells of the import
+    mechanism.
+
+    When an import statement is encountered, the interpreter looks up
+    the __import__ function in the builtin name space.  __import__ is
+    then called with four arguments, amongst which are the name of the
+    module being imported (may be a dotted name) and a reference to the
+    current global namespace.
+
+    The builtin __import__ function (known as PyImport_ImportModuleEx in
+    import.c) will then check to see whether the module doing the import
+    is a package by looking for a __path__ variable in the current
+    global namespace.  If it is indeed a package, it first tries to do
+    the import relative to the package.  For example if a package named
+    "spam" does "import eggs", it will first look for a module named
+    "spam.eggs".  If that fails, the import continues as an absolute
+    import: it will look for a module named "eggs".  Dotted name imports
+    work pretty much the same: if package "spam" does "import
+    eggs.bacon", first "spam.eggs.bacon" is tried, and only if that
+    fails "eggs.bacon" is tried.
+
+    Deeper down in the mechanism, a dotted name import is split up by
+    its components.  For "import spam.ham", first an "import spam" is
+    done, and only when that succeeds is "ham" imported as a submodule
+    of "spam".
+
+    The Importer Protocol operates at this level of *individual*
+    imports.  By the time an importer gets a request for "spam.ham",
+    module "spam" has already been imported.
+
+    The protocol involves two objects: an importer and a loader.  An
+    importer object has a single method:
+
+        importer.find_module(fullname)
+
+    This method returns a loader object if the module was found, or None
+    if it wasn't.  If find_module() raises an exception, it will be
+    propagated to the caller, aborting the import.
+
+    A loader object also has one method:
+
+        loader.load_module(fullname)
+
+    This method returns the loaded module.  In many cases the importer
+    and loader can be one and the same object: importer.find_module()
+    would just return self.
+
+    The 'fullname' argument of both methods is the fully qualified
+    module name, for example "spam.eggs.ham".  As explained above, when
+    importer.find_module("spam.eggs.ham") is called, "spam.eggs" has
+    already been imported and added to sys.modules.  However, the
+    find_module() method isn't neccesarily always called during an
+    actual import: meta tools that analyze import dependencies (such as
+    freeze, Installer or py2exe) don't actually load modules, so an
+    importer shouldn't *depend* on the parent package being available in
+    sys.modules.
+
+    The load_module() method has a few responsibilities that it must
+    fulfill *before* it runs any code:
+
+    - It must create the module object.  From Python this can be done
+      via the new.module() function, the imp.new_module() function or
+      via the module type object; from C with the PyModule_New()
+      function or the PyImport_ModuleAdd() function.  The latter also
+      does the following step:
+
+    - It must add the module to sys.modules.  This is crucial because
+      the module code may (directly or indirectly) import itself; adding
+      it to sys.modules beforehand prevents unbounded recursion in the
+      worst case and multiple loading in the best.
+
+    - The __file__ attribute must be set.  This must be a string, but it
+      may be a dummy value, for example "<frozen>".  The priviledge of
+      not having a __file__ attribute at all is reserved for builtin
+      modules.
+
+    - If it's a package, the __path__ variable must be set.  This must
+      be a list, but may be empty if __path__ has no further
+      significance to the importer (more on this later).
+
+    - It should add an __importer__ attribute to the module, set to the
+      loader object.  This is mostly for introspection, but can be used
+      for importer-specific extra's, for example getting data associated
+      with an importer.
+
+    If the module is a Python module (as opposed to a builtin module or
+    an dynamically loaded extension), it should execute the module's
+    code in the module's global name space (module.__dict__).
+
+    Here is a minimal pattern for a load_module() method:
+
+        def load_module(self, fullname):
+            ispkg, code = self._get_code(fullname)
+            mod = imp.new_module(fullname)
+            sys.modules[fullname] = mod
+            mod.__file__ = "<%s>" % self.__class__.__name__
+            mod.__importer__ = self
+            if ispkg:
+                mod.__path__ = []
+            exec code in mod.__dict__
+            return mod
+
+
+Specification part 2: Registering Hooks
+
+    There are two types of import hooks: Meta hooks and Path hooks.
+    Meta hooks are called at the start of import processing, before any
+    other import processing (so that meta hooks can override sys.path
+    processing, or frozen modules, or even builtin modules).  To
+    register a meta hook, simply add the importer object to
+    sys.meta_path (the list of registered meta hooks).
+
+    Path hooks are called as part of sys.path (or package.__path__)
+    processing, at the point where their associated path item is
+    encountered.  A path hook can be registered in either of two ways:
+
+    - By simply including an importer object directly on the path.
+      This approach is discouraged for general purpose hooks, as
+      existing code may not be expecting non-strings to exist on
+      sys.path.
+
+    - By registering an importer factory in sys.path_hooks.
+
+    sys.path_hooks is a list of callables, which will be checked in
+    sequence to determine if they can handle a given path item.  The
+    callable is called with one argument, the path item.  The callable
+    must raise ImportError if it is unable to handle the path item, and
+    return an importer object if it can handle the path item.  The
+    callable is typically the class of the import hook, and hence the
+    class __init__ method is called.  (This is also the reason why it
+    should raise ImportError: an __init__ method can't return anything.
+    This would be possible with a __new__ method in a new style class,
+    but we don't want to require anything about how a hook is
+    implemented.)
+
+    The results of path hook checks are cached in
+    sys.path_importer_cache, which is a dictionary mapping path entries
+    to importer objects.  The cache is checked before sys.path_hooks is
+    scanned.  If it is necessary to force a rescan of sys.path_hooks, it
+    is possible to manually clear all or part of
+    sys.path_importer_cache.
+
+    Just like sys.path itself, the new sys variables must have specific
+    types:
+
+        sys.meta_path and sys.path_hooks must be Python lists.
+        sys.path_importer_cache must be a Python dict.
+
+    Modifying these variables in place is allowed, as is replacing them
+    with new objects.
+
+
+Packages and the role of __path__
+
+    If a module has a __path__ attribute, the import mechanism will
+    treat it as a package.  The __path__ variable is used instead of
+    sys.path when importing submodules of the package.  The rules for
+    sys.path therefore also apply to pkg.__path__.  So sys.path_hooks is
+    also consulted when pkg.__path__ is traversed and importer objects
+    as path items are also allowed (yet, are discouraged for the same
+    reasons as they are discouraged on sys.path, at least for general
+    purpose code).  Meta importers don't neccesarily use sys.path at all
+    to do their work and therefore may also ignore the value of
+    pkg.__path__.  In this case it is still advised to set it to list,
+    which can be empty.
+
+
+Integration with the 'imp' module
+
+    The new import hooks are not easily integrated in the existing
+    imp.find_module() and imp.load_module() calls.  It's questionable
+    whether it's possible at all without breaking code; it is better to
+    simply add a new function to the imp module.  The meaning of the
+    existing imp.find_module() and imp.load_module() calls changes from:
+    "they expose the builtin import mechanism" to "they expose the basic
+    *unhooked* builtin import mechanism".  They simply won't invoke any
+    import hooks.  A new imp module function is proposed under the name
+    "find_module2", with is used like the following pattern:
+
+        loader = imp.find_module2(fullname, path)
+        if loader is not None:
+            loader.load_module(fullname)
+
+    In the case of a "basic" import, one the imp.find_module() function
+    would handle, the loader object would be a wrapper for the current
+    output of imp.find_module(), and loader.load_module() would call
+    imp.load_module() with that output.
+
+    Note that this wrapper is currently not yet implemented, although a
+    Python prototype exists in the test_importhooks.py script (the
+    ImpWrapper class) included with the patch.
+
+
+Open Issues
+
+    The new hook method allows for the possibility of objects other than
+    strings appearing on sys.path.  Existing code is entitled to assume
+    that sys.path only contains strings (the Python documentation states
+    this).  It is not clear if this will cause significant breakage.  In
+    particular, it is much less clear that code is entitled to assume
+    that sys.path contains a list of *directory names* - most code which
+    assumes that sys.path items contain strings also rely on this extra
+    assumption, and so could be considered as broken (or at least "not
+    robust") already.
+
+    Modules often need supporting data files to do their job,
+    particularly in the case of complex packages or full applications.
+    Current practice is generally to locate such files via sys.path (or
+    a package.__path__ attribute).  This approach will not work, in
+    general, for modules loaded via an import hook.
+
+    There are a number of possible ways to address this problem:
+
+    - "Don't do that".  If a package needs to locate data files via its
+      __path__, it is not suitable for loading via an import hook.  The
+      package can still be located on a directory in sys.path, as at
+      present, so this should not be seen as a major issue.
+
+    - Locate data files from a standard location, rather than relative
+      to the module file.  A relatively simple approach (which is
+      supported by distutils) would be to locate data files based on
+      sys.prefix (or sys.exec_prefix).  For example, looking in
+      os.path.join(sys.prefix, "data", package_name).
+
+    - Import hooks could offer a standard way of getting at datafiles
+      relative to the module file.  The standard zipimport object
+      provides a method get_data(name) which returns the content of the
+      "file" called name, as a string.  To allow modules to get at the
+      importer object, zipimport also adds an attribute "__importer__"
+      to the module, containing the zipimport object used to load the
+      module.  If such an approach is used, it is important that client
+      code takes care not to break if the get_data method (or the
+      __importer__ attribute) is not available, so it is not clear that
+      this approach offers a general answer to the problem.
+
+    Requiring loaders to set the module's __importer__ attribute means
+    that the loader will not get thrown away once the load is complete.
+    This increases memory usage, and stops loaders from being
+    lightweight, "throwaway" objects.  As loader objects are not
+    required to offer any useful functionality (any such functionality,
+    such as the zipimport get_data() method mentioned above, is
+    optional) it is not clear that the __importer__ attribute will be
+    helpful, in practice.
+
+    On the other hand, importer objects are mostly permanent, as they
+    live or are kept alive on sys.meta_path, sys.path_importer_cache or
+    sys.path, so for a loader to keep a reference to the importer costs
+    us nothing extra.  Whether loaders will ever need to carry so much
+    independent state for this to become a real issue is questionable.
+
+
+Implementation
+
+    A C implementation is available as SourceForge patch 652586.
+    http://www.python.org/sf/652586
+
+
+References
+
+    [1] Installer by Gordon McMillan
+    http://www.mcmillan-inc.com/install1.html
+    [2] PEP 273, Import Modules from Zip Archives, Ahlstrom
+    http://www.python.org/peps/pep-0273.html
+    [3] The Freeze tool
+    Tools/freeze/ in a Python source distribution
+    [4] Squeeze
+    http://starship.python.net/crew/fredrik/ipa/squeeze.htm
+    [5] py2exe by Thomas Heller
+    http://py2exe.sourceforge.net/
+    [6] imp.set_frozenmodules() patch
+    http://www.python.org/sf/642578
+
+
+Copyright
+
+    This document has been placed in the public domain.
+
+
+
+Local Variables:
+mode: indented-text
+indent-tabs-mode: nil
+sentence-end-double-space: t
+fill-column: 70
+End: