PEP: 451 Title: A ModuleSpec Type for the Import System Version: $Revision$ Last-Modified: $Date$ Author: Eric Snow Discussions-To: import-sig@python.org Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 8-Aug-2013 Python-Version: 3.4 Post-History: 8-Aug-2013, 28-Aug-2013, 18-Sep-2013, 24-Sep-2013, 4-Oct-2013 Resolution: Abstract ======== This PEP proposes to add a new class to importlib.machinery called "ModuleSpec". It will provide all the import-related information used to load a module and will be available without needing to load the module first. Finders will directly provide a module's spec instead of a loader (which they will continue to provide indirectly). The import machinery will be adjusted to take advantage of module specs, including using them to load modules. Terms and Concepts ================== The changes in this proposal are an opportunity to make several existing terms and concepts more clear, whereas currently they are (unfortunately) ambiguous. New concepts are also introduced in this proposal. Finally, it's worth explaining a few other existing terms with which people may not be so familiar. For the sake of context, here is a brief summary of all three groups of terms and concepts. A more detailed explanation of the import system is found at [import_system_docs]_. name ---- In this proposal, a module's "name" refers to its fully-qualified name, meaning the fully-qualified name of the module's parent (if any) joined to the simple name of the module by a period. finder ------ A "finder" is an object that identifies the loader that the import system should use to load a module. Currently this is accomplished by calling the finder's find_module() method, which returns the loader. Finders are strictly responsible for providing the loader, which they do through their find_module() method. The import system then uses that loader to load the module. loader ------ A "loader" is an object that is used to load a module during import. Currently this is done by calling the loader's load_module() method. A loader may also provide APIs for getting information about the modules it can load, as well as about data from sources associated with such a module. Right now loaders (via load_module()) are responsible for certain boilerplate, import-related operations. These are: 1. perform some (module-related) validation; 2. create the module object; 3. set import-related attributes on the module; 4. "register" the module to sys.modules; 5. exec the module; 6. clean up in the event of failure while loading the module. This all takes place during the import system's call to Loader.load_module(). origin ------ This is a new term and concept. The idea of it exists subtly in the import system already, but this proposal makes the concept explicit. "origin" in an import context means the system (or resource within a system) from which a module originates. For the purposes of this proposal, "origin" is also a string which identifies such a resource or system. "origin" is applicable to all modules. For example, the origin for built-in and frozen modules is the interpreter itself. The import system already identifies this origin as "built-in" and "frozen", respectively. This is demonstrated in the following module repr: "". In fact, the module repr is already a relatively reliable, though implicit, indicator of a module's origin. Other modules also indicate their origin through other means, as described in the entry for "location". It is up to the loader to decide on how to interpret and use a module's origin, if at all. location -------- This is a new term. However the concept already exists clearly in the import system, as associated with the ``__file__`` and ``__path__`` attributes of modules, as well as the name/term "path" elsewhere. A "location" is a resource or "place", rather than a system at large, from which a module is loaded. It qualifies as an "origin". Examples of locations include filesystem paths and URLs. A location is identified by the name of the resource, but may not necessarily identify the system to which the resource pertains. In such cases the loader would have to identify the system itself. In contrast to other kinds of module origin, a location cannot be inferred by the loader just by the module name. Instead, the loader must be provided with a string to identify the location, usually by the finder that generates the loader. The loader then uses this information to locate the resource from which it will load the module. In theory you could load the module at a given location under various names. The most common example of locations in the import system are the files from which source and extension modules are loaded. For these modules the location is identified by the string in the ``__file__`` attribute. Although ``__file__`` isn't particularly accurate for some modules (e.g. zipped), it is currently the only way that the import system indicates that a module has a location. A module that has a location may be called "locatable". cache ----- The import system stores compiled modules in the __pycache__ directory as an optimization. This module cache that we use today was provided by PEP 3147. For this proposal, the relevant API for module caching is the ``__cache__`` attribute of modules and the cache_from_source() function in importlib.util. Loaders are responsible for putting modules into the cache (and loading out of the cache). Currently the cache is only used for compiled source modules. However, loaders may take advantage of the module cache for other kinds of modules. package ------- The concept does not change, nor does the term. However, the distinction between modules and packages is mostly superficial. Packages *are* modules. They simply have a ``__path__`` attribute and import may add attributes bound to submodules. The typically perceived difference is a source of confusion. This proposal explicitly de-emphasizes the distinction between packages and modules where it makes sense to do so. Motivation ========== The import system has evolved over the lifetime of Python. In late 2002 PEP 302 introduced standardized import hooks via finders and loaders and sys.meta_path. The importlib module, introduced with Python 3.1, now exposes a pure Python implementation of the APIs described by PEP 302, as well as of the full import system. It is now much easier to understand and extend the import system. While a benefit to the Python community, this greater accessabilty also presents a challenge. As more developers come to understand and customize the import system, any weaknesses in the finder and loader APIs will be more impactful. So the sooner we can address any such weaknesses the import system, the better...and there are a couple we can take care of with this proposal. Firstly, any time the import system needs to save information about a module we end up with more attributes on module objects that are generally only meaningful to the import system. It would be nice to have a per-module namespace in which to put future import-related information and to pass around within the import system. Secondly, there's an API void between finders and loaders that causes undue complexity when encountered. The PEP 420 (namespace packages) implementation had to work around this. The complexity surfaced again during recent efforts on a separate proposal. [ref_files_pep]_ The `finder`_ and `loader`_ sections above detail current responsibility of both. Notably, loaders are not required to provide any of the functionality of their load_module() through other methods. Thus, though the import-related information about a module is likely available without loading the module, it is not otherwise exposed. Furthermore, the requirements associated with load_module() are common to all loaders and mostly are implemented in exactly the same way. This means every loader has to duplicate the same boilerplate code. importlib.util provides some tools that help with this, but it would be more helpful if the import system simply took charge of these responsibilities. The trouble is that this would limit the degree of customization that load_module() could easily continue to facilitate. More importantly, While a finder *could* provide the information that the loader's load_module() would need, it currently has no consistent way to get it to the loader. This is a gap between finders and loaders which this proposal aims to fill. Finally, when the import system calls a finder's find_module(), the finder makes use of a variety of information about the module that is useful outside the context of the method. Currently the options are limited for persisting that per-module information past the method call, since it only returns the loader. Popular options for this limitation are to store the information in a module-to-info mapping somewhere on the finder itself, or store it on the loader. Unfortunately, loaders are not required to be module-specific. On top of that, some of the useful information finders could provide is common to all finders, so ideally the import system could take care of those details. This is the same gap as before between finders and loaders. As an example of complexity attributable to this flaw, the implementation of namespace packages in Python 3.3 (see PEP 420) added FileFinder.find_loader() because there was no good way for find_module() to provide the namespace search locations. The answer to this gap is a ModuleSpec object that contains the per-module information and takes care of the boilerplate functionality involved with loading the module. Specification ============= The goal is to address the gap between finders and loaders while changing as little of their semantics as possible. Though some functionality and information is moved to the new ModuleSpec type, their behavior should remain the same. However, for the sake of clarity the finder and loader semantics will be explicitly identified. Here is a high-level summary of the changes described by this PEP. More detail is available in later sections. importlib.machinery.ModuleSpec (new) ------------------------------------ A specification for a module's import-system-related state. See the `ModuleSpec`_ section below for a more detailed description. * ModuleSpec(name, loader, \*, origin=None, loader_state=None, is_package=None) Attributes: * name - a string for the fully-qualified name of the module. * loader - the loader to use for loading. * origin - the name of the place from which the module is loaded, e.g. "builtin" for built-in modules and the filename for modules loaded from source. * submodule_search_locations - list of strings for where to find submodules, if a package (None otherwise). * loader_state - a container of extra module-specific data for use during loading. * cached (property) - a string for where the compiled module should be stored. * parent (RO-property) - the fully-qualified name of the package to which the module belongs as a submodule (or None). * has_location (RO-property) - a flag indicating whether or not the module's "origin" attribute refers to a location. Instance Methods: * module_repr() - provide a repr string for the spec'ed module; non-locatable modules will use their origin (e.g. "built-in"). * init_module_attrs(module) - set any of a module's import-related attributes that aren't already set. importlib.util Additions ------------------------ These are ModuleSpec factory functions, meant as a convenience for finders. See the `Factory Functions`_ section below for more detail. * spec_from_file_location(name, location, \*, loader=None, submodule_search_locations=None) - build a spec from file-oriented information and loader APIs. * spec_from_loader(name, loader, \*, origin=None, is_package=None) - build a spec with missing information filled in by using loader APIs. Other API Additions ------------------- * importlib.find_spec(name, path=None) will work exactly the same as importlib.find_loader() (which it replaces), but return a spec instead of a loader. For loaders: * importlib.abc.Loader.exec_module(module) will execute a module in its own namespace. It replaces importlib.abc.Loader.load_module(), taking over its module execution functionality. * importlib.abc.Loader.create_module(spec) (optional) will return the module to use for loading. For modules: * Module objects will have a new attribute: ``__spec__``. API Changes ----------- * InspectLoader.is_package() will become optional. Deprecations ------------ * importlib.abc.MetaPathFinder.find_module() * importlib.abc.PathEntryFinder.find_module() * importlib.abc.PathEntryFinder.find_loader() * importlib.abc.Loader.load_module() * importlib.abc.Loader.module_repr() * importlib.util.set_package() * importlib.util.set_loader() * importlib.find_loader() Removals -------- These were introduced prior to Python 3.4's release, so they can simply be removed. * importlib.abc.Loader.init_module_attrs() * importlib.util.module_to_load() Other Changes ------------- * The import system implementation in importlib will be changed to make use of ModuleSpec. * importlib.reload() will make use of ModuleSpec. * A module's import-related attributes (other than ``__spec__``) will no longer be used directly by the import system during that module's import. However, this does not impact use of those attributes (e.g. ``__path__``) when loading other modules (e.g. submodules). * Import-related attributes should no longer be added to modules directly, except by the import system. * The module type's ``__repr__()`` will be a thin wrapper around a pure Python implementation which will leverage ModuleSpec. * The spec for the ``__main__`` module will reflect the appropriate name and origin. Backward-Compatibility ---------------------- * If a finder does not define find_spec(), a spec is derived from the loader returned by find_module(). * PathEntryFinder.find_loader() still takes priority over find_module(). * Loader.load_module() is used if exec_module() is not defined. What Will not Change? --------------------- * The syntax and semantics of the import statement. * Existing finders and loaders will continue to work normally. * The import-related module attributes will still be initialized with the same information. * Finders will still create loaders (now storing them in specs). * Loader.load_module(), if a module defines it, will have all the same requirements and may still be called directly. * Loaders will still be responsible for module data APIs. * importlib.reload() will still overwrite the import-related attributes. Responsibilities ---------------- Here's a quick breakdown of where responsibilities lie after this PEP. finders: * create loader * create spec loaders: * create module (optional) * execute module ModuleSpec: * orchestrate module loading * boilerplate for module loading, including managing sys.modules and setting import-related attributes * create module if loader doesn't * call loader.exec_module(), passing in the module in which to exec * contain all the information the loader needs to exec the module * provide the repr for modules What Will Existing Finders and Loaders Have to Do Differently? ============================================================== Immediately? Nothing. The status quo will be deprecated, but will continue working. However, here are the things that the authors of finders and loaders should change relative to this PEP: * Implement find_spec() on finders. * Implement exec_module() on loaders, if possible. The ModuleSpec factory functions in importlib.util are intended to be helpful for converting existing finders. from_loader() and from_file_location() are both straight-forward utilities in this regard. In the case where loaders already expose methods for creating and preparing modules, ModuleSpec.from_module() may be useful to the corresponding finder. For existing loaders, exec_module() should be a relatively direct conversion from the non-boilerplate portion of load_module(). In some uncommon cases the loader should also implement create_module(). ModuleSpec Users ================ ModuleSpec objects have 3 distinct target audiences: Python itself, import hooks, and normal Python users. Python will use specs in the import machinery, in interpreter startup, and in various standard library modules. Some modules are import-oriented, like pkgutil, and others are not, like pickle and pydoc. In all cases, the full ModuleSpec API will get used. Import hooks (finders and loaders) will make use of the spec in specific ways. First of all, finders may use the spec factory functions in importlib.util to create spec objects. They may also directly adjust the spec attributes after the spec is created. Secondly, the finder may bind additional information to the spec (in finder_extras) for the loader to consume during module creation/execution. Finally, loaders will make use of the attributes on a spec when creating and/or executing a module. Python users will be able to inspect a module's ``__spec__`` to get import-related information about the object. Generally, Python applications and interactive users will not be using the ``ModuleSpec`` factory functions nor any the instance methods. How Loading Will Work ===================== Here is an outline of what ModuleSpec does during loading:: def load(self): if not hasattr(self.loader, 'exec_module'): module = self.loader.load_module(self.name) self.init_module_attrs(module) return sys.modules[self.name] module = None if hasattr(self.loader, 'create_module'): module = self.loader.create_module(self) if module is None: module = ModuleType(self.name) self.init_module_attrs(module) sys.modules[self.name] = module try: self.loader.exec_module(module) except BaseException: try: del sys.modules[self.name] except KeyError: pass raise return sys.modules[self.name] Note: no "load" method is actually implemented as part of the public ModuleSpec API. These steps are exactly what Loader.load_module() is already expected to do. Loaders will thus be simplified since they will only need to implement exec_module(). Note that we must return the module from sys.modules. During loading the module may have replaced itself in sys.modules. Since we don't have a post-import hook API to accommodate the use case, we have to deal with it. However, in the replacement case we do not worry about setting the import-related module attributes on the object. The module writer is on their own if they are doing this. ModuleSpec ========== Attributes ---------- Each of the following names is an attribute on ModuleSpec objects. A value of None indicates "not set". This contrasts with module objects where the attribute simply doesn't exist. Most of the attributes correspond to the import-related attributes of modules. Here is the mapping. The reverse of this mapping is used by ModuleSpec.init_module_attrs(). ========================== ============== On ModuleSpec On Modules ========================== ============== name __name__ loader __loader__ parent __package__ origin __file__* cached __cached__*,** submodule_search_locations __path__** loader_state \- has_location \- ========================== ============== | \* Set on the module only if spec.has_location is true. | \*\* Set on the module only if the spec attribute is not None. While parent and has_location are read-only properties, the remaining attributes can be replaced after the module spec is created and even after import is complete. This allows for unusual cases where directly modifying the spec is the best option. However, typical use should not involve changing the state of a module's spec. **origin** "origin" is a string for the name of the place from which the module originates. See `origin`_ above. Aside from the informational value, it is also used in module_repr(). In the case of a spec where "has_location" is true, ``__file__`` is set to the value of "origin". For built-in modules "origin" would be set to "built-in". **has_location** As explained in the `location`_ section above, many modules are "locatable", meaning there is a corresponding resource from which the module will be loaded and that resource can be described by a string. In contrast, non-locatable modules can't be loaded in this fashion, e.g. builtin modules and modules dynamically created in code. For these, the name is the only way to access them, so they have an "origin" but not a "location". "has_location" is true if the module is locatable. In that case the spec's origin is used as the location and ``__file__`` is set to spec.origin. If additional location information is required (e.g. zipimport), that information may be stored in spec.loader_state. "has_location" may be implied from the existence of a load_data() method on the loader. Incidentally, not all locatable modules will be cache-able, but most will. **submodule_search_locations** The list of location strings, typically directory paths, in which to search for submodules. If the module is a package this will be set to a list (even an empty one). Otherwise it is None. The name of the corresponding module attribute, ``__path__``, is relatively ambiguous. Instead of mirroring it, we use a more explicit attribute name that makes the purpose clear. **loader_state** A finder may set loader_state to any value to provide additional data for the loader to use during loading. A value of None is the default and indicates that there is no additional data. Otherwise it can be set to any object, such as a dict, list, or types.SimpleNamespace, containing the relevant extra information. For example, zipimporter could use it to pass the zip archive name to the loader directly, rather than needing to derive it from origin or create a custom loader for each find operation. loader_state is meant for use by the finder and corresponding loader. It is not guaranteed to be a stable resource for any other use. Factory Functions ----------------- **spec_from_file_location(name, location, \*, loader=None, submodule_search_locations=None)** Build a spec from file-oriented information and loader APIs. * "origin" will be set to the location. * "has_location" will be set to True. * "cached" will be set to the result of calling cache_from_source(). * "origin" can be deduced from loader.get_filename() (if "location" is not passed in. * "loader" can be deduced from suffix if the location is a filename. * "submodule_search_locations" can be deduced from loader.is_package() and from os.path.dirname(location) if location is a filename. **from_loader(name, loader, \*, origin=None, is_package=None)** Build a spec with missing information filled in by using loader APIs. * "has_location" can be deduced from loader.get_data. * "origin" can be deduced from loader.get_filename(). * "submodule_search_locations" can be deduced from loader.is_package() and from os.path.dirname(location) if location is a filename. **spec_from_module(module, loader=None)** Build a spec based on the import-related attributes of an existing module. The spec attributes are set to the corresponding import- related module attributes. See the table in `Attributes`_. Omitted Attributes and Methods ------------------------------ There is no "PathModuleSpec" subclass of ModuleSpec that separates out has_location, cached, and submodule_search_locations. While that might make the separation cleaner, module objects don't have that distinction. ModuleSpec will support both cases equally well. While "is_package" would be a simple additional attribute (aliasing self.submodule_search_locations is not None), it perpetuates the artificial (and mostly erroneous) distinction between modules and packages. Conceivably, a ModuleSpec.load() method could optionally take a list of modules with which to interact instead of sys.modules. That capability is left out of this PEP, but may be pursued separately at some other time, including relative to PEP 406 (import engine). Likewise load() could be leveraged to implement multi-version imports. While interesting, doing so is outside the scope of this proposal. Others: * Add ModuleSpec.submodules (RO-property) - returns possible submodules relative to the spec. * Add ModuleSpec.loaded (RO-property) - the module in sys.module, if any. * Add ModuleSpec.data - a descriptor that wraps the data API of the spec's loader. * Also see [cleaner_reload_support]_. Backward Compatibility ---------------------- ModuleSpec doesn't have any. This would be a different story if Finder.find_module() were to return a module spec instead of loader. In that case, specs would have to act like the loader that would have been returned instead. Doing so would be relatively simple, but is an unnecessary complication. It was part of earlier versions of this PEP. Subclassing ----------- Subclasses of ModuleSpec are allowed, but should not be necessary. Simply setting loader_state or adding functionality to a custom finder or loader will likely be a better fit and should be tried first. However, as long as a subclass still fulfills the requirements of the import system, objects of that type are completely fine as the return value of Finder.find_spec(). Existing Types ============== Module Objects -------------- Other than adding ``__spec__``, none of the import-related module attributes will be changed or deprecated, though some of them could be; any such deprecation can wait until Python 4. A module's spec will not be kept in sync with the corresponding import- related attributes. Though they may differ, in practice they will typically be the same. One notable exception is that case where a module is run as a script by using the ``-m`` flag. In that case ``module.__spec__.name`` will reflect the actual module name while ``module.__name__`` will be ``__main__``. A module's spec is not guaranteed to be identical between two modules with the same name. Likewise there is no guarantee that successive calls to importlib.find_spec() will return the same object or even an equivalent object, though at least the latter is likely. Finders ------- Finders are still responsible for creating the loader. That loader will now be stored in the module spec returned by find_spec() rather than returned directly. As is currently the case without the PEP, if a loader would be costly to create, that loader can be designed to defer the cost until later. **MetaPathFinder.find_spec(name, path=None)** **PathEntryFinder.find_spec(name)** Finders will return ModuleSpec objects when find_spec() is called. This new method replaces find_module() and find_loader() (in the PathEntryFinder case). If a loader does not have find_spec(), find_module() and find_loader() are used instead, for backward-compatibility. Adding yet another similar method to loaders is a case of practicality. find_module() could be changed to return specs instead of loaders. This is tempting because the import APIs have suffered enough, especially considering PathEntryFinder.find_loader() was just added in Python 3.3. However, the extra complexity and a less-than- explicit method name aren't worth it. Namespace Packages ------------------ Currently a path entry finder may return (None, portions) from find_loader() to indicate it found part of a possible namespace package. To achieve the same effect, find_spec() must return a spec with "loader" set to None (a.k.a. not set) and with submodule_search_locations set to the same portions as were provided by find_loader(). It's up to PathFinder how to handle such specs. Loaders ------- **Loader.exec_module(module)** Loaders will have a new method, exec_module(). Its only job is to "exec" the module and consequently populate the module's namespace. It is not responsible for creating or preparing the module object, nor for any cleanup afterward. It has no return value. exec_module() will be used during both loading and reloading. exec_module() should properly handle the case where it is called more than once. For some kinds of modules this may mean raising ImportError every time after the first time the method is called. This is particularly relevant for reloading, where some kinds of modules do not support in-place reloading. **Loader.create_module(spec)** Loaders may also implement create_module() that will return a new module to exec. It may return None to indicate that the default module creation code should be used. One use case, though atypical, for create_module() is to provide a module that is a subclass of the builtin module type. Most loaders will not need to implement create_module(), create_module() should properly handle the case where it is called more than once for the same spec/module. This may include returning None or raising ImportError. .. note:: exec_module() and create_module() should not set any import-related module attributes. The fact that load_module() does is a design flaw that this proposal aims to correct. Other changes: PEP 420 introduced the optional module_repr() loader method to limit the amount of special-casing in the module type's ``__repr__()``. Since this method is part of ModuleSpec, it will be deprecated on loaders. However, if it exists on a loader it will be used exclusively. Loader.init_module_attr() method, added prior to Python 3.4's release , will be removed in favor of the same method on ModuleSpec. However, InspectLoader.is_package() will not be deprecated even though the same information is found on ModuleSpec. ModuleSpec can use it to populate its own is_package if that information is not otherwise available. Still, it will be made optional. In addition to executing a module during loading, loaders will still be directly responsible for providing APIs concerning module-related data. Other Changes ============= * The various finders and loaders provided by importlib will be updated to comply with this proposal. * The spec for the ``__main__`` module will reflect how the interpreter was started. For instance, with ``-m`` the spec's name will be that of the run module, while ``__main__.__name__`` will still be "__main__". * We will add importlib.find_spec() to mirror importlib.find_loader() (which becomes deprecated). * importlib.reload() is changed to use ModuleSpec.load(). * importlib.reload() will now make use of the per-module import lock. Reference Implementation ======================== A reference implementation will be available at http://bugs.python.org/issue18864. Open Issues ============== \* Impact on some kinds of lazy loading modules. [lazy_import_concerns]_ This should not be an issue since the PEP does not change the semantics of this behavior. Implementation Notes ==================== \* The implementation of this PEP needs to be cognizant of its impact on pkgutil (and setuptools). pkgutil has some generic function-based extensions to PEP 302 which may break if importlib starts wrapping loaders without the tools' knowledge. \* Other modules to look at: runpy (and pythonrun.c), pickle, pydoc, inspect. For instance, pickle should be updated in the ``__main__`` case to look at ``module.__spec__.name``. References ========== .. [ref_files_pep] http://mail.python.org/pipermail/import-sig/2013-August/000658.html .. [import_system_docs] http://docs.python.org/3/reference/import.html .. [cleaner_reload_support] https://mail.python.org/pipermail/import-sig/2013-September/000735.html .. [lazy_import_concerns] https://mail.python.org/pipermail/python-dev/2013-August/128129.html Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: