624 lines
22 KiB
Plaintext
624 lines
22 KiB
Plaintext
PEP: 3147
|
||
Title: PYC Repository Directories
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Barry Warsaw <barry@python.org>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 2009-12-16
|
||
Python-Version: 3.2
|
||
Post-History: 2010-01-30, 2010-02-25
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP describes an extension to Python's import mechanism which
|
||
improves sharing of Python source code files among multiple installed
|
||
different versions of the Python interpreter. It does this by
|
||
allowing more than one byte compilation file (.pyc files) to be
|
||
co-located with the Python source file (.py file). The extension
|
||
described here can also be used to support different Python
|
||
compilation caches, such as JIT output that may be produced by an
|
||
Unladen Swallow [1]_ enabled C Python.
|
||
|
||
|
||
Background
|
||
==========
|
||
|
||
CPython compiles its source code into "byte code", and for performance
|
||
reasons, it caches this byte code on the file system whenever the
|
||
source file has changes. This makes loading of Python modules much
|
||
faster because the compilation phase can be bypassed. When your
|
||
source file is `foo.py`, CPython caches the byte code in a `foo.pyc`
|
||
file right next to the source.
|
||
|
||
Byte code files contain two 32-bit numbers followed by the marshaled
|
||
[2]_ code object. The 32-bit numbers represent a magic number and a
|
||
timestamp. The magic number changes whenever Python changes the byte
|
||
code format, e.g. by adding new byte codes to its virtual machine.
|
||
This ensures that pyc files built for previous versions of the VM
|
||
won't cause problems. The timestamp is used to make sure that the pyc
|
||
file is not older than the py file that was used to create it. When
|
||
either the magic number or timestamp do not match, the py file is
|
||
recompiled and a new pyc file is written.
|
||
|
||
In practice, it is well known that pyc files are not compatible across
|
||
Python major releases. A reading of import.c [3]_ in the Python
|
||
source code proves that within recent memory, every new CPython major
|
||
release has bumped the pyc magic number.
|
||
|
||
|
||
Rationale
|
||
=========
|
||
|
||
Linux distributions such as Ubuntu [4]_ and Debian [5]_ provide more
|
||
than one Python version at the same time to their users. For example,
|
||
Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1,
|
||
with Python 2.6 being the default.
|
||
|
||
This causes a conflict for Python source files installed by the
|
||
system (including third party packages), because you cannot compile a
|
||
single Python source file for more than one Python version at a time.
|
||
Thus if your system wanted to install a `/usr/share/python/foo.py`, it
|
||
could not create a `/usr/share/python/foo.pyc` file usable across all
|
||
installed Python versions.
|
||
|
||
Furthermore, in order to ease the burden on operating system packagers
|
||
for these distributions, the distribution packages do not contain
|
||
Python version numbers [6]_; they are shared across all Python
|
||
versions installed on the system. Putting Python version numbers in
|
||
the packages would be a maintenance nightmare, since all the packages
|
||
- *and their dependencies* - would have to be updated every time a new
|
||
Python release was added or removed from the distribution. Because of
|
||
the sheer number of packages available, this amount of work is
|
||
infeasible.
|
||
|
||
Even C extensions can be source compatible across multiple versions of
|
||
Python. Compiled extension modules are usually not compatible though,
|
||
and PEP 384 [7]_ has been proposed to address this by defining a
|
||
stable ABI for extension modules.
|
||
|
||
Because these distributions cannot share pyc files, elaborate
|
||
mechanisms have been developed to put the resulting pyc files in
|
||
non-shared locations while the source code is still shared. Examples
|
||
include the symlink-based Debian regimes python-support [8]_ and
|
||
python-central [9]_. These approaches make for much more complicated,
|
||
fragile, inscrutable, and fragmented policies for delivering Python
|
||
applications to a wide range of users. Arguably more users get Python
|
||
from their operating system vendor than from upstream tarballs. Thus,
|
||
solving this pyc sharing problem for CPython is a high priority for
|
||
such vendors.
|
||
|
||
This PEP proposes a solution to this problem.
|
||
|
||
|
||
Proposal
|
||
========
|
||
|
||
Python's import machinery is extended to write and search for byte
|
||
code cache files in a single directory inside every Python package
|
||
directory. This directory will be called `__pycache__`.
|
||
|
||
Further, pyc files will contain a magic string that
|
||
differentiates the Python version they were compiled for. This allows
|
||
multiple byte compiled cache files to co-exist for a single Python
|
||
source file.
|
||
|
||
This scheme has the added benefit of reducing the clutter in a Python
|
||
package directory.
|
||
|
||
What would this look like in practice?
|
||
|
||
Let's say we have a Python package named `alpha` which contains a
|
||
sub-package name `beta`. The source directory layout might look like
|
||
this::
|
||
|
||
alpha/
|
||
__init__.py
|
||
one.py
|
||
two.py
|
||
beta/
|
||
__init__.py
|
||
three.py
|
||
four.py
|
||
|
||
After byte compiling this package with Python 3.2, you would see the
|
||
following layout::
|
||
|
||
alpha/
|
||
__pycache__/
|
||
__init__.cpython-32.pyc
|
||
one.cpython-32.pyc
|
||
two.cpython-32.pyc
|
||
__init__.py
|
||
one.py
|
||
two.py
|
||
beta/
|
||
__pycache__/
|
||
__init__.cpython-32.pyc
|
||
three.cpython-32.pyc
|
||
four.cpython-32.pyc
|
||
__init__.py
|
||
three.py
|
||
four.py
|
||
|
||
Let's say that two new versions of Python are installed, one is Python
|
||
3.3 and another is Unladen Swallow. After byte compilation, the file
|
||
system would look like this::
|
||
|
||
alpha/
|
||
__pycache__/
|
||
__init__.cpython-32.pyc
|
||
__init__.cpython-33.pyc
|
||
__init__.unladen-10.pyc
|
||
one.cpython-32.pyc
|
||
one.cpython-33.pyc
|
||
one.unladen-10.pyc
|
||
two.cpython-32.pyc
|
||
two.cpython-33.pyc
|
||
two.unladen-10.pyc
|
||
__init__.py
|
||
one.py
|
||
two.py
|
||
beta/
|
||
__pycache__/
|
||
__init__.cpython-32.pyc
|
||
__init__.cpython-33.pyc
|
||
__init__.unladen-10.pyc
|
||
three.cpython-32.pyc
|
||
three.cpython-33.pyc
|
||
three.unladen-10.pyc
|
||
four.cpython-32.pyc
|
||
four.cpython-33.pyc
|
||
four.unladen-10.pyc
|
||
__init__.py
|
||
three.py
|
||
four.py
|
||
|
||
As you can see, as long as the Python version identifier string is
|
||
unique, any number of pyc files can co-exist. These identifier
|
||
strings are described in more detail below.
|
||
|
||
A nice property of this layout is that the `__pycache__` directories
|
||
can generally be ignored, such that a normal directory listing would
|
||
show something like this::
|
||
|
||
alpha/
|
||
__pycache__/
|
||
__init__.py
|
||
one.py
|
||
two.py
|
||
beta/
|
||
__pycache__/
|
||
__init__.py
|
||
three.py
|
||
four.py
|
||
|
||
This is much less cluttered than even today's Python.
|
||
|
||
|
||
Python behavior
|
||
===============
|
||
|
||
When Python searches for a module to import (say `foo`), it may find
|
||
one of several situations. As per current Python rules, the term
|
||
"matching pyc" means that the magic number matches the current
|
||
interpreter's magic number, and the source file is not newer than the
|
||
`pyc` file.
|
||
|
||
|
||
Case 1: The first import
|
||
------------------------
|
||
|
||
When Python is asked to import module `foo`, it searches for a
|
||
`foo.py` file (or `foo` package, but that's not important for this
|
||
discussion) along its `sys.path`. When Python locates the `foo.py`
|
||
file it will look for a `__pycache__` directory in the directory where
|
||
it found the `foo.py`. If the `__pycache__` directory is missing,
|
||
Python will create it. Then it will parse and byte compile the
|
||
`foo.py` file and save the byte code in `__pycache__/foo.<magic>.pyc`,
|
||
where <magic> is defined by the Python implementation, but will be a
|
||
human readable string such as `cpython-32`.
|
||
|
||
|
||
Case 2: The second import
|
||
-------------------------
|
||
|
||
When Python is asked to import module `foo` a second time (in a
|
||
different process of course), it will again search for the `foo.py`
|
||
file along its `sys.path`. When Python locates the `foo.py` file, it
|
||
looks for a matching `__pycache__/foo.<magic>.pyc` and finding this,
|
||
it reads the byte code and continues as usual.
|
||
|
||
|
||
Case 3: __pycache__/foo.<magic>.pyc with no source
|
||
------------------------------------------
|
||
|
||
It's possible that the `foo.py` file somehow got removed, while
|
||
leaving the cached pyc file still on the file system. If the
|
||
`__pycache__/foo.<magic>.pyc` file exists, but the `foo.py` file used
|
||
to create it does not, Python will raise an `ImportError` when asked
|
||
to import foo. In other words, by default, Python will not support
|
||
importing a module unless the source file exists.
|
||
|
||
Python users who want to deploy sourceless imports are instructed to
|
||
create a custom importer that supports this behavior. Options include
|
||
importing pycs from a zip file, or locating pyc files where the py
|
||
source file would have existed. (See the Open Issues section for more
|
||
discussion.)
|
||
|
||
|
||
Case 4: legacy pyc files
|
||
------------------------
|
||
|
||
Python will ignore all legacy pyc files. In other words, if a
|
||
`foo.pyc` file exists next to the `foo.py` file, it will be ignored in
|
||
all cases, including sourceless deployments. Python users wishing to
|
||
support this use case can create a custom importer.
|
||
|
||
|
||
Flow chart
|
||
==========
|
||
|
||
Here is a flow chart describing how modules are loaded:
|
||
|
||
.. image:: pep-3147-1.png
|
||
:scale: 75
|
||
|
||
|
||
Magic identifiers
|
||
=================
|
||
|
||
pyc files inside of the `__pycache__` directories contain a magic
|
||
identifier in their file names. These are mnemonic tags for the
|
||
actual magic numbers used by the importer. For example, for Python
|
||
3.2, we could use the hexlified [10]_ magic number as a unique
|
||
identifier::
|
||
|
||
>>> from binascii import hexlify
|
||
>>> from imp import get_magic
|
||
>>> 'foo.{}.pyc'.format(hexlify(get_magic()).decode('ascii'))
|
||
'foo.580c0d0a.pyc'
|
||
|
||
This isn't particularly human friendly though. Instead, this PEP
|
||
proposes to add a mapping between internal magic numbers and a
|
||
user-friendly tag. Newer versions of Python can add to this mapping
|
||
so that all later Pythons know the mapping between tags and magic
|
||
numbers. By convention, the tag will contain the Python
|
||
implementation name and version nickname, where the nickname is
|
||
generally the major version number and minor version number. Magic
|
||
numbers should not change between Python micro releases, but some
|
||
other convention can be used for changes in magic number between
|
||
pre-release development versions.
|
||
|
||
For example, CPython 3.2 would have a magic identifier tag of
|
||
`cpython-32` and write pyc files like this: `foo.cpython-32.pyc`.
|
||
When the `-O` flag is used, it would write `foo.cpython-32.pyo`. For
|
||
backports of this feature to Python 2, when the `-U` flag is used, a
|
||
file such as `foo.cpython-27u.pyc` can be written.
|
||
|
||
|
||
Alternative Python implementations
|
||
==================================
|
||
|
||
Alternative Python implementations such as Jython [11]_, IronPython
|
||
[12]_, PyPy [13]_, Pynie [14]_, and Unladen Swallow can also use the
|
||
`__pycache__` directory to store whatever compilation artifacts make
|
||
sense for their platforms. For example, Jython could store the class
|
||
file for the module in `__pycache__/foo.jython-32.class`.
|
||
|
||
|
||
Implementation strategy
|
||
=======================
|
||
|
||
This feature is targeted for Python 3.2, solving the problem for those
|
||
and all future versions. It may be back-ported to Python 2.7.
|
||
Vendors are free to backport the changes to earlier distributions as
|
||
they see fit.
|
||
|
||
|
||
Effects on existing code
|
||
========================
|
||
|
||
Adoption of this PEP will affect existing code and idioms, both inside
|
||
Python and outside. This section enumerates some of these effects.
|
||
|
||
|
||
__file__
|
||
---------
|
||
|
||
in Python 3, when you import a module, its `__file__` attribute points
|
||
to its source `py` file (in Python 2, it points to the `pyc` file). A
|
||
package's `__file__` points to the `py` file for its `__init__.py`.
|
||
E.g.::
|
||
|
||
>>> import foo
|
||
>>> foo.__file__
|
||
'foo.py'
|
||
# baz is a package
|
||
>>> import baz
|
||
>>> baz.__file__
|
||
'baz/__init__.py'
|
||
|
||
The implementation of this PEP would have to ensure that the same
|
||
directory level is returned from `__file__` as it currently does so
|
||
that the common idiom above continues to work.
|
||
|
||
As part of this PEP, we will add an `__cached__` attribute to modules,
|
||
which will always point to the actual `pyc` file that was read or
|
||
written. When the environment variable `$PYTHONDONTWRITEBYTECODE` is
|
||
set, or the `-B` option is given, or if the source lives on a
|
||
read-only filesystem, then the `__cached__` attribute will point to
|
||
the location that the `pyc` file *would* have been written to if it
|
||
didn't exist. This location of course includes the `__pycache__`
|
||
subdirectory in its path.
|
||
|
||
For alternative Python implementations which do not support `pyc`
|
||
files, the `__cached__` attribute may point to whatever information
|
||
makes sense. E.g. on Jython, this might be the `.class` file for the
|
||
module: `__pycache__/foo.jython-32.class`. Some implementations may
|
||
use multiple compiled files to create the module, in which case
|
||
`__cached__` may be a tuple. The exact contents of `__cached__` are
|
||
Python implementation specific.
|
||
|
||
Alternative implementations for which this scheme does not make sense
|
||
should set the `__cached__` attribute to `None`.
|
||
|
||
|
||
File extension checks
|
||
---------------------
|
||
|
||
There exists some code which checks for files ending in `.pyc` and
|
||
simply chops off the last character to find the matching `.py` file.
|
||
This code will obviously fail once this PEP is implemented.
|
||
|
||
To support this use case, we'll add two new methods to the `imp`
|
||
package [15]_:
|
||
|
||
* `imp.source_from_cache(py_path)` -> `pyc_path`
|
||
* `imp.cache_from_source(pyc_path)` -> `py_path`
|
||
|
||
Alternative implementations are free to override these functions to
|
||
return reasonable values based on their own support for this PEP.
|
||
|
||
|
||
PEP 302 loaders
|
||
---------------
|
||
|
||
PEP 302 [16]_ defined loaders have a `.get_filename()` method which
|
||
points to the `__file__` for a module. As part of this PEP, we will
|
||
extend this API, to include a new method `.get_paths()` which will
|
||
return a 2-tuple containing the path to the source file and the path
|
||
to where the matching `pyc` file is (or would be).
|
||
|
||
|
||
Backports
|
||
---------
|
||
|
||
For versions of Python earlier than 3.2 (and possibly 2.7), it is
|
||
possible to backport this PEP. However, in Python 3.2 (and possibly
|
||
2.7), this behavior will be turned on by default, and in fact, it will
|
||
replace the old behavior. Backports will need to support the old
|
||
layout by default. We suggest supporting PEP 3147 through the use of
|
||
an environment variable called `$PYTHONCACHEDIR` or the command line
|
||
switch `-Xcachedir` to enable the feature.
|
||
|
||
|
||
Alternatives
|
||
============
|
||
|
||
PEP 304
|
||
-------
|
||
|
||
There is some overlap between the goals of this PEP and PEP 304 [17]_,
|
||
which has been withdrawn. However PEP 304 would allow a user to
|
||
create a shadow file system hierarchy in which to store `pyc` files.
|
||
This concept of a shadow hierarchy for `pyc` files could be used to
|
||
satisfy the aims of this PEP. Although the PEP 304 does not indicate
|
||
why it was withdrawn, shadow directories have a number of problems.
|
||
The location of the shadow `pyc` files would not be easily discovered
|
||
and would depend on the proper and consistent use of the
|
||
`$PYTHONBYTECODE` environment variable both by the system and by end
|
||
users. There are also global implications, meaning that while the
|
||
system might want to shadow `pyc` files, users might not want to, but
|
||
the PEP defines only an all-or-nothing approach.
|
||
|
||
As an example of the problem, a common (though fragile) Python idiom
|
||
for locating data files is to do something like this::
|
||
|
||
from os import dirname, join
|
||
import foo.bar
|
||
data_file = join(dirname(foo.bar.__file__), 'my.dat')
|
||
|
||
This would be problematic since `foo.bar.__file__` will give the
|
||
location of the `pyc` file in the shadow directory, and it may not be
|
||
possible to find the `my.dat` file relative to the source directory
|
||
from there.
|
||
|
||
|
||
Fat byte compilation files
|
||
--------------------------
|
||
|
||
An earlier version of this PEP described "fat" Python byte code files.
|
||
These files would contain the equivalent of multiple `pyc` files in a
|
||
single `pyf` file, with a lookup table keyed off the appropriate magic
|
||
number. This was an extensible file format so that the first 5
|
||
parallel Python implementations could be supported fairly efficiently,
|
||
but with extension lookup tables available to scale `pyf` byte code
|
||
objects as large as necessary.
|
||
|
||
The fat byte compilation files were fairly complex, and inherently
|
||
introduced difficult race conditions, so the current simplification of
|
||
using directories was suggested. The same problem applies to using
|
||
zip files as the fat pyc file format.
|
||
|
||
|
||
Multiple file extensions
|
||
------------------------
|
||
|
||
The PEP author also considered an approach where multiple thin byte
|
||
compiled files lived in the same place, but used different file
|
||
extensions to designate the Python version. E.g. foo.pyc25,
|
||
foo.pyc26, foo.pyc31 etc. This was rejected because of the clutter
|
||
involved in writing so many different files. The multiple extension
|
||
approach makes it more difficult (and an ongoing task) to update any
|
||
tools that are dependent on the file extension.
|
||
|
||
|
||
Reference implementation
|
||
========================
|
||
|
||
A pure-Python reference implementation will be written using
|
||
importlib [18]_, which may need some modifications to its API and
|
||
abstract base classes. Once the semantics are agreed upon and the
|
||
implementation details are settled, we'll port this to the C
|
||
implementation in `import.c`. We will have extensive tests that
|
||
guarantee that the pure-Python implementation and the built-in
|
||
implementation remain in sync.
|
||
|
||
|
||
Open issues
|
||
===========
|
||
|
||
Byte code only packages
|
||
-----------------------
|
||
|
||
Some users of Python distribute packages containing only the byte code
|
||
files (pyc). The use cases for this are to make it more difficult for
|
||
end-users to view the source code, and to reduce maintenance burdens
|
||
when end users casually edit the source files.
|
||
|
||
This PEP currently promote no default support for bytecode-only
|
||
packages. The primary motivator for this are that we can reduce stat
|
||
calls if the importer only looks for .py files, making Python start-up
|
||
and import faster.
|
||
|
||
The question is how to balance the requirements of bytecode-only users
|
||
with the more universally beneficial faster start up times for
|
||
requiring source files? Should all Python users pay the extra stat
|
||
call penalty in the general case for a minority use case by default?
|
||
|
||
There are several ways out of this. Should we decide that it's
|
||
important enough to support bytecode-only packages, the semantics
|
||
would be as follows:
|
||
|
||
* If there is a traditional, non-magic-tagged .pyc file in the
|
||
location where a .py file should be found, it will satisfy the
|
||
import.
|
||
* The `__file__` attribute of the module will point to the .pyc file.
|
||
* The `__cached__` attribute of the module will point to the .pyc file
|
||
too.
|
||
* The existence of a matching `__pycached__/foo.<magic>.pyc` file
|
||
without the source py file will *not* satisfy the import. This
|
||
means that if the source file is removed, the pyc file will be
|
||
ignored (unlike in today's implementation).
|
||
|
||
Other ways to satisfy the bytecode-only packagers requirements would
|
||
have less impact on the general Python user population, and include:
|
||
|
||
* Add a `-X` switch and/or environment variable to enable
|
||
the bytecode-only search algorithm.
|
||
* Let those who want more protection against casual py hackers package
|
||
their code in a zip file, which is supported today.
|
||
* Provide a custom importer supporting bytecode-only packages, which
|
||
would have to be enabled explicitly by the application. Either
|
||
Python would provide such a custom importer or it would be left to
|
||
third parties to implement.
|
||
* Add a marker to a package's `__init__.py` file to enable
|
||
bytecode-only imports for everything else in the package.
|
||
|
||
|
||
__cached__ vs. __compiled__
|
||
----------------------------
|
||
|
||
Guido says: "I still prefer __compiled__ over __cached__ but I don't
|
||
feel strong about it."
|
||
|
||
Barry likes `__cached__` because it the more general term seems to fit
|
||
in better with future possible use cases such as JIT output from
|
||
Unladen Swallow.
|
||
|
||
|
||
References
|
||
==========
|
||
|
||
.. [1] PEP 3146
|
||
|
||
.. [2] The marshal module:
|
||
http://www.python.org/doc/current/library/marshal.html
|
||
|
||
.. [3] import.c:
|
||
http://svn.python.org/view/python/branches/py3k/Python/import.c?view=markup
|
||
|
||
.. [4] Ubuntu: <http://www.ubuntu.com>
|
||
|
||
.. [5] Debian: <http://www.debian.org>
|
||
|
||
.. [6] Debian Python Policy:
|
||
http://www.debian.org/doc/packaging-manuals/python-policy/
|
||
|
||
.. [7] PEP 384
|
||
|
||
.. [8] python-support:
|
||
http://wiki.debian.org/DebianPythonFAQ#Whatispython-support.3F
|
||
|
||
.. [9] python-central:
|
||
http://wiki.debian.org/DebianPythonFAQ#Whatispython-central.3F
|
||
|
||
.. [10] binascii.hexlify():
|
||
http://www.python.org/doc/current/library/binascii.html#binascii.hexlify
|
||
|
||
.. [11] Jython: http://www.jython.org/
|
||
|
||
.. [12] IronPython: http://ironpython.net/
|
||
|
||
.. [13] PyPy: http://codespeak.net/pypy/dist/pypy/doc/
|
||
|
||
.. [14] Pynie: http://code.google.com/p/pynie/
|
||
|
||
.. [15] imp: http://www.python.org/doc/current/library/imp.html
|
||
|
||
.. [16] PEP 302
|
||
|
||
.. [17] PEP 304
|
||
|
||
.. [18] importlib: http://docs.python.org/3.1/library/importlib.html
|
||
|
||
|
||
ACKNOWLEDGMENTS
|
||
===============
|
||
|
||
Barry Warsaw's original idea was for fat Python byte code files.
|
||
Martin von Loewis reviewed an early draft of the PEP and suggested the
|
||
simplification to store traditional `pyc` and `pyo` files in a
|
||
directory. Many other people reviewed early versions of this PEP and
|
||
provided useful feedback including but not limited to:
|
||
|
||
* David Malcolm
|
||
* Josselin Mouette
|
||
* Matthias Klose
|
||
* Michael Hudson
|
||
* Michael Vogt
|
||
* Piotr Ożarowski
|
||
* Scott Kitterman
|
||
* Toshio Kuratomi
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|