Updates to PYC Repository Directories, reflecting current thinking on the

approach.
This commit is contained in:
Barry Warsaw 2010-02-25 20:39:11 +00:00
parent 9d361d1e78
commit d3b8603bd9
3 changed files with 356 additions and 198 deletions

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

After

Width:  |  Height:  |  Size: 47 KiB

View File

@ -8,7 +8,7 @@ Type: Standards Track
Content-Type: text/x-rst
Created: 2009-12-16
Python-Version: 3.2
Post-History:
Post-History: 2010-01-30, 2010-02-25
Abstract
@ -17,49 +17,75 @@ Abstract
This PEP describes an extension to Python's import mechanism which
improves sharing of Python source code files among multiple installed
different versions of the Python interpreter. It does this by
allowing many different byte compilation files (.pyc files) to be
allowing more than one byte compilation file (.pyc files) to be
co-located with the Python source file (.py file). The extension
described here can also be used to support different Python
compilation caches, such as JIT output that may be produced by an
Unladen Swallow [1]_ enabled C Python.
Background
==========
CPython compiles its source code into "byte code", and for performance
reasons, it caches this byte code on the file system whenever the
source file has changes. This makes loading of Python modules much
faster because the compilation phase can be bypassed. When your
source file is `foo.py`, CPython caches the byte code in a `foo.pyc`
file right next to the source.
Byte code files contain two 32-bit numbers followed by the marshaled
[2]_ code object. The 32-bit numbers represent a magic number and a
timestamp. The magic number changes whenever Python changes the byte
code format, e.g. by adding new byte codes to its virtual machine.
This ensures that pyc files built for previous versions of the VM
won't cause problems. The timestamp is used to make sure that the pyc
file is not older than the py file that was used to create it. When
either the magic number or timestamp do not match, the py file is
recompiled and a new pyc file is written.
In practice, it is well known that pyc files are not compatible across
Python major releases. A reading of import.c [3]_ in the Python
source code proves that within recent memory, every new CPython major
release has bumped the pyc magic number.
Rationale
=========
Linux distributions such as Ubuntu [2]_ and Debian [3]_ provide more
Linux distributions such as Ubuntu [4]_ and Debian [5]_ provide more
than one Python version at the same time to their users. For example,
Ubuntu 9.10 Karmic Koala can install Python 2.5, 2.6, and 3.1, with
Python 2.6 being the default.
Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1,
with Python 2.6 being the default.
In order to ease the burden on operating system packagers for these
distributions, the distribution packages do not contain Python version
numbers [4]_; they are shared across all Python versions installed on
the system. Putting Python version numbers in the packages would be a
maintenance nightmare, since all the packages - *and their
dependencies* - would have to be updated every time a new Python
release was added or removed from the distribution. Because of the
sheer number of packages available, this amount of work is infeasible.
This causes a conflict for Python source files installed by the
system (including third party packages), because you cannot compile a
single Python source file for more than one Python version at a time.
Thus if your system wanted to install a `/usr/share/python/foo.py`, it
could not create a `/usr/share/python/foo.pyc` file usable across all
installed Python versions.
For pure Python modules, sharing is possible because upstream
maintainers typically support multiple versions of Python in a source
compatible way. In practice though, it is well known that pyc files
are not compatible across Python major releases. A reading of
import.c [5]_ in the Python source code proves that within recent
memory, every new CPython major release has bumped the pyc magic
number.
Furthermore, in order to ease the burden on operating system packagers
for these distributions, the distribution packages do not contain
Python version numbers [6]_; they are shared across all Python
versions installed on the system. Putting Python version numbers in
the packages would be a maintenance nightmare, since all the packages
- *and their dependencies* - would have to be updated every time a new
Python release was added or removed from the distribution. Because of
the sheer number of packages available, this amount of work is
infeasible.
Even C extensions can be source compatible across multiple versions of
Python. Compiled extension modules are usually not compatible though,
and PEP 384 [6]_ has been proposed to address this by defining a
and PEP 384 [7]_ has been proposed to address this by defining a
stable ABI for extension modules.
Because the distributions cannot share pyc files, elaborate mechanisms
have been developed to put the resulting pyc files in non-shared
locations while the source code is still shared. Examples include the
symlink-based Debian regimes python-support [7]_ and python-central
[8]_. These approaches make for much more complicated, fragile,
inscrutable, and fragmented policies for delivering Python
Because these distributions cannot share pyc files, elaborate
mechanisms have been developed to put the resulting pyc files in
non-shared locations while the source code is still shared. Examples
include the symlink-based Debian regimes python-support [8]_ and
python-central [9]_. These approaches make for much more complicated,
fragile, inscrutable, and fragmented policies for delivering Python
applications to a wide range of users. Arguably more users get Python
from their operating system vendor than from upstream tarballs. Thus,
solving this pyc sharing problem for CPython is a high priority for
@ -71,29 +97,106 @@ This PEP proposes a solution to this problem.
Proposal
========
Python's import machinery is extended to search for byte code cache
files in a directory co-located with the source file, but with an
extension 'pyr'. The pyr directory contains individual files with the
cached byte compilation of the source code, identical to current pyc
and pyo files. The files inside the pyr directory retain their file
extensions, but the base name is replaced by the hexlified [10]_ magic
number of the Python version the byte code is compatible with.
Python's import machinery is extended to write and search for byte
code cache files in a single directory inside every Python package
directory. This directory will be called `__pycache__`.
The file extension pyr was chosen because 'r' is a mnemonic for
'repository', and there appears to be no prior uses of the extension
[9]_.
Further, pyc files will contain a magic string that
differentiates the Python version they were compiled for. This allows
multiple byte compiled cache files to co-exist for a single Python
source file.
For example, a module `foo` with source code in `foo.py` and byte
compiled with Python 2.5, Python 2.6, Python 2.6 `-O`, Python 2.6
`-U`, and Python 3.1 would have the following file system layout::
This scheme has the added benefit of reducing the clutter in a Python
package directory.
foo.py
foo.pyr/
f2b30a0d.pyc # Python 2.5
f2d10a0d.pyc # Python 2.6
f2d10a0d.pyo # Python 2.6 -O
f2d20a0d.pyc # Python 2.6 -U
0c4f0a0d.pyc # Python 3.1
What would this look like in practice?
Let's say we have a Python package named `alpha` which contains a
sub-package name `beta`. The source directory layout might look like
this::
alpha/
__init__.py
one.py
two.py
beta/
__init__.py
three.py
four.py
After byte compiling this package with Python 3.2, you would see the
following layout::
alpha/
__pycache__
__init__.cpython-32.pyc
one.cpython-32.pyc
two.cpython-32.pyc
__init__.py
one.py
two.py
beta/
__pycache__
__init__.cpython-32.pyc
three.cpython-32.pyc
four.cpython-32.pyc
__init__.py
three.py
four.py
Let's say that two new versions of Python are installed, one is Python
3.3 and another is Unladen Swallow. After byte compilation, the file
system would look like this::
alpha/
__pycache__
__init__.cpython-32.pyc
__init__.cpython-33.pyc
__init__.unladen-10.pyc
one.cpython-32.pyc
one.cpython-33.pyc
one.unladen-10.pyc
two.cpython-32.pyc
two.cpython-33.pyc
two.unladen-10.pyc
__init__.py
one.py
two.py
beta/
__pycache__
__init__.cpython-32.pyc
__init__.cpython-33.pyc
__init__.unladen-10.pyc
three.cpython-32.pyc
three.cpython-33.pyc
three.unladen-10.pyc
four.cpython-32.pyc
four.cpython-33.pyc
four.unladen-10.pyc
__init__.py
three.py
four.py
As you can see, as long as the Python version identifier string is
unique, any number of pyc files can co-exist. These identifier
strings are described in more detail below.
A nice property of this layout is that the `__pycache__` directories
can generally be ignored, such that a normal directory listing would
show something like this::
alpha/
__pycache__
__init__.py
one.py
two.py
beta/
__pycache__
__init__.py
three.py
four.py
This is much less cluttered than even today's Python.
Python behavior
@ -105,56 +208,105 @@ one of several situations. As per current Python rules, the term
interpreter's magic number, and the source file is not newer than the
`pyc` file.
When Python finds a `foo.py` file for which no `foo.pyc` file or
`foo.pyr` directory exists, Python will by default load the `foo.py`
file and write a `foo.pyc` file next to the source file. This is
unchanged from current behavior.
When the Python executable is given a `-R` flag, or the environment
variable `$PYTHONPYR` is set, then Python will create a `foo.pyr`
directory and write a `pyc` file to that directory with the hexlified
magic number as the base name.
Case 1: The first import
------------------------
If during import, Python finds an existing `pyc` file but no `pyr`
directory, and the `$PYTHONPYR` environment variable is not set, then
the `pyc` file is loaded as normal and no `pyr` directory is created.
When Python is asked to import module `foo`, it searches for a
`foo.py` file (or `foo` package, but that's not important for this
discussion) along its `sys.path`. When Python locates the `foo.py`
file it will look for a `__pycache__` directory in the directory where
it found the `foo.py`. If the `__pycache__` directory is missing,
Python will create it. Then it will parse and byte compile the
`foo.py` file and save the byte code in `__pycache__/foo.<magic>.pyc`,
where <magic> is defined by the Python implementation, but will be a
human readable string such as `cpython-32`.
If during import, Python finds a `pyr` directory with a matching `pyc`
file, *regardless of whether `$PYTHONPYR` is set or not*, then
`foo.pyr/<magic>.pyc` is loaded and import completes successfully.
Thus a matching `pyc` file inside a `pyr` directory always takes
precedence over a sibling `pyc` file.
If during import, Python finds a `pyr` directory that does not contain
a matching `pyc` file, and no sibling `foo.pyc` file exists, Python
will load the source file and write a sibling `foo.pyc` file, unless
the `-R` flag is given in which case a `foo.pyr/<magic>.pyc` file will
be written.
Case 2: The second import
-------------------------
Here is a flowchart illustrating the rules.
When Python is asked to import module `foo` a second time (in a
different process of course), it will again search for the `foo.py`
file along its `sys.path`. When Python locates the `foo.py` file, it
looks for a matching `__pycache__/foo.<magic>.pyc` and finding this,
it reads the byte code and continues as usual.
Case 3: __pycache__/foo.pyc with no source
------------------------------------------
It's possible that the `foo.py` file somehow got removed, while
leaving the cached pyc file still on the file system. If the
`__pycache__/foo.pyc` file exists, but the `foo.py` file used to
create it does not, Python will raise an `ImportError` when asked to
import foo. In other words, by default, Python will not support
importing a module unless the source file exists.
Python users who want to deploy sourceless imports are instructed to
create a custom importer that supports this behavior. Options include
importing pycs from a zip file, or locating pyc files where the py
source file would have existed.
Case 4: legacy pyc files
------------------------
Python will ignore all legacy pyc files. In other words, if a
`foo.pyc` file exists next to the `foo.py` file, it will be ignored in
all cases, including sourceless deployments. Python users wishing to
support this use case can create a custom importer.
Flow chart
==========
Here is a flow chart describing how modules are loaded:
.. image:: pep-3147-1.png
:scale: 75
Effects on non-conforming Python versions
=========================================
Magic identifiers
=================
Python implementations which don't know anything about `pyr`
directories will ignore them. This means that they will read and
write `pyc` files as usual. A conforming implementation will still
prefer any existing `foo.pyr/<magic>.pyc` file over an existing
sibling `pyc` file.
pyc files inside of the `__pycache__` directories contain a magic
identifier in their file names. These are mnemonic tags for the
actual magic numbers used by the importer. For example, for Python
3.2, we could use the hexlified [10]_ magic number as a unique
identifier::
The one possible conflicting state is where a sibling `pyc` file
exists, but its magic number does not match.
>>> from binascii import hexlify
>>> from imp import get_magic
>>> 'foo.{}.pyc'.format(hexlify(get_magic()).decode('ascii'))
'foo.580c0d0a.pyc'
In the default case, when Python finds a `pyc` file with a
non-matching magic number, it simply overwrites the `pyc` file with
the new byte code and magic number. In the absence of the `-R` flag,
this remains unchanged. When the `-R` flag was given, the
non-matching sibling `pyc` file is ignored - it is neither removed nor
overwritten - and a `foo.pyr/<magic>.pyc` file is written instead.
This isn't particularly human friendly though. Instead, this PEP
proposes to add a mapping between internal magic numbers and a
user-friendly tag. Newer versions of Python can add to this mapping
so that all later Pythons know the mapping between tags and magic
numbers. By convention, the tag will contain the Python
implementation name and version nickname, where the nickname is
generally the major version number and minor version number. Magic
numbers should not change between Python micro releases, but some
other convention can be used for changes in magic number between
pre-release development versions.
For example, CPython 3.2 would have a magic identifier tag of
`cpython-32` and write pyc files like this: `foo.cpython-32.pyc`.
When the `-O` flag is used, it would write `foo.cpython-32.pyo`. For
backports of this feature to Python 2, when the `-U` flag is used, a
file such as `foo.cpython-27u.pyc` can be written.
Alternative Python implementations
==================================
Alternative Python implementations such as Jython [11]_, IronPython
[12]_, PyPy [13]_, Pynie [14]_, and Unladen Swallow can also use the
`__pycache__` directory to store whatever compilation artifacts make
sense for their platforms. For example, Jython could store the class
file for the module in `__pycache__/foo.jython-32.class`.
Implementation strategy
@ -166,13 +318,97 @@ Vendors are free to backport the changes to earlier distributions as
they see fit.
Effects on existing code
========================
Adoption of this PEP will affect existing code and idioms, both inside
Python and outside. This section enumerates some of these effects.
__file__
---------
in Python 3, when you import a module, its `__file__` attribute points
to its source `py` file (in Python 2, it points to the `pyc` file). A
package's `__file__` points to the `py` file for its `__init__.py`.
E.g.::
>>> import foo
>>> foo.__file__
'foo.py'
# baz is a package
>>> import baz
>>> baz.__file__
'baz/__init__.py'
The implementation of this PEP would have to ensure that the same
directory level is returned from `__file__` as it currently does so
that the common idiom above continues to work.
As part of this PEP, we will add an `__cached__` attribute to modules,
which will always point to the actual `pyc` file that was read or
written. When the environment variable `$PYTHONDONTWRITEBYTECODE` is
set, or the `-B` option is given, or if the source lives on a
read-only filesystem, then the `__cached__` attribute will point to
the location that the `pyc` file *would* have been written to if it
didn't exist. This location of course includes the `__pycache__`
subdirectory in its path.
For alternative Python implementations which do not support `pyc`
files, the `__cached__` attribute may point to whatever
version-specific binary file was read for the module code. E.g. on
Jython, this might be the `.class` file for the module:
`__pycache__/foo.jython-32.class`. Alternative implementations for
which this scheme does not make sense should set the `__cached__`
attribute to `None`.
File extension checks
---------------------
There exists some code which checks for files ending in `.pyc` and
simply chops off the last character to find the matching `.py` file.
This code will obviously fail once this PEP is implemented.
To support this use case, we'll add two new methods to the `imp`
package [15]_:
* `imp.source_from_cache(py_path)` -> `pyc_path`
* `imp.cache_from_source(pyc_path)` -> `py_path`
Alternative implementations are free to override these functions to
return reasonable values based on their own support for this PEP.
PEP 302 loaders
---------------
PEP 302 [16]_ defined loaders have a `.get_filename()` method which
points to the `__file__` for a module. As part of this PEP, we will
extend this API, to include a new method `.get_paths()` which will
return a 2-tuple containing the path to the source file and the path
to where the matching `pyc` file is (or would be).
Backports
---------
For versions of Python earlier than 3.2 (and possibly 2.7), it is
possible to backport this PEP. However, in Python 3.2 (and possibly
2.7), this behavior will be turned on by default, and in fact, it will
replace the old behavior. Backports will need to support the old
layout by default. We suggest supporting PEP 3147 through the use of
an environment variable called `$PYTHONCACHEDIR` or the command line
switch `-Xcachedir` to enable the feature.
Alternatives
============
PEP 304
-------
There is some overlap between the goals of this PEP and PEP 304 [12]_,
There is some overlap between the goals of this PEP and PEP 304 [17]_,
which has been withdrawn. However PEP 304 would allow a user to
create a shadow file system hierarchy in which to store `pyc` files.
This concept of a shadow hierarchy for `pyc` files could be used to
@ -197,37 +433,6 @@ location of the `pyc` file in the shadow directory, and it may not be
possible to find the `my.dat` file relative to the source directory
from there.
On the other hand, this PEP keeps all byte code artifacts co-located
with the source file. Some adjustment will have to be made for the
fact that the `pyc` file lives in a subdirectory. For example, in
current Python, when you import a module, its `__file__` attribute
points to its `pyc` file. A package's `__file__` points to the `pyc`
file for its `__init__.py`. E.g.::
>>> import foo
>>> foo.__file__
'foo.pyc'
# baz is a package
>>> import baz
>>> baz.__file__
'baz/__init__.pyc'
The implementation of this PEP would have to ensure that the same
directory level is returned from `__file__` as it does without the
`pyr` directory, so that the common idiom above continues to work::
>>> import foo
>>> foo.__file__
'foo.pyr'
# baz is a package
>>> import baz
>>> baz.__file__
'baz/__init__.pyr'
Note that some existing Python code only checks for `.py` and `.pyc`
file extensions (and possibly `.pyo`). These would have to be
extended to also check for `.pyr` extensions.
Fat byte compilation files
--------------------------
@ -240,8 +445,10 @@ parallel Python implementations could be supported fairly efficiently,
but with extension lookup tables available to scale `pyf` byte code
objects as large as necessary.
The fat byte compilation files were fairly complex, so the current
simplification of using directories was suggested.
The fat byte compilation files were fairly complex, and inherently
introduced difficult race conditions, so the current simplification of
using directories was suggested. The same problem applies to using
zip files as the fat pyc file format.
Multiple file extensions
@ -256,49 +463,16 @@ approach makes it more difficult (and an ongoing task) to update any
tools that are dependent on the file extension.
Open questions
==============
* Are there any concurrency issues added by this PEP, above those that
already exist? For example, what if two Python processes attempt to
write the same `<magic>.pyc` file? Is that any different than two
Python processes trying to write to the same `foo.pyc` file?
Current thinking is that there isn't since the exclusive open
mechanism currently used, will still be used to open `pyc` files
inside a `pyr` directory.
* How do the imp [13]_ and importlib [14]_ modules need to be updated
to conform to the `pyr` directories?
* What about `py` source files that are compatible with most but not
all installed Python versions. We might need a way to say "this py
file should be hidden from Python versions X.Y or earlier". There
are three options:
- Use file system tricks to only share py files that are actually
sharable in all installed Python versions (e.g. different search
directories for Python X.Y and Python X.Z).
- Introduce Python syntax that is legal before __future__ imports
and is evaluated to determine if the py file is compatible,
raising an `ImportError('no module named foo')` if not.
- Add an optional metadata file co-located with the py file that
declares which Python versions it is compatible with.
How does this requirement interact with PEP 382 namespace packages [15]_?
* Are there any opportunities for also sharing extension modules
(.so/.dll files) in a `pyr` directory?
* Would a moratorium on byte code changes, similar to the language
moratorium described in PEP 3003 [16]_ be a better approach to
pursue, and would that solve the problem for vendors? At the time
of this writing, PEP 3003 is silent on the issue.
Reference implementation
========================
TBD
A pure-Python reference implementation will be written using
importlib [18]_, which may need some modifications to its API and
abstract base classes. Once the semantics are agreed upon and the
implementation details are settled, we'll port this to the C
implementation in `import.c`. We will have extensive tests that
guarantee that the pure-Python implementation and the built-in
implementation remain in sync.
References
@ -306,41 +480,45 @@ References
.. [1] PEP 3146
.. [2] Ubuntu: <http://www.ubuntu.com>
.. [2] The marshal module:
http://www.python.org/doc/current/library/marshal.html
.. [3] Debian: <http://www.debian.org>
.. [4] Debian Python Policy:
http://www.debian.org/doc/packaging-manuals/python-policy/
.. [5] import.c:
.. [3] import.c:
http://svn.python.org/view/python/branches/py3k/Python/import.c?view=markup
.. [6] PEP 384
.. [4] Ubuntu: <http://www.ubuntu.com>
.. [7] python-support:
.. [5] Debian: <http://www.debian.org>
.. [6] Debian Python Policy:
http://www.debian.org/doc/packaging-manuals/python-policy/
.. [7] PEP 384
.. [8] python-support:
http://wiki.debian.org/DebianPythonFAQ#Whatispython-support.3F
.. [8] python-central:
.. [9] python-central:
http://wiki.debian.org/DebianPythonFAQ#Whatispython-central.3F
.. [9] http://www.filesuffix.com/?m=search&e=pyr&submit=Search
.. [10] binascii.hexlify():
http://www.python.org/doc/current/library/binascii.html#binascii.hexlify
.. [11] The marshal module:
http://www.python.org/doc/current/library/marshal.html
.. [11] Jython: http://www.jython.org/
.. [12] PEP 304:
.. [12] IronPython: http://ironpython.net/
.. [13] imp: http://www.python.org/doc/current/library/imp.html
.. [13] PyPy: http://codespeak.net/pypy/dist/pypy/doc/
.. [14] importlib: http://docs.python.org/3.1/library/importlib.html
.. [14] Pynie: http://code.google.com/p/pynie/
.. [15] PEP 382
.. [15] imp: http://www.python.org/doc/current/library/imp.html
.. [16] PEP 3003
.. [16] PEP 302
.. [17] PEP 304
.. [18] importlib: http://docs.python.org/3.1/library/importlib.html
ACKNOWLEDGMENTS
@ -350,7 +528,7 @@ Barry Warsaw's original idea was for fat Python byte code files.
Martin von Loewis reviewed an early draft of the PEP and suggested the
simplification to store traditional `pyc` and `pyo` files in a
directory. Many other people reviewed early versions of this PEP and
provided useful feedback including:
provided useful feedback including but not limited to:
* David Malcolm
* Josselin Mouette
@ -368,26 +546,6 @@ Copyright
This document has been placed in the public domain.
Notes from python-dev
=====================
The python-dev discussion has been very fruitful. Here are some
in-progress notes from that thread which still needs to be reconciled
into the body of the PEP.
* Rarity of the use of this feature. Important for distros but
probably much less so for individual users (who may never even see
these things).
* Sibling vs folder-per-folder. Do performance measurements. Do stat
calls outweigh everything else? We need to do an analysis of the
current implementation as a baseline.
* Magic numbers in file names are magical; no one really knows the
mappings. Maybe we should use magic strings (with a lookup table?),
e.g. 'foo.cython-27.py'
* Modules should unambiguously name their __source__ and __cache__
file names. __file__ is ambiguous.
..
Local Variables: