python-peps/pep-0307.txt

659 lines
29 KiB
Plaintext
Raw Normal View History

PEP: 307
Title: Extensions to the pickle protocol
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum, Tim Peters
2003-02-09 00:12:54 -05:00
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 31-Jan-2003
Post-History: 7-Feb-2003
Introduction
Pickling new-style objects in Python 2.2 is done somewhat clumsily
and causes pickle size to bloat compared to classic class
instances. This PEP documents a new pickle protocol in Python 2.3
that takes care of this and many other pickle issues.
There are two sides to specifying a new pickle protocol: the byte
stream constituting pickled data must be specified, and the
interface between objects and the pickling and unpickling engines
must be specified. This PEP focuses on API issues, although it
may occasionally touch on byte stream format details to motivate a
choice. The pickle byte stream format is documented formally by
the standard library module pickletools.py (already checked into
CVS for Python 2.3).
2003-02-03 12:50:16 -05:00
This PEP attempts to fully document the interface between pickled
objects and the pickling process, highlighting additions by
specifying "new in this PEP". (The interface to invoke pickling
or unpickling is not covered fully, except for the changes to the
API for specifying the pickling protocol to picklers.)
Motivation
Pickling new-style objects causes serious pickle bloat. For
example,
class C(object): # Omit "(object)" for classic class
pass
x = C()
x.foo = 42
print len(pickle.dumps(x, 1))
The binary pickle for the classic object consumed 33 bytes, and for
the new-style object 86 bytes.
The reasons for the bloat are complex, but are mostly caused by
the fact that new-style objects use __reduce__ in order to be
picklable at all. After ample consideration we've concluded that
the only way to reduce pickle sizes for new-style objects is to
add new opcodes to the pickle protocol. The net result is that
with the new protocol, the pickle size in the above example is 35
(two extra bytes are used at the start to indicate the protocol
version, although this isn't strictly necessary).
Protocol versions
Previously, pickling (but not unpickling) distinguished between
text mode and binary mode. By design, binary mode is a
superset of text mode, and unpicklers don't need to know in
advance whether an incoming pickle uses text mode or binary mode.
The virtual machine used for unpickling is the same regardless of
the mode; certain opcodes simply aren't used in text mode.
Retroactively, text mode is now called protocol 0, and binary mode
protocol 1. The new protocol is called protocol 2. In the
tradition of pickling protocols, protocol 2 is a superset of
protocol 1. But just so that future pickling protocols aren't
required to be supersets of the oldest protocols, a new opcode is
inserted at the start of a protocol 2 pickle indicating that it is
using protocol 2. To date, each release of Python has been able to
read pickles written by all previous releases. Of course pickles
written under protocol N can't be read by versions of Python
earlier than the one that introduced protocol N.
Several functions, methods and constructors used for pickling used
to take a positional argument named 'bin' which was a flag,
defaulting to 0, indicating binary mode. This argument is renamed
to 'proto' and now gives the protocol number, still defaulting to 0.
It so happens that passing 2 for the 'bin' argument in previous
Python versions had the same effect as passing 1. Nevertheless, a
special case is added here: passing a negative number selects the
2003-01-31 16:13:18 -05:00
highest protocol version supported by a particular implementation.
This works in previous Python versions, too.
2003-02-03 12:50:16 -05:00
The pickle.py module has supported passing the 'bin' value as a
keyword argument rather than a positional argument. (This is not
recommended, since cPickle only accepts positional arguments, but
it works...) Passing 'bin' as a keyword argument is deprecated,
and a PendingDeprecationWarning is issued in this case. You have
to invoke the Python interpreter with -Wa or a variation on that
to see PendingDeprecationWarning messages. In Python 2.4, the
warning class may be upgraded to DeprecationWarning.
2003-01-31 16:13:18 -05:00
Security issues
In previous versions of Python, unpickling would do a "safety
check" on certain operations, refusing to call functions or
constructors that weren't marked as "safe for unpickling" by
either having an attribute __safe_for_unpickling__ set to 1, or by
being registered in a global registry, copy_reg.safe_constructors.
This feature gives a false sense of security: nobody has ever done
the necessary, extensive, code audit to prove that unpickling
untrusted pickles cannot invoke unwanted code, and in fact bugs in
the Python 2.2 pickle.py module make it easy to circumvent these
security measures.
We firmly believe that, on the Internet, it is better to know that
you are using an insecure protocol than to trust a protocol to be
secure whose implementation hasn't been thoroughly checked. Even
high quality implementations of widely used protocols are
routinely found flawed; Python's pickle implementation simply
cannot make such guarantees without a much larger time investment.
Therefore, as of Python 2.3, all safety checks on unpickling are
officially removed, and replaced with this warning:
*** Do not unpickle data received from an untrusted or
unauthenticated source ***
2003-02-03 12:50:16 -05:00
The same warning applies to previous Python versions, despite the
presence of safety checks there.
2003-01-31 16:58:34 -05:00
Extended __reduce__ API
There are several APIs that a class can use to control pickling.
Perhaps the most popular of these are __getstate__ and
__setstate__; but the most powerful one is __reduce__. (There's
also __getinitargs__, and we're adding __getnewargs__ below.)
There are two ways to provide __reduce__ functionality: a class
can implement a __reduce__ method, or a reduce function can be
declared in copy_reg (copy_reg.dispatch_table maps classes to
functions). The return values are interpreted exactly the same,
though, and we'll refer to these collectively as __reduce__.
IMPORTANT: a classic class cannot provide __reduce__
functionality. It must use __getinitargs__ and/or __gestate__ to
customize pickling. These are described below.
2003-01-31 16:58:34 -05:00
__reduce__ must return either a string or a tuple. If it returns
a string, this is an object whose state is not to be pickled, but
instead a reference to an equivalent object referenced by name.
Surprisingly, the string returned by __reduce__ should be the
object's local name (relative to its module); the pickle module
searches the module namespace to determine the object's module.
The rest of this section is concerned with the tuple returned by
__reduce__. It is a variable length tuple. Only the first two
items (function and arguments) are required. The remaining items
may be None or left off from the end. The last two items are new
in this PEP. The items are, in order:
function A callable object (not necessarily a function) called
2003-02-01 15:10:35 -05:00
to create the initial version of the object; state
may be added to the object later to fully reconstruct
the pickled state. This function must itself be
picklable. See the section about __newobj__ for a
special case (new in this PEP) here.
2003-01-31 16:58:34 -05:00
arguments A tuple giving the argument list for the function.
As a special case, designed for Zope 2's
ExtensionClass, this may be None; in that case,
function should be a class or type, and
function.__basicnew__() is called to create the
initial version of the object. This exception is
deprecated.
state Additional state. If this is not None, the state is
2003-02-09 12:11:10 -05:00
pickled, and obj.__setstate__(state) will be called
when unpickling. If no __setstate__ method is
defined, a default implementation is provided, which
assumes that state is a dictionary mapping instance
variable names to their values, and calls
2003-01-31 16:58:34 -05:00
obj.__dict__.update(state) or "for k, v in
state.items(): obj[k] = v", if update() call fails.
listitems New in this PEP. If this is not None, it should be
an iterator (not a sequence!) yielding successive
list items. These list items will be pickled, and
appended to the object using either obj.append(item)
or obj.extend(list_of_items). This is primarily used
for list subclasses, but may be used by other classes
as long as they have append() and extend() methods
with the appropriate signature. (Whether append() or
extend() is used depend on which pickle protocol
version is used as well as the number of items to
append, so both must be supported.)
dictitems New in this PEP. If this is not None, it should be
an iterator (not a sequence!) yielding successive
dictionary items, which should be tuples of the form
(key, value). These items will be pickled, and
stored to the object using obj[key] = value. This is
primarily used for dict subclasses, but may be used
by other classes as long as they implement
2003-02-07 14:30:15 -05:00
__setitem__.
2003-01-31 16:58:34 -05:00
Note: in Python 2.2 and before, when using cPickle, state would be
pickled if present even if it is None; the only safe way to avoid
the __setstate__ call was to return a two-tuple from __reduce__.
(But pickle.py would not pickle state if it was None.) In Python
2.3, __setstate__ will never be called when __reduce__ returns a
state with value None.
2003-02-03 12:50:16 -05:00
A __reduce__ implementation that needs to work both under Python
2.2 and under Python 2.3 could check the variable
pickle.format_version to determine whether to use the listitems
and dictitems features. If this value is >= "2.0" then they are
supported. If not, any list or dict items should be incorporated
somehow in the 'state' return value; the __setstate__ method
should be prepared to accept list or dict items as part of the
state (how this is done is up to the application).
2003-01-31 16:58:34 -05:00
2003-02-04 12:53:55 -05:00
Customizing pickling absent a __reduce__ implementation
2003-02-04 12:53:55 -05:00
If no __reduce__ implementation is available for a particular
class, there are three cases that need to be considered
separately, because they are handled differently:
2003-02-04 12:53:55 -05:00
1. classic class instances, all protocols
2003-02-04 12:53:55 -05:00
2. new-style class instances, protocols 0 and 1
2003-02-04 12:53:55 -05:00
3. new-style class instances, protocol 2
2003-02-04 12:53:55 -05:00
Types implemented in C are considered new-style classes. However,
except for the common built-in types, these need to provide a
__reduce__ implementation in order to be picklable with protocols
0 or 1. Protocol 2 supports built-in types providing
__getnewargs__, __getstate__ and __setstate__ as well.
2003-02-01 15:10:35 -05:00
2003-02-04 12:53:55 -05:00
Case 1: pickling classic class instances
2003-02-01 15:10:35 -05:00
2003-02-04 12:53:55 -05:00
This case is the same for all protocols, and is unchanged from
Python 2.1.
2003-02-01 15:10:35 -05:00
2003-02-04 12:53:55 -05:00
For classic classes, __reduce__ is not used. Instead, classic
classes can customize their pickling by providing methods named
__getstate__, __setstate__ and __getinitargs__. Absent these, a
default pickling strategy for classic class instances is
implemented that works as long as all instance variables are
picklable. This default strategy is documented in terms of
default implementations of __getstate__ and __setstate__.
2003-02-01 15:10:35 -05:00
2003-02-04 12:53:55 -05:00
The primary ways to customize pickling of classic class instances
is by specifying __getstate__ and/or __setstate__ methods. It is
fine if a class implements one of these but not the other, as long
as it is compatible with the default version.
The __getstate__ method
The __getstate__ method should return a picklable value
representing the object's state without referencing the object
itself. If no __getstate__ method exists, a default
implementation is used that returns self.__dict__.
The __setstate__ method
The __setstate__ method should take one argument; it will be
called with the value returned by __getstate__ (or its default
implementation).
2003-02-03 12:50:16 -05:00
2003-02-04 12:53:55 -05:00
If no __setstate__ method exists, a default implementation is
provided that assumes the state is a dictionary mapping instance
variable names to values. The default implementation tries two
things:
2003-02-01 15:10:35 -05:00
2003-02-04 12:53:55 -05:00
- First, it tries to call self.__dict__.update(state).
2003-02-04 12:53:55 -05:00
- If the update() call fails with a RuntimeError exception, it
calls setattr(self, key, value) for each (key, value) pair in
the state dictionary. This only happens when unpickling in
restricted execution mode (see the rexec standard library
module).
2003-02-04 12:53:55 -05:00
The __getinitargs__ method
2003-02-04 12:53:55 -05:00
The __setstate__ method (or its default implementation) requires
that a new object already exists so that its __setstate__ method
can be called. The point is to create a new object that isn't
fully initialized; in particular, the class's __init__ method
should not be called if possible.
2003-02-04 12:53:55 -05:00
These are the possibilities:
2003-02-04 12:53:55 -05:00
- Normally, the following trick is used: create an instance of a
trivial classic class (one without any methods or instance
variables) and then use __class__ assignment to change its
class to the desired class. This creates an instance of the
desired class with an empty __dict__ whose __init__ has not
been called.
2003-02-04 12:53:55 -05:00
- However, if the class has a method named __getinitargs__, the
above trick is not used, and a class instance is created by
using the tuple returned by __getinitargs__ as an argument
list to the class constructor. This is done even if
__getinitargs__ returns an empty tuple -- a __getinitargs__
method that returns () is not equivalent to not having
__getinitargs__ at all. __getinitargs__ *must* return a
tuple.
2003-02-04 12:53:55 -05:00
- In restricted execution mode, the trick from the first bullet
doesn't work; in this case, the class constructor is called
with an empty argument list if no __getinitargs__ method
exists. This means that in order for a classic class to be
unpicklable in restricted execution mode, it must either
implement __getinitargs__ or its constructor (i.e., its
__init__ method) must be callable without arguments.
2003-02-04 12:53:55 -05:00
Case 2: pickling new-style class instances using protocols 0 or 1
2003-02-04 12:53:55 -05:00
This case is unchanged from Python 2.2. For better pickling of
new-style class instances when backwards compatibility is not an
issue, protocol 2 should be used; see case 3 below.
2003-02-04 12:53:55 -05:00
New-style classes, whether implemented in C or in Python, inherit
a default __reduce__ implementation from the universal base class
'object'.
2003-02-04 12:53:55 -05:00
This default __reduce__ implementation is not used for those
built-in types for which the pickle module has built-in support.
Here's a full list of those types:
- Concrete built-in types: NoneType, bool, int, float, complex,
str, unicode, tuple, list, dict. (Complex is supported by
virtue of a __reduce__ implementation registered in copy_reg.)
In Jython, PyStringMap is also included in this list.
- Classic instances.
- Classic class objects, Python function objects, built-in
function and method objects, and new-style type objects (==
new-style class objects). These are pickled by name, not by
value: at unpickling time, a reference to an object with the
same name (the fully qualified module name plus the variable
name in that module) is substituted.
The default __reduce__ implementation will fail at pickling time
for built-in types not mentioned above.
For new-style classes implemented in Python, the default
__reduce__ implementation works as follows:
Let D be the class on the object to be pickled. First, find the
nearest base class that is implemented in C (either as a
built-in type or as a type defined by an extension class). Call
this base class B, and the class of the object to be pickled D.
Unless B is the class 'object', instances of class B must be
picklable, either by having built-in support (as defined in the
above three bullet points), or by having a non-default
__reduce__ implementation. B must not be the same class as D
(if it were, it would mean that D is not implemented in Python).
The new object is created at unpickling time using the following
code:
obj = B.__new__(D, state)
B.__init__(obj, state)
2003-02-04 12:53:55 -05:00
where state is a value computed at pickling time as follows:
2003-02-04 12:53:55 -05:00
state = B(obj)
2003-02-04 12:53:55 -05:00
Objects for which this default __reduce__ implementation is used
can customize it by defining __getstate__ and/or __setstate__
methods. These work almost the same as described for classic
classes above, except that if __getstate__ returns an object (of
any type) whose value is considered false (e.g. None, or a number
that is zero, or an empty sequence or mapping), this state is not
pickled and __setstate__ will not be called at all.
Note that this strategy ignores slots. New-style classes that
define slots and don't define __getstate__ in the same class that
defines the slots automatically have a __getstate__ method added
2003-02-04 12:53:55 -05:00
that raises TypeError.
2003-02-04 12:53:55 -05:00
Case 3: pickling new-style class instances using protocol 2
Under protocol 2, the default __reduce__ implementation inherited
from the 'object' base class is *ignored*. Instead, a different
default implementation is used, which allows more efficient
pickling of new-style class instances than possible with protocols
0 or 1, at the cost of backward incompatibility with Python 2.2.
The customization uses three special methods: __getstate__,
__setstate__ and __getnewargs__. It is fine if a class implements
one or more but not all of these, as long as it is compatible with
the default implementations.
The __getstate__ method
The __getstate__ method should return a picklable value
representing the object's state without referencing the object
itself. If no __getstate__ method exists, a default
implementation is used which is described below.
There's a subtle difference between classic and new-style
classes here: if a classic class's __getstate__ returns None,
self.__setstate__(None) will be called as part of unpickling.
But if a new-style class's __getstate__ returns None, its
__setstate__ won't be called at all as part of unpickling.
If no __getstate__ method exists, a default state is assumed.
There are several cases:
2003-02-04 12:53:55 -05:00
- For a new-style class that has an instance __dict__ and no
__slots__, the default state is self.__dict__.
2003-02-04 12:53:55 -05:00
- For a new-style class that has no instance __dict__ and no
__slots__, the default __state__ is None.
- For a new-style class that has an instance __dict__ and
__slots__, the default state is a tuple consisting of two
dictionaries: the first being self.__dict__, and the second
being a dictionary mapping slot names to slot values. Only
slots that have a value are included in the latter.
- For a new-style class that has __slots__ and no instance
__dict__, the default state is a tuple whose first item is
None and whose second item is a dictionary mapping slot names
to slot values described in the previous bullet.
Note that new-style classes that define slots and don't define
__getstate__ in the same class that defines the slots
automatically have a __getstate__ method added that raises
TypeError. Protocol 2 ignores this __getstate__ method
(recognized by the specific text of the error message).
The __setstate__ method
The __setstate__ should take one argument; it will be called
with the value returned by __getstate__ or with the default
state described above if no __setstate__ method is defined.
If no __setstate__ method exists, a default implementation is
provided that can handle the state returned by the default
__getstate__, described above.
The __getnewargs__ method
Like for classic classes, the __setstate__ method (or its
default implementation) requires that a new object already
exists so that its __setstate__ method can be called.
In protocol 2, a new pickling opcode is used that causes a new
object to be created as follows:
obj = C.__new__(C, *args)
where args is either the empty tuple, or the tuple returned by
2003-02-04 12:53:55 -05:00
the __getnewargs__ method, if defined. __getnewargs__ must
return a tuple. The absence of a __getnewargs__ method is
equivalent to the existence of one that returns ().
The __newobj__ unpickling function
When the unpickling function returned by __reduce__ (the first
item of the returned tuple) has the name __newobj__, something
special happens for pickle protocol 2. An unpickling function
named __newobj__ is assumed to have the following semantics:
def __newobj__(cls, *args):
return cls.__new__(cls, *args)
Pickle protocol 2 special-cases an unpickling function with this
name, and emits a pickling opcode that, given 'cls' and 'args',
will return cls.__new__(cls, *args) without also pickling a
reference to __newobj__ (this is the same pickling opcode used by
protocol 2 for a new-style class instance when no __reduce__
implementation exists). This is the main reason why protocol 2
pickles are much smaller than classic pickles. Of course, the
pickling code cannot verify that a function named __newobj__
actually has the expected semantics. If you use an unpickling
function named __newobj__ that returns something different, you
deserve what you get.
It is safe to use this feature under Python 2.2; there's nothing
in the recommended implementation of __newobj__ that depends on
Python 2.3.
2003-02-04 14:12:25 -05:00
The extension registry
Protocol 2 supports a new mechanism to reduce the size of pickles.
When class instances (classic or new-style) are pickled, the full
name of the class (module name including package name, and class
name) is included in the pickle. Especially for applications that
generate many small pickles, this is a lot of overhead that has to
be repeated in each pickle. For large pickles, when using
protocol 1, repeated references to the same class name are
compressed using the "memo" feature; but each class name must be
spelled in full at least once per pickle, and this causes a lot of
overhead for small pickles.
The extension registry allows one to represent the most frequently
used names by small integers, which are pickled very efficiently:
an extension code in the range 1-255 requires only two bytes
including the opcode, one in the range 256-65535 requires only
three bytes including the opcode.
One of the design goals of the pickle protocol is to make pickles
"context-free": as long as you have installed the modules
containing the classes referenced by a pickle, you can unpickle
it, without needing to import any of those classes ahead of time.
Unbridled use of extension codes could jeopardize this desirable
property of pickles. Therefore, the main use of extension codes
is reserved for a set of codes to be standardized by some
standard-setting body. This being Python, the standard-setting
body is the PSF. From time to time, the PSF will decide on a
table mapping extension codes to class names (or occasionally
names of other global objects; functions are also eligible). This
table will be incorporated in the next Python release(s).
However, for some applications, like Zope, context-free pickles
are not a requirement, and waiting for the PSF to standardize
some codes may not be practical. Two solutions are offered for
such applications.
First of all, a few ranges of extension codes is reserved for
private use. Any application can register codes in these ranges.
Two applications exchanging pickles using codes in these ranges
need to have some out-of-band mechanism to agree on the mapping
between extension codes and names.
Second, some large Python projects (e.g. Zope) can be assigned a
range of extension codes outside the "private use" range that they
can assign as they see fit.
2003-02-04 14:12:25 -05:00
The extension registry is defined as a mapping between extension
codes and names. When an extension code is unpickled, it ends up
producing an object, but this object is gotten by interpreting the
name as a module name followed by a class (or function) name. The
mapping from names to objects is cached. It is quite possible
that certain names cannot be imported; that should not be a
problem as long as no pickle containing a reference to such names
has to be unpickled. (The same issue already exists for direct
references to such names in pickles that use protocols 0 or 1.)
Here is the proposed initial assigment of extension code ranges:
First Last Count Purpose
0 0 1 Reserved -- will never be used
1 127 127 Reserved for Python standard library
128 191 64 Reserved for Zope
2003-02-04 14:12:25 -05:00
192 239 48 Reserved for 3rd parties
240 255 16 Reserved for private use (will never be assigned)
256 MAX MAX Reserved for future assignment
2003-02-04 14:12:25 -05:00
MAX stands for 2147483647, or 2**31-1. This is a hard limitation
of the protocol as currently defined.
2003-02-04 14:12:25 -05:00
At the moment, no specific extension codes have been assigned yet.
Extension registry API
The extension registry is maintained as private global variables
in the copy_reg module. The following three functions are defined
in this module to manipulate the registry:
add_extension(module, name, code)
Register an extension code. The module and name arguments
must be strings; code must be an int in the inclusive range 1
through MAX. This must either register a new (module, name)
pair to a new code, or be a redundant repeat of a previous
call that was not canceled by a remove_extension() call; a
(module, name) pair may not be mapped to more than one code,
nor may a code be mapped to more than one (module, name)
pair. (XXX Aliasing may actually cause as problem for this
requirement; we'll see as we go.)
remove_extension(module, name, code)
Arguments are as for add_extension(). Remove a previously
registered mapping between (module, name) and code.
clear_extension_cache()
The implementation of extension codes may use a cache to speed
up loading objects that are named frequently. This cache can
be emptied (removing references to cached objects) by calling
this method.
Note that the API does not enforce the standard range assignments.
It is up to applications to respect these.
The copy module
2003-02-01 15:10:35 -05:00
Traditionally, the copy module has supported an extended subset of
the pickling APIs for customizing the copy() and deepcopy()
operations.
In particular, besides checking for a __copy__ or __deepcopy__
method, copy() and deepcopy() have always looked for __reduce__,
and for classic classes, have looked for __getinitargs__,
__getstate__ and __setstate__.
In Python 2.2, the default __reduce__ inherited from 'object' made
copying simple new-style classes possible, but slots and various
other special cases were not covered.
In Python 2.3, several changes are made to the copy module:
- The four- and five-argument return values of __reduce__ are
supported.
- Before looking for a __reduce__ method, the
copy_reg.dispatch_table is consulted, just like for pickling.
- When the __reduce__ method is inherited from object, it is
(unconditionally) replaced by a better one that uses the same
APIs as pickle protocol 2: __getnewargs__, __getstate__, and
__setstate__, handling list and dict subclasses, and handling
slots.
As a consequence of the latter change, certain new-style classes
that were copyable under Python 2.2 are not copyable under Python
2.3. (These classes are also not picklable using pickle protocol
2.) A minimal example of such a class:
class C(object):
def __new__(cls, a):
return object.__new__(cls)
The problem only occurs when __new__ is overridden and has at
least one mandatory argument in addition to the class argument.
To fix this, a __getnewargs__ method should be added that returns
the appropriate argument tuple (excluding the class).
2003-02-01 15:10:35 -05:00
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: