2003-01-31 14:12:53 -05:00
|
|
|
|
PEP: 307
|
|
|
|
|
Title: Extensions to the pickle protocol
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Guido van Rossum, Tim Peters
|
|
|
|
|
Status: Active
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/plain
|
|
|
|
|
Created: 31-Jan-2003
|
|
|
|
|
Post-History: None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
|
|
|
|
|
|
Pickling new-style objects in Python 2.2 is done somewhat clumsily
|
|
|
|
|
and causes pickle size to bloat compared to classic class
|
|
|
|
|
instances. This PEP documents a new pickle protocol that takes
|
2003-01-31 14:56:32 -05:00
|
|
|
|
care of this and many other pickle issues.
|
2003-01-31 14:12:53 -05:00
|
|
|
|
|
2003-01-31 14:56:32 -05:00
|
|
|
|
There are two sides to specifying a new pickle protocol: the byte
|
|
|
|
|
stream constituting pickled data must be specified, and the
|
|
|
|
|
interface between objects and the pickling and unpickling engines
|
|
|
|
|
must be specified. This PEP focuses on API issues, although it
|
|
|
|
|
may occasionally touch on byte stream format details to motivate a
|
|
|
|
|
choice. The pickle byte stream format is documented formally by
|
|
|
|
|
the standard library module pickletools.py (already checked into
|
|
|
|
|
CVS for Python 2.3).
|
|
|
|
|
|
2003-02-03 12:50:16 -05:00
|
|
|
|
This PEP attempts to fully document the interface between pickled
|
|
|
|
|
objects and the pickling process, highlighting additions by
|
|
|
|
|
specifying "new in this PEP". (The interface to invoke pickling
|
|
|
|
|
or unpickling is not covered fully, except for the changes to the
|
|
|
|
|
API for specifying the pickling protocol to picklers.)
|
|
|
|
|
|
2003-01-31 14:56:32 -05:00
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
|
|
|
|
|
|
Pickling new-style objects causes serious pickle bloat. For
|
|
|
|
|
example, the binary pickle for a classic object with one instance
|
|
|
|
|
variable takes up 33 bytes; a new-style object with one instance
|
|
|
|
|
variable takes up 86 bytes. This was measured as follows:
|
|
|
|
|
|
|
|
|
|
class C(object): # Omit "(object)" for classic class
|
|
|
|
|
pass
|
|
|
|
|
x = C()
|
|
|
|
|
x.foo = 42
|
|
|
|
|
print len(pickle.dumps(x, 1))
|
|
|
|
|
|
|
|
|
|
The reasons for the bloat are complex, but are mostly caused by
|
|
|
|
|
the fact that new-style objects use __reduce__ in order to be
|
|
|
|
|
picklable at all. After ample consideration we've concluded that
|
|
|
|
|
the only way to reduce pickle sizes for new-style objects is to
|
|
|
|
|
add new opcodes to the pickle protocol. The net result is that
|
|
|
|
|
with the new protocol, the pickle size in the above example is 35
|
|
|
|
|
(two extra bytes are used at the start to indicate the protocol
|
|
|
|
|
version, although this isn't strictly necessary).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Protocol versions
|
|
|
|
|
|
|
|
|
|
Previously, pickling (but not unpickling) has distinguished
|
|
|
|
|
between text mode and binary mode. By design, text mode is a
|
|
|
|
|
subset of binary mode, and unpicklers don't need to know in
|
|
|
|
|
advance whether an incoming pickle uses text mode or binary mode.
|
|
|
|
|
The virtual machine used for unpickling is the same regardless of
|
|
|
|
|
the mode; certain opcode simply aren't used in text mode.
|
|
|
|
|
|
|
|
|
|
Retroactively, text mode is called protocol 0, and binary mode is
|
|
|
|
|
called protocol 1. The new protocol is called protocol 2. In the
|
|
|
|
|
tradition of pickling protocols, protocol 2 is a superset of
|
|
|
|
|
protocol 1. But just so that future pickling protocols aren't
|
|
|
|
|
required to be supersets of the oldest protocols, a new opcode is
|
|
|
|
|
inserted at the start of a protocol 2 pickle indicating that it is
|
|
|
|
|
using protocol 2.
|
|
|
|
|
|
|
|
|
|
Several functions, methods and constructors used for pickling used
|
|
|
|
|
to take a positional argument named 'bin' which was a flag,
|
|
|
|
|
defaulting to 0, indicating binary mode. This argument is renamed
|
|
|
|
|
to 'proto' and now gives the protocol number, defaulting to 0.
|
|
|
|
|
|
|
|
|
|
It so happens that passing 2 for the 'bin' argument in previous
|
|
|
|
|
Python versions had the same effect as passing 1. Nevertheless, a
|
|
|
|
|
special case is added here: passing a negative number selects the
|
2003-01-31 16:13:18 -05:00
|
|
|
|
highest protocol version supported by a particular implementation.
|
|
|
|
|
This works in previous Python versions, too.
|
|
|
|
|
|
2003-02-03 12:50:16 -05:00
|
|
|
|
The pickle.py module has supported passing the 'bin' value as a
|
|
|
|
|
keyword argument rather than a positional argument. (This is not
|
|
|
|
|
recommended, since cPickle only accepts positional arguments, but
|
|
|
|
|
it works...) Passing 'bin' as a keyword argument is deprecated,
|
|
|
|
|
and a PendingDeprecationWarning is issued in this case. You have
|
|
|
|
|
to invoke the Python interpreter with -Wa or a variation on that
|
|
|
|
|
to see PendingDeprecationWarning messages. In Python 2.4, the
|
|
|
|
|
warning class may be upgraded to DeprecationWarning.
|
|
|
|
|
|
2003-01-31 16:13:18 -05:00
|
|
|
|
|
|
|
|
|
Security issues
|
|
|
|
|
|
|
|
|
|
In previous versions of Python, unpickling would do a "safety
|
|
|
|
|
check" on certain operations, refusing to call functions or
|
|
|
|
|
constructors that weren't marked as "safe for unpickling" by
|
|
|
|
|
either having an attribute __safe_for_unpickling__ set to 1, or by
|
|
|
|
|
being registered in a global registry, copy_reg.safe_constructors.
|
|
|
|
|
|
|
|
|
|
This feature gives a false sense of security: nobody has ever done
|
|
|
|
|
the necessary, extensive, code audit to prove that unpickling
|
|
|
|
|
untrusted pickles cannot invoke unwanted code, and in fact bugs in
|
|
|
|
|
the Python 2.2 pickle.py module make it easy to circumvent these
|
|
|
|
|
security measures.
|
|
|
|
|
|
|
|
|
|
We firmly believe that, on the Internet, it is better to know that
|
|
|
|
|
you are using an insecure protocol than to trust a protocol to be
|
|
|
|
|
secure whose implementation hasn't been thoroughly checked. Even
|
|
|
|
|
high quality implementations of widely used protocols are
|
|
|
|
|
routinely found flawed; Python's pickle implementation simply
|
|
|
|
|
cannot make such guarantees without a much larger time investment.
|
|
|
|
|
Therefore, as of Python 2.3, all safety checks on unpickling are
|
|
|
|
|
officially removed, and replaced with this warning:
|
|
|
|
|
|
|
|
|
|
*** Do not unpickle data received from an untrusted or
|
|
|
|
|
unauthenticated source ***
|
2003-01-31 14:12:53 -05:00
|
|
|
|
|
2003-02-03 12:50:16 -05:00
|
|
|
|
The same warning applies to previous Python versions, despite the
|
|
|
|
|
presence of safety checks there.
|
|
|
|
|
|
2003-01-31 14:12:53 -05:00
|
|
|
|
|
2003-01-31 16:58:34 -05:00
|
|
|
|
Extended __reduce__ API
|
|
|
|
|
|
|
|
|
|
There are several APIs that a class can use to control pickling.
|
|
|
|
|
Perhaps the most popular of these are __getstate__ and
|
|
|
|
|
__setstate__; but the most powerful one is __reduce__. (There's
|
|
|
|
|
also __getinitargs__, and we're adding __getnewargs__ below.)
|
|
|
|
|
|
|
|
|
|
There are two ways to provide __reduce__ functionality: a class
|
|
|
|
|
can implement a __reduce__ method, or a reduce function can be
|
|
|
|
|
declared in copy_reg (copy_reg.dispatch_table maps classes to
|
|
|
|
|
functions). The return values are interpreted exactly the same,
|
|
|
|
|
though, and we'll refer to these collectively as __reduce__.
|
|
|
|
|
|
2003-02-03 15:22:23 -05:00
|
|
|
|
IMPORTANT: a classic class cannot provide __reduce__
|
|
|
|
|
functionality. It must use __getinitargs__ and/or __gestate__ to
|
|
|
|
|
customize pickling. These are described below.
|
|
|
|
|
|
2003-01-31 16:58:34 -05:00
|
|
|
|
__reduce__ must return either a string or a tuple. If it returns
|
|
|
|
|
a string, this is an object whose state is not to be pickled, but
|
|
|
|
|
instead a reference to an equivalent object referenced by name.
|
|
|
|
|
Surprisingly, the string returned by __reduce__ should be the
|
|
|
|
|
object's local name (relative to its module); the pickle module
|
|
|
|
|
searches the module namespace to determine the object's module.
|
|
|
|
|
|
|
|
|
|
The rest of this section is concerned with the tuple returned by
|
|
|
|
|
__reduce__. It is a variable length tuple. Only the first two
|
|
|
|
|
items (function and arguments) are required. The remaining items
|
|
|
|
|
may be None or left off from the end. The last two items are new
|
|
|
|
|
in this PEP. The items are, in order:
|
|
|
|
|
|
|
|
|
|
function A callable object (not necessarily a function) called
|
2003-02-01 15:10:35 -05:00
|
|
|
|
to create the initial version of the object; state
|
|
|
|
|
may be added to the object later to fully reconstruct
|
|
|
|
|
the pickled state. This function must itself be
|
|
|
|
|
picklable. See the section about __newobj__ for a
|
|
|
|
|
special case (new in this PEP) here.
|
2003-01-31 16:58:34 -05:00
|
|
|
|
|
|
|
|
|
arguments A tuple giving the argument list for the function.
|
|
|
|
|
As a special case, designed for Zope 2's
|
|
|
|
|
ExtensionClass, this may be None; in that case,
|
|
|
|
|
function should be a class or type, and
|
|
|
|
|
function.__basicnew__() is called to create the
|
|
|
|
|
initial version of the object. This exception is
|
|
|
|
|
deprecated.
|
|
|
|
|
|
|
|
|
|
state Additional state. If this is not None, the state is
|
|
|
|
|
pickled, and obj.__setstate__(state) will called when
|
|
|
|
|
unpickling. If no __setstate__ method is defined, a
|
|
|
|
|
default implementation is provided, which assumes
|
|
|
|
|
that state is a dictionary mapping instance variable
|
|
|
|
|
names to their values, and calls
|
|
|
|
|
obj.__dict__.update(state) or "for k, v in
|
|
|
|
|
state.items(): obj[k] = v", if update() call fails.
|
|
|
|
|
|
|
|
|
|
listitems New in this PEP. If this is not None, it should be
|
|
|
|
|
an iterator (not a sequence!) yielding successive
|
|
|
|
|
list items. These list items will be pickled, and
|
|
|
|
|
appended to the object using either obj.append(item)
|
|
|
|
|
or obj.extend(list_of_items). This is primarily used
|
|
|
|
|
for list subclasses, but may be used by other classes
|
|
|
|
|
as long as they have append() and extend() methods
|
|
|
|
|
with the appropriate signature. (Whether append() or
|
|
|
|
|
extend() is used depend on which pickle protocol
|
|
|
|
|
version is used as well as the number of items to
|
|
|
|
|
append, so both must be supported.)
|
|
|
|
|
|
|
|
|
|
dictitems New in this PEP. If this is not None, it should be
|
|
|
|
|
an iterator (not a sequence!) yielding successive
|
|
|
|
|
dictionary items, which should be tuples of the form
|
|
|
|
|
(key, value). These items will be pickled, and
|
|
|
|
|
stored to the object using obj[key] = value. This is
|
|
|
|
|
primarily used for dict subclasses, but may be used
|
|
|
|
|
by other classes as long as they implement
|
|
|
|
|
__settitem__.
|
|
|
|
|
|
|
|
|
|
Note: in Python 2.2 and before, when using cPickle, state would be
|
|
|
|
|
pickled if present even if it is None; the only safe way to avoid
|
|
|
|
|
the __setstate__ call was to return a two-tuple from __reduce__.
|
|
|
|
|
(But pickle.py would not pickle state if it was None.) In Python
|
|
|
|
|
2.3, __setstate__ will never be called when __reduce__ returns a
|
|
|
|
|
state with value None.
|
|
|
|
|
|
2003-02-03 12:50:16 -05:00
|
|
|
|
A __reduce__ implementation that needs to work both under Python
|
|
|
|
|
2.2 and under Python 2.3 could check the variable
|
|
|
|
|
pickle.format_version to determine whether to use the listitems
|
|
|
|
|
and dictitems features. If this value is >= "2.0" then they are
|
|
|
|
|
supported. If not, any list or dict items should be incorporated
|
|
|
|
|
somehow in the 'state' return value; the __setstate__ method
|
|
|
|
|
should be prepared to accept list or dict items as part of the
|
|
|
|
|
state (how this is done is up to the application).
|
|
|
|
|
|
2003-01-31 16:58:34 -05:00
|
|
|
|
|
2003-02-03 15:22:23 -05:00
|
|
|
|
XXX Refactoring needed
|
|
|
|
|
|
|
|
|
|
The following sections should really be reorganized according to
|
|
|
|
|
the following cases:
|
|
|
|
|
|
|
|
|
|
1. classic classes, all protocols
|
|
|
|
|
|
|
|
|
|
2. new-style classes, protocols 0 and 1
|
|
|
|
|
|
|
|
|
|
3. new-style classes, protocol 2
|
|
|
|
|
|
|
|
|
|
|
2003-02-01 15:10:35 -05:00
|
|
|
|
The __newobj__ unpickling function
|
|
|
|
|
|
|
|
|
|
When the unpickling function returned by __reduce__ (the first
|
|
|
|
|
item of the returned tuple) has the name __newobj__, something
|
|
|
|
|
special happens for pickle protocol 2. An unpickling function
|
|
|
|
|
named __newobj__ is assumed to have the following semantics:
|
|
|
|
|
|
|
|
|
|
def __newobj__(cls, *args):
|
|
|
|
|
return cls.__new__(cls, *args)
|
|
|
|
|
|
|
|
|
|
Pickle protocol 2 special-cases an unpickling function with this
|
|
|
|
|
name, and emits a pickling opcode that, given 'cls' and 'args',
|
|
|
|
|
will return cls.__new__(cls, *args) without also pickling a
|
|
|
|
|
reference to __newobj__. This is the main reason why protocol 2
|
|
|
|
|
pickles are so much smaller than classic pickles. Of course, the
|
|
|
|
|
pickling code cannot verify that a function named __newobj__
|
|
|
|
|
actually has the expected semantics. If you use an unpickling
|
|
|
|
|
function named __newobj__ that returns something different, you
|
|
|
|
|
deserve what you get.
|
|
|
|
|
|
2003-02-03 12:50:16 -05:00
|
|
|
|
It is safe to use this feature under Python 2.2; there's nothing
|
|
|
|
|
in the recommended implementation of __newobj__ that depends on
|
|
|
|
|
Python 2.3.
|
|
|
|
|
|
2003-02-01 15:10:35 -05:00
|
|
|
|
|
2003-02-03 15:22:23 -05:00
|
|
|
|
The __getstate__ and __setstate__ methods
|
|
|
|
|
|
|
|
|
|
When there is no __reduce__ for an object, the primary ways to
|
|
|
|
|
customize pickling is by specifying __getstate__ and/or
|
|
|
|
|
__setstate__ methods. These are supported for classic classes as
|
|
|
|
|
well as for new-style classes for which no __reduce__ exists.
|
|
|
|
|
|
|
|
|
|
When __reduce__ exists, __getstate__ is not called (unless your
|
|
|
|
|
__reduce__ implementation calls it), but __getstate__ will be
|
|
|
|
|
called with the third item from the tuple returned by __reduce__,
|
|
|
|
|
if not None.
|
|
|
|
|
|
|
|
|
|
There's a subtle difference between classic and new-style classes
|
|
|
|
|
here: if a classic class's __getstate__ returns None,
|
|
|
|
|
self.__setstate__(None) will be called as part of unpickling. But
|
|
|
|
|
if a new-style class's __getstate__ returns None, its __setstate__
|
|
|
|
|
won't be called at all as part of unpickling.
|
|
|
|
|
|
|
|
|
|
The __getstate__ method is supposed to return a picklable version
|
|
|
|
|
of an object's state that does not reference the object itself.
|
|
|
|
|
If no __getstate__ method exists, a default state is assumed.
|
|
|
|
|
There are several cases:
|
|
|
|
|
|
|
|
|
|
- For a classic class, the default state is self.__dict__.
|
|
|
|
|
|
|
|
|
|
- For a new-style class that has an instance __dict__ and no
|
|
|
|
|
__slots__, the default state is self.__dict__.
|
|
|
|
|
|
|
|
|
|
- For a new-style class that has no instance __dict__ and no
|
|
|
|
|
__slots__, the default __state__ is None.
|
|
|
|
|
|
|
|
|
|
- For a new-style class that has an instance __dict__ and
|
|
|
|
|
__slots__, the default state is a tuple consisting of two
|
|
|
|
|
dictionaries: the first being self.__dict__, and the second
|
|
|
|
|
being a dictionary mapping slot names to slot values. Only
|
|
|
|
|
slots that have a value are included in the latter.
|
|
|
|
|
|
|
|
|
|
- For a new-style class that has __slots__ and no instance
|
|
|
|
|
__dict__, the default state is a tuple whose first item is None
|
|
|
|
|
and whose second item is a dictionary mapping slot names to slot
|
|
|
|
|
values described in the previous bullet.
|
|
|
|
|
|
|
|
|
|
The __setstate__ should take one argument; it will be called with
|
|
|
|
|
the value returned by __getstate__ or with the default state
|
|
|
|
|
described above if no __setstate__ method is defined.
|
|
|
|
|
|
|
|
|
|
If no __setstate__ method exists, a default implementation is
|
|
|
|
|
provided that can handle the state returned by the default
|
|
|
|
|
__getstate__.
|
|
|
|
|
|
|
|
|
|
It is fine if a class implements one of these but not the other,
|
|
|
|
|
as long as it is compatible with the default version.
|
|
|
|
|
|
|
|
|
|
New-style classes that inherit a default __reduce__ implementation
|
|
|
|
|
from the ultimate base class 'object'. This implementation is not
|
|
|
|
|
used for protocol 2, and then last four bullets above apply. For
|
|
|
|
|
protocols 0 and 1, the default implementation looks for a
|
|
|
|
|
__getstate__ method, and if none exists, it uses a simpler default
|
|
|
|
|
strategy:
|
|
|
|
|
|
|
|
|
|
- If there is an instance __dict__, the state is self.__dict__.
|
|
|
|
|
|
|
|
|
|
- Otherwise, the state is None (and __setstate__ will not be
|
|
|
|
|
called).
|
|
|
|
|
|
|
|
|
|
Note that this strategy ignores slots. New-style classes that
|
|
|
|
|
define slots and don't define __getstate__ in the same class that
|
|
|
|
|
defines the slots automatically have a __getstate__ method added
|
|
|
|
|
that raises TypeError. Protocol 2 ignores this __getstate__
|
|
|
|
|
method (recognized by the specific text of the error message).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The __getinitargs__ and __getnewargs__ methods
|
|
|
|
|
|
|
|
|
|
The __setstate__ method (or its default implementation) requires
|
|
|
|
|
that a new object already exists so that its __setstate__ method
|
|
|
|
|
can be called. The point is to create a new object that isn't
|
|
|
|
|
fully initialized; in particular, the class's __init__ method
|
|
|
|
|
should not be called if possible.
|
|
|
|
|
|
|
|
|
|
The way this is done differs between classic and new-style
|
|
|
|
|
classes.
|
|
|
|
|
|
|
|
|
|
For classic classes, these are the possibilities:
|
|
|
|
|
|
|
|
|
|
- Normally, the following trick is used: create an instance of a
|
|
|
|
|
trivial classic class (one without any methods or instance
|
|
|
|
|
variables) and then use __class__ assignment to change its class
|
|
|
|
|
to the desired class. This creates an instance of the desired
|
|
|
|
|
class with an empty __dict__ whose __init__ has not been called.
|
|
|
|
|
|
|
|
|
|
- However, if the class has a method named __getinitargs__, the
|
|
|
|
|
above trick is not used, and a class instance is created by
|
|
|
|
|
using the tuple returned by __getinitargs__ as an argument list
|
|
|
|
|
to the class constructor. This is done even if __getinitargs__
|
|
|
|
|
returns an empty tuple -- a __getinitargs__ method that returns
|
|
|
|
|
() is not equivalent to not having __getinitargs__ at all.
|
|
|
|
|
__getinitargs__ *must* return a tuple.
|
|
|
|
|
|
|
|
|
|
- In restricted execution mode, the trick from the first bullet
|
|
|
|
|
doesn't work; in this case, the class constructor is called with
|
|
|
|
|
an empty argument list if no __getinitargs__ method exists.
|
|
|
|
|
This means that in order for a classic class to be unpicklable
|
|
|
|
|
in restricted mode, it must either implement __getinitargs__ or
|
|
|
|
|
its constructor (i.e., its __init__ method) must be callable
|
|
|
|
|
without arguments.
|
|
|
|
|
|
|
|
|
|
For new-style classes, these are the possibilities:
|
|
|
|
|
|
|
|
|
|
- When using protocol 0 or 1, a default __reduce__ implementation
|
|
|
|
|
is normally inherited from the ultimate base class class
|
|
|
|
|
'object'. This implementation finds the nearest base class that
|
|
|
|
|
is implemented in C (either as a built-in type or as a type
|
|
|
|
|
defined by an extension class). Calling this base class B and
|
|
|
|
|
the class of the object to be pickled C, the new object is
|
|
|
|
|
created at unpickling time using the following code:
|
|
|
|
|
|
|
|
|
|
obj = B.__new__(C, state)
|
|
|
|
|
B.__init__(obj, state)
|
|
|
|
|
|
|
|
|
|
where state is a value computed at pickling time as follows:
|
|
|
|
|
|
|
|
|
|
state = B(obj)
|
|
|
|
|
|
|
|
|
|
This only works when B is not C, and only for certain classes
|
|
|
|
|
B. It does work for the following built-in classes: int, long,
|
|
|
|
|
float, complex, str, unicode, tuple, list, dict; and this is its
|
|
|
|
|
main redeeming factor.
|
|
|
|
|
|
|
|
|
|
- When using protocol 2, the default __reduce__ implementation
|
|
|
|
|
inherited from 'object' is ignored. Instead, a new pickling
|
|
|
|
|
opcode is generated that causes a new object to be created as
|
|
|
|
|
follows:
|
|
|
|
|
|
|
|
|
|
obj = C.__new__(C, *args)
|
|
|
|
|
|
|
|
|
|
where args is either the empty tuple, or the tuple returned by
|
|
|
|
|
the __getnewargs__ method, if defined.
|
|
|
|
|
|
|
|
|
|
|
2003-02-01 15:10:35 -05:00
|
|
|
|
TBD
|
|
|
|
|
|
|
|
|
|
The rest of this PEP is still under construction!
|
|
|
|
|
|
|
|
|
|
|
2003-01-31 14:12:53 -05:00
|
|
|
|
Copyright
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
End:
|