python-peps/pep-0307.txt

413 lines
18 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

PEP: 307
Title: Extensions to the pickle protocol
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum, Tim Peters
Status: Active
Type: Standards Track
Content-Type: text/plain
Created: 31-Jan-2003
Post-History: None
Introduction
Pickling new-style objects in Python 2.2 is done somewhat clumsily
and causes pickle size to bloat compared to classic class
instances. This PEP documents a new pickle protocol that takes
care of this and many other pickle issues.
There are two sides to specifying a new pickle protocol: the byte
stream constituting pickled data must be specified, and the
interface between objects and the pickling and unpickling engines
must be specified. This PEP focuses on API issues, although it
may occasionally touch on byte stream format details to motivate a
choice. The pickle byte stream format is documented formally by
the standard library module pickletools.py (already checked into
CVS for Python 2.3).
This PEP attempts to fully document the interface between pickled
objects and the pickling process, highlighting additions by
specifying "new in this PEP". (The interface to invoke pickling
or unpickling is not covered fully, except for the changes to the
API for specifying the pickling protocol to picklers.)
Motivation
Pickling new-style objects causes serious pickle bloat. For
example, the binary pickle for a classic object with one instance
variable takes up 33 bytes; a new-style object with one instance
variable takes up 86 bytes. This was measured as follows:
class C(object): # Omit "(object)" for classic class
pass
x = C()
x.foo = 42
print len(pickle.dumps(x, 1))
The reasons for the bloat are complex, but are mostly caused by
the fact that new-style objects use __reduce__ in order to be
picklable at all. After ample consideration we've concluded that
the only way to reduce pickle sizes for new-style objects is to
add new opcodes to the pickle protocol. The net result is that
with the new protocol, the pickle size in the above example is 35
(two extra bytes are used at the start to indicate the protocol
version, although this isn't strictly necessary).
Protocol versions
Previously, pickling (but not unpickling) has distinguished
between text mode and binary mode. By design, text mode is a
subset of binary mode, and unpicklers don't need to know in
advance whether an incoming pickle uses text mode or binary mode.
The virtual machine used for unpickling is the same regardless of
the mode; certain opcode simply aren't used in text mode.
Retroactively, text mode is called protocol 0, and binary mode is
called protocol 1. The new protocol is called protocol 2. In the
tradition of pickling protocols, protocol 2 is a superset of
protocol 1. But just so that future pickling protocols aren't
required to be supersets of the oldest protocols, a new opcode is
inserted at the start of a protocol 2 pickle indicating that it is
using protocol 2.
Several functions, methods and constructors used for pickling used
to take a positional argument named 'bin' which was a flag,
defaulting to 0, indicating binary mode. This argument is renamed
to 'proto' and now gives the protocol number, defaulting to 0.
It so happens that passing 2 for the 'bin' argument in previous
Python versions had the same effect as passing 1. Nevertheless, a
special case is added here: passing a negative number selects the
highest protocol version supported by a particular implementation.
This works in previous Python versions, too.
The pickle.py module has supported passing the 'bin' value as a
keyword argument rather than a positional argument. (This is not
recommended, since cPickle only accepts positional arguments, but
it works...) Passing 'bin' as a keyword argument is deprecated,
and a PendingDeprecationWarning is issued in this case. You have
to invoke the Python interpreter with -Wa or a variation on that
to see PendingDeprecationWarning messages. In Python 2.4, the
warning class may be upgraded to DeprecationWarning.
Security issues
In previous versions of Python, unpickling would do a "safety
check" on certain operations, refusing to call functions or
constructors that weren't marked as "safe for unpickling" by
either having an attribute __safe_for_unpickling__ set to 1, or by
being registered in a global registry, copy_reg.safe_constructors.
This feature gives a false sense of security: nobody has ever done
the necessary, extensive, code audit to prove that unpickling
untrusted pickles cannot invoke unwanted code, and in fact bugs in
the Python 2.2 pickle.py module make it easy to circumvent these
security measures.
We firmly believe that, on the Internet, it is better to know that
you are using an insecure protocol than to trust a protocol to be
secure whose implementation hasn't been thoroughly checked. Even
high quality implementations of widely used protocols are
routinely found flawed; Python's pickle implementation simply
cannot make such guarantees without a much larger time investment.
Therefore, as of Python 2.3, all safety checks on unpickling are
officially removed, and replaced with this warning:
*** Do not unpickle data received from an untrusted or
unauthenticated source ***
The same warning applies to previous Python versions, despite the
presence of safety checks there.
Extended __reduce__ API
There are several APIs that a class can use to control pickling.
Perhaps the most popular of these are __getstate__ and
__setstate__; but the most powerful one is __reduce__. (There's
also __getinitargs__, and we're adding __getnewargs__ below.)
There are two ways to provide __reduce__ functionality: a class
can implement a __reduce__ method, or a reduce function can be
declared in copy_reg (copy_reg.dispatch_table maps classes to
functions). The return values are interpreted exactly the same,
though, and we'll refer to these collectively as __reduce__.
IMPORTANT: a classic class cannot provide __reduce__
functionality. It must use __getinitargs__ and/or __gestate__ to
customize pickling. These are described below.
__reduce__ must return either a string or a tuple. If it returns
a string, this is an object whose state is not to be pickled, but
instead a reference to an equivalent object referenced by name.
Surprisingly, the string returned by __reduce__ should be the
object's local name (relative to its module); the pickle module
searches the module namespace to determine the object's module.
The rest of this section is concerned with the tuple returned by
__reduce__. It is a variable length tuple. Only the first two
items (function and arguments) are required. The remaining items
may be None or left off from the end. The last two items are new
in this PEP. The items are, in order:
function A callable object (not necessarily a function) called
to create the initial version of the object; state
may be added to the object later to fully reconstruct
the pickled state. This function must itself be
picklable. See the section about __newobj__ for a
special case (new in this PEP) here.
arguments A tuple giving the argument list for the function.
As a special case, designed for Zope 2's
ExtensionClass, this may be None; in that case,
function should be a class or type, and
function.__basicnew__() is called to create the
initial version of the object. This exception is
deprecated.
state Additional state. If this is not None, the state is
pickled, and obj.__setstate__(state) will called when
unpickling. If no __setstate__ method is defined, a
default implementation is provided, which assumes
that state is a dictionary mapping instance variable
names to their values, and calls
obj.__dict__.update(state) or "for k, v in
state.items(): obj[k] = v", if update() call fails.
listitems New in this PEP. If this is not None, it should be
an iterator (not a sequence!) yielding successive
list items. These list items will be pickled, and
appended to the object using either obj.append(item)
or obj.extend(list_of_items). This is primarily used
for list subclasses, but may be used by other classes
as long as they have append() and extend() methods
with the appropriate signature. (Whether append() or
extend() is used depend on which pickle protocol
version is used as well as the number of items to
append, so both must be supported.)
dictitems New in this PEP. If this is not None, it should be
an iterator (not a sequence!) yielding successive
dictionary items, which should be tuples of the form
(key, value). These items will be pickled, and
stored to the object using obj[key] = value. This is
primarily used for dict subclasses, but may be used
by other classes as long as they implement
__settitem__.
Note: in Python 2.2 and before, when using cPickle, state would be
pickled if present even if it is None; the only safe way to avoid
the __setstate__ call was to return a two-tuple from __reduce__.
(But pickle.py would not pickle state if it was None.) In Python
2.3, __setstate__ will never be called when __reduce__ returns a
state with value None.
A __reduce__ implementation that needs to work both under Python
2.2 and under Python 2.3 could check the variable
pickle.format_version to determine whether to use the listitems
and dictitems features. If this value is >= "2.0" then they are
supported. If not, any list or dict items should be incorporated
somehow in the 'state' return value; the __setstate__ method
should be prepared to accept list or dict items as part of the
state (how this is done is up to the application).
XXX Refactoring needed
The following sections should really be reorganized according to
the following cases:
1. classic classes, all protocols
2. new-style classes, protocols 0 and 1
3. new-style classes, protocol 2
The __newobj__ unpickling function
When the unpickling function returned by __reduce__ (the first
item of the returned tuple) has the name __newobj__, something
special happens for pickle protocol 2. An unpickling function
named __newobj__ is assumed to have the following semantics:
def __newobj__(cls, *args):
return cls.__new__(cls, *args)
Pickle protocol 2 special-cases an unpickling function with this
name, and emits a pickling opcode that, given 'cls' and 'args',
will return cls.__new__(cls, *args) without also pickling a
reference to __newobj__. This is the main reason why protocol 2
pickles are so much smaller than classic pickles. Of course, the
pickling code cannot verify that a function named __newobj__
actually has the expected semantics. If you use an unpickling
function named __newobj__ that returns something different, you
deserve what you get.
It is safe to use this feature under Python 2.2; there's nothing
in the recommended implementation of __newobj__ that depends on
Python 2.3.
The __getstate__ and __setstate__ methods
When there is no __reduce__ for an object, the primary ways to
customize pickling is by specifying __getstate__ and/or
__setstate__ methods. These are supported for classic classes as
well as for new-style classes for which no __reduce__ exists.
When __reduce__ exists, __getstate__ is not called (unless your
__reduce__ implementation calls it), but __setstate__ will be
called with the third item from the tuple returned by __reduce__,
if not None.
There's a subtle difference between classic and new-style classes
here: if a classic class's __getstate__ returns None,
self.__setstate__(None) will be called as part of unpickling. But
if a new-style class's __getstate__ returns None, its __setstate__
won't be called at all as part of unpickling.
The __getstate__ method is supposed to return a picklable version
of an object's state that does not reference the object itself.
If no __getstate__ method exists, a default state is assumed.
There are several cases:
- For a classic class, the default state is self.__dict__.
- For a new-style class that has an instance __dict__ and no
__slots__, the default state is self.__dict__.
- For a new-style class that has no instance __dict__ and no
__slots__, the default __state__ is None.
- For a new-style class that has an instance __dict__ and
__slots__, the default state is a tuple consisting of two
dictionaries: the first being self.__dict__, and the second
being a dictionary mapping slot names to slot values. Only
slots that have a value are included in the latter.
- For a new-style class that has __slots__ and no instance
__dict__, the default state is a tuple whose first item is None
and whose second item is a dictionary mapping slot names to slot
values described in the previous bullet.
The __setstate__ should take one argument; it will be called with
the value returned by __getstate__ or with the default state
described above if no __setstate__ method is defined.
If no __setstate__ method exists, a default implementation is
provided that can handle the state returned by the default
__getstate__.
It is fine if a class implements one of these but not the other,
as long as it is compatible with the default version.
New-style classes that inherit a default __reduce__ implementation
from the ultimate base class 'object'. This implementation is not
used for protocol 2, and then last four bullets above apply. For
protocols 0 and 1, the default implementation looks for a
__getstate__ method, and if none exists, it uses a simpler default
strategy:
- If there is an instance __dict__, the state is self.__dict__.
- Otherwise, the state is None (and __setstate__ will not be
called).
Note that this strategy ignores slots. New-style classes that
define slots and don't define __getstate__ in the same class that
defines the slots automatically have a __getstate__ method added
that raises TypeError. Protocol 2 ignores this __getstate__
method (recognized by the specific text of the error message).
The __getinitargs__ and __getnewargs__ methods
The __setstate__ method (or its default implementation) requires
that a new object already exists so that its __setstate__ method
can be called. The point is to create a new object that isn't
fully initialized; in particular, the class's __init__ method
should not be called if possible.
The way this is done differs between classic and new-style
classes.
For classic classes, these are the possibilities:
- Normally, the following trick is used: create an instance of a
trivial classic class (one without any methods or instance
variables) and then use __class__ assignment to change its class
to the desired class. This creates an instance of the desired
class with an empty __dict__ whose __init__ has not been called.
- However, if the class has a method named __getinitargs__, the
above trick is not used, and a class instance is created by
using the tuple returned by __getinitargs__ as an argument list
to the class constructor. This is done even if __getinitargs__
returns an empty tuple -- a __getinitargs__ method that returns
() is not equivalent to not having __getinitargs__ at all.
__getinitargs__ *must* return a tuple.
- In restricted execution mode, the trick from the first bullet
doesn't work; in this case, the class constructor is called with
an empty argument list if no __getinitargs__ method exists.
This means that in order for a classic class to be unpicklable
in restricted mode, it must either implement __getinitargs__ or
its constructor (i.e., its __init__ method) must be callable
without arguments.
For new-style classes, these are the possibilities:
- When using protocol 0 or 1, a default __reduce__ implementation
is normally inherited from the ultimate base class class
'object'. This implementation finds the nearest base class that
is implemented in C (either as a built-in type or as a type
defined by an extension class). Calling this base class B and
the class of the object to be pickled C, the new object is
created at unpickling time using the following code:
obj = B.__new__(C, state)
B.__init__(obj, state)
where state is a value computed at pickling time as follows:
state = B(obj)
This only works when B is not C, and only for certain classes
B. It does work for the following built-in classes: int, long,
float, complex, str, unicode, tuple, list, dict; and this is its
main redeeming factor.
- When using protocol 2, the default __reduce__ implementation
inherited from 'object' is ignored. Instead, a new pickling
opcode is generated that causes a new object to be created as
follows:
obj = C.__new__(C, *args)
where args is either the empty tuple, or the tuple returned by
the __getnewargs__ method, if defined.
TBD
The rest of this PEP is still under construction!
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: