772 lines
34 KiB
Plaintext
772 lines
34 KiB
Plaintext
PEP: 307
|
||
Title: Extensions to the pickle protocol
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Guido van Rossum, Tim Peters
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/plain
|
||
Created: 31-Jan-2003
|
||
Post-History: 7-Feb-2003
|
||
|
||
|
||
Introduction
|
||
|
||
Pickling new-style objects in Python 2.2 is done somewhat clumsily
|
||
and causes pickle size to bloat compared to classic class
|
||
instances. This PEP documents a new pickle protocol in Python 2.3
|
||
that takes care of this and many other pickle issues.
|
||
|
||
There are two sides to specifying a new pickle protocol: the byte
|
||
stream constituting pickled data must be specified, and the
|
||
interface between objects and the pickling and unpickling engines
|
||
must be specified. This PEP focuses on API issues, although it
|
||
may occasionally touch on byte stream format details to motivate a
|
||
choice. The pickle byte stream format is documented formally by
|
||
the standard library module pickletools.py (already checked into
|
||
CVS for Python 2.3).
|
||
|
||
This PEP attempts to fully document the interface between pickled
|
||
objects and the pickling process, highlighting additions by
|
||
specifying "new in this PEP". (The interface to invoke pickling
|
||
or unpickling is not covered fully, except for the changes to the
|
||
API for specifying the pickling protocol to picklers.)
|
||
|
||
|
||
Motivation
|
||
|
||
Pickling new-style objects causes serious pickle bloat. For
|
||
example,
|
||
|
||
class C(object): # Omit "(object)" for classic class
|
||
pass
|
||
x = C()
|
||
x.foo = 42
|
||
print len(pickle.dumps(x, 1))
|
||
|
||
The binary pickle for the classic object consumed 33 bytes, and for
|
||
the new-style object 86 bytes.
|
||
|
||
The reasons for the bloat are complex, but are mostly caused by
|
||
the fact that new-style objects use __reduce__ in order to be
|
||
picklable at all. After ample consideration we've concluded that
|
||
the only way to reduce pickle sizes for new-style objects is to
|
||
add new opcodes to the pickle protocol. The net result is that
|
||
with the new protocol, the pickle size in the above example is 35
|
||
(two extra bytes are used at the start to indicate the protocol
|
||
version, although this isn't strictly necessary).
|
||
|
||
|
||
Protocol versions
|
||
|
||
Previously, pickling (but not unpickling) distinguished between
|
||
text mode and binary mode. By design, binary mode is a
|
||
superset of text mode, and unpicklers don't need to know in
|
||
advance whether an incoming pickle uses text mode or binary mode.
|
||
The virtual machine used for unpickling is the same regardless of
|
||
the mode; certain opcodes simply aren't used in text mode.
|
||
|
||
Retroactively, text mode is now called protocol 0, and binary mode
|
||
protocol 1. The new protocol is called protocol 2. In the
|
||
tradition of pickling protocols, protocol 2 is a superset of
|
||
protocol 1. But just so that future pickling protocols aren't
|
||
required to be supersets of the oldest protocols, a new opcode is
|
||
inserted at the start of a protocol 2 pickle indicating that it is
|
||
using protocol 2. To date, each release of Python has been able to
|
||
read pickles written by all previous releases. Of course pickles
|
||
written under protocol N can't be read by versions of Python
|
||
earlier than the one that introduced protocol N.
|
||
|
||
Several functions, methods and constructors used for pickling used
|
||
to take a positional argument named 'bin' which was a flag,
|
||
defaulting to 0, indicating binary mode. This argument is renamed
|
||
to 'protocol' and now gives the protocol number, still defaulting
|
||
to 0.
|
||
|
||
It so happens that passing 2 for the 'bin' argument in previous
|
||
Python versions had the same effect as passing 1. Nevertheless, a
|
||
special case is added here: passing a negative number selects the
|
||
highest protocol version supported by a particular implementation.
|
||
This works in previous Python versions, too, and so can be used to
|
||
select the highest protocol available in a way that's both backward
|
||
and forward compatible. In addition, a new module constant
|
||
HIGHEST_PROTOCOL is supplied by both pickle and cPickle, equal to
|
||
the highest protocol number the module can read. This is cleaner
|
||
than passing -1, but cannot be used before Python 2.3.
|
||
|
||
The pickle.py module has supported passing the 'bin' value as a
|
||
keyword argument rather than a positional argument. (This is not
|
||
recommended, since cPickle only accepts positional arguments, but
|
||
it works...) Passing 'bin' as a keyword argument is deprecated,
|
||
and a PendingDeprecationWarning is issued in this case. You have
|
||
to invoke the Python interpreter with -Wa or a variation on that
|
||
to see PendingDeprecationWarning messages. In Python 2.4, the
|
||
warning class may be upgraded to DeprecationWarning.
|
||
|
||
|
||
Security issues
|
||
|
||
In previous versions of Python, unpickling would do a "safety
|
||
check" on certain operations, refusing to call functions or
|
||
constructors that weren't marked as "safe for unpickling" by
|
||
either having an attribute __safe_for_unpickling__ set to 1, or by
|
||
being registered in a global registry, copy_reg.safe_constructors.
|
||
|
||
This feature gives a false sense of security: nobody has ever done
|
||
the necessary, extensive, code audit to prove that unpickling
|
||
untrusted pickles cannot invoke unwanted code, and in fact bugs in
|
||
the Python 2.2 pickle.py module make it easy to circumvent these
|
||
security measures.
|
||
|
||
We firmly believe that, on the Internet, it is better to know that
|
||
you are using an insecure protocol than to trust a protocol to be
|
||
secure whose implementation hasn't been thoroughly checked. Even
|
||
high quality implementations of widely used protocols are
|
||
routinely found flawed; Python's pickle implementation simply
|
||
cannot make such guarantees without a much larger time investment.
|
||
Therefore, as of Python 2.3, all safety checks on unpickling are
|
||
officially removed, and replaced with this warning:
|
||
|
||
*** Do not unpickle data received from an untrusted or
|
||
unauthenticated source ***
|
||
|
||
The same warning applies to previous Python versions, despite the
|
||
presence of safety checks there.
|
||
|
||
|
||
Extended __reduce__ API
|
||
|
||
There are several APIs that a class can use to control pickling.
|
||
Perhaps the most popular of these are __getstate__ and
|
||
__setstate__; but the most powerful one is __reduce__. (There's
|
||
also __getinitargs__, and we're adding __getnewargs__ below.)
|
||
|
||
There are several ways to provide __reduce__ functionality: a
|
||
class can implement a __reduce__ method or a __reduce_ex__ method
|
||
(see next section), or a reduce function can be declared in
|
||
copy_reg (copy_reg.dispatch_table maps classes to functions). The
|
||
return values are interpreted exactly the same, though, and we'll
|
||
refer to these collectively as __reduce__.
|
||
|
||
IMPORTANT: pickling of classic class instances does not look for a
|
||
__reduce__ or __reduce_ex__ method or a reduce function in the
|
||
copy_reg dispatch table, so that a classic class cannot provide
|
||
__reduce__ functionality in the sense intended here. A classic
|
||
class must use __getinitargs__ and/or __gestate__ to customize
|
||
pickling. These are described below.
|
||
|
||
__reduce__ must return either a string or a tuple. If it returns
|
||
a string, this is an object whose state is not to be pickled, but
|
||
instead a reference to an equivalent object referenced by name.
|
||
Surprisingly, the string returned by __reduce__ should be the
|
||
object's local name (relative to its module); the pickle module
|
||
searches the module namespace to determine the object's module.
|
||
|
||
The rest of this section is concerned with the tuple returned by
|
||
__reduce__. It is a variable size tuple, of length 2 through 5.
|
||
The first two items (function and arguments) are required. The
|
||
remaining items are optional and may be left off from the end;
|
||
giving None for the value of an optional item acts the same as
|
||
leaving it off. The last two items are new in this PEP. The items
|
||
are, in order:
|
||
|
||
function Required.
|
||
A callable object (not necessarily a function) called
|
||
to create the initial version of the object; state
|
||
may be added to the object later to fully reconstruct
|
||
the pickled state. This function must itself be
|
||
picklable. See the section about __newobj__ for a
|
||
special case (new in this PEP) here.
|
||
|
||
arguments Required.
|
||
A tuple giving the argument list for the function.
|
||
As a special case, designed for Zope 2's
|
||
ExtensionClass, this may be None; in that case,
|
||
function should be a class or type, and
|
||
function.__basicnew__() is called to create the
|
||
initial version of the object. This exception is
|
||
deprecated.
|
||
|
||
Unpickling invokes function(*arguments) to create an initial object,
|
||
called obj below. If the remaining items are left off, that's the end
|
||
of unpickling for this object and obj is the result. Else obj is
|
||
modified at unpickling time by each item specified, as follows.
|
||
|
||
state Optional.
|
||
Additional state. If this is not None, the state is
|
||
pickled, and obj.__setstate__(state) will be called
|
||
when unpickling. If no __setstate__ method is
|
||
defined, a default implementation is provided, which
|
||
assumes that state is a dictionary mapping instance
|
||
variable names to their values. The default
|
||
implementation calls
|
||
|
||
obj.__dict__.update(state)
|
||
|
||
or, if the update() call fails,
|
||
|
||
for k, v in state.items():
|
||
setattr(obj, k, v)
|
||
|
||
listitems Optional, and new in this PEP.
|
||
If this is not None, it should be an iterator (not a
|
||
sequence!) yielding successive list items. These list
|
||
items will be pickled, and appended to the object using
|
||
either obj.append(item) or obj.extend(list_of_items).
|
||
This is primarily used for list subclasses, but may
|
||
be used by other classes as long as they have append()
|
||
and extend() methods with the appropriate signature.
|
||
(Whether append() or extend() is used depend on which
|
||
pickle protocol version is used as well as the number
|
||
of items to append, so both must be supported.)
|
||
|
||
dictitems Optional, and new in this PEP.
|
||
If this is not None, it should be an iterator (not a
|
||
sequence!) yielding successive dictionary items, which
|
||
should be tuples of the form (key, value). These items
|
||
will be pickled, and stored to the object using
|
||
obj[key] = value. This is primarily used for dict
|
||
subclasses, but may be used by other classes as long
|
||
as they implement __setitem__.
|
||
|
||
Note: in Python 2.2 and before, when using cPickle, state would be
|
||
pickled if present even if it is None; the only safe way to avoid
|
||
the __setstate__ call was to return a two-tuple from __reduce__.
|
||
(But pickle.py would not pickle state if it was None.) In Python
|
||
2.3, __setstate__ will never be called at unpickling time when
|
||
__reduce__ returns a state with value None at pickling time.
|
||
|
||
A __reduce__ implementation that needs to work both under Python
|
||
2.2 and under Python 2.3 could check the variable
|
||
pickle.format_version to determine whether to use the listitems
|
||
and dictitems features. If this value is >= "2.0" then they are
|
||
supported. If not, any list or dict items should be incorporated
|
||
somehow in the 'state' return value, and the __setstate__ method
|
||
should be prepared to accept list or dict items as part of the
|
||
state (how this is done is up to the application).
|
||
|
||
|
||
The __reduce_ex__ API
|
||
|
||
It is sometimes useful to know the protocol version when
|
||
implementing __reduce__. This can be done by implementing a
|
||
method named __reduce_ex__ instead of __reduce__. __reduce_ex__,
|
||
when it exists, is called in preference over __reduce__ (you may
|
||
still provide __reduce__ for backwards compatibility). The
|
||
__reduce_ex__ method will be called with a single integer
|
||
argument, the protocol version.
|
||
|
||
The 'object' class implements both __reduce__ and __reduce_ex__;
|
||
however, if a subclass overrides __reduce__ but not __reduce_ex__,
|
||
the __reduce_ex__ implementation detects this and calls
|
||
__reduce__.
|
||
|
||
|
||
Customizing pickling absent a __reduce__ implementation
|
||
|
||
If no __reduce__ implementation is available for a particular
|
||
class, there are three cases that need to be considered
|
||
separately, because they are handled differently:
|
||
|
||
1. classic class instances, all protocols
|
||
|
||
2. new-style class instances, protocols 0 and 1
|
||
|
||
3. new-style class instances, protocol 2
|
||
|
||
Types implemented in C are considered new-style classes. However,
|
||
except for the common built-in types, these need to provide a
|
||
__reduce__ implementation in order to be picklable with protocols
|
||
0 or 1. Protocol 2 supports built-in types providing
|
||
__getnewargs__, __getstate__ and __setstate__ as well.
|
||
|
||
|
||
Case 1: pickling classic class instances
|
||
|
||
This case is the same for all protocols, and is unchanged from
|
||
Python 2.1.
|
||
|
||
For classic classes, __reduce__ is not used. Instead, classic
|
||
classes can customize their pickling by providing methods named
|
||
__getstate__, __setstate__ and __getinitargs__. Absent these, a
|
||
default pickling strategy for classic class instances is
|
||
implemented that works as long as all instance variables are
|
||
picklable. This default strategy is documented in terms of
|
||
default implementations of __getstate__ and __setstate__.
|
||
|
||
The primary ways to customize pickling of classic class instances
|
||
is by specifying __getstate__ and/or __setstate__ methods. It is
|
||
fine if a class implements one of these but not the other, as long
|
||
as it is compatible with the default version.
|
||
|
||
The __getstate__ method
|
||
|
||
The __getstate__ method should return a picklable value
|
||
representing the object's state without referencing the object
|
||
itself. If no __getstate__ method exists, a default
|
||
implementation is used that returns self.__dict__.
|
||
|
||
The __setstate__ method
|
||
|
||
The __setstate__ method should take one argument; it will be
|
||
called with the value returned by __getstate__ (or its default
|
||
implementation).
|
||
|
||
If no __setstate__ method exists, a default implementation is
|
||
provided that assumes the state is a dictionary mapping instance
|
||
variable names to values. The default implementation tries two
|
||
things:
|
||
|
||
- First, it tries to call self.__dict__.update(state).
|
||
|
||
- If the update() call fails with a RuntimeError exception, it
|
||
calls setattr(self, key, value) for each (key, value) pair in
|
||
the state dictionary. This only happens when unpickling in
|
||
restricted execution mode (see the rexec standard library
|
||
module).
|
||
|
||
The __getinitargs__ method
|
||
|
||
The __setstate__ method (or its default implementation) requires
|
||
that a new object already exists so that its __setstate__ method
|
||
can be called. The point is to create a new object that isn't
|
||
fully initialized; in particular, the class's __init__ method
|
||
should not be called if possible.
|
||
|
||
These are the possibilities:
|
||
|
||
- Normally, the following trick is used: create an instance of a
|
||
trivial classic class (one without any methods or instance
|
||
variables) and then use __class__ assignment to change its
|
||
class to the desired class. This creates an instance of the
|
||
desired class with an empty __dict__ whose __init__ has not
|
||
been called.
|
||
|
||
- However, if the class has a method named __getinitargs__, the
|
||
above trick is not used, and a class instance is created by
|
||
using the tuple returned by __getinitargs__ as an argument
|
||
list to the class constructor. This is done even if
|
||
__getinitargs__ returns an empty tuple -- a __getinitargs__
|
||
method that returns () is not equivalent to not having
|
||
__getinitargs__ at all. __getinitargs__ *must* return a
|
||
tuple.
|
||
|
||
- In restricted execution mode, the trick from the first bullet
|
||
doesn't work; in this case, the class constructor is called
|
||
with an empty argument list if no __getinitargs__ method
|
||
exists. This means that in order for a classic class to be
|
||
unpicklable in restricted execution mode, it must either
|
||
implement __getinitargs__ or its constructor (i.e., its
|
||
__init__ method) must be callable without arguments.
|
||
|
||
|
||
Case 2: pickling new-style class instances using protocols 0 or 1
|
||
|
||
This case is unchanged from Python 2.2. For better pickling of
|
||
new-style class instances when backwards compatibility is not an
|
||
issue, protocol 2 should be used; see case 3 below.
|
||
|
||
New-style classes, whether implemented in C or in Python, inherit
|
||
a default __reduce__ implementation from the universal base class
|
||
'object'.
|
||
|
||
This default __reduce__ implementation is not used for those
|
||
built-in types for which the pickle module has built-in support.
|
||
Here's a full list of those types:
|
||
|
||
- Concrete built-in types: NoneType, bool, int, float, complex,
|
||
str, unicode, tuple, list, dict. (Complex is supported by
|
||
virtue of a __reduce__ implementation registered in copy_reg.)
|
||
In Jython, PyStringMap is also included in this list.
|
||
|
||
- Classic instances.
|
||
|
||
- Classic class objects, Python function objects, built-in
|
||
function and method objects, and new-style type objects (==
|
||
new-style class objects). These are pickled by name, not by
|
||
value: at unpickling time, a reference to an object with the
|
||
same name (the fully qualified module name plus the variable
|
||
name in that module) is substituted.
|
||
|
||
The default __reduce__ implementation will fail at pickling time
|
||
for built-in types not mentioned above, and for new-style classes
|
||
implemented in C: if they want to be picklable, they must supply
|
||
a custom __reduce__ implementation under protocols 0 and 1.
|
||
|
||
For new-style classes implemented in Python, the default
|
||
__reduce__ implementation (copy_reg._reduce) works as follows:
|
||
|
||
Let D be the class on the object to be pickled. First, find the
|
||
nearest base class that is implemented in C (either as a
|
||
built-in type or as a type defined by an extension class). Call
|
||
this base class B, and the class of the object to be pickled D.
|
||
Unless B is the class 'object', instances of class B must be
|
||
picklable, either by having built-in support (as defined in the
|
||
above three bullet points), or by having a non-default
|
||
__reduce__ implementation. B must not be the same class as D
|
||
(if it were, it would mean that D is not implemented in Python).
|
||
|
||
The callable produced by the default __reduce__ is
|
||
copy_reg._reconstructor, and its arguments tuple is
|
||
(D, B, basestate), where basestate is None if B is the builtin
|
||
object class, and basestate is
|
||
|
||
basestate = B(obj)
|
||
|
||
if B is not the builtin object class. This is geared toward
|
||
pickling subclasses of builtin types, where, for example,
|
||
list(some_list_subclass_instance) produces "the list part" of
|
||
the list subclass instance.
|
||
|
||
The object is recreated at unpickling time by
|
||
copy_reg._reconstructor, like so:
|
||
|
||
obj = B.__new__(D, basestate)
|
||
B.__init__(obj, basestate)
|
||
|
||
Objects using the default __reduce__ implementation can customize
|
||
it by defining __getstate__ and/or __setstate__ methods. These
|
||
work almost the same as described for classic classes above, except
|
||
that if __getstate__ returns an object (of any type) whose value is
|
||
considered false (e.g. None, or a number that is zero, or an empty
|
||
sequence or mapping), this state is not pickled and __setstate__
|
||
will not be called at all. If __getstate__ exists and returns a
|
||
true value, that value becomes the third element of the tuple
|
||
returned by the default __reduce__, and at unpickling time the
|
||
value is passed to __setstate__. If __getstate__ does not exist,
|
||
but obj.__dict__ exists, then obj.__dict__ becomes the third
|
||
element of the tuple returned by __reduce__, and again at
|
||
unpickling time the value is passed to obj.__setstate__. The
|
||
default __setstate__ is the same as that for classic classes,
|
||
described above.
|
||
|
||
Note that this strategy ignores slots. Instances of new-style
|
||
classes that have slots but no __getstate__ method cannot be
|
||
pickled by protocols 0 and 1; the code explicitly checks for
|
||
this condition.
|
||
|
||
Note that pickling new-style class instances ignores __getinitargs__
|
||
if it exists (and under all protocols). __getinitargs__ is
|
||
useful only for classic classes.
|
||
|
||
|
||
Case 3: pickling new-style class instances using protocol 2
|
||
|
||
Under protocol 2, the default __reduce__ implementation inherited
|
||
from the 'object' base class is *ignored*. Instead, a different
|
||
default implementation is used, which allows more efficient
|
||
pickling of new-style class instances than possible with protocols
|
||
0 or 1, at the cost of backward incompatibility with Python 2.2
|
||
(meaning no more than that a protocol 2 pickle cannot be unpickled
|
||
before Python 2.3).
|
||
|
||
The customization uses three special methods: __getstate__,
|
||
__setstate__ and __getnewargs__ (note that __getinitargs__ is again
|
||
ignored). It is fine if a class implements one or more but not all
|
||
of these, as long as it is compatible with the default
|
||
implementations.
|
||
|
||
The __getstate__ method
|
||
|
||
The __getstate__ method should return a picklable value
|
||
representing the object's state without referencing the object
|
||
itself. If no __getstate__ method exists, a default
|
||
implementation is used which is described below.
|
||
|
||
There's a subtle difference between classic and new-style
|
||
classes here: if a classic class's __getstate__ returns None,
|
||
self.__setstate__(None) will be called as part of unpickling.
|
||
But if a new-style class's __getstate__ returns None, its
|
||
__setstate__ won't be called at all as part of unpickling.
|
||
|
||
If no __getstate__ method exists, a default state is computed.
|
||
There are several cases:
|
||
|
||
- For a new-style class that has no instance __dict__ and no
|
||
__slots__, the default state is None.
|
||
|
||
- For a new-style class that has an instance __dict__ and no
|
||
__slots__, the default state is self.__dict__.
|
||
|
||
- For a new-style class that has an instance __dict__ and
|
||
__slots__, the default state is a tuple consisting of two
|
||
dictionaries: self.__dict__, and a dictionary mapping slot
|
||
names to slot values. Only slots that have a value are
|
||
included in the latter.
|
||
|
||
- For a new-style class that has __slots__ and no instance
|
||
__dict__, the default state is a tuple whose first item is
|
||
None and whose second item is a dictionary mapping slot names
|
||
to slot values described in the previous bullet.
|
||
|
||
The __setstate__ method
|
||
|
||
The __setstate__ method should take one argument; it will be
|
||
called with the value returned by __getstate__ or with the
|
||
default state described above if no __getstate__ method is
|
||
defined.
|
||
|
||
If no __setstate__ method exists, a default implementation is
|
||
provided that can handle the state returned by the default
|
||
__getstate__, described above.
|
||
|
||
The __getnewargs__ method
|
||
|
||
Like for classic classes, the __setstate__ method (or its
|
||
default implementation) requires that a new object already
|
||
exists so that its __setstate__ method can be called.
|
||
|
||
In protocol 2, a new pickling opcode is used that causes a new
|
||
object to be created as follows:
|
||
|
||
obj = C.__new__(C, *args)
|
||
|
||
where C is the class of the pickled object, and args is either
|
||
the empty tuple, or the tuple returned by the __getnewargs__
|
||
method, if defined. __getnewargs__ must return a tuple. The
|
||
absence of a __getnewargs__ method is equivalent to the existence
|
||
of one that returns ().
|
||
|
||
|
||
The __newobj__ unpickling function
|
||
|
||
When the unpickling function returned by __reduce__ (the first
|
||
item of the returned tuple) has the name __newobj__, something
|
||
special happens for pickle protocol 2. An unpickling function
|
||
named __newobj__ is assumed to have the following semantics:
|
||
|
||
def __newobj__(cls, *args):
|
||
return cls.__new__(cls, *args)
|
||
|
||
Pickle protocol 2 special-cases an unpickling function with this
|
||
name, and emits a pickling opcode that, given 'cls' and 'args',
|
||
will return cls.__new__(cls, *args) without also pickling a
|
||
reference to __newobj__ (this is the same pickling opcode used by
|
||
protocol 2 for a new-style class instance when no __reduce__
|
||
implementation exists). This is the main reason why protocol 2
|
||
pickles are much smaller than classic pickles. Of course, the
|
||
pickling code cannot verify that a function named __newobj__
|
||
actually has the expected semantics. If you use an unpickling
|
||
function named __newobj__ that returns something different, you
|
||
deserve what you get.
|
||
|
||
It is safe to use this feature under Python 2.2; there's nothing
|
||
in the recommended implementation of __newobj__ that depends on
|
||
Python 2.3.
|
||
|
||
|
||
The extension registry
|
||
|
||
Protocol 2 supports a new mechanism to reduce the size of pickles.
|
||
|
||
When class instances (classic or new-style) are pickled, the full
|
||
name of the class (module name including package name, and class
|
||
name) is included in the pickle. Especially for applications that
|
||
generate many small pickles, this is a lot of overhead that has to
|
||
be repeated in each pickle. For large pickles, when using
|
||
protocol 1, repeated references to the same class name are
|
||
compressed using the "memo" feature; but each class name must be
|
||
spelled in full at least once per pickle, and this causes a lot of
|
||
overhead for small pickles.
|
||
|
||
The extension registry allows one to represent the most frequently
|
||
used names by small integers, which are pickled very efficiently:
|
||
an extension code in the range 1-255 requires only two bytes
|
||
including the opcode, one in the range 256-65535 requires only
|
||
three bytes including the opcode.
|
||
|
||
One of the design goals of the pickle protocol is to make pickles
|
||
"context-free": as long as you have installed the modules
|
||
containing the classes referenced by a pickle, you can unpickle
|
||
it, without needing to import any of those classes ahead of time.
|
||
|
||
Unbridled use of extension codes could jeopardize this desirable
|
||
property of pickles. Therefore, the main use of extension codes
|
||
is reserved for a set of codes to be standardized by some
|
||
standard-setting body. This being Python, the standard-setting
|
||
body is the PSF. From time to time, the PSF will decide on a
|
||
table mapping extension codes to class names (or occasionally
|
||
names of other global objects; functions are also eligible). This
|
||
table will be incorporated in the next Python release(s).
|
||
|
||
However, for some applications, like Zope, context-free pickles
|
||
are not a requirement, and waiting for the PSF to standardize
|
||
some codes may not be practical. Two solutions are offered for
|
||
such applications.
|
||
|
||
First, a few ranges of extension codes are reserved for private
|
||
use. Any application can register codes in these ranges.
|
||
Two applications exchanging pickles using codes in these ranges
|
||
need to have some out-of-band mechanism to agree on the mapping
|
||
between extension codes and names.
|
||
|
||
Second, some large Python projects (e.g. Zope) can be assigned a
|
||
range of extension codes outside the "private use" range that they
|
||
can assign as they see fit.
|
||
|
||
The extension registry is defined as a mapping between extension
|
||
codes and names. When an extension code is unpickled, it ends up
|
||
producing an object, but this object is gotten by interpreting the
|
||
name as a module name followed by a class (or function) name. The
|
||
mapping from names to objects is cached. It is quite possible
|
||
that certain names cannot be imported; that should not be a
|
||
problem as long as no pickle containing a reference to such names
|
||
has to be unpickled. (The same issue already exists for direct
|
||
references to such names in pickles that use protocols 0 or 1.)
|
||
|
||
Here is the proposed initial assigment of extension code ranges:
|
||
|
||
First Last Count Purpose
|
||
|
||
0 0 1 Reserved -- will never be used
|
||
1 127 127 Reserved for Python standard library
|
||
128 191 64 Reserved for Zope
|
||
192 239 48 Reserved for 3rd parties
|
||
240 255 16 Reserved for private use (will never be assigned)
|
||
256 MAX MAX Reserved for future assignment
|
||
|
||
MAX stands for 2147483647, or 2**31-1. This is a hard limitation
|
||
of the protocol as currently defined.
|
||
|
||
At the moment, no specific extension codes have been assigned yet.
|
||
|
||
|
||
Extension registry API
|
||
|
||
The extension registry is maintained as private global variables
|
||
in the copy_reg module. The following three functions are defined
|
||
in this module to manipulate the registry:
|
||
|
||
add_extension(module, name, code)
|
||
Register an extension code. The module and name arguments
|
||
must be strings; code must be an int in the inclusive range 1
|
||
through MAX. This must either register a new (module, name)
|
||
pair to a new code, or be a redundant repeat of a previous
|
||
call that was not canceled by a remove_extension() call; a
|
||
(module, name) pair may not be mapped to more than one code,
|
||
nor may a code be mapped to more than one (module, name)
|
||
pair. (XXX Aliasing may actually cause a problem for this
|
||
requirement; we'll see as we go.)
|
||
|
||
remove_extension(module, name, code)
|
||
Arguments are as for add_extension(). Remove a previously
|
||
registered mapping between (module, name) and code.
|
||
|
||
clear_extension_cache()
|
||
The implementation of extension codes may use a cache to speed
|
||
up loading objects that are named frequently. This cache can
|
||
be emptied (removing references to cached objects) by calling
|
||
this method.
|
||
|
||
Note that the API does not enforce the standard range assignments.
|
||
It is up to applications to respect these.
|
||
|
||
|
||
The copy module
|
||
|
||
Traditionally, the copy module has supported an extended subset of
|
||
the pickling APIs for customizing the copy() and deepcopy()
|
||
operations.
|
||
|
||
In particular, besides checking for a __copy__ or __deepcopy__
|
||
method, copy() and deepcopy() have always looked for __reduce__,
|
||
and for classic classes, have looked for __getinitargs__,
|
||
__getstate__ and __setstate__.
|
||
|
||
In Python 2.2, the default __reduce__ inherited from 'object' made
|
||
copying simple new-style classes possible, but slots and various
|
||
other special cases were not covered.
|
||
|
||
In Python 2.3, several changes are made to the copy module:
|
||
|
||
- __reduce_ex__ is supported (and always called with 2 as the
|
||
protocol version argument).
|
||
|
||
- The four- and five-argument return values of __reduce__ are
|
||
supported.
|
||
|
||
- Before looking for a __reduce__ method, the
|
||
copy_reg.dispatch_table is consulted, just like for pickling.
|
||
|
||
- When the __reduce__ method is inherited from object, it is
|
||
(unconditionally) replaced by a better one that uses the same
|
||
APIs as pickle protocol 2: __getnewargs__, __getstate__, and
|
||
__setstate__, handling list and dict subclasses, and handling
|
||
slots.
|
||
|
||
As a consequence of the latter change, certain new-style classes
|
||
that were copyable under Python 2.2 are not copyable under Python
|
||
2.3. (These classes are also not picklable using pickle protocol
|
||
2.) A minimal example of such a class:
|
||
|
||
class C(object):
|
||
def __new__(cls, a):
|
||
return object.__new__(cls)
|
||
|
||
The problem only occurs when __new__ is overridden and has at
|
||
least one mandatory argument in addition to the class argument.
|
||
|
||
To fix this, a __getnewargs__ method should be added that returns
|
||
the appropriate argument tuple (excluding the class).
|
||
|
||
|
||
Pickling Python longs
|
||
|
||
Pickling and unpickling Python longs takes time quadratic in
|
||
the number of digits, in protocols 0 and 1. Under protocol 2,
|
||
new opcodes support linear-time pickling and unpickling of longs.
|
||
|
||
|
||
Pickling bools
|
||
|
||
Protocol 2 introduces new opcodes for pickling True and False
|
||
directly. Under protocols 0 and 1, bools are pickled as integers,
|
||
using a trick in the representation of the integer in the pickle
|
||
so that an unpickler can recognize that a bool was intended. That
|
||
trick consumed 4 bytes per bool pickled. The new bool opcodes
|
||
consume 1 byte per bool.
|
||
|
||
|
||
Pickling small tuples
|
||
|
||
Protocol 2 introduces new opcodes for more-compact pickling of
|
||
tuples of lengths 1, 2 and 3. Protocol 1 previously introduced
|
||
an opcode for more-compact pickling of empty tuples.
|
||
|
||
|
||
Protocol identification
|
||
|
||
Protocol 2 introduces a new opcode, with which all protocol 2
|
||
pickles begin, identifying that the pickle is protocol 2.
|
||
Attempting to unpickle a protocol 2 pickle under older versions
|
||
of Python will therefore raise an "unknown opcode" exception
|
||
immediately.
|
||
|
||
|
||
Pickling of large lists and dicts
|
||
|
||
Protocol 1 pickles large lists and dicts "in one piece", which
|
||
minimizes pickle size, but requires that unpickling create a temp
|
||
object as large as the object being unpickled. Part of the
|
||
protocol 2 changes break large lists and dicts into pieces of no
|
||
more than 1000 elements each, so that unpickling needn't create
|
||
a temp object larger than needed to hold 1000 elements. This
|
||
isn't part of protocol 2, however: the opcodes produced are still
|
||
part of protocol 1. __reduce__ implementations that return the
|
||
optional new listitems or dictitems iterators also benefit from
|
||
this unpickling temp-space optimization.
|
||
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
End:
|