PEP: 307 Title: Extensions to the pickle protocol Version: $Revision$ Last-Modified: $Date$ Author: Guido van Rossum, Tim Peters Status: Active Type: Standards Track Content-Type: text/plain Created: 31-Jan-2003 Post-History: None Introduction Pickling new-style objects in Python 2.2 is done somewhat clumsily and causes pickle size to bloat compared to classic class instances. This PEP documents a new pickle protocol that takes care of this and many other pickle issues. There are two sides to specifying a new pickle protocol: the byte stream constituting pickled data must be specified, and the interface between objects and the pickling and unpickling engines must be specified. This PEP focuses on API issues, although it may occasionally touch on byte stream format details to motivate a choice. The pickle byte stream format is documented formally by the standard library module pickletools.py (already checked into CVS for Python 2.3). This PEP attempts to fully document the interface between pickled objects and the pickling process, highlighting additions by specifying "new in this PEP". (The interface to invoke pickling or unpickling is not covered fully, except for the changes to the API for specifying the pickling protocol to picklers.) Motivation Pickling new-style objects causes serious pickle bloat. For example, the binary pickle for a classic object with one instance variable takes up 33 bytes; a new-style object with one instance variable takes up 86 bytes. This was measured as follows: class C(object): # Omit "(object)" for classic class pass x = C() x.foo = 42 print len(pickle.dumps(x, 1)) The reasons for the bloat are complex, but are mostly caused by the fact that new-style objects use __reduce__ in order to be picklable at all. After ample consideration we've concluded that the only way to reduce pickle sizes for new-style objects is to add new opcodes to the pickle protocol. The net result is that with the new protocol, the pickle size in the above example is 35 (two extra bytes are used at the start to indicate the protocol version, although this isn't strictly necessary). Protocol versions Previously, pickling (but not unpickling) has distinguished between text mode and binary mode. By design, text mode is a subset of binary mode, and unpicklers don't need to know in advance whether an incoming pickle uses text mode or binary mode. The virtual machine used for unpickling is the same regardless of the mode; certain opcode simply aren't used in text mode. Retroactively, text mode is called protocol 0, and binary mode is called protocol 1. The new protocol is called protocol 2. In the tradition of pickling protocols, protocol 2 is a superset of protocol 1. But just so that future pickling protocols aren't required to be supersets of the oldest protocols, a new opcode is inserted at the start of a protocol 2 pickle indicating that it is using protocol 2. Several functions, methods and constructors used for pickling used to take a positional argument named 'bin' which was a flag, defaulting to 0, indicating binary mode. This argument is renamed to 'proto' and now gives the protocol number, defaulting to 0. It so happens that passing 2 for the 'bin' argument in previous Python versions had the same effect as passing 1. Nevertheless, a special case is added here: passing a negative number selects the highest protocol version supported by a particular implementation. This works in previous Python versions, too. The pickle.py module has supported passing the 'bin' value as a keyword argument rather than a positional argument. (This is not recommended, since cPickle only accepts positional arguments, but it works...) Passing 'bin' as a keyword argument is deprecated, and a PendingDeprecationWarning is issued in this case. You have to invoke the Python interpreter with -Wa or a variation on that to see PendingDeprecationWarning messages. In Python 2.4, the warning class may be upgraded to DeprecationWarning. Security issues In previous versions of Python, unpickling would do a "safety check" on certain operations, refusing to call functions or constructors that weren't marked as "safe for unpickling" by either having an attribute __safe_for_unpickling__ set to 1, or by being registered in a global registry, copy_reg.safe_constructors. This feature gives a false sense of security: nobody has ever done the necessary, extensive, code audit to prove that unpickling untrusted pickles cannot invoke unwanted code, and in fact bugs in the Python 2.2 pickle.py module make it easy to circumvent these security measures. We firmly believe that, on the Internet, it is better to know that you are using an insecure protocol than to trust a protocol to be secure whose implementation hasn't been thoroughly checked. Even high quality implementations of widely used protocols are routinely found flawed; Python's pickle implementation simply cannot make such guarantees without a much larger time investment. Therefore, as of Python 2.3, all safety checks on unpickling are officially removed, and replaced with this warning: *** Do not unpickle data received from an untrusted or unauthenticated source *** The same warning applies to previous Python versions, despite the presence of safety checks there. Extended __reduce__ API There are several APIs that a class can use to control pickling. Perhaps the most popular of these are __getstate__ and __setstate__; but the most powerful one is __reduce__. (There's also __getinitargs__, and we're adding __getnewargs__ below.) There are two ways to provide __reduce__ functionality: a class can implement a __reduce__ method, or a reduce function can be declared in copy_reg (copy_reg.dispatch_table maps classes to functions). The return values are interpreted exactly the same, though, and we'll refer to these collectively as __reduce__. __reduce__ must return either a string or a tuple. If it returns a string, this is an object whose state is not to be pickled, but instead a reference to an equivalent object referenced by name. Surprisingly, the string returned by __reduce__ should be the object's local name (relative to its module); the pickle module searches the module namespace to determine the object's module. The rest of this section is concerned with the tuple returned by __reduce__. It is a variable length tuple. Only the first two items (function and arguments) are required. The remaining items may be None or left off from the end. The last two items are new in this PEP. The items are, in order: function A callable object (not necessarily a function) called to create the initial version of the object; state may be added to the object later to fully reconstruct the pickled state. This function must itself be picklable. See the section about __newobj__ for a special case (new in this PEP) here. arguments A tuple giving the argument list for the function. As a special case, designed for Zope 2's ExtensionClass, this may be None; in that case, function should be a class or type, and function.__basicnew__() is called to create the initial version of the object. This exception is deprecated. state Additional state. If this is not None, the state is pickled, and obj.__setstate__(state) will called when unpickling. If no __setstate__ method is defined, a default implementation is provided, which assumes that state is a dictionary mapping instance variable names to their values, and calls obj.__dict__.update(state) or "for k, v in state.items(): obj[k] = v", if update() call fails. listitems New in this PEP. If this is not None, it should be an iterator (not a sequence!) yielding successive list items. These list items will be pickled, and appended to the object using either obj.append(item) or obj.extend(list_of_items). This is primarily used for list subclasses, but may be used by other classes as long as they have append() and extend() methods with the appropriate signature. (Whether append() or extend() is used depend on which pickle protocol version is used as well as the number of items to append, so both must be supported.) dictitems New in this PEP. If this is not None, it should be an iterator (not a sequence!) yielding successive dictionary items, which should be tuples of the form (key, value). These items will be pickled, and stored to the object using obj[key] = value. This is primarily used for dict subclasses, but may be used by other classes as long as they implement __settitem__. Note: in Python 2.2 and before, when using cPickle, state would be pickled if present even if it is None; the only safe way to avoid the __setstate__ call was to return a two-tuple from __reduce__. (But pickle.py would not pickle state if it was None.) In Python 2.3, __setstate__ will never be called when __reduce__ returns a state with value None. A __reduce__ implementation that needs to work both under Python 2.2 and under Python 2.3 could check the variable pickle.format_version to determine whether to use the listitems and dictitems features. If this value is >= "2.0" then they are supported. If not, any list or dict items should be incorporated somehow in the 'state' return value; the __setstate__ method should be prepared to accept list or dict items as part of the state (how this is done is up to the application). The __newobj__ unpickling function When the unpickling function returned by __reduce__ (the first item of the returned tuple) has the name __newobj__, something special happens for pickle protocol 2. An unpickling function named __newobj__ is assumed to have the following semantics: def __newobj__(cls, *args): return cls.__new__(cls, *args) Pickle protocol 2 special-cases an unpickling function with this name, and emits a pickling opcode that, given 'cls' and 'args', will return cls.__new__(cls, *args) without also pickling a reference to __newobj__. This is the main reason why protocol 2 pickles are so much smaller than classic pickles. Of course, the pickling code cannot verify that a function named __newobj__ actually has the expected semantics. If you use an unpickling function named __newobj__ that returns something different, you deserve what you get. It is safe to use this feature under Python 2.2; there's nothing in the recommended implementation of __newobj__ that depends on Python 2.3. TBD The rest of this PEP is still under construction! Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End: