python-peps/pep-0307.txt

PEP: 307
Title: Extensions to the pickle protocol
Version: $Revision$
Last-Modified: $Date$
Author: Guido van Rossum, Tim Peters
Status: Active
Type: Standards Track
Content-Type: text/plain
Created: 31-Jan-2003
Post-History: None


Introduction

    Pickling new-style objects in Python 2.2 is done somewhat clumsily
    and causes pickle size to bloat compared to classic class
    instances.  This PEP documents a new pickle protocol that takes
    care of this and many other pickle issues.

    There are two sides to specifying a new pickle protocol: the byte
    stream constituting pickled data must be specified, and the
    interface between objects and the pickling and unpickling engines
    must be specified.  This PEP focuses on API issues, although it
    may occasionally touch on byte stream format details to motivate a
    choice.  The pickle byte stream format is documented formally by
    the standard library module pickletools.py (already checked into
    CVS for Python 2.3).


Motivation

    Pickling new-style objects causes serious pickle bloat.  For
    example, the binary pickle for a classic object with one instance
    variable takes up 33 bytes; a new-style object with one instance
    variable takes up 86 bytes.  This was measured as follows:

        class C(object): # Omit "(object)" for classic class
            pass
        x = C()
        x.foo = 42
        print len(pickle.dumps(x, 1))

    The reasons for the bloat are complex, but are mostly caused by
    the fact that new-style objects use __reduce__ in order to be
    picklable at all.  After ample consideration we've concluded that
    the only way to reduce pickle sizes for new-style objects is to
    add new opcodes to the pickle protocol.  The net result is that
    with the new protocol, the pickle size in the above example is 35
    (two extra bytes are used at the start to indicate the protocol
    version, although this isn't strictly necessary).


Protocol versions

    Previously, pickling (but not unpickling) has distinguished
    between text mode and binary mode.  By design, text mode is a
    subset of binary mode, and unpicklers don't need to know in
    advance whether an incoming pickle uses text mode or binary mode.
    The virtual machine used for unpickling is the same regardless of
    the mode; certain opcode simply aren't used in text mode.

    Retroactively, text mode is called protocol 0, and binary mode is
    called protocol 1.  The new protocol is called protocol 2.  In the
    tradition of pickling protocols, protocol 2 is a superset of
    protocol 1.  But just so that future pickling protocols aren't
    required to be supersets of the oldest protocols, a new opcode is
    inserted at the start of a protocol 2 pickle indicating that it is
    using protocol 2.

    Several functions, methods and constructors used for pickling used
    to take a positional argument named 'bin' which was a flag,
    defaulting to 0, indicating binary mode.  This argument is renamed
    to 'proto' and now gives the protocol number, defaulting to 0.

    It so happens that passing 2 for the 'bin' argument in previous
    Python versions had the same effect as passing 1.  Nevertheless, a
    special case is added here: passing a negative number selects the
    highest protocol version supported by a particular implementation.
    This works in previous Python versions, too.


Security issues

    In previous versions of Python, unpickling would do a "safety
    check" on certain operations, refusing to call functions or
    constructors that weren't marked as "safe for unpickling" by
    either having an attribute __safe_for_unpickling__ set to 1, or by
    being registered in a global registry, copy_reg.safe_constructors.

    This feature gives a false sense of security: nobody has ever done
    the necessary, extensive, code audit to prove that unpickling
    untrusted pickles cannot invoke unwanted code, and in fact bugs in
    the Python 2.2 pickle.py module make it easy to circumvent these
    security measures.

    We firmly believe that, on the Internet, it is better to know that
    you are using an insecure protocol than to trust a protocol to be
    secure whose implementation hasn't been thoroughly checked.  Even
    high quality implementations of widely used protocols are
    routinely found flawed; Python's pickle implementation simply
    cannot make such guarantees without a much larger time investment.
    Therefore, as of Python 2.3, all safety checks on unpickling are
    officially removed, and replaced with this warning:

      *** Do not unpickle data received from an untrusted or
          unauthenticated source ***


Extended __reduce__ API

    There are several APIs that a class can use to control pickling.
    Perhaps the most popular of these are __getstate__ and
    __setstate__; but the most powerful one is __reduce__.  (There's
    also __getinitargs__, and we're adding __getnewargs__ below.)

    There are two ways to provide __reduce__ functionality: a class
    can implement a __reduce__ method, or a reduce function can be
    declared in copy_reg (copy_reg.dispatch_table maps classes to
    functions).  The return values are interpreted exactly the same,
    though, and we'll refer to these collectively as __reduce__.

    __reduce__ must return either a string or a tuple.  If it returns
    a string, this is an object whose state is not to be pickled, but
    instead a reference to an equivalent object referenced by name.
    Surprisingly, the string returned by __reduce__ should be the
    object's local name (relative to its module); the pickle module
    searches the module namespace to determine the object's module.

    The rest of this section is concerned with the tuple returned by
    __reduce__.  It is a variable length tuple.  Only the first two
    items (function and arguments) are required.  The remaining items
    may be None or left off from the end.  The last two items are new
    in this PEP.  The items are, in order:

    function     A callable object (not necessarily a function) called
                 to create the initial version of the object; state
                 may be added to the object later to fully reconstruct
                 the pickled state.  This function must itself be
                 picklable.  See the section about __newobj__ for a
                 special case (new in this PEP) here.

    arguments    A tuple giving the argument list for the function.
                 As a special case, designed for Zope 2's
                 ExtensionClass, this may be None; in that case,
                 function should be a class or type, and
                 function.__basicnew__() is called to create the
                 initial version of the object.  This exception is
                 deprecated.

    state        Additional state.  If this is not None, the state is
                 pickled, and obj.__setstate__(state) will called when
                 unpickling.  If no __setstate__ method is defined, a
                 default implementation is provided, which assumes
                 that state is a dictionary mapping instance variable
                 names to their values, and calls
                 obj.__dict__.update(state) or "for k, v in
                 state.items(): obj[k] = v", if update() call fails.

    listitems    New in this PEP.  If this is not None, it should be
                 an iterator (not a sequence!) yielding successive
                 list items.  These list items will be pickled, and
                 appended to the object using either obj.append(item)
                 or obj.extend(list_of_items).  This is primarily used
                 for list subclasses, but may be used by other classes
                 as long as they have append() and extend() methods
                 with the appropriate signature.  (Whether append() or
                 extend() is used depend on which pickle protocol
                 version is used as well as the number of items to
                 append, so both must be supported.)

    dictitems    New in this PEP.  If this is not None, it should be
                 an iterator (not a sequence!) yielding successive
                 dictionary items, which should be tuples of the form
                 (key, value).  These items will be pickled, and
                 stored to the object using obj[key] = value.  This is
                 primarily used for dict subclasses, but may be used
                 by other classes as long as they implement
                 __settitem__.

    Note: in Python 2.2 and before, when using cPickle, state would be
    pickled if present even if it is None; the only safe way to avoid
    the __setstate__ call was to return a two-tuple from __reduce__.
    (But pickle.py would not pickle state if it was None.)  In Python
    2.3, __setstate__ will never be called when __reduce__ returns a
    state with value None.


The __newobj__ unpickling function

    When the unpickling function returned by __reduce__ (the first
    item of the returned tuple) has the name __newobj__, something
    special happens for pickle protocol 2.  An unpickling function
    named __newobj__ is assumed to have the following semantics:

      def __newobj__(cls, *args):
          return cls.__new__(cls, *args)

    Pickle protocol 2 special-cases an unpickling function with this
    name, and emits a pickling opcode that, given 'cls' and 'args',
    will return cls.__new__(cls, *args) without also pickling a
    reference to __newobj__.  This is the main reason why protocol 2
    pickles are so much smaller than classic pickles.  Of course, the
    pickling code cannot verify that a function named __newobj__
    actually has the expected semantics.  If you use an unpickling
    function named __newobj__ that returns something different, you
    deserve what you get.


TBD

    The rest of this PEP is still under construction!


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: