PEP: 3154 Title: Pickle protocol version 4 Version: $Revision$ Last-Modified: $Date$ Author: Antoine Pitrou Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2011-08-11 Python-Version: 3.4 Post-History: http://mail.python.org/pipermail/python-dev/2011-August/112821.html Resolution: TBD Abstract ======== Data serialized using the pickle module must be portable across Python versions. It should also support the latest language features as well as implementation-specific features. For this reason, the pickle module knows about several protocols (currently numbered from 0 to 3), each of which appeared in a different Python version. Using a low-numbered protocol version allows to exchange data with old Python versions, while using a high-numbered protocol allows access to newer features and sometimes more efficient resource use (both CPU time required for (de)serializing, and disk size / network bandwidth required for data transfer). Rationale ========= The latest current protocol, coincidentally named protocol 3, appeared with Python 3.0 and supports the new incompatible features in the language (mainly, unicode strings by default and the new bytes object). The opportunity was not taken at the time to improve the protocol in other ways. This PEP is an attempt to foster a number of incremental improvements in a new pickle protocol version. The PEP process is used in order to gather as many improvements as possible, because the introduction of a new pickle protocol should be a rare occurrence. Proposed changes ================ Framing ------- Traditionally, when unpickling an object from a stream (by calling ``load()`` rather than ``loads()``), many small ``read()`` calls can be issued on the file-like object, with a potentially huge performance impact. Protocol 4, by contrast, features binary framing. The general structure of a pickle is thus the following:: +------+------+ | 0x80 | 0x04 | protocol header (2 bytes) +------+------+-----------+ | MM MM MM MM MM MM MM MM | frame size (8 bytes, little-endian) +------+------------------+ | .... | first frame contents (M bytes) +------+------+-----------+ | NN NN NN NN NN NN NN NN | frame size (8 bytes, little-endian) +------+------------------+ | .... | second frame contents (N bytes) +------+ etc. To keep the implementation simple, it is forbidden for a pickle opcode to straddle frame boundaries. The pickler takes care not to produce such pickles, and the unpickler refuses them. Also, there is no "last frame" marker. The last frame is simply the one which ends with a STOP opcode. A well-written C implementation doesn't need additional memory copies for the framing layer, preserving general (un)pickling efficiency. .. note:: How the pickler decides to partition the pickle stream into frames is an implementation detail. For example, "closing" a frame as soon as it reaches ~64 KiB is a reasonable choice for both performance and pickle size overhead. Binary encoding for all opcodes ------------------------------- The GLOBAL opcode, which is still used in protocol 3, uses the so-called "text" mode of the pickle protocol, which involves looking for newlines in the pickle stream. It also complicates the implementation of binary framing. Protocol 4 forbids use of the GLOBAL opcode and replaces it with GLOBAL_STACK, a new opcode which takes its operand from the stack. Serializing more "lookupable" objects ------------------------------------- By default, pickle is only able to serialize module-global functions and classes. Supporting other kinds of objects, such as unbound methods [4]_, is a common request. Actually, third-party support for some of them, such as bound methods, is implemented in the multiprocessing module [5]_. The ``__qualname__`` attribute from :pep:`3155` makes it possible to lookup many more objects by name. Making the GLOBAL_STACK opcode accept dot-separated names would allow the standard pickle implementation to support all those kinds of objects. 64-bit opcodes for large objects -------------------------------- Current protocol versions export object sizes for various built-in types (str, bytes) as 32-bit ints. This forbids serialization of large data [1]_. New opcodes are required to support very large bytes and str objects. Native opcodes for sets and frozensets -------------------------------------- Many common built-in types (such as str, bytes, dict, list, tuple) have dedicated opcodes to improve resource consumption when serializing and deserializing them; however, sets and frozensets don't. Adding such opcodes would be an obvious improvement. Also, dedicated set support could help remove the current impossibility of pickling self-referential sets [2]_. Calling __new__ with keyword arguments -------------------------------------- Currently, classes whose __new__ mandates the use of keyword-only arguments can not be pickled (or, rather, unpickled) [3]_. Both a new special method (``__getnewargs_ex__`` ?) and a new opcode (NEWOBJ_EX ?) are needed. Better string encoding ---------------------- Short str objects currently have their length coded as a 4-bytes integer, which is wasteful. A specific opcode with a 1-byte length would make many pickles smaller. Summary of new opcodes ====================== These reflect the state of the proposed implementation (thanks mostly to Alexandre Vassalotti's work): * ``SHORT_BINUNICODE``: push a utf8-encoded str object with a one-byte size prefix (therefore less than 256 bytes long). * ``BINUNICODE8``: push a utf8-encoded str object with a eight-byte size prefix (for strings longer than 2**32 bytes, which therefore cannot be serialized using ``BINUNICODE``). * ``BINBYTES8``: push a bytes object with a eight-byte size prefix (for bytes objects longer than 2**32 bytes, which therefore cannot be serialized using ``BINBYTES``). * ``EMPTY_SET``: push a new empty set object on the stack. * ``ADDITEMS``: add the topmost stack items to the set (to be used with ``EMPTY_SET``). * ``EMPTY_FROZENSET``: push a new empty frozenset object on the stack. * ``FROZENSET``: create a frozenset object from the topmost stack items, and push it on the stack. * ``NEWOBJ_EX``: take the three topmost stack items ``cls``, ``args`` and ``kwargs``, and push the result of calling ``cls.__new__(*args, **kwargs)``. * ``STACK_GLOBAL``: take the two topmost stack items ``module_name`` and ``qualname``, and push the result of looking up the dotted ``qualname`` in the module named ``module_name``. Alternative ideas ================= Prefetching ----------- Serhiy Storchaka suggested to replace framing with a special PREFETCH opcode (with a 2- or 4-bytes argument) to declare known pickle chunks explicitly. Large data may be pickled outside such chunks. A naïve unpickler should be able to skip the PREFETCH opcode and still decode pickles properly, but good error handling would require checking that the PREFETCH length falls on an opcode boundary. Acknowledgments =============== In alphabetic order: * Alexandre Vassalotti, for starting the second PEP 3154 implementation [6]_ * Serhiy Storchaka, for discussing the framing proposal [6]_ * Stefan Mihaila, for starting the first PEP 3154 implementation as a Google Summer of Code project mentored by Alexandre Vassalotti [7]_. References ========== .. [1] "pickle not 64-bit ready": http://bugs.python.org/issue11564 .. [2] "Cannot pickle self-referencing sets": http://bugs.python.org/issue9269 .. [3] "pickle/copyreg doesn't support keyword only arguments in __new__": http://bugs.python.org/issue4727 .. [4] "pickle should support methods": http://bugs.python.org/issue9276 .. [5] Lib/multiprocessing/forking.py: http://hg.python.org/cpython/file/baea9f5f973c/Lib/multiprocessing/forking.py#l54 .. [6] Implement PEP 3154, by Alexandre Vassalotti http://bugs.python.org/issue17810 .. [7] Implement PEP 3154, by Stefan Mihaila http://bugs.python.org/issue15642 Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: