python-peps/pep-3154.txt

178 lines
5.6 KiB
Plaintext

PEP: 3154
Title: Pickle protocol version 4
Version: $Revision$
Last-Modified: $Date$
Author: Antoine Pitrou <solipsis@pitrou.net>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2011-08-11
Python-Version: 3.3
Post-History:
Resolution: TBD
Abstract
========
Data serialized using the pickle module must be portable across Python
versions. It should also support the latest language features as well
as implementation-specific features. For this reason, the pickle
module knows about several protocols (currently numbered from 0 to 3),
each of which appeared in a different Python version. Using a
low-numbered protocol version allows to exchange data with old Python
versions, while using a high-numbered protocol allows access to newer
features and sometimes more efficient resource use (both CPU time
required for (de)serializing, and disk size / network bandwidth
required for data transfer).
Rationale
=========
The latest current protocol, coincidentally named protocol 3, appeared
with Python 3.0 and supports the new incompatible features in the
language (mainly, unicode strings by default and the new bytes
object). The opportunity was not taken at the time to improve the
protocol in other ways.
This PEP is an attempt to foster a number of small incremental
improvements in a future new protocol version. The PEP process is
used in order to gather as many improvements as possible, because the
introduction of a new protocol version should be a rare occurrence.
Proposed changes
================
Framing
-------
Traditionally, when unpickling an object from a stream (by calling
``load()`` rather than ``loads()``), many small ``read()``
calls can be issued on the file-like object, with a potentially huge
performance impact.
Protocol 4, by contrast, features binary framing. The general structure
of a pickle is thus the following::
+------+------+
| 0x80 | 0x03 | protocol header (2 bytes)
+------+------+-----------+
| AA BB CC DD EE FF GG HH | frame size (8 bytes, little-endian)
+------+------------------+
| .... | first frame contents (N bytes)
+------+------+-----------+
| AA BB CC DD EE FF GG HH | frame size (8 bytes, little-endian)
+------+------------------+
| .... | second frame contents (N bytes)
+------+
etc.
To keep the implementation simple, it is forbidden for a pickle opcode
to overlap frame boundaries. The pickler takes care not to produce such
pickles, and the unpickler refuses them.
How the pickler decides frame sizes is an implementation detail.
A simple heuristic committing the current frame as soon as it reaches
64 KiB seems sufficient.
Binary encoding for all opcodes
-------------------------------
The GLOBAL opcode, which is still used in protocol 3, uses the
so-called "text" mode of the pickle protocol, which involves looking
for newlines in the pickle stream. It also complicates the implementation
of binary framing.
Protocol 4 forbids use of the GLOBAL opcode and replaces it with
GLOBAL_STACK, a new opcode which takes its operand from the stack.
Serializing more "lookupable" objects
-------------------------------------
By default, pickle is only able to serialize module-global functions and
classes. Supporting other kinds of objects, such as unbound methods [4]_,
is a common request. Actually, third-party support for some of them, such
as bound methods, is implemented in the multiprocessing module [5]_.
The ``__qualname__`` attribute from :pep:`3155` makes it possible to
lookup many more objects by name. Making the GLOBAL_STACK opcode accept
dot-separated names, or adding a special GETATTR opcode, would allow the
standard pickle implementation to support all those kinds of objects.
64-bit opcodes for large objects
--------------------------------
Current protocol versions export object sizes for various built-in
types (str, bytes) as 32-bit ints. This forbids serialization of
large data [1]_. New opcodes are required to support very large bytes
and str objects.
Native opcodes for sets and frozensets
--------------------------------------
Many common built-in types (such as str, bytes, dict, list, tuple)
have dedicated opcodes to improve resource consumption when
serializing and deserializing them; however, sets and frozensets
don't. Adding such opcodes would be an obvious improvement. Also,
dedicated set support could help remove the current impossibility of
pickling self-referential sets [2]_.
Calling __new__ with keyword arguments
--------------------------------------
Currently, classes whose __new__ mandates the use of keyword-only
arguments can not be pickled (or, rather, unpickled) [3]_. Both a new
special method (``__getnewargs_ex__`` ?) and a new opcode (NEWOBJEX ?)
are needed.
Better string encoding
----------------------
Short str objects currently have their length coded as a 4-bytes
integer, which is wasteful. A specific opcode with a 1-byte length
would make many pickles smaller.
Acknowledgments
===============
(...)
References
==========
.. [1] "pickle not 64-bit ready":
http://bugs.python.org/issue11564
.. [2] "Cannot pickle self-referencing sets":
http://bugs.python.org/issue9269
.. [3] "pickle/copyreg doesn't support keyword only arguments in __new__":
http://bugs.python.org/issue4727
.. [4] "pickle should support methods":
http://bugs.python.org/issue9276
.. [5] Lib/multiprocessing/forking.py:
http://hg.python.org/cpython/file/baea9f5f973c/Lib/multiprocessing/forking.py#l54
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: