Pickle 4 changes:

- add framing
- change BINGLOBAL to Alexandre Vassalotti's GLOBAL_STACK
This commit is contained in:
Antoine Pitrou 2013-04-26 22:57:06 +02:00
parent 2d013833a9
commit b26857967d
1 changed files with 60 additions and 31 deletions

View File

@ -42,11 +42,67 @@ used in order to gather as many improvements as possible, because the
introduction of a new protocol version should be a rare occurrence.
Improvements in discussion
==========================
Proposed changes
================
64-bit compatibility for large objects
--------------------------------------
Framing
-------
Traditionally, when unpickling an object from a stream (by calling
``load()`` rather than ``loads()``), many small ``read()``
calls can be issued on the file-like object, with a potentially huge
performance impact.
Protocol 4, by contrast, features binary framing. The general structure
of a pickle is thus the following::
+------+------+
| 0x80 | 0x03 | protocol header (2 bytes)
+------+------+-----------+
| AA BB CC DD EE FF GG HH | frame size (8 bytes, little-endian)
+------+------------------+
| .... | first frame contents (N bytes)
+------+------+-----------+
| AA BB CC DD EE FF GG HH | frame size (8 bytes, little-endian)
+------+------------------+
| .... | second frame contents (N bytes)
+------+
etc.
To keep the implementation simple, it is forbidden for a pickle opcode
to overlap frame boundaries. The pickler takes care not to produce such
pickles, and the unpickler refuses them.
How the pickler decides frame sizes is an implementation detail.
A simple heuristic committing the current frame as soon as it reaches
64 KiB seems sufficient.
Binary encoding for all opcodes
-------------------------------
The GLOBAL opcode, which is still used in protocol 3, uses the
so-called "text" mode of the pickle protocol, which involves looking
for newlines in the pickle stream. It also complicates the implementation
of binary framing.
Protocol 4 forbids use of the GLOBAL opcode and replaces it with
GLOBAL_STACK, a new opcode which takes its operand from the stack.
Serializing more "lookupable" objects
-------------------------------------
By default, pickle is only able to serialize module-global functions and
classes. Supporting other kinds of objects, such as unbound methods [4]_,
is a common request. Actually, third-party support for some of them, such
as bound methods, is implemented in the multiprocessing module [5]_.
The ``__qualname__`` attribute from :pep:`3155` makes it possible to
lookup many more objects by name. Making the GLOBAL_STACK opcode accept
dot-separated names, or adding a special GETATTR opcode, would allow the
standard pickle implementation to support all those kinds of objects.
64-bit opcodes for large objects
--------------------------------
Current protocol versions export object sizes for various built-in
types (str, bytes) as 32-bit ints. This forbids serialization of
@ -71,33 +127,6 @@ arguments can not be pickled (or, rather, unpickled) [3]_. Both a new
special method (``__getnewargs_ex__`` ?) and a new opcode (NEWOBJEX ?)
are needed.
Serializing more "lookupable" objects
-------------------------------------
For some kinds of objects, it only makes sense to serialize them by name
(for example classes and functions). By default, pickle is only able to
serialize module-global functions and classes by name. Supporting other
kinds of objects, such as unbound methods [4]_, is a common request.
Actually, third-party support for some of them, such as bound methods,
is implemented in the multiprocessing module [5]_.
:pep:`3155` now makes it possible to lookup many more objects by name.
Generalizing the GLOBAL opcode to accept dot-separated names, or adding
a special GETATTR opcode, would allow the standard pickle implementation
to support, in an efficient way, all those kinds of objects.
Binary encoding for all opcodes
-------------------------------
The GLOBAL opcode, which is still used in protocol 3, uses the
so-called "text" mode of the pickle protocol, which involves looking
for newlines in the pickle stream. Looking for newlines is difficult
to optimize on a non-seekable stream, and therefore a new version of
GLOBAL (BINGLOBAL?) could use a binary encoding instead.
It seems that all other opcodes emitted when using protocol 3 already
use binary encoding.
Better string encoding
----------------------