352 lines
16 KiB
Plaintext
352 lines
16 KiB
Plaintext
PEP: 296
|
||
Title: The Buffer Problem
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: xscottg at yahoo.com (Scott Gilbert)
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Created: 12-Jul-2002
|
||
Python-Version: 2.3
|
||
Post-History:
|
||
|
||
|
||
Abstract
|
||
|
||
This PEP proposes the creation of a new standard type and builtin
|
||
constructor called 'bytes'. The bytes object is an efficiently
|
||
stored array of bytes with some additional characteristics that
|
||
set it apart from several implementations that are similar.
|
||
|
||
|
||
Rationale
|
||
|
||
Python currently has many objects that implement something akin to
|
||
the bytes object of this proposal. For instance the standard
|
||
string, buffer, array, and mmap objects are all very similar in
|
||
some regards to the bytes object. Additionally, several
|
||
significant third party extensions have created similar objects to
|
||
try and fill similar needs. Frustratingly, each of these objects
|
||
is too narrow in scope and is missing critical features to make it
|
||
applicable to a wider category of problems.
|
||
|
||
|
||
Specification
|
||
|
||
The bytes object has the following important characteristics:
|
||
|
||
1. Efficient underlying array storage via the standard C type "unsigned
|
||
char". This allows fine grain control over how much memory is
|
||
allocated. With the alignment restrictions designated in the next
|
||
item, it is trivial for low level extensions to cast the pointer
|
||
to a different type as needed.
|
||
|
||
Also, since the object is implemented as an array of bytes, it is
|
||
possible to pass the bytes object to the extensive library of
|
||
routines already in the standard library that presently work with
|
||
strings. For instance, the bytes object in conjunction with the
|
||
struct module could be used to provide a complete replacement for
|
||
the array module using only Python script.
|
||
|
||
If an unusual platform comes to light, one where there isn't a
|
||
native unsigned 8 bit type, the object will do its best to
|
||
represent itself at the Python script level as though it were an
|
||
array of 8 bit unsigned values. It is doubtful whether many
|
||
extensions would handle this correctly, but Python script could be
|
||
portable in these cases.
|
||
|
||
2. Alignment of the allocated byte array is whatever is promised by the
|
||
platform implementation of malloc. A bytes object created from an
|
||
extension can be supplied that provides any arbitrary alignment as
|
||
the extension author sees fit.
|
||
|
||
This alignment restriction should allow the bytes object to be
|
||
used as storage for all standard C types - including PyComplex
|
||
objects or other structs of standard C type types. Further
|
||
alignment restrictions can be provided by extensions as necessary.
|
||
|
||
3. The bytes object implements a subset of the sequence operations
|
||
provided by string/array objects, but with slightly different
|
||
semantics in some cases. In particular, a slice always returns a
|
||
new bytes object, but the underlying memory is shared between the
|
||
two objects. This type of slice behavior has been called creating
|
||
a "view". Additionally, repetition and concatenation are
|
||
undefined for bytes objects and will raise an exception.
|
||
|
||
As these objects are likely to find use in high performance
|
||
applications, one motivation for the decision to use view slicing
|
||
is that copying between bytes objects should be very efficient and
|
||
not require the creation of temporary objects. The following code
|
||
illustrates this:
|
||
|
||
# create two 10 Meg bytes objects
|
||
b1 = bytes(10000000)
|
||
b2 = bytes(10000000)
|
||
|
||
# copy from part of one to another with out creating a 1 Meg temporary
|
||
b1[2000000:3000000] = b2[4000000:5000000]
|
||
|
||
Slice assignment where the rvalue is not the same length as the
|
||
lvalue will raise an exception. However, slice assignment will
|
||
work correctly with overlapping slices (typically implemented with
|
||
memmove).
|
||
|
||
4. The bytes object will be recognized as a native type by the pickle and
|
||
cPickle modules for efficient serialization. (In truth, this is
|
||
the only requirement that can't be implemented via a third party
|
||
extension.)
|
||
|
||
Partial solutions to address the need to serialize the data stored
|
||
in a bytes-like object without creating a temporary copy of the
|
||
data into a string have been implemented in the past. The tofile
|
||
and fromfile methods of the array object are good examples of
|
||
this. The bytes object will support these methods too. However,
|
||
pickling is useful in other situations - such as in the shelve
|
||
module, or implementing RPC of Python objects, and requiring the
|
||
end user to use two different serialization mechanisms to get an
|
||
efficient transfer of data is undesirable.
|
||
|
||
XXX: Will try to implement pickling of the new bytes object in
|
||
such a way that previous versions of Python will unpickle it as a
|
||
string object.
|
||
|
||
When unpickling, the bytes object will be created from memory
|
||
allocated from Python (via malloc). As such, it will lose any
|
||
additional properties that an extension supplied pointer might
|
||
have provided (special alignment, or special types of memory).
|
||
|
||
XXX: Will try to make it so that C subclasses of bytes type can
|
||
supply the memory that will be unpickled into. For instance, a
|
||
derived class called PageAlignedBytes would unpickle to memory
|
||
that is also page aligned.
|
||
|
||
On any platform where an int is 32 bits (most of them), it is
|
||
currently impossible to create a string with a length larger than
|
||
can be represented in 31 bits. As such, pickling to a string will
|
||
raise an exception when the operation is not possible.
|
||
|
||
At least on platforms supporting large files (many of them),
|
||
pickling large bytes objects to files should be possible via
|
||
repeated calls to the file.write() method.
|
||
|
||
5. The bytes type supports the PyBufferProcs interface, but a bytes object
|
||
provides the additional guarantee that the pointer will not be
|
||
deallocated or reallocated as long as a reference to the bytes
|
||
object is held. This implies that a bytes object is not resizable
|
||
once it is created, but allows the global interpreter lock (GIL)
|
||
to be released while a separate thread manipulates the memory
|
||
pointed to if the PyBytes_Check(...) test passes.
|
||
|
||
This characteristic of the bytes object allows it to be used in
|
||
situations such as asynchronous file I/O or on multiprocessor
|
||
machines where the pointer obtained by PyBufferProcs will be used
|
||
independently of the global interpreter lock.
|
||
|
||
Knowing that the pointer can not be reallocated or freed after the
|
||
GIL is released gives extension authors the capability to get true
|
||
concurrency and make use of additional processors for long running
|
||
computations on the pointer.
|
||
|
||
6. In C/C++ extensions, the bytes object can be created from a supplied
|
||
pointer and destructor function to free the memory when the
|
||
reference count goes to zero.
|
||
|
||
The special implementation of slicing for the bytes object allows
|
||
multiple bytes objects to refer to the same pointer/destructor.
|
||
As such, a refcount will be kept on the actual
|
||
pointer/destructor. This refcount is separate from the refcount
|
||
typically associated with Python objects.
|
||
|
||
XXX: It may be desirable to expose the inner refcounted object as an
|
||
actual Python object. If a good use case arises, it should be possible
|
||
for this to be implemented later with no loss to backwards compatibility.
|
||
|
||
7. It is also possible to signify the bytes object as readonly, in this
|
||
case it isn't actually mutable, but does provide the other features of a
|
||
bytes object.
|
||
|
||
8. The bytes object keeps track of the length of its data with a Python
|
||
LONG_LONG type. Even though the current definition for PyBufferProcs
|
||
restricts the length to be the size of an int, this PEP does not propose
|
||
to make any changes there. Instead, extensions can work around this limit
|
||
by making an explicit PyBytes_Check(...) call, and if that succeeds they
|
||
can make a PyBytes_GetReadBuffer(...) or PyBytes_GetWriteBuffer call to
|
||
get the pointer and full length of the object as a LONG_LONG.
|
||
|
||
The bytes object will raise an exception if the standard PyBufferProcs
|
||
mechanism is used and the size of the bytes object is greater than can be
|
||
represented by an integer.
|
||
|
||
From Python scripting, the bytes object will be subscriptable with longs
|
||
so the 32 bit int limit can be avoided.
|
||
|
||
There is still a problem with the len() function as it is PyObject_Size()
|
||
and this returns an int as well. As a workaround, the bytes object will
|
||
provide a .length() method that will return a long.
|
||
|
||
9. The bytes object can be constructed at the Python scripting level by
|
||
passing an int/long to the bytes constructor with the number of bytes to
|
||
allocate. For example:
|
||
|
||
b = bytes(100000) # alloc 100K bytes
|
||
|
||
The constructor can also take another bytes object. This will be useful
|
||
for the implementation of unpickling, and in converting a read-write bytes
|
||
object into a read-only one. An optional second argument will be used to
|
||
designate creation of a readonly bytes object.
|
||
|
||
10. From the C API, the bytes object can be allocated using any of the
|
||
following signatures:
|
||
|
||
PyObject* PyBytes_FromLength(LONG_LONG len, int readonly);
|
||
PyObject* PyBytes_FromPointer(void* ptr, LONG_LONG len, int readonly
|
||
void (*dest)(void *ptr, void *user), void* user);
|
||
|
||
In the PyBytes_FromPointer(...) function, if the dest function pointer is
|
||
passed in as NULL, it will not be called. This should only be used for
|
||
creating bytes objects from statically allocated space.
|
||
|
||
The user pointer has been called a closure in other places. It is a
|
||
pointer that the user can use for whatever purposes. It will be passed to
|
||
the destructor function on cleanup and can be useful for a number of
|
||
things. If the user pointer is not needed, NULL should be passed instead.
|
||
|
||
11. The bytes type will be a new style class as that seems to be where all
|
||
standard Python types are headed.
|
||
|
||
|
||
Contrast to existing types
|
||
|
||
The most common way to work around the lack of a bytes object has been to
|
||
simply use a string object in its place. Binary files, the struct/array
|
||
modules, and several other examples exist of this. Putting aside the
|
||
style issue that these uses typically have nothing to do with text
|
||
strings, there is the real problem that strings are not mutable, so direct
|
||
manipulation of the data returned in these cases is not possible. Also,
|
||
numerous optimizations in the string module (such as caching the hash
|
||
value or interning the pointers) mean that extension authors are on very
|
||
thin ice if they try to break the rules with the string object.
|
||
|
||
The buffer object seems like it was intended to address the purpose that
|
||
the bytes object is trying fulfill, but several shortcomings in its
|
||
implementation [1] have made it less useful in many common cases. The
|
||
buffer object made a different choice for its slicing behavior (it returns
|
||
new strings instead of buffers for slicing and other operations), and it
|
||
doesn't make many of the promises on alignment or being able to release
|
||
the GIL that the bytes object does.
|
||
|
||
Also in regards to the buffer object, it is not possible to simply replace
|
||
the buffer object with the bytes object and maintain backwards
|
||
compatibility. The buffer object provides a mechanism to take the
|
||
PyBufferProcs supplied pointer of another object and present it as its
|
||
own. Since the behavior of the other object can not be guaranteed to
|
||
follow the same set of strict rules that a bytes object does, it can't be
|
||
used in places that a bytes object could.
|
||
|
||
The array module supports the creation of an array of bytes, but it does
|
||
not provide a C API for supplying pointers and destructors to extension
|
||
supplied memory. This makes it unusable for constructing objects out of
|
||
shared memory, or memory that has special alignment or locking for things
|
||
like DMA transfers. Also, the array object does not currently pickle.
|
||
Finally since the array object allows its contents to grow, via the extend
|
||
method, the pointer can be changed if the GIL is not held while using it.
|
||
|
||
Creating a buffer object from an array object has the same problem of
|
||
leaving an invalid pointer when the array object is resized.
|
||
|
||
The mmap object caters to its particular niche, but does not attempt to
|
||
solve a wider class of problems.
|
||
|
||
Finally, any third party extension can not implement pickling without
|
||
creating a temporary object of a standard python type. For example in the
|
||
Numeric community, it is unpleasant that a large array can't pickle
|
||
without creating a large binary string to duplicate the array data.
|
||
|
||
|
||
Backward Compatibility
|
||
|
||
The only possibility for backwards compatibility problems that the author
|
||
is aware of are in previous versions of Python that try to unpickle data
|
||
containing the new bytes type.
|
||
|
||
|
||
Reference Implementation
|
||
|
||
XXX: Actual implementation is in progress, but changes are still possible
|
||
as this PEP gets further review.
|
||
|
||
The following new files will be added to the Python baseline:
|
||
|
||
Include/bytesobject.h # C interface
|
||
Objects/bytesobject.c # C implementation
|
||
Lib/test/test_bytes.py # unit testing
|
||
Doc/lib/libbytes.tex # documentation
|
||
|
||
The following files will also be modified:
|
||
|
||
Include/Python.h # adding bytesmodule.h include file
|
||
Python/bltinmodule.c # adding the bytes type object
|
||
Modules/cPickle.c # adding bytes to the standard types
|
||
Lib/pickle.py # adding bytes to the standard types
|
||
|
||
It is possible that several other modules could be cleaned up and
|
||
implemented in terms of the bytes object. The mmap module comes to mind
|
||
first, but as noted above it would be possible to reimplement the array
|
||
module as a pure Python module. While it is attractive that this PEP
|
||
could actually reduce the amount of source code by some amount, the author
|
||
feels that this could cause unnecessary risk for breaking existing
|
||
applications and should be avoided at this time.
|
||
|
||
|
||
Additional Notes/Comments
|
||
|
||
- Guido van Rossum wondered whether it would make sense to be able
|
||
to create a bytes object from a mmap object. The mmap object
|
||
appears to support the requirements necessary to provide memory
|
||
for a bytes object. (It doesn't resize, and the pointer is valid
|
||
for the lifetime of the object.) As such, a method could be added
|
||
to the mmap module such that a bytes object could be created
|
||
directly from a mmap object. An initial stab at how this would be
|
||
implemented would be to use the PyBytes_FromPointer() function
|
||
described above and pass the mmap_object as the user pointer. The
|
||
destructor function would decref the mmap_object for cleanup.
|
||
|
||
- Todd Miller notes that it may be useful to have two new functions:
|
||
PyObject_AsLargeReadBuffer() and PyObject_AsLargeWriteBuffer that are
|
||
similar to PyObject_AsReadBuffer() and PyObject_AsWriteBuffer(), but
|
||
support getting a LONG_LONG length in addition to the void* pointer.
|
||
These functions would allow extension authors to work transparently with
|
||
bytes object (that support LONG_LONG lengths) and most other buffer like
|
||
objects (which only support int lengths). These functions could be in
|
||
lieu of, or in addition to, creating a specific PyByte_GetReadBuffer() and
|
||
PyBytes_GetWriteBuffer() functions.
|
||
|
||
XXX: The author thinks this is very a good idea as it paves the way for
|
||
other objects to eventually support large (64 bit) pointers, and it should
|
||
only affect abstract.c and abstract.h. Should this be added above?
|
||
|
||
- It was generally agreed that abusing the segment count of the
|
||
PyBufferProcs interface is not a good hack to work around the 31 bit
|
||
limitation of the length. If you don't know what this means, then you're
|
||
in good company. Most code in the Python baseline, and presumably in many
|
||
third party extensions, punt when the segment count is not 1.
|
||
|
||
|
||
References
|
||
|
||
[1] The buffer interface
|
||
http://mail.python.org/pipermail/python-dev/2000-October/009974.html
|
||
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
End:
|