2002-07-22 17:04:00 -04:00
|
|
|
|
PEP: 296
|
2002-08-02 14:05:59 -04:00
|
|
|
|
Title: Adding a bytes Object Type
|
2002-07-22 17:04:00 -04:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: xscottg at yahoo.com (Scott Gilbert)
|
2007-05-18 13:41:31 -04:00
|
|
|
|
Status: Withdrawn
|
2002-07-22 17:04:00 -04:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Created: 12-Jul-2002
|
|
|
|
|
Python-Version: 2.3
|
|
|
|
|
Post-History:
|
|
|
|
|
|
|
|
|
|
|
2003-09-07 09:55:30 -04:00
|
|
|
|
Notice
|
|
|
|
|
|
2007-05-18 13:41:31 -04:00
|
|
|
|
This PEP is withdrawn by the author (in favor of PEP 358).
|
2003-09-07 09:55:30 -04:00
|
|
|
|
|
|
|
|
|
|
2002-07-22 17:04:00 -04:00
|
|
|
|
Abstract
|
|
|
|
|
|
|
|
|
|
This PEP proposes the creation of a new standard type and builtin
|
|
|
|
|
constructor called 'bytes'. The bytes object is an efficiently
|
|
|
|
|
stored array of bytes with some additional characteristics that
|
|
|
|
|
set it apart from several implementations that are similar.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
|
|
|
|
|
Python currently has many objects that implement something akin to
|
|
|
|
|
the bytes object of this proposal. For instance the standard
|
|
|
|
|
string, buffer, array, and mmap objects are all very similar in
|
|
|
|
|
some regards to the bytes object. Additionally, several
|
|
|
|
|
significant third party extensions have created similar objects to
|
|
|
|
|
try and fill similar needs. Frustratingly, each of these objects
|
|
|
|
|
is too narrow in scope and is missing critical features to make it
|
|
|
|
|
applicable to a wider category of problems.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
|
|
|
|
|
The bytes object has the following important characteristics:
|
|
|
|
|
|
|
|
|
|
1. Efficient underlying array storage via the standard C type "unsigned
|
|
|
|
|
char". This allows fine grain control over how much memory is
|
|
|
|
|
allocated. With the alignment restrictions designated in the next
|
|
|
|
|
item, it is trivial for low level extensions to cast the pointer
|
|
|
|
|
to a different type as needed.
|
2017-03-24 17:11:33 -04:00
|
|
|
|
|
2002-07-22 17:04:00 -04:00
|
|
|
|
Also, since the object is implemented as an array of bytes, it is
|
|
|
|
|
possible to pass the bytes object to the extensive library of
|
|
|
|
|
routines already in the standard library that presently work with
|
|
|
|
|
strings. For instance, the bytes object in conjunction with the
|
|
|
|
|
struct module could be used to provide a complete replacement for
|
|
|
|
|
the array module using only Python script.
|
|
|
|
|
|
|
|
|
|
If an unusual platform comes to light, one where there isn't a
|
|
|
|
|
native unsigned 8 bit type, the object will do its best to
|
|
|
|
|
represent itself at the Python script level as though it were an
|
|
|
|
|
array of 8 bit unsigned values. It is doubtful whether many
|
|
|
|
|
extensions would handle this correctly, but Python script could be
|
|
|
|
|
portable in these cases.
|
|
|
|
|
|
|
|
|
|
2. Alignment of the allocated byte array is whatever is promised by the
|
|
|
|
|
platform implementation of malloc. A bytes object created from an
|
|
|
|
|
extension can be supplied that provides any arbitrary alignment as
|
|
|
|
|
the extension author sees fit.
|
|
|
|
|
|
|
|
|
|
This alignment restriction should allow the bytes object to be
|
|
|
|
|
used as storage for all standard C types - including PyComplex
|
|
|
|
|
objects or other structs of standard C type types. Further
|
|
|
|
|
alignment restrictions can be provided by extensions as necessary.
|
|
|
|
|
|
|
|
|
|
3. The bytes object implements a subset of the sequence operations
|
|
|
|
|
provided by string/array objects, but with slightly different
|
|
|
|
|
semantics in some cases. In particular, a slice always returns a
|
|
|
|
|
new bytes object, but the underlying memory is shared between the
|
|
|
|
|
two objects. This type of slice behavior has been called creating
|
|
|
|
|
a "view". Additionally, repetition and concatenation are
|
|
|
|
|
undefined for bytes objects and will raise an exception.
|
|
|
|
|
|
|
|
|
|
As these objects are likely to find use in high performance
|
|
|
|
|
applications, one motivation for the decision to use view slicing
|
|
|
|
|
is that copying between bytes objects should be very efficient and
|
|
|
|
|
not require the creation of temporary objects. The following code
|
|
|
|
|
illustrates this:
|
|
|
|
|
|
|
|
|
|
# create two 10 Meg bytes objects
|
|
|
|
|
b1 = bytes(10000000)
|
|
|
|
|
b2 = bytes(10000000)
|
|
|
|
|
|
|
|
|
|
# copy from part of one to another with out creating a 1 Meg temporary
|
|
|
|
|
b1[2000000:3000000] = b2[4000000:5000000]
|
|
|
|
|
|
|
|
|
|
Slice assignment where the rvalue is not the same length as the
|
|
|
|
|
lvalue will raise an exception. However, slice assignment will
|
|
|
|
|
work correctly with overlapping slices (typically implemented with
|
|
|
|
|
memmove).
|
|
|
|
|
|
|
|
|
|
4. The bytes object will be recognized as a native type by the pickle and
|
|
|
|
|
cPickle modules for efficient serialization. (In truth, this is
|
|
|
|
|
the only requirement that can't be implemented via a third party
|
|
|
|
|
extension.)
|
|
|
|
|
|
|
|
|
|
Partial solutions to address the need to serialize the data stored
|
|
|
|
|
in a bytes-like object without creating a temporary copy of the
|
|
|
|
|
data into a string have been implemented in the past. The tofile
|
|
|
|
|
and fromfile methods of the array object are good examples of
|
|
|
|
|
this. The bytes object will support these methods too. However,
|
|
|
|
|
pickling is useful in other situations - such as in the shelve
|
|
|
|
|
module, or implementing RPC of Python objects, and requiring the
|
|
|
|
|
end user to use two different serialization mechanisms to get an
|
|
|
|
|
efficient transfer of data is undesirable.
|
|
|
|
|
|
|
|
|
|
XXX: Will try to implement pickling of the new bytes object in
|
|
|
|
|
such a way that previous versions of Python will unpickle it as a
|
|
|
|
|
string object.
|
|
|
|
|
|
|
|
|
|
When unpickling, the bytes object will be created from memory
|
|
|
|
|
allocated from Python (via malloc). As such, it will lose any
|
|
|
|
|
additional properties that an extension supplied pointer might
|
|
|
|
|
have provided (special alignment, or special types of memory).
|
|
|
|
|
|
|
|
|
|
XXX: Will try to make it so that C subclasses of bytes type can
|
|
|
|
|
supply the memory that will be unpickled into. For instance, a
|
|
|
|
|
derived class called PageAlignedBytes would unpickle to memory
|
|
|
|
|
that is also page aligned.
|
|
|
|
|
|
|
|
|
|
On any platform where an int is 32 bits (most of them), it is
|
|
|
|
|
currently impossible to create a string with a length larger than
|
|
|
|
|
can be represented in 31 bits. As such, pickling to a string will
|
|
|
|
|
raise an exception when the operation is not possible.
|
|
|
|
|
|
|
|
|
|
At least on platforms supporting large files (many of them),
|
|
|
|
|
pickling large bytes objects to files should be possible via
|
|
|
|
|
repeated calls to the file.write() method.
|
|
|
|
|
|
|
|
|
|
5. The bytes type supports the PyBufferProcs interface, but a bytes object
|
|
|
|
|
provides the additional guarantee that the pointer will not be
|
|
|
|
|
deallocated or reallocated as long as a reference to the bytes
|
|
|
|
|
object is held. This implies that a bytes object is not resizable
|
|
|
|
|
once it is created, but allows the global interpreter lock (GIL)
|
|
|
|
|
to be released while a separate thread manipulates the memory
|
|
|
|
|
pointed to if the PyBytes_Check(...) test passes.
|
|
|
|
|
|
|
|
|
|
This characteristic of the bytes object allows it to be used in
|
|
|
|
|
situations such as asynchronous file I/O or on multiprocessor
|
|
|
|
|
machines where the pointer obtained by PyBufferProcs will be used
|
|
|
|
|
independently of the global interpreter lock.
|
|
|
|
|
|
|
|
|
|
Knowing that the pointer can not be reallocated or freed after the
|
|
|
|
|
GIL is released gives extension authors the capability to get true
|
|
|
|
|
concurrency and make use of additional processors for long running
|
|
|
|
|
computations on the pointer.
|
|
|
|
|
|
|
|
|
|
6. In C/C++ extensions, the bytes object can be created from a supplied
|
|
|
|
|
pointer and destructor function to free the memory when the
|
|
|
|
|
reference count goes to zero.
|
|
|
|
|
|
|
|
|
|
The special implementation of slicing for the bytes object allows
|
|
|
|
|
multiple bytes objects to refer to the same pointer/destructor.
|
|
|
|
|
As such, a refcount will be kept on the actual
|
|
|
|
|
pointer/destructor. This refcount is separate from the refcount
|
|
|
|
|
typically associated with Python objects.
|
|
|
|
|
|
|
|
|
|
XXX: It may be desirable to expose the inner refcounted object as an
|
|
|
|
|
actual Python object. If a good use case arises, it should be possible
|
|
|
|
|
for this to be implemented later with no loss to backwards compatibility.
|
|
|
|
|
|
|
|
|
|
7. It is also possible to signify the bytes object as readonly, in this
|
|
|
|
|
case it isn't actually mutable, but does provide the other features of a
|
|
|
|
|
bytes object.
|
|
|
|
|
|
|
|
|
|
8. The bytes object keeps track of the length of its data with a Python
|
|
|
|
|
LONG_LONG type. Even though the current definition for PyBufferProcs
|
|
|
|
|
restricts the length to be the size of an int, this PEP does not propose
|
|
|
|
|
to make any changes there. Instead, extensions can work around this limit
|
|
|
|
|
by making an explicit PyBytes_Check(...) call, and if that succeeds they
|
|
|
|
|
can make a PyBytes_GetReadBuffer(...) or PyBytes_GetWriteBuffer call to
|
|
|
|
|
get the pointer and full length of the object as a LONG_LONG.
|
|
|
|
|
|
|
|
|
|
The bytes object will raise an exception if the standard PyBufferProcs
|
|
|
|
|
mechanism is used and the size of the bytes object is greater than can be
|
|
|
|
|
represented by an integer.
|
|
|
|
|
|
|
|
|
|
From Python scripting, the bytes object will be subscriptable with longs
|
|
|
|
|
so the 32 bit int limit can be avoided.
|
|
|
|
|
|
|
|
|
|
There is still a problem with the len() function as it is PyObject_Size()
|
|
|
|
|
and this returns an int as well. As a workaround, the bytes object will
|
|
|
|
|
provide a .length() method that will return a long.
|
|
|
|
|
|
|
|
|
|
9. The bytes object can be constructed at the Python scripting level by
|
|
|
|
|
passing an int/long to the bytes constructor with the number of bytes to
|
|
|
|
|
allocate. For example:
|
|
|
|
|
|
|
|
|
|
b = bytes(100000) # alloc 100K bytes
|
|
|
|
|
|
|
|
|
|
The constructor can also take another bytes object. This will be useful
|
|
|
|
|
for the implementation of unpickling, and in converting a read-write bytes
|
|
|
|
|
object into a read-only one. An optional second argument will be used to
|
|
|
|
|
designate creation of a readonly bytes object.
|
|
|
|
|
|
|
|
|
|
10. From the C API, the bytes object can be allocated using any of the
|
|
|
|
|
following signatures:
|
|
|
|
|
|
|
|
|
|
PyObject* PyBytes_FromLength(LONG_LONG len, int readonly);
|
|
|
|
|
PyObject* PyBytes_FromPointer(void* ptr, LONG_LONG len, int readonly
|
|
|
|
|
void (*dest)(void *ptr, void *user), void* user);
|
2017-03-24 17:11:33 -04:00
|
|
|
|
|
2002-07-22 17:04:00 -04:00
|
|
|
|
In the PyBytes_FromPointer(...) function, if the dest function pointer is
|
|
|
|
|
passed in as NULL, it will not be called. This should only be used for
|
|
|
|
|
creating bytes objects from statically allocated space.
|
2017-03-24 17:11:33 -04:00
|
|
|
|
|
2002-07-22 17:04:00 -04:00
|
|
|
|
The user pointer has been called a closure in other places. It is a
|
|
|
|
|
pointer that the user can use for whatever purposes. It will be passed to
|
|
|
|
|
the destructor function on cleanup and can be useful for a number of
|
|
|
|
|
things. If the user pointer is not needed, NULL should be passed instead.
|
2017-03-24 17:11:33 -04:00
|
|
|
|
|
2002-07-22 17:04:00 -04:00
|
|
|
|
11. The bytes type will be a new style class as that seems to be where all
|
|
|
|
|
standard Python types are headed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Contrast to existing types
|
|
|
|
|
|
|
|
|
|
The most common way to work around the lack of a bytes object has been to
|
|
|
|
|
simply use a string object in its place. Binary files, the struct/array
|
|
|
|
|
modules, and several other examples exist of this. Putting aside the
|
|
|
|
|
style issue that these uses typically have nothing to do with text
|
|
|
|
|
strings, there is the real problem that strings are not mutable, so direct
|
|
|
|
|
manipulation of the data returned in these cases is not possible. Also,
|
|
|
|
|
numerous optimizations in the string module (such as caching the hash
|
|
|
|
|
value or interning the pointers) mean that extension authors are on very
|
|
|
|
|
thin ice if they try to break the rules with the string object.
|
|
|
|
|
|
|
|
|
|
The buffer object seems like it was intended to address the purpose that
|
|
|
|
|
the bytes object is trying fulfill, but several shortcomings in its
|
|
|
|
|
implementation [1] have made it less useful in many common cases. The
|
|
|
|
|
buffer object made a different choice for its slicing behavior (it returns
|
|
|
|
|
new strings instead of buffers for slicing and other operations), and it
|
|
|
|
|
doesn't make many of the promises on alignment or being able to release
|
|
|
|
|
the GIL that the bytes object does.
|
|
|
|
|
|
|
|
|
|
Also in regards to the buffer object, it is not possible to simply replace
|
|
|
|
|
the buffer object with the bytes object and maintain backwards
|
|
|
|
|
compatibility. The buffer object provides a mechanism to take the
|
|
|
|
|
PyBufferProcs supplied pointer of another object and present it as its
|
|
|
|
|
own. Since the behavior of the other object can not be guaranteed to
|
|
|
|
|
follow the same set of strict rules that a bytes object does, it can't be
|
|
|
|
|
used in places that a bytes object could.
|
|
|
|
|
|
|
|
|
|
The array module supports the creation of an array of bytes, but it does
|
|
|
|
|
not provide a C API for supplying pointers and destructors to extension
|
|
|
|
|
supplied memory. This makes it unusable for constructing objects out of
|
|
|
|
|
shared memory, or memory that has special alignment or locking for things
|
|
|
|
|
like DMA transfers. Also, the array object does not currently pickle.
|
|
|
|
|
Finally since the array object allows its contents to grow, via the extend
|
|
|
|
|
method, the pointer can be changed if the GIL is not held while using it.
|
|
|
|
|
|
|
|
|
|
Creating a buffer object from an array object has the same problem of
|
|
|
|
|
leaving an invalid pointer when the array object is resized.
|
|
|
|
|
|
|
|
|
|
The mmap object caters to its particular niche, but does not attempt to
|
|
|
|
|
solve a wider class of problems.
|
|
|
|
|
|
|
|
|
|
Finally, any third party extension can not implement pickling without
|
2016-07-11 11:14:08 -04:00
|
|
|
|
creating a temporary object of a standard Python type. For example, in the
|
2002-07-22 17:04:00 -04:00
|
|
|
|
Numeric community, it is unpleasant that a large array can't pickle
|
|
|
|
|
without creating a large binary string to duplicate the array data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Backward Compatibility
|
|
|
|
|
|
|
|
|
|
The only possibility for backwards compatibility problems that the author
|
|
|
|
|
is aware of are in previous versions of Python that try to unpickle data
|
|
|
|
|
containing the new bytes type.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Reference Implementation
|
|
|
|
|
|
|
|
|
|
XXX: Actual implementation is in progress, but changes are still possible
|
|
|
|
|
as this PEP gets further review.
|
|
|
|
|
|
|
|
|
|
The following new files will be added to the Python baseline:
|
|
|
|
|
|
|
|
|
|
Include/bytesobject.h # C interface
|
|
|
|
|
Objects/bytesobject.c # C implementation
|
|
|
|
|
Lib/test/test_bytes.py # unit testing
|
|
|
|
|
Doc/lib/libbytes.tex # documentation
|
|
|
|
|
|
|
|
|
|
The following files will also be modified:
|
|
|
|
|
|
|
|
|
|
Include/Python.h # adding bytesmodule.h include file
|
|
|
|
|
Python/bltinmodule.c # adding the bytes type object
|
|
|
|
|
Modules/cPickle.c # adding bytes to the standard types
|
|
|
|
|
Lib/pickle.py # adding bytes to the standard types
|
|
|
|
|
|
|
|
|
|
It is possible that several other modules could be cleaned up and
|
|
|
|
|
implemented in terms of the bytes object. The mmap module comes to mind
|
|
|
|
|
first, but as noted above it would be possible to reimplement the array
|
|
|
|
|
module as a pure Python module. While it is attractive that this PEP
|
|
|
|
|
could actually reduce the amount of source code by some amount, the author
|
|
|
|
|
feels that this could cause unnecessary risk for breaking existing
|
|
|
|
|
applications and should be avoided at this time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Additional Notes/Comments
|
|
|
|
|
|
|
|
|
|
- Guido van Rossum wondered whether it would make sense to be able
|
|
|
|
|
to create a bytes object from a mmap object. The mmap object
|
|
|
|
|
appears to support the requirements necessary to provide memory
|
|
|
|
|
for a bytes object. (It doesn't resize, and the pointer is valid
|
|
|
|
|
for the lifetime of the object.) As such, a method could be added
|
|
|
|
|
to the mmap module such that a bytes object could be created
|
|
|
|
|
directly from a mmap object. An initial stab at how this would be
|
|
|
|
|
implemented would be to use the PyBytes_FromPointer() function
|
|
|
|
|
described above and pass the mmap_object as the user pointer. The
|
|
|
|
|
destructor function would decref the mmap_object for cleanup.
|
|
|
|
|
|
|
|
|
|
- Todd Miller notes that it may be useful to have two new functions:
|
|
|
|
|
PyObject_AsLargeReadBuffer() and PyObject_AsLargeWriteBuffer that are
|
|
|
|
|
similar to PyObject_AsReadBuffer() and PyObject_AsWriteBuffer(), but
|
|
|
|
|
support getting a LONG_LONG length in addition to the void* pointer.
|
|
|
|
|
These functions would allow extension authors to work transparently with
|
|
|
|
|
bytes object (that support LONG_LONG lengths) and most other buffer like
|
|
|
|
|
objects (which only support int lengths). These functions could be in
|
|
|
|
|
lieu of, or in addition to, creating a specific PyByte_GetReadBuffer() and
|
|
|
|
|
PyBytes_GetWriteBuffer() functions.
|
|
|
|
|
|
|
|
|
|
XXX: The author thinks this is very a good idea as it paves the way for
|
|
|
|
|
other objects to eventually support large (64 bit) pointers, and it should
|
|
|
|
|
only affect abstract.c and abstract.h. Should this be added above?
|
|
|
|
|
|
|
|
|
|
- It was generally agreed that abusing the segment count of the
|
|
|
|
|
PyBufferProcs interface is not a good hack to work around the 31 bit
|
|
|
|
|
limitation of the length. If you don't know what this means, then you're
|
|
|
|
|
in good company. Most code in the Python baseline, and presumably in many
|
|
|
|
|
third party extensions, punt when the segment count is not 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
|
|
|
|
|
[1] The buffer interface
|
2017-06-11 15:02:39 -04:00
|
|
|
|
https://mail.python.org/pipermail/python-dev/2000-October/009974.html
|
2002-07-22 17:04:00 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
End:
|