2007-09-26 17:55:16 -04:00
|
|
|
|
PEP: 3137
|
|
|
|
|
Title: Immutable Bytes and Mutable Buffer
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Guido van Rossum <guido@python.org>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
2007-09-26 17:58:29 -04:00
|
|
|
|
Created: 26-Sep-2007
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Python-Version: 3.0
|
2007-09-26 17:58:29 -04:00
|
|
|
|
Post-History: 26-Sep-2007
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
|
============
|
|
|
|
|
|
|
|
|
|
After releasing Python 3.0a1 with a mutable bytes type, pressure
|
|
|
|
|
mounted to add a way to represent immutable bytes. Gregory P. Smith
|
|
|
|
|
proposed a patch that would allow making a bytes object temporarily
|
|
|
|
|
immutable by requesting that the data be locked using the new buffer
|
|
|
|
|
API from PEP 3118. This did not seem the right approach to me.
|
|
|
|
|
|
|
|
|
|
Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to
|
|
|
|
|
make the bytes type immutable (by crudely removing all mutating APIs)
|
|
|
|
|
and fix the fall-out in the test suite. This showed that there aren't
|
|
|
|
|
all that many places that depend on the mutability of bytes, with the
|
|
|
|
|
exception of code that builds up a return value from small pieces.
|
|
|
|
|
|
|
|
|
|
Thinking through the consequences, and noticing that using the array
|
|
|
|
|
module as an ersatz mutable bytes type is far from ideal, and
|
|
|
|
|
recalling a proposal put forward earlier by Talin, I floated the
|
|
|
|
|
suggestion to have both a mutable and an immutable bytes type. (This
|
|
|
|
|
had been brought up before, but until seeing the evidence of Jeffrey's
|
|
|
|
|
patch I wasn't open to the suggestion.)
|
|
|
|
|
|
|
|
|
|
Moreover, a possible implementation strategy became clear: use the old
|
|
|
|
|
PyString implementation, stripped down to remove locale support and
|
|
|
|
|
implicit conversions to/from Unicode, for the immutable bytes type,
|
|
|
|
|
and keep the new PyBytes implementation as the mutable bytes type.
|
|
|
|
|
|
|
|
|
|
The ensuing discussion made it clear that the idea is welcome but
|
|
|
|
|
needs to be specified more precisely. Hence this PEP.
|
|
|
|
|
|
|
|
|
|
Advantages
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
One advantage of having an immutable bytes type is that code objects
|
|
|
|
|
can use these. It also makes it possible to efficiently create hash
|
|
|
|
|
tables using bytes for keys; this may be useful when parsing protocols
|
|
|
|
|
like HTTP or SMTP which are based on bytes representing text.
|
|
|
|
|
|
|
|
|
|
Porting code that manipulates binary data (or encoded text) in Python
|
|
|
|
|
2.x will be easier using the new design than using the original 3.0
|
|
|
|
|
design with mutable bytes; simply replace ``str`` with ``bytes`` and
|
|
|
|
|
change '...' literals into b'...' literals.
|
|
|
|
|
|
|
|
|
|
Naming
|
|
|
|
|
======
|
|
|
|
|
|
|
|
|
|
I propose the following type names at the Python level:
|
|
|
|
|
|
|
|
|
|
- ``bytes`` is an immutable array of bytes (PyString)
|
|
|
|
|
|
|
|
|
|
- ``buffer`` is a mutable array of bytes (PyBytes)
|
|
|
|
|
|
|
|
|
|
- ``memoryview`` is a bytes view on another object (PyMemory)
|
|
|
|
|
|
|
|
|
|
The old type named ``buffer`` is so similar to the new type
|
|
|
|
|
``memoryview``, introduce by PEP 3118, that it is redundant. The rest
|
|
|
|
|
of this PEP doesn't discuss the functionality of ``memoryview``; it is
|
|
|
|
|
just mentioned here to justify getting rid of the old ``buffer`` type
|
|
|
|
|
so we can reuse its name for the mutable bytes type.
|
|
|
|
|
|
|
|
|
|
While eventually it makes sense to change the C API names, this PEP
|
|
|
|
|
maintains the old C API names, which should be familiar to all.
|
|
|
|
|
|
|
|
|
|
Literal Notations
|
|
|
|
|
=================
|
|
|
|
|
|
|
|
|
|
The b'...' notation introduced in Python 3.0a1 returns an immutable
|
|
|
|
|
bytes object, whatever variation is used. To create a mutable bytes
|
|
|
|
|
buffer object, use buffer(b'...') or buffer([...]). The latter may
|
|
|
|
|
use a list of integers in range(256).
|
|
|
|
|
|
|
|
|
|
Functionality
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
PEP 3118 Buffer API
|
|
|
|
|
-------------------
|
|
|
|
|
|
2007-09-26 19:06:19 -04:00
|
|
|
|
Both bytes and buffer implement the PEP 3118 buffer API. The bytes
|
2007-09-27 13:54:01 -04:00
|
|
|
|
type only implements read-only requests; the buffer type allows
|
|
|
|
|
writable and data-locked requests as well. The element data type is
|
|
|
|
|
always 'B' (i.e. unsigned byte).
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Constructors
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
There are four forms of constructors, applicable to both bytes and
|
|
|
|
|
buffer:
|
|
|
|
|
|
|
|
|
|
- ``bytes(<bytes>)``, ``bytes(<buffer>)``, ``buffer(<bytes>)``,
|
|
|
|
|
``buffer(<buffer>)``: simple copying constructors, with the note
|
|
|
|
|
that ``bytes(<bytes>)`` might return its (immutable) argument.
|
|
|
|
|
|
|
|
|
|
- ``bytes(<str>, <encoding>[, <errors>])``, ``buffer(<str>,
|
|
|
|
|
<encoding>[, <errors>])``: encode a text string. Note that the
|
|
|
|
|
``str.encode()`` method returns an *immutable* bytes object.
|
|
|
|
|
The <encoding> argument is mandatory; <errors> is optional.
|
|
|
|
|
|
|
|
|
|
- ``bytes(<memory view>)``, ``buffer(<memory view>)``: construct a
|
2007-09-26 19:06:19 -04:00
|
|
|
|
bytes or buffer object from anything implementing the PEP 3118
|
2007-09-26 17:55:16 -04:00
|
|
|
|
buffer API.
|
|
|
|
|
|
|
|
|
|
- ``bytes(<iterable of ints>)``, ``buffer(<iterable of ints>)``:
|
|
|
|
|
construct an immutable bytes or mutable buffer object from a
|
|
|
|
|
stream of integers in range(256).
|
|
|
|
|
|
|
|
|
|
- ``buffer(<int>)``: construct a zero-initialized buffer of a given
|
2007-09-26 18:39:21 -04:00
|
|
|
|
length.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Comparisons
|
|
|
|
|
-----------
|
|
|
|
|
|
|
|
|
|
The bytes and buffer types are comparable with each other and
|
|
|
|
|
orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'.
|
|
|
|
|
|
|
|
|
|
Comparing either type to a str object raises an exception. This
|
|
|
|
|
turned out to be necessary to catch common mistakes.
|
|
|
|
|
|
|
|
|
|
Slicing
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
Slicing a bytes object returns a bytes object. Slicing a buffer
|
|
|
|
|
object returns a buffer object.
|
|
|
|
|
|
|
|
|
|
Slice assignment to a mutable buffer object accept anything that
|
2007-09-26 19:06:19 -04:00
|
|
|
|
implements the PEP 3118 buffer API, or an iterable of integers in
|
2007-09-26 17:55:16 -04:00
|
|
|
|
range(256).
|
|
|
|
|
|
|
|
|
|
Indexing
|
|
|
|
|
--------
|
|
|
|
|
|
2007-09-27 13:54:01 -04:00
|
|
|
|
Indexing bytes and buffer returns small ints (like the bytes type in
|
|
|
|
|
3.0a1, and like lists or array.array('B')).
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Assignment to an item of a mutable buffer object accepts an int in
|
2007-09-27 14:33:16 -04:00
|
|
|
|
range(256). (To assign from a bytes sequence, use a slice
|
2007-09-27 13:54:01 -04:00
|
|
|
|
assignment.)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Str() and Repr()
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
The str() and repr() functions return the same thing for these
|
|
|
|
|
objects. The repr() of a bytes object returns a b'...' style literal.
|
|
|
|
|
The repr() of a buffer returns a string of the form "buffer(b'...')".
|
|
|
|
|
|
2007-09-26 19:06:19 -04:00
|
|
|
|
Operators
|
|
|
|
|
---------
|
|
|
|
|
|
2007-09-27 13:54:01 -04:00
|
|
|
|
The following operators are implemented by the bytes and buffer types,
|
2007-09-26 19:06:19 -04:00
|
|
|
|
except where mentioned:
|
|
|
|
|
|
|
|
|
|
- ``b1 + b2``: concatenation. With mixed bytes/buffer operands,
|
|
|
|
|
the return type is that of the first argument (this seems arbitrary
|
|
|
|
|
until you consider how ``+=`` works).
|
|
|
|
|
|
|
|
|
|
- ``b1 += b2'': mutates b1 if it is a buffer object.
|
|
|
|
|
|
|
|
|
|
- ``b * n``, ``n * b``: repetition; n must be an integer.
|
|
|
|
|
|
|
|
|
|
- ``b *= n``: mutates b if it is a buffer object.
|
|
|
|
|
|
|
|
|
|
- ``b1 in b2``, ``b1 not in b2``: substring test; b1 can be any
|
|
|
|
|
object implementing the PEP 3118 buffer API.
|
|
|
|
|
|
|
|
|
|
- ``i in b``, ``i not in b``: single-byte membership test; i must
|
|
|
|
|
be an integer (if it is a length-1 bytes array, it is considered
|
|
|
|
|
to be a substring test, with the same outcome).
|
|
|
|
|
|
|
|
|
|
- ``len(b)``: the number of bytes.
|
|
|
|
|
|
|
|
|
|
- ``hash(b)``: the hash value; only implemented by the bytes type.
|
|
|
|
|
|
2007-09-27 13:54:01 -04:00
|
|
|
|
Note that the % operator is *not* implemented. It does not appear
|
|
|
|
|
worth the complexity.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Methods
|
|
|
|
|
-------
|
|
|
|
|
|
2007-09-27 13:54:01 -04:00
|
|
|
|
The following methods are implemented by bytes as well as buffer, with
|
2007-09-26 17:55:16 -04:00
|
|
|
|
similar semantics. They accept anything that implements the PEP 3118
|
|
|
|
|
buffer API for bytes arguments, and return the same type as the object
|
|
|
|
|
whose method is called ("self")::
|
|
|
|
|
|
|
|
|
|
.capitalize(), .center(), .count(), .decode(), .endswith(),
|
|
|
|
|
.expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
|
|
|
|
|
.islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
|
|
|
|
|
.lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
|
|
|
|
|
.rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
|
|
|
|
|
.splitlines(), .startswith(), .strip(), .swapcase(), .title(),
|
|
|
|
|
.translate(), .upper(), .zfill()
|
|
|
|
|
|
|
|
|
|
This is exactly the set of methods present on the str type in Python
|
|
|
|
|
2.x, with the exclusion of .encode(). The signatures and semantics
|
|
|
|
|
are the same too. However, whenever character classes like letter,
|
|
|
|
|
whitespace, lower case are used, the ASCII definitions of these
|
|
|
|
|
classes are used. (The Python 2.x str type uses the definitions from
|
|
|
|
|
the current locale, settable through the locale module.) The
|
|
|
|
|
.encode() method is left out because of the more strict definitions of
|
|
|
|
|
encoding and decoding in Python 3000: encoding always takes a Unicode
|
|
|
|
|
string and returns a bytes sequence, and decoding always takes a bytes
|
|
|
|
|
sequence and returns a Unicode string.
|
|
|
|
|
|
2007-09-27 13:54:01 -04:00
|
|
|
|
In addition, both types implement the class method ``.fromhex()``,
|
|
|
|
|
which constructs an object from a string containing hexadecimal values
|
|
|
|
|
(with or without spaces between the bytes).
|
|
|
|
|
|
|
|
|
|
The buffer type implements these additional methods from the
|
|
|
|
|
MutableSequence ABC (see PEP 3119):
|
|
|
|
|
|
|
|
|
|
.extend(), .insert(), .append(), .reverse(), .pop(), .remove().
|
|
|
|
|
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Bytes and the Str Type
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
Like the bytes type in Python 3.0a1, and unlike the relationship
|
|
|
|
|
between str and unicode in Python 2.x, any attempt to mix bytes (or
|
|
|
|
|
buffer) objects and str objects without specifying an encoding will
|
|
|
|
|
raise a TypeError exception. This is the case even for simply
|
|
|
|
|
comparing a bytes or buffer object to a str object (even violating the
|
|
|
|
|
general rule that comparing objects of different types for equality
|
|
|
|
|
should just return False).
|
|
|
|
|
|
|
|
|
|
Conversions between bytes or buffer objects and str objects must
|
|
|
|
|
always be explicit, using an encoding. There are two equivalent APIs:
|
|
|
|
|
``str(b, <encoding>[, <errors>])`` is equivalent to
|
2007-09-27 10:24:32 -04:00
|
|
|
|
``b.decode(<encoding>[, <errors>])``, and
|
2007-09-26 17:55:16 -04:00
|
|
|
|
``bytes(s, <encoding>[, <errors>])`` is equivalent to
|
2007-09-27 10:24:32 -04:00
|
|
|
|
``s.encode(<encoding>[, <errors>])``.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
There is one exception: we can convert from bytes (or buffer) to str
|
|
|
|
|
without specifying an encoding by writing ``str(b)``. This produces
|
|
|
|
|
the same result as ``repr(b)``. This exception is necessary because
|
|
|
|
|
of the general promise that *any* object can be printed, and printing
|
|
|
|
|
is just a special case of conversion to str. There is however no
|
|
|
|
|
promise that printing a bytes object interprets the individual bytes
|
|
|
|
|
as characters (unlike in Python 2.x).
|
|
|
|
|
|
2007-09-26 19:06:19 -04:00
|
|
|
|
The str type currently implements the PEP 3118 buffer API. While this
|
|
|
|
|
is perhaps occasionally convenient, it is also potentially confusing,
|
2007-09-26 17:55:16 -04:00
|
|
|
|
because the bytes accessed via the buffer API represent a
|
|
|
|
|
platform-depending encoding: depending on the platform byte order and
|
|
|
|
|
a compile-time configuration option, the encoding could be UTF-16-BE,
|
|
|
|
|
UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation
|
|
|
|
|
of the str type might completely change the bytes representation,
|
|
|
|
|
e.g. to UTF-8, or even make it impossible to access the data as a
|
2007-09-26 19:06:19 -04:00
|
|
|
|
contiguous array of bytes at all. Therefore, the PEP 3118 buffer API
|
|
|
|
|
will be removed from the str type.
|
|
|
|
|
|
|
|
|
|
Pickling
|
|
|
|
|
--------
|
|
|
|
|
|
|
|
|
|
Left as an exercise for the reader.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|