2007-09-26 17:55:16 -04:00
|
|
|
|
PEP: 3137
|
|
|
|
|
Title: Immutable Bytes and Mutable Buffer
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Guido van Rossum <guido@python.org>
|
2009-01-19 11:08:45 -05:00
|
|
|
|
Status: Final
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
2007-09-26 17:58:29 -04:00
|
|
|
|
Created: 26-Sep-2007
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Python-Version: 3.0
|
2007-09-30 19:19:14 -04:00
|
|
|
|
Post-History: 26-Sep-2007, 30-Sep-2007
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Introduction
|
|
|
|
|
============
|
|
|
|
|
|
|
|
|
|
After releasing Python 3.0a1 with a mutable bytes type, pressure
|
|
|
|
|
mounted to add a way to represent immutable bytes. Gregory P. Smith
|
|
|
|
|
proposed a patch that would allow making a bytes object temporarily
|
|
|
|
|
immutable by requesting that the data be locked using the new buffer
|
|
|
|
|
API from PEP 3118. This did not seem the right approach to me.
|
|
|
|
|
|
|
|
|
|
Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to
|
|
|
|
|
make the bytes type immutable (by crudely removing all mutating APIs)
|
|
|
|
|
and fix the fall-out in the test suite. This showed that there aren't
|
|
|
|
|
all that many places that depend on the mutability of bytes, with the
|
|
|
|
|
exception of code that builds up a return value from small pieces.
|
|
|
|
|
|
|
|
|
|
Thinking through the consequences, and noticing that using the array
|
|
|
|
|
module as an ersatz mutable bytes type is far from ideal, and
|
|
|
|
|
recalling a proposal put forward earlier by Talin, I floated the
|
|
|
|
|
suggestion to have both a mutable and an immutable bytes type. (This
|
|
|
|
|
had been brought up before, but until seeing the evidence of Jeffrey's
|
|
|
|
|
patch I wasn't open to the suggestion.)
|
|
|
|
|
|
|
|
|
|
Moreover, a possible implementation strategy became clear: use the old
|
|
|
|
|
PyString implementation, stripped down to remove locale support and
|
|
|
|
|
implicit conversions to/from Unicode, for the immutable bytes type,
|
|
|
|
|
and keep the new PyBytes implementation as the mutable bytes type.
|
|
|
|
|
|
|
|
|
|
The ensuing discussion made it clear that the idea is welcome but
|
|
|
|
|
needs to be specified more precisely. Hence this PEP.
|
|
|
|
|
|
|
|
|
|
Advantages
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
One advantage of having an immutable bytes type is that code objects
|
|
|
|
|
can use these. It also makes it possible to efficiently create hash
|
|
|
|
|
tables using bytes for keys; this may be useful when parsing protocols
|
|
|
|
|
like HTTP or SMTP which are based on bytes representing text.
|
|
|
|
|
|
|
|
|
|
Porting code that manipulates binary data (or encoded text) in Python
|
|
|
|
|
2.x will be easier using the new design than using the original 3.0
|
|
|
|
|
design with mutable bytes; simply replace ``str`` with ``bytes`` and
|
|
|
|
|
change '...' literals into b'...' literals.
|
|
|
|
|
|
|
|
|
|
Naming
|
|
|
|
|
======
|
|
|
|
|
|
|
|
|
|
I propose the following type names at the Python level:
|
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``bytes`` is an immutable array of bytes (PyString)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``bytearray`` is a mutable array of bytes (PyBytes)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``memoryview`` is a bytes view on another object (PyMemory)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2008-01-25 21:38:33 -05:00
|
|
|
|
The old type named ``buffer`` is so similar to the new type
|
2007-09-26 17:55:16 -04:00
|
|
|
|
``memoryview``, introduce by PEP 3118, that it is redundant. The rest
|
|
|
|
|
of this PEP doesn't discuss the functionality of ``memoryview``; it is
|
2007-11-21 14:45:46 -05:00
|
|
|
|
just mentioned here to justify getting rid of the old ``buffer`` type.
|
|
|
|
|
(An earlier version of this PEP proposed ``buffer`` as the new name
|
|
|
|
|
for PyBytes; in the end this name was deemed to confusing given the
|
|
|
|
|
many other uses of the word buffer.)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
While eventually it makes sense to change the C API names, this PEP
|
|
|
|
|
maintains the old C API names, which should be familiar to all.
|
|
|
|
|
|
2007-10-15 12:56:27 -04:00
|
|
|
|
Summary
|
|
|
|
|
-------
|
|
|
|
|
|
|
|
|
|
Here's a simple ASCII-art table summarizing the type names in various
|
|
|
|
|
Python versions::
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
+--------------+-------------+------------+--------------------------+
|
|
|
|
|
| C name | 2.x repr | 3.0a1 repr | 3.0a2 repr |
|
|
|
|
|
+--------------+-------------+------------+--------------------------+
|
|
|
|
|
| PyUnicode | unicode u'' | str '' | str '' |
|
|
|
|
|
| PyString | str '' | str8 s'' | bytes b'' |
|
|
|
|
|
| PyBytes | N/A | bytes b'' | bytearray bytearray(b'') |
|
|
|
|
|
| PyBuffer | buffer | buffer | N/A |
|
|
|
|
|
| PyMemoryView | N/A | memoryview | memoryview <...> |
|
|
|
|
|
+--------------+-------------+------------+--------------------------+
|
2007-10-15 12:56:27 -04:00
|
|
|
|
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Literal Notations
|
|
|
|
|
=================
|
|
|
|
|
|
|
|
|
|
The b'...' notation introduced in Python 3.0a1 returns an immutable
|
2007-11-21 14:45:46 -05:00
|
|
|
|
bytes object, whatever variation is used. To create a mutable array
|
|
|
|
|
of bytes, use bytearray(b'...') or bytearray([...]). The latter form
|
|
|
|
|
takes a list of integers in range(256).
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Functionality
|
|
|
|
|
=============
|
|
|
|
|
|
|
|
|
|
PEP 3118 Buffer API
|
|
|
|
|
-------------------
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
Both bytes and bytearray implement the PEP 3118 buffer API. The bytes
|
|
|
|
|
type only implements read-only requests; the bytearray type allows
|
2007-09-27 13:54:01 -04:00
|
|
|
|
writable and data-locked requests as well. The element data type is
|
|
|
|
|
always 'B' (i.e. unsigned byte).
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Constructors
|
|
|
|
|
------------
|
|
|
|
|
|
|
|
|
|
There are four forms of constructors, applicable to both bytes and
|
2007-11-21 14:45:46 -05:00
|
|
|
|
bytearray:
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``bytes(<bytes>)``, ``bytes(<bytearray>)``, ``bytearray(<bytes>)``,
|
|
|
|
|
``bytearray(<bytearray>)``: simple copying constructors, with the
|
|
|
|
|
note that ``bytes(<bytes>)`` might return its (immutable)
|
|
|
|
|
argument, but ``bytearray(<bytearray>)`` always makes a copy.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``bytes(<str>, <encoding>[, <errors>])``, ``bytearray(<str>,
|
|
|
|
|
<encoding>[, <errors>])``: encode a text string. Note that the
|
|
|
|
|
``str.encode()`` method returns an *immutable* bytes object. The
|
|
|
|
|
<encoding> argument is mandatory; <errors> is optional.
|
|
|
|
|
<encoding> and <errrors>, if given, must be ``str`` instances.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``bytes(<memory view>)``, ``bytearray(<memory view>)``: construct
|
|
|
|
|
a bytes or bytearray object from anything that implements the PEP
|
|
|
|
|
3118 buffer API.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``bytes(<iterable of ints>)``, ``bytearray(<iterable of ints>)``:
|
|
|
|
|
construct a bytes or bytearray object from a stream of integers in
|
|
|
|
|
range(256).
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``bytes(<int>)``, ``bytearray(<int>)``: construct a
|
|
|
|
|
zero-initialized bytes or bytearray object of a given length.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Comparisons
|
|
|
|
|
-----------
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
The bytes and bytearray types are comparable with each other and
|
|
|
|
|
orderable, so that e.g. b'abc' == bytearray(b'abc') < b'abd'.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2007-09-30 19:19:14 -04:00
|
|
|
|
Comparing either type to a str object for equality returns False
|
|
|
|
|
regardless of the contents of either operand. Ordering comparisons
|
|
|
|
|
with str raise TypeError. This is all conformant to the standard
|
|
|
|
|
rules for comparison and ordering between objects of incompatible
|
|
|
|
|
types.
|
|
|
|
|
|
|
|
|
|
(**Note:** in Python 3.0a1, comparing a bytes instance with a str
|
|
|
|
|
instance would raise TypeError, on the premise that this would catch
|
|
|
|
|
the occasional mistake quicker, especially in code ported from Python
|
|
|
|
|
2.x. However, a long discussion on the python-3000 list pointed out
|
|
|
|
|
so many problems with this that it is clearly a bad idea, to be rolled
|
|
|
|
|
back in 3.0a2 regardless of the fate of the rest of this PEP.)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Slicing
|
|
|
|
|
-------
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
Slicing a bytes object returns a bytes object. Slicing a bytearray
|
|
|
|
|
object returns a bytearray object.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
Slice assignment to a bytearray object accepts anything that
|
2007-09-26 19:06:19 -04:00
|
|
|
|
implements the PEP 3118 buffer API, or an iterable of integers in
|
2007-09-26 17:55:16 -04:00
|
|
|
|
range(256).
|
|
|
|
|
|
|
|
|
|
Indexing
|
|
|
|
|
--------
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
Indexing bytes and bytearray returns small ints (like the bytes type in
|
2007-09-27 13:54:01 -04:00
|
|
|
|
3.0a1, and like lists or array.array('B')).
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
Assignment to an item of a bytearray object accepts an int in
|
2007-09-27 14:33:16 -04:00
|
|
|
|
range(256). (To assign from a bytes sequence, use a slice
|
2007-09-27 13:54:01 -04:00
|
|
|
|
assignment.)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Str() and Repr()
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
The str() and repr() functions return the same thing for these
|
|
|
|
|
objects. The repr() of a bytes object returns a b'...' style literal.
|
2007-11-21 14:45:46 -05:00
|
|
|
|
The repr() of a bytearray returns a string of the form "bytearray(b'...')".
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2007-09-26 19:06:19 -04:00
|
|
|
|
Operators
|
|
|
|
|
---------
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
The following operators are implemented by the bytes and bytearray
|
|
|
|
|
types, except where mentioned:
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``b1 + b2``: concatenation. With mixed bytes/bytearray operands,
|
|
|
|
|
the return type is that of the first argument (this seems arbitrary
|
|
|
|
|
until you consider how ``+=`` works).
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``b1 += b2``: mutates b1 if it is a bytearray object.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``b * n``, ``n * b``: repetition; n must be an integer.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``b *= n``: mutates b if it is a bytearray object.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``b1 in b2``, ``b1 not in b2``: substring test; b1 can be any
|
|
|
|
|
object implementing the PEP 3118 buffer API.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``i in b``, ``i not in b``: single-byte membership test; i must
|
|
|
|
|
be an integer (if it is a length-1 bytes array, it is considered
|
|
|
|
|
to be a substring test, with the same outcome).
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``len(b)``: the number of bytes.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2016-05-03 03:51:54 -04:00
|
|
|
|
- ``hash(b)``: the hash value; only implemented by the bytes type.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2007-09-27 13:54:01 -04:00
|
|
|
|
Note that the % operator is *not* implemented. It does not appear
|
|
|
|
|
worth the complexity.
|
2007-09-26 19:06:19 -04:00
|
|
|
|
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Methods
|
|
|
|
|
-------
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
The following methods are implemented by bytes as well as bytearray, with
|
2007-09-26 17:55:16 -04:00
|
|
|
|
similar semantics. They accept anything that implements the PEP 3118
|
|
|
|
|
buffer API for bytes arguments, and return the same type as the object
|
|
|
|
|
whose method is called ("self")::
|
|
|
|
|
|
|
|
|
|
.capitalize(), .center(), .count(), .decode(), .endswith(),
|
|
|
|
|
.expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
|
|
|
|
|
.islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
|
|
|
|
|
.lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
|
|
|
|
|
.rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
|
|
|
|
|
.splitlines(), .startswith(), .strip(), .swapcase(), .title(),
|
|
|
|
|
.translate(), .upper(), .zfill()
|
|
|
|
|
|
|
|
|
|
This is exactly the set of methods present on the str type in Python
|
|
|
|
|
2.x, with the exclusion of .encode(). The signatures and semantics
|
|
|
|
|
are the same too. However, whenever character classes like letter,
|
|
|
|
|
whitespace, lower case are used, the ASCII definitions of these
|
|
|
|
|
classes are used. (The Python 2.x str type uses the definitions from
|
|
|
|
|
the current locale, settable through the locale module.) The
|
|
|
|
|
.encode() method is left out because of the more strict definitions of
|
|
|
|
|
encoding and decoding in Python 3000: encoding always takes a Unicode
|
|
|
|
|
string and returns a bytes sequence, and decoding always takes a bytes
|
|
|
|
|
sequence and returns a Unicode string.
|
|
|
|
|
|
2007-09-27 13:54:01 -04:00
|
|
|
|
In addition, both types implement the class method ``.fromhex()``,
|
|
|
|
|
which constructs an object from a string containing hexadecimal values
|
|
|
|
|
(with or without spaces between the bytes).
|
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
The bytearray type implements these additional methods from the
|
2007-09-27 13:54:01 -04:00
|
|
|
|
MutableSequence ABC (see PEP 3119):
|
|
|
|
|
|
|
|
|
|
.extend(), .insert(), .append(), .reverse(), .pop(), .remove().
|
|
|
|
|
|
2007-09-26 17:55:16 -04:00
|
|
|
|
Bytes and the Str Type
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
Like the bytes type in Python 3.0a1, and unlike the relationship
|
2007-09-30 22:58:22 -04:00
|
|
|
|
between str and unicode in Python 2.x, attempts to mix bytes (or
|
2007-11-21 14:45:46 -05:00
|
|
|
|
bytearray) objects and str objects without specifying an encoding will
|
|
|
|
|
raise a TypeError exception. (However, comparing bytes/bytearray and
|
|
|
|
|
str objects for equality will simply return False; see the section on
|
2007-09-30 22:58:22 -04:00
|
|
|
|
Comparisons above.)
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
Conversions between bytes or bytearray objects and str objects must
|
2007-09-26 17:55:16 -04:00
|
|
|
|
always be explicit, using an encoding. There are two equivalent APIs:
|
|
|
|
|
``str(b, <encoding>[, <errors>])`` is equivalent to
|
2007-09-27 10:24:32 -04:00
|
|
|
|
``b.decode(<encoding>[, <errors>])``, and
|
2007-09-26 17:55:16 -04:00
|
|
|
|
``bytes(s, <encoding>[, <errors>])`` is equivalent to
|
2007-09-27 10:24:32 -04:00
|
|
|
|
``s.encode(<encoding>[, <errors>])``.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
2007-11-21 14:45:46 -05:00
|
|
|
|
There is one exception: we can convert from bytes (or bytearray) to str
|
2007-09-26 17:55:16 -04:00
|
|
|
|
without specifying an encoding by writing ``str(b)``. This produces
|
|
|
|
|
the same result as ``repr(b)``. This exception is necessary because
|
|
|
|
|
of the general promise that *any* object can be printed, and printing
|
|
|
|
|
is just a special case of conversion to str. There is however no
|
|
|
|
|
promise that printing a bytes object interprets the individual bytes
|
|
|
|
|
as characters (unlike in Python 2.x).
|
|
|
|
|
|
2007-09-26 19:06:19 -04:00
|
|
|
|
The str type currently implements the PEP 3118 buffer API. While this
|
|
|
|
|
is perhaps occasionally convenient, it is also potentially confusing,
|
2007-09-26 17:55:16 -04:00
|
|
|
|
because the bytes accessed via the buffer API represent a
|
|
|
|
|
platform-depending encoding: depending on the platform byte order and
|
|
|
|
|
a compile-time configuration option, the encoding could be UTF-16-BE,
|
|
|
|
|
UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation
|
|
|
|
|
of the str type might completely change the bytes representation,
|
|
|
|
|
e.g. to UTF-8, or even make it impossible to access the data as a
|
2007-09-26 19:06:19 -04:00
|
|
|
|
contiguous array of bytes at all. Therefore, the PEP 3118 buffer API
|
|
|
|
|
will be removed from the str type.
|
|
|
|
|
|
2007-10-16 12:58:38 -04:00
|
|
|
|
The ``basestring`` Type
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
The ``basestring`` type will be removed from the language. Code that
|
|
|
|
|
used to say ``isinstance(x, basestring)`` should be changed to use
|
|
|
|
|
``isinstance(x, str)`` instead.
|
|
|
|
|
|
2007-09-26 19:06:19 -04:00
|
|
|
|
Pickling
|
|
|
|
|
--------
|
|
|
|
|
|
|
|
|
|
Left as an exercise for the reader.
|
2007-09-26 17:55:16 -04:00
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|