315 lines
12 KiB
ReStructuredText
315 lines
12 KiB
ReStructuredText
PEP: 3137
|
||
Title: Immutable Bytes and Mutable Buffer
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Guido van Rossum <guido@python.org>
|
||
Status: Final
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 26-Sep-2007
|
||
Python-Version: 3.0
|
||
Post-History: 26-Sep-2007, 30-Sep-2007
|
||
|
||
Introduction
|
||
============
|
||
|
||
After releasing Python 3.0a1 with a mutable bytes type, pressure
|
||
mounted to add a way to represent immutable bytes. Gregory P. Smith
|
||
proposed a patch that would allow making a bytes object temporarily
|
||
immutable by requesting that the data be locked using the new buffer
|
||
API from :pep:`3118`. This did not seem the right approach to me.
|
||
|
||
Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to
|
||
make the bytes type immutable (by crudely removing all mutating APIs)
|
||
and fix the fall-out in the test suite. This showed that there aren't
|
||
all that many places that depend on the mutability of bytes, with the
|
||
exception of code that builds up a return value from small pieces.
|
||
|
||
Thinking through the consequences, and noticing that using the array
|
||
module as an ersatz mutable bytes type is far from ideal, and
|
||
recalling a proposal put forward earlier by Talin, I floated the
|
||
suggestion to have both a mutable and an immutable bytes type. (This
|
||
had been brought up before, but until seeing the evidence of Jeffrey's
|
||
patch I wasn't open to the suggestion.)
|
||
|
||
Moreover, a possible implementation strategy became clear: use the old
|
||
PyString implementation, stripped down to remove locale support and
|
||
implicit conversions to/from Unicode, for the immutable bytes type,
|
||
and keep the new PyBytes implementation as the mutable bytes type.
|
||
|
||
The ensuing discussion made it clear that the idea is welcome but
|
||
needs to be specified more precisely. Hence this PEP.
|
||
|
||
Advantages
|
||
==========
|
||
|
||
One advantage of having an immutable bytes type is that code objects
|
||
can use these. It also makes it possible to efficiently create hash
|
||
tables using bytes for keys; this may be useful when parsing protocols
|
||
like HTTP or SMTP which are based on bytes representing text.
|
||
|
||
Porting code that manipulates binary data (or encoded text) in Python
|
||
2.x will be easier using the new design than using the original 3.0
|
||
design with mutable bytes; simply replace ``str`` with ``bytes`` and
|
||
change '...' literals into b'...' literals.
|
||
|
||
Naming
|
||
======
|
||
|
||
I propose the following type names at the Python level:
|
||
|
||
- ``bytes`` is an immutable array of bytes (PyString)
|
||
|
||
- ``bytearray`` is a mutable array of bytes (PyBytes)
|
||
|
||
- ``memoryview`` is a bytes view on another object (PyMemory)
|
||
|
||
The old type named ``buffer`` is so similar to the new type
|
||
``memoryview``, introduce by :pep:`3118`, that it is redundant. The rest
|
||
of this PEP doesn't discuss the functionality of ``memoryview``; it is
|
||
just mentioned here to justify getting rid of the old ``buffer`` type.
|
||
(An earlier version of this PEP proposed ``buffer`` as the new name
|
||
for PyBytes; in the end this name was deemed to confusing given the
|
||
many other uses of the word buffer.)
|
||
|
||
While eventually it makes sense to change the C API names, this PEP
|
||
maintains the old C API names, which should be familiar to all.
|
||
|
||
Summary
|
||
-------
|
||
|
||
Here's a simple ASCII-art table summarizing the type names in various
|
||
Python versions::
|
||
|
||
+--------------+-------------+------------+--------------------------+
|
||
| C name | 2.x repr | 3.0a1 repr | 3.0a2 repr |
|
||
+--------------+-------------+------------+--------------------------+
|
||
| PyUnicode | unicode u'' | str '' | str '' |
|
||
| PyString | str '' | str8 s'' | bytes b'' |
|
||
| PyBytes | N/A | bytes b'' | bytearray bytearray(b'') |
|
||
| PyBuffer | buffer | buffer | N/A |
|
||
| PyMemoryView | N/A | memoryview | memoryview <...> |
|
||
+--------------+-------------+------------+--------------------------+
|
||
|
||
Literal Notations
|
||
=================
|
||
|
||
The b'...' notation introduced in Python 3.0a1 returns an immutable
|
||
bytes object, whatever variation is used. To create a mutable array
|
||
of bytes, use bytearray(b'...') or bytearray([...]). The latter form
|
||
takes a list of integers in range(256).
|
||
|
||
Functionality
|
||
=============
|
||
|
||
PEP 3118 Buffer API
|
||
-------------------
|
||
|
||
Both bytes and bytearray implement the :pep:`3118` buffer API. The bytes
|
||
type only implements read-only requests; the bytearray type allows
|
||
writable and data-locked requests as well. The element data type is
|
||
always 'B' (i.e. unsigned byte).
|
||
|
||
Constructors
|
||
------------
|
||
|
||
There are four forms of constructors, applicable to both bytes and
|
||
bytearray:
|
||
|
||
- ``bytes(<bytes>)``, ``bytes(<bytearray>)``, ``bytearray(<bytes>)``,
|
||
``bytearray(<bytearray>)``: simple copying constructors, with the
|
||
note that ``bytes(<bytes>)`` might return its (immutable)
|
||
argument, but ``bytearray(<bytearray>)`` always makes a copy.
|
||
|
||
- ``bytes(<str>, <encoding>[, <errors>])``, ``bytearray(<str>,
|
||
<encoding>[, <errors>])``: encode a text string. Note that the
|
||
``str.encode()`` method returns an *immutable* bytes object. The
|
||
<encoding> argument is mandatory; <errors> is optional.
|
||
<encoding> and <errors>, if given, must be ``str`` instances.
|
||
|
||
- ``bytes(<memory view>)``, ``bytearray(<memory view>)``: construct
|
||
a bytes or bytearray object from anything that implements the PEP
|
||
3118 buffer API.
|
||
|
||
- ``bytes(<iterable of ints>)``, ``bytearray(<iterable of ints>)``:
|
||
construct a bytes or bytearray object from a stream of integers in
|
||
range(256).
|
||
|
||
- ``bytes(<int>)``, ``bytearray(<int>)``: construct a
|
||
zero-initialized bytes or bytearray object of a given length.
|
||
|
||
Comparisons
|
||
-----------
|
||
|
||
The bytes and bytearray types are comparable with each other and
|
||
orderable, so that e.g. b'abc' == bytearray(b'abc') < b'abd'.
|
||
|
||
Comparing either type to a str object for equality returns False
|
||
regardless of the contents of either operand. Ordering comparisons
|
||
with str raise TypeError. This is all conformant to the standard
|
||
rules for comparison and ordering between objects of incompatible
|
||
types.
|
||
|
||
(**Note:** in Python 3.0a1, comparing a bytes instance with a str
|
||
instance would raise TypeError, on the premise that this would catch
|
||
the occasional mistake quicker, especially in code ported from Python
|
||
2.x. However, a long discussion on the python-3000 list pointed out
|
||
so many problems with this that it is clearly a bad idea, to be rolled
|
||
back in 3.0a2 regardless of the fate of the rest of this PEP.)
|
||
|
||
Slicing
|
||
-------
|
||
|
||
Slicing a bytes object returns a bytes object. Slicing a bytearray
|
||
object returns a bytearray object.
|
||
|
||
Slice assignment to a bytearray object accepts anything that
|
||
implements the :pep:`3118` buffer API, or an iterable of integers in
|
||
range(256).
|
||
|
||
Indexing
|
||
--------
|
||
|
||
Indexing bytes and bytearray returns small ints (like the bytes type in
|
||
3.0a1, and like lists or array.array('B')).
|
||
|
||
Assignment to an item of a bytearray object accepts an int in
|
||
range(256). (To assign from a bytes sequence, use a slice
|
||
assignment.)
|
||
|
||
Str() and Repr()
|
||
----------------
|
||
|
||
The str() and repr() functions return the same thing for these
|
||
objects. The repr() of a bytes object returns a b'...' style literal.
|
||
The repr() of a bytearray returns a string of the form "bytearray(b'...')".
|
||
|
||
Operators
|
||
---------
|
||
|
||
The following operators are implemented by the bytes and bytearray
|
||
types, except where mentioned:
|
||
|
||
- ``b1 + b2``: concatenation. With mixed bytes/bytearray operands,
|
||
the return type is that of the first argument (this seems arbitrary
|
||
until you consider how ``+=`` works).
|
||
|
||
- ``b1 += b2``: mutates b1 if it is a bytearray object.
|
||
|
||
- ``b * n``, ``n * b``: repetition; n must be an integer.
|
||
|
||
- ``b *= n``: mutates b if it is a bytearray object.
|
||
|
||
- ``b1 in b2``, ``b1 not in b2``: substring test; b1 can be any
|
||
object implementing the :pep:`3118` buffer API.
|
||
|
||
- ``i in b``, ``i not in b``: single-byte membership test; i must
|
||
be an integer (if it is a length-1 bytes array, it is considered
|
||
to be a substring test, with the same outcome).
|
||
|
||
- ``len(b)``: the number of bytes.
|
||
|
||
- ``hash(b)``: the hash value; only implemented by the bytes type.
|
||
|
||
Note that the % operator is *not* implemented. It does not appear
|
||
worth the complexity.
|
||
|
||
Methods
|
||
-------
|
||
|
||
The following methods are implemented by bytes as well as bytearray, with
|
||
similar semantics. They accept anything that implements the :pep:`3118`
|
||
buffer API for bytes arguments, and return the same type as the object
|
||
whose method is called ("self")::
|
||
|
||
.capitalize(), .center(), .count(), .decode(), .endswith(),
|
||
.expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
|
||
.islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
|
||
.lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
|
||
.rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
|
||
.splitlines(), .startswith(), .strip(), .swapcase(), .title(),
|
||
.translate(), .upper(), .zfill()
|
||
|
||
This is exactly the set of methods present on the str type in Python
|
||
2.x, with the exclusion of .encode(). The signatures and semantics
|
||
are the same too. However, whenever character classes like letter,
|
||
whitespace, lower case are used, the ASCII definitions of these
|
||
classes are used. (The Python 2.x str type uses the definitions from
|
||
the current locale, settable through the locale module.) The
|
||
.encode() method is left out because of the more strict definitions of
|
||
encoding and decoding in Python 3000: encoding always takes a Unicode
|
||
string and returns a bytes sequence, and decoding always takes a bytes
|
||
sequence and returns a Unicode string.
|
||
|
||
In addition, both types implement the class method ``.fromhex()``,
|
||
which constructs an object from a string containing hexadecimal values
|
||
(with or without spaces between the bytes).
|
||
|
||
The bytearray type implements these additional methods from the
|
||
MutableSequence ABC (see :pep:`3119`):
|
||
|
||
.extend(), .insert(), .append(), .reverse(), .pop(), .remove().
|
||
|
||
Bytes and the Str Type
|
||
----------------------
|
||
|
||
Like the bytes type in Python 3.0a1, and unlike the relationship
|
||
between str and unicode in Python 2.x, attempts to mix bytes (or
|
||
bytearray) objects and str objects without specifying an encoding will
|
||
raise a TypeError exception. (However, comparing bytes/bytearray and
|
||
str objects for equality will simply return False; see the section on
|
||
Comparisons above.)
|
||
|
||
Conversions between bytes or bytearray objects and str objects must
|
||
always be explicit, using an encoding. There are two equivalent APIs:
|
||
``str(b, <encoding>[, <errors>])`` is equivalent to
|
||
``b.decode(<encoding>[, <errors>])``, and
|
||
``bytes(s, <encoding>[, <errors>])`` is equivalent to
|
||
``s.encode(<encoding>[, <errors>])``.
|
||
|
||
There is one exception: we can convert from bytes (or bytearray) to str
|
||
without specifying an encoding by writing ``str(b)``. This produces
|
||
the same result as ``repr(b)``. This exception is necessary because
|
||
of the general promise that *any* object can be printed, and printing
|
||
is just a special case of conversion to str. There is however no
|
||
promise that printing a bytes object interprets the individual bytes
|
||
as characters (unlike in Python 2.x).
|
||
|
||
The str type currently implements the :pep:`3118` buffer API. While this
|
||
is perhaps occasionally convenient, it is also potentially confusing,
|
||
because the bytes accessed via the buffer API represent a
|
||
platform-depending encoding: depending on the platform byte order and
|
||
a compile-time configuration option, the encoding could be UTF-16-BE,
|
||
UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation
|
||
of the str type might completely change the bytes representation,
|
||
e.g. to UTF-8, or even make it impossible to access the data as a
|
||
contiguous array of bytes at all. Therefore, the :pep:`3118` buffer API
|
||
will be removed from the str type.
|
||
|
||
The ``basestring`` Type
|
||
-----------------------
|
||
|
||
The ``basestring`` type will be removed from the language. Code that
|
||
used to say ``isinstance(x, basestring)`` should be changed to use
|
||
``isinstance(x, str)`` instead.
|
||
|
||
Pickling
|
||
--------
|
||
|
||
Left as an exercise for the reader.
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|