PEP 3137: immutable bytes and mutable buffer.
This commit is contained in:
parent
1ea1be4b61
commit
8eb885a264
|
@ -96,6 +96,7 @@ Index by Category
|
|||
S 3116 New I/O Stutzbach, Verdone, GvR
|
||||
S 3134 Exception Chaining and Embedded Tracebacks Yee
|
||||
S 3135 New Super Spealman, Delaney
|
||||
S 3137 Immutable Bytes and Mutable Buffer GvR
|
||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||
|
||||
Finished PEPs (done, implemented in Subversion)
|
||||
|
@ -509,6 +510,7 @@ Numerical Index
|
|||
S 3134 Exception Chaining and Embedded Tracebacks Yee
|
||||
S 3135 New Super Spealman, Delaney
|
||||
SR 3136 Labeled break and continue Chisholm
|
||||
S 3137 Immutable Bytes and Mutable Buffer GvR
|
||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,239 @@
|
|||
PEP: 3137
|
||||
Title: Immutable Bytes and Mutable Buffer
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Guido van Rossum <guido@python.org>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 26-May-2007
|
||||
Python-Version: 3.0
|
||||
Post-History: 26-may-2007
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
After releasing Python 3.0a1 with a mutable bytes type, pressure
|
||||
mounted to add a way to represent immutable bytes. Gregory P. Smith
|
||||
proposed a patch that would allow making a bytes object temporarily
|
||||
immutable by requesting that the data be locked using the new buffer
|
||||
API from PEP 3118. This did not seem the right approach to me.
|
||||
|
||||
Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to
|
||||
make the bytes type immutable (by crudely removing all mutating APIs)
|
||||
and fix the fall-out in the test suite. This showed that there aren't
|
||||
all that many places that depend on the mutability of bytes, with the
|
||||
exception of code that builds up a return value from small pieces.
|
||||
|
||||
Thinking through the consequences, and noticing that using the array
|
||||
module as an ersatz mutable bytes type is far from ideal, and
|
||||
recalling a proposal put forward earlier by Talin, I floated the
|
||||
suggestion to have both a mutable and an immutable bytes type. (This
|
||||
had been brought up before, but until seeing the evidence of Jeffrey's
|
||||
patch I wasn't open to the suggestion.)
|
||||
|
||||
Moreover, a possible implementation strategy became clear: use the old
|
||||
PyString implementation, stripped down to remove locale support and
|
||||
implicit conversions to/from Unicode, for the immutable bytes type,
|
||||
and keep the new PyBytes implementation as the mutable bytes type.
|
||||
|
||||
The ensuing discussion made it clear that the idea is welcome but
|
||||
needs to be specified more precisely. Hence this PEP.
|
||||
|
||||
Advantages
|
||||
==========
|
||||
|
||||
One advantage of having an immutable bytes type is that code objects
|
||||
can use these. It also makes it possible to efficiently create hash
|
||||
tables using bytes for keys; this may be useful when parsing protocols
|
||||
like HTTP or SMTP which are based on bytes representing text.
|
||||
|
||||
Porting code that manipulates binary data (or encoded text) in Python
|
||||
2.x will be easier using the new design than using the original 3.0
|
||||
design with mutable bytes; simply replace ``str`` with ``bytes`` and
|
||||
change '...' literals into b'...' literals.
|
||||
|
||||
Naming
|
||||
======
|
||||
|
||||
I propose the following type names at the Python level:
|
||||
|
||||
- ``bytes`` is an immutable array of bytes (PyString)
|
||||
|
||||
- ``buffer`` is a mutable array of bytes (PyBytes)
|
||||
|
||||
- ``memoryview`` is a bytes view on another object (PyMemory)
|
||||
|
||||
The old type named ``buffer`` is so similar to the new type
|
||||
``memoryview``, introduce by PEP 3118, that it is redundant. The rest
|
||||
of this PEP doesn't discuss the functionality of ``memoryview``; it is
|
||||
just mentioned here to justify getting rid of the old ``buffer`` type
|
||||
so we can reuse its name for the mutable bytes type.
|
||||
|
||||
While eventually it makes sense to change the C API names, this PEP
|
||||
maintains the old C API names, which should be familiar to all.
|
||||
|
||||
Literal Notations
|
||||
=================
|
||||
|
||||
The b'...' notation introduced in Python 3.0a1 returns an immutable
|
||||
bytes object, whatever variation is used. To create a mutable bytes
|
||||
buffer object, use buffer(b'...') or buffer([...]). The latter may
|
||||
use a list of integers in range(256).
|
||||
|
||||
Functionality
|
||||
=============
|
||||
|
||||
PEP 3118 Buffer API
|
||||
-------------------
|
||||
|
||||
Both bytes and buffer support the PEP 3118 buffer API. The bytes type
|
||||
only supports read-only requests; the buffer type allows writable and
|
||||
data-locked requests as well. The element data type is always 'B'
|
||||
(i.e. unsigned byte).
|
||||
|
||||
Constructors
|
||||
------------
|
||||
|
||||
There are four forms of constructors, applicable to both bytes and
|
||||
buffer:
|
||||
|
||||
- ``bytes(<bytes>)``, ``bytes(<buffer>)``, ``buffer(<bytes>)``,
|
||||
``buffer(<buffer>)``: simple copying constructors, with the note
|
||||
that ``bytes(<bytes>)`` might return its (immutable) argument.
|
||||
|
||||
- ``bytes(<str>, <encoding>[, <errors>])``, ``buffer(<str>,
|
||||
<encoding>[, <errors>])``: encode a text string. Note that the
|
||||
``str.encode()`` method returns an *immutable* bytes object.
|
||||
The <encoding> argument is mandatory; <errors> is optional.
|
||||
|
||||
- ``bytes(<memory view>)``, ``buffer(<memory view>)``: construct a
|
||||
bytes or buffer object from anything that supports the PEP 3118
|
||||
buffer API.
|
||||
|
||||
- ``bytes(<iterable of ints>)``, ``buffer(<iterable of ints>)``:
|
||||
construct an immutable bytes or mutable buffer object from a
|
||||
stream of integers in range(256).
|
||||
|
||||
- ``buffer(<int>)``: construct a zero-initialized buffer of a given
|
||||
lenth.
|
||||
|
||||
Comparisons
|
||||
-----------
|
||||
|
||||
The bytes and buffer types are comparable with each other and
|
||||
orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'.
|
||||
|
||||
Comparing either type to a str object raises an exception. This
|
||||
turned out to be necessary to catch common mistakes.
|
||||
|
||||
Slicing
|
||||
-------
|
||||
|
||||
Slicing a bytes object returns a bytes object. Slicing a buffer
|
||||
object returns a buffer object.
|
||||
|
||||
Slice assignment to a mutable buffer object accept anything that
|
||||
supports the PEP 3118 buffer API, or an iterable of integers in
|
||||
range(256).
|
||||
|
||||
Indexing
|
||||
--------
|
||||
|
||||
**Open Issue:** I'm undecided on whether indexing bytes and buffer
|
||||
objects should return small ints (like the bytes type in 3.0a1, and
|
||||
like lists or array.array('B')), or bytes/buffer objects of length 1
|
||||
(like the str type). The latter (str-like) approach will ease porting
|
||||
code from Python 2.x; but it makes it harder to extract values from a
|
||||
bytes array.
|
||||
|
||||
Assignment to an item of a mutable buffer object accepts an int in
|
||||
range(256); if we choose the str-like approach for indexing above, it
|
||||
also accepts an object implementing the PEP 3118 buffer API, if it has
|
||||
length 1.
|
||||
|
||||
Str() and Repr()
|
||||
----------------
|
||||
|
||||
The str() and repr() functions return the same thing for these
|
||||
objects. The repr() of a bytes object returns a b'...' style literal.
|
||||
The repr() of a buffer returns a string of the form "buffer(b'...')".
|
||||
|
||||
Methods
|
||||
-------
|
||||
|
||||
The following methods are supported by bytes as well as buffer, with
|
||||
similar semantics. They accept anything that implements the PEP 3118
|
||||
buffer API for bytes arguments, and return the same type as the object
|
||||
whose method is called ("self")::
|
||||
|
||||
.capitalize(), .center(), .count(), .decode(), .endswith(),
|
||||
.expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
|
||||
.islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
|
||||
.lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
|
||||
.rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
|
||||
.splitlines(), .startswith(), .strip(), .swapcase(), .title(),
|
||||
.translate(), .upper(), .zfill()
|
||||
|
||||
This is exactly the set of methods present on the str type in Python
|
||||
2.x, with the exclusion of .encode(). The signatures and semantics
|
||||
are the same too. However, whenever character classes like letter,
|
||||
whitespace, lower case are used, the ASCII definitions of these
|
||||
classes are used. (The Python 2.x str type uses the definitions from
|
||||
the current locale, settable through the locale module.) The
|
||||
.encode() method is left out because of the more strict definitions of
|
||||
encoding and decoding in Python 3000: encoding always takes a Unicode
|
||||
string and returns a bytes sequence, and decoding always takes a bytes
|
||||
sequence and returns a Unicode string.
|
||||
|
||||
Bytes and the Str Type
|
||||
----------------------
|
||||
|
||||
Like the bytes type in Python 3.0a1, and unlike the relationship
|
||||
between str and unicode in Python 2.x, any attempt to mix bytes (or
|
||||
buffer) objects and str objects without specifying an encoding will
|
||||
raise a TypeError exception. This is the case even for simply
|
||||
comparing a bytes or buffer object to a str object (even violating the
|
||||
general rule that comparing objects of different types for equality
|
||||
should just return False).
|
||||
|
||||
Conversions between bytes or buffer objects and str objects must
|
||||
always be explicit, using an encoding. There are two equivalent APIs:
|
||||
``str(b, <encoding>[, <errors>])`` is equivalent to
|
||||
``b.encode(<encoding>[, <errors>])``, and
|
||||
``bytes(s, <encoding>[, <errors>])`` is equivalent to
|
||||
``s.decode(<encoding>[, <errors>])``.
|
||||
|
||||
There is one exception: we can convert from bytes (or buffer) to str
|
||||
without specifying an encoding by writing ``str(b)``. This produces
|
||||
the same result as ``repr(b)``. This exception is necessary because
|
||||
of the general promise that *any* object can be printed, and printing
|
||||
is just a special case of conversion to str. There is however no
|
||||
promise that printing a bytes object interprets the individual bytes
|
||||
as characters (unlike in Python 2.x).
|
||||
|
||||
The str type current supports the PEP 3118 buffer API. While this is
|
||||
perhaps occasionally convenient, it is also potentially confusing,
|
||||
because the bytes accessed via the buffer API represent a
|
||||
platform-depending encoding: depending on the platform byte order and
|
||||
a compile-time configuration option, the encoding could be UTF-16-BE,
|
||||
UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation
|
||||
of the str type might completely change the bytes representation,
|
||||
e.g. to UTF-8, or even make it impossible to access the data as a
|
||||
contiguous array of bytes at all. Therefore, support for the PEP 3118
|
||||
buffer API will be removed from the str type.
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue