PEP 3137: immutable bytes and mutable buffer.

2007-09-26 21:55:16 +00:00 · 2007-09-26 21:55:16 +00:00 · 8eb885a264
parent 1ea1be4b61
commit 8eb885a264
2 changed files with 241 additions and 0 deletions
--- a/pep-0000.txt
+++ b/pep-0000.txt
@ -96,6 +96,7 @@ Index by Category
 S  3116  New I/O                                      Stutzbach, Verdone, GvR
 S  3134  Exception Chaining and Embedded Tracebacks   Yee
 S  3135  New Super                                    Spealman, Delaney
+ S  3137  Immutable Bytes and Mutable Buffer           GvR
 S  3141  A Type Hierarchy for Numbers                 Yasskin

 Finished PEPs (done, implemented in Subversion)
@ -509,6 +510,7 @@ Numerical Index
 S  3134  Exception Chaining and Embedded Tracebacks   Yee
 S  3135  New Super                                    Spealman, Delaney
 SR 3136  Labeled break and continue                   Chisholm
+ S  3137  Immutable Bytes and Mutable Buffer           GvR
 S  3141  A Type Hierarchy for Numbers                 Yasskin


--- a/pep-3137.txt
+++ b/pep-3137.txt
@ -0,0 +1,239 @@
+PEP: 3137
+Title: Immutable Bytes and Mutable Buffer
+Version: $Revision$
+Last-Modified: $Date$
+Author: Guido van Rossum <guido@python.org>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 26-May-2007
+Python-Version: 3.0
+Post-History: 26-may-2007
+
+Introduction
+============
+
+After releasing Python 3.0a1 with a mutable bytes type, pressure
+mounted to add a way to represent immutable bytes.  Gregory P. Smith
+proposed a patch that would allow making a bytes object temporarily
+immutable by requesting that the data be locked using the new buffer
+API from PEP 3118.  This did not seem the right approach to me.
+
+Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to
+make the bytes type immutable (by crudely removing all mutating APIs)
+and fix the fall-out in the test suite.  This showed that there aren't
+all that many places that depend on the mutability of bytes, with the
+exception of code that builds up a return value from small pieces.
+
+Thinking through the consequences, and noticing that using the array
+module as an ersatz mutable bytes type is far from ideal, and
+recalling a proposal put forward earlier by Talin, I floated the
+suggestion to have both a mutable and an immutable bytes type.  (This
+had been brought up before, but until seeing the evidence of Jeffrey's
+patch I wasn't open to the suggestion.)
+
+Moreover, a possible implementation strategy became clear: use the old
+PyString implementation, stripped down to remove locale support and
+implicit conversions to/from Unicode, for the immutable bytes type,
+and keep the new PyBytes implementation as the mutable bytes type.
+
+The ensuing discussion made it clear that the idea is welcome but
+needs to be specified more precisely.  Hence this PEP.
+
+Advantages
+==========
+
+One advantage of having an immutable bytes type is that code objects
+can use these.  It also makes it possible to efficiently create hash
+tables using bytes for keys; this may be useful when parsing protocols
+like HTTP or SMTP which are based on bytes representing text.
+
+Porting code that manipulates binary data (or encoded text) in Python
+2.x will be easier using the new design than using the original 3.0
+design with mutable bytes; simply replace ``str`` with ``bytes`` and
+change '...' literals into b'...' literals.
+
+Naming
+======
+
+I propose the following type names at the Python level:
+
+  - ``bytes`` is an immutable array of bytes (PyString)
+
+  - ``buffer`` is a mutable array of bytes (PyBytes)
+
+  - ``memoryview`` is a bytes view on another object (PyMemory)
+
+The old type named ``buffer`` is so similar to the new type
+``memoryview``, introduce by PEP 3118, that it is redundant.  The rest
+of this PEP doesn't discuss the functionality of ``memoryview``; it is
+just mentioned here to justify getting rid of the old ``buffer`` type
+so we can reuse its name for the mutable bytes type.
+
+While eventually it makes sense to change the C API names, this PEP
+maintains the old C API names, which should be familiar to all.
+
+Literal Notations
+=================
+
+The b'...' notation introduced in Python 3.0a1 returns an immutable
+bytes object, whatever variation is used.  To create a mutable bytes
+buffer object, use buffer(b'...') or buffer([...]).  The latter may
+use a list of integers in range(256).
+
+Functionality
+=============
+
+PEP 3118 Buffer API
+-------------------
+
+Both bytes and buffer support the PEP 3118 buffer API.  The bytes type
+only supports read-only requests; the buffer type allows writable and
+data-locked requests as well.  The element data type is always 'B'
+(i.e. unsigned byte).
+
+Constructors
+------------
+
+There are four forms of constructors, applicable to both bytes and
+buffer:
+
+  - ``bytes(<bytes>)``, ``bytes(<buffer>)``, ``buffer(<bytes>)``,
+    ``buffer(<buffer>)``: simple copying constructors, with the note
+    that ``bytes(<bytes>)`` might return its (immutable) argument.
+
+  - ``bytes(<str>, <encoding>[, <errors>])``, ``buffer(<str>,
+    <encoding>[, <errors>])``: encode a text string.  Note that the
+    ``str.encode()`` method returns an *immutable* bytes object.
+    The <encoding> argument is mandatory; <errors> is optional.
+
+  - ``bytes(<memory view>)``, ``buffer(<memory view>)``: construct a
+    bytes or buffer object from anything that supports the PEP 3118
+    buffer API.
+
+  - ``bytes(<iterable of ints>)``, ``buffer(<iterable of ints>)``:
+    construct an immutable bytes or mutable buffer object from a
+    stream of integers in range(256).
+
+  - ``buffer(<int>)``: construct a zero-initialized buffer of a given
+    lenth.
+
+Comparisons
+-----------
+
+The bytes and buffer types are comparable with each other and
+orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'.
+
+Comparing either type to a str object raises an exception.  This
+turned out to be necessary to catch common mistakes.
+
+Slicing
+-------
+
+Slicing a bytes object returns a bytes object.  Slicing a buffer
+object returns a buffer object.
+
+Slice assignment to a mutable buffer object accept anything that
+supports the PEP 3118 buffer API, or an iterable of integers in
+range(256).
+
+Indexing
+--------
+
+**Open Issue:** I'm undecided on whether indexing bytes and buffer
+objects should return small ints (like the bytes type in 3.0a1, and
+like lists or array.array('B')), or bytes/buffer objects of length 1
+(like the str type).  The latter (str-like) approach will ease porting
+code from Python 2.x; but it makes it harder to extract values from a
+bytes array.
+
+Assignment to an item of a mutable buffer object accepts an int in
+range(256); if we choose the str-like approach for indexing above, it
+also accepts an object implementing the PEP 3118 buffer API, if it has
+length 1.
+
+Str() and Repr()
+----------------
+
+The str() and repr() functions return the same thing for these
+objects.  The repr() of a bytes object returns a b'...' style literal.
+The repr() of a buffer returns a string of the form "buffer(b'...')".
+
+Methods
+-------
+
+The following methods are supported by bytes as well as buffer, with
+similar semantics.  They accept anything that implements the PEP 3118
+buffer API for bytes arguments, and return the same type as the object
+whose method is called ("self")::
+
+  .capitalize(), .center(), .count(), .decode(), .endswith(),
+  .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(),
+  .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(),
+  .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(),
+  .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(),
+  .splitlines(), .startswith(), .strip(), .swapcase(), .title(),
+  .translate(), .upper(), .zfill()
+
+This is exactly the set of methods present on the str type in Python
+2.x, with the exclusion of .encode().  The signatures and semantics
+are the same too.  However, whenever character classes like letter,
+whitespace, lower case are used, the ASCII definitions of these
+classes are used.  (The Python 2.x str type uses the definitions from
+the current locale, settable through the locale module.)  The
+.encode() method is left out because of the more strict definitions of
+encoding and decoding in Python 3000: encoding always takes a Unicode
+string and returns a bytes sequence, and decoding always takes a bytes
+sequence and returns a Unicode string.
+
+Bytes and the Str Type
+----------------------
+
+Like the bytes type in Python 3.0a1, and unlike the relationship
+between str and unicode in Python 2.x, any attempt to mix bytes (or
+buffer) objects and str objects without specifying an encoding will
+raise a TypeError exception.  This is the case even for simply
+comparing a bytes or buffer object to a str object (even violating the
+general rule that comparing objects of different types for equality
+should just return False).
+
+Conversions between bytes or buffer objects and str objects must
+always be explicit, using an encoding.  There are two equivalent APIs:
+``str(b, <encoding>[, <errors>])`` is equivalent to
+``b.encode(<encoding>[, <errors>])``, and
+``bytes(s, <encoding>[, <errors>])`` is equivalent to
+``s.decode(<encoding>[, <errors>])``.
+  
+There is one exception: we can convert from bytes (or buffer) to str
+without specifying an encoding by writing ``str(b)``.  This produces
+the same result as ``repr(b)``.  This exception is necessary because
+of the general promise that *any* object can be printed, and printing
+is just a special case of conversion to str.  There is however no
+promise that printing a bytes object interprets the individual bytes
+as characters (unlike in Python 2.x).
+
+The str type current supports the PEP 3118 buffer API.  While this is
+perhaps occasionally convenient, it is also potentially confusing,
+because the bytes accessed via the buffer API represent a
+platform-depending encoding: depending on the platform byte order and
+a compile-time configuration option, the encoding could be UTF-16-BE,
+UTF-16-LE, UTF-32-BE, or UTF-32-LE.  Worse, a different implementation
+of the str type might completely change the bytes representation,
+e.g. to UTF-8, or even make it impossible to access the data as a
+contiguous array of bytes at all.  Therefore, support for the PEP 3118
+buffer API will be removed from the str type.
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End: