diff --git a/pep-0000.txt b/pep-0000.txt index 0e2aeeb14..1129e0a30 100644 --- a/pep-0000.txt +++ b/pep-0000.txt @@ -96,6 +96,7 @@ Index by Category S 3116 New I/O Stutzbach, Verdone, GvR S 3134 Exception Chaining and Embedded Tracebacks Yee S 3135 New Super Spealman, Delaney + S 3137 Immutable Bytes and Mutable Buffer GvR S 3141 A Type Hierarchy for Numbers Yasskin Finished PEPs (done, implemented in Subversion) @@ -509,6 +510,7 @@ Numerical Index S 3134 Exception Chaining and Embedded Tracebacks Yee S 3135 New Super Spealman, Delaney SR 3136 Labeled break and continue Chisholm + S 3137 Immutable Bytes and Mutable Buffer GvR S 3141 A Type Hierarchy for Numbers Yasskin diff --git a/pep-3137.txt b/pep-3137.txt new file mode 100644 index 000000000..cb5f1c83b --- /dev/null +++ b/pep-3137.txt @@ -0,0 +1,239 @@ +PEP: 3137 +Title: Immutable Bytes and Mutable Buffer +Version: $Revision$ +Last-Modified: $Date$ +Author: Guido van Rossum +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 26-May-2007 +Python-Version: 3.0 +Post-History: 26-may-2007 + +Introduction +============ + +After releasing Python 3.0a1 with a mutable bytes type, pressure +mounted to add a way to represent immutable bytes. Gregory P. Smith +proposed a patch that would allow making a bytes object temporarily +immutable by requesting that the data be locked using the new buffer +API from PEP 3118. This did not seem the right approach to me. + +Jeffrey Yasskin, with the help of Adam Hupp, then prepared a patch to +make the bytes type immutable (by crudely removing all mutating APIs) +and fix the fall-out in the test suite. This showed that there aren't +all that many places that depend on the mutability of bytes, with the +exception of code that builds up a return value from small pieces. + +Thinking through the consequences, and noticing that using the array +module as an ersatz mutable bytes type is far from ideal, and +recalling a proposal put forward earlier by Talin, I floated the +suggestion to have both a mutable and an immutable bytes type. (This +had been brought up before, but until seeing the evidence of Jeffrey's +patch I wasn't open to the suggestion.) + +Moreover, a possible implementation strategy became clear: use the old +PyString implementation, stripped down to remove locale support and +implicit conversions to/from Unicode, for the immutable bytes type, +and keep the new PyBytes implementation as the mutable bytes type. + +The ensuing discussion made it clear that the idea is welcome but +needs to be specified more precisely. Hence this PEP. + +Advantages +========== + +One advantage of having an immutable bytes type is that code objects +can use these. It also makes it possible to efficiently create hash +tables using bytes for keys; this may be useful when parsing protocols +like HTTP or SMTP which are based on bytes representing text. + +Porting code that manipulates binary data (or encoded text) in Python +2.x will be easier using the new design than using the original 3.0 +design with mutable bytes; simply replace ``str`` with ``bytes`` and +change '...' literals into b'...' literals. + +Naming +====== + +I propose the following type names at the Python level: + + - ``bytes`` is an immutable array of bytes (PyString) + + - ``buffer`` is a mutable array of bytes (PyBytes) + + - ``memoryview`` is a bytes view on another object (PyMemory) + +The old type named ``buffer`` is so similar to the new type +``memoryview``, introduce by PEP 3118, that it is redundant. The rest +of this PEP doesn't discuss the functionality of ``memoryview``; it is +just mentioned here to justify getting rid of the old ``buffer`` type +so we can reuse its name for the mutable bytes type. + +While eventually it makes sense to change the C API names, this PEP +maintains the old C API names, which should be familiar to all. + +Literal Notations +================= + +The b'...' notation introduced in Python 3.0a1 returns an immutable +bytes object, whatever variation is used. To create a mutable bytes +buffer object, use buffer(b'...') or buffer([...]). The latter may +use a list of integers in range(256). + +Functionality +============= + +PEP 3118 Buffer API +------------------- + +Both bytes and buffer support the PEP 3118 buffer API. The bytes type +only supports read-only requests; the buffer type allows writable and +data-locked requests as well. The element data type is always 'B' +(i.e. unsigned byte). + +Constructors +------------ + +There are four forms of constructors, applicable to both bytes and +buffer: + + - ``bytes()``, ``bytes()``, ``buffer()``, + ``buffer()``: simple copying constructors, with the note + that ``bytes()`` might return its (immutable) argument. + + - ``bytes(, [, ])``, ``buffer(, + [, ])``: encode a text string. Note that the + ``str.encode()`` method returns an *immutable* bytes object. + The argument is mandatory; is optional. + + - ``bytes()``, ``buffer()``: construct a + bytes or buffer object from anything that supports the PEP 3118 + buffer API. + + - ``bytes()``, ``buffer()``: + construct an immutable bytes or mutable buffer object from a + stream of integers in range(256). + + - ``buffer()``: construct a zero-initialized buffer of a given + lenth. + +Comparisons +----------- + +The bytes and buffer types are comparable with each other and +orderable, so that e.g. b'abc' == buffer(b'abc') < b'abd'. + +Comparing either type to a str object raises an exception. This +turned out to be necessary to catch common mistakes. + +Slicing +------- + +Slicing a bytes object returns a bytes object. Slicing a buffer +object returns a buffer object. + +Slice assignment to a mutable buffer object accept anything that +supports the PEP 3118 buffer API, or an iterable of integers in +range(256). + +Indexing +-------- + +**Open Issue:** I'm undecided on whether indexing bytes and buffer +objects should return small ints (like the bytes type in 3.0a1, and +like lists or array.array('B')), or bytes/buffer objects of length 1 +(like the str type). The latter (str-like) approach will ease porting +code from Python 2.x; but it makes it harder to extract values from a +bytes array. + +Assignment to an item of a mutable buffer object accepts an int in +range(256); if we choose the str-like approach for indexing above, it +also accepts an object implementing the PEP 3118 buffer API, if it has +length 1. + +Str() and Repr() +---------------- + +The str() and repr() functions return the same thing for these +objects. The repr() of a bytes object returns a b'...' style literal. +The repr() of a buffer returns a string of the form "buffer(b'...')". + +Methods +------- + +The following methods are supported by bytes as well as buffer, with +similar semantics. They accept anything that implements the PEP 3118 +buffer API for bytes arguments, and return the same type as the object +whose method is called ("self"):: + + .capitalize(), .center(), .count(), .decode(), .endswith(), + .expandtabs(), .find(), .index(), .isalnum(), .isalpha(), .isdigit(), + .islower(), .isspace(), .istitle(), .isupper(), .join(), .ljust(), + .lower(), .lstrip(), .partition(), .replace(), .rfind(), .rindex(), + .rjust(), .rpartition(), .rsplit(), .rstrip(), .split(), + .splitlines(), .startswith(), .strip(), .swapcase(), .title(), + .translate(), .upper(), .zfill() + +This is exactly the set of methods present on the str type in Python +2.x, with the exclusion of .encode(). The signatures and semantics +are the same too. However, whenever character classes like letter, +whitespace, lower case are used, the ASCII definitions of these +classes are used. (The Python 2.x str type uses the definitions from +the current locale, settable through the locale module.) The +.encode() method is left out because of the more strict definitions of +encoding and decoding in Python 3000: encoding always takes a Unicode +string and returns a bytes sequence, and decoding always takes a bytes +sequence and returns a Unicode string. + +Bytes and the Str Type +---------------------- + +Like the bytes type in Python 3.0a1, and unlike the relationship +between str and unicode in Python 2.x, any attempt to mix bytes (or +buffer) objects and str objects without specifying an encoding will +raise a TypeError exception. This is the case even for simply +comparing a bytes or buffer object to a str object (even violating the +general rule that comparing objects of different types for equality +should just return False). + +Conversions between bytes or buffer objects and str objects must +always be explicit, using an encoding. There are two equivalent APIs: +``str(b, [, ])`` is equivalent to +``b.encode([, ])``, and +``bytes(s, [, ])`` is equivalent to +``s.decode([, ])``. + +There is one exception: we can convert from bytes (or buffer) to str +without specifying an encoding by writing ``str(b)``. This produces +the same result as ``repr(b)``. This exception is necessary because +of the general promise that *any* object can be printed, and printing +is just a special case of conversion to str. There is however no +promise that printing a bytes object interprets the individual bytes +as characters (unlike in Python 2.x). + +The str type current supports the PEP 3118 buffer API. While this is +perhaps occasionally convenient, it is also potentially confusing, +because the bytes accessed via the buffer API represent a +platform-depending encoding: depending on the platform byte order and +a compile-time configuration option, the encoding could be UTF-16-BE, +UTF-16-LE, UTF-32-BE, or UTF-32-LE. Worse, a different implementation +of the str type might completely change the bytes representation, +e.g. to UTF-8, or even make it impossible to access the data as a +contiguous array of bytes at all. Therefore, support for the PEP 3118 +buffer API will be removed from the str type. + +Copyright +========= + +This document has been placed in the public domain. + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: