From cd7901e86ee4ed90c30608ac8772130004fed2ab Mon Sep 17 00:00:00 2001 From: Neil Schemenauer Date: Wed, 22 Feb 2006 20:40:03 +0000 Subject: [PATCH] Add 'The "bytes" object' PEP. --- pep-0358.txt | 215 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 215 insertions(+) create mode 100644 pep-0358.txt diff --git a/pep-0358.txt b/pep-0358.txt new file mode 100644 index 000000000..949e40d76 --- /dev/null +++ b/pep-0358.txt @@ -0,0 +1,215 @@ +PEP: 358 +Title: The "bytes" object +Version: $Revision$ +Last-Modified: $Date$ +Author: Neil Schemenauer +Status: Draft +Type: Standards Track +Content-Type: text/plain +Created: 15-Feb-2006 +Python-Version: 2.5 +Post-History: + + +Abstract +======== + +This PEP outlines the introduction of a raw bytes sequence object. +Adding the bytes object is one step in the transition to Unicode based +str objects. + + +Motivation +========== + +Python's current string objects are overloaded. They serve to hold +both sequences of characters and sequences of bytes. This overloading +of purpose leads to confusion and bugs. In future versions of Python, +string objects will be used for holding character data. The bytes object +will fulfil the role of a byte container. Eventually the unicode +built-in will be renamed to str and the str object will be removed. + + +Specification +============= + +A bytes object stores a mutable sequence of integers that are in the +range 0 to 255. Unlike string objects, indexing a bytes object returns +an integer. Assigning an element using a object that is not an integer +causes a TypeError exception. Assigning an element to a value outside +the range 0 to 255 causes a ValueError exception. The __len__ method of +bytes returns the number of integers stored in the sequence (i.e. the +number of bytes). + +The constructor of the bytes object has the following signature: + + bytes([initialiser[, [encoding]]) + +If no arguments are provided then an object containing zero elements is +created and returned. The initialiser argument can be a string or a +sequence of integers. The pseudo-code for the constructor is: + + def bytes(initialiser=[], encoding=None): + if isinstance(initialiser, basestring): + if isinstance(initialiser, unicode): + if encoding is None: + encoding = sys.getdefaultencoding() + initialiser = initialiser.encode(encoding) + initialiser = [ord(c) for c in initialiser] + elif encoding is not None: + raise TypeError("explicit encoding invalid for non-string " + "initialiser") + create bytes object and fill with integers from initialiser + return bytes object + +The __repr__ method returns a string that can be evaluated to generate a +new bytes object containing the same sequence of integers. The sequence +is represented by a list of ints. For example: + + >>> repr(bytes[10, 20, 30]) + 'bytes([10, 20, 30])' + +The object has a decode method equivalent to the decode method of the +str object. The object has a classmethod fromhex that takes a string of +characters from the set [0-9a-zA-Z ] and returns a bytes object (similar +to binascii.unhexlify). For example: + + >>> bytes.fromhex('5c5350ff') + bytes([92, 83, 80, 255]]) + >>> bytes.fromhex('5c 53 50 ff') + bytes([92, 83, 80, 255]]) + +The object has a hex method that does the reverse conversion (similar to +binascii.hexlify): + + >> bytes([92, 83, 80, 255]]).hex() + '5c5350ff' + +The bytes object has methods similar to the list object: + + __add__ + __contains__ + __delitem__ + __delslice__ + __eq__ + __ge__ + __getitem__ + __getslice__ + __gt__ + __hash__ + __iadd__ + __imul__ + __iter__ + __le__ + __len__ + __lt__ + __mul__ + __ne__ + __reduce__ + __reduce_ex__ + __repr__ + __rmul__ + __setitem__ + __setslice__ + append + count + extend + index + insert + pop + remove + + +Out of scope issues +=================== + +* If we provide a literal syntax for bytes then it should look distinctly + different than the syntax for literal strings. Also, a new type, even + built-in, is much less drastic than a new literal (which requires + lexer and parser support in addition to everything else). Since there + appears to be no immediate need for a literal representation, + designing and implementing one is out of the scope of this PEP. + +* Python 3k will have a much different I/O subsystem. Deciding how that + I/O subsystem will work and interact with the bytes object is out of + the scope of this PEP. + +* It has been suggested that a special method named __bytes__ be added + to language to allow objects to be converted into byte arrays. This + decision is out of scope. + + +Unresolved issues +================= + +* Perhaps the bytes object should be implemented as a extension module + until we are more sure of the design (similar to how the set object + was prototyped). + +* Should the bytes object implement the buffer interface? Probably, but + we need to look into the implications of that (e.g. regex operations + on byte arrays). + +* Should the object implement __reversed__ and reverse? Should it + implement sort? + +* Need to clarify what some of the methods do. How are comparisons + done? Hashing? Pickling and marshalling? + + +Questions and answers +===================== + +Q: Why have the optional encoding argument when the encode method of + Unicode objects does the same thing. + +A: In the current version of Python, the encode method returns a str + object and we cannot change that without breaking code. The construct + bytes(s.encode(...)) is expensive because it has to copy the byte + sequence multiple times. Also, Python generally provides two ways of + converting an object of type A into an object of type B: ask an A + instance to convert itself to a B, or ask the type B to create a new + instance from an A. Depending on what A and B are, both APIs make + sense; sometimes reasons of decoupling require that A can't know + about B, in which case you have to use the latter approach; sometimes + B can't know about A, in which case you have to use the former. + + +Q: Why does bytes ignore the encoding argument if the initialiser is a + str? + +A: There is no sane meaning that the encoding can have in that case. + str objects *are* byte arrays and they know nothing about the + encoding of character data they contain. We need to assume that the + programmer has provided str object that already uses the desired + encoding. If you need something other than a pure copy of the bytes + then you need to first decode the string. For example: + + bytes(s.decode(encoding1), encoding2) + + +Q: Why not have the encoding argument default to Latin-1 (or some other + encoding that covers the entire byte range) rather than ASCII ? + +A: The system default encoding for Python is ASCII. It seems least + confusing to use that default. Also, in Py3k, using Latin-1 as + the default might not be what users expect. For example, they might + prefer a Unicode encoding. Any default will not always work as + expected. At least ASCII will complain loudly if you try to encode + non-ASCII data. + + +Copyright +========= + +This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + End: