diff --git a/pep-0358.txt b/pep-0358.txt index 6401260ab..77884c74c 100644 --- a/pep-0358.txt +++ b/pep-0358.txt @@ -13,9 +13,16 @@ Post-History: Abstract - This PEP outlines the introduction of a raw bytes sequence object. - Adding the bytes object is one step in the transition to Unicode - based str objects. + This PEP outlines the introduction of a raw bytes sequence type. + Adding the bytes type is one step in the transition to Unicode + based str objects which will be introduced in Python 3.0. + + The PEP describes how the bytes type should work in Python 2.6, as + well as how it should work in Python 3.0. (Occasionally there are + differences because in Python 2.6, we have two string types, str + and unicode, while in Python 3.0 we will only have one string + type, whose name will be str but whose semantics will be like the + 2.6 unicode type.) Motivation @@ -33,39 +40,48 @@ Specification A bytes object stores a mutable sequence of integers that are in the range 0 to 255. Unlike string objects, indexing a bytes - object returns an integer. Assigning an element using a object - that is not an integer causes a TypeError exception. Assigning an - element to a value outside the range 0 to 255 causes a ValueError - exception. The .__len__() method of bytes returns the number of - integers stored in the sequence (i.e. the number of bytes). + object returns an integer. Assigning or comparin an object that + is not an integer to an element causes a TypeError exception. + Assigning an element to a value outside the range 0 to 255 causes + a ValueError exception. The .__len__() method of bytes returns + the number of integers stored in the sequence (i.e. the number of + bytes). The constructor of the bytes object has the following signature: - bytes([initialiser[, [encoding]]) + bytes([initializer[, encoding]]) - If no arguments are provided then an object containing zero elements - is created and returned. The initialiser argument can be a string, - a sequence of integers, or a single integer. The pseudo-code for the - constructor is: + If no arguments are provided then a bytes object containing zero + elements is created and returned. The initializer argument can be + a string (in 2.6, either str or unicode), an iterable of integers, + or a single integer. The pseudo-code for the constructor + (optimized for clear semantics, not for speed) is: - def bytes(initialiser=[], encoding=None): - if isinstance(initialiser, int): # In 2.6, (int, long) - initialiser = [0]*initialiser - elif isinstance(initialiser, basestring): - if isinstance(initialiser, unicode): # In 3.0, always + def bytes(initializer=0, encoding=None): + if isinstance(initializer, int): # In 2.6, (int, long) + initializer = [0]*initializer + elif isinstance(initializer, basestring): + if isinstance(initializer, unicode): # In 3.0, always if encoding is None: # In 3.0, raise TypeError("explicit encoding required") encoding = sys.getdefaultencoding() - initialiser = initialiser.encode(encoding) - initialiser = [ord(c) for c in initialiser] + initializer = initializer.encode(encoding) + initializer = [ord(c) for c in initializer] else: if encoding is not None: - raise TypeError("explicit encoding invalid for non-string " - "initialiser") - # Create bytes object and fill with integers from initialiser - # while ensuring each integer is in range(256); initialiser - # can be any iterable - return bytes object + raise TypeError("no encoding allowed for this initializer") + tmp = [] + for c in initializer: + if not isinstance(c, int): + raise TypeError("initializer must be iterable of ints") + if not 0 <= c < 256: + raise ValueError("initializer element out of range") + tmp.append(c) + initializer = tmp + new = + for i, c in enumerate(initializer): + new[i] = c + return new The .__repr__() method returns a string that can be evaluated to generate a new bytes object containing the same sequence of @@ -76,13 +92,10 @@ Specification 'bytes([0x0a, 0x14, 0x1e])' The object has a .decode() method equivalent to the .decode() - method of the str object. (This is redundant since it can also be - decoded by calling unicode(b, ) (in 2.6) or str(b, - ) (in 3.0); do we need encode/decode methods? In a - sense the spelling using a constructor is cleaner.) The object - has a classmethod .fromhex() that takes a string of characters - from the set [0-9a-zA-Z ] and returns a bytes object (similar to - binascii.unhexlify). For example: + method of the str object. The object has a classmethod .fromhex() + that takes a string of characters from the set [0-9a-zA-Z ] and + returns a bytes object (similar to binascii.unhexlify). For + example: >>> bytes.fromhex('5c5350ff') bytes([92, 83, 80, 255]]) @@ -96,102 +109,118 @@ Specification '5c5350ff' The bytes object has some methods similar to list method, and - others similar to str methods: + others similar to str methods. Here is a complete list of + methods, with their approximate signatures: - __add__ - __contains__ (with int arg, like list; with bytes arg, like str) - __delitem__ - __delslice__ - __eq__ - __ge__ - __getitem__ - __getslice__ - __gt__ - __iadd__ - __imul__ - __iter__ - __le__ - __len__ - __lt__ - __mul__ - __ne__ - __reduce__ - __reduce_ex__ - __repr__ - __reversed__ - __rmul__ - __setitem__ - __setslice__ - append - count - decode - endswith - extend - find - index - insert - join - partition - pop - remove - replace - rindex - rpartition - split - startswith - reverse - rfind - rindex - rsplit - translate + .__add__(bytes) -> bytes + .__contains__(int | bytes) -> bool + .__delitem__(int | slice) -> None + .__delslice__(int, int) -> None + .__eq__(bytes) -> bool + .__ge__(bytes) -> bool + .__getitem__(int | slice) -> int | bytes + .__getslice__(int, int) -> bytes + .__gt__(bytes) -> bool + .__iadd__(bytes) -> bytes + .__imul__(int) -> bytes + .__iter__() -> iterator + .__le__(bytes) -> bool + .__len__() -> int + .__lt__(bytes) -> bool + .__mul__(int) -> bytes + .__ne__(bytes) -> bool + .__reduce__(...) -> ... + .__reduce_ex__(...) -> ... + .__repr__() -> str + .__reversed__() -> bytes + .__rmul__(int) -> bytes + .__setitem__(int | slice, int | iterable[int]) -> None + .__setslice__(int, int, iterable[int]) -> Bote + .append(int) -> None + .count(int) -> int + .decode(str) -> str | unicode # in 3.0, only str + .endswith(bytes) -> bool + .extend(iterable[int]) -> None + .find(bytes) -> int + .index(bytes | int) -> int + .insert(int, int) -> None + .join(iterable[bytes]) -> bytes + .partition(bytes) -> (bytes, bytes, bytes) + .pop([int]) -> int + .remove(int) -> None + .replace(bytes, bytes) -> bytes + .rindex(bytes | int) -> int + .rpartition(bytes) -> (bytes, bytes, bytes) + .split(bytes) -> list[bytes] + .startswith(bytes) -> bool + .reverse() -> None + .rfind(bytes) -> int + .rindex(bytes | int) -> int + .rsplit(bytes) -> list[bytes] + .translate(bytes, [bytes]) -> bytes Note the conspicuous absence of .isupper(), .upper(), and friends. - There is no __hash__ because the object is mutable. There is no - usecase for a .sort() method. + (But see "Open Issues" below.) There is no .__hash__() because + the object is mutable. There is no use case for a .sort() method. - The bytes also supports the buffer interface, supporting reading - and writing binary (but not character) data. + The bytes type also supports the buffer interface, supporting + reading and writing binary (but not character) data. -Out of scope issues +Out of Scope Issues - * If we provide a literal syntax for bytes then it should look - distinctly different than the syntax for literal strings. Also, a - new type, even built-in, is much less drastic than a new literal - (which requires lexer and parser support in addition to everything - else). Since there appears to be no immediate need for a literal - representation, designing and implementing one is out of the scope - of this PEP. (Hmm... A b"..." literal accepting only ASCII - values is likely to be added to 3.0; not clear about 2.6. This - needs a PEP.) + * Python 3k will have a much different I/O subsystem. Deciding + how that I/O subsystem will work and interact with the bytes + object is out of the scope of this PEP. The expectation however + is that binary I/O will read and write bytes, while text I/O + will read strings. Since the bytes type supports the buffer + interface, the existing binary I/O operations in Python 2.6 will + support bytes objects. - * Python 3k will have a much different I/O subsystem. Deciding how - that I/O subsystem will work and interact with the bytes object is - out of the scope of this PEP. - - * It has been suggested that a special method named __bytes__ be - added to language to allow objects to be converted into byte + * It has been suggested that a special method named .__bytes__() + be added to language to allow objects to be converted into byte arrays. This decision is out of scope. -Unresolved issues +Open Issues - * Need to specify the methods more carefully. + * The .decode() method is redundant since a bytes object b can + also be decoded by calling unicode(b, ) (in 2.6) or + str(b, ) (in 3.0). Do we need encode/decode methods + at all? In a sense the spelling using a constructor is cleaner. + + * Need to specify the methods still more carefully. + + * Pickling and marshalling support need to be specified. * Should all those list methods really be implemented? + * There is growing support for a b"..." literal. Here's a brief + spec. Each invocation of b"..." produces a new bytes object + (this is unlike "..." but similar to [...] and {...}). Inside + the literal, only ASCII characters and non-Unicode backslash + escapes are allowed; non-ASCII characters not specified as + escapes are rejected by the compiler regardless of the source + encoding. The resulting object's value is the same as if + bytes(map(ord, "...")) were called. + * A case could be made for supporting .ljust(), .rjust(), .center() with a mandatory second argument. * A case could be made for supporting .split() with a mandatory argument. - * How should pickling and marshalling work? - - * I probably forgot a few things. + * A case could even be made for supporting .islower(), .isupper(), + .isspace(), .isalpha(), .isalnum(), .isdigit() and the + corresponding conversions (.lower() etc.), using the ASCII + definitions for letters, digits and whitespace. If this is + accepted, the cases for .ljust(), .rjust(), .center() and + .split() become much stronger, and they should have default + arguments as well, using an ASCII space or all ASCII whitespace + (for .split()). -Questions and answers +Frequently Asked Questions Q: Why have the optional encoding argument when the encode method of Unicode objects does the same thing. @@ -209,7 +238,7 @@ Questions and answers in which case you have to use the former. - Q: Why does bytes ignore the encoding argument if the initialiser is + Q: Why does bytes ignore the encoding argument if the initializer is a str? (This only applies to 2.6.) A: There is no sane meaning that the encoding can have in that case.