Another update, clarifying (I hope) the method signatures and mentioning

other stuff that came up over dinner.
This commit is contained in:
Guido van Rossum 2007-02-23 04:31:15 +00:00
parent 3ed8bc79de
commit 9ca7df06ee
1 changed files with 136 additions and 107 deletions

View File

@ -13,9 +13,16 @@ Post-History:
Abstract
This PEP outlines the introduction of a raw bytes sequence object.
Adding the bytes object is one step in the transition to Unicode
based str objects.
This PEP outlines the introduction of a raw bytes sequence type.
Adding the bytes type is one step in the transition to Unicode
based str objects which will be introduced in Python 3.0.
The PEP describes how the bytes type should work in Python 2.6, as
well as how it should work in Python 3.0. (Occasionally there are
differences because in Python 2.6, we have two string types, str
and unicode, while in Python 3.0 we will only have one string
type, whose name will be str but whose semantics will be like the
2.6 unicode type.)
Motivation
@ -33,39 +40,48 @@ Specification
A bytes object stores a mutable sequence of integers that are in
the range 0 to 255. Unlike string objects, indexing a bytes
object returns an integer. Assigning an element using a object
that is not an integer causes a TypeError exception. Assigning an
element to a value outside the range 0 to 255 causes a ValueError
exception. The .__len__() method of bytes returns the number of
integers stored in the sequence (i.e. the number of bytes).
object returns an integer. Assigning or comparin an object that
is not an integer to an element causes a TypeError exception.
Assigning an element to a value outside the range 0 to 255 causes
a ValueError exception. The .__len__() method of bytes returns
the number of integers stored in the sequence (i.e. the number of
bytes).
The constructor of the bytes object has the following signature:
bytes([initialiser[, [encoding]])
bytes([initializer[, encoding]])
If no arguments are provided then an object containing zero elements
is created and returned. The initialiser argument can be a string,
a sequence of integers, or a single integer. The pseudo-code for the
constructor is:
If no arguments are provided then a bytes object containing zero
elements is created and returned. The initializer argument can be
a string (in 2.6, either str or unicode), an iterable of integers,
or a single integer. The pseudo-code for the constructor
(optimized for clear semantics, not for speed) is:
def bytes(initialiser=[], encoding=None):
if isinstance(initialiser, int): # In 2.6, (int, long)
initialiser = [0]*initialiser
elif isinstance(initialiser, basestring):
if isinstance(initialiser, unicode): # In 3.0, always
def bytes(initializer=0, encoding=None):
if isinstance(initializer, int): # In 2.6, (int, long)
initializer = [0]*initializer
elif isinstance(initializer, basestring):
if isinstance(initializer, unicode): # In 3.0, always
if encoding is None:
# In 3.0, raise TypeError("explicit encoding required")
encoding = sys.getdefaultencoding()
initialiser = initialiser.encode(encoding)
initialiser = [ord(c) for c in initialiser]
initializer = initializer.encode(encoding)
initializer = [ord(c) for c in initializer]
else:
if encoding is not None:
raise TypeError("explicit encoding invalid for non-string "
"initialiser")
# Create bytes object and fill with integers from initialiser
# while ensuring each integer is in range(256); initialiser
# can be any iterable
return bytes object
raise TypeError("no encoding allowed for this initializer")
tmp = []
for c in initializer:
if not isinstance(c, int):
raise TypeError("initializer must be iterable of ints")
if not 0 <= c < 256:
raise ValueError("initializer element out of range")
tmp.append(c)
initializer = tmp
new = <new bytes object of length len(initializer)>
for i, c in enumerate(initializer):
new[i] = c
return new
The .__repr__() method returns a string that can be evaluated to
generate a new bytes object containing the same sequence of
@ -76,13 +92,10 @@ Specification
'bytes([0x0a, 0x14, 0x1e])'
The object has a .decode() method equivalent to the .decode()
method of the str object. (This is redundant since it can also be
decoded by calling unicode(b, <encoding>) (in 2.6) or str(b,
<encoding>) (in 3.0); do we need encode/decode methods? In a
sense the spelling using a constructor is cleaner.) The object
has a classmethod .fromhex() that takes a string of characters
from the set [0-9a-zA-Z ] and returns a bytes object (similar to
binascii.unhexlify). For example:
method of the str object. The object has a classmethod .fromhex()
that takes a string of characters from the set [0-9a-zA-Z ] and
returns a bytes object (similar to binascii.unhexlify). For
example:
>>> bytes.fromhex('5c5350ff')
bytes([92, 83, 80, 255]])
@ -96,102 +109,118 @@ Specification
'5c5350ff'
The bytes object has some methods similar to list method, and
others similar to str methods:
others similar to str methods. Here is a complete list of
methods, with their approximate signatures:
__add__
__contains__ (with int arg, like list; with bytes arg, like str)
__delitem__
__delslice__
__eq__
__ge__
__getitem__
__getslice__
__gt__
__iadd__
__imul__
__iter__
__le__
__len__
__lt__
__mul__
__ne__
__reduce__
__reduce_ex__
__repr__
__reversed__
__rmul__
__setitem__
__setslice__
append
count
decode
endswith
extend
find
index
insert
join
partition
pop
remove
replace
rindex
rpartition
split
startswith
reverse
rfind
rindex
rsplit
translate
.__add__(bytes) -> bytes
.__contains__(int | bytes) -> bool
.__delitem__(int | slice) -> None
.__delslice__(int, int) -> None
.__eq__(bytes) -> bool
.__ge__(bytes) -> bool
.__getitem__(int | slice) -> int | bytes
.__getslice__(int, int) -> bytes
.__gt__(bytes) -> bool
.__iadd__(bytes) -> bytes
.__imul__(int) -> bytes
.__iter__() -> iterator
.__le__(bytes) -> bool
.__len__() -> int
.__lt__(bytes) -> bool
.__mul__(int) -> bytes
.__ne__(bytes) -> bool
.__reduce__(...) -> ...
.__reduce_ex__(...) -> ...
.__repr__() -> str
.__reversed__() -> bytes
.__rmul__(int) -> bytes
.__setitem__(int | slice, int | iterable[int]) -> None
.__setslice__(int, int, iterable[int]) -> Bote
.append(int) -> None
.count(int) -> int
.decode(str) -> str | unicode # in 3.0, only str
.endswith(bytes) -> bool
.extend(iterable[int]) -> None
.find(bytes) -> int
.index(bytes | int) -> int
.insert(int, int) -> None
.join(iterable[bytes]) -> bytes
.partition(bytes) -> (bytes, bytes, bytes)
.pop([int]) -> int
.remove(int) -> None
.replace(bytes, bytes) -> bytes
.rindex(bytes | int) -> int
.rpartition(bytes) -> (bytes, bytes, bytes)
.split(bytes) -> list[bytes]
.startswith(bytes) -> bool
.reverse() -> None
.rfind(bytes) -> int
.rindex(bytes | int) -> int
.rsplit(bytes) -> list[bytes]
.translate(bytes, [bytes]) -> bytes
Note the conspicuous absence of .isupper(), .upper(), and friends.
There is no __hash__ because the object is mutable. There is no
usecase for a .sort() method.
(But see "Open Issues" below.) There is no .__hash__() because
the object is mutable. There is no use case for a .sort() method.
The bytes also supports the buffer interface, supporting reading
and writing binary (but not character) data.
The bytes type also supports the buffer interface, supporting
reading and writing binary (but not character) data.
Out of scope issues
Out of Scope Issues
* If we provide a literal syntax for bytes then it should look
distinctly different than the syntax for literal strings. Also, a
new type, even built-in, is much less drastic than a new literal
(which requires lexer and parser support in addition to everything
else). Since there appears to be no immediate need for a literal
representation, designing and implementing one is out of the scope
of this PEP. (Hmm... A b"..." literal accepting only ASCII
values is likely to be added to 3.0; not clear about 2.6. This
needs a PEP.)
* Python 3k will have a much different I/O subsystem. Deciding
how that I/O subsystem will work and interact with the bytes
object is out of the scope of this PEP. The expectation however
is that binary I/O will read and write bytes, while text I/O
will read strings. Since the bytes type supports the buffer
interface, the existing binary I/O operations in Python 2.6 will
support bytes objects.
* Python 3k will have a much different I/O subsystem. Deciding how
that I/O subsystem will work and interact with the bytes object is
out of the scope of this PEP.
* It has been suggested that a special method named __bytes__ be
added to language to allow objects to be converted into byte
* It has been suggested that a special method named .__bytes__()
be added to language to allow objects to be converted into byte
arrays. This decision is out of scope.
Unresolved issues
Open Issues
* Need to specify the methods more carefully.
* The .decode() method is redundant since a bytes object b can
also be decoded by calling unicode(b, <encoding>) (in 2.6) or
str(b, <encoding>) (in 3.0). Do we need encode/decode methods
at all? In a sense the spelling using a constructor is cleaner.
* Need to specify the methods still more carefully.
* Pickling and marshalling support need to be specified.
* Should all those list methods really be implemented?
* There is growing support for a b"..." literal. Here's a brief
spec. Each invocation of b"..." produces a new bytes object
(this is unlike "..." but similar to [...] and {...}). Inside
the literal, only ASCII characters and non-Unicode backslash
escapes are allowed; non-ASCII characters not specified as
escapes are rejected by the compiler regardless of the source
encoding. The resulting object's value is the same as if
bytes(map(ord, "...")) were called.
* A case could be made for supporting .ljust(), .rjust(),
.center() with a mandatory second argument.
* A case could be made for supporting .split() with a mandatory
argument.
* How should pickling and marshalling work?
* I probably forgot a few things.
* A case could even be made for supporting .islower(), .isupper(),
.isspace(), .isalpha(), .isalnum(), .isdigit() and the
corresponding conversions (.lower() etc.), using the ASCII
definitions for letters, digits and whitespace. If this is
accepted, the cases for .ljust(), .rjust(), .center() and
.split() become much stronger, and they should have default
arguments as well, using an ASCII space or all ASCII whitespace
(for .split()).
Questions and answers
Frequently Asked Questions
Q: Why have the optional encoding argument when the encode method of
Unicode objects does the same thing.
@ -209,7 +238,7 @@ Questions and answers
in which case you have to use the former.
Q: Why does bytes ignore the encoding argument if the initialiser is
Q: Why does bytes ignore the encoding argument if the initializer is
a str? (This only applies to 2.6.)
A: There is no sane meaning that the encoding can have in that case.