Another update, clarifying (I hope) the method signatures and mentioning
other stuff that came up over dinner.
This commit is contained in:
parent
3ed8bc79de
commit
9ca7df06ee
243
pep-0358.txt
243
pep-0358.txt
|
@ -13,9 +13,16 @@ Post-History:
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
|
|
||||||
This PEP outlines the introduction of a raw bytes sequence object.
|
This PEP outlines the introduction of a raw bytes sequence type.
|
||||||
Adding the bytes object is one step in the transition to Unicode
|
Adding the bytes type is one step in the transition to Unicode
|
||||||
based str objects.
|
based str objects which will be introduced in Python 3.0.
|
||||||
|
|
||||||
|
The PEP describes how the bytes type should work in Python 2.6, as
|
||||||
|
well as how it should work in Python 3.0. (Occasionally there are
|
||||||
|
differences because in Python 2.6, we have two string types, str
|
||||||
|
and unicode, while in Python 3.0 we will only have one string
|
||||||
|
type, whose name will be str but whose semantics will be like the
|
||||||
|
2.6 unicode type.)
|
||||||
|
|
||||||
|
|
||||||
Motivation
|
Motivation
|
||||||
|
@ -33,39 +40,48 @@ Specification
|
||||||
|
|
||||||
A bytes object stores a mutable sequence of integers that are in
|
A bytes object stores a mutable sequence of integers that are in
|
||||||
the range 0 to 255. Unlike string objects, indexing a bytes
|
the range 0 to 255. Unlike string objects, indexing a bytes
|
||||||
object returns an integer. Assigning an element using a object
|
object returns an integer. Assigning or comparin an object that
|
||||||
that is not an integer causes a TypeError exception. Assigning an
|
is not an integer to an element causes a TypeError exception.
|
||||||
element to a value outside the range 0 to 255 causes a ValueError
|
Assigning an element to a value outside the range 0 to 255 causes
|
||||||
exception. The .__len__() method of bytes returns the number of
|
a ValueError exception. The .__len__() method of bytes returns
|
||||||
integers stored in the sequence (i.e. the number of bytes).
|
the number of integers stored in the sequence (i.e. the number of
|
||||||
|
bytes).
|
||||||
|
|
||||||
The constructor of the bytes object has the following signature:
|
The constructor of the bytes object has the following signature:
|
||||||
|
|
||||||
bytes([initialiser[, [encoding]])
|
bytes([initializer[, encoding]])
|
||||||
|
|
||||||
If no arguments are provided then an object containing zero elements
|
If no arguments are provided then a bytes object containing zero
|
||||||
is created and returned. The initialiser argument can be a string,
|
elements is created and returned. The initializer argument can be
|
||||||
a sequence of integers, or a single integer. The pseudo-code for the
|
a string (in 2.6, either str or unicode), an iterable of integers,
|
||||||
constructor is:
|
or a single integer. The pseudo-code for the constructor
|
||||||
|
(optimized for clear semantics, not for speed) is:
|
||||||
|
|
||||||
def bytes(initialiser=[], encoding=None):
|
def bytes(initializer=0, encoding=None):
|
||||||
if isinstance(initialiser, int): # In 2.6, (int, long)
|
if isinstance(initializer, int): # In 2.6, (int, long)
|
||||||
initialiser = [0]*initialiser
|
initializer = [0]*initializer
|
||||||
elif isinstance(initialiser, basestring):
|
elif isinstance(initializer, basestring):
|
||||||
if isinstance(initialiser, unicode): # In 3.0, always
|
if isinstance(initializer, unicode): # In 3.0, always
|
||||||
if encoding is None:
|
if encoding is None:
|
||||||
# In 3.0, raise TypeError("explicit encoding required")
|
# In 3.0, raise TypeError("explicit encoding required")
|
||||||
encoding = sys.getdefaultencoding()
|
encoding = sys.getdefaultencoding()
|
||||||
initialiser = initialiser.encode(encoding)
|
initializer = initializer.encode(encoding)
|
||||||
initialiser = [ord(c) for c in initialiser]
|
initializer = [ord(c) for c in initializer]
|
||||||
else:
|
else:
|
||||||
if encoding is not None:
|
if encoding is not None:
|
||||||
raise TypeError("explicit encoding invalid for non-string "
|
raise TypeError("no encoding allowed for this initializer")
|
||||||
"initialiser")
|
tmp = []
|
||||||
# Create bytes object and fill with integers from initialiser
|
for c in initializer:
|
||||||
# while ensuring each integer is in range(256); initialiser
|
if not isinstance(c, int):
|
||||||
# can be any iterable
|
raise TypeError("initializer must be iterable of ints")
|
||||||
return bytes object
|
if not 0 <= c < 256:
|
||||||
|
raise ValueError("initializer element out of range")
|
||||||
|
tmp.append(c)
|
||||||
|
initializer = tmp
|
||||||
|
new = <new bytes object of length len(initializer)>
|
||||||
|
for i, c in enumerate(initializer):
|
||||||
|
new[i] = c
|
||||||
|
return new
|
||||||
|
|
||||||
The .__repr__() method returns a string that can be evaluated to
|
The .__repr__() method returns a string that can be evaluated to
|
||||||
generate a new bytes object containing the same sequence of
|
generate a new bytes object containing the same sequence of
|
||||||
|
@ -76,13 +92,10 @@ Specification
|
||||||
'bytes([0x0a, 0x14, 0x1e])'
|
'bytes([0x0a, 0x14, 0x1e])'
|
||||||
|
|
||||||
The object has a .decode() method equivalent to the .decode()
|
The object has a .decode() method equivalent to the .decode()
|
||||||
method of the str object. (This is redundant since it can also be
|
method of the str object. The object has a classmethod .fromhex()
|
||||||
decoded by calling unicode(b, <encoding>) (in 2.6) or str(b,
|
that takes a string of characters from the set [0-9a-zA-Z ] and
|
||||||
<encoding>) (in 3.0); do we need encode/decode methods? In a
|
returns a bytes object (similar to binascii.unhexlify). For
|
||||||
sense the spelling using a constructor is cleaner.) The object
|
example:
|
||||||
has a classmethod .fromhex() that takes a string of characters
|
|
||||||
from the set [0-9a-zA-Z ] and returns a bytes object (similar to
|
|
||||||
binascii.unhexlify). For example:
|
|
||||||
|
|
||||||
>>> bytes.fromhex('5c5350ff')
|
>>> bytes.fromhex('5c5350ff')
|
||||||
bytes([92, 83, 80, 255]])
|
bytes([92, 83, 80, 255]])
|
||||||
|
@ -96,102 +109,118 @@ Specification
|
||||||
'5c5350ff'
|
'5c5350ff'
|
||||||
|
|
||||||
The bytes object has some methods similar to list method, and
|
The bytes object has some methods similar to list method, and
|
||||||
others similar to str methods:
|
others similar to str methods. Here is a complete list of
|
||||||
|
methods, with their approximate signatures:
|
||||||
|
|
||||||
__add__
|
.__add__(bytes) -> bytes
|
||||||
__contains__ (with int arg, like list; with bytes arg, like str)
|
.__contains__(int | bytes) -> bool
|
||||||
__delitem__
|
.__delitem__(int | slice) -> None
|
||||||
__delslice__
|
.__delslice__(int, int) -> None
|
||||||
__eq__
|
.__eq__(bytes) -> bool
|
||||||
__ge__
|
.__ge__(bytes) -> bool
|
||||||
__getitem__
|
.__getitem__(int | slice) -> int | bytes
|
||||||
__getslice__
|
.__getslice__(int, int) -> bytes
|
||||||
__gt__
|
.__gt__(bytes) -> bool
|
||||||
__iadd__
|
.__iadd__(bytes) -> bytes
|
||||||
__imul__
|
.__imul__(int) -> bytes
|
||||||
__iter__
|
.__iter__() -> iterator
|
||||||
__le__
|
.__le__(bytes) -> bool
|
||||||
__len__
|
.__len__() -> int
|
||||||
__lt__
|
.__lt__(bytes) -> bool
|
||||||
__mul__
|
.__mul__(int) -> bytes
|
||||||
__ne__
|
.__ne__(bytes) -> bool
|
||||||
__reduce__
|
.__reduce__(...) -> ...
|
||||||
__reduce_ex__
|
.__reduce_ex__(...) -> ...
|
||||||
__repr__
|
.__repr__() -> str
|
||||||
__reversed__
|
.__reversed__() -> bytes
|
||||||
__rmul__
|
.__rmul__(int) -> bytes
|
||||||
__setitem__
|
.__setitem__(int | slice, int | iterable[int]) -> None
|
||||||
__setslice__
|
.__setslice__(int, int, iterable[int]) -> Bote
|
||||||
append
|
.append(int) -> None
|
||||||
count
|
.count(int) -> int
|
||||||
decode
|
.decode(str) -> str | unicode # in 3.0, only str
|
||||||
endswith
|
.endswith(bytes) -> bool
|
||||||
extend
|
.extend(iterable[int]) -> None
|
||||||
find
|
.find(bytes) -> int
|
||||||
index
|
.index(bytes | int) -> int
|
||||||
insert
|
.insert(int, int) -> None
|
||||||
join
|
.join(iterable[bytes]) -> bytes
|
||||||
partition
|
.partition(bytes) -> (bytes, bytes, bytes)
|
||||||
pop
|
.pop([int]) -> int
|
||||||
remove
|
.remove(int) -> None
|
||||||
replace
|
.replace(bytes, bytes) -> bytes
|
||||||
rindex
|
.rindex(bytes | int) -> int
|
||||||
rpartition
|
.rpartition(bytes) -> (bytes, bytes, bytes)
|
||||||
split
|
.split(bytes) -> list[bytes]
|
||||||
startswith
|
.startswith(bytes) -> bool
|
||||||
reverse
|
.reverse() -> None
|
||||||
rfind
|
.rfind(bytes) -> int
|
||||||
rindex
|
.rindex(bytes | int) -> int
|
||||||
rsplit
|
.rsplit(bytes) -> list[bytes]
|
||||||
translate
|
.translate(bytes, [bytes]) -> bytes
|
||||||
|
|
||||||
Note the conspicuous absence of .isupper(), .upper(), and friends.
|
Note the conspicuous absence of .isupper(), .upper(), and friends.
|
||||||
There is no __hash__ because the object is mutable. There is no
|
(But see "Open Issues" below.) There is no .__hash__() because
|
||||||
usecase for a .sort() method.
|
the object is mutable. There is no use case for a .sort() method.
|
||||||
|
|
||||||
The bytes also supports the buffer interface, supporting reading
|
The bytes type also supports the buffer interface, supporting
|
||||||
and writing binary (but not character) data.
|
reading and writing binary (but not character) data.
|
||||||
|
|
||||||
|
|
||||||
Out of scope issues
|
Out of Scope Issues
|
||||||
|
|
||||||
* If we provide a literal syntax for bytes then it should look
|
* Python 3k will have a much different I/O subsystem. Deciding
|
||||||
distinctly different than the syntax for literal strings. Also, a
|
how that I/O subsystem will work and interact with the bytes
|
||||||
new type, even built-in, is much less drastic than a new literal
|
object is out of the scope of this PEP. The expectation however
|
||||||
(which requires lexer and parser support in addition to everything
|
is that binary I/O will read and write bytes, while text I/O
|
||||||
else). Since there appears to be no immediate need for a literal
|
will read strings. Since the bytes type supports the buffer
|
||||||
representation, designing and implementing one is out of the scope
|
interface, the existing binary I/O operations in Python 2.6 will
|
||||||
of this PEP. (Hmm... A b"..." literal accepting only ASCII
|
support bytes objects.
|
||||||
values is likely to be added to 3.0; not clear about 2.6. This
|
|
||||||
needs a PEP.)
|
|
||||||
|
|
||||||
* Python 3k will have a much different I/O subsystem. Deciding how
|
* It has been suggested that a special method named .__bytes__()
|
||||||
that I/O subsystem will work and interact with the bytes object is
|
be added to language to allow objects to be converted into byte
|
||||||
out of the scope of this PEP.
|
|
||||||
|
|
||||||
* It has been suggested that a special method named __bytes__ be
|
|
||||||
added to language to allow objects to be converted into byte
|
|
||||||
arrays. This decision is out of scope.
|
arrays. This decision is out of scope.
|
||||||
|
|
||||||
|
|
||||||
Unresolved issues
|
Open Issues
|
||||||
|
|
||||||
* Need to specify the methods more carefully.
|
* The .decode() method is redundant since a bytes object b can
|
||||||
|
also be decoded by calling unicode(b, <encoding>) (in 2.6) or
|
||||||
|
str(b, <encoding>) (in 3.0). Do we need encode/decode methods
|
||||||
|
at all? In a sense the spelling using a constructor is cleaner.
|
||||||
|
|
||||||
|
* Need to specify the methods still more carefully.
|
||||||
|
|
||||||
|
* Pickling and marshalling support need to be specified.
|
||||||
|
|
||||||
* Should all those list methods really be implemented?
|
* Should all those list methods really be implemented?
|
||||||
|
|
||||||
|
* There is growing support for a b"..." literal. Here's a brief
|
||||||
|
spec. Each invocation of b"..." produces a new bytes object
|
||||||
|
(this is unlike "..." but similar to [...] and {...}). Inside
|
||||||
|
the literal, only ASCII characters and non-Unicode backslash
|
||||||
|
escapes are allowed; non-ASCII characters not specified as
|
||||||
|
escapes are rejected by the compiler regardless of the source
|
||||||
|
encoding. The resulting object's value is the same as if
|
||||||
|
bytes(map(ord, "...")) were called.
|
||||||
|
|
||||||
* A case could be made for supporting .ljust(), .rjust(),
|
* A case could be made for supporting .ljust(), .rjust(),
|
||||||
.center() with a mandatory second argument.
|
.center() with a mandatory second argument.
|
||||||
|
|
||||||
* A case could be made for supporting .split() with a mandatory
|
* A case could be made for supporting .split() with a mandatory
|
||||||
argument.
|
argument.
|
||||||
|
|
||||||
* How should pickling and marshalling work?
|
* A case could even be made for supporting .islower(), .isupper(),
|
||||||
|
.isspace(), .isalpha(), .isalnum(), .isdigit() and the
|
||||||
* I probably forgot a few things.
|
corresponding conversions (.lower() etc.), using the ASCII
|
||||||
|
definitions for letters, digits and whitespace. If this is
|
||||||
|
accepted, the cases for .ljust(), .rjust(), .center() and
|
||||||
|
.split() become much stronger, and they should have default
|
||||||
|
arguments as well, using an ASCII space or all ASCII whitespace
|
||||||
|
(for .split()).
|
||||||
|
|
||||||
|
|
||||||
Questions and answers
|
Frequently Asked Questions
|
||||||
|
|
||||||
Q: Why have the optional encoding argument when the encode method of
|
Q: Why have the optional encoding argument when the encode method of
|
||||||
Unicode objects does the same thing.
|
Unicode objects does the same thing.
|
||||||
|
@ -209,7 +238,7 @@ Questions and answers
|
||||||
in which case you have to use the former.
|
in which case you have to use the former.
|
||||||
|
|
||||||
|
|
||||||
Q: Why does bytes ignore the encoding argument if the initialiser is
|
Q: Why does bytes ignore the encoding argument if the initializer is
|
||||||
a str? (This only applies to 2.6.)
|
a str? (This only applies to 2.6.)
|
||||||
|
|
||||||
A: There is no sane meaning that the encoding can have in that case.
|
A: There is no sane meaning that the encoding can have in that case.
|
||||||
|
|
Loading…
Reference in New Issue