Another update, clarifying (I hope) the method signatures and mentioning
other stuff that came up over dinner.
This commit is contained in:
parent
3ed8bc79de
commit
9ca7df06ee
243
pep-0358.txt
243
pep-0358.txt
|
@ -13,9 +13,16 @@ Post-History:
|
|||
|
||||
Abstract
|
||||
|
||||
This PEP outlines the introduction of a raw bytes sequence object.
|
||||
Adding the bytes object is one step in the transition to Unicode
|
||||
based str objects.
|
||||
This PEP outlines the introduction of a raw bytes sequence type.
|
||||
Adding the bytes type is one step in the transition to Unicode
|
||||
based str objects which will be introduced in Python 3.0.
|
||||
|
||||
The PEP describes how the bytes type should work in Python 2.6, as
|
||||
well as how it should work in Python 3.0. (Occasionally there are
|
||||
differences because in Python 2.6, we have two string types, str
|
||||
and unicode, while in Python 3.0 we will only have one string
|
||||
type, whose name will be str but whose semantics will be like the
|
||||
2.6 unicode type.)
|
||||
|
||||
|
||||
Motivation
|
||||
|
@ -33,39 +40,48 @@ Specification
|
|||
|
||||
A bytes object stores a mutable sequence of integers that are in
|
||||
the range 0 to 255. Unlike string objects, indexing a bytes
|
||||
object returns an integer. Assigning an element using a object
|
||||
that is not an integer causes a TypeError exception. Assigning an
|
||||
element to a value outside the range 0 to 255 causes a ValueError
|
||||
exception. The .__len__() method of bytes returns the number of
|
||||
integers stored in the sequence (i.e. the number of bytes).
|
||||
object returns an integer. Assigning or comparin an object that
|
||||
is not an integer to an element causes a TypeError exception.
|
||||
Assigning an element to a value outside the range 0 to 255 causes
|
||||
a ValueError exception. The .__len__() method of bytes returns
|
||||
the number of integers stored in the sequence (i.e. the number of
|
||||
bytes).
|
||||
|
||||
The constructor of the bytes object has the following signature:
|
||||
|
||||
bytes([initialiser[, [encoding]])
|
||||
bytes([initializer[, encoding]])
|
||||
|
||||
If no arguments are provided then an object containing zero elements
|
||||
is created and returned. The initialiser argument can be a string,
|
||||
a sequence of integers, or a single integer. The pseudo-code for the
|
||||
constructor is:
|
||||
If no arguments are provided then a bytes object containing zero
|
||||
elements is created and returned. The initializer argument can be
|
||||
a string (in 2.6, either str or unicode), an iterable of integers,
|
||||
or a single integer. The pseudo-code for the constructor
|
||||
(optimized for clear semantics, not for speed) is:
|
||||
|
||||
def bytes(initialiser=[], encoding=None):
|
||||
if isinstance(initialiser, int): # In 2.6, (int, long)
|
||||
initialiser = [0]*initialiser
|
||||
elif isinstance(initialiser, basestring):
|
||||
if isinstance(initialiser, unicode): # In 3.0, always
|
||||
def bytes(initializer=0, encoding=None):
|
||||
if isinstance(initializer, int): # In 2.6, (int, long)
|
||||
initializer = [0]*initializer
|
||||
elif isinstance(initializer, basestring):
|
||||
if isinstance(initializer, unicode): # In 3.0, always
|
||||
if encoding is None:
|
||||
# In 3.0, raise TypeError("explicit encoding required")
|
||||
encoding = sys.getdefaultencoding()
|
||||
initialiser = initialiser.encode(encoding)
|
||||
initialiser = [ord(c) for c in initialiser]
|
||||
initializer = initializer.encode(encoding)
|
||||
initializer = [ord(c) for c in initializer]
|
||||
else:
|
||||
if encoding is not None:
|
||||
raise TypeError("explicit encoding invalid for non-string "
|
||||
"initialiser")
|
||||
# Create bytes object and fill with integers from initialiser
|
||||
# while ensuring each integer is in range(256); initialiser
|
||||
# can be any iterable
|
||||
return bytes object
|
||||
raise TypeError("no encoding allowed for this initializer")
|
||||
tmp = []
|
||||
for c in initializer:
|
||||
if not isinstance(c, int):
|
||||
raise TypeError("initializer must be iterable of ints")
|
||||
if not 0 <= c < 256:
|
||||
raise ValueError("initializer element out of range")
|
||||
tmp.append(c)
|
||||
initializer = tmp
|
||||
new = <new bytes object of length len(initializer)>
|
||||
for i, c in enumerate(initializer):
|
||||
new[i] = c
|
||||
return new
|
||||
|
||||
The .__repr__() method returns a string that can be evaluated to
|
||||
generate a new bytes object containing the same sequence of
|
||||
|
@ -76,13 +92,10 @@ Specification
|
|||
'bytes([0x0a, 0x14, 0x1e])'
|
||||
|
||||
The object has a .decode() method equivalent to the .decode()
|
||||
method of the str object. (This is redundant since it can also be
|
||||
decoded by calling unicode(b, <encoding>) (in 2.6) or str(b,
|
||||
<encoding>) (in 3.0); do we need encode/decode methods? In a
|
||||
sense the spelling using a constructor is cleaner.) The object
|
||||
has a classmethod .fromhex() that takes a string of characters
|
||||
from the set [0-9a-zA-Z ] and returns a bytes object (similar to
|
||||
binascii.unhexlify). For example:
|
||||
method of the str object. The object has a classmethod .fromhex()
|
||||
that takes a string of characters from the set [0-9a-zA-Z ] and
|
||||
returns a bytes object (similar to binascii.unhexlify). For
|
||||
example:
|
||||
|
||||
>>> bytes.fromhex('5c5350ff')
|
||||
bytes([92, 83, 80, 255]])
|
||||
|
@ -96,102 +109,118 @@ Specification
|
|||
'5c5350ff'
|
||||
|
||||
The bytes object has some methods similar to list method, and
|
||||
others similar to str methods:
|
||||
others similar to str methods. Here is a complete list of
|
||||
methods, with their approximate signatures:
|
||||
|
||||
__add__
|
||||
__contains__ (with int arg, like list; with bytes arg, like str)
|
||||
__delitem__
|
||||
__delslice__
|
||||
__eq__
|
||||
__ge__
|
||||
__getitem__
|
||||
__getslice__
|
||||
__gt__
|
||||
__iadd__
|
||||
__imul__
|
||||
__iter__
|
||||
__le__
|
||||
__len__
|
||||
__lt__
|
||||
__mul__
|
||||
__ne__
|
||||
__reduce__
|
||||
__reduce_ex__
|
||||
__repr__
|
||||
__reversed__
|
||||
__rmul__
|
||||
__setitem__
|
||||
__setslice__
|
||||
append
|
||||
count
|
||||
decode
|
||||
endswith
|
||||
extend
|
||||
find
|
||||
index
|
||||
insert
|
||||
join
|
||||
partition
|
||||
pop
|
||||
remove
|
||||
replace
|
||||
rindex
|
||||
rpartition
|
||||
split
|
||||
startswith
|
||||
reverse
|
||||
rfind
|
||||
rindex
|
||||
rsplit
|
||||
translate
|
||||
.__add__(bytes) -> bytes
|
||||
.__contains__(int | bytes) -> bool
|
||||
.__delitem__(int | slice) -> None
|
||||
.__delslice__(int, int) -> None
|
||||
.__eq__(bytes) -> bool
|
||||
.__ge__(bytes) -> bool
|
||||
.__getitem__(int | slice) -> int | bytes
|
||||
.__getslice__(int, int) -> bytes
|
||||
.__gt__(bytes) -> bool
|
||||
.__iadd__(bytes) -> bytes
|
||||
.__imul__(int) -> bytes
|
||||
.__iter__() -> iterator
|
||||
.__le__(bytes) -> bool
|
||||
.__len__() -> int
|
||||
.__lt__(bytes) -> bool
|
||||
.__mul__(int) -> bytes
|
||||
.__ne__(bytes) -> bool
|
||||
.__reduce__(...) -> ...
|
||||
.__reduce_ex__(...) -> ...
|
||||
.__repr__() -> str
|
||||
.__reversed__() -> bytes
|
||||
.__rmul__(int) -> bytes
|
||||
.__setitem__(int | slice, int | iterable[int]) -> None
|
||||
.__setslice__(int, int, iterable[int]) -> Bote
|
||||
.append(int) -> None
|
||||
.count(int) -> int
|
||||
.decode(str) -> str | unicode # in 3.0, only str
|
||||
.endswith(bytes) -> bool
|
||||
.extend(iterable[int]) -> None
|
||||
.find(bytes) -> int
|
||||
.index(bytes | int) -> int
|
||||
.insert(int, int) -> None
|
||||
.join(iterable[bytes]) -> bytes
|
||||
.partition(bytes) -> (bytes, bytes, bytes)
|
||||
.pop([int]) -> int
|
||||
.remove(int) -> None
|
||||
.replace(bytes, bytes) -> bytes
|
||||
.rindex(bytes | int) -> int
|
||||
.rpartition(bytes) -> (bytes, bytes, bytes)
|
||||
.split(bytes) -> list[bytes]
|
||||
.startswith(bytes) -> bool
|
||||
.reverse() -> None
|
||||
.rfind(bytes) -> int
|
||||
.rindex(bytes | int) -> int
|
||||
.rsplit(bytes) -> list[bytes]
|
||||
.translate(bytes, [bytes]) -> bytes
|
||||
|
||||
Note the conspicuous absence of .isupper(), .upper(), and friends.
|
||||
There is no __hash__ because the object is mutable. There is no
|
||||
usecase for a .sort() method.
|
||||
(But see "Open Issues" below.) There is no .__hash__() because
|
||||
the object is mutable. There is no use case for a .sort() method.
|
||||
|
||||
The bytes also supports the buffer interface, supporting reading
|
||||
and writing binary (but not character) data.
|
||||
The bytes type also supports the buffer interface, supporting
|
||||
reading and writing binary (but not character) data.
|
||||
|
||||
|
||||
Out of scope issues
|
||||
Out of Scope Issues
|
||||
|
||||
* If we provide a literal syntax for bytes then it should look
|
||||
distinctly different than the syntax for literal strings. Also, a
|
||||
new type, even built-in, is much less drastic than a new literal
|
||||
(which requires lexer and parser support in addition to everything
|
||||
else). Since there appears to be no immediate need for a literal
|
||||
representation, designing and implementing one is out of the scope
|
||||
of this PEP. (Hmm... A b"..." literal accepting only ASCII
|
||||
values is likely to be added to 3.0; not clear about 2.6. This
|
||||
needs a PEP.)
|
||||
* Python 3k will have a much different I/O subsystem. Deciding
|
||||
how that I/O subsystem will work and interact with the bytes
|
||||
object is out of the scope of this PEP. The expectation however
|
||||
is that binary I/O will read and write bytes, while text I/O
|
||||
will read strings. Since the bytes type supports the buffer
|
||||
interface, the existing binary I/O operations in Python 2.6 will
|
||||
support bytes objects.
|
||||
|
||||
* Python 3k will have a much different I/O subsystem. Deciding how
|
||||
that I/O subsystem will work and interact with the bytes object is
|
||||
out of the scope of this PEP.
|
||||
|
||||
* It has been suggested that a special method named __bytes__ be
|
||||
added to language to allow objects to be converted into byte
|
||||
* It has been suggested that a special method named .__bytes__()
|
||||
be added to language to allow objects to be converted into byte
|
||||
arrays. This decision is out of scope.
|
||||
|
||||
|
||||
Unresolved issues
|
||||
Open Issues
|
||||
|
||||
* Need to specify the methods more carefully.
|
||||
* The .decode() method is redundant since a bytes object b can
|
||||
also be decoded by calling unicode(b, <encoding>) (in 2.6) or
|
||||
str(b, <encoding>) (in 3.0). Do we need encode/decode methods
|
||||
at all? In a sense the spelling using a constructor is cleaner.
|
||||
|
||||
* Need to specify the methods still more carefully.
|
||||
|
||||
* Pickling and marshalling support need to be specified.
|
||||
|
||||
* Should all those list methods really be implemented?
|
||||
|
||||
* There is growing support for a b"..." literal. Here's a brief
|
||||
spec. Each invocation of b"..." produces a new bytes object
|
||||
(this is unlike "..." but similar to [...] and {...}). Inside
|
||||
the literal, only ASCII characters and non-Unicode backslash
|
||||
escapes are allowed; non-ASCII characters not specified as
|
||||
escapes are rejected by the compiler regardless of the source
|
||||
encoding. The resulting object's value is the same as if
|
||||
bytes(map(ord, "...")) were called.
|
||||
|
||||
* A case could be made for supporting .ljust(), .rjust(),
|
||||
.center() with a mandatory second argument.
|
||||
|
||||
* A case could be made for supporting .split() with a mandatory
|
||||
argument.
|
||||
|
||||
* How should pickling and marshalling work?
|
||||
|
||||
* I probably forgot a few things.
|
||||
* A case could even be made for supporting .islower(), .isupper(),
|
||||
.isspace(), .isalpha(), .isalnum(), .isdigit() and the
|
||||
corresponding conversions (.lower() etc.), using the ASCII
|
||||
definitions for letters, digits and whitespace. If this is
|
||||
accepted, the cases for .ljust(), .rjust(), .center() and
|
||||
.split() become much stronger, and they should have default
|
||||
arguments as well, using an ASCII space or all ASCII whitespace
|
||||
(for .split()).
|
||||
|
||||
|
||||
Questions and answers
|
||||
Frequently Asked Questions
|
||||
|
||||
Q: Why have the optional encoding argument when the encode method of
|
||||
Unicode objects does the same thing.
|
||||
|
@ -209,7 +238,7 @@ Questions and answers
|
|||
in which case you have to use the former.
|
||||
|
||||
|
||||
Q: Why does bytes ignore the encoding argument if the initialiser is
|
||||
Q: Why does bytes ignore the encoding argument if the initializer is
|
||||
a str? (This only applies to 2.6.)
|
||||
|
||||
A: There is no sane meaning that the encoding can have in that case.
|
||||
|
|
Loading…
Reference in New Issue