275 lines
11 KiB
Plaintext
275 lines
11 KiB
Plaintext
PEP: 358
|
||
Title: The "bytes" Object
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Neil Schemenauer <nas@arctrix.com>, Guido van Rossum <guido@google.com>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/plain
|
||
Created: 15-Feb-2006
|
||
Python-Version: 2.6, 3.0
|
||
Post-History:
|
||
|
||
|
||
Abstract
|
||
|
||
This PEP outlines the introduction of a raw bytes sequence type.
|
||
Adding the bytes type is one step in the transition to Unicode
|
||
based str objects which will be introduced in Python 3.0.
|
||
|
||
The PEP describes how the bytes type should work in Python 2.6, as
|
||
well as how it should work in Python 3.0. (Occasionally there are
|
||
differences because in Python 2.6, we have two string types, str
|
||
and unicode, while in Python 3.0 we will only have one string
|
||
type, whose name will be str but whose semantics will be like the
|
||
2.6 unicode type.)
|
||
|
||
|
||
Motivation
|
||
|
||
Python's current string objects are overloaded. They serve to hold
|
||
both sequences of characters and sequences of bytes. This
|
||
overloading of purpose leads to confusion and bugs. In future
|
||
versions of Python, string objects will be used for holding
|
||
character data. The bytes object will fulfil the role of a byte
|
||
container. Eventually the unicode built-in will be renamed to str
|
||
and the str object will be removed.
|
||
|
||
|
||
Specification
|
||
|
||
A bytes object stores a mutable sequence of integers that are in
|
||
the range 0 to 255. Unlike string objects, indexing a bytes
|
||
object returns an integer. Assigning or comparin an object that
|
||
is not an integer to an element causes a TypeError exception.
|
||
Assigning an element to a value outside the range 0 to 255 causes
|
||
a ValueError exception. The .__len__() method of bytes returns
|
||
the number of integers stored in the sequence (i.e. the number of
|
||
bytes).
|
||
|
||
The constructor of the bytes object has the following signature:
|
||
|
||
bytes([initializer[, encoding]])
|
||
|
||
If no arguments are provided then a bytes object containing zero
|
||
elements is created and returned. The initializer argument can be
|
||
a string (in 2.6, either str or unicode), an iterable of integers,
|
||
or a single integer. The pseudo-code for the constructor
|
||
(optimized for clear semantics, not for speed) is:
|
||
|
||
def bytes(initializer=0, encoding=None):
|
||
if isinstance(initializer, int): # In 2.6, (int, long)
|
||
initializer = [0]*initializer
|
||
elif isinstance(initializer, basestring):
|
||
if isinstance(initializer, unicode): # In 3.0, always
|
||
if encoding is None:
|
||
# In 3.0, raise TypeError("explicit encoding required")
|
||
encoding = sys.getdefaultencoding()
|
||
initializer = initializer.encode(encoding)
|
||
initializer = [ord(c) for c in initializer]
|
||
else:
|
||
if encoding is not None:
|
||
raise TypeError("no encoding allowed for this initializer")
|
||
tmp = []
|
||
for c in initializer:
|
||
if not isinstance(c, int):
|
||
raise TypeError("initializer must be iterable of ints")
|
||
if not 0 <= c < 256:
|
||
raise ValueError("initializer element out of range")
|
||
tmp.append(c)
|
||
initializer = tmp
|
||
new = <new bytes object of length len(initializer)>
|
||
for i, c in enumerate(initializer):
|
||
new[i] = c
|
||
return new
|
||
|
||
The .__repr__() method returns a string that can be evaluated to
|
||
generate a new bytes object containing the same sequence of
|
||
integers. The sequence is represented by a list of ints using
|
||
hexadecimal notation. For example:
|
||
|
||
>>> repr(bytes[10, 20, 30])
|
||
'bytes([0x0a, 0x14, 0x1e])'
|
||
|
||
The object has a .decode() method equivalent to the .decode()
|
||
method of the str object. The object has a classmethod .fromhex()
|
||
that takes a string of characters from the set [0-9a-zA-Z ] and
|
||
returns a bytes object (similar to binascii.unhexlify). For
|
||
example:
|
||
|
||
>>> bytes.fromhex('5c5350ff')
|
||
bytes([92, 83, 80, 255]])
|
||
>>> bytes.fromhex('5c 53 50 ff')
|
||
bytes([92, 83, 80, 255]])
|
||
|
||
The object has a .hex() method that does the reverse conversion
|
||
(similar to binascii.hexlify):
|
||
|
||
>> bytes([92, 83, 80, 255]]).hex()
|
||
'5c5350ff'
|
||
|
||
The bytes object has some methods similar to list method, and
|
||
others similar to str methods. Here is a complete list of
|
||
methods, with their approximate signatures:
|
||
|
||
.__add__(bytes) -> bytes
|
||
.__contains__(int | bytes) -> bool
|
||
.__delitem__(int | slice) -> None
|
||
.__delslice__(int, int) -> None
|
||
.__eq__(bytes) -> bool
|
||
.__ge__(bytes) -> bool
|
||
.__getitem__(int | slice) -> int | bytes
|
||
.__getslice__(int, int) -> bytes
|
||
.__gt__(bytes) -> bool
|
||
.__iadd__(bytes) -> bytes
|
||
.__imul__(int) -> bytes
|
||
.__iter__() -> iterator
|
||
.__le__(bytes) -> bool
|
||
.__len__() -> int
|
||
.__lt__(bytes) -> bool
|
||
.__mul__(int) -> bytes
|
||
.__ne__(bytes) -> bool
|
||
.__reduce__(...) -> ...
|
||
.__reduce_ex__(...) -> ...
|
||
.__repr__() -> str
|
||
.__reversed__() -> bytes
|
||
.__rmul__(int) -> bytes
|
||
.__setitem__(int | slice, int | iterable[int]) -> None
|
||
.__setslice__(int, int, iterable[int]) -> Bote
|
||
.append(int) -> None
|
||
.count(int) -> int
|
||
.decode(str) -> str | unicode # in 3.0, only str
|
||
.endswith(bytes) -> bool
|
||
.extend(iterable[int]) -> None
|
||
.find(bytes) -> int
|
||
.index(bytes | int) -> int
|
||
.insert(int, int) -> None
|
||
.join(iterable[bytes]) -> bytes
|
||
.partition(bytes) -> (bytes, bytes, bytes)
|
||
.pop([int]) -> int
|
||
.remove(int) -> None
|
||
.replace(bytes, bytes) -> bytes
|
||
.rindex(bytes | int) -> int
|
||
.rpartition(bytes) -> (bytes, bytes, bytes)
|
||
.split(bytes) -> list[bytes]
|
||
.startswith(bytes) -> bool
|
||
.reverse() -> None
|
||
.rfind(bytes) -> int
|
||
.rindex(bytes | int) -> int
|
||
.rsplit(bytes) -> list[bytes]
|
||
.translate(bytes, [bytes]) -> bytes
|
||
|
||
Note the conspicuous absence of .isupper(), .upper(), and friends.
|
||
(But see "Open Issues" below.) There is no .__hash__() because
|
||
the object is mutable. There is no use case for a .sort() method.
|
||
|
||
The bytes type also supports the buffer interface, supporting
|
||
reading and writing binary (but not character) data.
|
||
|
||
|
||
Out of Scope Issues
|
||
|
||
* Python 3k will have a much different I/O subsystem. Deciding
|
||
how that I/O subsystem will work and interact with the bytes
|
||
object is out of the scope of this PEP. The expectation however
|
||
is that binary I/O will read and write bytes, while text I/O
|
||
will read strings. Since the bytes type supports the buffer
|
||
interface, the existing binary I/O operations in Python 2.6 will
|
||
support bytes objects.
|
||
|
||
* It has been suggested that a special method named .__bytes__()
|
||
be added to language to allow objects to be converted into byte
|
||
arrays. This decision is out of scope.
|
||
|
||
* A bytes literal of the form b"..." is also proposed. This is
|
||
the subject of PEP 3112.
|
||
|
||
|
||
Open Issues
|
||
|
||
* The .decode() method is redundant since a bytes object b can
|
||
also be decoded by calling unicode(b, <encoding>) (in 2.6) or
|
||
str(b, <encoding>) (in 3.0). Do we need encode/decode methods
|
||
at all? In a sense the spelling using a constructor is cleaner.
|
||
|
||
* Need to specify the methods still more carefully.
|
||
|
||
* Pickling and marshalling support need to be specified.
|
||
|
||
* Should all those list methods really be implemented?
|
||
|
||
* Now that a b"..." literal exists, shouldn't repr() return one?
|
||
|
||
* A case could be made for supporting .ljust(), .rjust(),
|
||
.center() with a mandatory second argument.
|
||
|
||
* A case could be made for supporting .split() with a mandatory
|
||
argument.
|
||
|
||
* A case could even be made for supporting .islower(), .isupper(),
|
||
.isspace(), .isalpha(), .isalnum(), .isdigit() and the
|
||
corresponding conversions (.lower() etc.), using the ASCII
|
||
definitions for letters, digits and whitespace. If this is
|
||
accepted, the cases for .ljust(), .rjust(), .center() and
|
||
.split() become much stronger, and they should have default
|
||
arguments as well, using an ASCII space or all ASCII whitespace
|
||
(for .split()).
|
||
|
||
|
||
Frequently Asked Questions
|
||
|
||
Q: Why have the optional encoding argument when the encode method of
|
||
Unicode objects does the same thing.
|
||
|
||
A: In the current version of Python, the encode method returns a str
|
||
object and we cannot change that without breaking code. The
|
||
construct bytes(s.encode(...)) is expensive because it has to
|
||
copy the byte sequence multiple times. Also, Python generally
|
||
provides two ways of converting an object of type A into an
|
||
object of type B: ask an A instance to convert itself to a B, or
|
||
ask the type B to create a new instance from an A. Depending on
|
||
what A and B are, both APIs make sense; sometimes reasons of
|
||
decoupling require that A can't know about B, in which case you
|
||
have to use the latter approach; sometimes B can't know about A,
|
||
in which case you have to use the former.
|
||
|
||
|
||
Q: Why does bytes ignore the encoding argument if the initializer is
|
||
a str? (This only applies to 2.6.)
|
||
|
||
A: There is no sane meaning that the encoding can have in that case.
|
||
str objects *are* byte arrays and they know nothing about the
|
||
encoding of character data they contain. We need to assume that
|
||
the programmer has provided str object that already uses the
|
||
desired encoding. If you need something other than a pure copy of
|
||
the bytes then you need to first decode the string. For example:
|
||
|
||
bytes(s.decode(encoding1), encoding2)
|
||
|
||
|
||
Q: Why not have the encoding argument default to Latin-1 (or some
|
||
other encoding that covers the entire byte range) rather than
|
||
ASCII?
|
||
|
||
A: The system default encoding for Python is ASCII. It seems least
|
||
confusing to use that default. Also, in Py3k, using Latin-1 as
|
||
the default might not be what users expect. For example, they
|
||
might prefer a Unicode encoding. Any default will not always
|
||
work as expected. At least ASCII will complain loudly if you try
|
||
to encode non-ASCII data.
|
||
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|