2006-02-22 15:40:03 -05:00
|
|
|
|
PEP: 358
|
2006-02-22 15:43:33 -05:00
|
|
|
|
Title: The "bytes" Object
|
2006-02-22 15:40:03 -05:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
2007-02-22 18:57:46 -05:00
|
|
|
|
Author: Neil Schemenauer <nas@arctrix.com>, Guido van Rossum <guido@google.com>
|
2007-03-18 16:04:00 -04:00
|
|
|
|
Status: Accepted
|
2006-02-22 15:40:03 -05:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/plain
|
|
|
|
|
Created: 15-Feb-2006
|
2007-02-22 18:57:46 -05:00
|
|
|
|
Python-Version: 2.6, 3.0
|
2006-02-22 15:40:03 -05:00
|
|
|
|
Post-History:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
This PEP outlines the introduction of a raw bytes sequence type.
|
|
|
|
|
Adding the bytes type is one step in the transition to Unicode
|
|
|
|
|
based str objects which will be introduced in Python 3.0.
|
|
|
|
|
|
|
|
|
|
The PEP describes how the bytes type should work in Python 2.6, as
|
|
|
|
|
well as how it should work in Python 3.0. (Occasionally there are
|
|
|
|
|
differences because in Python 2.6, we have two string types, str
|
|
|
|
|
and unicode, while in Python 3.0 we will only have one string
|
|
|
|
|
type, whose name will be str but whose semantics will be like the
|
|
|
|
|
2.6 unicode type.)
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
|
|
2007-02-22 18:57:46 -05:00
|
|
|
|
Python's current string objects are overloaded. They serve to hold
|
|
|
|
|
both sequences of characters and sequences of bytes. This
|
|
|
|
|
overloading of purpose leads to confusion and bugs. In future
|
2006-02-22 15:49:37 -05:00
|
|
|
|
versions of Python, string objects will be used for holding
|
2007-02-22 18:57:46 -05:00
|
|
|
|
character data. The bytes object will fulfil the role of a byte
|
|
|
|
|
container. Eventually the unicode built-in will be renamed to str
|
2006-02-22 15:49:37 -05:00
|
|
|
|
and the str object will be removed.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
2007-02-22 18:57:46 -05:00
|
|
|
|
A bytes object stores a mutable sequence of integers that are in
|
|
|
|
|
the range 0 to 255. Unlike string objects, indexing a bytes
|
2007-02-22 23:31:15 -05:00
|
|
|
|
object returns an integer. Assigning or comparin an object that
|
|
|
|
|
is not an integer to an element causes a TypeError exception.
|
|
|
|
|
Assigning an element to a value outside the range 0 to 255 causes
|
|
|
|
|
a ValueError exception. The .__len__() method of bytes returns
|
|
|
|
|
the number of integers stored in the sequence (i.e. the number of
|
|
|
|
|
bytes).
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
|
|
|
|
The constructor of the bytes object has the following signature:
|
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
bytes([initializer[, encoding]])
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
If no arguments are provided then a bytes object containing zero
|
|
|
|
|
elements is created and returned. The initializer argument can be
|
|
|
|
|
a string (in 2.6, either str or unicode), an iterable of integers,
|
|
|
|
|
or a single integer. The pseudo-code for the constructor
|
|
|
|
|
(optimized for clear semantics, not for speed) is:
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
def bytes(initializer=0, encoding=None):
|
|
|
|
|
if isinstance(initializer, int): # In 2.6, (int, long)
|
|
|
|
|
initializer = [0]*initializer
|
|
|
|
|
elif isinstance(initializer, basestring):
|
|
|
|
|
if isinstance(initializer, unicode): # In 3.0, always
|
2006-02-22 15:49:37 -05:00
|
|
|
|
if encoding is None:
|
2007-02-22 18:57:46 -05:00
|
|
|
|
# In 3.0, raise TypeError("explicit encoding required")
|
2006-02-22 15:49:37 -05:00
|
|
|
|
encoding = sys.getdefaultencoding()
|
2007-02-22 23:31:15 -05:00
|
|
|
|
initializer = initializer.encode(encoding)
|
|
|
|
|
initializer = [ord(c) for c in initializer]
|
2007-02-22 18:57:46 -05:00
|
|
|
|
else:
|
|
|
|
|
if encoding is not None:
|
2007-02-22 23:31:15 -05:00
|
|
|
|
raise TypeError("no encoding allowed for this initializer")
|
|
|
|
|
tmp = []
|
|
|
|
|
for c in initializer:
|
|
|
|
|
if not isinstance(c, int):
|
|
|
|
|
raise TypeError("initializer must be iterable of ints")
|
|
|
|
|
if not 0 <= c < 256:
|
|
|
|
|
raise ValueError("initializer element out of range")
|
|
|
|
|
tmp.append(c)
|
|
|
|
|
initializer = tmp
|
|
|
|
|
new = <new bytes object of length len(initializer)>
|
|
|
|
|
for i, c in enumerate(initializer):
|
|
|
|
|
new[i] = c
|
|
|
|
|
return new
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
2007-02-22 18:57:46 -05:00
|
|
|
|
The .__repr__() method returns a string that can be evaluated to
|
2007-02-26 12:33:15 -05:00
|
|
|
|
generate a new bytes object containing a bytes literal:
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
2007-02-26 12:33:15 -05:00
|
|
|
|
>>> bytes([10, 20, 30])
|
|
|
|
|
b'\n\x14\x1e'
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
2007-02-22 18:57:46 -05:00
|
|
|
|
The object has a .decode() method equivalent to the .decode()
|
2007-02-22 23:31:15 -05:00
|
|
|
|
method of the str object. The object has a classmethod .fromhex()
|
2007-02-27 03:39:07 -05:00
|
|
|
|
that takes a string of characters from the set [0-9a-fA-F ] and
|
2007-02-22 23:31:15 -05:00
|
|
|
|
returns a bytes object (similar to binascii.unhexlify). For
|
|
|
|
|
example:
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
|
|
|
|
>>> bytes.fromhex('5c5350ff')
|
2007-02-26 12:33:15 -05:00
|
|
|
|
b'\\SP\xff'
|
2006-02-22 15:49:37 -05:00
|
|
|
|
>>> bytes.fromhex('5c 53 50 ff')
|
2007-02-26 12:33:15 -05:00
|
|
|
|
b'\\SP\xff'
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
2007-02-22 18:57:46 -05:00
|
|
|
|
The object has a .hex() method that does the reverse conversion
|
2006-02-22 15:49:37 -05:00
|
|
|
|
(similar to binascii.hexlify):
|
|
|
|
|
|
2007-02-26 12:33:15 -05:00
|
|
|
|
>> bytes([92, 83, 80, 255]).hex()
|
2006-02-22 15:49:37 -05:00
|
|
|
|
'5c5350ff'
|
|
|
|
|
|
2007-02-22 18:57:46 -05:00
|
|
|
|
The bytes object has some methods similar to list method, and
|
2007-02-22 23:31:15 -05:00
|
|
|
|
others similar to str methods. Here is a complete list of
|
|
|
|
|
methods, with their approximate signatures:
|
|
|
|
|
|
|
|
|
|
.__add__(bytes) -> bytes
|
|
|
|
|
.__contains__(int | bytes) -> bool
|
|
|
|
|
.__delitem__(int | slice) -> None
|
|
|
|
|
.__delslice__(int, int) -> None
|
|
|
|
|
.__eq__(bytes) -> bool
|
|
|
|
|
.__ge__(bytes) -> bool
|
|
|
|
|
.__getitem__(int | slice) -> int | bytes
|
|
|
|
|
.__getslice__(int, int) -> bytes
|
|
|
|
|
.__gt__(bytes) -> bool
|
|
|
|
|
.__iadd__(bytes) -> bytes
|
|
|
|
|
.__imul__(int) -> bytes
|
|
|
|
|
.__iter__() -> iterator
|
|
|
|
|
.__le__(bytes) -> bool
|
|
|
|
|
.__len__() -> int
|
|
|
|
|
.__lt__(bytes) -> bool
|
|
|
|
|
.__mul__(int) -> bytes
|
|
|
|
|
.__ne__(bytes) -> bool
|
|
|
|
|
.__reduce__(...) -> ...
|
|
|
|
|
.__reduce_ex__(...) -> ...
|
|
|
|
|
.__repr__() -> str
|
|
|
|
|
.__reversed__() -> bytes
|
|
|
|
|
.__rmul__(int) -> bytes
|
|
|
|
|
.__setitem__(int | slice, int | iterable[int]) -> None
|
|
|
|
|
.__setslice__(int, int, iterable[int]) -> Bote
|
|
|
|
|
.append(int) -> None
|
|
|
|
|
.count(int) -> int
|
|
|
|
|
.decode(str) -> str | unicode # in 3.0, only str
|
|
|
|
|
.endswith(bytes) -> bool
|
|
|
|
|
.extend(iterable[int]) -> None
|
|
|
|
|
.find(bytes) -> int
|
|
|
|
|
.index(bytes | int) -> int
|
|
|
|
|
.insert(int, int) -> None
|
|
|
|
|
.join(iterable[bytes]) -> bytes
|
|
|
|
|
.partition(bytes) -> (bytes, bytes, bytes)
|
|
|
|
|
.pop([int]) -> int
|
|
|
|
|
.remove(int) -> None
|
|
|
|
|
.replace(bytes, bytes) -> bytes
|
|
|
|
|
.rindex(bytes | int) -> int
|
|
|
|
|
.rpartition(bytes) -> (bytes, bytes, bytes)
|
|
|
|
|
.split(bytes) -> list[bytes]
|
|
|
|
|
.startswith(bytes) -> bool
|
|
|
|
|
.reverse() -> None
|
|
|
|
|
.rfind(bytes) -> int
|
|
|
|
|
.rindex(bytes | int) -> int
|
|
|
|
|
.rsplit(bytes) -> list[bytes]
|
|
|
|
|
.translate(bytes, [bytes]) -> bytes
|
2007-02-22 18:57:46 -05:00
|
|
|
|
|
|
|
|
|
Note the conspicuous absence of .isupper(), .upper(), and friends.
|
2007-02-22 23:31:15 -05:00
|
|
|
|
(But see "Open Issues" below.) There is no .__hash__() because
|
|
|
|
|
the object is mutable. There is no use case for a .sort() method.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
The bytes type also supports the buffer interface, supporting
|
|
|
|
|
reading and writing binary (but not character) data.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
Out of Scope Issues
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
* Python 3k will have a much different I/O subsystem. Deciding
|
|
|
|
|
how that I/O subsystem will work and interact with the bytes
|
|
|
|
|
object is out of the scope of this PEP. The expectation however
|
|
|
|
|
is that binary I/O will read and write bytes, while text I/O
|
|
|
|
|
will read strings. Since the bytes type supports the buffer
|
|
|
|
|
interface, the existing binary I/O operations in Python 2.6 will
|
|
|
|
|
support bytes objects.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
* It has been suggested that a special method named .__bytes__()
|
|
|
|
|
be added to language to allow objects to be converted into byte
|
2006-02-22 15:49:37 -05:00
|
|
|
|
arrays. This decision is out of scope.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-24 00:42:52 -05:00
|
|
|
|
* A bytes literal of the form b"..." is also proposed. This is
|
|
|
|
|
the subject of PEP 3112.
|
|
|
|
|
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
Open Issues
|
|
|
|
|
|
|
|
|
|
* The .decode() method is redundant since a bytes object b can
|
|
|
|
|
also be decoded by calling unicode(b, <encoding>) (in 2.6) or
|
|
|
|
|
str(b, <encoding>) (in 3.0). Do we need encode/decode methods
|
|
|
|
|
at all? In a sense the spelling using a constructor is cleaner.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
* Need to specify the methods still more carefully.
|
|
|
|
|
|
|
|
|
|
* Pickling and marshalling support need to be specified.
|
2007-02-22 18:57:46 -05:00
|
|
|
|
|
|
|
|
|
* Should all those list methods really be implemented?
|
|
|
|
|
|
|
|
|
|
* A case could be made for supporting .ljust(), .rjust(),
|
|
|
|
|
.center() with a mandatory second argument.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-22 18:57:46 -05:00
|
|
|
|
* A case could be made for supporting .split() with a mandatory
|
|
|
|
|
argument.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
* A case could even be made for supporting .islower(), .isupper(),
|
|
|
|
|
.isspace(), .isalpha(), .isalnum(), .isdigit() and the
|
|
|
|
|
corresponding conversions (.lower() etc.), using the ASCII
|
|
|
|
|
definitions for letters, digits and whitespace. If this is
|
|
|
|
|
accepted, the cases for .ljust(), .rjust(), .center() and
|
|
|
|
|
.split() become much stronger, and they should have default
|
|
|
|
|
arguments as well, using an ASCII space or all ASCII whitespace
|
|
|
|
|
(for .split()).
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
Frequently Asked Questions
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
Q: Why have the optional encoding argument when the encode method of
|
2007-02-26 12:33:15 -05:00
|
|
|
|
Unicode objects does the same thing?
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
A: In the current version of Python, the encode method returns a str
|
|
|
|
|
object and we cannot change that without breaking code. The
|
|
|
|
|
construct bytes(s.encode(...)) is expensive because it has to
|
|
|
|
|
copy the byte sequence multiple times. Also, Python generally
|
|
|
|
|
provides two ways of converting an object of type A into an
|
|
|
|
|
object of type B: ask an A instance to convert itself to a B, or
|
|
|
|
|
ask the type B to create a new instance from an A. Depending on
|
|
|
|
|
what A and B are, both APIs make sense; sometimes reasons of
|
|
|
|
|
decoupling require that A can't know about B, in which case you
|
|
|
|
|
have to use the latter approach; sometimes B can't know about A,
|
|
|
|
|
in which case you have to use the former.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
2007-02-22 23:31:15 -05:00
|
|
|
|
Q: Why does bytes ignore the encoding argument if the initializer is
|
2007-02-22 18:57:46 -05:00
|
|
|
|
a str? (This only applies to 2.6.)
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
A: There is no sane meaning that the encoding can have in that case.
|
|
|
|
|
str objects *are* byte arrays and they know nothing about the
|
|
|
|
|
encoding of character data they contain. We need to assume that
|
|
|
|
|
the programmer has provided str object that already uses the
|
|
|
|
|
desired encoding. If you need something other than a pure copy of
|
|
|
|
|
the bytes then you need to first decode the string. For example:
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
bytes(s.decode(encoding1), encoding2)
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
Q: Why not have the encoding argument default to Latin-1 (or some
|
|
|
|
|
other encoding that covers the entire byte range) rather than
|
|
|
|
|
ASCII?
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
A: The system default encoding for Python is ASCII. It seems least
|
|
|
|
|
confusing to use that default. Also, in Py3k, using Latin-1 as
|
|
|
|
|
the default might not be what users expect. For example, they
|
|
|
|
|
might prefer a Unicode encoding. Any default will not always
|
|
|
|
|
work as expected. At least ASCII will complain loudly if you try
|
|
|
|
|
to encode non-ASCII data.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
This document has been placed in the public domain.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2007-02-23 04:01:52 -05:00
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|