285 lines
9.9 KiB
ReStructuredText
285 lines
9.9 KiB
ReStructuredText
PEP: 358
|
|
Title: The "bytes" Object
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Neil Schemenauer <nas@arctrix.com>, Guido van Rossum <guido@python.org>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 15-Feb-2006
|
|
Python-Version: 2.6, 3.0
|
|
Post-History:
|
|
|
|
|
|
Update
|
|
======
|
|
|
|
This PEP has partially been superseded by :pep:`3137`.
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
This PEP outlines the introduction of a raw bytes sequence type.
|
|
Adding the bytes type is one step in the transition to
|
|
Unicode-based str objects which will be introduced in Python 3.0.
|
|
|
|
The PEP describes how the bytes type should work in Python 2.6, as
|
|
well as how it should work in Python 3.0. (Occasionally there are
|
|
differences because in Python 2.6, we have two string types, str
|
|
and unicode, while in Python 3.0 we will only have one string
|
|
type, whose name will be str but whose semantics will be like the
|
|
2.6 unicode type.)
|
|
|
|
|
|
Motivation
|
|
==========
|
|
|
|
Python's current string objects are overloaded. They serve to hold
|
|
both sequences of characters and sequences of bytes. This
|
|
overloading of purpose leads to confusion and bugs. In future
|
|
versions of Python, string objects will be used for holding
|
|
character data. The bytes object will fulfil the role of a byte
|
|
container. Eventually the unicode type will be renamed to str
|
|
and the old str type will be removed.
|
|
|
|
|
|
Specification
|
|
=============
|
|
|
|
A bytes object stores a mutable sequence of integers that are in
|
|
the range 0 to 255. Unlike string objects, indexing a bytes
|
|
object returns an integer. Assigning or comparing an object that
|
|
is not an integer to an element causes a ``TypeError`` exception.
|
|
Assigning an element to a value outside the range 0 to 255 causes
|
|
a ``ValueError`` exception. The ``.__len__()`` method of bytes returns
|
|
the number of integers stored in the sequence (i.e. the number of
|
|
bytes).
|
|
|
|
The constructor of the bytes object has the following signature::
|
|
|
|
bytes([initializer[, encoding]])
|
|
|
|
If no arguments are provided then a bytes object containing zero
|
|
elements is created and returned. The initializer argument can be
|
|
a string (in 2.6, either str or unicode), an iterable of integers,
|
|
or a single integer. The pseudo-code for the constructor
|
|
(optimized for clear semantics, not for speed) is::
|
|
|
|
def bytes(initializer=0, encoding=None):
|
|
if isinstance(initializer, int): # In 2.6, int -> (int, long)
|
|
initializer = [0]*initializer
|
|
elif isinstance(initializer, basestring):
|
|
if isinstance(initializer, unicode): # In 3.0, "if True"
|
|
if encoding is None:
|
|
# In 3.0, raise TypeError("explicit encoding required")
|
|
encoding = sys.getdefaultencoding()
|
|
initializer = initializer.encode(encoding)
|
|
initializer = [ord(c) for c in initializer]
|
|
else:
|
|
if encoding is not None:
|
|
raise TypeError("no encoding allowed for this initializer")
|
|
tmp = []
|
|
for c in initializer:
|
|
if not isinstance(c, int):
|
|
raise TypeError("initializer must be iterable of ints")
|
|
if not 0 <= c < 256:
|
|
raise ValueError("initializer element out of range")
|
|
tmp.append(c)
|
|
initializer = tmp
|
|
new = <new bytes object of length len(initializer)>
|
|
for i, c in enumerate(initializer):
|
|
new[i] = c
|
|
return new
|
|
|
|
The ``.__repr__()`` method returns a string that can be evaluated to
|
|
generate a new bytes object containing a bytes literal::
|
|
|
|
>>> bytes([10, 20, 30])
|
|
b'\n\x14\x1e'
|
|
|
|
The object has a ``.decode()`` method equivalent to the ``.decode()``
|
|
method of the str object. The object has a classmethod ``.fromhex()``
|
|
that takes a string of characters from the set ``[0-9a-fA-F ]`` and
|
|
returns a bytes object (similar to ``binascii.unhexlify``). For
|
|
example::
|
|
|
|
>>> bytes.fromhex('5c5350ff')
|
|
b'\\SP\xff'
|
|
>>> bytes.fromhex('5c 53 50 ff')
|
|
b'\\SP\xff'
|
|
|
|
The object has a ``.hex()`` method that does the reverse conversion
|
|
(similar to ``binascii.hexlify``)::
|
|
|
|
>> bytes([92, 83, 80, 255]).hex()
|
|
'5c5350ff'
|
|
|
|
The bytes object has some methods similar to list methods, and
|
|
others similar to str methods. Here is a complete list of
|
|
methods, with their approximate signatures::
|
|
|
|
.__add__(bytes) -> bytes
|
|
.__contains__(int | bytes) -> bool
|
|
.__delitem__(int | slice) -> None
|
|
.__delslice__(int, int) -> None
|
|
.__eq__(bytes) -> bool
|
|
.__ge__(bytes) -> bool
|
|
.__getitem__(int | slice) -> int | bytes
|
|
.__getslice__(int, int) -> bytes
|
|
.__gt__(bytes) -> bool
|
|
.__iadd__(bytes) -> bytes
|
|
.__imul__(int) -> bytes
|
|
.__iter__() -> iterator
|
|
.__le__(bytes) -> bool
|
|
.__len__() -> int
|
|
.__lt__(bytes) -> bool
|
|
.__mul__(int) -> bytes
|
|
.__ne__(bytes) -> bool
|
|
.__reduce__(...) -> ...
|
|
.__reduce_ex__(...) -> ...
|
|
.__repr__() -> str
|
|
.__reversed__() -> bytes
|
|
.__rmul__(int) -> bytes
|
|
.__setitem__(int | slice, int | iterable[int]) -> None
|
|
.__setslice__(int, int, iterable[int]) -> Bote
|
|
.append(int) -> None
|
|
.count(int) -> int
|
|
.decode(str) -> str | unicode # in 3.0, only str
|
|
.endswith(bytes) -> bool
|
|
.extend(iterable[int]) -> None
|
|
.find(bytes) -> int
|
|
.index(bytes | int) -> int
|
|
.insert(int, int) -> None
|
|
.join(iterable[bytes]) -> bytes
|
|
.partition(bytes) -> (bytes, bytes, bytes)
|
|
.pop([int]) -> int
|
|
.remove(int) -> None
|
|
.replace(bytes, bytes) -> bytes
|
|
.rindex(bytes | int) -> int
|
|
.rpartition(bytes) -> (bytes, bytes, bytes)
|
|
.split(bytes) -> list[bytes]
|
|
.startswith(bytes) -> bool
|
|
.reverse() -> None
|
|
.rfind(bytes) -> int
|
|
.rindex(bytes | int) -> int
|
|
.rsplit(bytes) -> list[bytes]
|
|
.translate(bytes, [bytes]) -> bytes
|
|
|
|
Note the conspicuous absence of ``.isupper()``, ``.upper()``, and friends.
|
|
(But see "Open Issues" below.) There is no ``.__hash__()`` because
|
|
the object is mutable. There is no use case for a ``.sort()`` method.
|
|
|
|
The bytes type also supports the buffer interface, supporting
|
|
reading and writing binary (but not character) data.
|
|
|
|
|
|
Out of Scope Issues
|
|
===================
|
|
|
|
* Python 3k will have a much different I/O subsystem. Deciding
|
|
how that I/O subsystem will work and interact with the bytes
|
|
object is out of the scope of this PEP. The expectation however
|
|
is that binary I/O will read and write bytes, while text I/O
|
|
will read strings. Since the bytes type supports the buffer
|
|
interface, the existing binary I/O operations in Python 2.6 will
|
|
support bytes objects.
|
|
|
|
* It has been suggested that a special method named ``.__bytes__()``
|
|
be added to the language to allow objects to be converted into
|
|
byte arrays. This decision is out of scope.
|
|
|
|
* A bytes literal of the form ``b"..."`` is also proposed. This is
|
|
the subject of :pep:`3112`.
|
|
|
|
|
|
Open Issues
|
|
===========
|
|
|
|
* The ``.decode()`` method is redundant since a bytes object ``b`` can
|
|
also be decoded by calling ``unicode(b, <encoding>)`` (in 2.6) or
|
|
``str(b, <encoding>)`` (in 3.0). Do we need encode/decode methods
|
|
at all? In a sense the spelling using a constructor is cleaner.
|
|
|
|
* Need to specify the methods still more carefully.
|
|
|
|
* Pickling and marshalling support need to be specified.
|
|
|
|
* Should all those list methods really be implemented?
|
|
|
|
* A case could be made for supporting ``.ljust()``, ``.rjust()``,
|
|
``.center()`` with a mandatory second argument.
|
|
|
|
* A case could be made for supporting ``.split()`` with a mandatory
|
|
argument.
|
|
|
|
* A case could even be made for supporting ``.islower()``, ``.isupper()``,
|
|
``.isspace()``, ``.isalpha()``, ``.isalnum()``, ``.isdigit()`` and the
|
|
corresponding conversions (``.lower()`` etc.), using the ASCII
|
|
definitions for letters, digits and whitespace. If this is
|
|
accepted, the cases for ``.ljust()``, ``.rjust()``, ``.center()`` and
|
|
``.split()`` become much stronger, and they should have default
|
|
arguments as well, using an ASCII space or all ASCII whitespace
|
|
(for ``.split()``).
|
|
|
|
|
|
Frequently Asked Questions
|
|
==========================
|
|
|
|
**Q:** Why have the optional encoding argument when the encode method of
|
|
Unicode objects does the same thing?
|
|
|
|
**A:** In the current version of Python, the encode method returns a str
|
|
object and we cannot change that without breaking code. The
|
|
construct ``bytes(s.encode(...))`` is expensive because it has to
|
|
copy the byte sequence multiple times. Also, Python generally
|
|
provides two ways of converting an object of type A into an
|
|
object of type B: ask an A instance to convert itself to a B, or
|
|
ask the type B to create a new instance from an A. Depending on
|
|
what A and B are, both APIs make sense; sometimes reasons of
|
|
decoupling require that A can't know about B, in which case you
|
|
have to use the latter approach; sometimes B can't know about A,
|
|
in which case you have to use the former.
|
|
|
|
|
|
**Q:** Why does bytes ignore the encoding argument if the initializer is
|
|
a str? (This only applies to 2.6.)
|
|
|
|
**A:** There is no sane meaning that the encoding can have in that case.
|
|
str objects *are* byte arrays and they know nothing about the
|
|
encoding of character data they contain. We need to assume that
|
|
the programmer has provided a str object that already uses the
|
|
desired encoding. If you need something other than a pure copy of
|
|
the bytes then you need to first decode the string. For example::
|
|
|
|
bytes(s.decode(encoding1), encoding2)
|
|
|
|
|
|
**Q:** Why not have the encoding argument default to Latin-1 (or some
|
|
other encoding that covers the entire byte range) rather than
|
|
ASCII?
|
|
|
|
**A:** The system default encoding for Python is ASCII. It seems least
|
|
confusing to use that default. Also, in Py3k, using Latin-1 as
|
|
the default might not be what users expect. For example, they
|
|
might prefer a Unicode encoding. Any default will not always
|
|
work as expected. At least ASCII will complain loudly if you try
|
|
to encode non-ASCII data.
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
..
|
|
Local Variables:
|
|
mode: indented-text
|
|
indent-tabs-mode: nil
|
|
sentence-end-double-space: t
|
|
fill-column: 70
|
|
coding: utf-8
|
|
End:
|