2006-02-22 15:40:03 -05:00
|
|
|
|
PEP: 358
|
2006-02-22 15:43:33 -05:00
|
|
|
|
Title: The "bytes" Object
|
2006-02-22 15:40:03 -05:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Neil Schemenauer <nas@arctrix.com>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/plain
|
|
|
|
|
Created: 15-Feb-2006
|
|
|
|
|
Python-Version: 2.5
|
|
|
|
|
Post-History:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
This PEP outlines the introduction of a raw bytes sequence object.
|
|
|
|
|
Adding the bytes object is one step in the transition to Unicode
|
|
|
|
|
based str objects.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
Python's current string objects are overloaded. They serve to hold
|
|
|
|
|
both sequences of characters and sequences of bytes. This
|
|
|
|
|
overloading of purpose leads to confusion and bugs. In future
|
|
|
|
|
versions of Python, string objects will be used for holding
|
|
|
|
|
character data. The bytes object will fulfil the role of a byte
|
|
|
|
|
container. Eventually the unicode built-in will be renamed to str
|
|
|
|
|
and the str object will be removed.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
2006-02-22 15:49:37 -05:00
|
|
|
|
|
|
|
|
|
A bytes object stores a mutable sequence of integers that are in the
|
|
|
|
|
range 0 to 255. Unlike string objects, indexing a bytes object
|
|
|
|
|
returns an integer. Assigning an element using a object that is not
|
|
|
|
|
an integer causes a TypeError exception. Assigning an element to a
|
|
|
|
|
value outside the range 0 to 255 causes a ValueError exception. The
|
|
|
|
|
__len__ method of bytes returns the number of integers stored in the
|
|
|
|
|
sequence (i.e. the number of bytes).
|
|
|
|
|
|
|
|
|
|
The constructor of the bytes object has the following signature:
|
|
|
|
|
|
|
|
|
|
bytes([initialiser[, [encoding]])
|
|
|
|
|
|
|
|
|
|
If no arguments are provided then an object containing zero elements
|
|
|
|
|
is created and returned. The initialiser argument can be a string
|
|
|
|
|
or a sequence of integers. The pseudo-code for the constructor is:
|
|
|
|
|
|
|
|
|
|
def bytes(initialiser=[], encoding=None):
|
|
|
|
|
if isinstance(initialiser, basestring):
|
|
|
|
|
if isinstance(initialiser, unicode):
|
|
|
|
|
if encoding is None:
|
|
|
|
|
encoding = sys.getdefaultencoding()
|
|
|
|
|
initialiser = initialiser.encode(encoding)
|
|
|
|
|
initialiser = [ord(c) for c in initialiser]
|
|
|
|
|
elif encoding is not None:
|
|
|
|
|
raise TypeError("explicit encoding invalid for non-string "
|
|
|
|
|
"initialiser")
|
|
|
|
|
create bytes object and fill with integers from initialiser
|
|
|
|
|
return bytes object
|
|
|
|
|
|
|
|
|
|
The __repr__ method returns a string that can be evaluated to
|
|
|
|
|
generate a new bytes object containing the same sequence of
|
|
|
|
|
integers. The sequence is represented by a list of ints. For
|
|
|
|
|
example:
|
|
|
|
|
|
|
|
|
|
>>> repr(bytes[10, 20, 30])
|
|
|
|
|
'bytes([10, 20, 30])'
|
|
|
|
|
|
|
|
|
|
The object has a decode method equivalent to the decode method of
|
|
|
|
|
the str object. The object has a classmethod fromhex that takes a
|
|
|
|
|
string of characters from the set [0-9a-zA-Z ] and returns a bytes
|
|
|
|
|
object (similar to binascii.unhexlify). For example:
|
|
|
|
|
|
|
|
|
|
>>> bytes.fromhex('5c5350ff')
|
|
|
|
|
bytes([92, 83, 80, 255]])
|
|
|
|
|
>>> bytes.fromhex('5c 53 50 ff')
|
|
|
|
|
bytes([92, 83, 80, 255]])
|
|
|
|
|
|
|
|
|
|
The object has a hex method that does the reverse conversion
|
|
|
|
|
(similar to binascii.hexlify):
|
|
|
|
|
|
|
|
|
|
>> bytes([92, 83, 80, 255]]).hex()
|
|
|
|
|
'5c5350ff'
|
|
|
|
|
|
|
|
|
|
The bytes object has methods similar to the list object:
|
|
|
|
|
|
|
|
|
|
__add__
|
|
|
|
|
__contains__
|
|
|
|
|
__delitem__
|
|
|
|
|
__delslice__
|
|
|
|
|
__eq__
|
|
|
|
|
__ge__
|
|
|
|
|
__getitem__
|
|
|
|
|
__getslice__
|
|
|
|
|
__gt__
|
|
|
|
|
__hash__
|
|
|
|
|
__iadd__
|
|
|
|
|
__imul__
|
|
|
|
|
__iter__
|
|
|
|
|
__le__
|
|
|
|
|
__len__
|
|
|
|
|
__lt__
|
|
|
|
|
__mul__
|
|
|
|
|
__ne__
|
|
|
|
|
__reduce__
|
|
|
|
|
__reduce_ex__
|
|
|
|
|
__repr__
|
|
|
|
|
__rmul__
|
|
|
|
|
__setitem__
|
|
|
|
|
__setslice__
|
|
|
|
|
append
|
|
|
|
|
count
|
|
|
|
|
extend
|
|
|
|
|
index
|
|
|
|
|
insert
|
|
|
|
|
pop
|
|
|
|
|
remove
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Out of scope issues
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
* If we provide a literal syntax for bytes then it should look
|
|
|
|
|
distinctly different than the syntax for literal strings. Also, a
|
|
|
|
|
new type, even built-in, is much less drastic than a new literal
|
|
|
|
|
(which requires lexer and parser support in addition to everything
|
|
|
|
|
else). Since there appears to be no immediate need for a literal
|
|
|
|
|
representation, designing and implementing one is out of the scope
|
|
|
|
|
of this PEP.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
* Python 3k will have a much different I/O subsystem. Deciding how
|
|
|
|
|
that I/O subsystem will work and interact with the bytes object is
|
|
|
|
|
out of the scope of this PEP.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
* It has been suggested that a special method named __bytes__ be
|
|
|
|
|
added to language to allow objects to be converted into byte
|
|
|
|
|
arrays. This decision is out of scope.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Unresolved issues
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
* Perhaps the bytes object should be implemented as a extension
|
|
|
|
|
module until we are more sure of the design (similar to how the
|
|
|
|
|
set object was prototyped).
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
* Should the bytes object implement the buffer interface? Probably,
|
|
|
|
|
but we need to look into the implications of that (e.g. regex
|
|
|
|
|
operations on byte arrays).
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
* Should the object implement __reversed__ and reverse? Should it
|
|
|
|
|
implement sort?
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
* Need to clarify what some of the methods do. How are comparisons
|
|
|
|
|
done? Hashing? Pickling and marshalling?
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Questions and answers
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
Q: Why have the optional encoding argument when the encode method of
|
|
|
|
|
Unicode objects does the same thing.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
A: In the current version of Python, the encode method returns a str
|
|
|
|
|
object and we cannot change that without breaking code. The
|
|
|
|
|
construct bytes(s.encode(...)) is expensive because it has to
|
|
|
|
|
copy the byte sequence multiple times. Also, Python generally
|
|
|
|
|
provides two ways of converting an object of type A into an
|
|
|
|
|
object of type B: ask an A instance to convert itself to a B, or
|
|
|
|
|
ask the type B to create a new instance from an A. Depending on
|
|
|
|
|
what A and B are, both APIs make sense; sometimes reasons of
|
|
|
|
|
decoupling require that A can't know about B, in which case you
|
|
|
|
|
have to use the latter approach; sometimes B can't know about A,
|
|
|
|
|
in which case you have to use the former.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
Q: Why does bytes ignore the encoding argument if the initialiser is
|
|
|
|
|
a str?
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
A: There is no sane meaning that the encoding can have in that case.
|
|
|
|
|
str objects *are* byte arrays and they know nothing about the
|
|
|
|
|
encoding of character data they contain. We need to assume that
|
|
|
|
|
the programmer has provided str object that already uses the
|
|
|
|
|
desired encoding. If you need something other than a pure copy of
|
|
|
|
|
the bytes then you need to first decode the string. For example:
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
bytes(s.decode(encoding1), encoding2)
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
Q: Why not have the encoding argument default to Latin-1 (or some
|
|
|
|
|
other encoding that covers the entire byte range) rather than
|
|
|
|
|
ASCII?
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
A: The system default encoding for Python is ASCII. It seems least
|
|
|
|
|
confusing to use that default. Also, in Py3k, using Latin-1 as
|
|
|
|
|
the default might not be what users expect. For example, they
|
|
|
|
|
might prefer a Unicode encoding. Any default will not always
|
|
|
|
|
work as expected. At least ASCII will complain loudly if you try
|
|
|
|
|
to encode non-ASCII data.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
|
2006-02-22 15:49:37 -05:00
|
|
|
|
This document has been placed in the public domain.
|
2006-02-22 15:40:03 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
End:
|