214 lines
7.2 KiB
Plaintext
214 lines
7.2 KiB
Plaintext
PEP: 358
|
||
Title: The "bytes" Object
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Neil Schemenauer <nas@arctrix.com>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/plain
|
||
Created: 15-Feb-2006
|
||
Python-Version: 2.5
|
||
Post-History:
|
||
|
||
|
||
Abstract
|
||
|
||
This PEP outlines the introduction of a raw bytes sequence object.
|
||
Adding the bytes object is one step in the transition to Unicode
|
||
based str objects.
|
||
|
||
|
||
Motivation
|
||
|
||
Python's current string objects are overloaded. They serve to hold
|
||
both sequences of characters and sequences of bytes. This
|
||
overloading of purpose leads to confusion and bugs. In future
|
||
versions of Python, string objects will be used for holding
|
||
character data. The bytes object will fulfil the role of a byte
|
||
container. Eventually the unicode built-in will be renamed to str
|
||
and the str object will be removed.
|
||
|
||
|
||
Specification
|
||
|
||
A bytes object stores a mutable sequence of integers that are in the
|
||
range 0 to 255. Unlike string objects, indexing a bytes object
|
||
returns an integer. Assigning an element using a object that is not
|
||
an integer causes a TypeError exception. Assigning an element to a
|
||
value outside the range 0 to 255 causes a ValueError exception. The
|
||
__len__ method of bytes returns the number of integers stored in the
|
||
sequence (i.e. the number of bytes).
|
||
|
||
The constructor of the bytes object has the following signature:
|
||
|
||
bytes([initialiser[, [encoding]])
|
||
|
||
If no arguments are provided then an object containing zero elements
|
||
is created and returned. The initialiser argument can be a string
|
||
or a sequence of integers. The pseudo-code for the constructor is:
|
||
|
||
def bytes(initialiser=[], encoding=None):
|
||
if isinstance(initialiser, basestring):
|
||
if isinstance(initialiser, unicode):
|
||
if encoding is None:
|
||
encoding = sys.getdefaultencoding()
|
||
initialiser = initialiser.encode(encoding)
|
||
initialiser = [ord(c) for c in initialiser]
|
||
elif encoding is not None:
|
||
raise TypeError("explicit encoding invalid for non-string "
|
||
"initialiser")
|
||
create bytes object and fill with integers from initialiser
|
||
return bytes object
|
||
|
||
The __repr__ method returns a string that can be evaluated to
|
||
generate a new bytes object containing the same sequence of
|
||
integers. The sequence is represented by a list of ints. For
|
||
example:
|
||
|
||
>>> repr(bytes[10, 20, 30])
|
||
'bytes([10, 20, 30])'
|
||
|
||
The object has a decode method equivalent to the decode method of
|
||
the str object. The object has a classmethod fromhex that takes a
|
||
string of characters from the set [0-9a-zA-Z ] and returns a bytes
|
||
object (similar to binascii.unhexlify). For example:
|
||
|
||
>>> bytes.fromhex('5c5350ff')
|
||
bytes([92, 83, 80, 255]])
|
||
>>> bytes.fromhex('5c 53 50 ff')
|
||
bytes([92, 83, 80, 255]])
|
||
|
||
The object has a hex method that does the reverse conversion
|
||
(similar to binascii.hexlify):
|
||
|
||
>> bytes([92, 83, 80, 255]]).hex()
|
||
'5c5350ff'
|
||
|
||
The bytes object has methods similar to the list object:
|
||
|
||
__add__
|
||
__contains__
|
||
__delitem__
|
||
__delslice__
|
||
__eq__
|
||
__ge__
|
||
__getitem__
|
||
__getslice__
|
||
__gt__
|
||
__hash__
|
||
__iadd__
|
||
__imul__
|
||
__iter__
|
||
__le__
|
||
__len__
|
||
__lt__
|
||
__mul__
|
||
__ne__
|
||
__reduce__
|
||
__reduce_ex__
|
||
__repr__
|
||
__rmul__
|
||
__setitem__
|
||
__setslice__
|
||
append
|
||
count
|
||
extend
|
||
index
|
||
insert
|
||
pop
|
||
remove
|
||
|
||
|
||
Out of scope issues
|
||
|
||
* If we provide a literal syntax for bytes then it should look
|
||
distinctly different than the syntax for literal strings. Also, a
|
||
new type, even built-in, is much less drastic than a new literal
|
||
(which requires lexer and parser support in addition to everything
|
||
else). Since there appears to be no immediate need for a literal
|
||
representation, designing and implementing one is out of the scope
|
||
of this PEP.
|
||
|
||
* Python 3k will have a much different I/O subsystem. Deciding how
|
||
that I/O subsystem will work and interact with the bytes object is
|
||
out of the scope of this PEP.
|
||
|
||
* It has been suggested that a special method named __bytes__ be
|
||
added to language to allow objects to be converted into byte
|
||
arrays. This decision is out of scope.
|
||
|
||
|
||
Unresolved issues
|
||
|
||
* Perhaps the bytes object should be implemented as a extension
|
||
module until we are more sure of the design (similar to how the
|
||
set object was prototyped).
|
||
|
||
* Should the bytes object implement the buffer interface? Probably,
|
||
but we need to look into the implications of that (e.g. regex
|
||
operations on byte arrays).
|
||
|
||
* Should the object implement __reversed__ and reverse? Should it
|
||
implement sort?
|
||
|
||
* Need to clarify what some of the methods do. How are comparisons
|
||
done? Hashing? Pickling and marshalling?
|
||
|
||
|
||
Questions and answers
|
||
|
||
Q: Why have the optional encoding argument when the encode method of
|
||
Unicode objects does the same thing.
|
||
|
||
A: In the current version of Python, the encode method returns a str
|
||
object and we cannot change that without breaking code. The
|
||
construct bytes(s.encode(...)) is expensive because it has to
|
||
copy the byte sequence multiple times. Also, Python generally
|
||
provides two ways of converting an object of type A into an
|
||
object of type B: ask an A instance to convert itself to a B, or
|
||
ask the type B to create a new instance from an A. Depending on
|
||
what A and B are, both APIs make sense; sometimes reasons of
|
||
decoupling require that A can't know about B, in which case you
|
||
have to use the latter approach; sometimes B can't know about A,
|
||
in which case you have to use the former.
|
||
|
||
|
||
Q: Why does bytes ignore the encoding argument if the initialiser is
|
||
a str?
|
||
|
||
A: There is no sane meaning that the encoding can have in that case.
|
||
str objects *are* byte arrays and they know nothing about the
|
||
encoding of character data they contain. We need to assume that
|
||
the programmer has provided str object that already uses the
|
||
desired encoding. If you need something other than a pure copy of
|
||
the bytes then you need to first decode the string. For example:
|
||
|
||
bytes(s.decode(encoding1), encoding2)
|
||
|
||
|
||
Q: Why not have the encoding argument default to Latin-1 (or some
|
||
other encoding that covers the entire byte range) rather than
|
||
ASCII?
|
||
|
||
A: The system default encoding for Python is ASCII. It seems least
|
||
confusing to use that default. Also, in Py3k, using Latin-1 as
|
||
the default might not be what users expect. For example, they
|
||
might prefer a Unicode encoding. Any default will not always
|
||
work as expected. At least ASCII will complain loudly if you try
|
||
to encode non-ASCII data.
|
||
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
End:
|