Add 'The "bytes" object' PEP.
This commit is contained in:
parent
b92889b332
commit
cd7901e86e
|
@ -0,0 +1,215 @@
|
||||||
|
PEP: 358
|
||||||
|
Title: The "bytes" object
|
||||||
|
Version: $Revision$
|
||||||
|
Last-Modified: $Date$
|
||||||
|
Author: Neil Schemenauer <nas@arctrix.com>
|
||||||
|
Status: Draft
|
||||||
|
Type: Standards Track
|
||||||
|
Content-Type: text/plain
|
||||||
|
Created: 15-Feb-2006
|
||||||
|
Python-Version: 2.5
|
||||||
|
Post-History:
|
||||||
|
|
||||||
|
|
||||||
|
Abstract
|
||||||
|
========
|
||||||
|
|
||||||
|
This PEP outlines the introduction of a raw bytes sequence object.
|
||||||
|
Adding the bytes object is one step in the transition to Unicode based
|
||||||
|
str objects.
|
||||||
|
|
||||||
|
|
||||||
|
Motivation
|
||||||
|
==========
|
||||||
|
|
||||||
|
Python's current string objects are overloaded. They serve to hold
|
||||||
|
both sequences of characters and sequences of bytes. This overloading
|
||||||
|
of purpose leads to confusion and bugs. In future versions of Python,
|
||||||
|
string objects will be used for holding character data. The bytes object
|
||||||
|
will fulfil the role of a byte container. Eventually the unicode
|
||||||
|
built-in will be renamed to str and the str object will be removed.
|
||||||
|
|
||||||
|
|
||||||
|
Specification
|
||||||
|
=============
|
||||||
|
|
||||||
|
A bytes object stores a mutable sequence of integers that are in the
|
||||||
|
range 0 to 255. Unlike string objects, indexing a bytes object returns
|
||||||
|
an integer. Assigning an element using a object that is not an integer
|
||||||
|
causes a TypeError exception. Assigning an element to a value outside
|
||||||
|
the range 0 to 255 causes a ValueError exception. The __len__ method of
|
||||||
|
bytes returns the number of integers stored in the sequence (i.e. the
|
||||||
|
number of bytes).
|
||||||
|
|
||||||
|
The constructor of the bytes object has the following signature:
|
||||||
|
|
||||||
|
bytes([initialiser[, [encoding]])
|
||||||
|
|
||||||
|
If no arguments are provided then an object containing zero elements is
|
||||||
|
created and returned. The initialiser argument can be a string or a
|
||||||
|
sequence of integers. The pseudo-code for the constructor is:
|
||||||
|
|
||||||
|
def bytes(initialiser=[], encoding=None):
|
||||||
|
if isinstance(initialiser, basestring):
|
||||||
|
if isinstance(initialiser, unicode):
|
||||||
|
if encoding is None:
|
||||||
|
encoding = sys.getdefaultencoding()
|
||||||
|
initialiser = initialiser.encode(encoding)
|
||||||
|
initialiser = [ord(c) for c in initialiser]
|
||||||
|
elif encoding is not None:
|
||||||
|
raise TypeError("explicit encoding invalid for non-string "
|
||||||
|
"initialiser")
|
||||||
|
create bytes object and fill with integers from initialiser
|
||||||
|
return bytes object
|
||||||
|
|
||||||
|
The __repr__ method returns a string that can be evaluated to generate a
|
||||||
|
new bytes object containing the same sequence of integers. The sequence
|
||||||
|
is represented by a list of ints. For example:
|
||||||
|
|
||||||
|
>>> repr(bytes[10, 20, 30])
|
||||||
|
'bytes([10, 20, 30])'
|
||||||
|
|
||||||
|
The object has a decode method equivalent to the decode method of the
|
||||||
|
str object. The object has a classmethod fromhex that takes a string of
|
||||||
|
characters from the set [0-9a-zA-Z ] and returns a bytes object (similar
|
||||||
|
to binascii.unhexlify). For example:
|
||||||
|
|
||||||
|
>>> bytes.fromhex('5c5350ff')
|
||||||
|
bytes([92, 83, 80, 255]])
|
||||||
|
>>> bytes.fromhex('5c 53 50 ff')
|
||||||
|
bytes([92, 83, 80, 255]])
|
||||||
|
|
||||||
|
The object has a hex method that does the reverse conversion (similar to
|
||||||
|
binascii.hexlify):
|
||||||
|
|
||||||
|
>> bytes([92, 83, 80, 255]]).hex()
|
||||||
|
'5c5350ff'
|
||||||
|
|
||||||
|
The bytes object has methods similar to the list object:
|
||||||
|
|
||||||
|
__add__
|
||||||
|
__contains__
|
||||||
|
__delitem__
|
||||||
|
__delslice__
|
||||||
|
__eq__
|
||||||
|
__ge__
|
||||||
|
__getitem__
|
||||||
|
__getslice__
|
||||||
|
__gt__
|
||||||
|
__hash__
|
||||||
|
__iadd__
|
||||||
|
__imul__
|
||||||
|
__iter__
|
||||||
|
__le__
|
||||||
|
__len__
|
||||||
|
__lt__
|
||||||
|
__mul__
|
||||||
|
__ne__
|
||||||
|
__reduce__
|
||||||
|
__reduce_ex__
|
||||||
|
__repr__
|
||||||
|
__rmul__
|
||||||
|
__setitem__
|
||||||
|
__setslice__
|
||||||
|
append
|
||||||
|
count
|
||||||
|
extend
|
||||||
|
index
|
||||||
|
insert
|
||||||
|
pop
|
||||||
|
remove
|
||||||
|
|
||||||
|
|
||||||
|
Out of scope issues
|
||||||
|
===================
|
||||||
|
|
||||||
|
* If we provide a literal syntax for bytes then it should look distinctly
|
||||||
|
different than the syntax for literal strings. Also, a new type, even
|
||||||
|
built-in, is much less drastic than a new literal (which requires
|
||||||
|
lexer and parser support in addition to everything else). Since there
|
||||||
|
appears to be no immediate need for a literal representation,
|
||||||
|
designing and implementing one is out of the scope of this PEP.
|
||||||
|
|
||||||
|
* Python 3k will have a much different I/O subsystem. Deciding how that
|
||||||
|
I/O subsystem will work and interact with the bytes object is out of
|
||||||
|
the scope of this PEP.
|
||||||
|
|
||||||
|
* It has been suggested that a special method named __bytes__ be added
|
||||||
|
to language to allow objects to be converted into byte arrays. This
|
||||||
|
decision is out of scope.
|
||||||
|
|
||||||
|
|
||||||
|
Unresolved issues
|
||||||
|
=================
|
||||||
|
|
||||||
|
* Perhaps the bytes object should be implemented as a extension module
|
||||||
|
until we are more sure of the design (similar to how the set object
|
||||||
|
was prototyped).
|
||||||
|
|
||||||
|
* Should the bytes object implement the buffer interface? Probably, but
|
||||||
|
we need to look into the implications of that (e.g. regex operations
|
||||||
|
on byte arrays).
|
||||||
|
|
||||||
|
* Should the object implement __reversed__ and reverse? Should it
|
||||||
|
implement sort?
|
||||||
|
|
||||||
|
* Need to clarify what some of the methods do. How are comparisons
|
||||||
|
done? Hashing? Pickling and marshalling?
|
||||||
|
|
||||||
|
|
||||||
|
Questions and answers
|
||||||
|
=====================
|
||||||
|
|
||||||
|
Q: Why have the optional encoding argument when the encode method of
|
||||||
|
Unicode objects does the same thing.
|
||||||
|
|
||||||
|
A: In the current version of Python, the encode method returns a str
|
||||||
|
object and we cannot change that without breaking code. The construct
|
||||||
|
bytes(s.encode(...)) is expensive because it has to copy the byte
|
||||||
|
sequence multiple times. Also, Python generally provides two ways of
|
||||||
|
converting an object of type A into an object of type B: ask an A
|
||||||
|
instance to convert itself to a B, or ask the type B to create a new
|
||||||
|
instance from an A. Depending on what A and B are, both APIs make
|
||||||
|
sense; sometimes reasons of decoupling require that A can't know
|
||||||
|
about B, in which case you have to use the latter approach; sometimes
|
||||||
|
B can't know about A, in which case you have to use the former.
|
||||||
|
|
||||||
|
|
||||||
|
Q: Why does bytes ignore the encoding argument if the initialiser is a
|
||||||
|
str?
|
||||||
|
|
||||||
|
A: There is no sane meaning that the encoding can have in that case.
|
||||||
|
str objects *are* byte arrays and they know nothing about the
|
||||||
|
encoding of character data they contain. We need to assume that the
|
||||||
|
programmer has provided str object that already uses the desired
|
||||||
|
encoding. If you need something other than a pure copy of the bytes
|
||||||
|
then you need to first decode the string. For example:
|
||||||
|
|
||||||
|
bytes(s.decode(encoding1), encoding2)
|
||||||
|
|
||||||
|
|
||||||
|
Q: Why not have the encoding argument default to Latin-1 (or some other
|
||||||
|
encoding that covers the entire byte range) rather than ASCII ?
|
||||||
|
|
||||||
|
A: The system default encoding for Python is ASCII. It seems least
|
||||||
|
confusing to use that default. Also, in Py3k, using Latin-1 as
|
||||||
|
the default might not be what users expect. For example, they might
|
||||||
|
prefer a Unicode encoding. Any default will not always work as
|
||||||
|
expected. At least ASCII will complain loudly if you try to encode
|
||||||
|
non-ASCII data.
|
||||||
|
|
||||||
|
|
||||||
|
Copyright
|
||||||
|
=========
|
||||||
|
|
||||||
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
..
|
||||||
|
Local Variables:
|
||||||
|
mode: indented-text
|
||||||
|
indent-tabs-mode: nil
|
||||||
|
sentence-end-double-space: t
|
||||||
|
fill-column: 70
|
||||||
|
End:
|
Loading…
Reference in New Issue