Add 'The "bytes" object' PEP.
This commit is contained in:
parent
b92889b332
commit
cd7901e86e
|
@ -0,0 +1,215 @@
|
|||
PEP: 358
|
||||
Title: The "bytes" object
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Neil Schemenauer <nas@arctrix.com>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/plain
|
||||
Created: 15-Feb-2006
|
||||
Python-Version: 2.5
|
||||
Post-History:
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP outlines the introduction of a raw bytes sequence object.
|
||||
Adding the bytes object is one step in the transition to Unicode based
|
||||
str objects.
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
Python's current string objects are overloaded. They serve to hold
|
||||
both sequences of characters and sequences of bytes. This overloading
|
||||
of purpose leads to confusion and bugs. In future versions of Python,
|
||||
string objects will be used for holding character data. The bytes object
|
||||
will fulfil the role of a byte container. Eventually the unicode
|
||||
built-in will be renamed to str and the str object will be removed.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
A bytes object stores a mutable sequence of integers that are in the
|
||||
range 0 to 255. Unlike string objects, indexing a bytes object returns
|
||||
an integer. Assigning an element using a object that is not an integer
|
||||
causes a TypeError exception. Assigning an element to a value outside
|
||||
the range 0 to 255 causes a ValueError exception. The __len__ method of
|
||||
bytes returns the number of integers stored in the sequence (i.e. the
|
||||
number of bytes).
|
||||
|
||||
The constructor of the bytes object has the following signature:
|
||||
|
||||
bytes([initialiser[, [encoding]])
|
||||
|
||||
If no arguments are provided then an object containing zero elements is
|
||||
created and returned. The initialiser argument can be a string or a
|
||||
sequence of integers. The pseudo-code for the constructor is:
|
||||
|
||||
def bytes(initialiser=[], encoding=None):
|
||||
if isinstance(initialiser, basestring):
|
||||
if isinstance(initialiser, unicode):
|
||||
if encoding is None:
|
||||
encoding = sys.getdefaultencoding()
|
||||
initialiser = initialiser.encode(encoding)
|
||||
initialiser = [ord(c) for c in initialiser]
|
||||
elif encoding is not None:
|
||||
raise TypeError("explicit encoding invalid for non-string "
|
||||
"initialiser")
|
||||
create bytes object and fill with integers from initialiser
|
||||
return bytes object
|
||||
|
||||
The __repr__ method returns a string that can be evaluated to generate a
|
||||
new bytes object containing the same sequence of integers. The sequence
|
||||
is represented by a list of ints. For example:
|
||||
|
||||
>>> repr(bytes[10, 20, 30])
|
||||
'bytes([10, 20, 30])'
|
||||
|
||||
The object has a decode method equivalent to the decode method of the
|
||||
str object. The object has a classmethod fromhex that takes a string of
|
||||
characters from the set [0-9a-zA-Z ] and returns a bytes object (similar
|
||||
to binascii.unhexlify). For example:
|
||||
|
||||
>>> bytes.fromhex('5c5350ff')
|
||||
bytes([92, 83, 80, 255]])
|
||||
>>> bytes.fromhex('5c 53 50 ff')
|
||||
bytes([92, 83, 80, 255]])
|
||||
|
||||
The object has a hex method that does the reverse conversion (similar to
|
||||
binascii.hexlify):
|
||||
|
||||
>> bytes([92, 83, 80, 255]]).hex()
|
||||
'5c5350ff'
|
||||
|
||||
The bytes object has methods similar to the list object:
|
||||
|
||||
__add__
|
||||
__contains__
|
||||
__delitem__
|
||||
__delslice__
|
||||
__eq__
|
||||
__ge__
|
||||
__getitem__
|
||||
__getslice__
|
||||
__gt__
|
||||
__hash__
|
||||
__iadd__
|
||||
__imul__
|
||||
__iter__
|
||||
__le__
|
||||
__len__
|
||||
__lt__
|
||||
__mul__
|
||||
__ne__
|
||||
__reduce__
|
||||
__reduce_ex__
|
||||
__repr__
|
||||
__rmul__
|
||||
__setitem__
|
||||
__setslice__
|
||||
append
|
||||
count
|
||||
extend
|
||||
index
|
||||
insert
|
||||
pop
|
||||
remove
|
||||
|
||||
|
||||
Out of scope issues
|
||||
===================
|
||||
|
||||
* If we provide a literal syntax for bytes then it should look distinctly
|
||||
different than the syntax for literal strings. Also, a new type, even
|
||||
built-in, is much less drastic than a new literal (which requires
|
||||
lexer and parser support in addition to everything else). Since there
|
||||
appears to be no immediate need for a literal representation,
|
||||
designing and implementing one is out of the scope of this PEP.
|
||||
|
||||
* Python 3k will have a much different I/O subsystem. Deciding how that
|
||||
I/O subsystem will work and interact with the bytes object is out of
|
||||
the scope of this PEP.
|
||||
|
||||
* It has been suggested that a special method named __bytes__ be added
|
||||
to language to allow objects to be converted into byte arrays. This
|
||||
decision is out of scope.
|
||||
|
||||
|
||||
Unresolved issues
|
||||
=================
|
||||
|
||||
* Perhaps the bytes object should be implemented as a extension module
|
||||
until we are more sure of the design (similar to how the set object
|
||||
was prototyped).
|
||||
|
||||
* Should the bytes object implement the buffer interface? Probably, but
|
||||
we need to look into the implications of that (e.g. regex operations
|
||||
on byte arrays).
|
||||
|
||||
* Should the object implement __reversed__ and reverse? Should it
|
||||
implement sort?
|
||||
|
||||
* Need to clarify what some of the methods do. How are comparisons
|
||||
done? Hashing? Pickling and marshalling?
|
||||
|
||||
|
||||
Questions and answers
|
||||
=====================
|
||||
|
||||
Q: Why have the optional encoding argument when the encode method of
|
||||
Unicode objects does the same thing.
|
||||
|
||||
A: In the current version of Python, the encode method returns a str
|
||||
object and we cannot change that without breaking code. The construct
|
||||
bytes(s.encode(...)) is expensive because it has to copy the byte
|
||||
sequence multiple times. Also, Python generally provides two ways of
|
||||
converting an object of type A into an object of type B: ask an A
|
||||
instance to convert itself to a B, or ask the type B to create a new
|
||||
instance from an A. Depending on what A and B are, both APIs make
|
||||
sense; sometimes reasons of decoupling require that A can't know
|
||||
about B, in which case you have to use the latter approach; sometimes
|
||||
B can't know about A, in which case you have to use the former.
|
||||
|
||||
|
||||
Q: Why does bytes ignore the encoding argument if the initialiser is a
|
||||
str?
|
||||
|
||||
A: There is no sane meaning that the encoding can have in that case.
|
||||
str objects *are* byte arrays and they know nothing about the
|
||||
encoding of character data they contain. We need to assume that the
|
||||
programmer has provided str object that already uses the desired
|
||||
encoding. If you need something other than a pure copy of the bytes
|
||||
then you need to first decode the string. For example:
|
||||
|
||||
bytes(s.decode(encoding1), encoding2)
|
||||
|
||||
|
||||
Q: Why not have the encoding argument default to Latin-1 (or some other
|
||||
encoding that covers the entire byte range) rather than ASCII ?
|
||||
|
||||
A: The system default encoding for Python is ASCII. It seems least
|
||||
confusing to use that default. Also, in Py3k, using Latin-1 as
|
||||
the default might not be what users expect. For example, they might
|
||||
prefer a Unicode encoding. Any default will not always work as
|
||||
expected. At least ASCII will complain loudly if you try to encode
|
||||
non-ASCII data.
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
End:
|
Loading…
Reference in New Issue