Reformat.

This commit is contained in:
Neil Schemenauer 2006-02-22 20:49:37 +00:00
parent 2da9d64f5e
commit 61b80831fe
1 changed files with 139 additions and 141 deletions

View File

@ -12,197 +12,195 @@ Post-History:
Abstract
========
This PEP outlines the introduction of a raw bytes sequence object.
Adding the bytes object is one step in the transition to Unicode based
str objects.
This PEP outlines the introduction of a raw bytes sequence object.
Adding the bytes object is one step in the transition to Unicode
based str objects.
Motivation
==========
Python's current string objects are overloaded. They serve to hold
both sequences of characters and sequences of bytes. This overloading
of purpose leads to confusion and bugs. In future versions of Python,
string objects will be used for holding character data. The bytes object
will fulfil the role of a byte container. Eventually the unicode
built-in will be renamed to str and the str object will be removed.
Python's current string objects are overloaded. They serve to hold
both sequences of characters and sequences of bytes. This
overloading of purpose leads to confusion and bugs. In future
versions of Python, string objects will be used for holding
character data. The bytes object will fulfil the role of a byte
container. Eventually the unicode built-in will be renamed to str
and the str object will be removed.
Specification
=============
A bytes object stores a mutable sequence of integers that are in the
range 0 to 255. Unlike string objects, indexing a bytes object returns
an integer. Assigning an element using a object that is not an integer
causes a TypeError exception. Assigning an element to a value outside
the range 0 to 255 causes a ValueError exception. The __len__ method of
bytes returns the number of integers stored in the sequence (i.e. the
number of bytes).
A bytes object stores a mutable sequence of integers that are in the
range 0 to 255. Unlike string objects, indexing a bytes object
returns an integer. Assigning an element using a object that is not
an integer causes a TypeError exception. Assigning an element to a
value outside the range 0 to 255 causes a ValueError exception. The
__len__ method of bytes returns the number of integers stored in the
sequence (i.e. the number of bytes).
The constructor of the bytes object has the following signature:
The constructor of the bytes object has the following signature:
bytes([initialiser[, [encoding]])
bytes([initialiser[, [encoding]])
If no arguments are provided then an object containing zero elements is
created and returned. The initialiser argument can be a string or a
sequence of integers. The pseudo-code for the constructor is:
If no arguments are provided then an object containing zero elements
is created and returned. The initialiser argument can be a string
or a sequence of integers. The pseudo-code for the constructor is:
def bytes(initialiser=[], encoding=None):
if isinstance(initialiser, basestring):
if isinstance(initialiser, unicode):
if encoding is None:
encoding = sys.getdefaultencoding()
initialiser = initialiser.encode(encoding)
initialiser = [ord(c) for c in initialiser]
elif encoding is not None:
raise TypeError("explicit encoding invalid for non-string "
"initialiser")
create bytes object and fill with integers from initialiser
return bytes object
def bytes(initialiser=[], encoding=None):
if isinstance(initialiser, basestring):
if isinstance(initialiser, unicode):
if encoding is None:
encoding = sys.getdefaultencoding()
initialiser = initialiser.encode(encoding)
initialiser = [ord(c) for c in initialiser]
elif encoding is not None:
raise TypeError("explicit encoding invalid for non-string "
"initialiser")
create bytes object and fill with integers from initialiser
return bytes object
The __repr__ method returns a string that can be evaluated to generate a
new bytes object containing the same sequence of integers. The sequence
is represented by a list of ints. For example:
The __repr__ method returns a string that can be evaluated to
generate a new bytes object containing the same sequence of
integers. The sequence is represented by a list of ints. For
example:
>>> repr(bytes[10, 20, 30])
'bytes([10, 20, 30])'
>>> repr(bytes[10, 20, 30])
'bytes([10, 20, 30])'
The object has a decode method equivalent to the decode method of the
str object. The object has a classmethod fromhex that takes a string of
characters from the set [0-9a-zA-Z ] and returns a bytes object (similar
to binascii.unhexlify). For example:
The object has a decode method equivalent to the decode method of
the str object. The object has a classmethod fromhex that takes a
string of characters from the set [0-9a-zA-Z ] and returns a bytes
object (similar to binascii.unhexlify). For example:
>>> bytes.fromhex('5c5350ff')
bytes([92, 83, 80, 255]])
>>> bytes.fromhex('5c 53 50 ff')
bytes([92, 83, 80, 255]])
>>> bytes.fromhex('5c5350ff')
bytes([92, 83, 80, 255]])
>>> bytes.fromhex('5c 53 50 ff')
bytes([92, 83, 80, 255]])
The object has a hex method that does the reverse conversion (similar to
binascii.hexlify):
The object has a hex method that does the reverse conversion
(similar to binascii.hexlify):
>> bytes([92, 83, 80, 255]]).hex()
'5c5350ff'
>> bytes([92, 83, 80, 255]]).hex()
'5c5350ff'
The bytes object has methods similar to the list object:
The bytes object has methods similar to the list object:
__add__
__contains__
__delitem__
__delslice__
__eq__
__ge__
__getitem__
__getslice__
__gt__
__hash__
__iadd__
__imul__
__iter__
__le__
__len__
__lt__
__mul__
__ne__
__reduce__
__reduce_ex__
__repr__
__rmul__
__setitem__
__setslice__
append
count
extend
index
insert
pop
remove
__add__
__contains__
__delitem__
__delslice__
__eq__
__ge__
__getitem__
__getslice__
__gt__
__hash__
__iadd__
__imul__
__iter__
__le__
__len__
__lt__
__mul__
__ne__
__reduce__
__reduce_ex__
__repr__
__rmul__
__setitem__
__setslice__
append
count
extend
index
insert
pop
remove
Out of scope issues
===================
* If we provide a literal syntax for bytes then it should look distinctly
different than the syntax for literal strings. Also, a new type, even
built-in, is much less drastic than a new literal (which requires
lexer and parser support in addition to everything else). Since there
appears to be no immediate need for a literal representation,
designing and implementing one is out of the scope of this PEP.
* If we provide a literal syntax for bytes then it should look
distinctly different than the syntax for literal strings. Also, a
new type, even built-in, is much less drastic than a new literal
(which requires lexer and parser support in addition to everything
else). Since there appears to be no immediate need for a literal
representation, designing and implementing one is out of the scope
of this PEP.
* Python 3k will have a much different I/O subsystem. Deciding how that
I/O subsystem will work and interact with the bytes object is out of
the scope of this PEP.
* Python 3k will have a much different I/O subsystem. Deciding how
that I/O subsystem will work and interact with the bytes object is
out of the scope of this PEP.
* It has been suggested that a special method named __bytes__ be added
to language to allow objects to be converted into byte arrays. This
decision is out of scope.
* It has been suggested that a special method named __bytes__ be
added to language to allow objects to be converted into byte
arrays. This decision is out of scope.
Unresolved issues
=================
* Perhaps the bytes object should be implemented as a extension module
until we are more sure of the design (similar to how the set object
was prototyped).
* Perhaps the bytes object should be implemented as a extension
module until we are more sure of the design (similar to how the
set object was prototyped).
* Should the bytes object implement the buffer interface? Probably, but
we need to look into the implications of that (e.g. regex operations
on byte arrays).
* Should the bytes object implement the buffer interface? Probably,
but we need to look into the implications of that (e.g. regex
operations on byte arrays).
* Should the object implement __reversed__ and reverse? Should it
implement sort?
* Should the object implement __reversed__ and reverse? Should it
implement sort?
* Need to clarify what some of the methods do. How are comparisons
done? Hashing? Pickling and marshalling?
* Need to clarify what some of the methods do. How are comparisons
done? Hashing? Pickling and marshalling?
Questions and answers
=====================
Q: Why have the optional encoding argument when the encode method of
Unicode objects does the same thing.
Q: Why have the optional encoding argument when the encode method of
Unicode objects does the same thing.
A: In the current version of Python, the encode method returns a str
object and we cannot change that without breaking code. The construct
bytes(s.encode(...)) is expensive because it has to copy the byte
sequence multiple times. Also, Python generally provides two ways of
converting an object of type A into an object of type B: ask an A
instance to convert itself to a B, or ask the type B to create a new
instance from an A. Depending on what A and B are, both APIs make
sense; sometimes reasons of decoupling require that A can't know
about B, in which case you have to use the latter approach; sometimes
B can't know about A, in which case you have to use the former.
A: In the current version of Python, the encode method returns a str
object and we cannot change that without breaking code. The
construct bytes(s.encode(...)) is expensive because it has to
copy the byte sequence multiple times. Also, Python generally
provides two ways of converting an object of type A into an
object of type B: ask an A instance to convert itself to a B, or
ask the type B to create a new instance from an A. Depending on
what A and B are, both APIs make sense; sometimes reasons of
decoupling require that A can't know about B, in which case you
have to use the latter approach; sometimes B can't know about A,
in which case you have to use the former.
Q: Why does bytes ignore the encoding argument if the initialiser is a
str?
Q: Why does bytes ignore the encoding argument if the initialiser is
a str?
A: There is no sane meaning that the encoding can have in that case.
str objects *are* byte arrays and they know nothing about the
encoding of character data they contain. We need to assume that the
programmer has provided str object that already uses the desired
encoding. If you need something other than a pure copy of the bytes
then you need to first decode the string. For example:
A: There is no sane meaning that the encoding can have in that case.
str objects *are* byte arrays and they know nothing about the
encoding of character data they contain. We need to assume that
the programmer has provided str object that already uses the
desired encoding. If you need something other than a pure copy of
the bytes then you need to first decode the string. For example:
bytes(s.decode(encoding1), encoding2)
bytes(s.decode(encoding1), encoding2)
Q: Why not have the encoding argument default to Latin-1 (or some other
encoding that covers the entire byte range) rather than ASCII ?
Q: Why not have the encoding argument default to Latin-1 (or some
other encoding that covers the entire byte range) rather than
ASCII?
A: The system default encoding for Python is ASCII. It seems least
confusing to use that default. Also, in Py3k, using Latin-1 as
the default might not be what users expect. For example, they might
prefer a Unicode encoding. Any default will not always work as
expected. At least ASCII will complain loudly if you try to encode
non-ASCII data.
A: The system default encoding for Python is ASCII. It seems least
confusing to use that default. Also, in Py3k, using Latin-1 as
the default might not be what users expect. For example, they
might prefer a Unicode encoding. Any default will not always
work as expected. At least ASCII will complain loudly if you try
to encode non-ASCII data.
Copyright
=========
This document has been placed in the public domain.
This document has been placed in the public domain.