PEP 467: try to streamline the proposal

This commit is contained in:
Nick Coghlan 2014-08-16 15:05:16 +10:00
parent 4820a784db
commit 19a8de5ca0
1 changed files with 119 additions and 116 deletions

View File

@ -20,163 +20,166 @@ that is now referred to as ``bytearray``. Other aspects of operating in
the binary domain in Python have also evolved over the course of the Python
3 series.
This PEP proposes a number of small adjustments to the APIs of the ``bytes``
and ``bytearray`` types to make it easier to operate entirely in the binary
domain.
This PEP proposes four small adjustments to the APIs of the ``bytes``,
``bytearray`` and ``memoryview`` types to make it easier to operate entirely
in the binary domain:
Background
==========
To simplify the task of writing the Python 3 documentation, the ``bytes``
and ``bytearray`` types were documented primarily in terms of the way they
differed from the Unicode based Python 3 ``str`` type. Even when I
`heavily revised the sequence documentation
<http://hg.python.org/cpython/rev/463f52d20314>`__ in 2012, I retained that
simplifying shortcut.
However, it turns out that this approach to the documentation of these types
had a problem: it doesn't adequately introduce users to their hybrid nature,
where they can be manipulated *either* as a "sequence of integers" type,
*or* as ``str``-like types that assume ASCII compatible data.
That oversight has now been corrected, with the binary sequence types now
being documented entirely independently of the ``str`` documentation in
`Python 3.4+ <https://docs.python.org/3/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview>`__
The confusion isn't just a documentation issue, however, as there are also
some lingering design quirks from an earlier pre-release design where there
was *no* separate ``bytearray`` type, and instead the core ``bytes`` type
was mutable (with no immutable counterpart).
Finally, additional experience with using the existing Python 3 binary
sequence types in real world applications has suggested it would be
beneficial to make it easier to convert integers to length 1 bytes objects.
* Deprecate passing single integer values to ``bytes`` and ``bytearray``
* Add ``bytes.zeros`` and ``bytearray.zeros`` alternative constructors
* Add ``bytes.byte`` and ``bytearray.byte`` alternative constructors
* Add ``bytes.iterbytes``, ``bytearray.iterbytes`` and
``memoryview.iterbytes`` alternative iterators
Proposals
=========
As a "consistency improvement" proposal, this PEP is actually about a few
smaller micro-proposals, each aimed at improving the usability of the binary
data model in Python 3. Proposals are motivated by one of two main factors:
Deprecation of current "zero-initialised sequence" behaviour
------------------------------------------------------------
* removing remnants of the original design of ``bytes`` as a mutable type
* allowing users to easily convert integer values to a length 1 ``bytes``
object
Currently, the ``bytes`` and ``bytearray`` constructors accept an integer
argument and interpret it as meaning to create a zero-initialised sequence
of the given size::
Alternate Constructors
----------------------
The ``bytes`` and ``bytearray`` constructors currently accept an integer
argument, but interpret it to mean a zero-filled object of the given length.
This is a legacy of the original design of ``bytes`` as a mutable type,
rather than a particularly intuitive behaviour for users. It has become
especially confusing now that some other ``bytes`` interfaces treat integers
and the corresponding length 1 bytes instances as equivalent input.
Compare::
>>> b"\x03" in bytes([1, 2, 3])
True
>>> 3 in bytes([1, 2, 3])
True
>>> bytes(b"\x03")
b'\x03'
>>> bytes(3)
b'\x00\x00\x00'
>>> bytearray(3)
bytearray(b'\x00\x00\x00')
This PEP proposes that the current handling of integers in the bytes and
bytearray constructors by deprecated in Python 3.5 and targeted for
removal in Python 3.7, being replaced by two more explicit alternate
constructors provided as class methods. The initial python-ideas thread
[ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating
this constructor behaviour.
This PEP proposes to deprecate that behaviour in Python 3.5, and remove it
entirely in Python 3.6.
Firstly, a ``byte`` constructor is proposed that converts integers
in the range 0 to 255 (inclusive) to a ``bytes`` object::
No other changes are proposed to the existing constructors.
>>> bytes.byte(3)
b'\x03'
>>> bytearray.byte(3)
bytearray(b'\x03')
>>> bytes.byte(512)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: bytes must be in range(0, 256)
One specific use case for this alternate constructor is to easily convert
the result of indexing operations on ``bytes`` and other binary sequences
from an integer to a ``bytes`` object. The documentation for this API
should note that its counterpart for the reverse conversion is ``ord()``.
The ``ord()`` documentation will also be updated to note that while
``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and
``bytearray.byte`` are the counterparts for binary input.
Addition of explicit "zero-initialised sequence" constructors
-------------------------------------------------------------
Secondly, a ``zeros`` constructor is proposed that serves as a direct
replacement for the current constructor behaviour, rather than having to use
sequence repetition to achieve the same effect in a less intuitive way::
To replace the deprecated behaviour, this PEP proposes the addition of an
explicit ``zeros`` alternative constructor as a class method on both
``bytes`` and ``bytearray``::
>>> bytes.zeros(3)
b'\x00\x00\x00'
>>> bytearray.zeros(3)
bytearray(b'\x00\x00\x00')
The chosen name here is taken from the corresponding initialisation function
in NumPy (although, as these are sequence types rather than N-dimensional
matrices, the constructors take a length as input rather than a shape tuple)
It will behave just as the current constructors behave when passed a single
integer.
While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more
useful duo amongst the new constructors, ``bytes.zeros`` and
`bytearray.byte`` are provided in order to maintain API consistency between
the two types.
The specific choice of ``zeros`` as the alternative constructor name is taken
from the corresponding initialisation function in NumPy (although, as these
are 1-dimensional sequence types rather than N-dimensional matrices, the
constructors take a length as input rather than a shape tuple)
Iteration
---------
Addition of explicit "single byte" constructors
-----------------------------------------------
While iteration over ``bytes`` objects and other binary sequences produces
integers, it is sometimes desirable to iterate over length 1 bytes objects
instead.
As binary counterparts to the text ``chr`` function, this PEP proposes the
addition of an explicit ``byte`` alternative constructor as a class method
on both ``bytes`` and ``bytearray``::
To handle this situation more obviously (and more efficiently) than would be
the case with the ``map(bytes.byte, data)`` construct enabled by the above
constructor changes, this PEP proposes the addition of a new ``iterbytes``
method to ``bytes``, ``bytearray`` and ``memoryview``::
>>> bytes.byte(3)
b'\x03'
>>> bytearray.byte(3)
bytearray(b'\x03')
These methods will only accept integers in the range 0 to 255 (inclusive)::
>>> bytes.byte(512)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: bytes must be in range(0, 256)
>>> bytes.byte(1.0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'float' object cannot be interpreted as an integer
The documentation of the ``ord`` builtin will be updated to explicitly note
that ``bytes.byte`` is the inverse operation for binary data, while ``chr``
is the inverse operation for text data.
Behaviourally, ``bytes.byte(x)`` will be equivalent to the current
``bytes([x])`` (and similarly for ``bytearray``). The new spelling is
expected to be easier to discover and easier to read (especially when used
in conjunction with indexing operations on binary sequence types).
As a separate method, the new spelling will also work better with higher
order functions like ``map``.
Addition of optimised iterator methods that produce ``bytes`` objects
---------------------------------------------------------------------
This PEP proposes that ``bytes``, ``bytearray`` and ``memoryview`` gain an
optimised ``iterbytes`` method that produces length 1 ``bytes`` objects
rather than integers::
for x in data.iterbytes():
# x is a length 1 ``bytes`` object, rather than an integer
Third party types and arbitrary containers of integers that lack the new
method can still be handled by combining ``map`` with the new
``bytes.byte()`` alternate constructor proposed above::
The method can be used with arbitrary buffer exporting objects by wrapping
them in a ``memoryview`` instance first::
for x in map(bytes.byte, data):
for x in memoryview(data).iterbytes():
# x is a length 1 ``bytes`` object, rather than an integer
# This works with *any* container of integers in the range
# 0 to 255 inclusive
For ``memoryview``, the semantics of ``iterbytes()`` are defined such that::
memview.tobytes() == b''.join(memview.iterbytes())
This allows the raw bytes of the memory view to be iterated over without
needing to make a copy, regardless of the defined shape and format.
The main advantage this method offers over the ``map(bytes.byte, data)``
approach is that it is guaranteed *not* to fail midstream with a
``ValueError`` or ``TypeError``. By contrast, when using the ``map`` based
approach, the type and value of the individual items in the iterable are
only checked as they are retrieved and passed through the ``bytes.byte``
constructor.
Open questions
^^^^^^^^^^^^^^
Design discussion
=================
* The fallback case above suggests that this could perhaps be better handled
as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()``
if defined, but otherwise fell back to ``map(bytes.byte, data)``::
Why not rely on sequence repetition to create zero-initialised sequences?
-------------------------------------------------------------------------
for x in iterbytes(data):
# x is a length 1 ``bytes`` object, rather than an integer
# This works with *any* container of integers in the range
# 0 to 255 inclusive
Zero-initialised sequences can be created via sequence repetition::
>>> b'\x00' * 3
b'\x00\x00\x00'
>>> bytearray(b'\x00') * 3
bytearray(b'\x00\x00\x00')
However, this was also the case when the ``bytearray`` type was originally
designed, and the decision was made to add explicit support for it in the
type constructor. The immutable ``bytes`` type then inherited that feature
when it was introduced in PEP 3137.
This PEP isn't revisiting that original design decision, just changing the
spelling as users sometimes find the current behaviour of the binary sequence
constructors surprising. In particular, there's a reasonable case to be made
that ``bytes(x)`` (where ``x`` is an integer) should behave like the
``bytes.byte(x)`` proposal in this PEP. Providing both behaviours as separate
class methods avoids that ambiguity.
References
==========
.. [ideas-thread1] https://mail.python.org/pipermail/python-ideas/2014-March/027295.html
.. [empty-buffer-issue] http://bugs.python.org/issue20895
.. [GvR-initial-feedback] https://mail.python.org/pipermail/python-ideas/2014-March/027376.html
.. [1] Initial March 2014 discussion thread on python-ideas
(https://mail.python.org/pipermail/python-ideas/2014-March/027295.html)
.. [2] Guido's initial feedback in that thread
(https://mail.python.org/pipermail/python-ideas/2014-March/027376.html)
.. [3] Issue proposing moving zero-initialised sequences to a dedicated API
(http://bugs.python.org/issue20895)
.. [4] Issue proposing to use calloc() for zero-initialised binary sequences
(http://bugs.python.org/issue21644)
.. [5] August 2014 discussion thread on python-dev
(https://mail.python.org/pipermail/python-ideas/2014-March/027295.html)
Copyright