PEP 467: descope dramatically based on Guido's feedback

This commit is contained in:
Nick Coghlan 2014-04-03 22:33:36 +10:00
parent 199858bf8e
commit 56d64df53d
1 changed files with 96 additions and 209 deletions

View File

@ -22,28 +22,35 @@ the binary domain in Python have also evolved over the course of the Python
This PEP proposes a number of small adjustments to the APIs of the ``bytes``
and ``bytearray`` types to make their behaviour more internally consistent
and to make it easier to operate entirely in the binary domain.
and to make it easier to operate entirely in the binary domain, as well as
changes to their documentation to make it easier to grasp their dual roles
as containers of "arbitrary binary data" and "binary data with ASCII
compatible segments".
Background
==========
Over the course of Python 3's evolution, a number of adjustments have been
made to the core ``bytes`` and ``bytearray`` types as additional practical
experience was gained with using them in code beyond the Python 3 standard
library and test suite. However, to date, these changes have been made
on a relatively ad hoc tactical basis as specific issues were identified,
rather than as part of a systematic review of the APIs of these types. This
approach has allowed inconsistencies to creep into the API design as to which
input types are accepted by different methods. Additional inconsistencies
linger from an earlier pre-release design where there was *no* separate
``bytearray`` type, and instead the core ``bytes`` type was mutable (with
no immutable counterpart), as well as from the origins of these types in
the text-like behaviour of the Python 2 ``str`` type.
To simplify the task of writing the Python 3 documentation, the ``bytes``
and ``bytearray`` types were documented primarily in terms of the way they
differed from the Unicode based Python 3 ``str`` type. Even when I
`heavily revised the sequence documentation
<http://hg.python.org/cpython/rev/463f52d20314>`__ in 2012, I retained that
simplifying shortcut.
This PEP aims to provide the missing systematic review, with the goal of
ensuring that wherever feasible (given backwards compatibility constraints)
these current inconsistencies are addressed for the Python 3.5 release.
However, it turns out that this approach to the documentation of these types
has a problem: it doesn't adequately introduce users to their hybrid nature,
where they can be manipulated *either* as a "sequence of integers" type,
*or* as ``str``-like types that assume ASCII compatible data.
In addition to the documentation issues, there are some lingering design
quirks from an earlier pre-release design where there was *no* separate
``bytearray`` type, and instead the core ``bytes`` type was mutable (with
no immutable counterpart).
Finally, additional experience with using the existing Python 3 binary
sequence types in real world applications has suggested it would be
beneficial to make it easier to convert integers to length 1 bytes objects.
Proposals
@ -55,10 +62,13 @@ the binary data model in Python 3. Proposals are motivated by one of three
factors:
* removing remnants of the original design of ``bytes`` as a mutable type
* more consistently accepting length 1 ``bytes`` objects as input where an
integer between ``0`` and ``255`` inclusive is expected, and vice-versa
* allowing users to easily convert integer output to a length 1 ``bytes``
* allowing users to easily convert integer values to a length 1 ``bytes``
object
* consistently applying the following analogies to the type API designs
and documentation:
* ``bytes``: tuple of integers, with additional str-like methods
* ``bytearray``: list of integers, with additional str-like methods
Alternate Constructors
@ -83,95 +93,69 @@ Compare::
b'\x00\x00\x00'
This PEP proposes that the current handling of integers in the bytes and
bytearray constructors by deprecated in Python 3.5 and removed in Python
3.6, being replaced by two more type appropriate alternate constructors
provided as class methods. The initial python-ideas thread [ideas-thread1]_
that spawned this PEP was specifically aimed at deprecating this constructor
behaviour.
bytearray constructors by deprecated in Python 3.5 and targeted for
removal in Python 3.7, being replaced by two more explicit alternate
constructors provided as class methods. The initial python-ideas thread
[ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating
this constructor behaviour.
For ``bytes``, a ``byte`` constructor is proposed that converts integers
(as indicated by ``operator.index``) in the appropriate range to a ``bytes``
object, converts objects that support the buffer API to bytes, and also
passes through length 1 byte strings unchanged::
Firstly, a ``byte`` constructor is proposed that converts integers
in the range 0 to 255 (inclusive) to a ``bytes`` object::
>>> bytes.byte(3)
b'\x03'
>>> bytes.byte(bytearray(bytes([3])))
b'\x03'
>>> bytes.byte(memoryview(bytes([3])))
b'\x03'
>>> bytes.byte(bytes([3]))
b'\x03'
>>> bytearray.byte(3)
bytearray(b'\x03')
>>> bytes.byte(512)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: bytes must be in range(0, 256)
>>> bytes.byte(b"ab")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: bytes.byte() expected a byte, but buffer of length 2 found
One specific use case for this alternate constructor is to easily convert
the result of indexing operations on ``bytes`` and other binary sequences
from an integer to a ``bytes`` object. The documentation for this API
should note that its counterpart for the reverse conversion is ``ord()``.
The ``ord()`` documentation will also be updated to note that while
``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and
``bytearray.byte`` are the counterparts for binary input.
For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
the buffer filled with a particular value (default to ``0``) as a direct
Secondly, a ``zeros`` constructor is proposed that serves as a direct
replacement for the current constructor behaviour, rather than having to use
sequence repetition to achieve the same effect in a less intuitive way::
>>> bytearray.from_len(3)
>>> bytes.zeros(3)
b'\x00\x00\x00'
>>> bytearray.zeros(3)
bytearray(b'\x00\x00\x00')
>>> bytearray.from_len(3, 6)
bytearray(b'\x06\x06\x06')
This part of the proposal was covered by an existing issue
[empty-buffer-issue]_ and a variety of names have been proposed
(``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
specific name currently proposed was chosen by analogy with
``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
explicit that it is an alternate constructor rather than an in-place
mutation, as well as how it differs from the standard constructor.
The chosen name here is taken from the corresponding initialisation function
in NumPy (although, as these are sequence types rather than N-dimensional
matrices, the constructors take a length as input rather than a shape tuple)
Open questions
^^^^^^^^^^^^^^
* Should ``bytearray.byte()`` also be added? Or is
``bytearray(bytes.byte(x))`` sufficient for that case?
* Should ``bytes.from_len()`` also be added? Or is sequence repetition
sufficient for that case?
* Should ``bytearray.from_len()`` use a different name?
* Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
sequences with more than one element? The ``TypeError`` currently proposed
is copied (with slightly improved wording) from the behaviour of ``ord()``
with sequences containing more than one code point, while ``ValueError``
would be more consistent with the existing handling of out-of-range
integer values.
* ``bytes.byte()`` is defined above as accepting length 1 binary sequences
as individual bytes, but this is currently inconsistent with the main
``bytes`` constructor::
>>> bytes([b"a", b"b", b"c"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bytes' object cannot be interpreted as an integer
Should the ``bytes`` constructor be changed to accept iterables of length 1
bytes objects in addition to iterables of integers? If so, should it
allow a mixture of the two in a single iterable?
While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more
useful duo amongst the new constructors, ``bytes.zeros`` and
`bytearray.byte`` are provided in order to maintain API consistency between
the two types.
Iteration
---------
Iteration over ``bytes`` objects and other binary sequences produces
integers. Rather than proposing a new method that would need to be added
not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
to third party types as well, this PEP proposes that iteration to produce
length 1 ``bytes`` objects instead be handled by combining ``map`` with
the new ``bytes.byte()`` alternate constructor proposed above::
While iteration over ``bytes`` objects and other binary sequences produces
integers, it is sometimes desirable to iterate over length 1 bytes objects
instead.
To handle this situation more obviously (and more efficiently) than would be
the case with the ``map(bytes.byte, data)`` construct enabled by the above
constructor changes, this PEP proposes the addition of a new ``iterbytes``
method to ``bytes``, ``bytearray`` and ``memoryview``::
for x in data.iterbytes():
# x is a length 1 ``bytes`` object, rather than an integer
Third party types and arbitrary containers of integers that lack the new
method can still be handled by combining ``map`` with the new
``bytes.byte()`` alternate constructor proposed above::
for x in map(bytes.byte, data):
# x is a length 1 ``bytes`` object, rather than an integer
@ -179,139 +163,42 @@ the new ``bytes.byte()`` alternate constructor proposed above::
# 0 to 255 inclusive
Consistent support for different input types
--------------------------------------------
Open questions
^^^^^^^^^^^^^^
The ``bytes`` and ``bytearray`` methods inspired by the Python 2 ``str``
type generally expect to operate on binary subsequences: other objects
implementing the buffer API. By contrast, the mutating APIs added to
the ``bytearray`` interface expect to operate on individual elements:
integer in the range 0 to 255 (inclusive).
* The fallback case above suggests that this could perhaps be better handled
as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()``
if defined, but otherwise fell back to ``map(bytes.byte, data)``::
In Python 3.3, the binary search operations (``in``, ``count()``,
``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
accept integers in the range 0 to 255 (inclusive) as their first argument,
in addition to the existing support for binary subsequences.
This results in behaviour like the following in Python 3.3+::
>>> data = bytes([1, 2, 3, 4])
>>> 3 in data
True
>>> b"\x03" in data
True
>>> data.count(3)
1
>>> data.count(b"\x03")
1
>>> data.replace(3, 4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: expected bytes, bytearray or buffer compatible object
>>> data.replace(b"\x03", b"\x04")
b'\x01\x02\x04\x04'
>>> mutable = bytearray(data)
>>> mutable
bytearray(b'\x01\x02\x03\x04')
>>> mutable.append(b"\x05")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required
>>> mutable.append(5)
>>> mutable
bytearray(b'\x01\x02\x03\x04\x05')
for x in iterbytes(data):
# x is a length 1 ``bytes`` object, rather than an integer
# This works with *any* container of integers in the range
# 0 to 255 inclusive
This PEP proposes extending the behaviour of accepting integers as being
equivalent to the corresponding length 1 binary sequence to several other
``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
object for certain parameters. In essence, if a value is an acceptable
input to the new ``bytes.byte`` constructor defined above, then it would
be acceptable in the roles defined here (in addition to any other already
supported inputs):
Documentation clarifications
----------------------------
* ``startswith()`` prefix(es)
* ``endswith()`` suffix(es)
In an attempt to clarify the `documentation
<https://docs.python.org/dev/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview>`__
of the ``bytes`` and ``bytearray`` types, the following changes are
proposed:
* ``center()`` fill character
* ``ljust()`` fill character
* ``rjust()`` fill character
* the documentation of the *sequence* behaviour of each type is moved to
section for that individual type. These sections will be updated to
explicitly make the ``tuple of integers`` and ``list of integers``
analogies, as well as to make it clear that these parts of the API work
with arbitrary binary data
* the current "Bytes and bytearray operations" section will be updated to
"Handling binary data with ASCII compatible segments", and will explicitly
list *all* of the methods that are included.
* clarify that due to their origins in the API of the immutable ``str``
type, even the ``bytearray`` versions of these methods do *not* operate
in place, but instead create a new object.
* ``strip()`` character to strip
* ``lstrip()`` character to strip
* ``rstrip()`` character to strip
* ``partition()`` separator argument
* ``rpartition()`` separator argument
* ``split()`` separator argument
* ``rsplit()`` separator argument
* ``replace()`` old value and new value
In addition to the consistency motive, this approach also makes it easier
to work with the indexing behaviour , as the result of an indexing operation
can more easily be fed back in to other methods.
For ``bytearray``, some additional changes are proposed to the current
integer based operations to ensure they remain consistent with the proposed
constructor changes::
* ``append()``: updated to be consistent with ``bytes.byte()``
* ``remove()``: updated to be consistent with ``bytes.byte()``
* ``+=``: updated to be consistent with ``bytes()`` changes (if any)
* ``extend()``: updated to be consistent with ``bytes()`` changes (if any)
The general principle behind these changes is to restore the flexible
"element-or-subsequence" behaviour seen in the ``str`` API, even though
Python 3 actually represents subsequences and individual elements as
distinct types in the binary domain.
Acknowledgement of surprising behaviour of some ``bytearray`` methods
---------------------------------------------------------------------
Several of the ``bytes`` and ``bytearray`` methods have their origins in the
Python 2 ``str`` API. As ``str`` is an immutable type, all of these
operations are defined as returning a *new* instance, rather than operating
in place. This contrasts with methods on other mutable types like ``list``,
where ``list.sort()`` and ``list.reverse()`` operate in-place and return
``None``, rather than creating a new object.
Backwards compatibility constraints make it impractical to change this
behaviour at this point, but it may be appropriate to explicitly call out
this quirk in the documentation for the ``bytearray`` type. It affects the
following methods that could reasonably be expected to operate in-place on
a mutable type:
* ``center()``
* ``ljust()``
* ``rjust()``
* ``strip()``
* ``lstrip()``
* ``rstrip()``
* ``replace()``
* ``lower()``
* ``upper()``
* ``swapcase()``
* ``title()``
* ``capitalize()``
* ``translate()``
* ``expandtabs()``
* ``zfill()``
Note that the following ``bytearray`` operations *do* operate in place, as
they're part of the mutable sequence API in ``bytearray``, rather than being
inspired by the immutable Python 2 ``str`` API:
* ``+=``
* ``append()``
* ``extend()``
* ``reverse()``
* ``remove()``
* ``pop()``
A patch for at least this part of the proposal will be prepared before
submitting the PEP for approval, as writing out these docs completely may
suggest additional opportunities for API consistency improvements.
References