PEP 467: descope dramatically based on Guido's feedback
This commit is contained in:
parent
199858bf8e
commit
56d64df53d
305
pep-0467.txt
305
pep-0467.txt
|
@ -22,28 +22,35 @@ the binary domain in Python have also evolved over the course of the Python
|
|||
|
||||
This PEP proposes a number of small adjustments to the APIs of the ``bytes``
|
||||
and ``bytearray`` types to make their behaviour more internally consistent
|
||||
and to make it easier to operate entirely in the binary domain.
|
||||
and to make it easier to operate entirely in the binary domain, as well as
|
||||
changes to their documentation to make it easier to grasp their dual roles
|
||||
as containers of "arbitrary binary data" and "binary data with ASCII
|
||||
compatible segments".
|
||||
|
||||
|
||||
Background
|
||||
==========
|
||||
|
||||
Over the course of Python 3's evolution, a number of adjustments have been
|
||||
made to the core ``bytes`` and ``bytearray`` types as additional practical
|
||||
experience was gained with using them in code beyond the Python 3 standard
|
||||
library and test suite. However, to date, these changes have been made
|
||||
on a relatively ad hoc tactical basis as specific issues were identified,
|
||||
rather than as part of a systematic review of the APIs of these types. This
|
||||
approach has allowed inconsistencies to creep into the API design as to which
|
||||
input types are accepted by different methods. Additional inconsistencies
|
||||
linger from an earlier pre-release design where there was *no* separate
|
||||
``bytearray`` type, and instead the core ``bytes`` type was mutable (with
|
||||
no immutable counterpart), as well as from the origins of these types in
|
||||
the text-like behaviour of the Python 2 ``str`` type.
|
||||
To simplify the task of writing the Python 3 documentation, the ``bytes``
|
||||
and ``bytearray`` types were documented primarily in terms of the way they
|
||||
differed from the Unicode based Python 3 ``str`` type. Even when I
|
||||
`heavily revised the sequence documentation
|
||||
<http://hg.python.org/cpython/rev/463f52d20314>`__ in 2012, I retained that
|
||||
simplifying shortcut.
|
||||
|
||||
This PEP aims to provide the missing systematic review, with the goal of
|
||||
ensuring that wherever feasible (given backwards compatibility constraints)
|
||||
these current inconsistencies are addressed for the Python 3.5 release.
|
||||
However, it turns out that this approach to the documentation of these types
|
||||
has a problem: it doesn't adequately introduce users to their hybrid nature,
|
||||
where they can be manipulated *either* as a "sequence of integers" type,
|
||||
*or* as ``str``-like types that assume ASCII compatible data.
|
||||
|
||||
In addition to the documentation issues, there are some lingering design
|
||||
quirks from an earlier pre-release design where there was *no* separate
|
||||
``bytearray`` type, and instead the core ``bytes`` type was mutable (with
|
||||
no immutable counterpart).
|
||||
|
||||
Finally, additional experience with using the existing Python 3 binary
|
||||
sequence types in real world applications has suggested it would be
|
||||
beneficial to make it easier to convert integers to length 1 bytes objects.
|
||||
|
||||
|
||||
Proposals
|
||||
|
@ -55,10 +62,13 @@ the binary data model in Python 3. Proposals are motivated by one of three
|
|||
factors:
|
||||
|
||||
* removing remnants of the original design of ``bytes`` as a mutable type
|
||||
* more consistently accepting length 1 ``bytes`` objects as input where an
|
||||
integer between ``0`` and ``255`` inclusive is expected, and vice-versa
|
||||
* allowing users to easily convert integer output to a length 1 ``bytes``
|
||||
* allowing users to easily convert integer values to a length 1 ``bytes``
|
||||
object
|
||||
* consistently applying the following analogies to the type API designs
|
||||
and documentation:
|
||||
|
||||
* ``bytes``: tuple of integers, with additional str-like methods
|
||||
* ``bytearray``: list of integers, with additional str-like methods
|
||||
|
||||
|
||||
Alternate Constructors
|
||||
|
@ -83,95 +93,69 @@ Compare::
|
|||
b'\x00\x00\x00'
|
||||
|
||||
This PEP proposes that the current handling of integers in the bytes and
|
||||
bytearray constructors by deprecated in Python 3.5 and removed in Python
|
||||
3.6, being replaced by two more type appropriate alternate constructors
|
||||
provided as class methods. The initial python-ideas thread [ideas-thread1]_
|
||||
that spawned this PEP was specifically aimed at deprecating this constructor
|
||||
behaviour.
|
||||
bytearray constructors by deprecated in Python 3.5 and targeted for
|
||||
removal in Python 3.7, being replaced by two more explicit alternate
|
||||
constructors provided as class methods. The initial python-ideas thread
|
||||
[ideas-thread1]_ that spawned this PEP was specifically aimed at deprecating
|
||||
this constructor behaviour.
|
||||
|
||||
For ``bytes``, a ``byte`` constructor is proposed that converts integers
|
||||
(as indicated by ``operator.index``) in the appropriate range to a ``bytes``
|
||||
object, converts objects that support the buffer API to bytes, and also
|
||||
passes through length 1 byte strings unchanged::
|
||||
Firstly, a ``byte`` constructor is proposed that converts integers
|
||||
in the range 0 to 255 (inclusive) to a ``bytes`` object::
|
||||
|
||||
>>> bytes.byte(3)
|
||||
b'\x03'
|
||||
>>> bytes.byte(bytearray(bytes([3])))
|
||||
b'\x03'
|
||||
>>> bytes.byte(memoryview(bytes([3])))
|
||||
b'\x03'
|
||||
>>> bytes.byte(bytes([3]))
|
||||
b'\x03'
|
||||
>>> bytearray.byte(3)
|
||||
bytearray(b'\x03')
|
||||
>>> bytes.byte(512)
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in <module>
|
||||
ValueError: bytes must be in range(0, 256)
|
||||
>>> bytes.byte(b"ab")
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in <module>
|
||||
TypeError: bytes.byte() expected a byte, but buffer of length 2 found
|
||||
|
||||
One specific use case for this alternate constructor is to easily convert
|
||||
the result of indexing operations on ``bytes`` and other binary sequences
|
||||
from an integer to a ``bytes`` object. The documentation for this API
|
||||
should note that its counterpart for the reverse conversion is ``ord()``.
|
||||
The ``ord()`` documentation will also be updated to note that while
|
||||
``chr()`` is the counterpart for ``str`` input, ``bytes.byte`` and
|
||||
``bytearray.byte`` are the counterparts for binary input.
|
||||
|
||||
For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
|
||||
the buffer filled with a particular value (default to ``0``) as a direct
|
||||
Secondly, a ``zeros`` constructor is proposed that serves as a direct
|
||||
replacement for the current constructor behaviour, rather than having to use
|
||||
sequence repetition to achieve the same effect in a less intuitive way::
|
||||
|
||||
>>> bytearray.from_len(3)
|
||||
>>> bytes.zeros(3)
|
||||
b'\x00\x00\x00'
|
||||
>>> bytearray.zeros(3)
|
||||
bytearray(b'\x00\x00\x00')
|
||||
>>> bytearray.from_len(3, 6)
|
||||
bytearray(b'\x06\x06\x06')
|
||||
|
||||
This part of the proposal was covered by an existing issue
|
||||
[empty-buffer-issue]_ and a variety of names have been proposed
|
||||
(``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
|
||||
specific name currently proposed was chosen by analogy with
|
||||
``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
|
||||
explicit that it is an alternate constructor rather than an in-place
|
||||
mutation, as well as how it differs from the standard constructor.
|
||||
The chosen name here is taken from the corresponding initialisation function
|
||||
in NumPy (although, as these are sequence types rather than N-dimensional
|
||||
matrices, the constructors take a length as input rather than a shape tuple)
|
||||
|
||||
|
||||
Open questions
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
* Should ``bytearray.byte()`` also be added? Or is
|
||||
``bytearray(bytes.byte(x))`` sufficient for that case?
|
||||
* Should ``bytes.from_len()`` also be added? Or is sequence repetition
|
||||
sufficient for that case?
|
||||
* Should ``bytearray.from_len()`` use a different name?
|
||||
* Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
|
||||
sequences with more than one element? The ``TypeError`` currently proposed
|
||||
is copied (with slightly improved wording) from the behaviour of ``ord()``
|
||||
with sequences containing more than one code point, while ``ValueError``
|
||||
would be more consistent with the existing handling of out-of-range
|
||||
integer values.
|
||||
* ``bytes.byte()`` is defined above as accepting length 1 binary sequences
|
||||
as individual bytes, but this is currently inconsistent with the main
|
||||
``bytes`` constructor::
|
||||
|
||||
>>> bytes([b"a", b"b", b"c"])
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in <module>
|
||||
TypeError: 'bytes' object cannot be interpreted as an integer
|
||||
|
||||
Should the ``bytes`` constructor be changed to accept iterables of length 1
|
||||
bytes objects in addition to iterables of integers? If so, should it
|
||||
allow a mixture of the two in a single iterable?
|
||||
While ``bytes.byte`` and ``bytearray.zeros`` are expected to be the more
|
||||
useful duo amongst the new constructors, ``bytes.zeros`` and
|
||||
`bytearray.byte`` are provided in order to maintain API consistency between
|
||||
the two types.
|
||||
|
||||
|
||||
Iteration
|
||||
---------
|
||||
|
||||
Iteration over ``bytes`` objects and other binary sequences produces
|
||||
integers. Rather than proposing a new method that would need to be added
|
||||
not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
|
||||
to third party types as well, this PEP proposes that iteration to produce
|
||||
length 1 ``bytes`` objects instead be handled by combining ``map`` with
|
||||
the new ``bytes.byte()`` alternate constructor proposed above::
|
||||
While iteration over ``bytes`` objects and other binary sequences produces
|
||||
integers, it is sometimes desirable to iterate over length 1 bytes objects
|
||||
instead.
|
||||
|
||||
To handle this situation more obviously (and more efficiently) than would be
|
||||
the case with the ``map(bytes.byte, data)`` construct enabled by the above
|
||||
constructor changes, this PEP proposes the addition of a new ``iterbytes``
|
||||
method to ``bytes``, ``bytearray`` and ``memoryview``::
|
||||
|
||||
for x in data.iterbytes():
|
||||
# x is a length 1 ``bytes`` object, rather than an integer
|
||||
|
||||
Third party types and arbitrary containers of integers that lack the new
|
||||
method can still be handled by combining ``map`` with the new
|
||||
``bytes.byte()`` alternate constructor proposed above::
|
||||
|
||||
for x in map(bytes.byte, data):
|
||||
# x is a length 1 ``bytes`` object, rather than an integer
|
||||
|
@ -179,139 +163,42 @@ the new ``bytes.byte()`` alternate constructor proposed above::
|
|||
# 0 to 255 inclusive
|
||||
|
||||
|
||||
Consistent support for different input types
|
||||
--------------------------------------------
|
||||
Open questions
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
The ``bytes`` and ``bytearray`` methods inspired by the Python 2 ``str``
|
||||
type generally expect to operate on binary subsequences: other objects
|
||||
implementing the buffer API. By contrast, the mutating APIs added to
|
||||
the ``bytearray`` interface expect to operate on individual elements:
|
||||
integer in the range 0 to 255 (inclusive).
|
||||
* The fallback case above suggests that this could perhaps be better handled
|
||||
as an ``iterbytes(data)`` *builtin*, that used ``data.__iterbytes__()``
|
||||
if defined, but otherwise fell back to ``map(bytes.byte, data)``::
|
||||
|
||||
In Python 3.3, the binary search operations (``in``, ``count()``,
|
||||
``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
|
||||
accept integers in the range 0 to 255 (inclusive) as their first argument,
|
||||
in addition to the existing support for binary subsequences.
|
||||
|
||||
This results in behaviour like the following in Python 3.3+::
|
||||
|
||||
>>> data = bytes([1, 2, 3, 4])
|
||||
>>> 3 in data
|
||||
True
|
||||
>>> b"\x03" in data
|
||||
True
|
||||
>>> data.count(3)
|
||||
1
|
||||
>>> data.count(b"\x03")
|
||||
1
|
||||
|
||||
>>> data.replace(3, 4)
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in <module>
|
||||
TypeError: expected bytes, bytearray or buffer compatible object
|
||||
>>> data.replace(b"\x03", b"\x04")
|
||||
b'\x01\x02\x04\x04'
|
||||
|
||||
>>> mutable = bytearray(data)
|
||||
>>> mutable
|
||||
bytearray(b'\x01\x02\x03\x04')
|
||||
>>> mutable.append(b"\x05")
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in <module>
|
||||
TypeError: an integer is required
|
||||
>>> mutable.append(5)
|
||||
>>> mutable
|
||||
bytearray(b'\x01\x02\x03\x04\x05')
|
||||
for x in iterbytes(data):
|
||||
# x is a length 1 ``bytes`` object, rather than an integer
|
||||
# This works with *any* container of integers in the range
|
||||
# 0 to 255 inclusive
|
||||
|
||||
|
||||
This PEP proposes extending the behaviour of accepting integers as being
|
||||
equivalent to the corresponding length 1 binary sequence to several other
|
||||
``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
|
||||
object for certain parameters. In essence, if a value is an acceptable
|
||||
input to the new ``bytes.byte`` constructor defined above, then it would
|
||||
be acceptable in the roles defined here (in addition to any other already
|
||||
supported inputs):
|
||||
Documentation clarifications
|
||||
----------------------------
|
||||
|
||||
* ``startswith()`` prefix(es)
|
||||
* ``endswith()`` suffix(es)
|
||||
In an attempt to clarify the `documentation
|
||||
<https://docs.python.org/dev/library/stdtypes.html#binary-sequence-types-bytes-bytearray-memoryview>`__
|
||||
of the ``bytes`` and ``bytearray`` types, the following changes are
|
||||
proposed:
|
||||
|
||||
* ``center()`` fill character
|
||||
* ``ljust()`` fill character
|
||||
* ``rjust()`` fill character
|
||||
* the documentation of the *sequence* behaviour of each type is moved to
|
||||
section for that individual type. These sections will be updated to
|
||||
explicitly make the ``tuple of integers`` and ``list of integers``
|
||||
analogies, as well as to make it clear that these parts of the API work
|
||||
with arbitrary binary data
|
||||
* the current "Bytes and bytearray operations" section will be updated to
|
||||
"Handling binary data with ASCII compatible segments", and will explicitly
|
||||
list *all* of the methods that are included.
|
||||
* clarify that due to their origins in the API of the immutable ``str``
|
||||
type, even the ``bytearray`` versions of these methods do *not* operate
|
||||
in place, but instead create a new object.
|
||||
|
||||
* ``strip()`` character to strip
|
||||
* ``lstrip()`` character to strip
|
||||
* ``rstrip()`` character to strip
|
||||
|
||||
* ``partition()`` separator argument
|
||||
* ``rpartition()`` separator argument
|
||||
|
||||
* ``split()`` separator argument
|
||||
* ``rsplit()`` separator argument
|
||||
|
||||
* ``replace()`` old value and new value
|
||||
|
||||
In addition to the consistency motive, this approach also makes it easier
|
||||
to work with the indexing behaviour , as the result of an indexing operation
|
||||
can more easily be fed back in to other methods.
|
||||
|
||||
For ``bytearray``, some additional changes are proposed to the current
|
||||
integer based operations to ensure they remain consistent with the proposed
|
||||
constructor changes::
|
||||
|
||||
* ``append()``: updated to be consistent with ``bytes.byte()``
|
||||
* ``remove()``: updated to be consistent with ``bytes.byte()``
|
||||
* ``+=``: updated to be consistent with ``bytes()`` changes (if any)
|
||||
* ``extend()``: updated to be consistent with ``bytes()`` changes (if any)
|
||||
|
||||
The general principle behind these changes is to restore the flexible
|
||||
"element-or-subsequence" behaviour seen in the ``str`` API, even though
|
||||
Python 3 actually represents subsequences and individual elements as
|
||||
distinct types in the binary domain.
|
||||
|
||||
|
||||
Acknowledgement of surprising behaviour of some ``bytearray`` methods
|
||||
---------------------------------------------------------------------
|
||||
|
||||
Several of the ``bytes`` and ``bytearray`` methods have their origins in the
|
||||
Python 2 ``str`` API. As ``str`` is an immutable type, all of these
|
||||
operations are defined as returning a *new* instance, rather than operating
|
||||
in place. This contrasts with methods on other mutable types like ``list``,
|
||||
where ``list.sort()`` and ``list.reverse()`` operate in-place and return
|
||||
``None``, rather than creating a new object.
|
||||
|
||||
Backwards compatibility constraints make it impractical to change this
|
||||
behaviour at this point, but it may be appropriate to explicitly call out
|
||||
this quirk in the documentation for the ``bytearray`` type. It affects the
|
||||
following methods that could reasonably be expected to operate in-place on
|
||||
a mutable type:
|
||||
|
||||
* ``center()``
|
||||
* ``ljust()``
|
||||
* ``rjust()``
|
||||
* ``strip()``
|
||||
* ``lstrip()``
|
||||
* ``rstrip()``
|
||||
* ``replace()``
|
||||
* ``lower()``
|
||||
* ``upper()``
|
||||
* ``swapcase()``
|
||||
* ``title()``
|
||||
* ``capitalize()``
|
||||
* ``translate()``
|
||||
* ``expandtabs()``
|
||||
* ``zfill()``
|
||||
|
||||
Note that the following ``bytearray`` operations *do* operate in place, as
|
||||
they're part of the mutable sequence API in ``bytearray``, rather than being
|
||||
inspired by the immutable Python 2 ``str`` API:
|
||||
|
||||
* ``+=``
|
||||
* ``append()``
|
||||
* ``extend()``
|
||||
* ``reverse()``
|
||||
* ``remove()``
|
||||
* ``pop()``
|
||||
A patch for at least this part of the proposal will be prepared before
|
||||
submitting the PEP for approval, as writing out these docs completely may
|
||||
suggest additional opportunities for API consistency improvements.
|
||||
|
||||
|
||||
References
|
||||
|
|
Loading…
Reference in New Issue