python-peps/pep-0467.txt

338 lines
12 KiB
Plaintext
Raw Normal View History

PEP: 467
Title: Improved API consistency for bytes and bytearray
Version: $Revision$
Last-Modified: $Date$
Author: Nick Coghlan <ncoghlan@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2014-03-30
Python-Version: 3.5
Post-History: 2014-03-30
Abstract
========
During the initial development of the Python 3 language specification, the
core ``bytes`` type for arbitrary binary data started as the mutable type
that is now referred to as ``bytearray``. Other aspects of operating in
the binary domain in Python have also evolved over the course of the Python
3 series.
This PEP proposes a number of small adjustments to the APIs of the ``bytes``
and ``bytearray`` types to make their behaviour more internally consistent
and to make it easier to operate entirely in the binary domain.
Background
==========
Over the course of Python 3's evolution, a number of adjustments have been
made to the core ``bytes`` and ``bytearray`` types as additional practical
experience was gained with using them in code beyond the Python 3 standard
library and test suite. However, to date, these changes have been made
on a relatively ad hoc tactical basis as specific issues were identified,
rather than as part of a systematic review of the APIs of these types. This
approach has allowed inconsistencies to creep into the API design as to which
input types are accepted by different methods. Additional inconsistencies
linger from an earlier pre-release design where there was *no* separate
``bytearray`` type, and instead the core ``bytes`` type was mutable (with
no immutable counterpart), as well as from the origins of these types in
the text-like behaviour of the Python 2 ``str`` type.
This PEP aims to provide the missing systematic review, with the goal of
ensuring that wherever feasible (given backwards compatibility constraints)
these current inconsistencies are addressed for the Python 3.5 release.
Proposals
=========
As a "consistency improvement" proposal, this PEP is actually about a number
of smaller micro-proposals, each aimed at improving the self-consistency of
the binary data model in Python 3. Proposals are motivated by one of three
factors:
* removing remnants of the original design of ``bytes`` as a mutable type
* more consistently accepting length 1 ``bytes`` objects as input where an
integer between ``0`` and ``255`` inclusive is expected, and vice-versa
* allowing users to easily convert integer output to a length 1 ``bytes``
object
Alternate Constructors
----------------------
The ``bytes`` and ``bytearray`` constructors currently accept an integer
argument, but interpret it to mean a zero-filled object of the given length.
This is a legacy of the original design of ``bytes`` as a mutable type,
rather than a particularly intuitive behaviour for users. It has become
especially confusing now that other ``bytes`` interfaces treat integers
and the corresponding length 1 bytes instances as equivalent input.
Compare::
>>> b"\x03" in bytes([1, 2, 3])
True
>>> 3 in bytes([1, 2, 3])
True
>>> bytes(b"\x03")
b'\x03'
>>> bytes(3)
b'\x00\x00\x00'
This PEP proposes that the current handling of integers in the bytes and
bytearray constructors by deprecated in Python 3.5 and removed in Python
3.6, being replaced by two more type appropriate alternate constructors
provided as class methods. The initial python-ideas thread [ideas-thread1]_
that spawned this PEP was specifically aimed at deprecating this constructor
behaviour.
For ``bytes``, a ``byte`` constructor is proposed that converts integers
(as indicated by ``operator.index``) in the appropriate range to a ``bytes``
object, converts objects that support the buffer API to bytes, and also
passes through length 1 byte strings unchanged::
>>> bytes.byte(3)
b'\x03'
>>> bytes.byte(bytearray(bytes([3])))
b'\x03'
>>> bytes.byte(memoryview(bytes([3])))
b'\x03'
>>> bytes.byte(bytes([3]))
b'\x03'
>>> bytes.byte(512)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: bytes must be in range(0, 256)
>>> bytes.byte(b"ab")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: bytes.byte() expected a byte, but buffer of length 2 found
One specific use case for this alternate constructor is to easily convert
the result of indexing operations on ``bytes`` and other binary sequences
from an integer to a ``bytes`` object. The documentation for this API
should note that its counterpart for the reverse conversion is ``ord()``.
For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
the buffer filled with a particular value (default to ``0``) as a direct
replacement for the current constructor behaviour, rather than having to use
sequence repetition to achieve the same effect in a less intuitive way::
>>> bytearray.from_len(3)
bytearray(b'\x00\x00\x00')
>>> bytearray.from_len(3, 6)
bytearray(b'\x06\x06\x06')
This part of the proposal was covered by an existing issue
[empty-buffer-issue]_ and a variety of names have been proposed
(``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
specific name currently proposed was chosen by analogy with
``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
explicit that it is an alternate constructor rather than an in-place
mutation, as well as how it differs from the standard constructor.
Open questions
^^^^^^^^^^^^^^
* Should ``bytearray.byte()`` also be added? Or is
``bytearray(bytes.byte(x))`` sufficient for that case?
* Should ``bytes.from_len()`` also be added? Or is sequence repetition
sufficient for that case?
* Should ``bytearray.from_len()`` use a different name?
* Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
sequences with more than one element? The ``TypeError`` currently proposed
is copied (with slightly improved wording) from the behaviour of ``ord()``
with sequences containing more than one code point, while ``ValueError``
would be more consistent with the existing handling of out-of-range
integer values.
* ``bytes.byte()`` is defined above as accepting length 1 binary sequences
as individual bytes, but this is currently inconsistent with the main
``bytes`` constructor::
>>> bytes([b"a", b"b", b"c"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bytes' object cannot be interpreted as an integer
Should the ``bytes`` constructor be changed to accept iterables of length 1
bytes objects in addition to iterables of integers? If so, should it
allow a mixture of the two in a single iterable?
Iteration
---------
Iteration over ``bytes`` objects and other binary sequences produces
integers. Rather than proposing a new method that would need to be added
not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
to third party types as well, this PEP proposes that iteration to produce
length 1 ``bytes`` objects instead be handled by combining ``map`` with
the new ``bytes.byte()`` alternate constructor proposed above::
for x in map(bytes.byte, data):
# x is a length 1 ``bytes`` object, rather than an integer
# This works with *any* container of integers in the range
# 0 to 255 inclusive
Consistent support for different input types
--------------------------------------------
The ``bytes`` and ``bytearray`` methods inspired by the Python 2 ``str``
type generally expect to operate on binary subsequences: other objects
implementing the buffer API. By contrast, the mutating APIs added to
the ``bytearray`` interface expect to operate on individual elements:
integer in the range 0 to 255 (inclusive).
In Python 3.3, the binary search operations (``in``, ``count()``,
``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
accept integers in the range 0 to 255 (inclusive) as their first argument,
in addition to the existing support for binary subsequences.
This results in behaviour like the following in Python 3.3+::
>>> data = bytes([1, 2, 3, 4])
>>> 3 in data
True
>>> b"\x03" in data
True
>>> data.count(3)
1
>>> data.count(b"\x03")
1
>>> data.replace(3, 4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: expected bytes, bytearray or buffer compatible object
>>> data.replace(b"\x03", b"\x04")
b'\x01\x02\x04\x04'
>>> mutable = bytearray(data)
>>> mutable
bytearray(b'\x01\x02\x03\x04')
>>> mutable.append(b"\x05")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: an integer is required
>>> mutable.append(5)
>>> mutable
bytearray(b'\x01\x02\x03\x04\x05')
This PEP proposes extending the behaviour of accepting integers as being
equivalent to the corresponding length 1 binary sequence to several other
``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
object for certain parameters. In essence, if a value is an acceptable
input to the new ``bytes.byte`` constructor defined above, then it would
be acceptable in the roles defined here (in addition to any other already
supported inputs):
* ``startswith()`` prefix(es)
* ``endswith()`` suffix(es)
* ``center()`` fill character
* ``ljust()`` fill character
* ``rjust()`` fill character
* ``strip()`` character to strip
* ``lstrip()`` character to strip
* ``rstrip()`` character to strip
* ``partition()`` separator argument
* ``rpartition()`` separator argument
* ``split()`` separator argument
* ``rsplit()`` separator argument
* ``replace()`` old value and new value
In addition to the consistency motive, this approach also makes it easier
to work with the indexing behaviour , as the result of an indexing operation
can more easily be fed back in to other methods.
For ``bytearray``, some additional changes are proposed to the current
integer based operations to ensure they remain consistent with the proposed
constructor changes::
* ``append()``: updated to be consistent with ``bytes.byte()``
* ``remove()``: updated to be consistent with ``bytes.byte()``
* ``+=``: updated to be consistent with ``bytes()`` changes (if any)
* ``extend()``: updated to be consistent with ``bytes()`` changes (if any)
The general principle behind these changes is to restore the flexible
"element-or-subsequence" behaviour seen in the ``str`` API, even though
Python 3 actually represents subsequences and individual elements as
distinct types in the binary domain.
Acknowledgement of surprising behaviour of some ``bytearray`` methods
---------------------------------------------------------------------
Several of the ``bytes`` and ``bytearray`` methods have their origins in the
Python 2 ``str`` API. As ``str`` is an immutable type, all of these
operations are defined as returning a *new* instance, rather than operating
in place. This contrasts with methods on other mutable types like ``list``,
where ``list.sort()`` and ``list.reverse()`` operate in-place and return
``None``, rather than creating a new object.
Backwards compatibility constraints make it impractical to change this
behaviour at this point, but it may be appropriate to explicitly call out
this quirk in the documentation for the ``bytearray`` type. It affects the
following methods that could reasonably be expected to operate in-place on
a mutable type:
* ``center()``
* ``ljust()``
* ``rjust()``
* ``strip()``
* ``lstrip()``
* ``rstrip()``
* ``replace()``
* ``lower()``
* ``upper()``
* ``swapcase()``
* ``title()``
* ``capitalize()``
* ``translate()``
* ``expandtabs()``
* ``zfill()``
Note that the following ``bytearray`` operations *do* operate in place, as
they're part of the mutable sequence API in ``bytearray``, rather than being
inspired by the immutable Python 2 ``str`` API:
* ``+=``
* ``append()``
* ``extend()``
* ``reverse()``
* ``remove()``
* ``pop()``
References
==========
.. [ideas-thread1] https://mail.python.org/pipermail/python-ideas/2014-March/027295.html
.. [empty-buffer-issue] http://bugs.python.org/issue20895
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: