2014-03-29 21:28:34 -04:00
|
|
|
PEP: 467
|
|
|
|
Title: Improved API consistency for bytes and bytearray
|
|
|
|
Version: $Revision$
|
|
|
|
Last-Modified: $Date$
|
|
|
|
Author: Nick Coghlan <ncoghlan@gmail.com>
|
|
|
|
Status: Draft
|
|
|
|
Type: Standards Track
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
Created: 2014-03-30
|
|
|
|
Python-Version: 3.5
|
|
|
|
Post-History: 2014-03-30
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
========
|
|
|
|
|
|
|
|
During the initial development of the Python 3 language specification, the
|
|
|
|
core ``bytes`` type for arbitrary binary data started as the mutable type
|
|
|
|
that is now referred to as ``bytearray``. Other aspects of operating in
|
|
|
|
the binary domain in Python have also evolved over the course of the Python
|
|
|
|
3 series.
|
|
|
|
|
|
|
|
This PEP proposes a number of small adjustments to the APIs of the ``bytes``
|
|
|
|
and ``bytearray`` types to make their behaviour more internally consistent
|
2014-03-30 00:26:23 -04:00
|
|
|
and to make it easier to operate entirely in the binary domain.
|
2014-03-29 21:28:34 -04:00
|
|
|
|
|
|
|
|
|
|
|
Background
|
|
|
|
==========
|
|
|
|
|
|
|
|
Over the course of Python 3's evolution, a number of adjustments have been
|
|
|
|
made to the core ``bytes`` and ``bytearray`` types as additional practical
|
|
|
|
experience was gained with using them in code beyond the Python 3 standard
|
|
|
|
library and test suite. However, to date, these changes have been made
|
|
|
|
on a relatively ad hoc tactical basis as specific issues were identified,
|
|
|
|
rather than as part of a systematic review of the APIs of these types. This
|
|
|
|
approach has allowed inconsistencies to creep into the API design as to which
|
|
|
|
input types are accepted by different methods. Additional inconsistencies
|
|
|
|
linger from an earlier pre-release design where there was *no* separate
|
|
|
|
``bytearray`` type, and instead the core ``bytes`` type was mutable (with
|
|
|
|
no immutable counterpart), as well as from the origins of these types in
|
|
|
|
the text-like behaviour of the Python 2 ``str`` type.
|
|
|
|
|
|
|
|
This PEP aims to provide the missing systematic review, with the goal of
|
|
|
|
ensuring that wherever feasible (given backwards compatibility constraints)
|
|
|
|
these current inconsistencies are addressed for the Python 3.5 release.
|
|
|
|
|
|
|
|
|
|
|
|
Proposals
|
|
|
|
=========
|
|
|
|
|
|
|
|
As a "consistency improvement" proposal, this PEP is actually about a number
|
|
|
|
of smaller micro-proposals, each aimed at improving the self-consistency of
|
|
|
|
the binary data model in Python 3. Proposals are motivated by one of three
|
|
|
|
factors:
|
|
|
|
|
|
|
|
* removing remnants of the original design of ``bytes`` as a mutable type
|
|
|
|
* more consistently accepting length 1 ``bytes`` objects as input where an
|
|
|
|
integer between ``0`` and ``255`` inclusive is expected, and vice-versa
|
|
|
|
* allowing users to easily convert integer output to a length 1 ``bytes``
|
|
|
|
object
|
|
|
|
|
|
|
|
|
|
|
|
Alternate Constructors
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
The ``bytes`` and ``bytearray`` constructors currently accept an integer
|
|
|
|
argument, but interpret it to mean a zero-filled object of the given length.
|
|
|
|
This is a legacy of the original design of ``bytes`` as a mutable type,
|
|
|
|
rather than a particularly intuitive behaviour for users. It has become
|
|
|
|
especially confusing now that other ``bytes`` interfaces treat integers
|
|
|
|
and the corresponding length 1 bytes instances as equivalent input.
|
|
|
|
Compare::
|
|
|
|
|
|
|
|
>>> b"\x03" in bytes([1, 2, 3])
|
|
|
|
True
|
|
|
|
>>> 3 in bytes([1, 2, 3])
|
|
|
|
True
|
|
|
|
|
|
|
|
>>> bytes(b"\x03")
|
|
|
|
b'\x03'
|
|
|
|
>>> bytes(3)
|
|
|
|
b'\x00\x00\x00'
|
|
|
|
|
|
|
|
This PEP proposes that the current handling of integers in the bytes and
|
|
|
|
bytearray constructors by deprecated in Python 3.5 and removed in Python
|
|
|
|
3.6, being replaced by two more type appropriate alternate constructors
|
|
|
|
provided as class methods. The initial python-ideas thread [ideas-thread1]_
|
|
|
|
that spawned this PEP was specifically aimed at deprecating this constructor
|
|
|
|
behaviour.
|
|
|
|
|
|
|
|
For ``bytes``, a ``byte`` constructor is proposed that converts integers
|
|
|
|
(as indicated by ``operator.index``) in the appropriate range to a ``bytes``
|
|
|
|
object, converts objects that support the buffer API to bytes, and also
|
|
|
|
passes through length 1 byte strings unchanged::
|
|
|
|
|
|
|
|
>>> bytes.byte(3)
|
|
|
|
b'\x03'
|
|
|
|
>>> bytes.byte(bytearray(bytes([3])))
|
|
|
|
b'\x03'
|
|
|
|
>>> bytes.byte(memoryview(bytes([3])))
|
|
|
|
b'\x03'
|
|
|
|
>>> bytes.byte(bytes([3]))
|
|
|
|
b'\x03'
|
|
|
|
>>> bytes.byte(512)
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "<stdin>", line 1, in <module>
|
|
|
|
ValueError: bytes must be in range(0, 256)
|
|
|
|
>>> bytes.byte(b"ab")
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "<stdin>", line 1, in <module>
|
|
|
|
TypeError: bytes.byte() expected a byte, but buffer of length 2 found
|
|
|
|
|
|
|
|
One specific use case for this alternate constructor is to easily convert
|
|
|
|
the result of indexing operations on ``bytes`` and other binary sequences
|
|
|
|
from an integer to a ``bytes`` object. The documentation for this API
|
|
|
|
should note that its counterpart for the reverse conversion is ``ord()``.
|
|
|
|
|
|
|
|
For ``bytearray``, a ``from_len`` constructor is proposed that preallocates
|
|
|
|
the buffer filled with a particular value (default to ``0``) as a direct
|
|
|
|
replacement for the current constructor behaviour, rather than having to use
|
|
|
|
sequence repetition to achieve the same effect in a less intuitive way::
|
|
|
|
|
|
|
|
>>> bytearray.from_len(3)
|
|
|
|
bytearray(b'\x00\x00\x00')
|
|
|
|
>>> bytearray.from_len(3, 6)
|
|
|
|
bytearray(b'\x06\x06\x06')
|
|
|
|
|
|
|
|
This part of the proposal was covered by an existing issue
|
|
|
|
[empty-buffer-issue]_ and a variety of names have been proposed
|
|
|
|
(``empty_buffer``, ``zeros``, ``zeroes``, ``allnull``, ``fill``). The
|
|
|
|
specific name currently proposed was chosen by analogy with
|
|
|
|
``dict.fromkeys()`` and ``itertools.chain.from_iter()`` to be completely
|
|
|
|
explicit that it is an alternate constructor rather than an in-place
|
|
|
|
mutation, as well as how it differs from the standard constructor.
|
|
|
|
|
2014-03-29 21:54:55 -04:00
|
|
|
|
2014-03-29 21:28:34 -04:00
|
|
|
Open questions
|
|
|
|
^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
* Should ``bytearray.byte()`` also be added? Or is
|
|
|
|
``bytearray(bytes.byte(x))`` sufficient for that case?
|
|
|
|
* Should ``bytes.from_len()`` also be added? Or is sequence repetition
|
|
|
|
sufficient for that case?
|
2014-03-29 21:54:55 -04:00
|
|
|
* Should ``bytearray.from_len()`` use a different name?
|
2014-03-29 21:28:34 -04:00
|
|
|
* Should ``bytes.byte()`` raise ``TypeError`` or ``ValueError`` for binary
|
|
|
|
sequences with more than one element? The ``TypeError`` currently proposed
|
2014-03-29 21:54:55 -04:00
|
|
|
is copied (with slightly improved wording) from the behaviour of ``ord()``
|
|
|
|
with sequences containing more than one code point, while ``ValueError``
|
|
|
|
would be more consistent with the existing handling of out-of-range
|
|
|
|
integer values.
|
2014-03-29 21:28:34 -04:00
|
|
|
* ``bytes.byte()`` is defined above as accepting length 1 binary sequences
|
|
|
|
as individual bytes, but this is currently inconsistent with the main
|
|
|
|
``bytes`` constructor::
|
|
|
|
|
|
|
|
>>> bytes([b"a", b"b", b"c"])
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "<stdin>", line 1, in <module>
|
|
|
|
TypeError: 'bytes' object cannot be interpreted as an integer
|
|
|
|
|
|
|
|
Should the ``bytes`` constructor be changed to accept iterables of length 1
|
|
|
|
bytes objects in addition to iterables of integers? If so, should it
|
|
|
|
allow a mixture of the two in a single iterable?
|
|
|
|
|
2014-03-29 21:54:55 -04:00
|
|
|
|
2014-03-29 21:28:34 -04:00
|
|
|
Iteration
|
|
|
|
---------
|
|
|
|
|
|
|
|
Iteration over ``bytes`` objects and other binary sequences produces
|
|
|
|
integers. Rather than proposing a new method that would need to be added
|
|
|
|
not only to ``bytes``, ``bytearray`` and ``memoryview``, but potentially
|
|
|
|
to third party types as well, this PEP proposes that iteration to produce
|
|
|
|
length 1 ``bytes`` objects instead be handled by combining ``map`` with
|
|
|
|
the new ``bytes.byte()`` alternate constructor proposed above::
|
|
|
|
|
|
|
|
for x in map(bytes.byte, data):
|
|
|
|
# x is a length 1 ``bytes`` object, rather than an integer
|
|
|
|
# This works with *any* container of integers in the range
|
|
|
|
# 0 to 255 inclusive
|
|
|
|
|
|
|
|
|
2014-03-29 21:54:55 -04:00
|
|
|
Consistent support for different input types
|
|
|
|
--------------------------------------------
|
2014-03-29 21:28:34 -04:00
|
|
|
|
2014-03-30 03:03:44 -04:00
|
|
|
The ``bytes`` and ``bytearray`` methods inspired by the Python 2 ``str``
|
|
|
|
type generally expect to operate on binary subsequences: other objects
|
|
|
|
implementing the buffer API. By contrast, the mutating APIs added to
|
|
|
|
the ``bytearray`` interface expect to operate on individual elements:
|
|
|
|
integer in the range 0 to 255 (inclusive).
|
|
|
|
|
2014-03-29 21:28:34 -04:00
|
|
|
In Python 3.3, the binary search operations (``in``, ``count()``,
|
|
|
|
``find()``, ``index()``, ``rfind()`` and ``rindex()``) were updated to
|
2014-03-30 03:03:44 -04:00
|
|
|
accept integers in the range 0 to 255 (inclusive) as their first argument,
|
|
|
|
in addition to the existing support for binary subsequences.
|
|
|
|
|
|
|
|
This results in behaviour like the following in Python 3.3+::
|
|
|
|
|
|
|
|
>>> data = bytes([1, 2, 3, 4])
|
|
|
|
>>> 3 in data
|
|
|
|
True
|
|
|
|
>>> b"\x03" in data
|
|
|
|
True
|
|
|
|
>>> data.count(3)
|
|
|
|
1
|
|
|
|
>>> data.count(b"\x03")
|
|
|
|
1
|
|
|
|
|
|
|
|
>>> data.replace(3, 4)
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "<stdin>", line 1, in <module>
|
|
|
|
TypeError: expected bytes, bytearray or buffer compatible object
|
|
|
|
>>> data.replace(b"\x03", b"\x04")
|
|
|
|
b'\x01\x02\x04\x04'
|
|
|
|
|
|
|
|
>>> mutable = bytearray(data)
|
|
|
|
>>> mutable
|
|
|
|
bytearray(b'\x01\x02\x03\x04')
|
|
|
|
>>> mutable.append(b"\x05")
|
|
|
|
Traceback (most recent call last):
|
|
|
|
File "<stdin>", line 1, in <module>
|
|
|
|
TypeError: an integer is required
|
|
|
|
>>> mutable.append(5)
|
|
|
|
>>> mutable
|
|
|
|
bytearray(b'\x01\x02\x03\x04\x05')
|
|
|
|
|
2014-03-29 21:28:34 -04:00
|
|
|
|
2014-03-30 03:03:44 -04:00
|
|
|
This PEP proposes extending the behaviour of accepting integers as being
|
2014-03-29 21:28:34 -04:00
|
|
|
equivalent to the corresponding length 1 binary sequence to several other
|
|
|
|
``bytes`` and ``bytearray`` methods that currently expect a ``bytes``
|
2014-03-29 21:54:55 -04:00
|
|
|
object for certain parameters. In essence, if a value is an acceptable
|
|
|
|
input to the new ``bytes.byte`` constructor defined above, then it would
|
2014-03-29 22:04:18 -04:00
|
|
|
be acceptable in the roles defined here (in addition to any other already
|
|
|
|
supported inputs):
|
2014-03-29 21:28:34 -04:00
|
|
|
|
|
|
|
* ``startswith()`` prefix(es)
|
|
|
|
* ``endswith()`` suffix(es)
|
|
|
|
|
|
|
|
* ``center()`` fill character
|
|
|
|
* ``ljust()`` fill character
|
|
|
|
* ``rjust()`` fill character
|
|
|
|
|
2014-03-29 21:54:55 -04:00
|
|
|
* ``strip()`` character to strip
|
|
|
|
* ``lstrip()`` character to strip
|
|
|
|
* ``rstrip()`` character to strip
|
2014-03-29 21:28:34 -04:00
|
|
|
|
|
|
|
* ``partition()`` separator argument
|
|
|
|
* ``rpartition()`` separator argument
|
|
|
|
|
|
|
|
* ``split()`` separator argument
|
|
|
|
* ``rsplit()`` separator argument
|
|
|
|
|
|
|
|
* ``replace()`` old value and new value
|
|
|
|
|
|
|
|
In addition to the consistency motive, this approach also makes it easier
|
|
|
|
to work with the indexing behaviour , as the result of an indexing operation
|
|
|
|
can more easily be fed back in to other methods.
|
|
|
|
|
2014-03-29 21:54:55 -04:00
|
|
|
For ``bytearray``, some additional changes are proposed to the current
|
|
|
|
integer based operations to ensure they remain consistent with the proposed
|
|
|
|
constructor changes::
|
|
|
|
|
|
|
|
* ``append()``: updated to be consistent with ``bytes.byte()``
|
|
|
|
* ``remove()``: updated to be consistent with ``bytes.byte()``
|
|
|
|
* ``+=``: updated to be consistent with ``bytes()`` changes (if any)
|
|
|
|
* ``extend()``: updated to be consistent with ``bytes()`` changes (if any)
|
|
|
|
|
2014-03-30 03:03:44 -04:00
|
|
|
The general principle behind these changes is to restore the flexible
|
|
|
|
"element-or-subsequence" behaviour seen in the ``str`` API, even though
|
|
|
|
Python 3 actually represents subsequences and individual elements as
|
|
|
|
distinct types in the binary domain.
|
|
|
|
|
2014-03-29 21:28:34 -04:00
|
|
|
|
|
|
|
Acknowledgement of surprising behaviour of some ``bytearray`` methods
|
|
|
|
---------------------------------------------------------------------
|
|
|
|
|
|
|
|
Several of the ``bytes`` and ``bytearray`` methods have their origins in the
|
|
|
|
Python 2 ``str`` API. As ``str`` is an immutable type, all of these
|
2014-03-29 21:54:55 -04:00
|
|
|
operations are defined as returning a *new* instance, rather than operating
|
2014-03-29 21:28:34 -04:00
|
|
|
in place. This contrasts with methods on other mutable types like ``list``,
|
|
|
|
where ``list.sort()`` and ``list.reverse()`` operate in-place and return
|
|
|
|
``None``, rather than creating a new object.
|
|
|
|
|
|
|
|
Backwards compatibility constraints make it impractical to change this
|
|
|
|
behaviour at this point, but it may be appropriate to explicitly call out
|
|
|
|
this quirk in the documentation for the ``bytearray`` type. It affects the
|
|
|
|
following methods that could reasonably be expected to operate in-place on
|
|
|
|
a mutable type:
|
|
|
|
|
|
|
|
* ``center()``
|
|
|
|
* ``ljust()``
|
|
|
|
* ``rjust()``
|
|
|
|
* ``strip()``
|
|
|
|
* ``lstrip()``
|
|
|
|
* ``rstrip()``
|
|
|
|
* ``replace()``
|
|
|
|
* ``lower()``
|
|
|
|
* ``upper()``
|
|
|
|
* ``swapcase()``
|
|
|
|
* ``title()``
|
|
|
|
* ``capitalize()``
|
|
|
|
* ``translate()``
|
|
|
|
* ``expandtabs()``
|
|
|
|
* ``zfill()``
|
|
|
|
|
|
|
|
Note that the following ``bytearray`` operations *do* operate in place, as
|
|
|
|
they're part of the mutable sequence API in ``bytearray``, rather than being
|
|
|
|
inspired by the immutable Python 2 ``str`` API:
|
|
|
|
|
2014-03-29 21:54:55 -04:00
|
|
|
* ``+=``
|
|
|
|
* ``append()``
|
|
|
|
* ``extend()``
|
2014-03-29 21:28:34 -04:00
|
|
|
* ``reverse()``
|
|
|
|
* ``remove()``
|
|
|
|
* ``pop()``
|
|
|
|
|
2014-03-29 21:54:55 -04:00
|
|
|
|
2014-03-29 21:28:34 -04:00
|
|
|
References
|
|
|
|
==========
|
|
|
|
|
|
|
|
.. [ideas-thread1] https://mail.python.org/pipermail/python-ideas/2014-March/027295.html
|
|
|
|
.. [empty-buffer-issue] http://bugs.python.org/issue20895
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
=========
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
Local Variables:
|
|
|
|
mode: indented-text
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
sentence-end-double-space: t
|
|
|
|
fill-column: 70
|
|
|
|
coding: utf-8
|
|
|
|
End:
|