251 lines
8.8 KiB
ReStructuredText
251 lines
8.8 KiB
ReStructuredText
PEP: 467
|
|
Title: Minor API improvements for binary sequences
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Alyssa Coghlan <ncoghlan@gmail.com>, Ethan Furman <ethan@stoneleaf.us>
|
|
Status: Draft
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 30-Mar-2014
|
|
Python-Version: 3.12
|
|
Post-History: 30-Mar-2014, 15-Aug-2014, 16-Aug-2014, 07-Jun-2016, 01-Sep-2016,
|
|
13-Apr-2021, 03-Nov-2021
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
This PEP proposes five small adjustments to the APIs of the ``bytes`` and
|
|
``bytearray`` types to make it easier to operate entirely in the binary domain:
|
|
|
|
* Add ``fromsize`` alternative constructor
|
|
* Add ``fromint`` alternative constructor
|
|
* Add ``ascii`` alternative constructor
|
|
* Add ``getbyte`` byte retrieval method
|
|
* Add ``iterbytes`` alternative iterator
|
|
|
|
Rationale
|
|
=========
|
|
|
|
During the initial development of the Python 3 language specification, the
|
|
core ``bytes`` type for arbitrary binary data started as the mutable type
|
|
that is now referred to as ``bytearray``. Other aspects of operating in
|
|
the binary domain in Python have also evolved over the course of the Python
|
|
3 series, for example with :pep:`461`.
|
|
|
|
|
|
Motivation
|
|
==========
|
|
|
|
With Python 3 and the split between ``str`` and ``bytes``, one small but
|
|
important area of programming became slightly more difficult, and much more
|
|
painful -- wire format protocols.
|
|
|
|
This area of programming is characterized by a mixture of binary data and
|
|
ASCII compatible segments of text (aka ASCII-encoded text). The addition of
|
|
the new constructors, methods, and iterators will aid both in writing new
|
|
wire format code, and in porting any remaining Python 2 wire format code.
|
|
|
|
Common use-cases include ``dbf`` and ``pdf`` file formats, ``email``
|
|
formats, and ``FTP`` and ``HTTP`` communications, among many others.
|
|
|
|
|
|
Proposals
|
|
=========
|
|
|
|
Addition of explicit "count and byte initialised sequence" constructors
|
|
-----------------------------------------------------------------------
|
|
|
|
To replace the now discouraged behavior, this PEP proposes the addition of an
|
|
explicit ``fromsize`` alternative constructor as a class method on both
|
|
``bytes`` and ``bytearray`` whose first argument is the count, and whose
|
|
second argument is the fill byte to use (defaults to ``\x00``)::
|
|
|
|
>>> bytes.fromsize(3)
|
|
b'\x00\x00\x00'
|
|
>>> bytearray.fromsize(3)
|
|
bytearray(b'\x00\x00\x00')
|
|
>>> bytes.fromsize(5, b'\x0a')
|
|
b'\x0a\x0a\x0a\x0a\x0a'
|
|
>>> bytearray.fromsize(5, fill=b'\x0a')
|
|
bytearray(b'\x0a\x0a\x0a\x0a\x0a')
|
|
|
|
``fromsize`` will behave just as the current constructors behave when passed a
|
|
single integer, while allowing for non-zero fill values when needed.
|
|
|
|
|
|
Addition of explicit "single byte" constructors
|
|
-----------------------------------------------
|
|
|
|
As binary counterparts to the text ``chr`` function, this PEP proposes
|
|
the addition of an explicit ``fromint`` alternative constructor as a class
|
|
method on both ``bytes`` and ``bytearray``::
|
|
|
|
>>> bytes.fromint(65)
|
|
b'A'
|
|
>>> bytearray.fromint(65)
|
|
bytearray(b'A')
|
|
|
|
These methods will only accept integers in the range 0 to 255 (inclusive)::
|
|
|
|
>>> bytes.fromint(512)
|
|
Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
ValueError: integer must be in range(0, 256)
|
|
|
|
>>> bytes.fromint(1.0)
|
|
Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
TypeError: 'float' object cannot be interpreted as an integer
|
|
|
|
The documentation of the ``ord`` builtin will be updated to explicitly note
|
|
that ``bytes.fromint`` is the primary inverse operation for binary data, while
|
|
``chr`` is the inverse operation for text data, and that ``bytearray.fromint``
|
|
also exists.
|
|
|
|
Behaviorally, ``bytes.fromint(x)`` will be equivalent to the current
|
|
``bytes([x])`` (and similarly for ``bytearray``). The new spelling is
|
|
expected to be easier to discover and easier to read (especially when used
|
|
in conjunction with indexing operations on binary sequence types).
|
|
|
|
As a separate method, the new spelling will also work better with higher
|
|
order functions like ``map``.
|
|
|
|
These new methods intentionally do NOT offer the same level of general integer
|
|
support as the existing ``int.to_bytes`` conversion method, which allows
|
|
arbitrarily large integers to be converted to arbitrarily long bytes objects. The
|
|
restriction to only accept positive integers that fit in a single byte means
|
|
that no byte order information is needed, and there is no need to handle
|
|
negative numbers. The documentation of the new methods will refer readers to
|
|
``int.to_bytes`` for use cases where handling of arbitrary integers is needed.
|
|
|
|
|
|
Addition of "ascii" constructors
|
|
--------------------------------
|
|
|
|
In Python 2 converting an object, such as the integer ``123``, to bytes (aka the
|
|
Python 2 ``str``) was as simple as::
|
|
|
|
>>> str(123)
|
|
'123'
|
|
|
|
With Python 3 that became the more verbose::
|
|
|
|
>>> b'%d' % 123
|
|
|
|
or even::
|
|
|
|
>>> str(123).encode('ascii')
|
|
|
|
This PEP proposes that an ``ascii`` method be added to ``bytes`` and ``bytearray``
|
|
to handle this use-case::
|
|
|
|
>>> bytes.ascii(123)
|
|
b'123'
|
|
|
|
Note that ``bytes.ascii()`` would handle simple ascii-encodable text correctly,
|
|
unlike the ``ascii()`` built-in::
|
|
|
|
>>> ascii("hello").encode('ascii')
|
|
b"'hello'"
|
|
|
|
|
|
Addition of "getbyte" method to retrieve a single byte
|
|
------------------------------------------------------
|
|
|
|
This PEP proposes that ``bytes`` and ``bytearray`` gain the method ``getbyte``
|
|
which will always return ``bytes``::
|
|
|
|
>>> b'abc'.getbyte(0)
|
|
b'a'
|
|
|
|
If an index is asked for that doesn't exist, ``IndexError`` is raised::
|
|
|
|
>>> b'abc'.getbyte(9)
|
|
Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
IndexError: index out of range
|
|
|
|
|
|
Addition of optimised iterator methods that produce ``bytes`` objects
|
|
---------------------------------------------------------------------
|
|
|
|
This PEP proposes that ``bytes`` and ``bytearray`` gain an optimised
|
|
``iterbytes`` method that produces length 1 ``bytes`` objects rather than
|
|
integers::
|
|
|
|
for x in data.iterbytes():
|
|
# x is a length 1 ``bytes`` object, rather than an integer
|
|
|
|
For example::
|
|
|
|
>>> tuple(b"ABC".iterbytes())
|
|
(b'A', b'B', b'C')
|
|
|
|
|
|
Design discussion
|
|
=================
|
|
|
|
Why not rely on sequence repetition to create zero-initialised sequences?
|
|
-------------------------------------------------------------------------
|
|
|
|
Zero-initialised sequences can be created via sequence repetition::
|
|
|
|
>>> b'\x00' * 3
|
|
b'\x00\x00\x00'
|
|
>>> bytearray(b'\x00') * 3
|
|
bytearray(b'\x00\x00\x00')
|
|
|
|
However, this was also the case when the ``bytearray`` type was originally
|
|
designed, and the decision was made to add explicit support for it in the
|
|
type constructor. The immutable ``bytes`` type then inherited that feature
|
|
when it was introduced in :pep:`3137`.
|
|
|
|
This PEP isn't revisiting that original design decision, just changing the
|
|
spelling as users sometimes find the current behavior of the binary sequence
|
|
constructors surprising. In particular, there's a reasonable case to be made
|
|
that ``bytes(x)`` (where ``x`` is an integer) should behave like the
|
|
``bytes.fromint(x)`` proposal in this PEP. Providing both behaviors as separate
|
|
class methods avoids that ambiguity.
|
|
|
|
|
|
Omitting the originally proposed builtin function
|
|
-------------------------------------------------
|
|
|
|
When submitted to the Steering Council, this PEP proposed the introduction of
|
|
a ``bchr`` builtin (with the same behaviour as ``bytes.fromint``), recreating
|
|
the ``ord``/``chr``/``unichr`` trio from Python 2 under a different naming
|
|
scheme (``ord``/``bchr``/``chr``).
|
|
|
|
The SC indicated they didn't think this functionality was needed often enough
|
|
to justify offering two ways of doing the same thing, especially when one of
|
|
those ways was a new builtin function. That part of the proposal was therefore
|
|
dropped as being redundant with the ``bytes.fromint`` alternate constructor.
|
|
|
|
Developers that use this method frequently will instead have the option to
|
|
define their own ``bchr = bytes.fromint`` aliases.
|
|
|
|
|
|
Scope limitation: memoryview
|
|
----------------------------
|
|
|
|
Updating ``memoryview`` with the new item retrieval methods is outside the scope
|
|
of this PEP.
|
|
|
|
|
|
References
|
|
==========
|
|
|
|
* `Initial March 2014 discussion thread on python-ideas <https://mail.python.org/pipermail/python-ideas/2014-March/027295.html>`_
|
|
* `Guido's initial feedback in that thread <https://mail.python.org/pipermail/python-ideas/2014-March/027376.html>`_
|
|
* `Issue proposing moving zero-initialised sequences to a dedicated API <https://github.com/python/cpython/issues/65094>`_
|
|
* `Issue proposing to use calloc() for zero-initialised binary sequences <https://github.com/python/cpython/issues/65843>`_
|
|
* `August 2014 discussion thread on python-dev <https://mail.python.org/pipermail/python-ideas/2014-March/027295.html>`_
|
|
* `June 2016 discussion thread on python-dev <https://mail.python.org/pipermail/python-dev/2016-June/144875.html>`_
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|