PEP: 467 Title: Minor API improvements for binary sequences Version: $Revision$ Last-Modified: $Date$ Author: Alyssa Coghlan , Ethan Furman Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 30-Mar-2014 Python-Version: 3.12 Post-History: 30-Mar-2014, 15-Aug-2014, 16-Aug-2014, 07-Jun-2016, 01-Sep-2016, 13-Apr-2021, 03-Nov-2021 Abstract ======== This PEP proposes five small adjustments to the APIs of the ``bytes`` and ``bytearray`` types to make it easier to operate entirely in the binary domain: * Add ``fromsize`` alternative constructor * Add ``fromint`` alternative constructor * Add ``ascii`` alternative constructor * Add ``getbyte`` byte retrieval method * Add ``iterbytes`` alternative iterator Rationale ========= During the initial development of the Python 3 language specification, the core ``bytes`` type for arbitrary binary data started as the mutable type that is now referred to as ``bytearray``. Other aspects of operating in the binary domain in Python have also evolved over the course of the Python 3 series, for example with :pep:`461`. Motivation ========== With Python 3 and the split between ``str`` and ``bytes``, one small but important area of programming became slightly more difficult, and much more painful -- wire format protocols. This area of programming is characterized by a mixture of binary data and ASCII compatible segments of text (aka ASCII-encoded text). The addition of the new constructors, methods, and iterators will aid both in writing new wire format code, and in porting any remaining Python 2 wire format code. Common use-cases include ``dbf`` and ``pdf`` file formats, ``email`` formats, and ``FTP`` and ``HTTP`` communications, among many others. Proposals ========= Addition of explicit "count and byte initialised sequence" constructors ----------------------------------------------------------------------- To replace the now discouraged behavior, this PEP proposes the addition of an explicit ``fromsize`` alternative constructor as a class method on both ``bytes`` and ``bytearray`` whose first argument is the count, and whose second argument is the fill byte to use (defaults to ``\x00``):: >>> bytes.fromsize(3) b'\x00\x00\x00' >>> bytearray.fromsize(3) bytearray(b'\x00\x00\x00') >>> bytes.fromsize(5, b'\x0a') b'\x0a\x0a\x0a\x0a\x0a' >>> bytearray.fromsize(5, fill=b'\x0a') bytearray(b'\x0a\x0a\x0a\x0a\x0a') ``fromsize`` will behave just as the current constructors behave when passed a single integer, while allowing for non-zero fill values when needed. Addition of explicit "single byte" constructors ----------------------------------------------- As binary counterparts to the text ``chr`` function, this PEP proposes the addition of an explicit ``fromint`` alternative constructor as a class method on both ``bytes`` and ``bytearray``:: >>> bytes.fromint(65) b'A' >>> bytearray.fromint(65) bytearray(b'A') These methods will only accept integers in the range 0 to 255 (inclusive):: >>> bytes.fromint(512) Traceback (most recent call last): File "", line 1, in ValueError: integer must be in range(0, 256) >>> bytes.fromint(1.0) Traceback (most recent call last): File "", line 1, in TypeError: 'float' object cannot be interpreted as an integer The documentation of the ``ord`` builtin will be updated to explicitly note that ``bytes.fromint`` is the primary inverse operation for binary data, while ``chr`` is the inverse operation for text data, and that ``bytearray.fromint`` also exists. Behaviorally, ``bytes.fromint(x)`` will be equivalent to the current ``bytes([x])`` (and similarly for ``bytearray``). The new spelling is expected to be easier to discover and easier to read (especially when used in conjunction with indexing operations on binary sequence types). As a separate method, the new spelling will also work better with higher order functions like ``map``. These new methods intentionally do NOT offer the same level of general integer support as the existing ``int.to_bytes`` conversion method, which allows arbitrarily large integers to be converted to arbitrarily long bytes objects. The restriction to only accept positive integers that fit in a single byte means that no byte order information is needed, and there is no need to handle negative numbers. The documentation of the new methods will refer readers to ``int.to_bytes`` for use cases where handling of arbitrary integers is needed. Addition of "ascii" constructors -------------------------------- In Python 2 converting an object, such as the integer ``123``, to bytes (aka the Python 2 ``str``) was as simple as:: >>> str(123) '123' With Python 3 that became the more verbose:: >>> b'%d' % 123 or even:: >>> str(123).encode('ascii') This PEP proposes that an ``ascii`` method be added to ``bytes`` and ``bytearray`` to handle this use-case:: >>> bytes.ascii(123) b'123' Note that ``bytes.ascii()`` would handle simple ascii-encodable text correctly, unlike the ``ascii()`` built-in:: >>> ascii("hello").encode('ascii') b"'hello'" Addition of "getbyte" method to retrieve a single byte ------------------------------------------------------ This PEP proposes that ``bytes`` and ``bytearray`` gain the method ``getbyte`` which will always return ``bytes``:: >>> b'abc'.getbyte(0) b'a' If an index is asked for that doesn't exist, ``IndexError`` is raised:: >>> b'abc'.getbyte(9) Traceback (most recent call last): File "", line 1, in IndexError: index out of range Addition of optimised iterator methods that produce ``bytes`` objects --------------------------------------------------------------------- This PEP proposes that ``bytes`` and ``bytearray`` gain an optimised ``iterbytes`` method that produces length 1 ``bytes`` objects rather than integers:: for x in data.iterbytes(): # x is a length 1 ``bytes`` object, rather than an integer For example:: >>> tuple(b"ABC".iterbytes()) (b'A', b'B', b'C') Design discussion ================= Why not rely on sequence repetition to create zero-initialised sequences? ------------------------------------------------------------------------- Zero-initialised sequences can be created via sequence repetition:: >>> b'\x00' * 3 b'\x00\x00\x00' >>> bytearray(b'\x00') * 3 bytearray(b'\x00\x00\x00') However, this was also the case when the ``bytearray`` type was originally designed, and the decision was made to add explicit support for it in the type constructor. The immutable ``bytes`` type then inherited that feature when it was introduced in :pep:`3137`. This PEP isn't revisiting that original design decision, just changing the spelling as users sometimes find the current behavior of the binary sequence constructors surprising. In particular, there's a reasonable case to be made that ``bytes(x)`` (where ``x`` is an integer) should behave like the ``bytes.fromint(x)`` proposal in this PEP. Providing both behaviors as separate class methods avoids that ambiguity. Omitting the originally proposed builtin function ------------------------------------------------- When submitted to the Steering Council, this PEP proposed the introduction of a ``bchr`` builtin (with the same behaviour as ``bytes.fromint``), recreating the ``ord``/``chr``/``unichr`` trio from Python 2 under a different naming scheme (``ord``/``bchr``/``chr``). The SC indicated they didn't think this functionality was needed often enough to justify offering two ways of doing the same thing, especially when one of those ways was a new builtin function. That part of the proposal was therefore dropped as being redundant with the ``bytes.fromint`` alternate constructor. Developers that use this method frequently will instead have the option to define their own ``bchr = bytes.fromint`` aliases. Scope limitation: memoryview ---------------------------- Updating ``memoryview`` with the new item retrieval methods is outside the scope of this PEP. References ========== * `Initial March 2014 discussion thread on python-ideas `_ * `Guido's initial feedback in that thread `_ * `Issue proposing moving zero-initialised sequences to a dedicated API `_ * `Issue proposing to use calloc() for zero-initialised binary sequences `_ * `August 2014 discussion thread on python-dev `_ * `June 2016 discussion thread on python-dev `_ Copyright ========= This document has been placed in the public domain.