PEP: 461
Title: Adding % formatting to bytes and bytearray
Author: Ethan Furman <ethan@stoneleaf.us>
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 13-Jan-2014
Python-Version: 3.5
Post-History: 14-Jan-2014, 15-Jan-2014, 17-Jan-2014, 22-Feb-2014, 25-Mar-2014,
              27-Mar-2014
Resolution: https://mail.python.org/pipermail/python-dev/2014-March/133621.html


Abstract
========

This PEP proposes adding % formatting operations similar to Python 2's ``str``
type to ``bytes`` and ``bytearray`` [1]_ [2]_.


Rationale
=========

While interpolation is usually thought of as a string operation, there are
cases where interpolation on ``bytes`` or ``bytearrays`` make sense, and the
work needed to make up for this missing functionality detracts from the overall
readability of the code.


Motivation
==========

With Python 3 and the split between ``str`` and ``bytes``, one small but
important area of programming became slightly more difficult, and much more
painful -- wire format protocols [3]_.

This area of programming is characterized by a mixture of binary data and
ASCII compatible segments of text (aka ASCII-encoded text).  Bringing back a
restricted %-interpolation for ``bytes`` and ``bytearray`` will aid both in
writing new wire format code, and in porting Python 2 wire format code.

Common use-cases include ``dbf`` and ``pdf`` file formats, ``email``
formats, and ``FTP`` and ``HTTP`` communications, among many others.


Proposed semantics for ``bytes`` and ``bytearray`` formatting
=============================================================

%-interpolation
---------------

All the numeric formatting codes (``d``, ``i``, ``o``, ``u``, ``x``, ``X``,
``e``, ``E``, ``f``, ``F``, ``g``, ``G``, and any that are subsequently added
to Python 3) will be supported, and will work as they do for str, including
the padding, justification and other related modifiers (currently ``#``, ``0``,
``-``, space, and ``+`` (plus any added to Python 3)).  The only
non-numeric codes allowed are ``c``, ``b``, ``a``, and ``s`` (which is a
synonym for b).

For the numeric codes, the only difference between ``str`` and ``bytes`` (or
``bytearray``) interpolation is that the results from these codes will be
ASCII-encoded text, not unicode.  In other words, for any numeric formatting
code ``%x``::

   b"%x" % val

is equivalent to::

   ("%x" % val).encode("ascii")

Examples::

   >>> b'%4x' % 10
   b'   a'

   >>> b'%#4x' % 10
   ' 0xa'

   >>> b'%04X' % 10
   '000A'

``%c`` will insert a single byte, either from an ``int`` in range(256), or from
a ``bytes`` argument of length 1, not from a ``str``.

Examples::

    >>> b'%c' % 48
    b'0'

    >>> b'%c' % b'a'
    b'a'

``%b`` will insert a series of bytes.  These bytes are collected in one of two
ways:

- input type supports ``Py_buffer`` [4]_?
  use it to collect the necessary bytes

- input type is something else?
  use its ``__bytes__`` method [5]_ ; if there isn't one, raise a ``TypeError``

In particular, ``%b`` will not accept numbers nor ``str``.  ``str`` is rejected
as the string to bytes conversion requires an encoding, and we are refusing to
guess; numbers are rejected because:

- what makes a number is fuzzy (float? Decimal? Fraction? some user type?)

- allowing numbers would lead to ambiguity between numbers and textual
  representations of numbers (3.14 vs '3.14')

- given the nature of wire formats, explicit is definitely better than implicit

``%s`` is included as a synonym for ``%b`` for the sole purpose of making 2/3 code
bases easier to maintain.  Python 3 only code should use ``%b``.

Examples::

    >>> b'%b' % b'abc'
    b'abc'

    >>> b'%b' % 'some string'.encode('utf8')
    b'some string'

    >>> b'%b' % 3.14
    Traceback (most recent call last):
    ...
    TypeError: b'%b' does not accept 'float'

    >>> b'%b' % 'hello world!'
    Traceback (most recent call last):
    ...
    TypeError: b'%b' does not accept 'str'


``%a`` will give the equivalent of
``repr(some_obj).encode('ascii', 'backslashreplace')`` on the interpolated
value.  Use cases include developing a new protocol and writing landmarks
into the stream; debugging data going into an existing protocol to see if
the problem is the protocol itself or bad data; a fall-back for a serialization
format; or any situation where defining ``__bytes__`` would not be appropriate
but a readable/informative representation is needed [6]_.

``%r`` is included as a synonym for ``%a`` for the sole purpose of making 2/3
code bases easier to maintain.  Python 3 only code use ``%a`` [7]_.

Examples::

    >>> b'%a' % 3.14
    b'3.14'

    >>> b'%a' % b'abc'
    b"b'abc'"

    >>> b'%a' % 'def'
    b"'def'"


Compatibility with Python 2
===========================

As noted above, ``%s`` and ``%r`` are being included solely to help ease
migration from, and/or have a single code base with, Python 2.  This is
important as there are modules both in the wild and behind closed doors that
currently use the Python 2 ``str`` type as a ``bytes`` container, and hence
are using ``%s`` as a bytes interpolator.

However, ``%b`` and ``%a`` should be used in new, Python 3 only code, so ``%s``
and ``%r`` will immediately be deprecated, but not removed from the 3.x series
[7]_.

Proposed variations
===================

It has been proposed to automatically use ``.encode('ascii','strict')`` for
``str`` arguments to ``%b``.

- Rejected as this would lead to intermittent failures.  Better to have the
  operation always fail so the trouble-spot can be correctly fixed.

It has been proposed to have ``%b`` return the ascii-encoded repr when the
value is a ``str`` (b'%b' % 'abc'  --> b"'abc'").

- Rejected as this would lead to hard to debug failures far from the problem
  site.  Better to have the operation always fail so the trouble-spot can be
  easily fixed.

Originally this PEP also proposed adding format-style formatting, but it was
decided that format and its related machinery were all strictly text (aka
``str``) based, and it was dropped.

Various new special methods were proposed, such as ``__ascii__``,
``__format_bytes__``, etc.; such methods are not needed at this time, but can
be visited again later if real-world use shows deficiencies with this solution.

A competing PEP, :pep:`PEP 460 Add binary interpolation and formatting <460>`,
also exists.


Objections
==========

The objections raised against this PEP were mainly variations on two themes:

- the ``bytes`` and ``bytearray`` types are for pure binary data, with no
  assumptions about encodings

- offering %-interpolation that assumes an ASCII encoding will be an
  attractive nuisance and lead us back to the problems of the Python 2
  ``str``/``unicode`` text model

As was seen during the discussion, ``bytes`` and ``bytearray`` are also used
for mixed binary data and ASCII-compatible segments: file formats such as
``dbf`` and ``pdf``, network protocols such as ``ftp`` and ``email``, etc.

``bytes`` and ``bytearray`` already have several methods which assume an ASCII
compatible encoding.  ``upper()``, ``isalpha()``, and ``expandtabs()`` to name
just a few.  %-interpolation, with its very restricted mini-language, will not
be any more of a nuisance than the already existing methods.

Some have objected to allowing the full range of numeric formatting codes with
the claim that decimal alone would be sufficient.  However, at least two
formats (dbf and pdf) make use of non-decimal numbers.


Footnotes
=========

.. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting
.. [2] neither string.Template, format, nor str.format are under consideration
.. [3] https://mail.python.org/pipermail/python-dev/2014-January/131518.html
.. [4] http://docs.python.org/3/c-api/buffer.html
       examples:  ``memoryview``, ``array.array``, ``bytearray``, ``bytes``
.. [5] http://docs.python.org/3/reference/datamodel.html#object.__bytes__
.. [6] https://mail.python.org/pipermail/python-dev/2014-February/132750.html
.. [7] http://bugs.python.org/issue23467 -- originally ``%r`` was not allowed,
       but was added for consistency during the 3.5 alpha stage.


Copyright
=========

This document has been placed in the public domain.