python-peps/pep-0461.txt

PEP: 461
Title: Adding % formatting to bytes and bytearray
Version: $Revision$
Last-Modified: $Date$
Author: Ethan Furman <ethan@stoneleaf.us>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 2014-01-13
Python-Version: 3.5
Post-History: 2014-01-14, 2014-01-15, 2014-01-17, 2014-02-22, 2014-03-25
Resolution:


Abstract
========

This PEP proposes adding % formatting operations similar to Python 2's ``str``
type to ``bytes`` and ``bytearray`` [1]_ [2]_.


Rationale
=========

While interpolation is usually thought of as a string operation, there are
cases where interpolation on ``bytes`` or ``bytearrays`` make sense, and the
work needed to make up for this missing functionality detracts from the overall
readability of the code.


Motivation
==========

With Python 3 and the split between ``str`` and ``bytes``, one small but
important area of programming became slightly more difficult, and much more
painful -- wire format protocols [3]_.

This area of programming is characterized by a mixture of binary data and
ASCII compatible segments of text (aka ASCII-encoded text).  Bringing back a
restricted %-interpolation for ``bytes`` and ``bytearray`` will aid both in
writing new wire format code, and in porting Python 2 wire format code.

Common use-cases include ``dbf`` and ``pdf`` file formats, ``email``
formats, and ``FTP`` and ``HTTP`` communications, among many others.


Proposed semantics for ``bytes`` and ``bytearray`` formatting
=============================================================

%-interpolation
---------------

All the numeric formatting codes (``d``, ``i``, ``o``, ``u``, ``x``, ``X``,
``e``, ``E``, ``f``, ``F``, ``g``, ``G``, and any that are subsequently added
to Python 3) will be supported, and will work as they do for str, including
the padding, justification and other related modifiers (currently ``#``, ``0``,
``-``, `` `` (space), and ``+`` (plus any added to Python 3)).  The only
non-numeric codes allowed are ``c``, ``s``, and ``a``.

For the numeric codes, the only difference between ``str`` and ``bytes`` (or
``bytearray``) interpolation is that the results from these codes will be
ASCII-encoded text, not unicode.  In other words, for any numeric formatting
code `%x`::

   b"%x" % val

is equivalent to

   ("%x" % val).encode("ascii")

Examples::

   >>> b'%4x' % 10
   b'   a'

   >>> b'%#4x' % 10
   ' 0xa'

   >>> b'%04X' % 10
   '000A'

``%c`` will insert a single byte, either from an ``int`` in range(256), or from
a ``bytes`` argument of length 1, not from a ``str``.

Examples::

    >>> b'%c' % 48
    b'0'

    >>> b'%c' % b'a'
    b'a'

``%s`` is included for two reasons:  1) `b` is already a format code for
``format`` numerics (binary), and 2) it will make 2/3 code easier as Python 2.x
code uses ``%s``; however, it is restricted in what it will accept::

  - input type supports ``Py_buffer`` [6]_?
    use it to collect the necessary bytes

  - input type is something else?
    use its ``__bytes__`` method [7]_ ; if there isn't one, raise a ``TypeError``

In particular, ``%s`` will not accept numbers (use a numeric format code for
that), nor ``str`` (encode it to ``bytes``).

Examples::

    >>> b'%s' % b'abc'
    b'abc'

    >>> b'%s' % 'some string'.encode('utf8')
    b'some string'

    >>> b'%s' % 3.14
    Traceback (most recent call last):
    ...
    TypeError: b'%s' does not accept numbers, use a numeric code instead

    >>> b'%s' % 'hello world!'
    Traceback (most recent call last):
    ...
    TypeError: b'%s' does not accept 'str', it must be encoded to `bytes`


``%a`` will give the equivalent of
``repr(some_obj).encode('ascii', 'backslashreplace')`` on the interpolated
value.  Use cases include developing a new protocol and writing landmarks
into the stream; debugging data going into an existing protocol to see if
the problem is the protocol itself or bad data; a fall-back for a serialization
format; or any situation where defining ``__bytes__`` would not be appropriate
but a readable/informative representation is needed [8].

Examples::

    >>> b'%a' % 3.14
    b'3.14'

    >>> b'%a' % b'abc'
    b"b'abc'"

    >>> b'%a' % 'def'
    b"'def'"


Unsupported codes
-----------------

``%r`` (which calls ``__repr__`` and returns a ``str``) is not supported.


Proposed variations
===================

It was suggested to let ``%s`` accept numbers, but since numbers have their own
format codes this idea was discarded.

It has been suggested to use ``%b`` for bytes as well as ``%s``.  This was
rejected as not adding any value either in clarity or simplicity.

It has been proposed to automatically use ``.encode('ascii','strict')`` for
``str`` arguments to ``%s``.

  - Rejected as this would lead to intermittent failures.  Better to have the
    operation always fail so the trouble-spot can be correctly fixed.

It has been proposed to have ``%s`` return the ascii-encoded repr when the
value is a ``str`` (b'%s' % 'abc'  --> b"'abc'").

  - Rejected as this would lead to hard to debug failures far from the problem
    site.  Better to have the operation always fail so the trouble-spot can be
    easily fixed.

Originally this PEP also proposed adding format-style formatting, but it was
decided that format and its related machinery were all strictly text (aka
``str``) based, and it was dropped.

Various new special methods were proposed, such as ``__ascii__``,
``__format_bytes__``, etc.; such methods are not needed at this time, but can
be visited again later if real-world use shows deficiencies with this solution.

A competing PEP, ``PEP 460 Add binary interpolation and formatting`` [9], also
exists.


Objections
==========

The objections raised against this PEP were mainly variations on two themes::

  - the ``bytes`` and ``bytearray`` types are for pure binary data, with no
    assumptions about encodings
  - offering %-interpolation that assumes an ASCII encoding will be an
    attractive nuisance and lead us back to the problems of the Python 2
    ``str``/``unicode`` text model

As was seen during the discussion, ``bytes`` and ``bytearray`` are also used
for mixed binary data and ASCII-compatible segments: file formats such as
``dbf`` and ``pdf``, network protocols such as ``ftp`` and ``email``, etc.

``bytes`` and ``bytearray`` already have several methods which assume an ASCII
compatible encoding.  ``upper()``, ``isalpha()``, and ``expandtabs()`` to name
just a few.  %-interpolation, with its very restricted mini-language, will not
be any more of a nuisance than the already existing methods.

Some have objected to allowing the full range of numeric formatting codes with
the claim that decimal alone would be sufficient.  However, at least two
formats (dbf and pdf) make use of non-decimal numbers.


Footnotes
=========

.. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting
.. [2] neither string.Template, format, nor str.format are under consideration
.. [3] https://mail.python.org/pipermail/python-dev/2014-January/131518.html
.. [4] to use a str object in a bytes interpolation, encode it first
.. [5] %c is not an exception as neither of its possible arguments are str
.. [6] http://docs.python.org/3/c-api/buffer.html
       examples:  ``memoryview``, ``array.array``, ``bytearray``, ``bytes``
.. [7] http://docs.python.org/3/reference/datamodel.html#object.__bytes__
.. [8] https://mail.python.org/pipermail/python-dev/2014-February/132750.html
.. [9] http://python.org/dev/peps/pep-0460/


Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
Fill in the correct PEP number (461). 2014-01-14 14:23:36 -05:00			`PEP: 461`
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			`Title: Adding % formatting to bytes and bytearray`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00			`Version: $Revision$`
			`Last-Modified: $Date$`
			`Author: Ethan Furman <ethan@stoneleaf.us>`
			`Status: Draft`
			`Type: Standards Track`
			`Content-Type: text/x-rst`
			`Created: 2014-01-13`
			`Python-Version: 3.5`
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`Post-History: 2014-01-14, 2014-01-15, 2014-01-17, 2014-02-22, 2014-03-25`
			`Resolution:`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00

			`Abstract`
			`========`

PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00			This PEP proposes adding % formatting operations similar to Python 2's ``str``
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			type to ``bytes`` and ``bytearray`` [1]_ [2]_.


			`Rationale`
			`=========`

			`While interpolation is usually thought of as a string operation, there are`
			cases where interpolation on ``bytes`` or ``bytearrays`` make sense, and the
			`work needed to make up for this missing functionality detracts from the overall`
			`readability of the code.`


			`Motivation`
			`==========`

			With Python 3 and the split between ``str`` and ``bytes``, one small but
			`important area of programming became slightly more difficult, and much more`
			`painful -- wire format protocols [3]_.`

			`This area of programming is characterized by a mixture of binary data and`
			`ASCII compatible segments of text (aka ASCII-encoded text). Bringing back a`
			restricted %-interpolation for ``bytes`` and ``bytearray`` will aid both in
			`writing new wire format code, and in porting Python 2 wire format code.`
PEP 461: more updates 2014-01-15 19:12:41 -05:00
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			Common use-cases include ``dbf`` and ``pdf`` file formats, ``email``
			formats, and ``FTP`` and ``HTTP`` communications, among many others.
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00

Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			Proposed semantics for ``bytes`` and ``bytearray`` formatting
Fix formatting 2014-03-02 01:41:09 -05:00			`=============================================================`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
			`%-interpolation`
			`---------------`

added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00			All the numeric formatting codes (``d``, ``i``, ``o``, ``u``, ``x``, ``X``,
PEP 461: fix reST syntax 2014-03-27 06:34:18 -04:00			``e``, ``E``, ``f``, ``F``, ``g``, ``G``, and any that are subsequently added
added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00			`to Python 3) will be supported, and will work as they do for str, including`
			the padding, justification and other related modifiers (currently ``#``, ``0``,
			``-``, `` `` (space), and ``+`` (plus any added to Python 3)). The only
			non-numeric codes allowed are ``c``, ``s``, and ``a``.

			For the numeric codes, the only difference between ``str`` and ``bytes`` (or
			``bytearray``) interpolation is that the results from these codes will be
			`ASCII-encoded text, not unicode. In other words, for any numeric formatting`
			code `%x`::
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00
			`b"%x" % val`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`is equivalent to`

			`("%x" % val).encode("ascii")`

			`Examples::`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
			`>>> b'%4x' % 10`
			`b' a'`

incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`>>> b'%#4x' % 10`
PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00			`' 0xa'`

incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`>>> b'%04X' % 10`
PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00			`'000A'`

			``%c`` will insert a single byte, either from an ``int`` in range(256), or from
			a ``bytes`` argument of length 1, not from a ``str``.
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`Examples::`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
			`>>> b'%c' % 48`
			`b'0'`

			`>>> b'%c' % b'a'`
			`b'a'`

incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			``%s`` is included for two reasons: 1) `b` is already a format code for
			``format`` numerics (binary), and 2) it will make 2/3 code easier as Python 2.x
			code uses ``%s``; however, it is restricted in what it will accept::
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			- input type supports ``Py_buffer`` [6]_?
			`use it to collect the necessary bytes`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			`- input type is something else?`
			use its ``__bytes__`` method [7]_ ; if there isn't one, raise a ``TypeError``
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			In particular, ``%s`` will not accept numbers (use a numeric format code for
			that), nor ``str`` (encode it to ``bytes``).

added Objections section; fixed formatting 2014-02-22 21:41:14 -05:00			`Examples::`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
			`>>> b'%s' % b'abc'`
			`b'abc'`

incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`>>> b'%s' % 'some string'.encode('utf8')`
			`b'some string'`

PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00			`>>> b'%s' % 3.14`
PEP 461: updates to %s and Open Questions 2014-01-14 21:23:03 -05:00			`Traceback (most recent call last):`
			`...`
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`TypeError: b'%s' does not accept numbers, use a numeric code instead`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
			`>>> b'%s' % 'hello world!'`
			`Traceback (most recent call last):`
			`...`
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			TypeError: b'%s' does not accept 'str', it must be encoded to `bytes`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00
added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00			``%a`` will give the equivalent of
			``repr(some_obj).encode('ascii', 'backslashreplace')`` on the interpolated
			`value. Use cases include developing a new protocol and writing landmarks`
			`into the stream; debugging data going into an existing protocol to see if`
			`the problem is the protocol itself or bad data; a fall-back for a serialization`
			format; or any situation where defining ``__bytes__`` would not be appropriate
			`but a readable/informative representation is needed [8].`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00			`Examples::`

			`>>> b'%a' % 3.14`
			`b'3.14'`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00			`>>> b'%a' % b'abc'`
fix %a bytes example 2014-03-26 18:47:12 -04:00			`b"b'abc'"`
added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00
			`>>> b'%a' % 'def'`
			`b"'def'"`
PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
PEP 461: more updates 2014-01-15 19:12:41 -05:00			`Unsupported codes`
			`-----------------`

incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			``%r`` (which calls ``__repr__`` and returns a ``str``) is not supported.
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00

			`Proposed variations`
			`===================`

PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00			It was suggested to let ``%s`` accept numbers, but since numbers have their own
PEP 461: updates to %s and Open Questions 2014-01-14 21:23:03 -05:00			`format codes this idea was discarded.`

incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			It has been suggested to use ``%b`` for bytes as well as ``%s``. This was
			`rejected as not adding any value either in clarity or simplicity.`

PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00			It has been proposed to automatically use ``.encode('ascii','strict')`` for
			``str`` arguments to ``%s``.
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			`- Rejected as this would lead to intermittent failures. Better to have the`
			`operation always fail so the trouble-spot can be correctly fixed.`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00			It has been proposed to have ``%s`` return the ascii-encoded repr when the
			value is a ``str`` (b'%s' % 'abc' --> b"'abc'").
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			`- Rejected as this would lead to hard to debug failures far from the problem`
			`site. Better to have the operation always fail so the trouble-spot can be`
			`easily fixed.`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			`Originally this PEP also proposed adding format-style formatting, but it was`
			`decided that format and its related machinery were all strictly text (aka`
			``str``) based, and it was dropped.
PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00
			Various new special methods were proposed, such as ``__ascii__``,
			``__format_bytes__``, etc.; such methods are not needed at this time, but can
			`be visited again later if real-world use shows deficiencies with this solution.`

added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00			A competing PEP, ``PEP 460 Add binary interpolation and formatting`` [9], also
			`exists.`

PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
added Objections section; fixed formatting 2014-02-22 21:41:14 -05:00			`Objections`
			`==========`

			`The objections raised against this PEP were mainly variations on two themes::`

			- the ``bytes`` and ``bytearray`` types are for pure binary data, with no
			`assumptions about encodings`
			`- offering %-interpolation that assumes an ASCII encoding will be an`
			`attractive nuisance and lead us back to the problems of the Python 2`
			``str``/``unicode`` text model

			As was seen during the discussion, ``bytes`` and ``bytearray`` are also used
			`for mixed binary data and ASCII-compatible segments: file formats such as`
			``dbf`` and ``pdf``, network protocols such as ``ftp`` and ``email``, etc.

			``bytes`` and ``bytearray`` already have several methods which assume an ASCII
			compatible encoding. ``upper()``, ``isalpha()``, and ``expandtabs()`` to name
			`just a few. %-interpolation, with its very restricted mini-language, will not`
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`be any more of a nuisance than the already existing methods.`
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`Some have objected to allowing the full range of numeric formatting codes with`
			`the claim that decimal alone would be sufficient. However, at least two`
			`formats (dbf and pdf) make use of non-decimal numbers.`
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00

PEP 461: more updates 2014-01-15 19:12:41 -05:00			`Footnotes`
			`=========`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00
PEP 461: removed .format; added markup 2014-01-17 12:07:32 -05:00			`.. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting`
Incorporate comments from last round of emails (in late January) 2014-02-22 20:53:31 -05:00			`.. [2] neither string.Template, format, nor str.format are under consideration`
			`.. [3] https://mail.python.org/pipermail/python-dev/2014-January/131518.html`
			`.. [4] to use a str object in a bytes interpolation, encode it first`
			`.. [5] %c is not an exception as neither of its possible arguments are str`
			`.. [6] http://docs.python.org/3/c-api/buffer.html`
			examples: ``memoryview``, ``array.array``, ``bytearray``, ``bytes``
			`.. [7] http://docs.python.org/3/reference/datamodel.html#object.__bytes__`
incorporated %a comments; general clean-up 2014-03-25 18:33:49 -04:00			`.. [8] https://mail.python.org/pipermail/python-dev/2014-February/132750.html`
added list of currently supported codes; modified description for %a; added reference to competing PEP 460 2014-03-26 10:46:34 -04:00			`.. [9] http://python.org/dev/peps/pep-0460/`
PEP 461: Adding % and {} formatting to bytes 2014-01-14 14:04:10 -05:00

			`Copyright`
			`=========`

			`This document has been placed in the public domain.`


			`..`
			`Local Variables:`
			`mode: indented-text`
			`indent-tabs-mode: nil`
			`sentence-end-double-space: t`
			`fill-column: 70`
			`coding: utf-8`
			`End:`