244 lines
8.0 KiB
Plaintext
244 lines
8.0 KiB
Plaintext
PEP: 461
|
|
Title: Adding % formatting to bytes and bytearray
|
|
Author: Ethan Furman <ethan@stoneleaf.us>
|
|
Status: Final
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 13-Jan-2014
|
|
Python-Version: 3.5
|
|
Post-History: 14-Jan-2014, 15-Jan-2014, 17-Jan-2014, 22-Feb-2014, 25-Mar-2014,
|
|
27-Mar-2014
|
|
Resolution: https://mail.python.org/pipermail/python-dev/2014-March/133621.html
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
This PEP proposes adding % formatting operations similar to Python 2's ``str``
|
|
type to ``bytes`` and ``bytearray`` [1]_ [2]_.
|
|
|
|
|
|
Rationale
|
|
=========
|
|
|
|
While interpolation is usually thought of as a string operation, there are
|
|
cases where interpolation on ``bytes`` or ``bytearrays`` make sense, and the
|
|
work needed to make up for this missing functionality detracts from the overall
|
|
readability of the code.
|
|
|
|
|
|
Motivation
|
|
==========
|
|
|
|
With Python 3 and the split between ``str`` and ``bytes``, one small but
|
|
important area of programming became slightly more difficult, and much more
|
|
painful -- wire format protocols [3]_.
|
|
|
|
This area of programming is characterized by a mixture of binary data and
|
|
ASCII compatible segments of text (aka ASCII-encoded text). Bringing back a
|
|
restricted %-interpolation for ``bytes`` and ``bytearray`` will aid both in
|
|
writing new wire format code, and in porting Python 2 wire format code.
|
|
|
|
Common use-cases include ``dbf`` and ``pdf`` file formats, ``email``
|
|
formats, and ``FTP`` and ``HTTP`` communications, among many others.
|
|
|
|
|
|
Proposed semantics for ``bytes`` and ``bytearray`` formatting
|
|
=============================================================
|
|
|
|
%-interpolation
|
|
---------------
|
|
|
|
All the numeric formatting codes (``d``, ``i``, ``o``, ``u``, ``x``, ``X``,
|
|
``e``, ``E``, ``f``, ``F``, ``g``, ``G``, and any that are subsequently added
|
|
to Python 3) will be supported, and will work as they do for str, including
|
|
the padding, justification and other related modifiers (currently ``#``, ``0``,
|
|
``-``, space, and ``+`` (plus any added to Python 3)). The only
|
|
non-numeric codes allowed are ``c``, ``b``, ``a``, and ``s`` (which is a
|
|
synonym for b).
|
|
|
|
For the numeric codes, the only difference between ``str`` and ``bytes`` (or
|
|
``bytearray``) interpolation is that the results from these codes will be
|
|
ASCII-encoded text, not unicode. In other words, for any numeric formatting
|
|
code ``%x``::
|
|
|
|
b"%x" % val
|
|
|
|
is equivalent to::
|
|
|
|
("%x" % val).encode("ascii")
|
|
|
|
Examples::
|
|
|
|
>>> b'%4x' % 10
|
|
b' a'
|
|
|
|
>>> b'%#4x' % 10
|
|
' 0xa'
|
|
|
|
>>> b'%04X' % 10
|
|
'000A'
|
|
|
|
``%c`` will insert a single byte, either from an ``int`` in range(256), or from
|
|
a ``bytes`` argument of length 1, not from a ``str``.
|
|
|
|
Examples::
|
|
|
|
>>> b'%c' % 48
|
|
b'0'
|
|
|
|
>>> b'%c' % b'a'
|
|
b'a'
|
|
|
|
``%b`` will insert a series of bytes. These bytes are collected in one of two
|
|
ways:
|
|
|
|
- input type supports ``Py_buffer`` [4]_?
|
|
use it to collect the necessary bytes
|
|
|
|
- input type is something else?
|
|
use its ``__bytes__`` method [5]_ ; if there isn't one, raise a ``TypeError``
|
|
|
|
In particular, ``%b`` will not accept numbers nor ``str``. ``str`` is rejected
|
|
as the string to bytes conversion requires an encoding, and we are refusing to
|
|
guess; numbers are rejected because:
|
|
|
|
- what makes a number is fuzzy (float? Decimal? Fraction? some user type?)
|
|
|
|
- allowing numbers would lead to ambiguity between numbers and textual
|
|
representations of numbers (3.14 vs '3.14')
|
|
|
|
- given the nature of wire formats, explicit is definitely better than implicit
|
|
|
|
``%s`` is included as a synonym for ``%b`` for the sole purpose of making 2/3 code
|
|
bases easier to maintain. Python 3 only code should use ``%b``.
|
|
|
|
Examples::
|
|
|
|
>>> b'%b' % b'abc'
|
|
b'abc'
|
|
|
|
>>> b'%b' % 'some string'.encode('utf8')
|
|
b'some string'
|
|
|
|
>>> b'%b' % 3.14
|
|
Traceback (most recent call last):
|
|
...
|
|
TypeError: b'%b' does not accept 'float'
|
|
|
|
>>> b'%b' % 'hello world!'
|
|
Traceback (most recent call last):
|
|
...
|
|
TypeError: b'%b' does not accept 'str'
|
|
|
|
|
|
``%a`` will give the equivalent of
|
|
``repr(some_obj).encode('ascii', 'backslashreplace')`` on the interpolated
|
|
value. Use cases include developing a new protocol and writing landmarks
|
|
into the stream; debugging data going into an existing protocol to see if
|
|
the problem is the protocol itself or bad data; a fall-back for a serialization
|
|
format; or any situation where defining ``__bytes__`` would not be appropriate
|
|
but a readable/informative representation is needed [6]_.
|
|
|
|
``%r`` is included as a synonym for ``%a`` for the sole purpose of making 2/3
|
|
code bases easier to maintain. Python 3 only code use ``%a`` [7]_.
|
|
|
|
Examples::
|
|
|
|
>>> b'%a' % 3.14
|
|
b'3.14'
|
|
|
|
>>> b'%a' % b'abc'
|
|
b"b'abc'"
|
|
|
|
>>> b'%a' % 'def'
|
|
b"'def'"
|
|
|
|
|
|
|
|
Compatibility with Python 2
|
|
===========================
|
|
|
|
As noted above, ``%s`` and ``%r`` are being included solely to help ease
|
|
migration from, and/or have a single code base with, Python 2. This is
|
|
important as there are modules both in the wild and behind closed doors that
|
|
currently use the Python 2 ``str`` type as a ``bytes`` container, and hence
|
|
are using ``%s`` as a bytes interpolator.
|
|
|
|
However, ``%b`` and ``%a`` should be used in new, Python 3 only code, so ``%s``
|
|
and ``%r`` will immediately be deprecated, but not removed from the 3.x series
|
|
[7]_.
|
|
|
|
Proposed variations
|
|
===================
|
|
|
|
It has been proposed to automatically use ``.encode('ascii','strict')`` for
|
|
``str`` arguments to ``%b``.
|
|
|
|
- Rejected as this would lead to intermittent failures. Better to have the
|
|
operation always fail so the trouble-spot can be correctly fixed.
|
|
|
|
It has been proposed to have ``%b`` return the ascii-encoded repr when the
|
|
value is a ``str`` (b'%b' % 'abc' --> b"'abc'").
|
|
|
|
- Rejected as this would lead to hard to debug failures far from the problem
|
|
site. Better to have the operation always fail so the trouble-spot can be
|
|
easily fixed.
|
|
|
|
Originally this PEP also proposed adding format-style formatting, but it was
|
|
decided that format and its related machinery were all strictly text (aka
|
|
``str``) based, and it was dropped.
|
|
|
|
Various new special methods were proposed, such as ``__ascii__``,
|
|
``__format_bytes__``, etc.; such methods are not needed at this time, but can
|
|
be visited again later if real-world use shows deficiencies with this solution.
|
|
|
|
A competing PEP, :pep:`PEP 460 Add binary interpolation and formatting <460>`,
|
|
also exists.
|
|
|
|
|
|
Objections
|
|
==========
|
|
|
|
The objections raised against this PEP were mainly variations on two themes:
|
|
|
|
- the ``bytes`` and ``bytearray`` types are for pure binary data, with no
|
|
assumptions about encodings
|
|
|
|
- offering %-interpolation that assumes an ASCII encoding will be an
|
|
attractive nuisance and lead us back to the problems of the Python 2
|
|
``str``/``unicode`` text model
|
|
|
|
As was seen during the discussion, ``bytes`` and ``bytearray`` are also used
|
|
for mixed binary data and ASCII-compatible segments: file formats such as
|
|
``dbf`` and ``pdf``, network protocols such as ``ftp`` and ``email``, etc.
|
|
|
|
``bytes`` and ``bytearray`` already have several methods which assume an ASCII
|
|
compatible encoding. ``upper()``, ``isalpha()``, and ``expandtabs()`` to name
|
|
just a few. %-interpolation, with its very restricted mini-language, will not
|
|
be any more of a nuisance than the already existing methods.
|
|
|
|
Some have objected to allowing the full range of numeric formatting codes with
|
|
the claim that decimal alone would be sufficient. However, at least two
|
|
formats (dbf and pdf) make use of non-decimal numbers.
|
|
|
|
|
|
Footnotes
|
|
=========
|
|
|
|
.. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting
|
|
.. [2] neither string.Template, format, nor str.format are under consideration
|
|
.. [3] https://mail.python.org/pipermail/python-dev/2014-January/131518.html
|
|
.. [4] http://docs.python.org/3/c-api/buffer.html
|
|
examples: ``memoryview``, ``array.array``, ``bytearray``, ``bytes``
|
|
.. [5] http://docs.python.org/3/reference/datamodel.html#object.__bytes__
|
|
.. [6] https://mail.python.org/pipermail/python-dev/2014-February/132750.html
|
|
.. [7] http://bugs.python.org/issue23467 -- originally ``%r`` was not allowed,
|
|
but was added for consistency during the 3.5 alpha stage.
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|