PEP: 461 Title: Adding % formatting to bytes and bytearray Version: $Revision$ Last-Modified: $Date$ Author: Ethan Furman Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 2014-01-13 Python-Version: 3.5 Post-History: 2014-01-14, 2014-01-15, 2014-01-17, 2014-02-22 Resolution: Abstract ======== This PEP proposes adding % formatting operations similar to Python 2's ``str`` type to ``bytes`` and ``bytearray`` [1]_ [2]_. Rationale ========= While interpolation is usually thought of as a string operation, there are cases where interpolation on ``bytes`` or ``bytearrays`` make sense, and the work needed to make up for this missing functionality detracts from the overall readability of the code. Motivation ========== With Python 3 and the split between ``str`` and ``bytes``, one small but important area of programming became slightly more difficult, and much more painful -- wire format protocols [3]_. This area of programming is characterized by a mixture of binary data and ASCII compatible segments of text (aka ASCII-encoded text). Bringing back a restricted %-interpolation for ``bytes`` and ``bytearray`` will aid both in writing new wire format code, and in porting Python 2 wire format code. Overriding Principles ===================== In order to avoid the problems of auto-conversion and Unicode exceptions that could plague Python 2 code, ``str`` objects will not be supported as interpolation values [4]_ [5]_. Proposed semantics for ``bytes`` and ``bytearray`` formatting ============================================================= %-interpolation --------------- All the numeric formatting codes (such as ``%x``, ``%o``, ``%e``, ``%f``, ``%g``, etc.) will be supported, and will work as they do for str, including the padding, justification and other related modifiers. Example:: >>> b'%4x' % 10 b' a' >>> '%#4x' % 10 ' 0xa' >>> '%04X' % 10 '000A' ``%c`` will insert a single byte, either from an ``int`` in range(256), or from a ``bytes`` argument of length 1, not from a ``str``. Example:: >>> b'%c' % 48 b'0' >>> b'%c' % b'a' b'a' ``%s`` is restricted in what it will accept:: - input type supports ``Py_buffer`` [6]_? use it to collect the necessary bytes - input type is something else? use its ``__bytes__`` method [7]_ ; if there isn't one, raise a ``TypeError`` Examples:: >>> b'%s' % b'abc' b'abc' >>> b'%s' % 3.14 Traceback (most recent call last): ... TypeError: 3.14 has no __bytes__ method, use a numeric code instead >>> b'%s' % 'hello world!' Traceback (most recent call last): ... TypeError: 'hello world' has no __bytes__ method, perhaps you need to encode it? .. note:: Because the ``str`` type does not have a ``__bytes__`` method, attempts to directly use ``'a string'`` as a bytes interpolation value will raise an exception. To use strings they must be encoded or otherwise transformed into a ``bytes`` sequence:: 'a string'.encode('latin-1') ``%a`` will call ``ascii()`` on the interpolated value's ``repr()``. This is intended as a debugging aid, rather than something that should be used in production. Non-ascii values will be encoded to either ``\xnn`` or ``\unnnn`` representation. Unsupported codes ----------------- ``%r`` (which calls ``__repr__`` and returns a '`str`') is not supported. Proposed variations =================== It was suggested to let ``%s`` accept numbers, but since numbers have their own format codes this idea was discarded. It has been proposed to automatically use ``.encode('ascii','strict')`` for ``str`` arguments to ``%s``. - Rejected as this would lead to intermittent failures. Better to have the operation always fail so the trouble-spot can be correctly fixed. It has been proposed to have ``%s`` return the ascii-encoded repr when the value is a ``str`` (b'%s' % 'abc' --> b"'abc'"). - Rejected as this would lead to hard to debug failures far from the problem site. Better to have the operation always fail so the trouble-spot can be easily fixed. Originally this PEP also proposed adding format-style formatting, but it was decided that format and its related machinery were all strictly text (aka ``str``) based, and it was dropped. Various new special methods were proposed, such as ``__ascii__``, ``__format_bytes__``, etc.; such methods are not needed at this time, but can be visited again later if real-world use shows deficiencies with this solution. Objections ========== The objections raised against this PEP were mainly variations on two themes:: - the ``bytes`` and ``bytearray`` types are for pure binary data, with no assumptions about encodings - offering %-interpolation that assumes an ASCII encoding will be an attractive nuisance and lead us back to the problems of the Python 2 ``str``/``unicode`` text model As was seen during the discussion, ``bytes`` and ``bytearray`` are also used for mixed binary data and ASCII-compatible segments: file formats such as ``dbf`` and ``pdf``, network protocols such as ``ftp`` and ``email``, etc. ``bytes`` and ``bytearray`` already have several methods which assume an ASCII compatible encoding. ``upper()``, ``isalpha()``, and ``expandtabs()`` to name just a few. %-interpolation, with its very restricted mini-language, will not be any more of a nuisance than the already existing methdods. Open Questions ============== It has been suggested to use ``%b`` for bytes as well as ``%s``. - Pro: clearly says 'this is bytes'; should be used for new code. - Con: does not exist in Python 2.x, so we would have two ways of doing the same thing, ``%s`` and ``%b``, with no difference between them. Footnotes ========= .. [1] http://docs.python.org/2/library/stdtypes.html#string-formatting .. [2] neither string.Template, format, nor str.format are under consideration .. [3] https://mail.python.org/pipermail/python-dev/2014-January/131518.html .. [4] to use a str object in a bytes interpolation, encode it first .. [5] %c is not an exception as neither of its possible arguments are str .. [6] http://docs.python.org/3/c-api/buffer.html examples: ``memoryview``, ``array.array``, ``bytearray``, ``bytes`` .. [7] http://docs.python.org/3/reference/datamodel.html#object.__bytes__ .. [8] mainly implicit encode/decode, with intermittent errors when the data was not ASCII compatible Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: