Overhaul PEP 460, and add myself as author

2014-01-08 23:38:18 +01:00 · 2014-01-08 23:38:18 +01:00 · 19f33e611b
parent e35a26608c
commit 19f33e611b
1 changed files with 95 additions and 108 deletions
--- a/pep-0460.txt
+++ b/pep-0460.txt
@ -1,8 +1,8 @@
 PEP: 460
-Title: Add bytes % args and bytes.format(args) to Python 3.5
+Title: Add binary interpolation and formatting
 Version: $Revision$
 Last-Modified: $Date$
-Author: Victor Stinner <victor.stinner@gmail.com>
+Author: Victor Stinner <victor.stinner@gmail.com>, Antoine Pitrou <solipsis@pitrou.net>
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
@ -13,136 +13,124 @@ Python-Version: 3.5
 Abstract
 ========

-Add ``bytes % args`` operator and ``bytes.format(args)`` method to
-Python 3.5.
+This PEP proposes to add minimal formatting operations to bytes and
+bytearray objects.  The proposed additions are:
+
+* ``bytes % ...`` and ``bytearray % ...`` for percent-formatting,
+  similar in syntax to percent-formatting on ``str`` objects
+  (accepting a single object, a tuple or a dict).
+
+* ``bytes.format(...)`` and ``bytearray.format(...)`` for a formatting
+  similar in syntax to ``str.format()`` (accepting positional as well as
+  keyword arguments).


 Rationale
 =========

-``bytes % args`` and ``bytes.format(args)`` have been removed in Python
-2. This operator and this method are requested by Mercurial and Twisted
-developers to ease porting their project on Python 3.
+In Python 2, ``str % args`` and ``str.format(args)`` allow the formatting
+and interpolation of 8-bit strings.  This feature has commonly been used
+for the assembling of protocol messages when protocols are known to use
+a fixed encoding.

-Python 3 suggests to format text first and then encode to bytes. In
-some cases, it does not make sense because arguments are bytes strings.
-Typical usage is a network protocol which is binary, since data are
-send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP,
-POP, FTP are ASCII commands interspersed with binary data.
+Python 3 generally mandates that text be stored and manipulated as unicode
+(i.e. ``str`` objects, not ``bytes``).  In some cases, though, it makes
+sense to manipulate ``bytes`` objects directly.  Typical usage is binary
+network protocols, where you can want to interpolate and assemble several
+bytes object (some of them literals, some of them compute) to produce
+complete protocol messages.  For example, protocols such as HTTP or SIP
+have headers with ASCII names and opaque "textual" values using a varying
+and/or sometimes ill-defined encoding.  Moreover, those headers can be
+followed by a binary body... which can be chunked and decorated with ASCII
+headers and trailers!

-Using multiple ``bytes + bytes`` instructions is inefficient because it
-requires temporary buffers and copies which are slow and waste memory.
-Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
-
-``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even
-before the first release of Python 3.0 (see issue #3982).
-
-``struct.pack()`` is incomplete. For example, a number cannot be
-formatted as decimal and it does not support padding bytes string.
-
-Mercurial 2.8 still supports Python 2.4.
+While there are reasonably efficient ways to accumulate binary data
+(such as using a ``bytearray`` object, the ``bytes.join`` method or
+even ``io.BytesIO``), none of them leads to the kind of readable and
+intuitive code that is produced by a %-formatted or {}-formatted template
+and a formatting operation.


-Needed and excluded features
-============================
+Binary formatting features
+==========================

-Needed features
+Supported features
+------------------

-* Bytes strings: bytes, bytearray and memoryview types
-* Format integer numbers as decimal
-* Padding with spaces and null bytes
-* "%s" should use the buffer protocol, not str()
+In this proposal, percent-formatting for ``bytes`` and ``bytearray``
+supports the following features:

-The feature set is minimal to keep the implementation as simple as
-possible to limit the cost of the implementation. ``str % args`` and
-``str.format(args)`` are already complex and difficult to maintain, the
-code is heavily optimized.
+* Looking up formatting arguments by position as well as by name (i.e.,
+  ``%s`` as well as ``%(name)s``).
+* ``%s`` will try to get a ``Py_buffer`` on the given value, and fallback
+  on calling ``__bytes__``.  The resulting binary data is inserted at
+  the given point in the string.  This is expected to work with bytes,
+  bytearray and memoryview objects (as well as a couple others such
+  as pathlib's path objects).
+* ``%c`` will accept an integer between 0 and 255, and insert a byte of the
+  given value.

-Excluded features:
+Braces-formatting for ``bytes`` and ``bytearray`` supports the following
+features:

-* no implicit conversion from Unicode to bytes (ex: encode to ASCII or
-  to Latin1)
-* Locale support (``{!n}`` format for numbers). Locales are related to
-  text and usually to an encoding.
-* ``repr()``, ``ascii()``: ``%r``, ``{!r}``, ``%a`` and ``{!a}``
-  formats. ``repr()`` and ``ascii()`` are used to debug, the output is
-  displayed a terminal or a graphical widget. They are more related to
-  text.
-* Attribute access: ``{obj.attr}``
-* Indexing: ``{dict[key]}``
-* Features of struct.pack(). For example, format a number as 32 bit unsigned
-  integer in network endian. The ``struct.pack()`` can be used to prepare
-  arguments, the implementation should be kept simple.
-* Features of int.to_bytes().
-* Features of ctypes.
-* New format protocol like a new ``__bformat__()`` method. Since the
-* list of
-  supported types is short, there is no need to add a new protocol.
-  Other types must be explicitly casted.
-* Alternate format for integer. For example, ``'{|#x}'.format(0x123)``
-  to get ``0x123``. It is more related to debug, and the prefix can be
-  easily be written in the format string (ex: ``0x%x``).
-* Relation with format() and the __format__() protocol. bytes.format()
-  and str.format() are unrelated.
+* All the kinds of argument lookup supported by ``str.format()`` (explicit
+  positional lookup, auto-incremented positional lookup, keyword lookup,
+  attribute lookup, etc.)
+* Insertion of binary data when no modifier or layout is specified
+  (e.g. ``{}``, ``{0}``, ``{name}``).  This has the same semantics as
+  ``%s`` for percent-formatting (see above).
+* The ``c`` modifier will accept an integer between 0 and 255, and insert a
+  byte of the given value (same as ``%c`` above).

-Unknown:
+Unsupported features
+--------------------

-* Format integer to hexadecimal? ``%x`` and ``%X``
-* Format integer to octal? ``%o``
-* Format integer to binary? ``{!b}``
-* Alignment?
-* Truncating? Truncate or raise an error?
-* format keywords? ``b'{arg}'.format(arg=5)``
-* ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
-* Floating point number?
-* ``%i``, ``%u`` and ``%d`` formats for integer numbers?
-* Signed number? ``%+i`` and ``%-i``
+All other features present in formatting of ``str`` objects (either
+through the percent operator or the ``str.format()`` method) are
+unsupported.  Those features imply treating the recipient of the
+operator or method as text, which goes counter to the text / bytes
+separation (for example, accepting ``%d`` as a format code would imply
+that the bytes object really is a ASCII-compatible text string).

-
-bytes % args
-============
-
-Formatters:
-
-* ``"%c"``: one byte
-* ``"%s"``: integer or bytes strings
-* ``"%20s"`` pads to 20 bytes with spaces (``b' '``)
-* ``"%020s"`` pads to 20 bytes with zeros (``b'0'``)
-* ``"%\020s"`` pads to 20 bytes with null bytes (``b'\0'``)
-
-
-bytes.format(args)
-==================
-
-Formatters:
-
-* ``"{!c}"``: one byte
-* ``"{!s}"``: integer or bytes strings
-* ``"{!.20s}"`` pads to 20 bytes with spaces (``b' '``)
-* ``"{!.020s}"`` pads to 20 bytes with zeros (``b'0'``)
-* ``"{!\020s}"`` pads to 20 bytes with null bytes (``b'\0'``)
-
-
-Examples
-========
-
-* ``b'a%sc%s' % (b'b', 4)`` gives ``b'abc4'``
-* ``b'a{}c{}'.format(b'b', 4)`` gives ``b'abc4'``
-* ``b'%c'`` % 88`` gives ``b'X``'
-* ``b'%%'`` gives ``b'%'``
+Amongst those unsupported features are not only most type-specific
+format codes, but also the various layout specifiers such as padding
+or alignment.  Besides, ``str`` objects are not acceptable as arguments
+to the formatting operations, even when using e.g. the ``%s`` format code.


 Criticisms
 ==========

 * The development cost and maintenance cost.
-* In 3.3 encoding to ascii or latin1 is as fast as memcpy
-* Developers must work around the lack of bytes%args and
-  bytes.format(args) anyway to support Python 3.0-3.4
-* bytes.join() is consistently faster than format to join bytes strings.
-* Formatting functions can be implemented in a third party module
+* In 3.3 encoding to ASCII or latin-1 is as fast as memcpy (but it still
+  creates a separate object).
+* Developers will have to work around the lack of binary formatting anyway,
+  if they want to to support Python 3.4 and earlier.
+* bytes.join() is consistently faster than format to join bytes strings
+  (XXX *is it?*).
+* Formatting functions could be implemented in a third party module,
+  rather than added to builtin types.


+Other proposals
+===============
+
+A new type datatype
+-------------------
+
+It was proposed to create a new datatype specialized for "network
+programming".  The authors of this PEP believe this is counter-productive.
+Python 3 already has several major types dedicated to manipulation of
+binary data: ``bytes``, ``bytearray``, ``memoryview``, ``io.BytesIO``.
+
+Adding yet another type would make things more confusing for users, and
+interoperability between libraries more painful (also potentially
+sub-optimal, due to the necessary conversions).
+
+Moreover, not one type would be needed, but two: one immutable type (to
+allow for hashing), and one mutable type (as efficient accumulation is
+often necessary when working with network messages).
+
 References
 ==========

@ -172,4 +160,3 @@ This document has been placed in the public domain.
   fill-column: 70
   coding: utf-8
   End:
-