Overhaul PEP 460, and add myself as author

2014-01-08 23:38:18 +01:00 · 2014-01-08 23:38:18 +01:00 · 19f33e611b
parent e35a26608c
commit 19f33e611b
1 changed files with 95 additions and 108 deletions
--- a/pep-0460.txt
+++ b/pep-0460.txt
@ -1,8 +1,8 @@
 PEP: 460
-Title: Add bytes % args and bytes.format(args) to Python 3.5
+Title: Add binary interpolation and formatting
 Version: $Revision$
 Last-Modified: $Date$
-Author: Victor Stinner <victor.stinner@gmail.com>
+Author: Victor Stinner <victor.stinner@gmail.com>, Antoine Pitrou <solipsis@pitrou.net>
 Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
@ -13,136 +13,124 @@ Python-Version: 3.5
 Abstract
 ========
-Add ``bytes % args`` operator and ``bytes.format(args)`` method to
+This PEP proposes to add minimal formatting operations to bytes and
-Python 3.5.
+bytearray objects.  The proposed additions are:
 * ``bytes % ...`` and ``bytearray % ...`` for percent-formatting,
  similar in syntax to percent-formatting on ``str`` objects
  (accepting a single object, a tuple or a dict).
 * ``bytes.format(...)`` and ``bytearray.format(...)`` for a formatting
  similar in syntax to ``str.format()`` (accepting positional as well as
  keyword arguments).
 Rationale
 =========
-``bytes % args`` and ``bytes.format(args)`` have been removed in Python
+In Python 2, ``str % args`` and ``str.format(args)`` allow the formatting
-2. This operator and this method are requested by Mercurial and Twisted
+and interpolation of 8-bit strings.  This feature has commonly been used
-developers to ease porting their project on Python 3.
+for the assembling of protocol messages when protocols are known to use
 a fixed encoding.
-Python 3 suggests to format text first and then encode to bytes. In
+Python 3 generally mandates that text be stored and manipulated as unicode
-some cases, it does not make sense because arguments are bytes strings.
+(i.e. ``str`` objects, not ``bytes``).  In some cases, though, it makes
-Typical usage is a network protocol which is binary, since data are
+sense to manipulate ``bytes`` objects directly.  Typical usage is binary
-send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP,
+network protocols, where you can want to interpolate and assemble several
-POP, FTP are ASCII commands interspersed with binary data.
+bytes object (some of them literals, some of them compute) to produce
 complete protocol messages.  For example, protocols such as HTTP or SIP
 have headers with ASCII names and opaque "textual" values using a varying
 and/or sometimes ill-defined encoding.  Moreover, those headers can be
 followed by a binary body... which can be chunked and decorated with ASCII
 headers and trailers!
-Using multiple ``bytes + bytes`` instructions is inefficient because it
+While there are reasonably efficient ways to accumulate binary data
-requires temporary buffers and copies which are slow and waste memory.
+(such as using a ``bytearray`` object, the ``bytes.join`` method or
-Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
+even ``io.BytesIO``), none of them leads to the kind of readable and
-
+intuitive code that is produced by a %-formatted or {}-formatted template
-``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even
+and a formatting operation.
 before the first release of Python 3.0 (see issue #3982).
 ``struct.pack()`` is incomplete. For example, a number cannot be
 formatted as decimal and it does not support padding bytes string.
 Mercurial 2.8 still supports Python 2.4.
-Needed and excluded features
+Binary formatting features
-============================
+==========================
-Needed features
+Supported features
 ------------------
-* Bytes strings: bytes, bytearray and memoryview types
+In this proposal, percent-formatting for ``bytes`` and ``bytearray``
-* Format integer numbers as decimal
+supports the following features:
 * Padding with spaces and null bytes
 * "%s" should use the buffer protocol, not str()
-The feature set is minimal to keep the implementation as simple as
+* Looking up formatting arguments by position as well as by name (i.e.,
-possible to limit the cost of the implementation. ``str % args`` and
+  ``%s`` as well as ``%(name)s``).
-``str.format(args)`` are already complex and difficult to maintain, the
+* ``%s`` will try to get a ``Py_buffer`` on the given value, and fallback
-code is heavily optimized.
+  on calling ``__bytes__``.  The resulting binary data is inserted at
  the given point in the string.  This is expected to work with bytes,
  bytearray and memoryview objects (as well as a couple others such
  as pathlib's path objects).
 * ``%c`` will accept an integer between 0 and 255, and insert a byte of the
  given value.
-Excluded features:
+Braces-formatting for ``bytes`` and ``bytearray`` supports the following
 features:
-* no implicit conversion from Unicode to bytes (ex: encode to ASCII or
+* All the kinds of argument lookup supported by ``str.format()`` (explicit
-  to Latin1)
+  positional lookup, auto-incremented positional lookup, keyword lookup,
-* Locale support (``{!n}`` format for numbers). Locales are related to
+  attribute lookup, etc.)
-  text and usually to an encoding.
+* Insertion of binary data when no modifier or layout is specified
-* ``repr()``, ``ascii()``: ``%r``, ``{!r}``, ``%a`` and ``{!a}``
+  (e.g. ``{}``, ``{0}``, ``{name}``).  This has the same semantics as
-  formats. ``repr()`` and ``ascii()`` are used to debug, the output is
+  ``%s`` for percent-formatting (see above).
-  displayed a terminal or a graphical widget. They are more related to
+* The ``c`` modifier will accept an integer between 0 and 255, and insert a
-  text.
+  byte of the given value (same as ``%c`` above).
 * Attribute access: ``{obj.attr}``
 * Indexing: ``{dict[key]}``
 * Features of struct.pack(). For example, format a number as 32 bit unsigned
  integer in network endian. The ``struct.pack()`` can be used to prepare
  arguments, the implementation should be kept simple.
 * Features of int.to_bytes().
 * Features of ctypes.
 * New format protocol like a new ``__bformat__()`` method. Since the
 * list of
  supported types is short, there is no need to add a new protocol.
  Other types must be explicitly casted.
 * Alternate format for integer. For example, ``'{|#x}'.format(0x123)``
  to get ``0x123``. It is more related to debug, and the prefix can be
  easily be written in the format string (ex: ``0x%x``).
 * Relation with format() and the __format__() protocol. bytes.format()
  and str.format() are unrelated.
-Unknown:
+Unsupported features
 --------------------
-* Format integer to hexadecimal? ``%x`` and ``%X``
+All other features present in formatting of ``str`` objects (either
-* Format integer to octal? ``%o``
+through the percent operator or the ``str.format()`` method) are
-* Format integer to binary? ``{!b}``
+unsupported.  Those features imply treating the recipient of the
-* Alignment?
+operator or method as text, which goes counter to the text / bytes
-* Truncating? Truncate or raise an error?
+separation (for example, accepting ``%d`` as a format code would imply
-* format keywords? ``b'{arg}'.format(arg=5)``
+that the bytes object really is a ASCII-compatible text string).
 * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
 * Floating point number?
 * ``%i``, ``%u`` and ``%d`` formats for integer numbers?
 * Signed number? ``%+i`` and ``%-i``
-
+Amongst those unsupported features are not only most type-specific
-bytes % args
+format codes, but also the various layout specifiers such as padding
-============
+or alignment.  Besides, ``str`` objects are not acceptable as arguments
-
+to the formatting operations, even when using e.g. the ``%s`` format code.
 Formatters:
 * ``"%c"``: one byte
 * ``"%s"``: integer or bytes strings
 * ``"%20s"`` pads to 20 bytes with spaces (``b' '``)
 * ``"%020s"`` pads to 20 bytes with zeros (``b'0'``)
 * ``"%\020s"`` pads to 20 bytes with null bytes (``b'\0'``)
 bytes.format(args)
 ==================
 Formatters:
 * ``"{!c}"``: one byte
 * ``"{!s}"``: integer or bytes strings
 * ``"{!.20s}"`` pads to 20 bytes with spaces (``b' '``)
 * ``"{!.020s}"`` pads to 20 bytes with zeros (``b'0'``)
 * ``"{!\020s}"`` pads to 20 bytes with null bytes (``b'\0'``)
 Examples
 ========
 * ``b'a%sc%s' % (b'b', 4)`` gives ``b'abc4'``
 * ``b'a{}c{}'.format(b'b', 4)`` gives ``b'abc4'``
 * ``b'%c'`` % 88`` gives ``b'X``'
 * ``b'%%'`` gives ``b'%'``
 Criticisms
 ==========
 * The development cost and maintenance cost.
-* In 3.3 encoding to ascii or latin1 is as fast as memcpy
+* In 3.3 encoding to ASCII or latin-1 is as fast as memcpy (but it still
-* Developers must work around the lack of bytes%args and
+  creates a separate object).
-  bytes.format(args) anyway to support Python 3.0-3.4
+* Developers will have to work around the lack of binary formatting anyway,
-* bytes.join() is consistently faster than format to join bytes strings.
+  if they want to to support Python 3.4 and earlier.
-* Formatting functions can be implemented in a third party module
+* bytes.join() is consistently faster than format to join bytes strings
  (XXX *is it?*).
 * Formatting functions could be implemented in a third party module,
  rather than added to builtin types.
 Other proposals
 ===============
 A new type datatype
 -------------------
 It was proposed to create a new datatype specialized for "network
 programming".  The authors of this PEP believe this is counter-productive.
 Python 3 already has several major types dedicated to manipulation of
 binary data: ``bytes``, ``bytearray``, ``memoryview``, ``io.BytesIO``.
 Adding yet another type would make things more confusing for users, and
 interoperability between libraries more painful (also potentially
 sub-optimal, due to the necessary conversions).
 Moreover, not one type would be needed, but two: one immutable type (to
 allow for hashing), and one mutable type (as efficient accumulation is
 often necessary when working with network messages).
 References
 ==========
@ -172,4 +160,3 @@ This document has been placed in the public domain.
   fill-column: 70
   coding: utf-8
   End: