2007-03-04 01:05:29 -05:00
|
|
|
PEP: 3112
|
2007-02-24 00:42:52 -05:00
|
|
|
Title: Bytes literals in Python 3000
|
|
|
|
Version: $Revision$
|
|
|
|
Last-Modified: $Date$
|
|
|
|
Author: Jason Orendorff <jason.orendorff@gmail.com>
|
|
|
|
Status: Accepted
|
2007-03-04 01:35:06 -05:00
|
|
|
Type: Standards Track
|
2007-02-24 00:42:52 -05:00
|
|
|
Content-Type: text/x-rst
|
|
|
|
Requires: 358
|
|
|
|
Created: 23-Feb-2007
|
|
|
|
Python-Version: 3.0
|
|
|
|
Post-History: 23-Feb-2007
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
========
|
|
|
|
|
|
|
|
This PEP proposes a literal syntax for the ``bytes`` objects
|
|
|
|
introduced in PEP 358. The purpose is to provide a convenient way to
|
|
|
|
spell ASCII strings and arbitrary binary data.
|
|
|
|
|
|
|
|
|
|
|
|
Motivation
|
|
|
|
==========
|
|
|
|
|
|
|
|
Existing spellings of an ASCII string in Python 3000 include:
|
|
|
|
|
|
|
|
bytes('Hello world', 'ascii')
|
|
|
|
'Hello world'.encode('ascii')
|
|
|
|
|
|
|
|
The proposed syntax is:
|
|
|
|
|
|
|
|
b'Hello world'
|
|
|
|
|
|
|
|
Existing spellings of an 8-bit binary sequence in Python 3000 include:
|
|
|
|
|
|
|
|
bytes([0x7f, 0x45, 0x4c, 0x46, 0x01, 0x01, 0x01, 0x00])
|
|
|
|
bytes('\x7fELF\x01\x01\x01\0', 'latin-1')
|
|
|
|
'7f454c4601010100'.decode('hex')
|
|
|
|
|
|
|
|
The proposed syntax is:
|
|
|
|
|
|
|
|
b'\x7f\x45\x4c\x46\x01\x01\x01\x00'
|
|
|
|
b'\x7fELF\x01\x01\x01\0'
|
|
|
|
|
|
|
|
In both cases, the advantages of the new syntax are brevity, some
|
|
|
|
small efficiency gain, and the detection of encoding errors at compile
|
|
|
|
time rather than at runtime. The brevity benefit is especially felt
|
|
|
|
when using the string-like methods of bytes objects:
|
|
|
|
|
|
|
|
lines = bdata.split(bytes('\n', 'ascii')) # existing syntax
|
|
|
|
lines = bdata.split(b'\n') # proposed syntax
|
|
|
|
|
|
|
|
And when converting code from Python 2.x to Python 3000:
|
|
|
|
|
|
|
|
sok.send('EXIT\r\n') # Python 2.x
|
|
|
|
sok.send('EXIT\r\n'.encode('ascii')) # Python 3000 existing
|
|
|
|
sok.send(b'EXIT\r\n') # proposed
|
|
|
|
|
|
|
|
|
|
|
|
Grammar Changes
|
|
|
|
===============
|
|
|
|
|
|
|
|
The proposed syntax is an extension of the existing string
|
|
|
|
syntax. [#stringliterals]_
|
|
|
|
|
2007-03-04 01:05:29 -05:00
|
|
|
The new syntax for strings, including the new bytes literal, is::
|
2007-02-24 00:42:52 -05:00
|
|
|
|
|
|
|
stringliteral: [stringprefix] (shortstring | longstring)
|
|
|
|
stringprefix: "b" | "r" | "br" | "B" | "R" | "BR" | "Br" | "bR"
|
|
|
|
shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"'
|
|
|
|
longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
|
|
|
|
shortstringitem: shortstringchar | escapeseq
|
|
|
|
longstringitem: longstringchar | escapeseq
|
|
|
|
shortstringchar:
|
|
|
|
<any source character except "\" or newline or the quote>
|
|
|
|
longstringchar: <any source character except "\">
|
|
|
|
escapeseq: "\" NL
|
|
|
|
| "\\" | "\'" | '\"'
|
|
|
|
| "\a" | "\b" | "\f" | "\n" | "\r" | "\t" | "\v"
|
|
|
|
| "\ooo" | "\xhh"
|
|
|
|
| "\uxxxx" | "\Uxxxxxxxx" | "\N{name}"
|
|
|
|
|
|
|
|
The following additional restrictions apply only to bytes literals
|
|
|
|
(``stringliteral`` tokens with ``b`` or ``B`` in the
|
|
|
|
``stringprefix``):
|
|
|
|
|
|
|
|
- Each ``shortstringchar`` or ``longstringchar`` must be a character
|
|
|
|
between 1 and 127 inclusive, regardless of any encoding
|
|
|
|
declaration [#encodings]_ in the source file.
|
|
|
|
|
|
|
|
- The Unicode-specific escape sequences ``\u``*xxxx*,
|
|
|
|
``\U``*xxxxxxxx*, and ``\N{``*name*``}`` are unrecognized in
|
|
|
|
Python 2.x and forbidden in Python 3000.
|
|
|
|
|
|
|
|
Adjacent bytes literals are subject to the same concatenation rules as
|
|
|
|
adjacent string literals. [#concat]_ A bytes literal adjacent to a
|
|
|
|
string literal is an error.
|
|
|
|
|
|
|
|
|
|
|
|
Semantics
|
|
|
|
=========
|
|
|
|
|
|
|
|
Each evaluation of a bytes literal produces a new ``bytes`` object.
|
|
|
|
The bytes in the new object are the bytes represented by the
|
2007-03-04 01:05:29 -05:00
|
|
|
``shortstringitem`` or ``longstringitem`` parts of the literal, in the
|
2007-02-24 00:42:52 -05:00
|
|
|
same order.
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
=========
|
|
|
|
|
|
|
|
The proposed syntax provides a cleaner migration path from Python 2.x
|
|
|
|
to Python 3000 for most code involving 8-bit strings. Preserving the
|
|
|
|
old 8-bit meaning of a string literal is usually as simple as adding a
|
|
|
|
``b`` prefix. The one exception is Python 2.x strings containing
|
|
|
|
bytes >127, which must be rewritten using escape sequences.
|
|
|
|
Transcoding a source file from one encoding to another, and fixing up
|
|
|
|
the encoding declaration, should preserve the meaning of the program.
|
|
|
|
Python 2.x non-Unicode strings violate this principle; Python 3000
|
|
|
|
bytes literals shouldn't.
|
|
|
|
|
|
|
|
A string literal with a ``b`` in the prefix is always a syntax error
|
|
|
|
in Python 2.5, so this syntax can be introduced in Python 2.6, along
|
|
|
|
with the ``bytes`` type.
|
|
|
|
|
|
|
|
A bytes literal produces a new object each time it is evaluated, like
|
|
|
|
list displays and unlike string literals. This is necessary because
|
|
|
|
bytes literals, like lists and unlike strings, are
|
|
|
|
mutable. [#eachnew]_
|
|
|
|
|
|
|
|
|
|
|
|
Reference Implementation
|
|
|
|
========================
|
|
|
|
|
|
|
|
Thomas Wouters has checked an implementation into the Py3K branch,
|
|
|
|
r53872.
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
==========
|
|
|
|
|
|
|
|
.. [#stringliterals]
|
|
|
|
http://www.python.org/doc/current/ref/strings.html
|
|
|
|
|
|
|
|
.. [#encodings]
|
|
|
|
http://www.python.org/doc/current/ref/encodings.html
|
|
|
|
|
|
|
|
.. [#concat]
|
|
|
|
http://www.python.org/doc/current/ref/string-catenation.html
|
|
|
|
|
|
|
|
.. [#eachnew]
|
|
|
|
http://mail.python.org/pipermail/python-3000/2007-February/005779.html
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
=========
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|