Some typo fixes in PEP 3138; also add variables footer.

This commit is contained in:
Georg Brandl 2012-09-30 08:55:27 +02:00
parent caf06e1290
commit 9c83bacdff
1 changed files with 146 additions and 127 deletions

View File

@ -13,11 +13,11 @@ Post-History: 05-May-2008, 05-Jun-2008
Abstract
========
This PEP proposes a new string representation form for Python 3000. In
Python prior to Python 3000, the repr() built-in function converted
arbitrary objects to printable ASCII strings for debugging and logging.
For Python 3000, a wider range of characters, based on the Unicode
standard, should be considered 'printable'.
This PEP proposes a new string representation form for Python 3000.
In Python prior to Python 3000, the repr() built-in function converted
arbitrary objects to printable ASCII strings for debugging and
logging. For Python 3000, a wider range of characters, based on the
Unicode standard, should be considered 'printable'.
Motivation
@ -28,8 +28,8 @@ algorithm.
- Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII
characters(>=0x80) to '\\xXX'.
- Convert other non-printable characters(0x00-0x1f, 0x7f) and
non-ASCII characters (>= 0x80) to '\\xXX'.
- Backslash-escape quote characters (apostrophe, ') and add the quote
character at the beginning and the end.
@ -39,177 +39,184 @@ For Unicode strings, the following additional conversions are done.
- Convert leading surrogate pair characters without trailing character
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
- Convert 16-bit characters(>=0x100) to '\\uXXXX'.
- Convert 16-bit characters (>= 0x100) to '\\uXXXX'.
- Convert 21-bit characters(>=0x10000) and surrogate pair characters to
'\\U00xxxxxx'.
- Convert 21-bit characters (>= 0x10000) and surrogate pair characters
to '\\U00xxxxxx'.
This algorithm converts any string to printable ASCII, and repr() is
used as a handy and safe way to print strings for debugging or for
logging. Although all non-ASCII characters are escaped, this does not
matter when most of the string's characters are ASCII. But for other
logging. Although all non-ASCII characters are escaped, this does not
matter when most of the string's characters are ASCII. But for other
languages, such as Japanese where most characters in a string are not
ASCII, this is very inconvenient.
We can use ``print(aJapaneseString)`` to get a readable string, but we
don't have a similar workaround for printing strings from collections
such as lists or tuples. ``print(listOfJapaneseStrings)`` uses repr() to
build the string to be printed, so the resulting strings are always
hex-escaped. Or when ``open(japaneseFilemame)`` raises an exception, the
error message is something like ``IOError: [Errno 2] No such file or
directory: '\u65e5\u672c\u8a9e'``, which isn't helpful.
such as lists or tuples. ``print(listOfJapaneseStrings)`` uses repr()
to build the string to be printed, so the resulting strings are always
hex-escaped. Or when ``open(japaneseFilemame)`` raises an exception,
the error message is something like ``IOError: [Errno 2] No such file
or directory: '\u65e5\u672c\u8a9e'``, which isn't helpful.
Python 3000 has a lot of nice features for non-Latin users such as
non-ASCII identifiers, so it would be helpful if Python could also progress
in a similar way for printable output.
non-ASCII identifiers, so it would be helpful if Python could also
progress in a similar way for printable output.
Some users might be concerned that such output will mess up their
console if they print binary data like images. But this is unlikely to
happen in practice because bytes and strings are different types in
console if they print binary data like images. But this is unlikely
to happen in practice because bytes and strings are different types in
Python 3000, so printing an image to the console won't mess it up.
This issue was once discussed by Hye-Shik Chang [1]_ , but was rejected.
This issue was once discussed by Hye-Shik Chang [1]_, but was rejected.
Specification
=============
- Add a new function to the Python C API ``int Py_UNICODE_ISPRINTABLE
(Py_UNICODE ch)``. This function returns 0 if repr() should escape the
Unicode character ``ch``; otherwise it returns 1. Characters that should
be escaped are defined in the Unicode character database as:
(Py_UNICODE ch)``. This function returns 0 if repr() should escape
the Unicode character ``ch``; otherwise it returns 1. Characters
that should be escaped are defined in the Unicode character database
as:
* Cc (Other, Control)
* Cf (Other, Format)
* Cs (Other, Surrogate)
* Co (Other, Private Use)
* Cn (Other, Not Assigned)
* Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028').
* Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\\u2029').
* Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
this category should be escaped to avoid ambiguity.
* Cc (Other, Control)
* Cf (Other, Format)
* Cs (Other, Surrogate)
* Co (Other, Private Use)
* Cn (Other, Not Assigned)
* Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028').
* Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR
('\\u2029').
* Zs (Separator, Space) other than ASCII space ('\\x20'). Characters
in this category should be escaped to avoid ambiguity.
- The algorithm to build repr() strings should be changed to:
* Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
* Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
* Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'.
* Convert non-printable ASCII characters (0x00-0x1f, 0x7f) to
'\\xXX'.
* Convert leading surrogate pair characters without trailing character
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
* Convert leading surrogate pair characters without trailing
character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to
'\\uXXXX'.
* Convert non-printable characters(Py_UNICODE_ISPRINTABLE() returns 0)
to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
* Convert non-printable characters (Py_UNICODE_ISPRINTABLE() returns
0) to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
* Backslash-escape quote characters (apostrophe, 0x27) and add quote
character at the beginning and the end.
* Backslash-escape quote characters (apostrophe, 0x27) and add a
quote character at the beginning and the end.
- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by
default.
- Set the Unicode error-handler for sys.stderr to 'backslashreplace'
by default.
- Add a new function to the Python C API ``PyObject *PyObject_ASCII
(PyObject *o)``. This function converts any python object to a string
using PyObject_Repr() and then hex-escapes all non-ASCII characters.
``PyObject_ASCII()`` generates the same string as ``PyObject_Repr()``
in Python 2.
(PyObject *o)``. This function converts any python object to a
string using PyObject_Repr() and then hex-escapes all non-ASCII
characters. ``PyObject_ASCII()`` generates the same string as
``PyObject_Repr()`` in Python 2.
- Add a new built-in function, ``ascii()``. This function converts any
python object to a string using repr() and then hex-escapes all non-ASCII
characters. ``ascii()`` generates the same string as ``repr()`` in
Python 2.
- Add a new built-in function, ``ascii()``. This function converts
any python object to a string using repr() and then hex-escapes all
non-ASCII characters. ``ascii()`` generates the same string as
``repr()`` in Python 2.
- Add ``'%a'`` string format operator. ``'%a'`` converts any python
- Add a ``'%a'`` string format operator. ``'%a'`` converts any python
object to a string using repr() and then hex-escapes all non-ASCII
characters. The ``'%a'`` format operator generates the same string as
``'%r'`` in Python 2. Also, add ``'!a'`` conversion flags to the
characters. The ``'%a'`` format operator generates the same string
as ``'%r'`` in Python 2. Also, add ``'!a'`` conversion flags to the
``string.format()`` method and add ``'%A'`` operator to the
PyUnicode_FromFormat(). They converts any object to an ASCII string
PyUnicode_FromFormat(). They convert any object to an ASCII string
as ``'%a'`` string format operator.
- Add an ``isprintable()`` method to the string type. ``str.isprintable()``
returns False if repr() should escape any character in the string;
otherwise returns True. The ``isprintable()`` method calls the
``Py_UNICODE_ISPRINTABLE()`` function internally.
- Add an ``isprintable()`` method to the string type.
``str.isprintable()`` returns False if repr() would escape any
character in the string; otherwise returns True. The
``isprintable()`` method calls the ``Py_UNICODE_ISPRINTABLE()``
function internally.
Rationale
=========
The repr() in Python 3000 should be Unicode not ASCII based, just like
Python 3000 strings. Also, conversion should not be affected by the
locale setting, because the locale is not necessarily the same as the
output device's locale. For example, it is common for a daemon process
to be invoked in an ASCII setting, but writes UTF-8 to its log files.
Also, web applications might want to report the error information in
more readable form based on the HTML page's encoding.
The repr() in Python 3000 should be Unicode, not ASCII based, just
like Python 3000 strings. Also, conversion should not be affected by
the locale setting, because the locale is not necessarily the same as
the output device's locale. For example, it is common for a daemon
process to be invoked in an ASCII setting, but writes UTF-8 to its log
files. Also, web applications might want to report the error
information in more readable form based on the HTML page's encoding.
Characters not supported by the user's console could be hex-escaped on
printing, by the Unicode encoder's error-handler. If the error-handler
of the output file is 'backslashreplace', such characters are
hex-escaped without raising UnicodeEncodeError. For example, if your default
encoding is ASCII, ``print('Hello ¢')`` will print 'Hello \\xa2'. If
your encoding is ISO-8859-1, 'Hello ¢' will be printed.
printing, by the Unicode encoder's error-handler. If the
error-handler of the output file is 'backslashreplace', such
characters are hex-escaped without raising UnicodeEncodeError. For
example, if the default encoding is ASCII, ``print('Hello ¢')`` will
print 'Hello \\xa2'. If the encoding is ISO-8859-1, 'Hello ¢' will be
printed.
The default error-handler for sys.stdout is 'strict'. Other applications
reading the output might not understand hex-escaped characters, so
unsupported characters should be trapped when writing. If you need to
escape unsupported characters, you should explicitly change the
error-handler. Unlike sys.stdout, sys.stderr doesn't raise
The default error-handler for sys.stdout is 'strict'. Other
applications reading the output might not understand hex-escaped
characters, so unsupported characters should be trapped when writing.
If unsupported characters must be escaped, the error-handler should be
changed explicitly. Unlike sys.stdout, sys.stderr doesn't raise
UnicodeEncodingError by default, because the default error-handler is
'backslashreplace'. So printing error messeges containing non-ASCII
characters to sys.stderr will not raise an exception. Also, information
about uncaught exceptions (exception object, traceback) are printed by
the interpreter without raising exceptions.
'backslashreplace'. So printing error messages containing non-ASCII
characters to sys.stderr will not raise an exception. Also,
information about uncaught exceptions (exception object, traceback) is
printed by the interpreter without raising exceptions.
Alternate Solutions
-------------------
To help debugging in non-Latin languages without changing repr(), other
suggestions were made.
To help debugging in non-Latin languages without changing repr(),
other suggestions were made.
- Supply a tool to print lists or dicts.
Strings to be printed for debugging are not only contained by lists or
dicts, but also in many other types of object. File objects contain a
file name in Unicode, exception objects contain a message in Unicode,
etc. These strings should be printed in readable form when repr()ed.
It is unlikely to be possible to implement a tool to print all
possible object types.
Strings to be printed for debugging are not only contained by lists
or dicts, but also in many other types of object. File objects
contain a file name in Unicode, exception objects contain a message
in Unicode, etc. These strings should be printed in readable form
when repr()ed. It is unlikely to be possible to implement a tool to
print all possible object types.
- Use sys.displayhook and sys.excepthook.
For interactive sessions, we can write hooks to restore hex escaped
characters to the original characters. But these hooks are called only
when printing the result of evaluating an expression entered in an
interactive Python session, and doesn't work for the ``print()`` function,
for non-interactive sessions or for ``logging.debug("%r", ...)``, etc.
characters to the original characters. But these hooks are called
only when printing the result of evaluating an expression entered in
an interactive Python session, and don't work for the ``print()``
function, for non-interactive sessions or for ``logging.debug("%r",
...)``, etc.
- Subclass sys.stdout and sys.stderr.
It is difficult to implement a subclass to restore hex-escaped
characters since there isn't enough information left by the time it's
a string to undo the escaping correctly in all cases. For example,
``print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
there is no chance to tell file objects apart.
characters since there isn't enough information left by the time
it's a string to undo the escaping correctly in all cases. For
example, ``print("\\"+"u0041")`` should be printed as '\\u0041', not
'A'. But there is no chance to tell file objects apart.
- Make the encoding used by unicode_repr() adjustable, and make the
existing repr() the default.
With adjustable repr(), the result of using repr() is unpredictable
and would make it impossible to write correct code involving repr().
And if current repr() is the default, then the old convention remains
intact and users may expect ASCII strings as the result of repr().
Third party applications or libraries could be confused when a custom
repr() function is used.
And if current repr() is the default, then the old convention
remains intact and users may expect ASCII strings as the result of
repr(). Third party applications or libraries could be confused
when a custom repr() function is used.
Backwards Compatibility
=======================
Changing repr() may break some existing code, especially testing code.
Five of Python's regression tests fail with this modification. If you
need repr() strings without non-ASCII character as Python 2, you can use
the following function. ::
Five of Python's regression tests fail with this modification. If you
need repr() strings without non-ASCII character as Python 2, you can
use the following function. ::
def repr_ascii(obj):
return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
@ -221,25 +228,25 @@ UnicodeEncodeError. ::
log.write(repr(data)) # UnicodeEncodeError will be raised
# if data contains unsupported characters.
To avoid exceptions being raised, you can explicitly specify the error-
handler. ::
To avoid exceptions being raised, you can explicitly specify the
error-handler. ::
log = open("logfile", "w", errors="backslashreplace")
log.write(repr(data)) # Unsupported characters will be escaped.
For a console that uses a Unicode-based encoding, for example, en_US.
utf8 or de_DE.utf8, the backslashescape trick doesn't work and all
printable characters are not escaped. This will cause a problem of
similarly drawing characters in Western, Greek and Cyrillic languages.
These languages use similar (but different) alphabets (descended from a
common ancestor) and contain letters that look similar but have
different character codes. For example, it is hard to distinguish Latin
'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual
representation, of course, very much depends on the fonts used but
usually these letters are almost indistinguishable.) To avoid the
problem, the user can adjust the terminal encoding to get a result
suitable for their environment.
For a console that uses a Unicode-based encoding, for example,
en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn't work and
all printable characters are not escaped. This will cause a problem
of similarly drawing characters in Western, Greek and Cyrillic
languages. These languages use similar (but different) alphabets
(descended from a common ancestor) and contain letters that look
similar but have different character codes. For example, it is hard
to distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'.
(The visual representation, of course, very much depends on the fonts
used but usually these letters are almost indistinguishable.) To
avoid the problem, the user can adjust the terminal encoding to get a
result suitable for their environment.
Rejected Proposals
@ -252,20 +259,21 @@ Rejected Proposals
idea. [2]_
- Use character names to escape characters, instead of hex character
codes. For example, ``repr('\u03b1')`` can be converted to ``"\N{GREEK
SMALL LETTER ALPHA}"``.
codes. For example, ``repr('\u03b1')`` can be converted to
``"\N{GREEK SMALL LETTER ALPHA}"``.
Using character names can be very verbose compared to hex-escape.
e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR
KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``.
Using character names can be very verbose compared to hex-escape.
e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE
UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED
FORM}"``.
- Default error-handler of sys.stdout should be 'backslashreplace'.
Stuff written to stdout might be consumed by another program that
might misinterpret the \ escapes. For interactive session, it is
possible to make 'backslashreplace' error-handler to default, but may
add confusion of the kind "it works in interactive mode but not when
redirecting to a file".
might misinterpret the \\ escapes. For interactive sessions, it is
possible to make the 'backslashreplace' error-handler the default,
but this may add confusion of the kind "it works in interactive mode
but not when redirecting to a file".
Implementation
@ -288,3 +296,14 @@ Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: