New cleaned-up version from Atsuo.

This commit is contained in:
Guido van Rossum 2008-06-05 17:28:44 +00:00
parent c9332b34c2
commit 2840a2d09b
1 changed files with 65 additions and 61 deletions

View File

@ -7,7 +7,7 @@ Status: Accepted
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst Content-Type: text/x-rst
Created: 05-May-2008 Created: 05-May-2008
Post-History: Post-History: 05-May-2008, 05-Jun-2008
Abstract Abstract
@ -60,8 +60,8 @@ error message is something like ``IOError: [Errno 2] No such file or
directory: '\u65e5\u672c\u8a9e'``, which isn't helpful. directory: '\u65e5\u672c\u8a9e'``, which isn't helpful.
Python 3000 has a lot of nice features for non-Latin users such as Python 3000 has a lot of nice features for non-Latin users such as
non-ASCII identifiers, so it would be helpful if Python could also non-ASCII identifiers, so it would be helpful if Python could also progress
progress in a similar way for printable output. in a similar way for printable output.
Some users might be concerned that such output will mess up their Some users might be concerned that such output will mess up their
console if they print binary data like images. But this is unlikely to console if they print binary data like images. But this is unlikely to
@ -79,48 +79,57 @@ Specification
Unicode character ``ch``; otherwise it returns 1. Characters that should Unicode character ``ch``; otherwise it returns 1. Characters that should
be escaped are defined in the Unicode character database as: be escaped are defined in the Unicode character database as:
* Cc (Other, Control) * Cc (Other, Control)
* Cf (Other, Format) * Cf (Other, Format)
* Cs (Other, Surrogate) * Cs (Other, Surrogate)
* Co (Other, Private Use) * Co (Other, Private Use)
* Cn (Other, Not Assigned) * Cn (Other, Not Assigned)
* Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028'). * Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028').
* Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\\u2029'). * Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\\u2029').
* Zs (Separator, Space) other than ASCII space('\\x20'). Characters in * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
this category should be escaped to avoid ambiguity. this category should be escaped to avoid ambiguity.
- The algorithm to build repr() strings should be changed to: - The algorithm to build repr() strings should be changed to:
* Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'. * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
* Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'. * Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'.
* Convert leading surrogate pair characters without trailing character * Convert leading surrogate pair characters without trailing character
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'. (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
* Convert non-printable characters(Py_UNICODE_ISPRINTABLE() returns 0) * Convert non-printable characters(Py_UNICODE_ISPRINTABLE() returns 0)
to '\\xXX', '\\uXXXX' or '\\U00xxxxxx'. to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
* Backslash-escape quote characters (apostrophe, 0x27) and add quote * Backslash-escape quote characters (apostrophe, 0x27) and add quote
character at the beginning and the end. character at the beginning and the end.
- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by - Set the Unicode error-handler for sys.stderr to 'backslashreplace' by
default. default.
- Add a new function to the Python C API ``PyObject *PyObject_ASCII
(PyObject *o)``. This function converts any python object to a string
using PyObject_Repr() and then hex-escapes all non-ASCII characters.
``PyObject_ASCII()`` generates the same string as ``PyObject_Repr()``
in Python 2.
- Add a new built-in function, ``ascii()``. This function converts any
python object to a string using repr() and then hex-escapes all non-ASCII
characters. ``ascii()`` generates the same string as ``repr()`` in
Python 2.
- Add ``'%a'`` string format operator. ``'%a'`` converts any python - Add ``'%a'`` string format operator. ``'%a'`` converts any python
object to a string using repr() and then hex-escapes all non-ASCII object to a string using repr() and then hex-escapes all non-ASCII
characters. The ``'%a'`` format operator generates the same string as characters. The ``'%a'`` format operator generates the same string as
``'%r'`` in Python 2. ``'%r'`` in Python 2. Also, add ``'!a'`` conversion flags to the
``string.format()`` method and add ``'%A'`` operator to the
- Add a new built-in function, ``ascii()``. This function converts any PyUnicode_FromFormat(). They converts any object to an ASCII string
python object to a string using repr() and then hex-escapes all non- as ``'%a'`` string format operator.
ASCII characters. ``ascii()`` generates the same string as ``repr()``
in Python 2.
- Add an ``isprintable()`` method to the string type. ``str.isprintable()`` - Add an ``isprintable()`` method to the string type. ``str.isprintable()``
returns False if repr() should escape any character in the string; returns False if repr() should escape any character in the string;
otherwise returns True. The ``isprintable()`` method calls the otherwise returns True. The ``isprintable()`` method calls the
`` Py_UNICODE_ISPRINTABLE()`` function internally. ``Py_UNICODE_ISPRINTABLE()`` function internally.
Rationale Rationale
@ -136,18 +145,21 @@ more readable form based on the HTML page's encoding.
Characters not supported by the user's console could be hex-escaped on Characters not supported by the user's console could be hex-escaped on
printing, by the Unicode encoder's error-handler. If the error-handler printing, by the Unicode encoder's error-handler. If the error-handler
of the output file is 'backslashreplace', such characters are hex- of the output file is 'backslashreplace', such characters are
escaped without raising UnicodeEncodeError. For example, if your default hex-escaped without raising UnicodeEncodeError. For example, if your default
encoding is ASCII, ``print('Hello ¢')`` will prints 'Hello \\xa2'. If encoding is ASCII, ``print('Hello ¢')`` will print 'Hello \\xa2'. If
your encoding is ISO-8859-1, 'Hello ¢' will be printed. your encoding is ISO-8859-1, 'Hello ¢' will be printed.
Default error-handler of sys.stdout is 'strict'. Other applications The default error-handler for sys.stdout is 'strict'. Other applications
reading the output might not understand hex-escaped characters, so reading the output might not understand hex-escaped characters, so
unsupported characters should be trapped when writing. If you need to unsupported characters should be trapped when writing. If you need to
escape unsupported characters, you should change error-handler escape unsupported characters, you should explicitly change the
explicitly. For sys.stderr, default error-handler is set to error-handler. Unlike sys.stdout, sys.stderr doesn't raise
'backslashreplace' and printing exceptions or error messages won't UnicodeEncodingError by default, because the default error-handler is
be failed. 'backslashreplace'. So printing error messeges containing non-ASCII
characters to sys.stderr will not raise an exception. Also, information
about uncaught exceptions (exception object, traceback) are printed by
the interpreter without raising exceptions.
Alternate Solutions Alternate Solutions
------------------- -------------------
@ -169,15 +181,15 @@ suggestions were made.
For interactive sessions, we can write hooks to restore hex escaped For interactive sessions, we can write hooks to restore hex escaped
characters to the original characters. But these hooks are called only characters to the original characters. But these hooks are called only
when printing the result of evaluating an expression entered in an when printing the result of evaluating an expression entered in an
interactive Python session, and doesn't work for the print() function, interactive Python session, and doesn't work for the ``print()`` function,
for non-interactive sessions or for logging.debug("%r", ...), etc. for non-interactive sessions or for ``logging.debug("%r", ...)``, etc.
- Subclass sys.stdout and sys.stderr. - Subclass sys.stdout and sys.stderr.
It is difficult to implement a subclass to restore hex-escaped It is difficult to implement a subclass to restore hex-escaped
characters since there isn't enough information left by the time it's characters since there isn't enough information left by the time it's
a string to undo the escaping correctly in all cases. For example, `` a string to undo the escaping correctly in all cases. For example,
print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But ``print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
there is no chance to tell file objects apart. there is no chance to tell file objects apart.
- Make the encoding used by unicode_repr() adjustable, and make the - Make the encoding used by unicode_repr() adjustable, and make the
@ -199,45 +211,37 @@ Five of Python's regression tests fail with this modification. If you
need repr() strings without non-ASCII character as Python 2, you can use need repr() strings without non-ASCII character as Python 2, you can use
the following function. :: the following function. ::
def repr_ascii(obj): def repr_ascii(obj):
return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII") return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
For logging or for debugging, the following code can raise For logging or for debugging, the following code can raise
UnicodeEncodeError. :: UnicodeEncodeError. ::
log = open("logfile", "w") log = open("logfile", "w")
log.write(repr(data)) # UnicodeEncodeError will be raised log.write(repr(data)) # UnicodeEncodeError will be raised
# if data contains unsupported characters. # if data contains unsupported characters.
To avoid exceptions being raised, you can explicitly specify the error- To avoid exceptions being raised, you can explicitly specify the error-
handler. :: handler. ::
log = open("logfile", "w", errors="backslashreplace") log = open("logfile", "w", errors="backslashreplace")
log.write(repr(data)) # Unsupported characters will be escaped. log.write(repr(data)) # Unsupported characters will be escaped.
For a console that uses a Unicode-based encoding, for example, en_US. For a console that uses a Unicode-based encoding, for example, en_US.
utf8 or de_DE.utf8, the backslashescape trick doesn't work and all utf8 or de_DE.utf8, the backslashescape trick doesn't work and all
printable characters are not escaped. This will cause a problem of printable characters are not escaped. This will cause a problem of
similarly drawing characters in Western, Greek and Cyrillic languages. similarly drawing characters in Western, Greek and Cyrillic languages.
These languages use similar (but different) alphabets (descended from These languages use similar (but different) alphabets (descended from a
the common ancestor) and contain letters that look similar but have common ancestor) and contain letters that look similar but have
different character codes. For example, it is hard to distinguish Latin different character codes. For example, it is hard to distinguish Latin
'a', 'e' and 'o' from Cyrillic '\u0430', '\u0435' and '\u043e'. (The visual 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual
representation, of course, very much depends on the fonts used but representation, of course, very much depends on the fonts used but
usually these letters are almost indistinguishable.) To avoid the usually these letters are almost indistinguishable.) To avoid the
problem, the user can adjust the terminal encoding to get a result problem, the user can adjust the terminal encoding to get a result
suitable for their environment. suitable for their environment.
Open Issues
===========
- Is the ``ascii()`` function necessary, or is it sufficient to document
how to do it? If necessary, should ``ascii()`` belong to the builtin
namespace?
Rejected Proposals Rejected Proposals
================== ==================
@ -248,10 +252,10 @@ Rejected Proposals
idea. [2]_ idea. [2]_
- Use character names to escape characters, instead of hex character - Use character names to escape characters, instead of hex character
codes. For example, ``repr('\u03b1')`` can be converted to codes. For example, ``repr('\u03b1')`` can be converted to ``"\N{GREEK
``"\N{GREEK SMALL LETTER ALPHA}"``. SMALL LETTER ALPHA}"``.
Using character names can be very verbose compared to hex-escape. Using character names can be very verbose compared to hex-escape.
e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR
KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``. KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``.
@ -273,7 +277,7 @@ http://bugs.python.org/issue2630
References References
========== ==========
.. [1] Multibyte string on string::string_print .. [1] Multibyte string on string\::string_print
(http://bugs.python.org/issue479898) (http://bugs.python.org/issue479898)
.. [2] [Python-3000] Displaying strings containing unicode escapes .. [2] [Python-3000] Displaying strings containing unicode escapes