New cleaned-up version from Atsuo.
This commit is contained in:
parent
c9332b34c2
commit
2840a2d09b
78
pep-3138.txt
78
pep-3138.txt
|
@ -7,7 +7,7 @@ Status: Accepted
|
|||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 05-May-2008
|
||||
Post-History:
|
||||
Post-History: 05-May-2008, 05-Jun-2008
|
||||
|
||||
|
||||
Abstract
|
||||
|
@ -60,8 +60,8 @@ error message is something like ``IOError: [Errno 2] No such file or
|
|||
directory: '\u65e5\u672c\u8a9e'``, which isn't helpful.
|
||||
|
||||
Python 3000 has a lot of nice features for non-Latin users such as
|
||||
non-ASCII identifiers, so it would be helpful if Python could also
|
||||
progress in a similar way for printable output.
|
||||
non-ASCII identifiers, so it would be helpful if Python could also progress
|
||||
in a similar way for printable output.
|
||||
|
||||
Some users might be concerned that such output will mess up their
|
||||
console if they print binary data like images. But this is unlikely to
|
||||
|
@ -99,7 +99,7 @@ Specification
|
|||
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
||||
|
||||
* Convert non-printable characters(Py_UNICODE_ISPRINTABLE() returns 0)
|
||||
to '\\xXX', '\\uXXXX' or '\\U00xxxxxx'.
|
||||
to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
|
||||
|
||||
* Backslash-escape quote characters (apostrophe, 0x27) and add quote
|
||||
character at the beginning and the end.
|
||||
|
@ -107,20 +107,29 @@ Specification
|
|||
- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by
|
||||
default.
|
||||
|
||||
- Add a new function to the Python C API ``PyObject *PyObject_ASCII
|
||||
(PyObject *o)``. This function converts any python object to a string
|
||||
using PyObject_Repr() and then hex-escapes all non-ASCII characters.
|
||||
``PyObject_ASCII()`` generates the same string as ``PyObject_Repr()``
|
||||
in Python 2.
|
||||
|
||||
- Add a new built-in function, ``ascii()``. This function converts any
|
||||
python object to a string using repr() and then hex-escapes all non-ASCII
|
||||
characters. ``ascii()`` generates the same string as ``repr()`` in
|
||||
Python 2.
|
||||
|
||||
- Add ``'%a'`` string format operator. ``'%a'`` converts any python
|
||||
object to a string using repr() and then hex-escapes all non-ASCII
|
||||
characters. The ``'%a'`` format operator generates the same string as
|
||||
``'%r'`` in Python 2.
|
||||
|
||||
- Add a new built-in function, ``ascii()``. This function converts any
|
||||
python object to a string using repr() and then hex-escapes all non-
|
||||
ASCII characters. ``ascii()`` generates the same string as ``repr()``
|
||||
in Python 2.
|
||||
``'%r'`` in Python 2. Also, add ``'!a'`` conversion flags to the
|
||||
``string.format()`` method and add ``'%A'`` operator to the
|
||||
PyUnicode_FromFormat(). They converts any object to an ASCII string
|
||||
as ``'%a'`` string format operator.
|
||||
|
||||
- Add an ``isprintable()`` method to the string type. ``str.isprintable()``
|
||||
returns False if repr() should escape any character in the string;
|
||||
otherwise returns True. The ``isprintable()`` method calls the
|
||||
`` Py_UNICODE_ISPRINTABLE()`` function internally.
|
||||
``Py_UNICODE_ISPRINTABLE()`` function internally.
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -136,18 +145,21 @@ more readable form based on the HTML page's encoding.
|
|||
|
||||
Characters not supported by the user's console could be hex-escaped on
|
||||
printing, by the Unicode encoder's error-handler. If the error-handler
|
||||
of the output file is 'backslashreplace', such characters are hex-
|
||||
escaped without raising UnicodeEncodeError. For example, if your default
|
||||
encoding is ASCII, ``print('Hello ¢')`` will prints 'Hello \\xa2'. If
|
||||
of the output file is 'backslashreplace', such characters are
|
||||
hex-escaped without raising UnicodeEncodeError. For example, if your default
|
||||
encoding is ASCII, ``print('Hello ¢')`` will print 'Hello \\xa2'. If
|
||||
your encoding is ISO-8859-1, 'Hello ¢' will be printed.
|
||||
|
||||
Default error-handler of sys.stdout is 'strict'. Other applications
|
||||
The default error-handler for sys.stdout is 'strict'. Other applications
|
||||
reading the output might not understand hex-escaped characters, so
|
||||
unsupported characters should be trapped when writing. If you need to
|
||||
escape unsupported characters, you should change error-handler
|
||||
explicitly. For sys.stderr, default error-handler is set to
|
||||
'backslashreplace' and printing exceptions or error messages won't
|
||||
be failed.
|
||||
escape unsupported characters, you should explicitly change the
|
||||
error-handler. Unlike sys.stdout, sys.stderr doesn't raise
|
||||
UnicodeEncodingError by default, because the default error-handler is
|
||||
'backslashreplace'. So printing error messeges containing non-ASCII
|
||||
characters to sys.stderr will not raise an exception. Also, information
|
||||
about uncaught exceptions (exception object, traceback) are printed by
|
||||
the interpreter without raising exceptions.
|
||||
|
||||
Alternate Solutions
|
||||
-------------------
|
||||
|
@ -169,15 +181,15 @@ suggestions were made.
|
|||
For interactive sessions, we can write hooks to restore hex escaped
|
||||
characters to the original characters. But these hooks are called only
|
||||
when printing the result of evaluating an expression entered in an
|
||||
interactive Python session, and doesn't work for the print() function,
|
||||
for non-interactive sessions or for logging.debug("%r", ...), etc.
|
||||
interactive Python session, and doesn't work for the ``print()`` function,
|
||||
for non-interactive sessions or for ``logging.debug("%r", ...)``, etc.
|
||||
|
||||
- Subclass sys.stdout and sys.stderr.
|
||||
|
||||
It is difficult to implement a subclass to restore hex-escaped
|
||||
characters since there isn't enough information left by the time it's
|
||||
a string to undo the escaping correctly in all cases. For example, ``
|
||||
print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
|
||||
a string to undo the escaping correctly in all cases. For example,
|
||||
``print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
|
||||
there is no chance to tell file objects apart.
|
||||
|
||||
- Make the encoding used by unicode_repr() adjustable, and make the
|
||||
|
@ -220,24 +232,16 @@ For a console that uses a Unicode-based encoding, for example, en_US.
|
|||
utf8 or de_DE.utf8, the backslashescape trick doesn't work and all
|
||||
printable characters are not escaped. This will cause a problem of
|
||||
similarly drawing characters in Western, Greek and Cyrillic languages.
|
||||
These languages use similar (but different) alphabets (descended from
|
||||
the common ancestor) and contain letters that look similar but have
|
||||
These languages use similar (but different) alphabets (descended from a
|
||||
common ancestor) and contain letters that look similar but have
|
||||
different character codes. For example, it is hard to distinguish Latin
|
||||
'a', 'e' and 'o' from Cyrillic '\u0430', '\u0435' and '\u043e'. (The visual
|
||||
'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual
|
||||
representation, of course, very much depends on the fonts used but
|
||||
usually these letters are almost indistinguishable.) To avoid the
|
||||
problem, the user can adjust the terminal encoding to get a result
|
||||
suitable for their environment.
|
||||
|
||||
|
||||
Open Issues
|
||||
===========
|
||||
|
||||
- Is the ``ascii()`` function necessary, or is it sufficient to document
|
||||
how to do it? If necessary, should ``ascii()`` belong to the builtin
|
||||
namespace?
|
||||
|
||||
|
||||
Rejected Proposals
|
||||
==================
|
||||
|
||||
|
@ -248,8 +252,8 @@ Rejected Proposals
|
|||
idea. [2]_
|
||||
|
||||
- Use character names to escape characters, instead of hex character
|
||||
codes. For example, ``repr('\u03b1')`` can be converted to
|
||||
``"\N{GREEK SMALL LETTER ALPHA}"``.
|
||||
codes. For example, ``repr('\u03b1')`` can be converted to ``"\N{GREEK
|
||||
SMALL LETTER ALPHA}"``.
|
||||
|
||||
Using character names can be very verbose compared to hex-escape.
|
||||
e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR
|
||||
|
@ -273,7 +277,7 @@ http://bugs.python.org/issue2630
|
|||
References
|
||||
==========
|
||||
|
||||
.. [1] Multibyte string on string::string_print
|
||||
.. [1] Multibyte string on string\::string_print
|
||||
(http://bugs.python.org/issue479898)
|
||||
|
||||
.. [2] [Python-3000] Displaying strings containing unicode escapes
|
||||
|
|
Loading…
Reference in New Issue