New version from Atsuo.
This commit is contained in:
parent
1254bb5028
commit
4f64d25025
231
pep-3138.txt
231
pep-3138.txt
|
@ -9,11 +9,12 @@ Content-Type: text/x-rst
|
|||
Created: 05-May-2008
|
||||
Post-History:
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP proposes new string representation form for Python 3000. In
|
||||
Python prior to Python 3000, the repr() built-in function converts
|
||||
This PEP proposes a new string representation form for Python 3000. In
|
||||
Python prior to Python 3000, the repr() built-in function converted
|
||||
arbitrary objects to printable ASCII strings for debugging and logging.
|
||||
For Python 3000, a wider range of characters, based on the Unicode
|
||||
standard, should be considered 'printable'.
|
||||
|
@ -28,30 +29,39 @@ algorithm.
|
|||
- Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
|
||||
|
||||
- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII
|
||||
characters(>=0x80) to '\\xXX'.
|
||||
characters(>=0x80) to '\\xXX'.
|
||||
|
||||
- Backslash-escape quote characters(' or ") and add quote character at
|
||||
head and tail.
|
||||
- Backslash-escape quote characters (apostrophe, ') and add the quote
|
||||
character at the beginning and the end.
|
||||
|
||||
For Unicode strings, the following additional conversions are done.
|
||||
|
||||
- Convert leading surrogate pair characters without trailing character
|
||||
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
||||
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
||||
|
||||
- Convert 16-bit characters(>=0x100) to '\\uXXXX'.
|
||||
|
||||
- Convert 21-bit characters(>=0x10000) and surrogate pair characters to
|
||||
'\\U00xxxxxx'.
|
||||
'\\U00xxxxxx'.
|
||||
|
||||
This algorithm converts any string to printable ASCII, and repr() is
|
||||
used as handy and safe way to print strings for debugging or for
|
||||
used as a handy and safe way to print strings for debugging or for
|
||||
logging. Although all non-ASCII characters are escaped, this does not
|
||||
matter when most of the string's characters are ASCII. But for other
|
||||
languages, such as Japanese where most characters in a string are not
|
||||
ASCII, this is very inconvenient. Python 3000 has a lot of nice features
|
||||
for non-Latin users such as non-ASCII identifiers, so it would be
|
||||
helpful if Python could also progress in a similar way for printable
|
||||
output.
|
||||
ASCII, this is very inconvenient.
|
||||
|
||||
We can use ``print(aJapaneseString)`` to get a readable string, but we
|
||||
don't have a similar workaround for printing strings from collections
|
||||
such as lists or tuples. ``print(listOfJapaneseStrings)`` uses repr() to
|
||||
build the string to be printed, so the resulting strings are always
|
||||
hex-escaped. Or when ``open(japaneseFilemame)`` raises an exception, the
|
||||
error message is something like ``IOError: [Errno 2] No such file or
|
||||
directory: '\u65e5\u672c\u8a9e'``, which isn't helpful.
|
||||
|
||||
Python 3000 has a lot of nice features for non-Latin users such as
|
||||
non-ASCII identifiers, so it would be helpful if Python could also
|
||||
progress in a similar way for printable output.
|
||||
|
||||
Some users might be concerned that such output will mess up their
|
||||
console if they print binary data like images. But this is unlikely to
|
||||
|
@ -64,22 +74,53 @@ This issue was once discussed by Hye-Shik Chang [1]_ , but was rejected.
|
|||
Specification
|
||||
=============
|
||||
|
||||
- Add a new function to the Python C API ``int PY_UNICODE_ISPRINTABLE
|
||||
(Py_UNICODE ch)``. This function returns 0 if repr() should escape the
|
||||
Unicode character ``ch``; otherwise it returns 1. Characters that should
|
||||
be escaped are defined in the Unicode character database as:
|
||||
|
||||
* Cc (Other, Control)
|
||||
* Cf (Other, Format)
|
||||
* Cs (Other, Surrogate)
|
||||
* Co (Other, Private Use)
|
||||
* Cn (Other, Not Assigned)
|
||||
* Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028').
|
||||
* Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\\u2029').
|
||||
* Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
|
||||
this category should be escaped to avoid ambiguity.
|
||||
|
||||
- The algorithm to build repr() strings should be changed to:
|
||||
|
||||
* Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
|
||||
* Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
|
||||
|
||||
* Convert other non-printable ASCII characters(0x00-0x1f, 0x7f) to
|
||||
'\\xXX'.
|
||||
* Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'.
|
||||
|
||||
* Convert leading surrogate pair characters without trailing character
|
||||
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
||||
* Convert leading surrogate pair characters without trailing character
|
||||
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
||||
|
||||
* Convert Unicode whitespace other than ASCII space('\\x20'), and
|
||||
control characters (categories Z* and C* in the Unicode database),
|
||||
to '\\xXX', '\\uXXXX' or '\\U00xxxxxx'.
|
||||
* Convert non-printable characters(PY_UNICODE_ISPRINTABLE() returns 0)
|
||||
to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
|
||||
|
||||
- Set the Unicode error-handler for sys.stdout and sys.stderr to
|
||||
'backslashreplace' by default.
|
||||
* Backslash-escape quote characters (apostrophe, 0x27) and add quote
|
||||
character at the beginning and the end.
|
||||
|
||||
- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by
|
||||
default.
|
||||
|
||||
- Add ``'%a'`` string format operator. ``'%a'`` converts any python
|
||||
object to a string using repr() and then hex-escapes all non-ASCII
|
||||
characters. The ``'%a'`` format operator generates the same string as
|
||||
``'%r'`` in Python 2.
|
||||
|
||||
- Add a new built-in function, ``ascii()``. This function converts any
|
||||
python object to a string using repr() and then hex-escapes all non-
|
||||
ASCII characters. ``ascii()`` generates the same string as ``repr()``
|
||||
in Python 2.
|
||||
|
||||
- Add an ``isprintable()`` method to the string type. ``str.isprintable()``
|
||||
returns False if repr() should escape any character in the string;
|
||||
otherwise returns True. The ``isprintable()`` method calls the
|
||||
`` PY_UNICODE_ISPRINTABLE()`` function internally.
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -90,44 +131,29 @@ Python 3000 strings. Also, conversion should not be affected by the
|
|||
locale setting, because the locale is not necessarily the same as the
|
||||
output device's locale. For example, it is common for a daemon process
|
||||
to be invoked in an ASCII setting, but writes UTF-8 to its log files.
|
||||
Also, web applications might want to report the error information in
|
||||
more readable form based on the HTML page's encoding.
|
||||
|
||||
Characters not supported by user's console are hex-escaped on printing,
|
||||
by the Unicode encoders' error-handler. If the error-handler of the
|
||||
output file is 'backslashreplace', such characters are hex-escaped
|
||||
without raising UnicodeEncodeError. For example, if your default
|
||||
encoding is ASCII, ``print('¢')`` will prints '\\xa2'. If your encoding
|
||||
is ISO-8859-1, '' will be printed.
|
||||
|
||||
|
||||
Printable characters
|
||||
--------------------
|
||||
|
||||
The Unicode standard doesn't define Non-printable characters, so we must
|
||||
create our own definition. Here we propose to define Non-printable
|
||||
characters as follows.
|
||||
|
||||
- Non-printable ASCII characters as Python 2.
|
||||
|
||||
- Broken surrogate pair characters.
|
||||
|
||||
- Characters defined in the Unicode character database as
|
||||
|
||||
* Cc (Other, Control)
|
||||
* Cf (Other, Format)
|
||||
* Cs (Other, Surrogate)
|
||||
* Co (Other, Private Use)
|
||||
* Cn (Other, Not Assigned)
|
||||
* Zl Separator, Line ('\\u2028', LINE SEPARATOR)
|
||||
* Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
|
||||
* Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
|
||||
this category should be escaped to avoid ambiguity.
|
||||
Characters not supported by the user's console could be hex-escaped on
|
||||
printing, by the Unicode encoder's error-handler. If the error-handler
|
||||
of the output file is 'backslashreplace', such characters are hex-
|
||||
escaped without raising UnicodeEncodeError. For example, if your default
|
||||
encoding is ASCII, ``print('Hello ¢')`` will prints 'Hello \\xa2'. If
|
||||
your encoding is ISO-8859-1, 'Hello ¢' will be printed.
|
||||
|
||||
Default error-handler of sys.stdout is 'strict'. Other applications
|
||||
reading the output might not understand hex-escaped characters, so
|
||||
unsupported characters should be trapped when writing. If you need to
|
||||
escape unsupported characters, you should change error-handler
|
||||
explicitly. For sys.stderr, default error-handler is set to
|
||||
'backslashreplace' and printing exceptions or error messages won't
|
||||
be failed.
|
||||
|
||||
Alternate Solutions
|
||||
-------------------
|
||||
|
||||
To help debugging in non-Latin languages without changing repr(), other
|
||||
suggestion were made.
|
||||
suggestions were made.
|
||||
|
||||
- Supply a tool to print lists or dicts.
|
||||
|
||||
|
@ -142,9 +168,9 @@ suggestion were made.
|
|||
|
||||
For interactive sessions, we can write hooks to restore hex escaped
|
||||
characters to the original characters. But these hooks are called only
|
||||
when the result of evaluating an expression entered in an interactive
|
||||
Python session, and doesn't work for the print() function or for
|
||||
non-interactive sessions.
|
||||
when printing the result of evaluating an expression entered in an
|
||||
interactive Python session, and doesn't work for the print() function,
|
||||
for non-interactive sessions or for logging.debug("%r", ...), etc.
|
||||
|
||||
- Subclass sys.stdout and sys.stderr.
|
||||
|
||||
|
@ -154,34 +180,91 @@ suggestion were made.
|
|||
print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
|
||||
there is no chance to tell file objects apart.
|
||||
|
||||
- Make the encoding used by unicode_repr() adjustable.
|
||||
- Make the encoding used by unicode_repr() adjustable, and make the
|
||||
existing repr() the default.
|
||||
|
||||
There is no benefit preserving the current repr() behavior to make
|
||||
application/library authors aware of non-ASCII repr(). And selecting
|
||||
an encoding on printing is more flexible than having a global setting.
|
||||
|
||||
|
||||
Open Issues
|
||||
===========
|
||||
|
||||
- A lot of people use UTF-8 for their encoding, for example, en_US.utf8
|
||||
and de_DE.utf8. In such cases, the backslashescape trick doesn't work.
|
||||
With adjustable repr(), the result of using repr() is unpredictable
|
||||
and would make it impossible to write correct code involving repr().
|
||||
And if current repr() is the default, then the old convention remains
|
||||
intact and users may expect ASCII strings as the result of repr().
|
||||
Third party applications or libraries could be confused when a custom
|
||||
repr() function is used.
|
||||
|
||||
|
||||
Backwards Compatibility
|
||||
=======================
|
||||
|
||||
Changing repr() may break some existing codes, especially testing code.
|
||||
Five of Python's regression test fail with this modification. If you
|
||||
Changing repr() may break some existing code, especially testing code.
|
||||
Five of Python's regression tests fail with this modification. If you
|
||||
need repr() strings without non-ASCII character as Python 2, you can use
|
||||
following function.
|
||||
the following function. ::
|
||||
|
||||
::
|
||||
def repr_ascii(obj):
|
||||
return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
|
||||
|
||||
def repr_ascii(obj):
|
||||
return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
|
||||
For logging or for debugging, the following code can raise
|
||||
UnicodeEncodeError. ::
|
||||
|
||||
log = open("logfile", "w")
|
||||
log.write(repr(data)) # UnicodeEncodeError will be raised
|
||||
# if data contains unsupported characters.
|
||||
|
||||
To avoid exceptions being raised, you can explicitly specify the error-
|
||||
handler. ::
|
||||
|
||||
log = open("logfile", "w", errors="backslashreplace")
|
||||
log.write(repr(data)) # Unsupported characters will be escaped.
|
||||
|
||||
|
||||
For a console that uses a Unicode-based encoding, for example, en_US.
|
||||
utf8 or de_DE.utf8, the backslashescape trick doesn't work and all
|
||||
printable characters are not escaped. This will cause a problem of
|
||||
similarly drawing characters in Western, Greek and Cyrillic languages.
|
||||
These languages use similar (but different) alphabets (descended from
|
||||
the common ancestor) and contain letters that look similar but have
|
||||
different character codes. For example, it is hard to distinguish Latin
|
||||
'a', 'e' and 'o' from Cyrillic '\u0430', '\u0435' and '\u043e'. (The visual
|
||||
representation, of course, very much depends on the fonts used but
|
||||
usually these letters are almost indistinguishable.) To avoid the
|
||||
problem, the user can adjust the terminal encoding to get a result
|
||||
suitable for their environment.
|
||||
|
||||
|
||||
Open Issues
|
||||
===========
|
||||
|
||||
- Is the ``ascii()`` function necessary, or is it sufficient to document
|
||||
how to do it? If necessary, should ``ascii()`` belong to the builtin
|
||||
namespace?
|
||||
|
||||
|
||||
Rejected Proposals
|
||||
==================
|
||||
|
||||
- Add encoding and errors arguments to the builtin print() function,
|
||||
with defaults of sys.getfilesystemencoding() and 'backslashreplace'.
|
||||
|
||||
Complicated to implement, and in general, this is not seen as a good
|
||||
idea. [2]_
|
||||
|
||||
- Use character names to escape characters, instead of hex character
|
||||
codes. For example, ``repr('\u03b1')`` can be converted to
|
||||
``"\N{GREEK SMALL LETTER ALPHA}"``.
|
||||
|
||||
Using character names can be very verbose compared to hex-escape.
|
||||
e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR
|
||||
KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``.
|
||||
|
||||
- Default error-handler of sys.stdout should be 'backslashreplace'.
|
||||
|
||||
Stuff written to stdout might be consumed by another program that
|
||||
might misinterpret the \ escapes. For interactive session, it is
|
||||
possible to make 'backslashreplace' error-handler to default, but may
|
||||
add confusion of the kind "it works in interactive mode but not when
|
||||
redirecting to a file".
|
||||
|
||||
|
||||
- Hide quoted text -
|
||||
Reference Implementation
|
||||
========================
|
||||
|
||||
|
@ -194,6 +277,8 @@ References
|
|||
.. [1] Multibyte string on string::string_print
|
||||
(http://bugs.python.org/issue479898)
|
||||
|
||||
.. [2] [Python-3000] Displaying strings containing unicode escapes
|
||||
(http://mail.python.org/pipermail/python-3000/2008-April/013366.html)
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
|
Loading…
Reference in New Issue