2008-05-05 17:50:12 -04:00
|
|
|
|
PEP: 3138
|
|
|
|
|
Title: String representation in Python 3000
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Atsuo Ishimoto <ishimoto--at--gembook.org>
|
2008-06-11 14:41:49 -04:00
|
|
|
|
Status: Final
|
2008-05-05 17:50:12 -04:00
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 05-May-2008
|
2008-06-05 13:28:44 -04:00
|
|
|
|
Post-History: 05-May-2008, 05-Jun-2008
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2008-05-05 17:50:12 -04:00
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
This PEP proposes a new string representation form for Python 3000.
|
|
|
|
|
In Python prior to Python 3000, the repr() built-in function converted
|
|
|
|
|
arbitrary objects to printable ASCII strings for debugging and
|
|
|
|
|
logging. For Python 3000, a wider range of characters, based on the
|
|
|
|
|
Unicode standard, should be considered 'printable'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-05-05 18:30:14 -04:00
|
|
|
|
|
2008-05-05 17:50:12 -04:00
|
|
|
|
Motivation
|
|
|
|
|
==========
|
|
|
|
|
|
|
|
|
|
The current repr() converts 8-bit strings to ASCII using following
|
|
|
|
|
algorithm.
|
|
|
|
|
|
|
|
|
|
- Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
- Convert other non-printable characters(0x00-0x1f, 0x7f) and
|
|
|
|
|
non-ASCII characters (>= 0x80) to '\\xXX'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
- Backslash-escape quote characters (apostrophe, ') and add the quote
|
2008-06-02 18:26:21 -04:00
|
|
|
|
character at the beginning and the end.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
For Unicode strings, the following additional conversions are done.
|
|
|
|
|
|
|
|
|
|
- Convert leading surrogate pair characters without trailing character
|
2008-06-02 18:26:21 -04:00
|
|
|
|
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
- Convert 16-bit characters (>= 0x100) to '\\uXXXX'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
- Convert 21-bit characters (>= 0x10000) and surrogate pair characters
|
|
|
|
|
to '\\U00xxxxxx'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
This algorithm converts any string to printable ASCII, and repr() is
|
2008-06-02 18:19:25 -04:00
|
|
|
|
used as a handy and safe way to print strings for debugging or for
|
2012-09-30 02:55:27 -04:00
|
|
|
|
logging. Although all non-ASCII characters are escaped, this does not
|
|
|
|
|
matter when most of the string's characters are ASCII. But for other
|
2008-05-05 17:50:12 -04:00
|
|
|
|
languages, such as Japanese where most characters in a string are not
|
2008-06-02 18:19:25 -04:00
|
|
|
|
ASCII, this is very inconvenient.
|
|
|
|
|
|
|
|
|
|
We can use ``print(aJapaneseString)`` to get a readable string, but we
|
|
|
|
|
don't have a similar workaround for printing strings from collections
|
2012-09-30 02:55:27 -04:00
|
|
|
|
such as lists or tuples. ``print(listOfJapaneseStrings)`` uses repr()
|
|
|
|
|
to build the string to be printed, so the resulting strings are always
|
2019-07-03 14:20:45 -04:00
|
|
|
|
hex-escaped. Or when ``open(japaneseFilename)`` raises an exception,
|
2012-09-30 02:55:27 -04:00
|
|
|
|
the error message is something like ``IOError: [Errno 2] No such file
|
|
|
|
|
or directory: '\u65e5\u672c\u8a9e'``, which isn't helpful.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
|
|
|
|
Python 3000 has a lot of nice features for non-Latin users such as
|
2012-09-30 02:55:27 -04:00
|
|
|
|
non-ASCII identifiers, so it would be helpful if Python could also
|
|
|
|
|
progress in a similar way for printable output.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
Some users might be concerned that such output will mess up their
|
2012-09-30 02:55:27 -04:00
|
|
|
|
console if they print binary data like images. But this is unlikely
|
|
|
|
|
to happen in practice because bytes and strings are different types in
|
2008-05-05 17:50:12 -04:00
|
|
|
|
Python 3000, so printing an image to the console won't mess it up.
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
This issue was once discussed by Hye-Shik Chang [1]_, but was rejected.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-05-05 18:30:14 -04:00
|
|
|
|
|
2008-05-05 17:50:12 -04:00
|
|
|
|
Specification
|
|
|
|
|
=============
|
|
|
|
|
|
2008-06-02 18:23:23 -04:00
|
|
|
|
- Add a new function to the Python C API ``int Py_UNICODE_ISPRINTABLE
|
2012-09-30 02:55:27 -04:00
|
|
|
|
(Py_UNICODE ch)``. This function returns 0 if repr() should escape
|
|
|
|
|
the Unicode character ``ch``; otherwise it returns 1. Characters
|
|
|
|
|
that should be escaped are defined in the Unicode character database
|
|
|
|
|
as:
|
|
|
|
|
|
|
|
|
|
* Cc (Other, Control)
|
|
|
|
|
* Cf (Other, Format)
|
|
|
|
|
* Cs (Other, Surrogate)
|
|
|
|
|
* Co (Other, Private Use)
|
|
|
|
|
* Cn (Other, Not Assigned)
|
|
|
|
|
* Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028').
|
|
|
|
|
* Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR
|
|
|
|
|
('\\u2029').
|
|
|
|
|
* Zs (Separator, Space) other than ASCII space ('\\x20'). Characters
|
|
|
|
|
in this category should be escaped to avoid ambiguity.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2008-05-05 17:50:12 -04:00
|
|
|
|
- The algorithm to build repr() strings should be changed to:
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
* Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
* Convert non-printable ASCII characters (0x00-0x1f, 0x7f) to
|
|
|
|
|
'\\xXX'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
* Convert leading surrogate pair characters without trailing
|
|
|
|
|
character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to
|
|
|
|
|
'\\uXXXX'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
* Convert non-printable characters (Py_UNICODE_ISPRINTABLE() returns
|
2022-03-27 13:46:23 -04:00
|
|
|
|
0) to '\\xXX', '\\uXXXX' or '\\U00xxxxxx'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
* Backslash-escape quote characters (apostrophe, 0x27) and add a
|
|
|
|
|
quote character at the beginning and the end.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
- Set the Unicode error-handler for sys.stderr to 'backslashreplace'
|
|
|
|
|
by default.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2008-06-05 13:28:44 -04:00
|
|
|
|
- Add a new function to the Python C API ``PyObject *PyObject_ASCII
|
2012-09-30 02:55:27 -04:00
|
|
|
|
(PyObject *o)``. This function converts any python object to a
|
|
|
|
|
string using PyObject_Repr() and then hex-escapes all non-ASCII
|
|
|
|
|
characters. ``PyObject_ASCII()`` generates the same string as
|
|
|
|
|
``PyObject_Repr()`` in Python 2.
|
2008-06-05 13:28:44 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
- Add a new built-in function, ``ascii()``. This function converts
|
|
|
|
|
any python object to a string using repr() and then hex-escapes all
|
|
|
|
|
non-ASCII characters. ``ascii()`` generates the same string as
|
|
|
|
|
``repr()`` in Python 2.
|
2008-06-05 13:28:44 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
- Add a ``'%a'`` string format operator. ``'%a'`` converts any python
|
2008-06-02 18:26:21 -04:00
|
|
|
|
object to a string using repr() and then hex-escapes all non-ASCII
|
2012-09-30 02:55:27 -04:00
|
|
|
|
characters. The ``'%a'`` format operator generates the same string
|
|
|
|
|
as ``'%r'`` in Python 2. Also, add ``'!a'`` conversion flags to the
|
2008-06-05 13:28:44 -04:00
|
|
|
|
``string.format()`` method and add ``'%A'`` operator to the
|
2012-09-30 02:55:27 -04:00
|
|
|
|
PyUnicode_FromFormat(). They convert any object to an ASCII string
|
2008-06-05 13:28:44 -04:00
|
|
|
|
as ``'%a'`` string format operator.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
- Add an ``isprintable()`` method to the string type.
|
|
|
|
|
``str.isprintable()`` returns False if repr() would escape any
|
|
|
|
|
character in the string; otherwise returns True. The
|
|
|
|
|
``isprintable()`` method calls the ``Py_UNICODE_ISPRINTABLE()``
|
|
|
|
|
function internally.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-05-05 18:30:14 -04:00
|
|
|
|
|
2008-05-05 17:50:12 -04:00
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
The repr() in Python 3000 should be Unicode, not ASCII based, just
|
|
|
|
|
like Python 3000 strings. Also, conversion should not be affected by
|
|
|
|
|
the locale setting, because the locale is not necessarily the same as
|
|
|
|
|
the output device's locale. For example, it is common for a daemon
|
|
|
|
|
process to be invoked in an ASCII setting, but writes UTF-8 to its log
|
|
|
|
|
files. Also, web applications might want to report the error
|
|
|
|
|
information in more readable form based on the HTML page's encoding.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
|
|
|
|
Characters not supported by the user's console could be hex-escaped on
|
2012-09-30 02:55:27 -04:00
|
|
|
|
printing, by the Unicode encoder's error-handler. If the
|
|
|
|
|
error-handler of the output file is 'backslashreplace', such
|
|
|
|
|
characters are hex-escaped without raising UnicodeEncodeError. For
|
|
|
|
|
example, if the default encoding is ASCII, ``print('Hello ¢')`` will
|
|
|
|
|
print 'Hello \\xa2'. If the encoding is ISO-8859-1, 'Hello ¢' will be
|
|
|
|
|
printed.
|
|
|
|
|
|
|
|
|
|
The default error-handler for sys.stdout is 'strict'. Other
|
|
|
|
|
applications reading the output might not understand hex-escaped
|
|
|
|
|
characters, so unsupported characters should be trapped when writing.
|
|
|
|
|
If unsupported characters must be escaped, the error-handler should be
|
|
|
|
|
changed explicitly. Unlike sys.stdout, sys.stderr doesn't raise
|
2008-06-05 13:28:44 -04:00
|
|
|
|
UnicodeEncodingError by default, because the default error-handler is
|
2012-09-30 02:55:27 -04:00
|
|
|
|
'backslashreplace'. So printing error messages containing non-ASCII
|
|
|
|
|
characters to sys.stderr will not raise an exception. Also,
|
|
|
|
|
information about uncaught exceptions (exception object, traceback) is
|
|
|
|
|
printed by the interpreter without raising exceptions.
|
2008-05-05 18:30:14 -04:00
|
|
|
|
|
2008-05-05 17:50:12 -04:00
|
|
|
|
Alternate Solutions
|
|
|
|
|
-------------------
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
To help debugging in non-Latin languages without changing repr(),
|
|
|
|
|
other suggestions were made.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
- Supply a tool to print lists or dicts.
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
Strings to be printed for debugging are not only contained by lists
|
|
|
|
|
or dicts, but also in many other types of object. File objects
|
|
|
|
|
contain a file name in Unicode, exception objects contain a message
|
|
|
|
|
in Unicode, etc. These strings should be printed in readable form
|
|
|
|
|
when repr()ed. It is unlikely to be possible to implement a tool to
|
|
|
|
|
print all possible object types.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
- Use sys.displayhook and sys.excepthook.
|
|
|
|
|
|
2008-06-02 18:26:21 -04:00
|
|
|
|
For interactive sessions, we can write hooks to restore hex escaped
|
2012-09-30 02:55:27 -04:00
|
|
|
|
characters to the original characters. But these hooks are called
|
|
|
|
|
only when printing the result of evaluating an expression entered in
|
|
|
|
|
an interactive Python session, and don't work for the ``print()``
|
|
|
|
|
function, for non-interactive sessions or for ``logging.debug("%r",
|
|
|
|
|
...)``, etc.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
- Subclass sys.stdout and sys.stderr.
|
|
|
|
|
|
2008-06-02 18:26:21 -04:00
|
|
|
|
It is difficult to implement a subclass to restore hex-escaped
|
2012-09-30 02:55:27 -04:00
|
|
|
|
characters since there isn't enough information left by the time
|
|
|
|
|
it's a string to undo the escaping correctly in all cases. For
|
|
|
|
|
example, ``print("\\"+"u0041")`` should be printed as '\\u0041', not
|
|
|
|
|
'A'. But there is no chance to tell file objects apart.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
- Make the encoding used by unicode_repr() adjustable, and make the
|
2008-06-02 18:26:21 -04:00
|
|
|
|
existing repr() the default.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2008-06-02 18:26:21 -04:00
|
|
|
|
With adjustable repr(), the result of using repr() is unpredictable
|
|
|
|
|
and would make it impossible to write correct code involving repr().
|
2012-09-30 02:55:27 -04:00
|
|
|
|
And if current repr() is the default, then the old convention
|
|
|
|
|
remains intact and users may expect ASCII strings as the result of
|
|
|
|
|
repr(). Third party applications or libraries could be confused
|
|
|
|
|
when a custom repr() function is used.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
|
|
|
|
Backwards Compatibility
|
|
|
|
|
=======================
|
|
|
|
|
|
|
|
|
|
Changing repr() may break some existing code, especially testing code.
|
2012-09-30 02:55:27 -04:00
|
|
|
|
Five of Python's regression tests fail with this modification. If you
|
|
|
|
|
need repr() strings without non-ASCII character as Python 2, you can
|
|
|
|
|
use the following function. ::
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2008-06-05 13:28:44 -04:00
|
|
|
|
def repr_ascii(obj):
|
|
|
|
|
return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
|
|
|
|
For logging or for debugging, the following code can raise
|
|
|
|
|
UnicodeEncodeError. ::
|
|
|
|
|
|
2008-06-05 13:28:44 -04:00
|
|
|
|
log = open("logfile", "w")
|
|
|
|
|
log.write(repr(data)) # UnicodeEncodeError will be raised
|
|
|
|
|
# if data contains unsupported characters.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
To avoid exceptions being raised, you can explicitly specify the
|
|
|
|
|
error-handler. ::
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2008-06-05 13:28:44 -04:00
|
|
|
|
log = open("logfile", "w", errors="backslashreplace")
|
|
|
|
|
log.write(repr(data)) # Unsupported characters will be escaped.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
For a console that uses a Unicode-based encoding, for example,
|
|
|
|
|
en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn't work and
|
|
|
|
|
all printable characters are not escaped. This will cause a problem
|
|
|
|
|
of similarly drawing characters in Western, Greek and Cyrillic
|
|
|
|
|
languages. These languages use similar (but different) alphabets
|
|
|
|
|
(descended from a common ancestor) and contain letters that look
|
|
|
|
|
similar but have different character codes. For example, it is hard
|
|
|
|
|
to distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'.
|
|
|
|
|
(The visual representation, of course, very much depends on the fonts
|
|
|
|
|
used but usually these letters are almost indistinguishable.) To
|
|
|
|
|
avoid the problem, the user can adjust the terminal encoding to get a
|
|
|
|
|
result suitable for their environment.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
Rejected Proposals
|
|
|
|
|
==================
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
- Add encoding and errors arguments to the builtin print() function,
|
2008-06-02 18:26:21 -04:00
|
|
|
|
with defaults of sys.getfilesystemencoding() and 'backslashreplace'.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2008-06-02 18:26:21 -04:00
|
|
|
|
Complicated to implement, and in general, this is not seen as a good
|
|
|
|
|
idea. [2]_
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
|
|
|
|
- Use character names to escape characters, instead of hex character
|
2012-09-30 02:55:27 -04:00
|
|
|
|
codes. For example, ``repr('\u03b1')`` can be converted to
|
|
|
|
|
``"\N{GREEK SMALL LETTER ALPHA}"``.
|
2008-06-02 18:19:25 -04:00
|
|
|
|
|
2012-09-30 02:55:27 -04:00
|
|
|
|
Using character names can be very verbose compared to hex-escape.
|
|
|
|
|
e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE
|
|
|
|
|
UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED
|
|
|
|
|
FORM}"``.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
- Default error-handler of sys.stdout should be 'backslashreplace'.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-02 18:26:21 -04:00
|
|
|
|
Stuff written to stdout might be consumed by another program that
|
2012-09-30 02:55:27 -04:00
|
|
|
|
might misinterpret the \\ escapes. For interactive sessions, it is
|
|
|
|
|
possible to make the 'backslashreplace' error-handler the default,
|
|
|
|
|
but this may add confusion of the kind "it works in interactive mode
|
|
|
|
|
but not when redirecting to a file".
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-05-05 18:30:14 -04:00
|
|
|
|
|
2008-06-11 14:41:49 -04:00
|
|
|
|
Implementation
|
|
|
|
|
==============
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
2008-06-11 14:41:49 -04:00
|
|
|
|
The author wrote a patch in http://bugs.python.org/issue2630; this was
|
|
|
|
|
committed to the Python 3.0 branch in revision 64138 on 06-11-2008.
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
2008-06-05 13:28:44 -04:00
|
|
|
|
.. [1] Multibyte string on string\::string_print
|
2008-05-05 17:50:12 -04:00
|
|
|
|
(http://bugs.python.org/issue479898)
|
|
|
|
|
|
2008-06-02 18:19:25 -04:00
|
|
|
|
.. [2] [Python-3000] Displaying strings containing unicode escapes
|
2017-06-11 15:02:39 -04:00
|
|
|
|
(https://mail.python.org/pipermail/python-3000/2008-April/013366.html)
|
2008-05-05 17:50:12 -04:00
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
2012-09-30 02:55:27 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
coding: utf-8
|
|
|
|
|
End:
|