Add PEP 3138 by Atsuo Ishimoto, String representation in Python 3000.
This commit is contained in:
parent
dcd2305766
commit
4bb3cc5475
|
@ -98,6 +98,7 @@ Index by Category
|
|||
S 3108 Standard Library Reorganization Cannon
|
||||
S 3134 Exception Chaining and Embedded Tracebacks Yee
|
||||
S 3135 New Super Spealman, Delaney
|
||||
S 3138 String representation in Python 3000 Ishimoto
|
||||
|
||||
Finished PEPs (done, implemented in Subversion)
|
||||
|
||||
|
@ -515,6 +516,7 @@ Numerical Index
|
|||
S 3135 New Super Spealman, Delaney
|
||||
SR 3136 Labeled break and continue Chisholm
|
||||
SA 3137 Immutable Bytes and Mutable Buffer GvR
|
||||
S 3138 String representation in Python 3000 Ishimoto
|
||||
SA 3141 A Type Hierarchy for Numbers Yasskin
|
||||
|
||||
|
||||
|
@ -582,6 +584,7 @@ Owners
|
|||
Hodgson, Neil neilh@scintilla.org
|
||||
Hudson, Michael mwh@python.net
|
||||
Hylton, Jeremy jeremy@alum.mit.edu
|
||||
Ishimoto, Atsuo ishimoto at gembook.org
|
||||
Jansen, Jack jack@cwi.nl
|
||||
Jewett, Jim J. jimjjewett@users.sourceforge.net
|
||||
Jones, Richard richard@python.org
|
||||
|
|
|
@ -0,0 +1,195 @@
|
|||
PEP: 3138
|
||||
Title: String representation in Python 3000
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Atsuo Ishimoto <ishimoto--at--gembook.org>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 05-May-2008
|
||||
Post-History:
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP proposes new string representation form for Python 3000. In
|
||||
Python prior to Python 3000, the repr() built-in function converts
|
||||
arbitrary objects to printable ASCII strings for debugging and logging.
|
||||
For Python 3000, a wider range of characters, based on the Unicode
|
||||
standard, should be considered 'printable'.
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The current repr() converts 8-bit strings to ASCII using following
|
||||
algorithm.
|
||||
|
||||
- Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
|
||||
|
||||
- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII
|
||||
characters(>=0x80) to '\\xXX'.
|
||||
|
||||
- Backslash-escape quote characters(' or ") and add quote character at
|
||||
head and tail.
|
||||
|
||||
For Unicode strings, the following additional conversions are done.
|
||||
|
||||
- Convert leading surrogate pair characters without trailing character
|
||||
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
||||
|
||||
- Convert 16-bit characters(>=0x100) to '\\uXXXX'.
|
||||
|
||||
- Convert 21-bit characters(>=0x10000) and surrogate pair characters to
|
||||
'\\U00xxxxxx'.
|
||||
|
||||
This algorithm converts any string to printable ASCII, and repr() is
|
||||
used as handy and safe way to print strings for debugging or for
|
||||
logging. Although all non-ASCII characters are escaped, this does not
|
||||
matter when most of the string's characters are ASCII. But for other
|
||||
languages, such as Japanese where most characters in a string are not
|
||||
ASCII, this is very inconvenient. Python 3000 has a lot of nice features
|
||||
for non-Latin users such as non-ASCII identifiers, so it would be
|
||||
helpful if Python could also progress in a similar way for printable
|
||||
output.
|
||||
|
||||
Some users might be concerned that such output will mess up their
|
||||
console if they print binary data like images. But this is unlikely to
|
||||
happen in practice because bytes and strings are different types in
|
||||
Python 3000, so printing an image to the console won't mess it up.
|
||||
|
||||
This issue was once discussed by Hye-Shik Chang [1]_ , but was rejected.
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
- The algorithm to build repr() strings should be changed to:
|
||||
|
||||
* Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
|
||||
|
||||
* Convert other non-printable ASCII characters(0x00-0x1f, 0x7f) to
|
||||
'\\xXX'.
|
||||
|
||||
* Convert leading surrogate pair characters without trailing character
|
||||
(0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
|
||||
|
||||
* Convert Unicode whitespace other than ASCII space('\\x20') and
|
||||
control characters (categories Z* and C* in the Unicode database)
|
||||
to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
|
||||
|
||||
- Set the Unicode error-handler for sys.stdout and sys.stderr to
|
||||
'backslashreplace' by default.
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
The repr() in Python 3000 should be Unicode not ASCII based, just like
|
||||
Python 3000 strings. Also, conversion should not be affected by the
|
||||
locale setting, because the locale is not necessarily the same as the
|
||||
output device's locale. For example, it is common for a daemon process
|
||||
to be invoked in an ASCII setting, but writes UTF-8 to its log files.
|
||||
|
||||
Characters not supported by user's console are hex-escaped on printing,
|
||||
by the Unicode encoders' error-handler. If the error-handler of the
|
||||
output file is 'backslashreplace', such characters are hex-escaped
|
||||
without raising UnicodeEncodeError. For example, if your default
|
||||
encoding is ASCII, ``print('¢')`` will prints '\\xa2'. If your encoding
|
||||
is ISO-8859-1, '' will be printed.
|
||||
|
||||
Printable characters
|
||||
--------------------
|
||||
|
||||
The Unicode standard doesn't define Non-printable characters, so we must
|
||||
create our own definition. Here we propose to define Non-printable
|
||||
characters as follows.
|
||||
|
||||
- Non-printable ASCII characters as Python 2.
|
||||
|
||||
- Broken surrogate pair characters.
|
||||
|
||||
- Characters defined in the Unicode character database as
|
||||
|
||||
* Cc (Other, Control)
|
||||
* Cf (Other, Format)
|
||||
* Cs (Other, Surrogate)
|
||||
* Co (Other, Private Use)
|
||||
* Cn (Other, Not Assigned)
|
||||
* Zl Separator, Line ('\\u2028', LINE SEPARATOR)
|
||||
* Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
|
||||
* Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
|
||||
this category should be escaped to avoid ambiguity.
|
||||
|
||||
Alternate Solutions
|
||||
-------------------
|
||||
|
||||
To help debugging in non-Latin languages without changing repr(), other
|
||||
suggestion were made.
|
||||
|
||||
- Supply a tool to print lists or dicts.
|
||||
|
||||
Strings to be printed for debugging are not only contained by lists or
|
||||
dicts, but also in many other types of object. File objects contain a
|
||||
file name in Unicode, exception objects contain a message in Unicode,
|
||||
etc. These strings should be printed in readable form when repr()ed.
|
||||
It is unlikely to be possible to implement a tool to print all
|
||||
possible object types.
|
||||
|
||||
- Use sys.displayhook and sys.excepthook.
|
||||
|
||||
For interactive sessions, we can write hooks to restore hex escaped
|
||||
characters to the original characters. But these hooks are called only
|
||||
when the result of evaluating an expression entered in an interactive
|
||||
Python session, and doesn't work for the print() function or for
|
||||
non-interactive sessions.
|
||||
|
||||
- Subclass sys.stdout and sys.stderr.
|
||||
|
||||
It is difficult to implement a subclass to restore hex-escaped
|
||||
characters since there isn't enough information left by the time it's
|
||||
a string to undo the escaping correctly in all cases. For example, ``
|
||||
print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
|
||||
there is no chance to tell file objects apart.
|
||||
|
||||
- Make the encoding used by unicode_repr() adjustable.
|
||||
|
||||
There is no benefit preserving the current repr() behavior to make
|
||||
application/library authors aware of non-ASCII repr(). And selecting
|
||||
an encoding on printing is more flexible than having a global setting.
|
||||
|
||||
|
||||
Open Issues
|
||||
===========
|
||||
|
||||
- A lot of people use UTF-8 for their encoding, for example, en_US.utf8
|
||||
and de_DE.utf8. In such cases, the backslashescape trick doesn't work.
|
||||
|
||||
|
||||
Backwards Compatibility
|
||||
=======================
|
||||
|
||||
Changing repr() may break some existing codes, especially testing code.
|
||||
Five of Python's regression test fail with this modification. If you
|
||||
need repr() strings without non-ASCII character as Python 2, you can use
|
||||
following function.
|
||||
|
||||
::
|
||||
|
||||
def repr_ascii(obj):
|
||||
return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
|
||||
|
||||
Reference Implementation
|
||||
========================
|
||||
|
||||
http://bugs.python.org/issue2630
|
||||
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] Multibyte string on string::string_print
|
||||
(http://bugs.python.org/issue479898)
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
Loading…
Reference in New Issue