Add PEP 3138 by Atsuo Ishimoto, String representation in Python 3000.

2008-05-05 21:50:12 +00:00 · 2008-05-05 21:50:12 +00:00 · 4bb3cc5475
parent dcd2305766
commit 4bb3cc5475
2 changed files with 198 additions and 0 deletions
--- a/pep-0000.txt
+++ b/pep-0000.txt
@ -98,6 +98,7 @@ Index by Category
 S  3108  Standard Library Reorganization              Cannon
 S  3134  Exception Chaining and Embedded Tracebacks   Yee
 S  3135  New Super                                    Spealman, Delaney
+ S  3138  String representation in Python 3000         Ishimoto

 Finished PEPs (done, implemented in Subversion)

@ -515,6 +516,7 @@ Numerical Index
 S  3135  New Super                                    Spealman, Delaney
 SR 3136  Labeled break and continue                   Chisholm
 SA 3137  Immutable Bytes and Mutable Buffer           GvR
+ S  3138  String representation in Python 3000         Ishimoto
 SA 3141  A Type Hierarchy for Numbers                 Yasskin


@ -582,6 +584,7 @@ Owners
    Hodgson, Neil            neilh@scintilla.org
    Hudson, Michael          mwh@python.net
    Hylton, Jeremy           jeremy@alum.mit.edu
+    Ishimoto, Atsuo          ishimoto at gembook.org
    Jansen, Jack             jack@cwi.nl
    Jewett, Jim J.           jimjjewett@users.sourceforge.net
    Jones, Richard           richard@python.org
--- a/pep-3138.txt
+++ b/pep-3138.txt
@ -0,0 +1,195 @@
+PEP: 3138
+Title: String representation in Python 3000
+Version: $Revision$
+Last-Modified: $Date$
+Author: Atsuo Ishimoto <ishimoto--at--gembook.org>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 05-May-2008
+Post-History:
+
+Abstract
+========
+
+This PEP proposes new string representation form for Python 3000. In
+Python prior to Python 3000, the repr() built-in function converts
+arbitrary objects to printable ASCII strings for debugging and logging.
+For Python 3000, a wider range of characters, based on the Unicode
+standard, should be considered 'printable'.
+
+Motivation
+==========
+
+The current repr() converts 8-bit strings to ASCII using following
+algorithm.
+
+- Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
+
+- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII
+  characters(>=0x80) to '\\xXX'.
+
+- Backslash-escape quote characters(' or ") and add quote character at
+  head and tail.
+
+For Unicode strings, the following additional conversions are done.
+
+- Convert leading surrogate pair characters without trailing character
+  (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
+
+- Convert 16-bit characters(>=0x100) to '\\uXXXX'.
+
+- Convert 21-bit characters(>=0x10000) and surrogate pair characters to
+  '\\U00xxxxxx'.
+
+This algorithm converts any string to printable ASCII, and repr() is
+used as handy and safe way to print strings for debugging or for
+logging. Although all non-ASCII characters are escaped, this does not
+matter when most of the string's characters are ASCII. But for other
+languages, such as Japanese where most characters in a string are not
+ASCII, this is very inconvenient. Python 3000 has a lot of nice features
+for non-Latin users such as non-ASCII identifiers, so it would be
+helpful if Python could also progress in a similar way for printable
+output.
+
+Some users might be concerned that such output will mess up their
+console if they print binary data like images. But this is unlikely to
+happen in practice because bytes and strings are different types in
+Python 3000, so printing an image to the console won't mess it up.
+
+This issue was once discussed by Hye-Shik Chang [1]_ , but was rejected.
+
+Specification
+=============
+
+- The algorithm to build repr() strings should be changed to:
+
+  * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
+
+  * Convert other non-printable ASCII characters(0x00-0x1f, 0x7f) to
+    '\\xXX'.
+
+  * Convert leading surrogate pair characters without trailing character
+    (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
+
+  * Convert Unicode whitespace other than ASCII space('\\x20') and
+    control characters (categories Z* and C* in the Unicode database)
+    to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
+
+- Set the Unicode error-handler for sys.stdout and sys.stderr to
+  'backslashreplace' by default.
+
+Rationale
+=========
+
+The repr() in Python 3000 should be Unicode not ASCII based, just like
+Python 3000 strings. Also, conversion should not be affected by the
+locale setting, because the locale is not necessarily the same as the
+output device's locale. For example, it is common for a daemon process
+to be invoked in an ASCII setting, but writes UTF-8 to its log files.
+
+Characters not supported by user's console are hex-escaped on printing,
+by the Unicode encoders' error-handler. If the error-handler of the
+output file is 'backslashreplace', such characters are hex-escaped
+without raising UnicodeEncodeError. For example, if your default
+encoding is ASCII, ``print('¢')`` will prints '\\xa2'. If your encoding
+is ISO-8859-1, '' will be printed.
+
+Printable characters
+--------------------
+
+The Unicode standard doesn't define Non-printable characters, so we must
+create our own definition. Here we propose to define Non-printable
+characters as follows.
+
+- Non-printable ASCII characters as Python 2.
+
+- Broken surrogate pair characters.
+
+- Characters defined in the Unicode character database as
+
+  * Cc (Other, Control)
+  * Cf (Other, Format)
+  * Cs (Other, Surrogate)
+  * Co (Other, Private Use)
+  * Cn (Other, Not Assigned)
+  * Zl Separator, Line ('\\u2028', LINE SEPARATOR)
+  * Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
+  * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
+    this category should be escaped to avoid ambiguity.
+
+Alternate Solutions
+-------------------
+
+To help debugging in non-Latin languages without changing repr(), other
+suggestion were made.
+
+- Supply a tool to print lists or dicts.
+
+ Strings to be printed for debugging are not only contained by lists or
+ dicts, but also in many other types of object. File objects contain a
+ file name in Unicode, exception objects contain a message in Unicode,
+ etc. These strings should be printed in readable form when repr()ed.
+ It is unlikely to be possible to implement a tool to print all
+ possible object types.
+
+- Use sys.displayhook and sys.excepthook.
+
+ For interactive sessions, we can write hooks to restore hex escaped
+ characters to the original characters. But these hooks are called only
+ when the result of evaluating an expression entered in an interactive
+ Python session, and doesn't work for the print() function or for
+ non-interactive sessions.
+
+- Subclass sys.stdout and sys.stderr.
+
+ It is difficult to implement a subclass to restore hex-escaped
+ characters since there isn't enough information left by the time it's
+ a string to undo the escaping correctly in all cases. For example, ``
+ print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
+ there is no chance to tell file objects apart.
+
+- Make the encoding used by unicode_repr() adjustable.
+
+ There is no benefit preserving the current repr() behavior to make
+ application/library authors aware of non-ASCII repr(). And selecting
+ an encoding on printing is more flexible than having a global setting.
+
+
+Open Issues
+===========
+
+- A lot of people use UTF-8 for their encoding, for example, en_US.utf8
+  and de_DE.utf8. In such cases, the backslashescape trick doesn't work.
+
+
+Backwards Compatibility
+=======================
+
+Changing repr() may break some existing codes, especially testing code.
+Five of Python's regression test fail with this modification. If you
+need repr() strings without non-ASCII character as Python 2, you can use
+following function.
+
+::
+
+ def repr_ascii(obj):
+     return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
+
+Reference Implementation
+========================
+
+http://bugs.python.org/issue2630
+
+
+References
+==========
+
+.. [1] Multibyte string on string::string_print
+       (http://bugs.python.org/issue479898)
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.