From 9c83bacdff5dc20d6f1060e2d9d3eed47231d8a0 Mon Sep 17 00:00:00 2001 From: Georg Brandl Date: Sun, 30 Sep 2012 08:55:27 +0200 Subject: [PATCH] Some typo fixes in PEP 3138; also add variables footer. --- pep-3138.txt | 273 +++++++++++++++++++++++++++------------------------ 1 file changed, 146 insertions(+), 127 deletions(-) diff --git a/pep-3138.txt b/pep-3138.txt index a911d9bbc..df7578488 100644 --- a/pep-3138.txt +++ b/pep-3138.txt @@ -13,11 +13,11 @@ Post-History: 05-May-2008, 05-Jun-2008 Abstract ======== -This PEP proposes a new string representation form for Python 3000. In -Python prior to Python 3000, the repr() built-in function converted -arbitrary objects to printable ASCII strings for debugging and logging. -For Python 3000, a wider range of characters, based on the Unicode -standard, should be considered 'printable'. +This PEP proposes a new string representation form for Python 3000. +In Python prior to Python 3000, the repr() built-in function converted +arbitrary objects to printable ASCII strings for debugging and +logging. For Python 3000, a wider range of characters, based on the +Unicode standard, should be considered 'printable'. Motivation @@ -28,8 +28,8 @@ algorithm. - Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'. -- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII - characters(>=0x80) to '\\xXX'. +- Convert other non-printable characters(0x00-0x1f, 0x7f) and + non-ASCII characters (>= 0x80) to '\\xXX'. - Backslash-escape quote characters (apostrophe, ') and add the quote character at the beginning and the end. @@ -39,177 +39,184 @@ For Unicode strings, the following additional conversions are done. - Convert leading surrogate pair characters without trailing character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'. -- Convert 16-bit characters(>=0x100) to '\\uXXXX'. +- Convert 16-bit characters (>= 0x100) to '\\uXXXX'. -- Convert 21-bit characters(>=0x10000) and surrogate pair characters to - '\\U00xxxxxx'. +- Convert 21-bit characters (>= 0x10000) and surrogate pair characters + to '\\U00xxxxxx'. This algorithm converts any string to printable ASCII, and repr() is used as a handy and safe way to print strings for debugging or for -logging. Although all non-ASCII characters are escaped, this does not -matter when most of the string's characters are ASCII. But for other +logging. Although all non-ASCII characters are escaped, this does not +matter when most of the string's characters are ASCII. But for other languages, such as Japanese where most characters in a string are not ASCII, this is very inconvenient. We can use ``print(aJapaneseString)`` to get a readable string, but we don't have a similar workaround for printing strings from collections -such as lists or tuples. ``print(listOfJapaneseStrings)`` uses repr() to -build the string to be printed, so the resulting strings are always -hex-escaped. Or when ``open(japaneseFilemame)`` raises an exception, the -error message is something like ``IOError: [Errno 2] No such file or -directory: '\u65e5\u672c\u8a9e'``, which isn't helpful. +such as lists or tuples. ``print(listOfJapaneseStrings)`` uses repr() +to build the string to be printed, so the resulting strings are always +hex-escaped. Or when ``open(japaneseFilemame)`` raises an exception, +the error message is something like ``IOError: [Errno 2] No such file +or directory: '\u65e5\u672c\u8a9e'``, which isn't helpful. Python 3000 has a lot of nice features for non-Latin users such as -non-ASCII identifiers, so it would be helpful if Python could also progress -in a similar way for printable output. +non-ASCII identifiers, so it would be helpful if Python could also +progress in a similar way for printable output. Some users might be concerned that such output will mess up their -console if they print binary data like images. But this is unlikely to -happen in practice because bytes and strings are different types in +console if they print binary data like images. But this is unlikely +to happen in practice because bytes and strings are different types in Python 3000, so printing an image to the console won't mess it up. -This issue was once discussed by Hye-Shik Chang [1]_ , but was rejected. +This issue was once discussed by Hye-Shik Chang [1]_, but was rejected. Specification ============= - Add a new function to the Python C API ``int Py_UNICODE_ISPRINTABLE - (Py_UNICODE ch)``. This function returns 0 if repr() should escape the - Unicode character ``ch``; otherwise it returns 1. Characters that should - be escaped are defined in the Unicode character database as: + (Py_UNICODE ch)``. This function returns 0 if repr() should escape + the Unicode character ``ch``; otherwise it returns 1. Characters + that should be escaped are defined in the Unicode character database + as: - * Cc (Other, Control) - * Cf (Other, Format) - * Cs (Other, Surrogate) - * Co (Other, Private Use) - * Cn (Other, Not Assigned) - * Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028'). - * Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\\u2029'). - * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in - this category should be escaped to avoid ambiguity. + * Cc (Other, Control) + * Cf (Other, Format) + * Cs (Other, Surrogate) + * Co (Other, Private Use) + * Cn (Other, Not Assigned) + * Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028'). + * Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR + ('\\u2029'). + * Zs (Separator, Space) other than ASCII space ('\\x20'). Characters + in this category should be escaped to avoid ambiguity. - The algorithm to build repr() strings should be changed to: - * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'. + * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'. - * Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'. + * Convert non-printable ASCII characters (0x00-0x1f, 0x7f) to + '\\xXX'. - * Convert leading surrogate pair characters without trailing character - (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'. + * Convert leading surrogate pair characters without trailing + character (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to + '\\uXXXX'. - * Convert non-printable characters(Py_UNICODE_ISPRINTABLE() returns 0) - to 'xXX', '\\uXXXX' or '\\U00xxxxxx'. + * Convert non-printable characters (Py_UNICODE_ISPRINTABLE() returns + 0) to 'xXX', '\\uXXXX' or '\\U00xxxxxx'. - * Backslash-escape quote characters (apostrophe, 0x27) and add quote - character at the beginning and the end. + * Backslash-escape quote characters (apostrophe, 0x27) and add a + quote character at the beginning and the end. -- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by - default. +- Set the Unicode error-handler for sys.stderr to 'backslashreplace' + by default. - Add a new function to the Python C API ``PyObject *PyObject_ASCII - (PyObject *o)``. This function converts any python object to a string - using PyObject_Repr() and then hex-escapes all non-ASCII characters. - ``PyObject_ASCII()`` generates the same string as ``PyObject_Repr()`` - in Python 2. + (PyObject *o)``. This function converts any python object to a + string using PyObject_Repr() and then hex-escapes all non-ASCII + characters. ``PyObject_ASCII()`` generates the same string as + ``PyObject_Repr()`` in Python 2. -- Add a new built-in function, ``ascii()``. This function converts any - python object to a string using repr() and then hex-escapes all non-ASCII - characters. ``ascii()`` generates the same string as ``repr()`` in - Python 2. +- Add a new built-in function, ``ascii()``. This function converts + any python object to a string using repr() and then hex-escapes all + non-ASCII characters. ``ascii()`` generates the same string as + ``repr()`` in Python 2. -- Add ``'%a'`` string format operator. ``'%a'`` converts any python +- Add a ``'%a'`` string format operator. ``'%a'`` converts any python object to a string using repr() and then hex-escapes all non-ASCII - characters. The ``'%a'`` format operator generates the same string as - ``'%r'`` in Python 2. Also, add ``'!a'`` conversion flags to the + characters. The ``'%a'`` format operator generates the same string + as ``'%r'`` in Python 2. Also, add ``'!a'`` conversion flags to the ``string.format()`` method and add ``'%A'`` operator to the - PyUnicode_FromFormat(). They converts any object to an ASCII string + PyUnicode_FromFormat(). They convert any object to an ASCII string as ``'%a'`` string format operator. -- Add an ``isprintable()`` method to the string type. ``str.isprintable()`` - returns False if repr() should escape any character in the string; - otherwise returns True. The ``isprintable()`` method calls the - ``Py_UNICODE_ISPRINTABLE()`` function internally. +- Add an ``isprintable()`` method to the string type. + ``str.isprintable()`` returns False if repr() would escape any + character in the string; otherwise returns True. The + ``isprintable()`` method calls the ``Py_UNICODE_ISPRINTABLE()`` + function internally. Rationale ========= -The repr() in Python 3000 should be Unicode not ASCII based, just like -Python 3000 strings. Also, conversion should not be affected by the -locale setting, because the locale is not necessarily the same as the -output device's locale. For example, it is common for a daemon process -to be invoked in an ASCII setting, but writes UTF-8 to its log files. -Also, web applications might want to report the error information in -more readable form based on the HTML page's encoding. +The repr() in Python 3000 should be Unicode, not ASCII based, just +like Python 3000 strings. Also, conversion should not be affected by +the locale setting, because the locale is not necessarily the same as +the output device's locale. For example, it is common for a daemon +process to be invoked in an ASCII setting, but writes UTF-8 to its log +files. Also, web applications might want to report the error +information in more readable form based on the HTML page's encoding. Characters not supported by the user's console could be hex-escaped on -printing, by the Unicode encoder's error-handler. If the error-handler -of the output file is 'backslashreplace', such characters are -hex-escaped without raising UnicodeEncodeError. For example, if your default -encoding is ASCII, ``print('Hello ¢')`` will print 'Hello \\xa2'. If -your encoding is ISO-8859-1, 'Hello ¢' will be printed. +printing, by the Unicode encoder's error-handler. If the +error-handler of the output file is 'backslashreplace', such +characters are hex-escaped without raising UnicodeEncodeError. For +example, if the default encoding is ASCII, ``print('Hello ¢')`` will +print 'Hello \\xa2'. If the encoding is ISO-8859-1, 'Hello ¢' will be +printed. -The default error-handler for sys.stdout is 'strict'. Other applications -reading the output might not understand hex-escaped characters, so -unsupported characters should be trapped when writing. If you need to -escape unsupported characters, you should explicitly change the -error-handler. Unlike sys.stdout, sys.stderr doesn't raise +The default error-handler for sys.stdout is 'strict'. Other +applications reading the output might not understand hex-escaped +characters, so unsupported characters should be trapped when writing. +If unsupported characters must be escaped, the error-handler should be +changed explicitly. Unlike sys.stdout, sys.stderr doesn't raise UnicodeEncodingError by default, because the default error-handler is -'backslashreplace'. So printing error messeges containing non-ASCII -characters to sys.stderr will not raise an exception. Also, information -about uncaught exceptions (exception object, traceback) are printed by -the interpreter without raising exceptions. +'backslashreplace'. So printing error messages containing non-ASCII +characters to sys.stderr will not raise an exception. Also, +information about uncaught exceptions (exception object, traceback) is +printed by the interpreter without raising exceptions. Alternate Solutions ------------------- -To help debugging in non-Latin languages without changing repr(), other -suggestions were made. +To help debugging in non-Latin languages without changing repr(), +other suggestions were made. - Supply a tool to print lists or dicts. - Strings to be printed for debugging are not only contained by lists or - dicts, but also in many other types of object. File objects contain a - file name in Unicode, exception objects contain a message in Unicode, - etc. These strings should be printed in readable form when repr()ed. - It is unlikely to be possible to implement a tool to print all - possible object types. + Strings to be printed for debugging are not only contained by lists + or dicts, but also in many other types of object. File objects + contain a file name in Unicode, exception objects contain a message + in Unicode, etc. These strings should be printed in readable form + when repr()ed. It is unlikely to be possible to implement a tool to + print all possible object types. - Use sys.displayhook and sys.excepthook. For interactive sessions, we can write hooks to restore hex escaped - characters to the original characters. But these hooks are called only - when printing the result of evaluating an expression entered in an - interactive Python session, and doesn't work for the ``print()`` function, - for non-interactive sessions or for ``logging.debug("%r", ...)``, etc. + characters to the original characters. But these hooks are called + only when printing the result of evaluating an expression entered in + an interactive Python session, and don't work for the ``print()`` + function, for non-interactive sessions or for ``logging.debug("%r", + ...)``, etc. - Subclass sys.stdout and sys.stderr. It is difficult to implement a subclass to restore hex-escaped - characters since there isn't enough information left by the time it's - a string to undo the escaping correctly in all cases. For example, - ``print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But - there is no chance to tell file objects apart. + characters since there isn't enough information left by the time + it's a string to undo the escaping correctly in all cases. For + example, ``print("\\"+"u0041")`` should be printed as '\\u0041', not + 'A'. But there is no chance to tell file objects apart. - Make the encoding used by unicode_repr() adjustable, and make the existing repr() the default. With adjustable repr(), the result of using repr() is unpredictable and would make it impossible to write correct code involving repr(). - And if current repr() is the default, then the old convention remains - intact and users may expect ASCII strings as the result of repr(). - Third party applications or libraries could be confused when a custom - repr() function is used. + And if current repr() is the default, then the old convention + remains intact and users may expect ASCII strings as the result of + repr(). Third party applications or libraries could be confused + when a custom repr() function is used. Backwards Compatibility ======================= Changing repr() may break some existing code, especially testing code. -Five of Python's regression tests fail with this modification. If you -need repr() strings without non-ASCII character as Python 2, you can use -the following function. :: +Five of Python's regression tests fail with this modification. If you +need repr() strings without non-ASCII character as Python 2, you can +use the following function. :: def repr_ascii(obj): return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII") @@ -221,25 +228,25 @@ UnicodeEncodeError. :: log.write(repr(data)) # UnicodeEncodeError will be raised # if data contains unsupported characters. -To avoid exceptions being raised, you can explicitly specify the error- -handler. :: +To avoid exceptions being raised, you can explicitly specify the +error-handler. :: log = open("logfile", "w", errors="backslashreplace") log.write(repr(data)) # Unsupported characters will be escaped. -For a console that uses a Unicode-based encoding, for example, en_US. -utf8 or de_DE.utf8, the backslashescape trick doesn't work and all -printable characters are not escaped. This will cause a problem of -similarly drawing characters in Western, Greek and Cyrillic languages. -These languages use similar (but different) alphabets (descended from a -common ancestor) and contain letters that look similar but have -different character codes. For example, it is hard to distinguish Latin -'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual -representation, of course, very much depends on the fonts used but -usually these letters are almost indistinguishable.) To avoid the -problem, the user can adjust the terminal encoding to get a result -suitable for their environment. +For a console that uses a Unicode-based encoding, for example, +en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn't work and +all printable characters are not escaped. This will cause a problem +of similarly drawing characters in Western, Greek and Cyrillic +languages. These languages use similar (but different) alphabets +(descended from a common ancestor) and contain letters that look +similar but have different character codes. For example, it is hard +to distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. +(The visual representation, of course, very much depends on the fonts +used but usually these letters are almost indistinguishable.) To +avoid the problem, the user can adjust the terminal encoding to get a +result suitable for their environment. Rejected Proposals @@ -252,20 +259,21 @@ Rejected Proposals idea. [2]_ - Use character names to escape characters, instead of hex character - codes. For example, ``repr('\u03b1')`` can be converted to ``"\N{GREEK - SMALL LETTER ALPHA}"``. + codes. For example, ``repr('\u03b1')`` can be converted to + ``"\N{GREEK SMALL LETTER ALPHA}"``. - Using character names can be very verbose compared to hex-escape. - e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR - KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``. + Using character names can be very verbose compared to hex-escape. + e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE + UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED + FORM}"``. - Default error-handler of sys.stdout should be 'backslashreplace'. Stuff written to stdout might be consumed by another program that - might misinterpret the \ escapes. For interactive session, it is - possible to make 'backslashreplace' error-handler to default, but may - add confusion of the kind "it works in interactive mode but not when - redirecting to a file". + might misinterpret the \\ escapes. For interactive sessions, it is + possible to make the 'backslashreplace' error-handler the default, + but this may add confusion of the kind "it works in interactive mode + but not when redirecting to a file". Implementation @@ -288,3 +296,14 @@ Copyright ========= This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: