diff --git a/pep-0000.txt b/pep-0000.txt index 749ae0a11..f7f884dab 100644 --- a/pep-0000.txt +++ b/pep-0000.txt @@ -105,7 +105,7 @@ Index by Category S 345 Metadata for Python Software Packages 1.2 Jones P 347 Migrating the Python CVS to Subversion von Löwis S 348 Exception Reorganization for Python 3.0 Cannon - S 349 Generalized String Coercion Schemenauer + S 349 Allow str() to return unicode strings Schemenauer S 754 IEEE 754 Floating Point Special Values Warnes Finished PEPs (done, implemented in CVS) @@ -393,7 +393,7 @@ Numerical Index SR 346 User Defined ("with") Statements Coghlan P 347 Migrating the Python CVS to Subversion von Löwis S 348 Exception Reorganization for Python 3.0 Cannon - S 349 Generalized String Coercion Schemenauer + S 349 Allow str() to return unicode strings Schemenauer SR 666 Reject Foolish Indentation Creighton S 754 IEEE 754 Floating Point Special Values Warnes I 3000 Python 3.0 Plans Kuchling, Cannon diff --git a/pep-0349.txt b/pep-0349.txt index 9bc322b1b..9c35b0502 100644 --- a/pep-0349.txt +++ b/pep-0349.txt @@ -1,5 +1,5 @@ PEP: 349 -Title: Generalised String Coercion +Title: Allow str() to return unicode strings Version: $Revision$ Last-Modified: $Date$ Author: Neil Schemenauer @@ -7,20 +7,18 @@ Status: Draft Type: Standards Track Content-Type: text/plain Created: 02-Aug-2005 -Post-History: +Post-History: 06-Aug-2005 Python-Version: 2.5 Abstract - This PEP proposes the introduction of a new built-in function, - text(), that provides a way of generating a string representation - of an object without forcing the result to be a particular string - type. In addition, the behavior %s format specifier would be - changed to call text() on the argument. These two changes would - make it easier to write library code that can be used by - applications that use only the str type and by others that also use - the unicode type. + This PEP proposes to change the str() built-in function so that it + can return unicode strings. This change would make it easier to + write code that works with either string type and would also make + some existing code handle unicode strings. The C function + PyObject_Str() would remain unchanged and the function + PyString_New() would be added instead. Rationale @@ -64,51 +62,35 @@ Rationale object; an operation traditionally accomplished by using the str() built-in function. - Using str() makes the code not Unicode-safe. Replacing a str() - call with a unicode() call makes the code not str-stable. Using a - string format almost accomplishes the goal but not quite. - Consider the following code: - - def text(obj): - return '%s' % obj - - It behaves as desired except if 'obj' is not a basestring instance - and needs to return a Unicode representation of itself. In that - case, the string format will attempt to coerce the result of - __str__ to a str instance. Defining a __unicode__ method does not - help since it will only be called if the right-hand operand is a - unicode instance. Using a unicode instance for the right-hand - operand does not work because the function is no longer str-stable - (i.e. it will coerce everything to unicode). + Using the current str() function makes the code not Unicode-safe. + Replacing a str() call with a unicode() call makes the code not + str-stable. Changing str() so that it could return unicode + instances would solve this problem. As a further benefit, some code + that is currently not Unicode-safe because it uses str() would + become Unicode-safe. Specification - A Python implementation of the text() built-in follows: + A Python implementation of the str() built-in follows: - def text(s): + def str(s): """Return a nice string representation of the object. The - return value is a basestring instance. + return value is a str or unicode instance. """ - if isinstance(s, basestring): + if type(s) is str or type(s) is unicode: return s r = s.__str__() - if not isinstance(r, basestring): + if not isinstance(r, (str, unicode)): raise TypeError('__str__ returned non-string') return r - Note that it is currently possible, although not very useful, to - write __str__ methods that return unicode instances. - - The %s format specifier for str objects would be changed to call - text() on the argument. Currently it calls str() unless the - argument is a unicode instance (in which case the object is - substituted as is and the % operation returns a unicode instance). - The following function would be added to the C API and would be the - equivalent of the text() function: + equivalent to the str() built-in (ideally it be called PyObject_Str, + but changing that function could cause a massive number of + compatibility problems): - PyObject *PyObject_Text(PyObject *o); + PyObject *PyString_New(PyObject *); A reference implementation is available on Sourceforge [1] as a patch. @@ -116,52 +98,36 @@ Specification Backwards Compatibility - The change to the %s format specifier would result in some % - operations returning a unicode instance rather than raising a - UnicodeDecodeError exception. It seems unlikely that the change - would break currently working code. + Some code may require that str() returns a str instance. In the + standard library, only one such case has been found so far. The + function email.header_decode() requires a str instance and the + email.Header.decode_header() function tries to ensure this by + calling str() on its argument. The code was fixed by changing + the line "header = str(header)" to: + + if isinstance(header, unicode): + header = header.encode('ascii') + + Whether this is truly a bug is questionable since decode_header() + really operates on byte strings, not character strings. Code that + passes it a unicode instance could itself be considered buggy. Alternative Solutions - Rather than adding the text() built-in, if PEP 246 were - implemented then adapt(s, basestring) could be equivalent to - text(s). The advantage would be one less built-in function. The - problem is that PEP 246 is not implemented. + A new built-in function could be added instead of changing str(). + Doing so would introduce virtually no backwards compatibility + problems. However, since the compatibility problems are expected to + rare, changing str() seems preferable to adding a new built-in. - Fredrik Lundh has suggested [2] that perhaps a new slot should be - added (e.g. __text__), that could return any kind of string that's - compatible with Python's text model. That seems like an - attractive idea but many details would still need to be worked - out. - - Instead of providing the text() built-in, the %s format specifier - could be changed and a string format could be used instead of - calling text(). However, it seems like the operation is important - enough to justify a built-in. - - Instead of providing the text() built-in, the basestring type - could be changed to provide the same functionality. That would - possibly be confusing behaviour for an abstract base type. - - Some people have suggested [3] that an easier migration path would - be to change the default encoding to be UTF-8. Code that is not - Unicode safe would then encode Unicode strings as UTF-8 and - operate on them as str instances, rather than raising a - UnicodeDecodeError exception. Other code would assume that str - instances were encoded using UTF-8 and decode them if necessary. - While that solution may work for some applications, it seems - unsuitable as a general solution. For example, some applications - get string data from many different sources and assuming that all - str instances were encoded using UTF-8 could easily introduce - subtle bugs. + The basestring type could be changed to have the proposed behaviour, + rather than changing str(). However, that would be confusing + behaviour for an abstract base type. References - [1] http://www.python.org/sf/1159501 - [2] http://mail.python.org/pipermail/python-dev/2004-September/048755.html - [3] http://blog.ianbicking.org/illusive-setdefaultencoding.html + [1] http://www.python.org/sf/1266570 Copyright