diff --git a/pep-0000.txt b/pep-0000.txt index aec25900b..1ab6a2f20 100644 --- a/pep-0000.txt +++ b/pep-0000.txt @@ -105,6 +105,7 @@ Index by Category S 345 Metadata for Python Software Packages 1.2 Jones I 347 Migrating the Python CVS to Subversion von Löwis S 348 Exception Reorganization for Python 3.0 Cannon + S 349 Generalized String Coercion Schemenauer S 754 IEEE 754 Floating Point Special Values Warnes Finished PEPs (done, implemented in CVS) @@ -392,6 +393,7 @@ Numerical Index SR 346 User Defined ("with") Statements Coghlan I 347 Migrating the Python CVS to Subversion von Löwis S 348 Exception Reorganization for Python 3.0 Cannon + S 349 Generalized String Coercion Schemenauer SR 666 Reject Foolish Indentation Creighton S 754 IEEE 754 Floating Point Special Values Warnes I 3000 Python 3.0 Plans Kuchling, Cannon diff --git a/pep-0349.txt b/pep-0349.txt new file mode 100644 index 000000000..b8cdeebf3 --- /dev/null +++ b/pep-0349.txt @@ -0,0 +1,175 @@ +PEP: 349 +Title: Generalised String Coercion +Version: $Revision$ +Last-Modified: $Date$ +Author: Neil Schemenauer +Status: Draft +Type: Standards Track +Content-Type: text/plain +Created: 02-Aug-2005 +Post-History: +Python-Version: 2.5 + + +Abstract + + This PEP proposes the introduction of a new built-in function, + text(), that provides a way of generating a string representation + of an object. This function would make it easier to write library + code that processes string data without forcing the use of a + particular string type. + + +Rationale + + Python has had a Unicode string type for some time now but use of + it is not yet widespread. There is a large amount of Python code + that assumes that string data is represented as str instances. + The long term plan for Python is to phase out the str type and use + unicode for all string data. Clearly, a smooth migration path + must be provided. + + We need to upgrade existing libraries, written for str instances, + to be made capable of operating in an all-unicode string world. + We can't change to an all-unicode world until all essential + libraries are made capable for it. Upgrading the libraries in one + shot does not seem feasible. A more realistic strategy is to + individually make the libraries capable of operating on unicode + strings while preserving their current all-str environment + behaviour. + + First, we need to be able to write code that can accept unicode + instances without attempting to coerce them to str instances. Let + us label such code as Unicode-safe. Unicode-safe libraries can be + used in an all-unicode world. + + Second, we need to be able to write code that, when provided only + str instances, will not create unicode results. Let us label such + code as str-stable. Libraries that are str-stable can be used by + libraries and applications that are not yet Unicode-safe. + + Sometimes it is simple to write code that is both str-stable and + Unicode-safe. For example, the following function just works: + + def appendx(s): + return s + 'x' + + That's not too surprising since the unicode type is designed to + make the task easier. The principle is that when str and unicode + instances meet, the result is a unicode instance. One notable + difficulty arises when code requires a string representation of an + object; an operation traditionally accomplished by using the str() + built-in function. + + Using str() makes the code not Unicode-safe. Replacing a str() + call with a unicode() call makes the code not str-stable. Using a + string format almost accomplishes the goal but not quite. + Consider the following code: + + def text(obj): + return '%s' % obj + + It behaves as desired except if 'obj' is not a basestring instance + and needs to return a Unicode representation of itself. In that + case, the string format will attempt to coerce the result of + __str__ to a str instance. Defining a __unicode__ method does not + help since it will only be called if the right-hand operand is a + unicode instance. Using a unicode instance for the right-hand + operand does not work because the function is no longer str-stable + (i.e. it will coerce everything to unicode). + + +Specification + + A Python implementation of the text() built-in follows: + + def text(s): + """Return a nice string representation of the object. The + return value is a basestring instance. + """ + if isinstance(s, basestring): + return s + r = s.__str__() + if not isinstance(s, basestring): + raise TypeError('__str__ returned non-string') + return r + + Note that it is currently possible, although not very useful, to + write __str__ methods that return unicode instances. + + The %s format specifier for str objects would be changed to call + text() on the argument. Currently it calls str() unless the + argument is a unicode instance (in which case the object is + substituted as is and the % operation returns a unicode instance). + + The following function would be added to the C API and would be the + equivalent of the text() function: + + PyObject *PyObject_Text(PyObject *o); + + A reference implementation is available on Sourceforge [1] as a + patch. + + +Backwards Compatibility + + The change to the %s format specifier would result in some % + operations returning a unicode instance rather than raising a + UnicodeDecodeError exception. It seems unlikely that the change + would break currently working code. + + +Alternative Solutions + + Rather than adding the text() built-in, if PEP 246 were + implemented then adapt(s, basestring) could be equivalent to + text(s). The advantage would be one less built-in function. The + problem is that PEP 246 is not implemented. + + Fredrik Lundh has suggested [2] that perhaps a new slot should be + added (e.g. __text__), that could return any kind of string that's + compatible with Python's text model. That seems like an + attractive idea but many details would still need to be worked + out. + + Instead of providing the text() built-in, the %s format specifier + could be changed and a string format could be used instead of + calling text(). However, it seems like the operation is important + enough to justify a built-in. + + Instead of providing the text() built-in, the basestring type + could be changed to provide the same functionality. That would + possibly be confusing behaviour for an abstract base type. + + Some people have suggested [3] that an easier migration path would + be to change the default encoding to be UTF-8. Code that is not + Unicode safe would then encode Unicode strings as UTF-8 and + operate on them as str instances, rather than raising a + UnicodeDecodeError exception. Other code would assume that str + instances were encoded using UTF-8 and decode them if necessary. + While that solution may work for some applications, it seems + unsuitable as a general solution. For example, some applications + get string data from many different sources and assuming that all + str instances were encoded using UTF-8 could easily introduce + subtle bugs. + + +References + + [1] http://www.python.org/sf/1159501 + [2] http://mail.python.org/pipermail/python-dev/2004-September/048755.html + [3] http://blog.ianbicking.org/illusive-setdefaultencoding.html + + +Copyright + + This document has been placed in the public domain. + + + +Local Variables: +mode: indented-text +indent-tabs-mode: nil +sentence-end-double-space: t +fill-column: 70 +End: