New version of PEP 349. Propose that str() be changed rather than
adding a new built-in function.
This commit is contained in:
parent
35b61a7c94
commit
fb89a4ee52
|
@ -105,7 +105,7 @@ Index by Category
|
||||||
S 345 Metadata for Python Software Packages 1.2 Jones
|
S 345 Metadata for Python Software Packages 1.2 Jones
|
||||||
P 347 Migrating the Python CVS to Subversion von Löwis
|
P 347 Migrating the Python CVS to Subversion von Löwis
|
||||||
S 348 Exception Reorganization for Python 3.0 Cannon
|
S 348 Exception Reorganization for Python 3.0 Cannon
|
||||||
S 349 Generalized String Coercion Schemenauer
|
S 349 Allow str() to return unicode strings Schemenauer
|
||||||
S 754 IEEE 754 Floating Point Special Values Warnes
|
S 754 IEEE 754 Floating Point Special Values Warnes
|
||||||
|
|
||||||
Finished PEPs (done, implemented in CVS)
|
Finished PEPs (done, implemented in CVS)
|
||||||
|
@ -393,7 +393,7 @@ Numerical Index
|
||||||
SR 346 User Defined ("with") Statements Coghlan
|
SR 346 User Defined ("with") Statements Coghlan
|
||||||
P 347 Migrating the Python CVS to Subversion von Löwis
|
P 347 Migrating the Python CVS to Subversion von Löwis
|
||||||
S 348 Exception Reorganization for Python 3.0 Cannon
|
S 348 Exception Reorganization for Python 3.0 Cannon
|
||||||
S 349 Generalized String Coercion Schemenauer
|
S 349 Allow str() to return unicode strings Schemenauer
|
||||||
SR 666 Reject Foolish Indentation Creighton
|
SR 666 Reject Foolish Indentation Creighton
|
||||||
S 754 IEEE 754 Floating Point Special Values Warnes
|
S 754 IEEE 754 Floating Point Special Values Warnes
|
||||||
I 3000 Python 3.0 Plans Kuchling, Cannon
|
I 3000 Python 3.0 Plans Kuchling, Cannon
|
||||||
|
|
122
pep-0349.txt
122
pep-0349.txt
|
@ -1,5 +1,5 @@
|
||||||
PEP: 349
|
PEP: 349
|
||||||
Title: Generalised String Coercion
|
Title: Allow str() to return unicode strings
|
||||||
Version: $Revision$
|
Version: $Revision$
|
||||||
Last-Modified: $Date$
|
Last-Modified: $Date$
|
||||||
Author: Neil Schemenauer <nas@arctrix.com>
|
Author: Neil Schemenauer <nas@arctrix.com>
|
||||||
|
@ -7,20 +7,18 @@ Status: Draft
|
||||||
Type: Standards Track
|
Type: Standards Track
|
||||||
Content-Type: text/plain
|
Content-Type: text/plain
|
||||||
Created: 02-Aug-2005
|
Created: 02-Aug-2005
|
||||||
Post-History:
|
Post-History: 06-Aug-2005
|
||||||
Python-Version: 2.5
|
Python-Version: 2.5
|
||||||
|
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
|
|
||||||
This PEP proposes the introduction of a new built-in function,
|
This PEP proposes to change the str() built-in function so that it
|
||||||
text(), that provides a way of generating a string representation
|
can return unicode strings. This change would make it easier to
|
||||||
of an object without forcing the result to be a particular string
|
write code that works with either string type and would also make
|
||||||
type. In addition, the behavior %s format specifier would be
|
some existing code handle unicode strings. The C function
|
||||||
changed to call text() on the argument. These two changes would
|
PyObject_Str() would remain unchanged and the function
|
||||||
make it easier to write library code that can be used by
|
PyString_New() would be added instead.
|
||||||
applications that use only the str type and by others that also use
|
|
||||||
the unicode type.
|
|
||||||
|
|
||||||
|
|
||||||
Rationale
|
Rationale
|
||||||
|
@ -64,51 +62,35 @@ Rationale
|
||||||
object; an operation traditionally accomplished by using the str()
|
object; an operation traditionally accomplished by using the str()
|
||||||
built-in function.
|
built-in function.
|
||||||
|
|
||||||
Using str() makes the code not Unicode-safe. Replacing a str()
|
Using the current str() function makes the code not Unicode-safe.
|
||||||
call with a unicode() call makes the code not str-stable. Using a
|
Replacing a str() call with a unicode() call makes the code not
|
||||||
string format almost accomplishes the goal but not quite.
|
str-stable. Changing str() so that it could return unicode
|
||||||
Consider the following code:
|
instances would solve this problem. As a further benefit, some code
|
||||||
|
that is currently not Unicode-safe because it uses str() would
|
||||||
def text(obj):
|
become Unicode-safe.
|
||||||
return '%s' % obj
|
|
||||||
|
|
||||||
It behaves as desired except if 'obj' is not a basestring instance
|
|
||||||
and needs to return a Unicode representation of itself. In that
|
|
||||||
case, the string format will attempt to coerce the result of
|
|
||||||
__str__ to a str instance. Defining a __unicode__ method does not
|
|
||||||
help since it will only be called if the right-hand operand is a
|
|
||||||
unicode instance. Using a unicode instance for the right-hand
|
|
||||||
operand does not work because the function is no longer str-stable
|
|
||||||
(i.e. it will coerce everything to unicode).
|
|
||||||
|
|
||||||
|
|
||||||
Specification
|
Specification
|
||||||
|
|
||||||
A Python implementation of the text() built-in follows:
|
A Python implementation of the str() built-in follows:
|
||||||
|
|
||||||
def text(s):
|
def str(s):
|
||||||
"""Return a nice string representation of the object. The
|
"""Return a nice string representation of the object. The
|
||||||
return value is a basestring instance.
|
return value is a str or unicode instance.
|
||||||
"""
|
"""
|
||||||
if isinstance(s, basestring):
|
if type(s) is str or type(s) is unicode:
|
||||||
return s
|
return s
|
||||||
r = s.__str__()
|
r = s.__str__()
|
||||||
if not isinstance(r, basestring):
|
if not isinstance(r, (str, unicode)):
|
||||||
raise TypeError('__str__ returned non-string')
|
raise TypeError('__str__ returned non-string')
|
||||||
return r
|
return r
|
||||||
|
|
||||||
Note that it is currently possible, although not very useful, to
|
|
||||||
write __str__ methods that return unicode instances.
|
|
||||||
|
|
||||||
The %s format specifier for str objects would be changed to call
|
|
||||||
text() on the argument. Currently it calls str() unless the
|
|
||||||
argument is a unicode instance (in which case the object is
|
|
||||||
substituted as is and the % operation returns a unicode instance).
|
|
||||||
|
|
||||||
The following function would be added to the C API and would be the
|
The following function would be added to the C API and would be the
|
||||||
equivalent of the text() function:
|
equivalent to the str() built-in (ideally it be called PyObject_Str,
|
||||||
|
but changing that function could cause a massive number of
|
||||||
|
compatibility problems):
|
||||||
|
|
||||||
PyObject *PyObject_Text(PyObject *o);
|
PyObject *PyString_New(PyObject *);
|
||||||
|
|
||||||
A reference implementation is available on Sourceforge [1] as a
|
A reference implementation is available on Sourceforge [1] as a
|
||||||
patch.
|
patch.
|
||||||
|
@ -116,52 +98,36 @@ Specification
|
||||||
|
|
||||||
Backwards Compatibility
|
Backwards Compatibility
|
||||||
|
|
||||||
The change to the %s format specifier would result in some %
|
Some code may require that str() returns a str instance. In the
|
||||||
operations returning a unicode instance rather than raising a
|
standard library, only one such case has been found so far. The
|
||||||
UnicodeDecodeError exception. It seems unlikely that the change
|
function email.header_decode() requires a str instance and the
|
||||||
would break currently working code.
|
email.Header.decode_header() function tries to ensure this by
|
||||||
|
calling str() on its argument. The code was fixed by changing
|
||||||
|
the line "header = str(header)" to:
|
||||||
|
|
||||||
|
if isinstance(header, unicode):
|
||||||
|
header = header.encode('ascii')
|
||||||
|
|
||||||
|
Whether this is truly a bug is questionable since decode_header()
|
||||||
|
really operates on byte strings, not character strings. Code that
|
||||||
|
passes it a unicode instance could itself be considered buggy.
|
||||||
|
|
||||||
|
|
||||||
Alternative Solutions
|
Alternative Solutions
|
||||||
|
|
||||||
Rather than adding the text() built-in, if PEP 246 were
|
A new built-in function could be added instead of changing str().
|
||||||
implemented then adapt(s, basestring) could be equivalent to
|
Doing so would introduce virtually no backwards compatibility
|
||||||
text(s). The advantage would be one less built-in function. The
|
problems. However, since the compatibility problems are expected to
|
||||||
problem is that PEP 246 is not implemented.
|
rare, changing str() seems preferable to adding a new built-in.
|
||||||
|
|
||||||
Fredrik Lundh has suggested [2] that perhaps a new slot should be
|
The basestring type could be changed to have the proposed behaviour,
|
||||||
added (e.g. __text__), that could return any kind of string that's
|
rather than changing str(). However, that would be confusing
|
||||||
compatible with Python's text model. That seems like an
|
behaviour for an abstract base type.
|
||||||
attractive idea but many details would still need to be worked
|
|
||||||
out.
|
|
||||||
|
|
||||||
Instead of providing the text() built-in, the %s format specifier
|
|
||||||
could be changed and a string format could be used instead of
|
|
||||||
calling text(). However, it seems like the operation is important
|
|
||||||
enough to justify a built-in.
|
|
||||||
|
|
||||||
Instead of providing the text() built-in, the basestring type
|
|
||||||
could be changed to provide the same functionality. That would
|
|
||||||
possibly be confusing behaviour for an abstract base type.
|
|
||||||
|
|
||||||
Some people have suggested [3] that an easier migration path would
|
|
||||||
be to change the default encoding to be UTF-8. Code that is not
|
|
||||||
Unicode safe would then encode Unicode strings as UTF-8 and
|
|
||||||
operate on them as str instances, rather than raising a
|
|
||||||
UnicodeDecodeError exception. Other code would assume that str
|
|
||||||
instances were encoded using UTF-8 and decode them if necessary.
|
|
||||||
While that solution may work for some applications, it seems
|
|
||||||
unsuitable as a general solution. For example, some applications
|
|
||||||
get string data from many different sources and assuming that all
|
|
||||||
str instances were encoded using UTF-8 could easily introduce
|
|
||||||
subtle bugs.
|
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
|
|
||||||
[1] http://www.python.org/sf/1159501
|
[1] http://www.python.org/sf/1266570
|
||||||
[2] http://mail.python.org/pipermail/python-dev/2004-September/048755.html
|
|
||||||
[3] http://blog.ianbicking.org/illusive-setdefaultencoding.html
|
|
||||||
|
|
||||||
|
|
||||||
Copyright
|
Copyright
|
||||||
|
|
Loading…
Reference in New Issue