New version of PEP 349. Propose that str() be changed rather than
adding a new built-in function.
This commit is contained in:
parent
35b61a7c94
commit
fb89a4ee52
|
@ -105,7 +105,7 @@ Index by Category
|
|||
S 345 Metadata for Python Software Packages 1.2 Jones
|
||||
P 347 Migrating the Python CVS to Subversion von Löwis
|
||||
S 348 Exception Reorganization for Python 3.0 Cannon
|
||||
S 349 Generalized String Coercion Schemenauer
|
||||
S 349 Allow str() to return unicode strings Schemenauer
|
||||
S 754 IEEE 754 Floating Point Special Values Warnes
|
||||
|
||||
Finished PEPs (done, implemented in CVS)
|
||||
|
@ -393,7 +393,7 @@ Numerical Index
|
|||
SR 346 User Defined ("with") Statements Coghlan
|
||||
P 347 Migrating the Python CVS to Subversion von Löwis
|
||||
S 348 Exception Reorganization for Python 3.0 Cannon
|
||||
S 349 Generalized String Coercion Schemenauer
|
||||
S 349 Allow str() to return unicode strings Schemenauer
|
||||
SR 666 Reject Foolish Indentation Creighton
|
||||
S 754 IEEE 754 Floating Point Special Values Warnes
|
||||
I 3000 Python 3.0 Plans Kuchling, Cannon
|
||||
|
|
122
pep-0349.txt
122
pep-0349.txt
|
@ -1,5 +1,5 @@
|
|||
PEP: 349
|
||||
Title: Generalised String Coercion
|
||||
Title: Allow str() to return unicode strings
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Neil Schemenauer <nas@arctrix.com>
|
||||
|
@ -7,20 +7,18 @@ Status: Draft
|
|||
Type: Standards Track
|
||||
Content-Type: text/plain
|
||||
Created: 02-Aug-2005
|
||||
Post-History:
|
||||
Post-History: 06-Aug-2005
|
||||
Python-Version: 2.5
|
||||
|
||||
|
||||
Abstract
|
||||
|
||||
This PEP proposes the introduction of a new built-in function,
|
||||
text(), that provides a way of generating a string representation
|
||||
of an object without forcing the result to be a particular string
|
||||
type. In addition, the behavior %s format specifier would be
|
||||
changed to call text() on the argument. These two changes would
|
||||
make it easier to write library code that can be used by
|
||||
applications that use only the str type and by others that also use
|
||||
the unicode type.
|
||||
This PEP proposes to change the str() built-in function so that it
|
||||
can return unicode strings. This change would make it easier to
|
||||
write code that works with either string type and would also make
|
||||
some existing code handle unicode strings. The C function
|
||||
PyObject_Str() would remain unchanged and the function
|
||||
PyString_New() would be added instead.
|
||||
|
||||
|
||||
Rationale
|
||||
|
@ -64,51 +62,35 @@ Rationale
|
|||
object; an operation traditionally accomplished by using the str()
|
||||
built-in function.
|
||||
|
||||
Using str() makes the code not Unicode-safe. Replacing a str()
|
||||
call with a unicode() call makes the code not str-stable. Using a
|
||||
string format almost accomplishes the goal but not quite.
|
||||
Consider the following code:
|
||||
|
||||
def text(obj):
|
||||
return '%s' % obj
|
||||
|
||||
It behaves as desired except if 'obj' is not a basestring instance
|
||||
and needs to return a Unicode representation of itself. In that
|
||||
case, the string format will attempt to coerce the result of
|
||||
__str__ to a str instance. Defining a __unicode__ method does not
|
||||
help since it will only be called if the right-hand operand is a
|
||||
unicode instance. Using a unicode instance for the right-hand
|
||||
operand does not work because the function is no longer str-stable
|
||||
(i.e. it will coerce everything to unicode).
|
||||
Using the current str() function makes the code not Unicode-safe.
|
||||
Replacing a str() call with a unicode() call makes the code not
|
||||
str-stable. Changing str() so that it could return unicode
|
||||
instances would solve this problem. As a further benefit, some code
|
||||
that is currently not Unicode-safe because it uses str() would
|
||||
become Unicode-safe.
|
||||
|
||||
|
||||
Specification
|
||||
|
||||
A Python implementation of the text() built-in follows:
|
||||
A Python implementation of the str() built-in follows:
|
||||
|
||||
def text(s):
|
||||
def str(s):
|
||||
"""Return a nice string representation of the object. The
|
||||
return value is a basestring instance.
|
||||
return value is a str or unicode instance.
|
||||
"""
|
||||
if isinstance(s, basestring):
|
||||
if type(s) is str or type(s) is unicode:
|
||||
return s
|
||||
r = s.__str__()
|
||||
if not isinstance(r, basestring):
|
||||
if not isinstance(r, (str, unicode)):
|
||||
raise TypeError('__str__ returned non-string')
|
||||
return r
|
||||
|
||||
Note that it is currently possible, although not very useful, to
|
||||
write __str__ methods that return unicode instances.
|
||||
|
||||
The %s format specifier for str objects would be changed to call
|
||||
text() on the argument. Currently it calls str() unless the
|
||||
argument is a unicode instance (in which case the object is
|
||||
substituted as is and the % operation returns a unicode instance).
|
||||
|
||||
The following function would be added to the C API and would be the
|
||||
equivalent of the text() function:
|
||||
equivalent to the str() built-in (ideally it be called PyObject_Str,
|
||||
but changing that function could cause a massive number of
|
||||
compatibility problems):
|
||||
|
||||
PyObject *PyObject_Text(PyObject *o);
|
||||
PyObject *PyString_New(PyObject *);
|
||||
|
||||
A reference implementation is available on Sourceforge [1] as a
|
||||
patch.
|
||||
|
@ -116,52 +98,36 @@ Specification
|
|||
|
||||
Backwards Compatibility
|
||||
|
||||
The change to the %s format specifier would result in some %
|
||||
operations returning a unicode instance rather than raising a
|
||||
UnicodeDecodeError exception. It seems unlikely that the change
|
||||
would break currently working code.
|
||||
Some code may require that str() returns a str instance. In the
|
||||
standard library, only one such case has been found so far. The
|
||||
function email.header_decode() requires a str instance and the
|
||||
email.Header.decode_header() function tries to ensure this by
|
||||
calling str() on its argument. The code was fixed by changing
|
||||
the line "header = str(header)" to:
|
||||
|
||||
if isinstance(header, unicode):
|
||||
header = header.encode('ascii')
|
||||
|
||||
Whether this is truly a bug is questionable since decode_header()
|
||||
really operates on byte strings, not character strings. Code that
|
||||
passes it a unicode instance could itself be considered buggy.
|
||||
|
||||
|
||||
Alternative Solutions
|
||||
|
||||
Rather than adding the text() built-in, if PEP 246 were
|
||||
implemented then adapt(s, basestring) could be equivalent to
|
||||
text(s). The advantage would be one less built-in function. The
|
||||
problem is that PEP 246 is not implemented.
|
||||
A new built-in function could be added instead of changing str().
|
||||
Doing so would introduce virtually no backwards compatibility
|
||||
problems. However, since the compatibility problems are expected to
|
||||
rare, changing str() seems preferable to adding a new built-in.
|
||||
|
||||
Fredrik Lundh has suggested [2] that perhaps a new slot should be
|
||||
added (e.g. __text__), that could return any kind of string that's
|
||||
compatible with Python's text model. That seems like an
|
||||
attractive idea but many details would still need to be worked
|
||||
out.
|
||||
|
||||
Instead of providing the text() built-in, the %s format specifier
|
||||
could be changed and a string format could be used instead of
|
||||
calling text(). However, it seems like the operation is important
|
||||
enough to justify a built-in.
|
||||
|
||||
Instead of providing the text() built-in, the basestring type
|
||||
could be changed to provide the same functionality. That would
|
||||
possibly be confusing behaviour for an abstract base type.
|
||||
|
||||
Some people have suggested [3] that an easier migration path would
|
||||
be to change the default encoding to be UTF-8. Code that is not
|
||||
Unicode safe would then encode Unicode strings as UTF-8 and
|
||||
operate on them as str instances, rather than raising a
|
||||
UnicodeDecodeError exception. Other code would assume that str
|
||||
instances were encoded using UTF-8 and decode them if necessary.
|
||||
While that solution may work for some applications, it seems
|
||||
unsuitable as a general solution. For example, some applications
|
||||
get string data from many different sources and assuming that all
|
||||
str instances were encoded using UTF-8 could easily introduce
|
||||
subtle bugs.
|
||||
The basestring type could be changed to have the proposed behaviour,
|
||||
rather than changing str(). However, that would be confusing
|
||||
behaviour for an abstract base type.
|
||||
|
||||
|
||||
References
|
||||
|
||||
[1] http://www.python.org/sf/1159501
|
||||
[2] http://mail.python.org/pipermail/python-dev/2004-September/048755.html
|
||||
[3] http://blog.ianbicking.org/illusive-setdefaultencoding.html
|
||||
[1] http://www.python.org/sf/1266570
|
||||
|
||||
|
||||
Copyright
|
||||
|
|
Loading…
Reference in New Issue