2005-08-04 22:59:00 -04:00
|
|
|
|
PEP: 349
|
|
|
|
|
Title: Generalised String Coercion
|
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Neil Schemenauer <nas@arctrix.com>
|
|
|
|
|
Status: Draft
|
|
|
|
|
Type: Standards Track
|
|
|
|
|
Content-Type: text/plain
|
|
|
|
|
Created: 02-Aug-2005
|
|
|
|
|
Post-History:
|
|
|
|
|
Python-Version: 2.5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
|
|
|
|
|
This PEP proposes the introduction of a new built-in function,
|
|
|
|
|
text(), that provides a way of generating a string representation
|
2005-08-06 00:05:48 -04:00
|
|
|
|
of an object without forcing the result to be a particular string
|
|
|
|
|
type. In addition, the behavior %s format specifier would be
|
|
|
|
|
changed to call text() on the argument. These two changes would
|
|
|
|
|
make it easier to write library code that can be used by
|
|
|
|
|
applications that use only the str type and by others that also use
|
|
|
|
|
the unicode type.
|
2005-08-04 22:59:00 -04:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
|
|
|
|
|
Python has had a Unicode string type for some time now but use of
|
|
|
|
|
it is not yet widespread. There is a large amount of Python code
|
|
|
|
|
that assumes that string data is represented as str instances.
|
|
|
|
|
The long term plan for Python is to phase out the str type and use
|
|
|
|
|
unicode for all string data. Clearly, a smooth migration path
|
|
|
|
|
must be provided.
|
|
|
|
|
|
|
|
|
|
We need to upgrade existing libraries, written for str instances,
|
|
|
|
|
to be made capable of operating in an all-unicode string world.
|
|
|
|
|
We can't change to an all-unicode world until all essential
|
|
|
|
|
libraries are made capable for it. Upgrading the libraries in one
|
|
|
|
|
shot does not seem feasible. A more realistic strategy is to
|
|
|
|
|
individually make the libraries capable of operating on unicode
|
|
|
|
|
strings while preserving their current all-str environment
|
|
|
|
|
behaviour.
|
|
|
|
|
|
|
|
|
|
First, we need to be able to write code that can accept unicode
|
|
|
|
|
instances without attempting to coerce them to str instances. Let
|
|
|
|
|
us label such code as Unicode-safe. Unicode-safe libraries can be
|
|
|
|
|
used in an all-unicode world.
|
|
|
|
|
|
|
|
|
|
Second, we need to be able to write code that, when provided only
|
|
|
|
|
str instances, will not create unicode results. Let us label such
|
|
|
|
|
code as str-stable. Libraries that are str-stable can be used by
|
|
|
|
|
libraries and applications that are not yet Unicode-safe.
|
|
|
|
|
|
|
|
|
|
Sometimes it is simple to write code that is both str-stable and
|
|
|
|
|
Unicode-safe. For example, the following function just works:
|
|
|
|
|
|
|
|
|
|
def appendx(s):
|
|
|
|
|
return s + 'x'
|
|
|
|
|
|
|
|
|
|
That's not too surprising since the unicode type is designed to
|
|
|
|
|
make the task easier. The principle is that when str and unicode
|
|
|
|
|
instances meet, the result is a unicode instance. One notable
|
|
|
|
|
difficulty arises when code requires a string representation of an
|
|
|
|
|
object; an operation traditionally accomplished by using the str()
|
|
|
|
|
built-in function.
|
|
|
|
|
|
|
|
|
|
Using str() makes the code not Unicode-safe. Replacing a str()
|
|
|
|
|
call with a unicode() call makes the code not str-stable. Using a
|
|
|
|
|
string format almost accomplishes the goal but not quite.
|
|
|
|
|
Consider the following code:
|
|
|
|
|
|
|
|
|
|
def text(obj):
|
|
|
|
|
return '%s' % obj
|
|
|
|
|
|
|
|
|
|
It behaves as desired except if 'obj' is not a basestring instance
|
|
|
|
|
and needs to return a Unicode representation of itself. In that
|
|
|
|
|
case, the string format will attempt to coerce the result of
|
|
|
|
|
__str__ to a str instance. Defining a __unicode__ method does not
|
|
|
|
|
help since it will only be called if the right-hand operand is a
|
|
|
|
|
unicode instance. Using a unicode instance for the right-hand
|
|
|
|
|
operand does not work because the function is no longer str-stable
|
|
|
|
|
(i.e. it will coerce everything to unicode).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Specification
|
|
|
|
|
|
|
|
|
|
A Python implementation of the text() built-in follows:
|
|
|
|
|
|
|
|
|
|
def text(s):
|
|
|
|
|
"""Return a nice string representation of the object. The
|
|
|
|
|
return value is a basestring instance.
|
|
|
|
|
"""
|
|
|
|
|
if isinstance(s, basestring):
|
|
|
|
|
return s
|
|
|
|
|
r = s.__str__()
|
2005-08-06 00:05:48 -04:00
|
|
|
|
if not isinstance(r, basestring):
|
2005-08-04 22:59:00 -04:00
|
|
|
|
raise TypeError('__str__ returned non-string')
|
|
|
|
|
return r
|
|
|
|
|
|
|
|
|
|
Note that it is currently possible, although not very useful, to
|
|
|
|
|
write __str__ methods that return unicode instances.
|
|
|
|
|
|
|
|
|
|
The %s format specifier for str objects would be changed to call
|
|
|
|
|
text() on the argument. Currently it calls str() unless the
|
|
|
|
|
argument is a unicode instance (in which case the object is
|
|
|
|
|
substituted as is and the % operation returns a unicode instance).
|
|
|
|
|
|
|
|
|
|
The following function would be added to the C API and would be the
|
|
|
|
|
equivalent of the text() function:
|
|
|
|
|
|
|
|
|
|
PyObject *PyObject_Text(PyObject *o);
|
|
|
|
|
|
|
|
|
|
A reference implementation is available on Sourceforge [1] as a
|
|
|
|
|
patch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Backwards Compatibility
|
|
|
|
|
|
|
|
|
|
The change to the %s format specifier would result in some %
|
|
|
|
|
operations returning a unicode instance rather than raising a
|
|
|
|
|
UnicodeDecodeError exception. It seems unlikely that the change
|
|
|
|
|
would break currently working code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Alternative Solutions
|
|
|
|
|
|
|
|
|
|
Rather than adding the text() built-in, if PEP 246 were
|
|
|
|
|
implemented then adapt(s, basestring) could be equivalent to
|
|
|
|
|
text(s). The advantage would be one less built-in function. The
|
|
|
|
|
problem is that PEP 246 is not implemented.
|
|
|
|
|
|
|
|
|
|
Fredrik Lundh has suggested [2] that perhaps a new slot should be
|
|
|
|
|
added (e.g. __text__), that could return any kind of string that's
|
|
|
|
|
compatible with Python's text model. That seems like an
|
|
|
|
|
attractive idea but many details would still need to be worked
|
|
|
|
|
out.
|
|
|
|
|
|
|
|
|
|
Instead of providing the text() built-in, the %s format specifier
|
|
|
|
|
could be changed and a string format could be used instead of
|
|
|
|
|
calling text(). However, it seems like the operation is important
|
|
|
|
|
enough to justify a built-in.
|
|
|
|
|
|
|
|
|
|
Instead of providing the text() built-in, the basestring type
|
|
|
|
|
could be changed to provide the same functionality. That would
|
|
|
|
|
possibly be confusing behaviour for an abstract base type.
|
|
|
|
|
|
|
|
|
|
Some people have suggested [3] that an easier migration path would
|
|
|
|
|
be to change the default encoding to be UTF-8. Code that is not
|
|
|
|
|
Unicode safe would then encode Unicode strings as UTF-8 and
|
|
|
|
|
operate on them as str instances, rather than raising a
|
|
|
|
|
UnicodeDecodeError exception. Other code would assume that str
|
|
|
|
|
instances were encoded using UTF-8 and decode them if necessary.
|
|
|
|
|
While that solution may work for some applications, it seems
|
|
|
|
|
unsuitable as a general solution. For example, some applications
|
|
|
|
|
get string data from many different sources and assuming that all
|
|
|
|
|
str instances were encoded using UTF-8 could easily introduce
|
|
|
|
|
subtle bugs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
|
|
|
|
|
[1] http://www.python.org/sf/1159501
|
|
|
|
|
[2] http://mail.python.org/pipermail/python-dev/2004-September/048755.html
|
|
|
|
|
[3] http://blog.ianbicking.org/illusive-setdefaultencoding.html
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
End:
|