145 lines
5.1 KiB
Plaintext
145 lines
5.1 KiB
Plaintext
PEP: 349
|
||
Title: Allow str() to return unicode strings
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Neil Schemenauer <nas@arctrix.com>
|
||
Status: Deferred
|
||
Type: Standards Track
|
||
Content-Type: text/plain
|
||
Created: 02-Aug-2005
|
||
Post-History: 06-Aug-2005
|
||
Python-Version: 2.5
|
||
|
||
|
||
Abstract
|
||
|
||
This PEP proposes to change the str() built-in function so that it
|
||
can return unicode strings. This change would make it easier to
|
||
write code that works with either string type and would also make
|
||
some existing code handle unicode strings. The C function
|
||
PyObject_Str() would remain unchanged and the function
|
||
PyString_New() would be added instead.
|
||
|
||
|
||
Rationale
|
||
|
||
Python has had a Unicode string type for some time now but use of
|
||
it is not yet widespread. There is a large amount of Python code
|
||
that assumes that string data is represented as str instances.
|
||
The long term plan for Python is to phase out the str type and use
|
||
unicode for all string data. Clearly, a smooth migration path
|
||
must be provided.
|
||
|
||
We need to upgrade existing libraries, written for str instances,
|
||
to be made capable of operating in an all-unicode string world.
|
||
We can't change to an all-unicode world until all essential
|
||
libraries are made capable for it. Upgrading the libraries in one
|
||
shot does not seem feasible. A more realistic strategy is to
|
||
individually make the libraries capable of operating on unicode
|
||
strings while preserving their current all-str environment
|
||
behaviour.
|
||
|
||
First, we need to be able to write code that can accept unicode
|
||
instances without attempting to coerce them to str instances. Let
|
||
us label such code as Unicode-safe. Unicode-safe libraries can be
|
||
used in an all-unicode world.
|
||
|
||
Second, we need to be able to write code that, when provided only
|
||
str instances, will not create unicode results. Let us label such
|
||
code as str-stable. Libraries that are str-stable can be used by
|
||
libraries and applications that are not yet Unicode-safe.
|
||
|
||
Sometimes it is simple to write code that is both str-stable and
|
||
Unicode-safe. For example, the following function just works:
|
||
|
||
def appendx(s):
|
||
return s + 'x'
|
||
|
||
That's not too surprising since the unicode type is designed to
|
||
make the task easier. The principle is that when str and unicode
|
||
instances meet, the result is a unicode instance. One notable
|
||
difficulty arises when code requires a string representation of an
|
||
object; an operation traditionally accomplished by using the str()
|
||
built-in function.
|
||
|
||
Using the current str() function makes the code not Unicode-safe.
|
||
Replacing a str() call with a unicode() call makes the code not
|
||
str-stable. Changing str() so that it could return unicode
|
||
instances would solve this problem. As a further benefit, some code
|
||
that is currently not Unicode-safe because it uses str() would
|
||
become Unicode-safe.
|
||
|
||
|
||
Specification
|
||
|
||
A Python implementation of the str() built-in follows:
|
||
|
||
def str(s):
|
||
"""Return a nice string representation of the object. The
|
||
return value is a str or unicode instance.
|
||
"""
|
||
if type(s) is str or type(s) is unicode:
|
||
return s
|
||
r = s.__str__()
|
||
if not isinstance(r, (str, unicode)):
|
||
raise TypeError('__str__ returned non-string')
|
||
return r
|
||
|
||
The following function would be added to the C API and would be the
|
||
equivalent to the str() built-in (ideally it be called PyObject_Str,
|
||
but changing that function could cause a massive number of
|
||
compatibility problems):
|
||
|
||
PyObject *PyString_New(PyObject *);
|
||
|
||
A reference implementation is available on Sourceforge [1] as a
|
||
patch.
|
||
|
||
|
||
Backwards Compatibility
|
||
|
||
Some code may require that str() returns a str instance. In the
|
||
standard library, only one such case has been found so far. The
|
||
function email.header_decode() requires a str instance and the
|
||
email.Header.decode_header() function tries to ensure this by
|
||
calling str() on its argument. The code was fixed by changing
|
||
the line "header = str(header)" to:
|
||
|
||
if isinstance(header, unicode):
|
||
header = header.encode('ascii')
|
||
|
||
Whether this is truly a bug is questionable since decode_header()
|
||
really operates on byte strings, not character strings. Code that
|
||
passes it a unicode instance could itself be considered buggy.
|
||
|
||
|
||
Alternative Solutions
|
||
|
||
A new built-in function could be added instead of changing str().
|
||
Doing so would introduce virtually no backwards compatibility
|
||
problems. However, since the compatibility problems are expected to
|
||
rare, changing str() seems preferable to adding a new built-in.
|
||
|
||
The basestring type could be changed to have the proposed behaviour,
|
||
rather than changing str(). However, that would be confusing
|
||
behaviour for an abstract base type.
|
||
|
||
|
||
References
|
||
|
||
[1] http://www.python.org/sf/1266570
|
||
|
||
|
||
Copyright
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
End:
|