diff --git a/pep-0100.txt b/pep-0100.txt index 1117c614f..f93b78292 100644 --- a/pep-0100.txt +++ b/pep-0100.txt @@ -5,117 +5,122 @@ Last-Modified: $Date$ Author: mal@lemburg.com (Marc-André Lemburg) Status: Final Type: Standards Track +Content-Type: text/x-rst Created: 10-Mar-2000 Python-Version: 2.0 Post-History: Historical Note +=============== - This document was first written by Marc-Andre in the pre-PEP days, - and was originally distributed as Misc/unicode.txt in Python - distributions up to and included Python 2.1. The last revision of - the proposal in that location was labeled version 1.7 (CVS - revision 3.10). Because the document clearly serves the purpose - of an informational PEP in the post-PEP era, it has been moved - here and reformatted to comply with PEP guidelines. Future - revisions will be made to this document, while Misc/unicode.txt - will contain a pointer to this PEP. +This document was first written by Marc-Andre in the pre-PEP days, +and was originally distributed as Misc/unicode.txt in Python +distributions up to and included Python 2.1. The last revision of +the proposal in that location was labeled version 1.7 (CVS +revision 3.10). Because the document clearly serves the purpose +of an informational PEP in the post-PEP era, it has been moved +here and reformatted to comply with PEP guidelines. Future +revisions will be made to this document, while Misc/unicode.txt +will contain a pointer to this PEP. - -Barry Warsaw, PEP editor +-Barry Warsaw, PEP editor Introduction +============ - The idea of this proposal is to add native Unicode 3.0 support to - Python in a way that makes use of Unicode strings as simple as - possible without introducing too many pitfalls along the way. +The idea of this proposal is to add native Unicode 3.0 support to +Python in a way that makes use of Unicode strings as simple as +possible without introducing too many pitfalls along the way. - Since this goal is not easy to achieve -- strings being one of the - most fundamental objects in Python -- we expect this proposal to - undergo some significant refinements. +Since this goal is not easy to achieve -- strings being one of the +most fundamental objects in Python -- we expect this proposal to +undergo some significant refinements. - Note that the current version of this proposal is still a bit - unsorted due to the many different aspects of the Unicode-Python - integration. +Note that the current version of this proposal is still a bit +unsorted due to the many different aspects of the Unicode-Python +integration. - The latest version of this document is always available at: +The latest version of this document is always available at: +http://starship.python.net/~lemburg/unicode-proposal.txt - http://starship.python.net/~lemburg/unicode-proposal.txt +Older versions are available as: +http://starship.python.net/~lemburg/unicode-proposal-X.X.txt - Older versions are available as: - - http://starship.python.net/~lemburg/unicode-proposal-X.X.txt - - [ed. note: new revisions should be made to this PEP document, - while the historical record previous to version 1.7 should be - retrieved from MAL's url, or Misc/unicode.txt] +[ed. note: new revisions should be made to this PEP document, +while the historical record previous to version 1.7 should be +retrieved from MAL's url, or Misc/unicode.txt] Conventions +=========== - - In examples we use u = Unicode object and s = Python string +- In examples we use u = Unicode object and s = Python string - - 'XXX' markings indicate points of discussion (PODs) +- 'XXX' markings indicate points of discussion (PODs) General Remarks +=============== - - Unicode encoding names should be lower case on output and - case-insensitive on input (they will be converted to lower case - by all APIs taking an encoding name as input). +- Unicode encoding names should be lower case on output and + case-insensitive on input (they will be converted to lower case + by all APIs taking an encoding name as input). - - Encoding names should follow the name conventions as used by the - Unicode Consortium: spaces are converted to hyphens, e.g. 'utf - 16' is written as 'utf-16'. +- Encoding names should follow the name conventions as used by the + Unicode Consortium: spaces are converted to hyphens, e.g. 'utf + 16' is written as 'utf-16'. - - Codec modules should use the same names, but with hyphens - converted to underscores, e.g. utf_8, utf_16, iso_8859_1. +- Codec modules should use the same names, but with hyphens + converted to underscores, e.g. ``utf_8``, ``utf_16``, ``iso_8859_1``. Unicode Default Encoding +======================== - The Unicode implementation has to make some assumption about the - encoding of 8-bit strings passed to it for coercion and about the - encoding to as default for conversion of Unicode to strings when - no specific encoding is given. This encoding is called throughout this text. +The Unicode implementation has to make some assumption about the +encoding of 8-bit strings passed to it for coercion and about the +encoding to as default for conversion of Unicode to strings when +no specific encoding is given. This encoding is called throughout this text. - For this, the implementation maintains a global which can be set - in the site.py Python startup script. Subsequent changes are not - possible. The can be set and queried using the - two sys module APIs: +For this, the implementation maintains a global which can be set +in the site.py Python startup script. Subsequent changes are not +possible. The can be set and queried using the +two sys module APIs: - sys.setdefaultencoding(encoding) - --> Sets the used by the Unicode implementation. - encoding has to be an encoding which is supported by the - Python installation, otherwise, a LookupError is raised. +``sys.setdefaultencoding(encoding)`` + Sets the used by the Unicode implementation. + encoding has to be an encoding which is supported by the + Python installation, otherwise, a LookupError is raised. - Note: This API is only available in site.py! It is - removed from the sys module by site.py after usage. + Note: This API is only available in site.py! It is + removed from the sys module by site.py after usage. - sys.getdefaultencoding() - --> Returns the current . +``sys.getdefaultencoding()`` + Returns the current . - If not otherwise defined or set, the defaults - to 'ascii'. This encoding is also the startup default of Python - (and in effect before site.py is executed). +If not otherwise defined or set, the defaults +to 'ascii'. This encoding is also the startup default of Python +(and in effect before site.py is executed). - Note that the default site.py startup module contains disabled - optional code which can set the according to - the encoding defined by the current locale. The locale module is - used to extract the encoding from the locale default settings - defined by the OS environment (see locale.py). If the encoding - cannot be determined, is unknown or unsupported, the code defaults - to setting the to 'ascii'. To enable this - code, edit the site.py file or place the appropriate code into the - sitecustomize.py module of your Python installation. +Note that the default site.py startup module contains disabled +optional code which can set the according to +the encoding defined by the current locale. The locale module is +used to extract the encoding from the locale default settings +defined by the OS environment (see locale.py). If the encoding +cannot be determined, is unknown or unsupported, the code defaults +to setting the to 'ascii'. To enable this +code, edit the site.py file or place the appropriate code into the +sitecustomize.py module of your Python installation. Unicode Constructors +==================== - Python should provide a built-in constructor for Unicode strings - which is available through __builtins__: +Python should provide a built-in constructor for Unicode strings +which is available through ``__builtins__``:: u = unicode(encoded_string[,encoding=][,errors="strict"]) @@ -123,249 +128,266 @@ Unicode Constructors u = ur'' - With the 'unicode-escape' encoding being defined as: +With the 'unicode-escape' encoding being defined as: - - all non-escape characters represent themselves as Unicode - ordinal (e.g. 'a' -> U+0061). +- all non-escape characters represent themselves as Unicode + ordinal (e.g. 'a' -> U+0061). - - all existing defined Python escape sequences are interpreted as - Unicode ordinals; note that \xXXXX can represent all Unicode - ordinals, and \OOO (octal) can represent Unicode ordinals up to - U+01FF. +- all existing defined Python escape sequences are interpreted as + Unicode ordinals; note that ``\xXXXX`` can represent all Unicode + ordinals, and ``\OOO`` (octal) can represent Unicode ordinals up to + U+01FF. - - a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax - error to have fewer than 4 digits after \u. +- a new escape sequence, ``\uXXXX``, represents U+XXXX; it is a syntax + error to have fewer than 4 digits after ``\u``. - For an explanation of possible values for errors see the Codec - section below. +For an explanation of possible values for errors see the Codec +section below. - Examples: +Examples:: - u'abc' -> U+0061 U+0062 U+0063 - u'\u1234' -> U+1234 - u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c + u'abc' -> U+0061 U+0062 U+0063 + u'\u1234' -> U+1234 + u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c - The 'raw-unicode-escape' encoding is defined as follows: +The 'raw-unicode-escape' encoding is defined as follows: - - \uXXXX sequence represent the U+XXXX Unicode character if and - only if the number of leading backslashes is odd +- ``\uXXXX`` sequence represent the U+XXXX Unicode character if and + only if the number of leading backslashes is odd - - all other characters represent themselves as Unicode ordinal - (e.g. 'b' -> U+0062) +- all other characters represent themselves as Unicode ordinal + (e.g. 'b' -> U+0062) - Note that you should provide some hint to the encoding you used to - write your programs as pragma line in one the first few comment - lines of the source file (e.g. '# source file encoding: latin-1'). - If you only use 7-bit ASCII then everything is fine and no such - notice is needed, but if you include Latin-1 characters not - defined in ASCII, it may well be worthwhile including a hint since - people in other countries will want to be able to read your source - strings too. +Note that you should provide some hint to the encoding you used to +write your programs as pragma line in one the first few comment +lines of the source file (e.g. '# source file encoding: latin-1'). +If you only use 7-bit ASCII then everything is fine and no such +notice is needed, but if you include Latin-1 characters not +defined in ASCII, it may well be worthwhile including a hint since +people in other countries will want to be able to read your source +strings too. Unicode Type Object +=================== - Unicode objects should have the type UnicodeType with type name - 'unicode', made available through the standard types module. +Unicode objects should have the type UnicodeType with type name +'unicode', made available through the standard types module. Unicode Output +============== - Unicode objects have a method .encode([encoding=]) - which returns a Python string encoding the Unicode string using the - given scheme (see Codecs). +Unicode objects have a method .encode([encoding=]) +which returns a Python string encoding the Unicode string using the +given scheme (see Codecs). - print u := print u.encode() # using the +:: - str(u) := u.encode() # using the + print u := print u.encode() # using the - repr(u) := "u%s" % repr(u.encode('unicode-escape')) + str(u) := u.encode() # using the - Also see Internal Argument Parsing and Buffer Interface for - details on how other APIs written in C will treat Unicode objects. + repr(u) := "u%s" % repr(u.encode('unicode-escape')) + +Also see Internal Argument Parsing and Buffer Interface for +details on how other APIs written in C will treat Unicode objects. Unicode Ordinals +================ - Since Unicode 3.0 has a 32-bit ordinal character set, the - implementation should provide 32-bit aware ordinal conversion - APIs: +Since Unicode 3.0 has a 32-bit ordinal character set, the +implementation should provide 32-bit aware ordinal conversion +APIs:: - ord(u[:1]) (this is the standard ord() extended to work with Unicode - objects) - --> Unicode ordinal number (32-bit) + ord(u[:1]) (this is the standard ord() extended to work with Unicode + objects) + --> Unicode ordinal number (32-bit) - unichr(i) - --> Unicode object for character i (provided it is 32-bit); - ValueError otherwise + unichr(i) + --> Unicode object for character i (provided it is 32-bit); + ValueError otherwise - Both APIs should go into __builtins__ just like their string - counterparts ord() and chr(). +Both APIs should go into ``__builtins__`` just like their string +counterparts ``ord()`` and ``chr()``. - Note that Unicode provides space for private encodings. Usage of - these can cause different output representations on different - machines. This problem is not a Python or Unicode problem, but a - machine setup and maintenance one. +Note that Unicode provides space for private encodings. Usage of +these can cause different output representations on different +machines. This problem is not a Python or Unicode problem, but a +machine setup and maintenance one. Comparison & Hash Value +======================= - Unicode objects should compare equal to other objects after these - other objects have been coerced to Unicode. For strings this - means that they are interpreted as Unicode string using the - . +Unicode objects should compare equal to other objects after these +other objects have been coerced to Unicode. For strings this +means that they are interpreted as Unicode string using the +. - Unicode objects should return the same hash value as their ASCII - equivalent strings. Unicode strings holding non-ASCII values are - not guaranteed to return the same hash values as the default - encoded equivalent string representation. +Unicode objects should return the same hash value as their ASCII +equivalent strings. Unicode strings holding non-ASCII values are +not guaranteed to return the same hash values as the default +encoded equivalent string representation. - When compared using cmp() (or PyObject_Compare()) the - implementation should mask TypeErrors raised during the conversion - to remain in synch with the string behavior. All other errors - such as ValueErrors raised during coercion of strings to Unicode - should not be masked and passed through to the user. +When compared using ``cmp()`` (or ``PyObject_Compare()``) the +implementation should mask ``TypeErrors`` raised during the conversion +to remain in synch with the string behavior. All other errors +such as ``ValueErrors`` raised during coercion of strings to Unicode +should not be masked and passed through to the user. - In containment tests ('a' in u'abc' and u'a' in 'abc') both sides - should be coerced to Unicode before applying the test. Errors - occurring during coercion (e.g. None in u'abc') should not be - masked. +In containment tests ('a' in u'abc' and u'a' in 'abc') both sides +should be coerced to Unicode before applying the test. Errors +occurring during coercion (e.g. None in u'abc') should not be +masked. Coercion +======== - Using Python strings and Unicode objects to form new objects - should always coerce to the more precise format, i.e. Unicode - objects. +Using Python strings and Unicode objects to form new objects +should always coerce to the more precise format, i.e. Unicode +objects. - u + s := u + unicode(s) +:: - s + u := unicode(s) + u + u + s := u + unicode(s) - All string methods should delegate the call to an equivalent - Unicode object method call by converting all involved strings to - Unicode and then applying the arguments to the Unicode method of - the same name, e.g. + s + u := unicode(s) + u - string.join((s,u),sep) := (s + sep) + u +All string methods should delegate the call to an equivalent +Unicode object method call by converting all involved strings to +Unicode and then applying the arguments to the Unicode method of +the same name, e.g. - sep.join((s,u)) := (s + sep) + u +:: - For a discussion of %-formatting w/r to Unicode objects, see - Formatting Markers. + string.join((s,u),sep) := (s + sep) + u + + sep.join((s,u)) := (s + sep) + u + +For a discussion of %-formatting w/r to Unicode objects, see +Formatting Markers. Exceptions +========== - UnicodeError is defined in the exceptions module as a subclass of - ValueError. It is available at the C level via - PyExc_UnicodeError. All exceptions related to Unicode - encoding/decoding should be subclasses of UnicodeError. +``UnicodeError`` is defined in the exceptions module as a subclass of +``ValueError``. It is available at the C level via +``PyExc_UnicodeError``. All exceptions related to Unicode +encoding/decoding should be subclasses of ``UnicodeError``. Codecs (Coder/Decoders) Lookup +============================== - A Codec (see Codec Interface Definition) search registry should be - implemented by a module "codecs": +A Codec (see Codec Interface Definition) search registry should be +implemented by a module "codecs":: - codecs.register(search_function) + codecs.register(search_function) - Search functions are expected to take one argument, the encoding - name in all lower case letters and with hyphens and spaces - converted to underscores, and return a tuple of functions - (encoder, decoder, stream_reader, stream_writer) taking the - following arguments: +Search functions are expected to take one argument, the encoding +name in all lower case letters and with hyphens and spaces +converted to underscores, and return a tuple of functions +(encoder, decoder, stream_reader, stream_writer) taking the +following arguments: - encoder and decoder: +encoder and decoder + These must be functions or methods which have the same + interface as the ``.encode``/``.decode`` methods of Codec instances + (see Codec Interface). The functions/methods are expected to + work in a stateless mode. - These must be functions or methods which have the same - interface as the .encode/.decode methods of Codec instances - (see Codec Interface). The functions/methods are expected to - work in a stateless mode. +stream_reader and stream_writer + These need to be factory functions with the following + interface:: - stream_reader and stream_writer: + factory(stream,errors='strict') - These need to be factory functions with the following - interface: + The factory functions must return objects providing the + interfaces defined by ``StreamWriter``/``StreamReader`` resp. (see + Codec Interface). Stream codecs can maintain state. - factory(stream,errors='strict') + Possible values for errors are defined in the Codec section + below. - The factory functions must return objects providing the - interfaces defined by StreamWriter/StreamReader resp. (see - Codec Interface). Stream codecs can maintain state. +In case a search function cannot find a given encoding, it should +return None. - Possible values for errors are defined in the Codec section - below. +Aliasing support for encodings is left to the search functions to +implement. - In case a search function cannot find a given encoding, it should - return None. +The codecs module will maintain an encoding cache for performance +reasons. Encodings are first looked up in the cache. If not +found, the list of registered search functions is scanned. If no +codecs tuple is found, a LookupError is raised. Otherwise, the +codecs tuple is stored in the cache and returned to the caller. - Aliasing support for encodings is left to the search functions to - implement. +To query the Codec instance the following API should be used:: - The codecs module will maintain an encoding cache for performance - reasons. Encodings are first looked up in the cache. If not - found, the list of registered search functions is scanned. If no - codecs tuple is found, a LookupError is raised. Otherwise, the - codecs tuple is stored in the cache and returned to the caller. + codecs.lookup(encoding) - To query the Codec instance the following API should be used: - - codecs.lookup(encoding) - - This will either return the found codecs tuple or raise a - LookupError. +This will either return the found codecs tuple or raise a +``LookupError``. Standard Codecs +=============== - Standard codecs should live inside an encodings/ package directory - in the Standard Python Code Library. The __init__.py file of that - directory should include a Codec Lookup compatible search function - implementing a lazy module based codec lookup. +Standard codecs should live inside an encodings/ package directory +in the Standard Python Code Library. The ``__init__.py`` file of that +directory should include a Codec Lookup compatible search function +implementing a lazy module based codec lookup. - Python should provide a few standard codecs for the most relevant - encodings, e.g. +Python should provide a few standard codecs for the most relevant +encodings, e.g. - 'utf-8': 8-bit variable length encoding - 'utf-16': 16-bit variable length encoding (little/big endian) - 'utf-16-le': utf-16 but explicitly little endian - 'utf-16-be': utf-16 but explicitly big endian - 'ascii': 7-bit ASCII codepage - 'iso-8859-1': ISO 8859-1 (Latin 1) codepage - 'unicode-escape': See Unicode Constructors for a definition - 'raw-unicode-escape': See Unicode Constructors for a definition - 'native': Dump of the Internal Format used by Python +:: - Common aliases should also be provided per default, e.g. - 'latin-1' for 'iso-8859-1'. + 'utf-8': 8-bit variable length encoding + 'utf-16': 16-bit variable length encoding (little/big endian) + 'utf-16-le': utf-16 but explicitly little endian + 'utf-16-be': utf-16 but explicitly big endian + 'ascii': 7-bit ASCII codepage + 'iso-8859-1': ISO 8859-1 (Latin 1) codepage + 'unicode-escape': See Unicode Constructors for a definition + 'raw-unicode-escape': See Unicode Constructors for a definition + 'native': Dump of the Internal Format used by Python - Note: 'utf-16' should be implemented by using and requiring byte - order marks (BOM) for file input/output. +Common aliases should also be provided per default, e.g. +'latin-1' for 'iso-8859-1'. - All other encodings such as the CJK ones to support Asian scripts - should be implemented in separate packages which do not get - included in the core Python distribution and are not a part of - this proposal. +Note: 'utf-16' should be implemented by using and requiring byte +order marks (BOM) for file input/output. + +All other encodings such as the CJK ones to support Asian scripts +should be implemented in separate packages which do not get +included in the core Python distribution and are not a part of +this proposal. Codecs Interface Definition +=========================== - The following base class should be defined in the module "codecs". - They provide not only templates for use by encoding module - implementors, but also define the interface which is expected by - the Unicode implementation. +The following base class should be defined in the module "codecs". +They provide not only templates for use by encoding module +implementors, but also define the interface which is expected by +the Unicode implementation. - Note that the Codec Interface defined here is well suitable for a - larger range of applications. The Unicode implementation expects - Unicode objects on input for .encode() and .write() and character - buffer compatible objects on input for .decode(). Output of - .encode() and .read() should be a Python string and .decode() must - return an Unicode object. +Note that the Codec Interface defined here is well suitable for a +larger range of applications. The Unicode implementation expects +Unicode objects on input for ``.encode()`` and ``.write()`` and character +buffer compatible objects on input for ``.decode()``. Output of +``.encode()`` and ``.read()`` should be a Python string and ``.decode()`` must +return an Unicode object. - First, we have the stateless encoders/decoders. These do not work - in chunks as the stream codecs (see below) do, because all - components are expected to be available in memory. +First, we have the stateless encoders/decoders. These do not work +in chunks as the stream codecs (see below) do, because all +components are expected to be available in memory. + +:: class Codec: @@ -415,13 +437,15 @@ Codecs Interface Definition """ - StreamWriter and StreamReader define the interface for stateful - encoders/decoders which work on streams. These allow processing - of the data in chunks to efficiently use memory. If you have - large strings in memory, you may want to wrap them with cStringIO - objects and then use these codecs on them to be able to do chunk - processing as well, e.g. to provide progress information to the - user. +``StreamWriter`` and ``StreamReader`` define the interface for stateful +encoders/decoders which work on streams. These allow processing +of the data in chunks to efficiently use memory. If you have +large strings in memory, you may want to wrap them with ``cStringIO`` +objects and then use these codecs on them to be able to do chunk +processing as well, e.g. to provide progress information to the +user. + +:: class StreamWriter(Codec): @@ -593,544 +617,641 @@ Codecs Interface Definition return getattr(self.stream,name) - Stream codec implementors are free to combine the StreamWriter and - StreamReader interfaces into one class. Even combining all these - with the Codec class should be possible. +Stream codec implementors are free to combine the ``StreamWriter`` and +``StreamReader`` interfaces into one class. Even combining all these +with the Codec class should be possible. - Implementors are free to add additional methods to enhance the - codec functionality or provide extra state information needed for - them to work. The internal codec implementation will only use the - above interfaces, though. +Implementors are free to add additional methods to enhance the +codec functionality or provide extra state information needed for +them to work. The internal codec implementation will only use the +above interfaces, though. - It is not required by the Unicode implementation to use these base - classes, only the interfaces must match; this allows writing - Codecs as extension types. +It is not required by the Unicode implementation to use these base +classes, only the interfaces must match; this allows writing +Codecs as extension types. - As guideline, large mapping tables should be implemented using - static C data in separate (shared) extension modules. That way - multiple processes can share the same data. +As guideline, large mapping tables should be implemented using +static C data in separate (shared) extension modules. That way +multiple processes can share the same data. - A tool to auto-convert Unicode mapping files to mapping modules - should be provided to simplify support for additional mappings - (see References). +A tool to auto-convert Unicode mapping files to mapping modules +should be provided to simplify support for additional mappings +(see References). Whitespace +========== - The .split() method will have to know about what is considered - whitespace in Unicode. +The ``.split()`` method will have to know about what is considered +whitespace in Unicode. Case Conversion +=============== - Case conversion is rather complicated with Unicode data, since - there are many different conditions to respect. See +Case conversion is rather complicated with Unicode data, since +there are many different conditions to respect. See - http://www.unicode.org/unicode/reports/tr13/ + http://www.unicode.org/unicode/reports/tr13/ - for some guidelines on implementing case conversion. +for some guidelines on implementing case conversion. - For Python, we should only implement the 1-1 conversions included - in Unicode. Locale dependent and other special case conversions - (see the Unicode standard file SpecialCasing.txt) should be left - to user land routines and not go into the core interpreter. +For Python, we should only implement the 1-1 conversions included +in Unicode. Locale dependent and other special case conversions +(see the Unicode standard file SpecialCasing.txt) should be left +to user land routines and not go into the core interpreter. - The methods .capitalize() and .iscapitalized() should follow the - case mapping algorithm defined in the above technical report as - closely as possible. +The methods ``.capitalize()`` and ``.iscapitalized()`` should follow the +case mapping algorithm defined in the above technical report as +closely as possible. Line Breaks +=========== - Line breaking should be done for all Unicode characters having the - B property as well as the combinations CRLF, CR, LF (interpreted - in that order) and other special line separators defined by the - standard. +Line breaking should be done for all Unicode characters having the +B property as well as the combinations CRLF, CR, LF (interpreted +in that order) and other special line separators defined by the +standard. - The Unicode type should provide a .splitlines() method which - returns a list of lines according to the above specification. See - Unicode Methods. +The Unicode type should provide a ``.splitlines()`` method which +returns a list of lines according to the above specification. See +Unicode Methods. Unicode Character Properties +============================ - A separate module "unicodedata" should provide a compact interface - to all Unicode character properties defined in the standard's - UnicodeData.txt file. +A separate module "unicodedata" should provide a compact interface +to all Unicode character properties defined in the standard's +UnicodeData.txt file. - Among other things, these properties provide ways to recognize - numbers, digits, spaces, whitespace, etc. +Among other things, these properties provide ways to recognize +numbers, digits, spaces, whitespace, etc. - Since this module will have to provide access to all Unicode - characters, it will eventually have to contain the data from - UnicodeData.txt which takes up around 600kB. For this reason, the - data should be stored in static C data. This enables compilation - as shared module which the underlying OS can shared between - processes (unlike normal Python code modules). +Since this module will have to provide access to all Unicode +characters, it will eventually have to contain the data from +UnicodeData.txt which takes up around 600kB. For this reason, the +data should be stored in static C data. This enables compilation +as shared module which the underlying OS can shared between +processes (unlike normal Python code modules). - There should be a standard Python interface for accessing this - information so that other implementors can plug in their own - possibly enhanced versions, e.g. ones that do decompressing of the - data on-the-fly. +There should be a standard Python interface for accessing this +information so that other implementors can plug in their own +possibly enhanced versions, e.g. ones that do decompressing of the +data on-the-fly. Private Code Point Areas +======================== - Support for these is left to user land Codecs and not explicitly - integrated into the core. Note that due to the Internal Format - being implemented, only the area between \uE000 and \uF8FF is - usable for private encodings. +Support for these is left to user land Codecs and not explicitly +integrated into the core. Note that due to the Internal Format +being implemented, only the area between ``\uE000`` and ``\uF8FF`` is +usable for private encodings. Internal Format +=============== - The internal format for Unicode objects should use a Python - specific fixed format implemented as 'unsigned - short' (or another unsigned numeric type having 16 bits). Byte - order is platform dependent. +The internal format for Unicode objects should use a Python +specific fixed format implemented as 'unsigned +short' (or another unsigned numeric type having 16 bits). Byte +order is platform dependent. - This format will hold UTF-16 encodings of the corresponding - Unicode ordinals. The Python Unicode implementation will address - these values as if they were UCS-2 values. UCS-2 and UTF-16 are - the same for all currently defined Unicode character points. - UTF-16 without surrogates provides access to about 64k characters - and covers all characters in the Basic Multilingual Plane (BMP) of - Unicode. +This format will hold UTF-16 encodings of the corresponding +Unicode ordinals. The Python Unicode implementation will address +these values as if they were UCS-2 values. UCS-2 and UTF-16 are +the same for all currently defined Unicode character points. +UTF-16 without surrogates provides access to about 64k characters +and covers all characters in the Basic Multilingual Plane (BMP) of +Unicode. - It is the Codec's responsibility to ensure that the data they pass - to the Unicode object constructor respects this assumption. The - constructor does not check the data for Unicode compliance or use - of surrogates. +It is the Codec's responsibility to ensure that the data they pass +to the Unicode object constructor respects this assumption. The +constructor does not check the data for Unicode compliance or use +of surrogates. - Future implementations can extend the 32 bit restriction to the - full set of all UTF-16 addressable characters (around 1M - characters). +Future implementations can extend the 32 bit restriction to the +full set of all UTF-16 addressable characters (around 1M +characters). - The Unicode API should provide interface routines from - to the compiler's wchar_t which can be 16 or 32 - bit depending on the compiler/libc/platform being used. +The Unicode API should provide interface routines from + to the compiler's wchar_t which can be 16 or 32 +bit depending on the compiler/libc/platform being used. - Unicode objects should have a pointer to a cached Python string - object holding the object's value using the . This is needed for performance and internal parsing - (see Internal Argument Parsing) reasons. The buffer is filled - when the first conversion request to the is - issued on the object. +Unicode objects should have a pointer to a cached Python string +object holding the object's value using the . This is needed for performance and internal parsing +(see Internal Argument Parsing) reasons. The buffer is filled +when the first conversion request to the is +issued on the object. - Interning is not needed (for now), since Python identifiers are - defined as being ASCII only. +Interning is not needed (for now), since Python identifiers are +defined as being ASCII only. - codecs.BOM should return the byte order mark (BOM) for the format - used internally. The codecs module should provide the following - additional constants for convenience and reference (codecs.BOM - will either be BOM_BE or BOM_LE depending on the platform): +``codecs.BOM`` should return the byte order mark (BOM) for the format +used internally. The codecs module should provide the following +additional constants for convenience and reference (``codecs.BOM`` +will either be ``BOM_BE`` or ``BOM_LE`` depending on the platform):: - BOM_BE: '\376\377' - (corresponds to Unicode U+0000FEFF in UTF-16 on big endian - platforms == ZERO WIDTH NO-BREAK SPACE) + BOM_BE: '\376\377' + (corresponds to Unicode U+0000FEFF in UTF-16 on big endian + platforms == ZERO WIDTH NO-BREAK SPACE) - BOM_LE: '\377\376' - (corresponds to Unicode U+0000FFFE in UTF-16 on little endian - platforms == defined as being an illegal Unicode character) + BOM_LE: '\377\376' + (corresponds to Unicode U+0000FFFE in UTF-16 on little endian + platforms == defined as being an illegal Unicode character) - BOM4_BE: '\000\000\376\377' - (corresponds to Unicode U+0000FEFF in UCS-4) + BOM4_BE: '\000\000\376\377' + (corresponds to Unicode U+0000FEFF in UCS-4) - BOM4_LE: '\377\376\000\000' - (corresponds to Unicode U+0000FFFE in UCS-4) + BOM4_LE: '\377\376\000\000' + (corresponds to Unicode U+0000FFFE in UCS-4) - Note that Unicode sees big endian byte order as being "correct". - The swapped order is taken to be an indicator for a "wrong" - format, hence the illegal character definition. +Note that Unicode sees big endian byte order as being "correct". +The swapped order is taken to be an indicator for a "wrong" +format, hence the illegal character definition. - The configure script should provide aid in deciding whether Python - can use the native wchar_t type or not (it has to be a 16-bit - unsigned type). +The configure script should provide aid in deciding whether Python +can use the native ``wchar_t`` type or not (it has to be a 16-bit +unsigned type). Buffer Interface +================ - Implement the buffer interface using the Python string - object as basis for bf_getcharbuf and the internal buffer for - bf_getreadbuf. If bf_getcharbuf is requested and the - object does not yet exist, it is created first. +Implement the buffer interface using the Python string +object as basis for ``bf_getcharbuf`` and the internal buffer for +``bf_getreadbuf``. If ``bf_getcharbuf`` is requested and the +object does not yet exist, it is created first. - Note that as special case, the parser marker "s#" will not return - raw Unicode UTF-16 data (which the bf_getreadbuf returns), but - instead tries to encode the Unicode object using the default - encoding and then returns a pointer to the resulting string object - (or raises an exception in case the conversion fails). This was - done in order to prevent accidentely writing binary data to an - output stream which the other end might not recognize. +Note that as special case, the parser marker "s#" will not return +raw Unicode UTF-16 data (which the ``bf_getreadbuf`` returns), but +instead tries to encode the Unicode object using the default +encoding and then returns a pointer to the resulting string object +(or raises an exception in case the conversion fails). This was +done in order to prevent accidentely writing binary data to an +output stream which the other end might not recognize. - This has the advantage of being able to write to output streams - (which typically use this interface) without additional - specification of the encoding to use. +This has the advantage of being able to write to output streams +(which typically use this interface) without additional +specification of the encoding to use. - If you need to access the read buffer interface of Unicode - objects, use the PyObject_AsReadBuffer() interface. +If you need to access the read buffer interface of Unicode +objects, use the ``PyObject_AsReadBuffer()`` interface. - The internal format can also be accessed using the - 'unicode-internal' codec, e.g. via u.encode('unicode-internal'). +The internal format can also be accessed using the +'unicode-internal' codec, e.g. via ``u.encode('unicode-internal')``. Pickle/Marshalling +================== - Should have native Unicode object support. The objects should be - encoded using platform independent encodings. +Should have native Unicode object support. The objects should be +encoded using platform independent encodings. - Marshal should use UTF-8 and Pickle should either choose - Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as - encoding. Using UTF-8 instead of UTF-16 has the advantage of - eliminating the need to store a BOM mark. +Marshal should use UTF-8 and Pickle should either choose +Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as +encoding. Using UTF-8 instead of UTF-16 has the advantage of +eliminating the need to store a BOM mark. Regular Expressions +=================== - Secret Labs AB is working on a Unicode-aware regular expression - machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4 - internal character buffers. +Secret Labs AB is working on a Unicode-aware regular expression +machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4 +internal character buffers. - Also see +Also see - http://www.unicode.org/unicode/reports/tr18/ + http://www.unicode.org/unicode/reports/tr18/ - for some remarks on how to treat Unicode REs. +for some remarks on how to treat Unicode REs. Formatting Markers +================== - Format markers are used in Python format strings. If Python - strings are used as format strings, the following interpretations - should be in effect: +Format markers are used in Python format strings. If Python +strings are used as format strings, the following interpretations +should be in effect:: - '%s': For Unicode objects this will cause coercion of the - whole format string to Unicode. Note that you should use - a Unicode format string to start with for performance - reasons. + '%s': For Unicode objects this will cause coercion of the + whole format string to Unicode. Note that you should use + a Unicode format string to start with for performance + reasons. - In case the format string is an Unicode object, all parameters are - coerced to Unicode first and then put together and formatted - according to the format string. Numbers are first converted to - strings and then to Unicode. +In case the format string is an Unicode object, all parameters are +coerced to Unicode first and then put together and formatted +according to the format string. Numbers are first converted to +strings and then to Unicode. - '%s': Python strings are interpreted as Unicode - string using the . Unicode objects are - taken as is. +:: - All other string formatters should work accordingly. + '%s': Python strings are interpreted as Unicode + string using the . Unicode objects are + taken as is. - Example: +All other string formatters should work accordingly. + +Example:: u"%s %s" % (u"abc", "abc") == u"abc abc" Internal Argument Parsing +========================= - These markers are used by the PyArg_ParseTuple() APIs: +These markers are used by the ``PyArg_ParseTuple()`` APIs: - "U": Check for Unicode object and return a pointer to it +"U" + Check for Unicode object and return a pointer to it - "s": For Unicode objects: return a pointer to the object's - buffer (which uses the ). +"s" + For Unicode objects: return a pointer to the object's + buffer (which uses the ). - "s#": Access to the default encoded version of the Unicode object - (see Buffer Interface); note that the length relates to - the length of the default encoded string rather than the - Unicode object length. +"s#" + Access to the default encoded version of the Unicode object + (see Buffer Interface); note that the length relates to + the length of the default encoded string rather than the + Unicode object length. - "t#": Same as "s#". +"t#" + Same as "s#". - "es": - Takes two parameters: encoding (const char *) and buffer - (char **). +"es" + Takes two parameters: encoding (``const char *``) and buffer + (``char **``). - The input object is first coerced to Unicode in the usual - way and then encoded into a string using the given - encoding. + The input object is first coerced to Unicode in the usual + way and then encoded into a string using the given + encoding. - On output, a buffer of the needed size is allocated and - returned through *buffer as NULL-terminated string. The - encoded may not contain embedded NULL characters. The - caller is responsible for calling PyMem_Free() to free the - allocated *buffer after usage. + On output, a buffer of the needed size is allocated and + returned through ``*buffer`` as NULL-terminated string. The + encoded may not contain embedded NULL characters. The + caller is responsible for calling ``PyMem_Free()`` to free the + allocated ``*buffer`` after usage. - "es#": - Takes three parameters: encoding (const char *), buffer - (char **) and buffer_len (int *). +"es#" + Takes three parameters: encoding (``const char *``), buffer + (``char **``) and buffer_len (``int *``). - The input object is first coerced to Unicode in the usual - way and then encoded into a string using the given - encoding. + The input object is first coerced to Unicode in the usual + way and then encoded into a string using the given + encoding. - If *buffer is non-NULL, *buffer_len must be set to - sizeof(buffer) on input. Output is then copied to *buffer. + If ``*buffer`` is non-NULL, ``*buffer_len`` must be set to + ``sizeof(buffer)`` on input. Output is then copied to ``*buffer``. - If *buffer is NULL, a buffer of the needed size is - allocated and output copied into it. *buffer is then - updated to point to the allocated memory area. The caller - is responsible for calling PyMem_Free() to free the - allocated *buffer after usage. + If ``*buffer`` is NULL, a buffer of the needed size is + allocated and output copied into it. ``*buffer`` is then + updated to point to the allocated memory area. The caller + is responsible for calling ``PyMem_Free()`` to free the + allocated ``*buffer`` after usage. - In both cases *buffer_len is updated to the number of - characters written (excluding the trailing NULL-byte). - The output buffer is assured to be NULL-terminated. + In both cases ``*buffer_len`` is updated to the number of + characters written (excluding the trailing NULL-byte). + The output buffer is assured to be NULL-terminated. - Examples: +Examples: - Using "es#" with auto-allocation: +Using "es#" with auto-allocation:: - static PyObject * - test_parser(PyObject *self, - PyObject *args) - { - PyObject *str; - const char *encoding = "latin-1"; - char *buffer = NULL; - int buffer_len = 0; + static PyObject * + test_parser(PyObject *self, + PyObject *args) + { + PyObject *str; + const char *encoding = "latin-1"; + char *buffer = NULL; + int buffer_len = 0; - if (!PyArg_ParseTuple(args, "es#:test_parser", - encoding, &buffer, &buffer_len)) - return NULL; - if (!buffer) { - PyErr_SetString(PyExc_SystemError, - "buffer is NULL"); - return NULL; - } - str = PyString_FromStringAndSize(buffer, buffer_len); - PyMem_Free(buffer); - return str; + if (!PyArg_ParseTuple(args, "es#:test_parser", + encoding, &buffer, &buffer_len)) + return NULL; + if (!buffer) { + PyErr_SetString(PyExc_SystemError, + "buffer is NULL"); + return NULL; } + str = PyString_FromStringAndSize(buffer, buffer_len); + PyMem_Free(buffer); + return str; + } - Using "es" with auto-allocation returning a NULL-terminated string: +Using "es" with auto-allocation returning a NULL-terminated string:: - static PyObject * - test_parser(PyObject *self, - PyObject *args) - { - PyObject *str; - const char *encoding = "latin-1"; - char *buffer = NULL; + static PyObject * + test_parser(PyObject *self, + PyObject *args) + { + PyObject *str; + const char *encoding = "latin-1"; + char *buffer = NULL; - if (!PyArg_ParseTuple(args, "es:test_parser", - encoding, &buffer)) - return NULL; - if (!buffer) { - PyErr_SetString(PyExc_SystemError, - "buffer is NULL"); - return NULL; - } - str = PyString_FromString(buffer); - PyMem_Free(buffer); - return str; + if (!PyArg_ParseTuple(args, "es:test_parser", + encoding, &buffer)) + return NULL; + if (!buffer) { + PyErr_SetString(PyExc_SystemError, + "buffer is NULL"); + return NULL; } + str = PyString_FromString(buffer); + PyMem_Free(buffer); + return str; + } - Using "es#" with a pre-allocated buffer: +Using "es#" with a pre-allocated buffer:: - static PyObject * - test_parser(PyObject *self, - PyObject *args) - { - PyObject *str; - const char *encoding = "latin-1"; - char _buffer[10]; - char *buffer = _buffer; - int buffer_len = sizeof(_buffer); + static PyObject * + test_parser(PyObject *self, + PyObject *args) + { + PyObject *str; + const char *encoding = "latin-1"; + char _buffer[10]; + char *buffer = _buffer; + int buffer_len = sizeof(_buffer); - if (!PyArg_ParseTuple(args, "es#:test_parser", - encoding, &buffer, &buffer_len)) - return NULL; - if (!buffer) { - PyErr_SetString(PyExc_SystemError, - "buffer is NULL"); - return NULL; - } - str = PyString_FromStringAndSize(buffer, buffer_len); - return str; + if (!PyArg_ParseTuple(args, "es#:test_parser", + encoding, &buffer, &buffer_len)) + return NULL; + if (!buffer) { + PyErr_SetString(PyExc_SystemError, + "buffer is NULL"); + return NULL; } + str = PyString_FromStringAndSize(buffer, buffer_len); + return str; + } File/Stream Output +================== - Since file.write(object) and most other stream writers use the - "s#" or "t#" argument parsing marker for querying the data to - write, the default encoded string version of the Unicode object - will be written to the streams (see Buffer Interface). +Since file.write(object) and most other stream writers use the +"s#" or "t#" argument parsing marker for querying the data to +write, the default encoded string version of the Unicode object +will be written to the streams (see Buffer Interface). - For explicit handling of files using Unicode, the standard stream - codecs as available through the codecs module should be used. +For explicit handling of files using Unicode, the standard stream +codecs as available through the codecs module should be used. - The codecs module should provide a short-cut - open(filename,mode,encoding) available which also assures that - mode contains the 'b' character when needed. +The codecs module should provide a short-cut +open(filename,mode,encoding) available which also assures that +mode contains the 'b' character when needed. File/Stream Input +================= - Only the user knows what encoding the input data uses, so no - special magic is applied. The user will have to explicitly - convert the string data to Unicode objects as needed or use the - file wrappers defined in the codecs module (see File/Stream - Output). +Only the user knows what encoding the input data uses, so no +special magic is applied. The user will have to explicitly +convert the string data to Unicode objects as needed or use the +file wrappers defined in the codecs module (see File/Stream +Output). Unicode Methods & Attributes +============================ - All Python string methods, plus: +All Python string methods, plus:: - .encode([encoding=][,errors="strict"]) - --> see Unicode Output + .encode([encoding=][,errors="strict"]) + --> see Unicode Output - .splitlines([include_breaks=0]) - --> breaks the Unicode string into a list of (Unicode) lines; - returns the lines with line breaks included, if - include_breaks is true. See Line Breaks for a - specification of how line breaking is done. + .splitlines([include_breaks=0]) + --> breaks the Unicode string into a list of (Unicode) lines; + returns the lines with line breaks included, if + include_breaks is true. See Line Breaks for a + specification of how line breaking is done. Code Base +========= - We should use Fredrik Lundh's Unicode object implementation as - basis. It already implements most of the string methods needed - and provides a well written code base which we can build upon. +We should use Fredrik Lundh's Unicode object implementation as +basis. It already implements most of the string methods needed +and provides a well written code base which we can build upon. - The object sharing implemented in Fredrik's implementation should - be dropped. +The object sharing implemented in Fredrik's implementation should +be dropped. Test Cases +========== - Test cases should follow those in Lib/test/test_string.py and - include additional checks for the Codec Registry and the Standard - Codecs. +Test cases should follow those in Lib/test/test_string.py and +include additional checks for the Codec Registry and the Standard +Codecs. References +========== - Unicode Consortium: - http://www.unicode.org/ +* Unicode Consortium: http://www.unicode.org/ - Unicode FAQ: - http://www.unicode.org/unicode/faq/ +* Unicode FAQ: http://www.unicode.org/unicode/faq/ - Unicode 3.0: - http://www.unicode.org/unicode/standard/versions/Unicode3.0.html +* Unicode 3.0: http://www.unicode.org/unicode/standard/versions/Unicode3.0.html - Unicode-TechReports: - http://www.unicode.org/unicode/reports/techreports.html +* Unicode-TechReports: http://www.unicode.org/unicode/reports/techreports.html - Unicode-Mappings: - ftp://ftp.unicode.org/Public/MAPPINGS/ +* Unicode-Mappings: ftp://ftp.unicode.org/Public/MAPPINGS/ - Introduction to Unicode (a little outdated by still nice to read): - http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html +* Introduction to Unicode (a little outdated by still nice to read): + http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html - For comparison: - Introducing Unicode to ECMAScript (aka JavaScript) -- - http://www-4.ibm.com/software/developer/library/internationalization-support.html +* For comparison: + Introducing Unicode to ECMAScript (aka JavaScript) -- + http://www-4.ibm.com/software/developer/library/internationalization-support.html - IANA Character Set Names: - ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets +* IANA Character Set Names: + ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets - Discussion of UTF-8 and Unicode support for POSIX and Linux: - http://www.cl.cam.ac.uk/~mgk25/unicode.html +* Discussion of UTF-8 and Unicode support for POSIX and Linux: + http://www.cl.cam.ac.uk/~mgk25/unicode.html - Encodings: +* Encodings: - Overview: - http://czyborra.com/utf/ + * Overview: http://czyborra.com/utf/ - UCS-2: - http://www.uazone.org/multiling/unicode/ucs2.html + * UCS-2: http://www.uazone.org/multiling/unicode/ucs2.html - UTF-7: - Defined in RFC2152, e.g. - http://www.uazone.org/multiling/ml-docs/rfc2152.txt + * UTF-7: Defined in RFC2152, e.g. + http://www.uazone.org/multiling/ml-docs/rfc2152.txt - UTF-8: - Defined in RFC2279, e.g. - https://tools.ietf.org/html/rfc2279 + * UTF-8: Defined in RFC2279, e.g. + https://tools.ietf.org/html/rfc2279 - UTF-16: - http://www.uazone.org/multiling/unicode/wg2n1035.html + * UTF-16: http://www.uazone.org/multiling/unicode/wg2n1035.html History of this Proposal +======================== - [ed. note: revisions prior to 1.7 are available in the CVS history - of Misc/unicode.txt from the standard Python distribution. All - subsequent history is available via the CVS revisions on this - file.] +[ed. note: revisions prior to 1.7 are available in the CVS history +of Misc/unicode.txt from the standard Python distribution. All +subsequent history is available via the CVS revisions on this +file.] - 1.7: Added note about the changed behaviour of "s#". - 1.6: Changed to since this is the name used in the - implementation. Added notes about the usage of in - the buffer protocol implementation. - 1.5: Added notes about setting the . Fixed some - typos (thanks to Andrew Kuchling). Changed to - . - 1.4: Added note about mixed type comparisons and contains tests. - Changed treating of Unicode objects in format strings (if - used with '%s' % u they will now cause the format string to - be coerced to Unicode, thus producing a Unicode object on - return). Added link to IANA charset names (thanks to Lars - Marius Garshol). Added new codec methods .readline(), - .readlines() and .writelines(). - 1.3: Added new "es" and "es#" parser markers - 1.2: Removed POD about codecs.open() - 1.1: Added note about comparisons and hash values. Added note about - case mapping algorithms. Changed stream codecs .read() and - .write() method to match the standard file-like object - methods (bytes consumed information is no longer returned by - the methods) - 1.0: changed encode Codec method to be symmetric to the decode method - (they both return (object, data consumed) now and thus become - interchangeable); removed __init__ method of Codec class (the - methods are stateless) and moved the errors argument down to - the methods; made the Codec design more generic w/r to type - of input and output objects; changed StreamWriter.flush to - StreamWriter.reset in order to avoid overriding the stream's - .flush() method; renamed .breaklines() to .splitlines(); - renamed the module unicodec to codecs; modified the File I/O - section to refer to the stream codecs. - 0.9: changed errors keyword argument definition; added 'replace' error - handling; changed the codec APIs to accept buffer like - objects on input; some minor typo fixes; added Whitespace - section and included references for Unicode characters that - have the whitespace and the line break characteristic; added - note that search functions can expect lower-case encoding - names; dropped slicing and offsets in the codec APIs - 0.8: added encodings package and raw unicode escape encoding; untabified - the proposal; added notes on Unicode format strings; added - .breaklines() method - 0.7: added a whole new set of codec APIs; added a different - encoder lookup scheme; fixed some names - 0.6: changed "s#" to "t#"; changed to holding - a real Python string object; changed Buffer Interface to - delegate requests to 's buffer interface; removed - the explicit reference to the unicodec.codecs dictionary (the - module can implement this in way fit for the purpose); - removed the settable default encoding; move UnicodeError from - unicodec to exceptions; "s#" not returns the internal data; - passed the UCS-2/UTF-16 checking from the Unicode constructor - to the Codecs - 0.5: moved sys.bom to unicodec.BOM; added sections on case mapping, - private use encodings and Unicode character properties - 0.4: added Codec interface, notes on %-formatting, changed some encoding - details, added comments on stream wrappers, fixed some - discussion points (most important: Internal Format), - clarified the 'unicode-escape' encoding, added encoding - references - 0.3: added references, comments on codec modules, the internal format, - bf_getcharbuffer and the RE engine; added 'unicode-escape' - encoding proposed by Tim Peters and fixed repr(u) accordingly - 0.2: integrated Guido's suggestions, added stream codecs and file - wrapping - 0.1: first version +1.7 +--- + +* Added note about the changed behaviour of "s#". + +1.6 +--- + +* Changed to since this is the name used in the + implementation. +* Added notes about the usage of in + the buffer protocol implementation. + +1.5 +--- + +* Added notes about setting the . +* Fixed some typos (thanks to Andrew Kuchling). +* Changed to . + +1.4 +--- + +* Added note about mixed type comparisons and contains tests. +* Changed treating of Unicode objects in format strings (if + used with ``'%s' % u`` they will now cause the format string to + be coerced to Unicode, thus producing a Unicode object on + return). +* Added link to IANA charset names (thanks to Lars + Marius Garshol). +* Added new codec methods ``.readline()``, + ``.readlines()`` and ``.writelines()``. + +1.3 +--- + +* Added new "es" and "es#" parser markers + +1.2 +--- + +* Removed POD about ``codecs.open()`` + +1.1 +--- + +* Added note about comparisons and hash values. +* Added note about case mapping algorithms. +* Changed stream codecs ``.read()`` and ``.write()`` method + to match the standard file-like object + methods (bytes consumed information is no longer returned by + the methods) + +1.0 +--- + +* changed encode Codec method to be symmetric to the decode method + (they both return (object, data consumed) now and thus become + interchangeable); +* removed ``__init__`` method of Codec class (the + methods are stateless) and moved the errors argument down to + the methods; +* made the Codec design more generic w/r to type + of input and output objects; +* changed ``StreamWriter.flush`` to ``StreamWriter.reset`` in order to + avoid overriding the stream's ``.flush()`` method; +* renamed ``.breaklines()`` to ``.splitlines()``; +* renamed the module unicodec to codecs; +* modified the File I/O section to refer to the stream codecs. + +0.9 +--- + +* changed errors keyword argument definition; +* added 'replace' error handling; +* changed the codec APIs to accept buffer like + objects on input; +* some minor typo fixes; +* added Whitespace section and included references for Unicode characters that + have the whitespace and the line break characteristic; +* added note that search functions can expect lower-case encoding names; +* dropped slicing and offsets in the codec APIs + +0.8 +--- + +* added encodings package and raw unicode escape encoding; +* untabified the proposal; +* added notes on Unicode format strings; +* added ``.breaklines()`` method + +0.7 +--- + +* added a whole new set of codec APIs; +* added a different encoder lookup scheme; +* fixed some names + +0.6 +--- + +* changed "s#" to "t#"; +* changed to holding + a real Python string object; +* changed Buffer Interface to + delegate requests to 's buffer interface; +* removed the explicit reference to the unicodec.codecs dictionary (the + module can implement this in way fit for the purpose); +* removed the settable default encoding; +* move ``UnicodeError`` from unicodec to exceptions; +* "s#" not returns the internal data; +* passed the UCS-2/UTF-16 checking from the Unicode constructor + to the Codecs + +0.5 +--- + +* moved ``sys.bom`` to ``unicodec.BOM``; +* added sections on case mapping, +* private use encodings and Unicode character properties + +0.4 +--- + +* added Codec interface, notes on %-formatting, +* changed some encoding details, +* added comments on stream wrappers, +* fixed some discussion points (most important: Internal Format), +* clarified the 'unicode-escape' encoding, added encoding + references + +0.3 +--- + +* added references, comments on codec modules, the internal format, + bf_getcharbuffer and the RE engine; +* added 'unicode-escape' + encoding proposed by Tim Peters and fixed repr(u) accordingly + +0.2 +--- + +* integrated Guido's suggestions, added stream codecs and file wrapping + +0.1 +--- + +* first version - -Local Variables: -mode: indented-text -indent-tabs-mode: nil -End: +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + End: