Add PEP 400: Deprecate codecs.StreamReader and codecs.StreamWriter

2011-07-07 14:08:47 +02:00 · 2011-07-07 14:08:47 +02:00 · e0589b22b2
parent 50097f94fc
commit e0589b22b2
1 changed files with 323 additions and 0 deletions
--- a/pep-0400.txt
+++ b/pep-0400.txt
@ -0,0 +1,323 @@
+PEP: 400
+Title: Deprecate codecs.StreamReader and codecs.StreamWriter
+Version: $Revision$
+Last-Modified: $Date$
+Author: Victor Stinner <victor.stinner@haypocalc.com>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 28-May-2011
+Python-Version: 3.3
+
+
+Abstract
+========
+
+io.TextIOWrapper and codecs.StreamReaderWriter offer the same API
+[#f1]_. TextIOWrapper has more features and is faster than
+StreamReaderWriter. Duplicate code means that bugs should be fixed
+twice and that we may have subtle differences between the two
+implementations.
+
+The codecs modules was introduced in Python 2.0, see the PEP 100. The
+io module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and
+reimplemented in C in Python 2.7 and 3.1.
+
+
+Motivation
+==========
+
+When the Python I/O model was updated for 3.0, the concept of a
+"stream-with-known-encoding" was introduced in the form of
+io.TextIOWrapper. As this class is critical to the performance of
+text-based I/O in Python 3, this module has an optimised C version
+which is used by CPython by default. Many corner cases in handling
+buffering, stateful codecs and universal newlines have been dealt with
+since the release of Python 3.0.
+
+This new interface overlaps heavily with the legacy
+codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter
+interfaces that were part of the original codec interface design in
+PEP 100. These interfaces are organised around the principle of an
+encoding with an associated stream (i.e. the reverse of arrangement in
+the io module), so the original PEP 100 design required that codec
+writers provide appropriate StreamReader and StreamWriter
+implementations in addition to the core codec encode() and decode()
+methods. This places a heavy burden on codec authors providing these
+specialised implementations to correctly handle many of the corner
+cases that have now been dealt with by io.TextIOWrapper. While deeper
+integration between the codec and the stream allows for additional
+optimisations in theory, these optimisations have in practice either
+not been carried out and else the associated code duplication means
+that the corner cases that have been fixed in io.TextIOWrapper are
+still not handled correctly in the various StreamReader and
+StreamWriter implementations.
+
+Accordingly, this PEP proposes that:
+
+* codecs.open() be updated to delegate to the builtin open() in Python
+  3.3;
+* the legacy codecs.Stream* interfaces, including the streamreader and
+  streamwriter attributes of codecs.CodecInfo be deprecated in Python
+  3.3 and removed in Python 3.4.
+
+
+Rationale
+=========
+
+StreamReader and StreamWriter issues
+''''''''''''''''''''''''''''''''''''
+
+ * StreamReader is unable to translate newlines.
+ * StreamReaderWriter handles reads using StreamReader and writes
+   using StreamWriter. These two classes may be inconsistent. To stay
+   consistent, flush() must be called after each write which slows
+   down interlaced read-write.
+ * StreamWriter doesn't support "line buffering" (flush if the input
+   text contains a newline).
+ * StreamReader classes of the CJK encodings (e.g. GB18030) don't
+   support universal newlines, only UNIX newlines ('\\n').
+ * StreamReader and StreamWriter are stateful codecs but don't expose
+   functions to control their state (getstate() or setstate()). Each
+   codec has to implement corner cases, see "Issue with stateful
+   codecs".
+ * StreamReader and StreamWriter are very similar to IncrementalReader
+   and IncrementalEncoder, some code is duplicated for stateful codecs
+   (e.g. UTF-16).
+ * Each codec has to reimplement its own StreamReader and StreamWriter
+   class, even if it's trivial (just call the encoder/decoder).
+ * codecs.open(filename, "r") creates a io.TextIOWrapper object.
+ * No codec implements an optimized method in StreamReader or
+   StreamWriter based on the specificities of the codec.
+
+Other issues in the bug tracker:
+
+ * `Issue #5445 <http://bugs.python.org/issue5445>`_ (2009-03-08):
+   codecs.StreamWriter.writelines problem when passed generator
+ * `Issue #7262: <http://bugs.python.org/issue7262>`_ (2009-11-04):
+   codecs.open() + eol (windows)
+ * `Issue #8260 <http://bugs.python.org/issue8260>`_ (2010-03-29):
+   When I use codecs.open(...) and f.readline() follow up by f.read()
+   return bad result
+ * `Issue #8630 <http://bugs.python.org/issue8630>`_ (2010-05-05):
+   Keepends param in codec readline(s)
+ * `Issue #10344 <http://bugs.python.org/issue10344>`_ (2010-11-06):
+   codecs.readline doesn't care buffering
+ * `Issue #11461 <http://bugs.python.org/issue11461>`_ (2011-03-10):
+   Reading UTF-16 with codecs.readline() breaks on surrogate pairs
+ * `Issue #12446 <http://bugs.python.org/issue12446>`_ (2011-06-30):
+   StreamReader Readlines behavior odd
+ * `Issue #12508 <http://bugs.python.org/issue12508>`_ (2011-07-06):
+   Codecs Anomaly
+ * `Issue #12512 <http://bugs.python.org/issue12512>`_ (2011-07-07):
+   codecs: StreamWriter issues with stateful codecs after a seek or
+   with append mode
+
+TextIOWrapper features
+''''''''''''''''''''''
+
+ * TextIOWrapper supports any kind of newline, including translating
+   newlines (to UNIX newlines), to read and write.
+ * TextIOWrapper reuses incremental encoders and decoders (no
+   duplication of code).
+ * The io module (TextIOWrapper) is faster than the codecs module
+   (StreamReader). It is implemented in C, whereas codecs is
+   implemented in Python.
+ * TextIOWrapper has a readahead algorithm which speeds up small
+   reads: read character by character or line by line (io is 10x
+   through 25x faster than codecs on these operations).
+ * TextIOWrapper has a write buffer.
+ * TextIOWrapper.tell() is optimized.
+ * TextIOWrapper supports random access (read+write) using a single
+   class which permit to optimize interlaced read-write (but no such
+   optimization is implemented).
+
+TextIOWrapper issues
+''''''''''''''''''''
+
+ * `Issue #12213 <http://bugs.python.org/issue12213>`_ (2011-05-30):
+   BufferedRandom, BufferedRWPair: issues with interlaced read-write
+ * `Issue #12215 <http://bugs.python.org/issue12215>`_ (2011-05-30):
+   TextIOWrapper: issues with interlaced read-write
+
+Possible improvements of StreamReader and StreamWriter
+''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+It would be possible to add functions to StreamReader and StreamWriter
+to give access to the state of codec. And so it would be possible fix
+issues with stateful codecs in a base class instead of having to fix
+them is each stateful StreamReader and StreamWriter classes.
+
+It would be possible to change StreamReader and StreamWriter to make
+them use IncrementalDecoder and IncrementalEncoder.
+
+A codec can implement variants which are optimized for the specific
+encoding or intercept certain stream methods to add functionality or
+improve the encoding/decoding performance. TextIOWrapper cannot
+implement such optimization, but TextIOWrapper uses incremental
+encoders and decoders and uses read and write buffers, so the overhead
+of incomplete inputs is low or nul.
+
+A lot more could be done for other variable length encoding codecs,
+e.g. UTF-8, since these often have problems near the end of a read due
+to missing bytes. The UTF-32-BE/LE codecs could simply multiply the
+character position by 4 to get the byte position.
+
+
+Usage of StreamReader and StreamWriter
+''''''''''''''''''''''''''''''''''''''
+
+These classes are rarely used directly, but indirectly using
+codecs.open(). They are not used in Python 3 standard library (except
+in the codecs module).
+
+Some projects implement their own codec with StreamReader and
+StreamWriter, but don't use these classes.
+
+
+Backwards Compatibility
+=======================
+
+Keep the public API, codecs.open
+''''''''''''''''''''''''''''''''
+
+codecs.open() can be replaced by the builtin open() function. open()
+has a similar API but has also more options.
+
+codecs.open() was the only way to open a text file in Unicode mode
+until Python 2.6. Many Python 2 programs uses this function. Removing
+codecs.open() implies more work to port programs from Python 2 to
+Python 3, especially projets using the same code base for the two
+Python versions (without using 2to3 program).
+
+codecs.open() is kept for backward compatibility with Python 2.
+
+
+Deprecate StreamReader and StreamWriter
+'''''''''''''''''''''''''''''''''''''''
+
+Instanciate StreamReader or StreamWriter must raise a
+DeprecationWarning in Python 3.3. Implement a subclass don't raise a
+DeprecationWarning.
+
+codecs.open() will be changed to reuse the builtin open() function
+(TextIOWrapper).
+
+EncodedFile(), StreamRandom, StreamReader, StreamReaderWriter and
+StreamWriter will be removed in Python 3.4.
+
+
+Issue with stateful codecs
+==========================
+
+It is difficult to use correctly a stateful codec with a stream. Some
+cases are supported by the codecs module, while io has no more known
+bug related to stateful codecs. The main difference between the codecs
+and the io module is that bugs have to be fixed in StreamReader and/or
+StreamWriter classes of each codec for the codecs module, whereas bugs
+can be fixed only once in io.TextIOWrapper. Here are some examples of
+issues with stateful codecs.
+
+Stateful codecs
+'''''''''''''''
+
+Python supports the following stateful codecs:
+
+ * cp932
+ * cp949
+ * cp950
+ * euc_jis_2004
+ * euc_jisx2003
+ * euc_jp
+ * euc_kr
+ * gb18030
+ * gbk
+ * hz
+ * iso2022_jp
+ * iso2022_jp_1
+ * iso2022_jp_2
+ * iso2022_jp_2004
+ * iso2022_jp_3
+ * iso2022_jp_ext
+ * iso2022_kr
+ * shift_jis
+ * shift_jis_2004
+ * shift_jisx0213
+ * utf_8_sig
+ * utf_16
+ * utf_32
+
+Read and seek(0)
+''''''''''''''''
+
+::
+
+    with open(filename, 'w', encoding='utf-16') as f:
+        f.write('abc')
+        f.write('def')
+        f.seek(0)
+        assert f.read() == 'abcdef'
+        f.seek(0)
+        assert f.read() == 'abcdef'
+
+The io and codecs modules support this usecase correctly.
+
+seek(n)
+'''''''
+
+::
+
+    with open(filename, 'w', encoding='utf-16') as f:
+        f.write('abc')
+        pos = f.tell()
+    with open(filename, 'w', encoding='utf-16') as f:
+        f.seek(pos)
+        f.write('def')
+        f.seek(0)
+        f.write('###')
+    with open(filename, 'r', encoding='utf-16') as f:
+        assert f.read() == '###def'
+
+The io module supports this usecase, whereas codecs fails because it
+writes a new BOM on the second write (issue #12512).
+
+Append mode
+'''''''''''
+
+::
+
+    with open(filename, 'w', encoding='utf-16') as f:
+        f.write('abc')
+    with open(filename, 'a', encoding='utf-16') as f:
+        f.write('def')
+    with open(filename, 'r', encoding='utf-16') as f:
+        assert f.read() == 'abcdef'
+
+The io module supports this usecase, whereas codecs fails because it
+writes a new BOM on the second write (issue #12512).
+
+
+Links
+=====
+
+ * `PEP 100: Python Unicode Integration
+   <http://www.python.org/dev/peps/pep-0100/>`_
+ * `PEP 3116 <http://www.python.org/dev/peps/pep-3116/>`_
+ * `Issue #8796: Deprecate codecs.open()
+   <http://bugs.python.org/issue8796>`_
+ * `[python-dev] Deprecate codecs.open() and StreamWriter/StreamReader
+   <http://mail.python.org/pipermail/python-dev/2011-May/111591.html>`_
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+Footnotes
+=========
+
+.. [#f1] StreamReaderWriter has two more attributes than
+         TextIOWrapper, reader and writer.
+