Add PEP 400: Deprecate codecs.StreamReader and codecs.StreamWriter
This commit is contained in:
parent
50097f94fc
commit
e0589b22b2
|
@ -0,0 +1,323 @@
|
|||
PEP: 400
|
||||
Title: Deprecate codecs.StreamReader and codecs.StreamWriter
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Victor Stinner <victor.stinner@haypocalc.com>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 28-May-2011
|
||||
Python-Version: 3.3
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
io.TextIOWrapper and codecs.StreamReaderWriter offer the same API
|
||||
[#f1]_. TextIOWrapper has more features and is faster than
|
||||
StreamReaderWriter. Duplicate code means that bugs should be fixed
|
||||
twice and that we may have subtle differences between the two
|
||||
implementations.
|
||||
|
||||
The codecs modules was introduced in Python 2.0, see the PEP 100. The
|
||||
io module was introduced in Python 2.6 and 3.0 (see the PEP 3116), and
|
||||
reimplemented in C in Python 2.7 and 3.1.
|
||||
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
When the Python I/O model was updated for 3.0, the concept of a
|
||||
"stream-with-known-encoding" was introduced in the form of
|
||||
io.TextIOWrapper. As this class is critical to the performance of
|
||||
text-based I/O in Python 3, this module has an optimised C version
|
||||
which is used by CPython by default. Many corner cases in handling
|
||||
buffering, stateful codecs and universal newlines have been dealt with
|
||||
since the release of Python 3.0.
|
||||
|
||||
This new interface overlaps heavily with the legacy
|
||||
codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter
|
||||
interfaces that were part of the original codec interface design in
|
||||
PEP 100. These interfaces are organised around the principle of an
|
||||
encoding with an associated stream (i.e. the reverse of arrangement in
|
||||
the io module), so the original PEP 100 design required that codec
|
||||
writers provide appropriate StreamReader and StreamWriter
|
||||
implementations in addition to the core codec encode() and decode()
|
||||
methods. This places a heavy burden on codec authors providing these
|
||||
specialised implementations to correctly handle many of the corner
|
||||
cases that have now been dealt with by io.TextIOWrapper. While deeper
|
||||
integration between the codec and the stream allows for additional
|
||||
optimisations in theory, these optimisations have in practice either
|
||||
not been carried out and else the associated code duplication means
|
||||
that the corner cases that have been fixed in io.TextIOWrapper are
|
||||
still not handled correctly in the various StreamReader and
|
||||
StreamWriter implementations.
|
||||
|
||||
Accordingly, this PEP proposes that:
|
||||
|
||||
* codecs.open() be updated to delegate to the builtin open() in Python
|
||||
3.3;
|
||||
* the legacy codecs.Stream* interfaces, including the streamreader and
|
||||
streamwriter attributes of codecs.CodecInfo be deprecated in Python
|
||||
3.3 and removed in Python 3.4.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
StreamReader and StreamWriter issues
|
||||
''''''''''''''''''''''''''''''''''''
|
||||
|
||||
* StreamReader is unable to translate newlines.
|
||||
* StreamReaderWriter handles reads using StreamReader and writes
|
||||
using StreamWriter. These two classes may be inconsistent. To stay
|
||||
consistent, flush() must be called after each write which slows
|
||||
down interlaced read-write.
|
||||
* StreamWriter doesn't support "line buffering" (flush if the input
|
||||
text contains a newline).
|
||||
* StreamReader classes of the CJK encodings (e.g. GB18030) don't
|
||||
support universal newlines, only UNIX newlines ('\\n').
|
||||
* StreamReader and StreamWriter are stateful codecs but don't expose
|
||||
functions to control their state (getstate() or setstate()). Each
|
||||
codec has to implement corner cases, see "Issue with stateful
|
||||
codecs".
|
||||
* StreamReader and StreamWriter are very similar to IncrementalReader
|
||||
and IncrementalEncoder, some code is duplicated for stateful codecs
|
||||
(e.g. UTF-16).
|
||||
* Each codec has to reimplement its own StreamReader and StreamWriter
|
||||
class, even if it's trivial (just call the encoder/decoder).
|
||||
* codecs.open(filename, "r") creates a io.TextIOWrapper object.
|
||||
* No codec implements an optimized method in StreamReader or
|
||||
StreamWriter based on the specificities of the codec.
|
||||
|
||||
Other issues in the bug tracker:
|
||||
|
||||
* `Issue #5445 <http://bugs.python.org/issue5445>`_ (2009-03-08):
|
||||
codecs.StreamWriter.writelines problem when passed generator
|
||||
* `Issue #7262: <http://bugs.python.org/issue7262>`_ (2009-11-04):
|
||||
codecs.open() + eol (windows)
|
||||
* `Issue #8260 <http://bugs.python.org/issue8260>`_ (2010-03-29):
|
||||
When I use codecs.open(...) and f.readline() follow up by f.read()
|
||||
return bad result
|
||||
* `Issue #8630 <http://bugs.python.org/issue8630>`_ (2010-05-05):
|
||||
Keepends param in codec readline(s)
|
||||
* `Issue #10344 <http://bugs.python.org/issue10344>`_ (2010-11-06):
|
||||
codecs.readline doesn't care buffering
|
||||
* `Issue #11461 <http://bugs.python.org/issue11461>`_ (2011-03-10):
|
||||
Reading UTF-16 with codecs.readline() breaks on surrogate pairs
|
||||
* `Issue #12446 <http://bugs.python.org/issue12446>`_ (2011-06-30):
|
||||
StreamReader Readlines behavior odd
|
||||
* `Issue #12508 <http://bugs.python.org/issue12508>`_ (2011-07-06):
|
||||
Codecs Anomaly
|
||||
* `Issue #12512 <http://bugs.python.org/issue12512>`_ (2011-07-07):
|
||||
codecs: StreamWriter issues with stateful codecs after a seek or
|
||||
with append mode
|
||||
|
||||
TextIOWrapper features
|
||||
''''''''''''''''''''''
|
||||
|
||||
* TextIOWrapper supports any kind of newline, including translating
|
||||
newlines (to UNIX newlines), to read and write.
|
||||
* TextIOWrapper reuses incremental encoders and decoders (no
|
||||
duplication of code).
|
||||
* The io module (TextIOWrapper) is faster than the codecs module
|
||||
(StreamReader). It is implemented in C, whereas codecs is
|
||||
implemented in Python.
|
||||
* TextIOWrapper has a readahead algorithm which speeds up small
|
||||
reads: read character by character or line by line (io is 10x
|
||||
through 25x faster than codecs on these operations).
|
||||
* TextIOWrapper has a write buffer.
|
||||
* TextIOWrapper.tell() is optimized.
|
||||
* TextIOWrapper supports random access (read+write) using a single
|
||||
class which permit to optimize interlaced read-write (but no such
|
||||
optimization is implemented).
|
||||
|
||||
TextIOWrapper issues
|
||||
''''''''''''''''''''
|
||||
|
||||
* `Issue #12213 <http://bugs.python.org/issue12213>`_ (2011-05-30):
|
||||
BufferedRandom, BufferedRWPair: issues with interlaced read-write
|
||||
* `Issue #12215 <http://bugs.python.org/issue12215>`_ (2011-05-30):
|
||||
TextIOWrapper: issues with interlaced read-write
|
||||
|
||||
Possible improvements of StreamReader and StreamWriter
|
||||
''''''''''''''''''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
It would be possible to add functions to StreamReader and StreamWriter
|
||||
to give access to the state of codec. And so it would be possible fix
|
||||
issues with stateful codecs in a base class instead of having to fix
|
||||
them is each stateful StreamReader and StreamWriter classes.
|
||||
|
||||
It would be possible to change StreamReader and StreamWriter to make
|
||||
them use IncrementalDecoder and IncrementalEncoder.
|
||||
|
||||
A codec can implement variants which are optimized for the specific
|
||||
encoding or intercept certain stream methods to add functionality or
|
||||
improve the encoding/decoding performance. TextIOWrapper cannot
|
||||
implement such optimization, but TextIOWrapper uses incremental
|
||||
encoders and decoders and uses read and write buffers, so the overhead
|
||||
of incomplete inputs is low or nul.
|
||||
|
||||
A lot more could be done for other variable length encoding codecs,
|
||||
e.g. UTF-8, since these often have problems near the end of a read due
|
||||
to missing bytes. The UTF-32-BE/LE codecs could simply multiply the
|
||||
character position by 4 to get the byte position.
|
||||
|
||||
|
||||
Usage of StreamReader and StreamWriter
|
||||
''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
These classes are rarely used directly, but indirectly using
|
||||
codecs.open(). They are not used in Python 3 standard library (except
|
||||
in the codecs module).
|
||||
|
||||
Some projects implement their own codec with StreamReader and
|
||||
StreamWriter, but don't use these classes.
|
||||
|
||||
|
||||
Backwards Compatibility
|
||||
=======================
|
||||
|
||||
Keep the public API, codecs.open
|
||||
''''''''''''''''''''''''''''''''
|
||||
|
||||
codecs.open() can be replaced by the builtin open() function. open()
|
||||
has a similar API but has also more options.
|
||||
|
||||
codecs.open() was the only way to open a text file in Unicode mode
|
||||
until Python 2.6. Many Python 2 programs uses this function. Removing
|
||||
codecs.open() implies more work to port programs from Python 2 to
|
||||
Python 3, especially projets using the same code base for the two
|
||||
Python versions (without using 2to3 program).
|
||||
|
||||
codecs.open() is kept for backward compatibility with Python 2.
|
||||
|
||||
|
||||
Deprecate StreamReader and StreamWriter
|
||||
'''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
Instanciate StreamReader or StreamWriter must raise a
|
||||
DeprecationWarning in Python 3.3. Implement a subclass don't raise a
|
||||
DeprecationWarning.
|
||||
|
||||
codecs.open() will be changed to reuse the builtin open() function
|
||||
(TextIOWrapper).
|
||||
|
||||
EncodedFile(), StreamRandom, StreamReader, StreamReaderWriter and
|
||||
StreamWriter will be removed in Python 3.4.
|
||||
|
||||
|
||||
Issue with stateful codecs
|
||||
==========================
|
||||
|
||||
It is difficult to use correctly a stateful codec with a stream. Some
|
||||
cases are supported by the codecs module, while io has no more known
|
||||
bug related to stateful codecs. The main difference between the codecs
|
||||
and the io module is that bugs have to be fixed in StreamReader and/or
|
||||
StreamWriter classes of each codec for the codecs module, whereas bugs
|
||||
can be fixed only once in io.TextIOWrapper. Here are some examples of
|
||||
issues with stateful codecs.
|
||||
|
||||
Stateful codecs
|
||||
'''''''''''''''
|
||||
|
||||
Python supports the following stateful codecs:
|
||||
|
||||
* cp932
|
||||
* cp949
|
||||
* cp950
|
||||
* euc_jis_2004
|
||||
* euc_jisx2003
|
||||
* euc_jp
|
||||
* euc_kr
|
||||
* gb18030
|
||||
* gbk
|
||||
* hz
|
||||
* iso2022_jp
|
||||
* iso2022_jp_1
|
||||
* iso2022_jp_2
|
||||
* iso2022_jp_2004
|
||||
* iso2022_jp_3
|
||||
* iso2022_jp_ext
|
||||
* iso2022_kr
|
||||
* shift_jis
|
||||
* shift_jis_2004
|
||||
* shift_jisx0213
|
||||
* utf_8_sig
|
||||
* utf_16
|
||||
* utf_32
|
||||
|
||||
Read and seek(0)
|
||||
''''''''''''''''
|
||||
|
||||
::
|
||||
|
||||
with open(filename, 'w', encoding='utf-16') as f:
|
||||
f.write('abc')
|
||||
f.write('def')
|
||||
f.seek(0)
|
||||
assert f.read() == 'abcdef'
|
||||
f.seek(0)
|
||||
assert f.read() == 'abcdef'
|
||||
|
||||
The io and codecs modules support this usecase correctly.
|
||||
|
||||
seek(n)
|
||||
'''''''
|
||||
|
||||
::
|
||||
|
||||
with open(filename, 'w', encoding='utf-16') as f:
|
||||
f.write('abc')
|
||||
pos = f.tell()
|
||||
with open(filename, 'w', encoding='utf-16') as f:
|
||||
f.seek(pos)
|
||||
f.write('def')
|
||||
f.seek(0)
|
||||
f.write('###')
|
||||
with open(filename, 'r', encoding='utf-16') as f:
|
||||
assert f.read() == '###def'
|
||||
|
||||
The io module supports this usecase, whereas codecs fails because it
|
||||
writes a new BOM on the second write (issue #12512).
|
||||
|
||||
Append mode
|
||||
'''''''''''
|
||||
|
||||
::
|
||||
|
||||
with open(filename, 'w', encoding='utf-16') as f:
|
||||
f.write('abc')
|
||||
with open(filename, 'a', encoding='utf-16') as f:
|
||||
f.write('def')
|
||||
with open(filename, 'r', encoding='utf-16') as f:
|
||||
assert f.read() == 'abcdef'
|
||||
|
||||
The io module supports this usecase, whereas codecs fails because it
|
||||
writes a new BOM on the second write (issue #12512).
|
||||
|
||||
|
||||
Links
|
||||
=====
|
||||
|
||||
* `PEP 100: Python Unicode Integration
|
||||
<http://www.python.org/dev/peps/pep-0100/>`_
|
||||
* `PEP 3116 <http://www.python.org/dev/peps/pep-3116/>`_
|
||||
* `Issue #8796: Deprecate codecs.open()
|
||||
<http://bugs.python.org/issue8796>`_
|
||||
* `[python-dev] Deprecate codecs.open() and StreamWriter/StreamReader
|
||||
<http://mail.python.org/pipermail/python-dev/2011-May/111591.html>`_
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
Footnotes
|
||||
=========
|
||||
|
||||
.. [#f1] StreamReaderWriter has two more attributes than
|
||||
TextIOWrapper, reader and writer.
|
||||
|
Loading…
Reference in New Issue