335 lines
11 KiB
Plaintext
335 lines
11 KiB
Plaintext
PEP: 400
|
|
Title: Deprecate codecs.StreamReader and codecs.StreamWriter
|
|
Version: $Revision$
|
|
Last-Modified: $Date$
|
|
Author: Victor Stinner <vstinner@python.org>
|
|
Status: Deferred
|
|
Type: Standards Track
|
|
Content-Type: text/x-rst
|
|
Created: 28-May-2011
|
|
Python-Version: 3.3
|
|
|
|
|
|
Abstract
|
|
========
|
|
|
|
io.TextIOWrapper and codecs.StreamReaderWriter offer the same API
|
|
[#f1]_. TextIOWrapper has more features and is faster than
|
|
StreamReaderWriter. Duplicate code means that bugs should be fixed
|
|
twice and that we may have subtle differences between the two
|
|
implementations.
|
|
|
|
The codecs module was introduced in Python 2.0 (see the :pep:`100`).
|
|
The io module was
|
|
introduced in Python 2.6 and 3.0 (see the :pep:`3116`),
|
|
and reimplemented in C in
|
|
Python 2.7 and 3.1.
|
|
|
|
PEP Deferral
|
|
============
|
|
|
|
Further exploration of the concepts covered in this PEP has been deferred
|
|
for lack of a current champion interested in promoting the goals of the PEP
|
|
and collecting and incorporating feedback, and with sufficient available
|
|
time to do so effectively.
|
|
|
|
Motivation
|
|
==========
|
|
|
|
When the Python I/O model was updated for 3.0, the concept of a
|
|
"stream-with-known-encoding" was introduced in the form of
|
|
io.TextIOWrapper. As this class is critical to the performance of
|
|
text-based I/O in Python 3, this module has an optimised C version
|
|
which is used by CPython by default. Many corner cases in handling
|
|
buffering, stateful codecs and universal newlines have been dealt with
|
|
since the release of Python 3.0.
|
|
|
|
This new interface overlaps heavily with the legacy
|
|
codecs.StreamReader, codecs.StreamWriter and codecs.StreamReaderWriter
|
|
interfaces that were part of the original codec interface design in
|
|
:pep:`100`. These interfaces are organised around the principle of an
|
|
encoding with an associated stream (i.e. the reverse of arrangement in
|
|
the io module), so the original :pep:`100` design required that codec
|
|
writers provide appropriate StreamReader and StreamWriter
|
|
implementations in addition to the core codec encode() and decode()
|
|
methods. This places a heavy burden on codec authors providing these
|
|
specialised implementations to correctly handle many of the corner
|
|
cases (see `Appendix A`_) that have now been dealt with by io.TextIOWrapper. While deeper
|
|
integration between the codec and the stream allows for additional
|
|
optimisations in theory, these optimisations have in practice either
|
|
not been carried out and else the associated code duplication means
|
|
that the corner cases that have been fixed in io.TextIOWrapper are
|
|
still not handled correctly in the various StreamReader and
|
|
StreamWriter implementations.
|
|
|
|
Accordingly, this PEP proposes that:
|
|
|
|
* codecs.open() be updated to delegate to the builtin open() in Python
|
|
3.3;
|
|
* the legacy codecs.Stream* interfaces, including the streamreader and
|
|
streamwriter attributes of codecs.CodecInfo be deprecated in Python
|
|
3.3.
|
|
|
|
|
|
Rationale
|
|
=========
|
|
|
|
StreamReader and StreamWriter issues
|
|
''''''''''''''''''''''''''''''''''''
|
|
|
|
* StreamReader is unable to translate newlines.
|
|
* StreamWriter doesn't support "line buffering" (flush if the input
|
|
text contains a newline).
|
|
* StreamReader classes of the CJK encodings (e.g. GB18030) only
|
|
supports UNIX newlines ('\\n').
|
|
* StreamReader and StreamWriter are stateful codecs but don't expose
|
|
functions to control their state (getstate() or setstate()). Each
|
|
codec has to handle corner cases, see `Appendix A <PEP 400 Appendix A_>`_.
|
|
* StreamReader and StreamWriter are very similar to IncrementalReader
|
|
and IncrementalEncoder, some code is duplicated for stateful codecs
|
|
(e.g. UTF-16).
|
|
* Each codec has to reimplement its own StreamReader and StreamWriter
|
|
class, even if it's trivial (just call the encoder/decoder).
|
|
* codecs.open(filename, "r") creates an io.TextIOWrapper object.
|
|
* No codec implements an optimized method in StreamReader or
|
|
StreamWriter based on the specificities of the codec.
|
|
|
|
Issues in the bug tracker:
|
|
|
|
* `Issue #5445 <http://bugs.python.org/issue5445>`_ (2009-03-08):
|
|
codecs.StreamWriter.writelines problem when passed generator
|
|
* `Issue #7262: <http://bugs.python.org/issue7262>`_ (2009-11-04):
|
|
codecs.open() + eol (windows)
|
|
* `Issue #8260 <http://bugs.python.org/issue8260>`_ (2010-03-29):
|
|
When I use codecs.open(...) and f.readline() follow up by f.read()
|
|
return bad result
|
|
* `Issue #8630 <http://bugs.python.org/issue8630>`_ (2010-05-05):
|
|
Keepends param in codec readline(s)
|
|
* `Issue #10344 <http://bugs.python.org/issue10344>`_ (2010-11-06):
|
|
codecs.readline doesn't care buffering
|
|
* `Issue #11461 <http://bugs.python.org/issue11461>`_ (2011-03-10):
|
|
Reading UTF-16 with codecs.readline() breaks on surrogate pairs
|
|
* `Issue #12446 <http://bugs.python.org/issue12446>`_ (2011-06-30):
|
|
StreamReader Readlines behavior odd
|
|
* `Issue #12508 <http://bugs.python.org/issue12508>`_ (2011-07-06):
|
|
Codecs Anomaly
|
|
* `Issue #12512 <http://bugs.python.org/issue12512>`_ (2011-07-07):
|
|
codecs: StreamWriter issues with stateful codecs after a seek or
|
|
with append mode
|
|
* `Issue #12513 <http://bugs.python.org/issue12513>`_ (2011-07-07):
|
|
codec.StreamReaderWriter: issues with interlaced read-write
|
|
|
|
TextIOWrapper features
|
|
''''''''''''''''''''''
|
|
|
|
* TextIOWrapper supports any kind of newline, including translating
|
|
newlines (to UNIX newlines), to read and write.
|
|
* TextIOWrapper reuses codecs incremental encoders and decoders (no
|
|
duplication of code).
|
|
* The io module (TextIOWrapper) is faster than the codecs module
|
|
(StreamReader). It is implemented in C, whereas codecs is
|
|
implemented in Python.
|
|
* TextIOWrapper has a readahead algorithm which speeds up small
|
|
reads: read character by character or line by line (io is 10x
|
|
through 25x faster than codecs on these operations).
|
|
* TextIOWrapper has a write buffer.
|
|
* TextIOWrapper.tell() is optimized.
|
|
* TextIOWrapper supports random access (read+write) using a single
|
|
class which permit to optimize interlaced read-write (but no such
|
|
optimization is implemented).
|
|
|
|
TextIOWrapper issues
|
|
''''''''''''''''''''
|
|
|
|
* `Issue #12215 <http://bugs.python.org/issue12215>`_ (2011-05-30):
|
|
TextIOWrapper: issues with interlaced read-write
|
|
|
|
Possible improvements of StreamReader and StreamWriter
|
|
''''''''''''''''''''''''''''''''''''''''''''''''''''''
|
|
|
|
By adding codec state read/write functions to the StreamReader and
|
|
StreamWriter classes, it will become possible to fix issues with
|
|
stateful codecs in a base class instead of in each stateful
|
|
StreamReader and StreamWriter classes.
|
|
|
|
It would be possible to change StreamReader and StreamWriter to make
|
|
them use IncrementalDecoder and IncrementalEncoder.
|
|
|
|
A codec can implement variants which are optimized for the specific
|
|
encoding or intercept certain stream methods to add functionality or
|
|
improve the encoding/decoding performance. TextIOWrapper cannot
|
|
implement such optimization, but TextIOWrapper uses incremental
|
|
encoders and decoders and uses read and write buffers, so the overhead
|
|
of incomplete inputs is low or nul.
|
|
|
|
A lot more could be done for other variable length encoding codecs,
|
|
e.g. UTF-8, since these often have problems near the end of a read due
|
|
to missing bytes. The UTF-32-BE/LE codecs could simply multiply the
|
|
character position by 4 to get the byte position.
|
|
|
|
|
|
Usage of StreamReader and StreamWriter
|
|
''''''''''''''''''''''''''''''''''''''
|
|
|
|
These classes are rarely used directly, but indirectly using
|
|
codecs.open(). They are not used in Python 3 standard library (except
|
|
in the codecs module).
|
|
|
|
Some projects implement their own codec with StreamReader and
|
|
StreamWriter, but don't use these classes.
|
|
|
|
|
|
Backwards Compatibility
|
|
=======================
|
|
|
|
Keep the public API, codecs.open
|
|
''''''''''''''''''''''''''''''''
|
|
|
|
codecs.open() can be replaced by the builtin open() function. open()
|
|
has a similar API but has also more options. Both functions return
|
|
file-like objects (same API).
|
|
|
|
codecs.open() was the only way to open a text file in Unicode mode
|
|
until Python 2.6. Many Python 2 programs uses this function. Removing
|
|
codecs.open() implies more work to port programs from Python 2 to
|
|
Python 3, especially projects using the same code base for the two
|
|
Python versions (without using 2to3 program).
|
|
|
|
codecs.open() is kept for backward compatibility with Python 2.
|
|
|
|
|
|
Deprecate StreamReader and StreamWriter
|
|
'''''''''''''''''''''''''''''''''''''''
|
|
|
|
Instantiating StreamReader or StreamWriter must emit a DeprecationWarning in
|
|
Python 3.3. Defining a subclass doesn't emit a DeprecationWarning.
|
|
|
|
codecs.open() will be changed to reuse the builtin open() function
|
|
(TextIOWrapper) to read-write text files.
|
|
|
|
.. _PEP 400 Appendix A:
|
|
|
|
Alternative Approach
|
|
====================
|
|
|
|
An alternative to the deprecation of the codecs.Stream* classes is to rename
|
|
codecs.open() to codecs.open_stream(), and to create a new codecs.open()
|
|
function reusing open() and so io.TextIOWrapper.
|
|
|
|
|
|
Appendix A: Issues with stateful codecs
|
|
=======================================
|
|
|
|
It is difficult to use correctly a stateful codec with a stream. Some
|
|
cases are supported by the codecs module, while io has no more known
|
|
bug related to stateful codecs. The main difference between the codecs
|
|
and the io module is that bugs have to be fixed in StreamReader and/or
|
|
StreamWriter classes of each codec for the codecs module, whereas bugs
|
|
can be fixed only once in io.TextIOWrapper. Here are some examples of
|
|
issues with stateful codecs.
|
|
|
|
Stateful codecs
|
|
'''''''''''''''
|
|
|
|
Python supports the following stateful codecs:
|
|
|
|
* cp932
|
|
* cp949
|
|
* cp950
|
|
* euc_jis_2004
|
|
* euc_jisx2003
|
|
* euc_jp
|
|
* euc_kr
|
|
* gb18030
|
|
* gbk
|
|
* hz
|
|
* iso2022_jp
|
|
* iso2022_jp_1
|
|
* iso2022_jp_2
|
|
* iso2022_jp_2004
|
|
* iso2022_jp_3
|
|
* iso2022_jp_ext
|
|
* iso2022_kr
|
|
* shift_jis
|
|
* shift_jis_2004
|
|
* shift_jisx0213
|
|
* utf_8_sig
|
|
* utf_16
|
|
* utf_32
|
|
|
|
Read and seek(0)
|
|
''''''''''''''''
|
|
|
|
::
|
|
|
|
with open(filename, 'w', encoding='utf-16') as f:
|
|
f.write('abc')
|
|
f.write('def')
|
|
f.seek(0)
|
|
assert f.read() == 'abcdef'
|
|
f.seek(0)
|
|
assert f.read() == 'abcdef'
|
|
|
|
The io and codecs modules support this usecase correctly.
|
|
|
|
seek(n)
|
|
'''''''
|
|
|
|
::
|
|
|
|
with open(filename, 'w', encoding='utf-16') as f:
|
|
f.write('abc')
|
|
pos = f.tell()
|
|
with open(filename, 'w', encoding='utf-16') as f:
|
|
f.seek(pos)
|
|
f.write('def')
|
|
f.seek(0)
|
|
f.write('###')
|
|
with open(filename, 'r', encoding='utf-16') as f:
|
|
assert f.read() == '###def'
|
|
|
|
The io module supports this usecase, whereas codecs fails because it
|
|
writes a new BOM on the second write (`issue #12512
|
|
<http://bugs.python.org/issue12512>`_).
|
|
|
|
Append mode
|
|
'''''''''''
|
|
|
|
::
|
|
|
|
with open(filename, 'w', encoding='utf-16') as f:
|
|
f.write('abc')
|
|
with open(filename, 'a', encoding='utf-16') as f:
|
|
f.write('def')
|
|
with open(filename, 'r', encoding='utf-16') as f:
|
|
assert f.read() == 'abcdef'
|
|
|
|
The io module supports this usecase, whereas codecs fails because it
|
|
writes a new BOM on the second write (`issue #12512
|
|
<http://bugs.python.org/issue12512>`_).
|
|
|
|
|
|
Links
|
|
=====
|
|
|
|
* :pep:`PEP 100: Python Unicode Integration <100>`
|
|
* :pep:`PEP 3116: New I/O <3116>`
|
|
* `Issue #8796: Deprecate codecs.open()
|
|
<http://bugs.python.org/issue8796>`_
|
|
* `[python-dev] Deprecate codecs.open() and StreamWriter/StreamReader
|
|
<https://mail.python.org/pipermail/python-dev/2011-May/111591.html>`_
|
|
|
|
|
|
Copyright
|
|
=========
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
Footnotes
|
|
=========
|
|
|
|
.. [#f1] StreamReaderWriter has two more attributes than
|
|
TextIOWrapper, reader and writer.
|
|
|