PEP 540: open() error handler remains strict

This commit is contained in:
Victor Stinner 2017-12-08 01:36:37 +01:00
parent 3b263c0ae1
commit 22b31e0e82
1 changed files with 16 additions and 14 deletions

View File

@ -14,9 +14,9 @@ Python-Version: 3.7
Abstract
========
Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding
with the ``surrogateescape`` error handler. This mode is enabled by
default in the POSIX locale, but otherwise disabled by default.
Add a new UTF-8 mode to ignore the locale and use the UTF-8 encoding.
This mode is enabled by default in the POSIX locale, but otherwise
disabled by default.
Add also a "strict" UTF-8 mode which uses the ``strict`` error handler,
instead of ``surrogateescape``, with the UTF-8 encoding.
@ -65,10 +65,8 @@ locale coercion is ineffective.
Passthough undecodable bytes: surrogateescape
---------------------------------------------
Using UTF-8 is nice, until you read the first file encoded to a
different encoding. When using the ``strict`` error handler, which is
the default, Python 3 raises a ``UnicodeDecodeError`` on the first
undecodable byte.
When using the ``strict`` error handler, which is the default, Python 3
raises a ``UnicodeDecodeError`` on the first undecodable byte.
Unix command line tools like ``cat`` or ``grep`` and most Python 2
applications simply do not have this class of bugs: they don't decode
@ -79,12 +77,16 @@ the ``surrogateescape`` error handler (:pep:`383`). It allows to process
data "as bytes" but uses Unicode in practice (undecodable bytes are
stored as surrogate characters).
For an application written as a Unix "pipe" tool like ``grep``, taking
input on stdin and writing output to stdout, ``surrogateescape`` allows
to "passthrough" undecodable bytes.
The UTF-8 mode uses the ``surrogateescape`` error handler for ``stdin``
and ``stdout`` since these streams as commonly associated to Unix
command line tools.
However, users have a different expectation on files. Files are expected
to be properly encoded. Python is expected to fail early when ``open()``
is called with the wrong options, like opening a JPEG picture in text
mode. The ``open()`` default error handler remains ``strict`` for these
reasons.
The UTF-8 encoding used with the ``surrogateescape`` error handler is a
compromise between correctness and usability.
Strict UTF-8 for correctness
----------------------------
@ -155,7 +157,7 @@ Encoding and error handler
============================ ======================= ========================== ==========================
Function Default UTF-8 mode or POSIX locale Strict UTF-8 mode
============================ ======================= ========================== ==========================
open() locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
open() locale/strict **UTF-8**/strict **UTF-8**/strict
os.fsdecode(), os.fsencode() locale/surrogateescape **UTF-8**/surrogateescape **UTF-8**/surrogateescape
sys.stdin, sys.stdout locale/strict **UTF-8/surrogateescape** **UTF-8**/strict
sys.stderr locale/backslashreplace **UTF-8**/backslashreplace **UTF-8**/backslashreplace
@ -180,7 +182,7 @@ On Windows, the encodings and error handlers are different:
============================ ======================= ========================== ========================== ==========================
Function Default Legacy Windows FS encoding UTF-8 mode Strict UTF-8 mode
============================ ======================= ========================== ========================== ==========================
open() mbcs/strict mbcs/strict **UTF-8/surrogateescape** **UTF-8**/strict
open() mbcs/strict mbcs/strict **UTF-8**/strict **UTF-8**/strict
os.fsdecode(), os.fsencode() UTF-8/surrogatepass **mbcs/replace** UTF-8/surrogatepass UTF-8/surrogatepass
sys.stdin, sys.stdout UTF-8/surrogateescape UTF-8/surrogateescape UTF-8/surrogateescape **UTF-8/strict**
sys.stderr UTF-8/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace UTF-8/backslashreplace