PEP 538: update for python-dev & implementation feedback

- PYTHONCOERCECLOCALE=0 now also disables the library warning - PEP just refers to locale-aware/locale-independent components, without specifically limiting that to C/C++ components
2017-03-13 15:06:48 +10:00 · 2017-03-13 15:06:48 +10:00 · 48f355fc28
parent 8ae8b612d4
commit 48f355fc28
1 changed files with 86 additions and 55 deletions
--- a/pep-0538.txt
+++ b/pep-0538.txt
@ -19,7 +19,7 @@ Abstract

 An ongoing challenge with Python 3 on \*nix systems is the conflict between
 needing to use the configured locale encoding by default for consistency with
-other C/C++ components in the same process and those invoked in subprocesses,
+other locale-aware components in the same process subprocesses,
 and the fact that the standard C locale (as defined in POSIX:2001) typically
 implies a default text encoding of ASCII, which is entirely inadequate for the
 development of networked services and client applications in a multilingual
@ -33,8 +33,8 @@ This is a good approach for cases where network encoding interoperability
 is a more important concern than local encoding interoperability.

 However, it comes at the cost of making CPython's encoding assumptions diverge
-from those of other C and C++ components in the same process, as well as those
-of components running in subprocesses that share the same environment.
+from those of other locale-aware components in the same process, as well as
+those of components running in subprocesses that share the same environment.

 It also requires changes to the internals of how CPython itself works, rather
 than using existing configuration settings that are supported by Python
@ -55,11 +55,12 @@ changed such that:
  ``PYTHONUTF8`` (if not otherwise set) will be set to ``1``
 * if the subsequent runtime initialization process detects that the legacy
  ``C`` locale remains active (e.g. none of ``C.UTF-8``, ``C.utf8`` or ``UTF-8``
-  are available, locale coercion is disabled, or the runtime is embedded in an
-  application other than the main CPython binary), and the ``PYTHONUTF8``
-  feature defined in PEP 540 is also disabled (or not implemented), it  will
-  emit a warning on stderr that use of the legacy ``C`` locale's default ASCII
-  text encoding may cause various Unicode compatibility issues
+  are available, or the runtime is embedded in an application other than the
+  main CPython binary), locale coercion is not explicitly disabled, and the
+  ``PYTHONUTF8`` feature defined in PEP 540 is also disabled (or not
+  implemented), it  will emit a warning on stderr that use of the legacy
+  ``C`` locale's default ASCII text encoding may cause various Unicode
+  compatibility issues

 With this change, any \*nix platform that does *not* offer at least one of the
 ``C.UTF-8``, ``C.utf8`` or ``UTF-8`` locales as part of its standard
@ -163,7 +164,7 @@ definition.

 Another common failure case is developers specifying ``LANG=C`` in order to
 see otherwise translated user interface messages in English, rather than the
-more narrowly scoped ``LC_MESSAGES=C``.
+more narrowly scoped ``LC_MESSAGES=C`` or ``LANGUAGE=en``.


 Relationship with other PEPs
@ -175,20 +176,20 @@ solution:

 * PEP 540 proposes to entirely decouple CPython's default text encoding from
  the C locale system in that case, allowing text handling inconsistencies to
-  arise between CPython and other C/C++ components running in the same process
-  and in subprocesses. This approach aims to make CPython behave less like a
-  locale-aware C/C++ application, and more like C/C++ independent language
+  arise between CPython and other locale-ware components running in the same
+  process and in subprocesses. This approach aims to make CPython behave less
+  like a locale-aware application, and more like locale-independent language
  runtimes like the JVM, .NET CLR, Go, Node.js, and Rust
 * this PEP proposes to override the legacy C locale with a more recently
  defined locale that uses UTF-8 as its default text encoding. This means that
  the text encoding override will apply not only to CPython, but also to any
-  locale aware extension modules loaded into the current process, as well as to
-  locale aware C/C++ applications invoked in subprocesses that inherit their
+  locale-aware extension modules loaded into the current process, as well as to
+  locale-aware applications invoked in subprocesses that inherit their
  environment from the parent process. This approach aims to retain CPython's
-  traditional strong support for integration with other components written
-  in C and C++, while actively helping to push forward the adoption and
-  standardisation of the C.UTF-8 locale as a Unicode-aware replacement for
-  the legacy C locale in the wider C/C++ ecosystem
+  traditional strong support for integration with other locale-aware components
+  while also actively helping to push forward the adoption and standardisation
+  of the C.UTF-8 locale as a Unicode-aware replacement for the legacy C locale
+  in the wider C/C++ ecosystem

 After reviewing both PEPs, it became clear that they didn't actually conflict
 at a technical level, and the proposal in PEP 540 offered a superior option in
@ -197,9 +198,9 @@ reference behaviour for platforms where the notion of a "locale encoding"
 doesn't make sense (for example, embedded systems running MicroPython rather
 than the CPython reference interpreter).

-Meanwhile, this PEP offered improved compatibility with other C/C++ components,
-and an approach more amenable to being backported to Python 3.6 by downstream
-redistributors.
+Meanwhile, this PEP offered improved compatibility with other locale-aware
+components, and an approach more amenable to being backported to Python 3.6
+by downstream redistributors.

 As a result, this PEP was amended to refer to PEP 540 as a complementary
 solution that offered improved behaviour both when locale coercion triggered,
@ -323,7 +324,7 @@ proposed solution:
  even running with ``-Werror`` won't turn it into a runtime exception
 * any changes made will use *existing* configuration options

-To minimize the negative impact on systems currently correctly configured to
+Minimizing the negative impact on systems currently correctly configured to
 use GB-18030 or another partially ASCII compatible universal encoding leads to
 an additional design principle:

@ -345,7 +346,8 @@ run as a standalone command line application.

 It further proposes to emit a warning on stderr if the legacy ``C`` locale
 is in effect at the point where the language runtime itself is initialized,
-and the PEP 540 UTF-8 encoding override is also disabled, in order to warn
+the explicit environmental flag to disable locale coercion is not set, and
+the PEP 540 UTF-8 encoding override is also disabled, in order to warn
 system and application integrators that they're running CPython in an
 unsupported configuration.

@ -358,7 +360,7 @@ reconfigure the C locale before any locale dependent operations are executed
 in the process.

 This means that it can change the locale settings not only for the CPython
-runtime, but also for any other C/C++ components running in the current
+runtime, but also for any other locale-aware components running in the current
 process (e.g. as part of extension modules), as well as in subprocesses that
 inherit their environment from the current process.

@ -409,31 +411,37 @@ When this locale coercion is activated, the following warning will be
 printed on stderr, with the warning containing whichever locale was
 successfully configured::

-    Python detected LC_CTYPE=C, LC_ALL & LANG set to C.UTF-8 (set another
+    Python detected LC_CTYPE=C: LC_ALL & LANG coerced to C.UTF-8 (set another
    locale or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).

 When falling back to the ``UTF-8`` locale, the message would be slightly
 different::

-    Python detected LC_CTYPE=C, LC_CTYPE set to UTF-8 (set another locale
+    Python detected LC_CTYPE=C: LC_CTYPE coerced to UTF-8 (set another locale
    or PYTHONCOERCECLOCALE=0 to disable this locale coercion behaviour).

 In combination with PEP 540, this locale coercion will mean that the standard
-Python binary *and* locale aware C/C++ extensions should once again "just work"
+Python binary *and* locale-aware extensions should once again "just work"
 in the three main failure cases we're aware of (missing locale
 settings, SSH forwarding of unknown locales, and developers explicitly
 requesting ``LANG=C``), as long as the target platform provides at least one
 of the candidate UTF-8 based environments.

-If ``PYTHONCOERCECLOCALE=0`` is set, or none of the candidate locales is
-successfully configured, then initialization will continue as usual in the C
-locale and the Unicode compatibility warning described in the next section will
-be emitted just as it would for any other application.
+If none of the candidate locales are successfully configured, then
+initialization will continue in the C locale and the Unicode compatibility
+warning described in the next section will be emitted just as it would for
+any other application.
+
+If ``PYTHONCOERCECLOCALE=0`` is explicitly set, initialization will continue in
+the C locale and the Unicode compatibility warning described in the next
+section will be automatically suppressed.

 The interpreter will always check for the ``PYTHONCOERCECLOCALE`` environment
-variable (even when running under the ``-E`` or ``-I`` switches), as the locale
-coercion check necessarily takes place before any command line argument
-processing.
+variable at startup (even when running under the ``-E`` or ``-I`` switches),
+as the locale coercion check necessarily takes place before any command line
+argument processing. For consistency, the runtime check to determine whether
+or not to suppress the locale compatibility warning will be similarly
+independent of these settings.


 Changes to the runtime initialization process
@ -446,12 +454,12 @@ doing so would introduce inconsistencies in decoded text, even in the context
 of the standalone Python interpreter binary.

 Accordingly, when ``Py_Initialize`` is called and CPython detects that the
-configured locale is still the default ``C`` locale *and* the ``PYTHONUTF8``
-feature from PEP 540 is disabled, the following warning will
-be issued::
+configured locale is still the default ``C`` locale, ``PYTHONCOERCECLOCALE=0``
+is set, *and* the ``PYTHONUTF8`` feature from PEP 540 is disabled (or not
+implemented), the following warning will be issued::

   Python runtime initialized with LC_CTYPE=C (a locale with default ASCII
-   encoding), which may cause Unicode compatibility problems. Using C.UTF-8
+   encoding), which may cause Unicode compatibility problems. Using C.UTF-8,
   C.utf8, or UTF-8 (if available) as alternative Unicode-compatible
   locales is recommended.

@ -461,9 +469,10 @@ Instead, the warning informs both system and application integrators that
 they're running Python 3 in a configuration that we don't expect to work
 properly.

-The second sentence providing recommendations would be conditionally compiled
-based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8`` on \*BSD
-systems.
+The second sentence providing recommendations may eventually be conditionally
+compiled based on the operating system (e.g. recommending ``LC_CTYPE=UTF-8``
+on \*BSD systems), but the initial implementation will just use the common
+generic message shown above.


 New build-time configuration options
@ -490,6 +499,15 @@ Platform Support Changes

 A new "Legacy C Locale" section will be added to PEP 11 that states:

+* as of CPython 3.7, \*nix platforms are expected to provide at least one of
+  ``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
+  ``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
+  Any Unicode related integration problems that occur only in that locale and
+  cannot be reproduced in an appropriately configured non-ASCII locale will be
+  closed as "won't fix".
+
+If PEP 540 is also implemented, then this section would instead state:
+
 * as of CPython 3.7, the legacy C locale is only supported when operating in
  "UTF-8" mode. Any Unicode handling issues that occur only in that locale
  and cannot be reproduced in an appropriately configured non-ASCII locale will
@ -497,9 +515,9 @@ A new "Legacy C Locale" section will be added to PEP 11 that states:
 * as of CPython 3.7, \*nix platforms are expected to provide at least one of
  ``C.UTF-8`` (full locale), ``C.utf8`` (full locale) or ``UTF-8`` (
  ``LC_CTYPE``-only locale) as an alternative to the legacy ``C`` locale.
-  Any Unicode related integration problems with C/C++ extensions that occur
-  only in that locale and cannot be reproduced in an appropriately configured
-  non-ASCII locale will be closed as "won't fix".
+  Any Unicode related integration problems with other locale-aware components
+  that occur only in that locale and cannot be reproduced in an appropriately
+  configured non-ASCII locale will be closed as "won't fix".


 Rationale
@ -524,9 +542,9 @@ The challenge for CPython has been the fact that in addition to being used for
 network service development, it is also extensively used as an embedded
 scripting language in larger applications, and as a desktop application
 development language, where it is more important to be consistent with other
-C/C++ components sharing the same process, as well as with the user's desktop
-locale settings, than it is with the emergent conventions of modern network
-service development.
+locale-aware components sharing the same process, as well as with the user's
+desktop locale settings, than it is with the emergent conventions of modern
+network service development.

 The core premise of this PEP is that for *all* of these use cases, the
 assumption of ASCII implied by the default "C" locale is the wrong choice,
@ -677,6 +695,10 @@ and would hence also make the last example display the desired output::
      | iconv -f GB18030 -t UTF-8
    ℙƴ☂ℌøἤ

+However, such a change would have broader implications than the C locale
+specific changes currently proposed, so it is considered out of scope for this
+PEP.
+

 Dropping official support for ASCII based text handling in the legacy C locale
 ------------------------------------------------------------------------------
@ -689,10 +711,13 @@ them to text (C/C++, Python 2.x, Ruby, etc), or else to ignore the nominal
 C/C++ locale encoding entirely and assume the use of either UTF-8 (PEP 540,
 Rust, Go, Node.js, etc) or UTF-16-LE (JVM, .NET CLR).

-While this PEP ensures that developers that need to do so can still opt-in to
-running their Python code in the legacy C locale, it also makes clear that we
-*don't* expect Python 3's Unicode handling to be reliable in that configuration,
-and the recommended alternative is to use a more appropriate locale setting.
+While this PEP ensures that developers that genuinely need to do so can still
+opt-in to running their Python code in the legacy C locale (either by setting
+PYTHONCOERCECLOCALE=0 or running a custom build that sets
+``--without-c-locale-coercion``), it also makes it clear that we *don't*
+expect Python 3's Unicode handling to be completely reliable in that
+configuration, and the recommended alternative is to use a more appropriate
+locale setting (or PEP 540's UTF-8 mode, if that is available).


 Providing implicit locale coercion only when running standalone
@ -743,10 +768,10 @@ components in the current process, and components written in arbitrary
 languages in subprocesses.

 Setting ``LC_ALL`` to ``C.UTF-8`` imposes a locale setting override on all
-C/C++ components in the current process and in any subprocesses that inherit
-the current environment. This is important to handle cases where the problem
-has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a system
-where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
+locale-aware components in the current process and in any subprocesses that
+inherit the current environment. This is important to handle cases where the
+problem has arisen from a setting like ``LC_CTYPE=UTF-8`` being provided on a
+system where no ``UTF-8`` locale is defined (e.g. when a Mac OS X ssh client is
 configured to forward locale settings, and the user logs into a Linux server).

 Setting ``LANG`` to ``C.UTF-8`` ensures that even components that only check
@ -797,6 +822,9 @@ conjunction with the changes needed to also provide a suitable locale by
 default, or else specifically for platforms where such a locale is already
 consistently available.

+At least the Fedora project is planning to pursue this approach for the
+upcoming Fedora 26 release [19_].
+

 Backporting to other 3.x releases
 ---------------------------------
@ -909,6 +937,9 @@ References
 .. [18] GitHub branch diff for ``ncoghlan:pep538-coerce-c-locale``
   (https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale)

+.. [19] Fedora 26 change proposal for locale coercion backport
+   (https://fedoraproject.org/wiki/Changes/python3_c.utf-8_locale)
+
 Copyright
 =========