Add "Supporting Unicode Identifiers" by Martin v. LÃ¶wis as PEP 3131.

2007-05-01 20:34:25 +00:00 · 2007-05-01 20:34:25 +00:00 · 644c0cbc31
parent 6b77a7f8e2
commit 644c0cbc31
2 changed files with 135 additions and 0 deletions
--- a/pep-0000.txt
+++ b/pep-0000.txt
@ -129,6 +129,7 @@ Index by Category
 S  3128  BList: A Faster List-like Type               Stutzbach
 S  3129  Class Decorators                             Winter
 S  3130  Access to Current Module/Class/Function      Jewett
+ S  3131  Supporting Non-ASCII Identifiers             von Löwis
 S  3141  A Type Hierarchy for Numbers                 Yasskin

 Finished PEPs (done, implemented in Subversion)
@ -500,6 +501,7 @@ Numerical Index
 S  3128  BList: A Faster List-like Type               Stutzbach
 S  3129  Class Decorators                             Winter
 S  3130  Access to Current Module/Class/Function      Jewett
+ S  3131  Supporting Non-ASCII Identifiers             von Löwis
 S  3141  A Type Hierarchy for Numbers                 Yasskin


--- a/pep-3131.txt
+++ b/pep-3131.txt
@ -0,0 +1,133 @@
+PEP: 3131
+Title: Supporting Non-ASCII Identifiers
+Version: $Revision$
+Last-Modified: $Date$
+Author: Martin v. Löwis <martin@v.loewis.de>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 1-May-2007
+Python-Version: 3.0
+Post-History:
+
+
+Abstract
+========
+
+This PEP suggests to support non-ASCII letters (such as accented characters,
+Cyrillic, Greek, Kanji, etc.) in Python identifiers.
+
+Rationale
+=========
+
+Python code is written by many people in the world who are not familiar with the
+English language, or even well-acquainted with the Latin writing system.  Such
+developers often desire to define classes and functions with names in their
+native languages, rather than having to come up with an (often incorrect)
+English translation of the concept they want to name.
+
+For some languages, common transliteration systems exist (in particular, for the
+Latin-based writing systems).  For other languages, users have larger
+difficulties to use Latin to write their native words.
+
+Common Objections
+=================
+
+Some objections are often raised against proposals similar to this one.
+
+People claim that they will not be able to use a library if to do so they have
+to use characters they cannot type on their keyboards.  However, it is the
+choice of the designer of the library to decide on various constraints for using
+the library: people may not be able to use the library because they cannot get
+physical access to the source code (because it is not published), or because
+licensing prohibits usage, or because the documentation is in a language they
+cannot understand.  A developer wishing to make a library widely available needs
+to make a number of explicit choices (such as publication, licensing, language
+of documentation, and language of identifiers).  It should always be the choice
+of the author to make these decisions - not the choice of the language
+designers.
+
+In particular, projects wishing to have wide usage probably might want to
+establish a policy that all identifiers, comments, and documentation is written
+in English (see the GNU coding style guide for an example of such a policy).
+Restricting the language to ASCII-only identifiers does not enforce comments and
+documentation to be English, or the identifiers actually to be English words, so
+an additional policy is necessary, anyway.
+
+Specification of Language Changes
+=================================
+
+The syntax of identifiers in Python will be based on the Unicode standard annex
+UAX-31 [1]_, with elaboration and changes as defined below.
+
+Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
+are the same as in Python 2.5.  This specification only introduces additional
+characters from outside the ASCII range.  For other characters, the
+classification uses the version of the Unicode Character Database as included in
+the ``unicodedata`` module.
+
+The identifier syntax is ``<ID_Start> <ID_Continue>*``.
+
+``ID_Start`` is defined as all characters having one of the general categories
+uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier
+letters (Lm), other letters (Lo), letter numbers (Nl), plus the underscore (XXX
+what are "stability extensions" listed in UAX 31).
+
+``ID_Continue`` is defined as all characters in ``ID_Start``, plus nonspacing
+marks (Mn), spacing combining marks (Mc), decimal number (Nd), and connector
+punctuations (Pc).
+
+All identifiers are converted into the normal form NFC while parsing; comparison
+of identifiers is based on NFC.
+
+Policy Specification
+====================
+
+As an addition to the Python Coding style, the following policy is prescribed:
+All identifiers in the Python standard library MUST use ASCII-only identifiers,
+and SHOULD use English words wherever feasible.
+
+As an option, this specification can be applied to Python 2.x.  In that case,
+ASCII-only identifiers would continue to be represented as byte string objects
+in namespace dictionaries; identifiers with non-ASCII characters would be
+represented as Unicode strings.
+
+Implementation
+==============
+
+The following changes will need to be made to the parser:
+
+1. If a non-ASCII character is found in the UTF-8 representation of the source
+   code, a forward scan is made to find the first ASCII non-identifier character
+   (e.g. a space or punctuation character)
+
+2. The entire UTF-8 string is passed to a function to normalize the string to
+   NFC, and then verify that it follows the identifier syntax. No such callout
+   is made for pure-ASCII identifiers, which continue to be parsed the way they
+   are today.
+
+3. If this specification is implemented for 2.x, reflective libraries (such as
+   pydoc) must be verified to continue to work when Unicode strings appear in
+   ``__dict__`` slots as keys.
+
+References
+==========
+
+.. [1] http://www.unicode.org/reports/tr31/
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+
+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End: