Add "Supporting Unicode Identifiers" by Martin v. Löwis as PEP 3131.
This commit is contained in:
parent
6b77a7f8e2
commit
644c0cbc31
|
@ -129,6 +129,7 @@ Index by Category
|
|||
S 3128 BList: A Faster List-like Type Stutzbach
|
||||
S 3129 Class Decorators Winter
|
||||
S 3130 Access to Current Module/Class/Function Jewett
|
||||
S 3131 Supporting Non-ASCII Identifiers von Löwis
|
||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||
|
||||
Finished PEPs (done, implemented in Subversion)
|
||||
|
@ -500,6 +501,7 @@ Numerical Index
|
|||
S 3128 BList: A Faster List-like Type Stutzbach
|
||||
S 3129 Class Decorators Winter
|
||||
S 3130 Access to Current Module/Class/Function Jewett
|
||||
S 3131 Supporting Non-ASCII Identifiers von Löwis
|
||||
S 3141 A Type Hierarchy for Numbers Yasskin
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,133 @@
|
|||
PEP: 3131
|
||||
Title: Supporting Non-ASCII Identifiers
|
||||
Version: $Revision$
|
||||
Last-Modified: $Date$
|
||||
Author: Martin v. Löwis <martin@v.loewis.de>
|
||||
Status: Draft
|
||||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 1-May-2007
|
||||
Python-Version: 3.0
|
||||
Post-History:
|
||||
|
||||
|
||||
Abstract
|
||||
========
|
||||
|
||||
This PEP suggests to support non-ASCII letters (such as accented characters,
|
||||
Cyrillic, Greek, Kanji, etc.) in Python identifiers.
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
Python code is written by many people in the world who are not familiar with the
|
||||
English language, or even well-acquainted with the Latin writing system. Such
|
||||
developers often desire to define classes and functions with names in their
|
||||
native languages, rather than having to come up with an (often incorrect)
|
||||
English translation of the concept they want to name.
|
||||
|
||||
For some languages, common transliteration systems exist (in particular, for the
|
||||
Latin-based writing systems). For other languages, users have larger
|
||||
difficulties to use Latin to write their native words.
|
||||
|
||||
Common Objections
|
||||
=================
|
||||
|
||||
Some objections are often raised against proposals similar to this one.
|
||||
|
||||
People claim that they will not be able to use a library if to do so they have
|
||||
to use characters they cannot type on their keyboards. However, it is the
|
||||
choice of the designer of the library to decide on various constraints for using
|
||||
the library: people may not be able to use the library because they cannot get
|
||||
physical access to the source code (because it is not published), or because
|
||||
licensing prohibits usage, or because the documentation is in a language they
|
||||
cannot understand. A developer wishing to make a library widely available needs
|
||||
to make a number of explicit choices (such as publication, licensing, language
|
||||
of documentation, and language of identifiers). It should always be the choice
|
||||
of the author to make these decisions - not the choice of the language
|
||||
designers.
|
||||
|
||||
In particular, projects wishing to have wide usage probably might want to
|
||||
establish a policy that all identifiers, comments, and documentation is written
|
||||
in English (see the GNU coding style guide for an example of such a policy).
|
||||
Restricting the language to ASCII-only identifiers does not enforce comments and
|
||||
documentation to be English, or the identifiers actually to be English words, so
|
||||
an additional policy is necessary, anyway.
|
||||
|
||||
Specification of Language Changes
|
||||
=================================
|
||||
|
||||
The syntax of identifiers in Python will be based on the Unicode standard annex
|
||||
UAX-31 [1]_, with elaboration and changes as defined below.
|
||||
|
||||
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
|
||||
are the same as in Python 2.5. This specification only introduces additional
|
||||
characters from outside the ASCII range. For other characters, the
|
||||
classification uses the version of the Unicode Character Database as included in
|
||||
the ``unicodedata`` module.
|
||||
|
||||
The identifier syntax is ``<ID_Start> <ID_Continue>*``.
|
||||
|
||||
``ID_Start`` is defined as all characters having one of the general categories
|
||||
uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier
|
||||
letters (Lm), other letters (Lo), letter numbers (Nl), plus the underscore (XXX
|
||||
what are "stability extensions" listed in UAX 31).
|
||||
|
||||
``ID_Continue`` is defined as all characters in ``ID_Start``, plus nonspacing
|
||||
marks (Mn), spacing combining marks (Mc), decimal number (Nd), and connector
|
||||
punctuations (Pc).
|
||||
|
||||
All identifiers are converted into the normal form NFC while parsing; comparison
|
||||
of identifiers is based on NFC.
|
||||
|
||||
Policy Specification
|
||||
====================
|
||||
|
||||
As an addition to the Python Coding style, the following policy is prescribed:
|
||||
All identifiers in the Python standard library MUST use ASCII-only identifiers,
|
||||
and SHOULD use English words wherever feasible.
|
||||
|
||||
As an option, this specification can be applied to Python 2.x. In that case,
|
||||
ASCII-only identifiers would continue to be represented as byte string objects
|
||||
in namespace dictionaries; identifiers with non-ASCII characters would be
|
||||
represented as Unicode strings.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
The following changes will need to be made to the parser:
|
||||
|
||||
1. If a non-ASCII character is found in the UTF-8 representation of the source
|
||||
code, a forward scan is made to find the first ASCII non-identifier character
|
||||
(e.g. a space or punctuation character)
|
||||
|
||||
2. The entire UTF-8 string is passed to a function to normalize the string to
|
||||
NFC, and then verify that it follows the identifier syntax. No such callout
|
||||
is made for pure-ASCII identifiers, which continue to be parsed the way they
|
||||
are today.
|
||||
|
||||
3. If this specification is implemented for 2.x, reflective libraries (such as
|
||||
pydoc) must be verified to continue to work when Unicode strings appear in
|
||||
``__dict__`` slots as keys.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
.. [1] http://www.unicode.org/reports/tr31/
|
||||
|
||||
|
||||
Copyright
|
||||
=========
|
||||
|
||||
This document has been placed in the public domain.
|
||||
|
||||
|
||||
|
||||
..
|
||||
Local Variables:
|
||||
mode: indented-text
|
||||
indent-tabs-mode: nil
|
||||
sentence-end-double-space: t
|
||||
fill-column: 70
|
||||
coding: utf-8
|
||||
End:
|
Loading…
Reference in New Issue