164 lines
6.2 KiB
Plaintext
164 lines
6.2 KiB
Plaintext
PEP: 3131
|
||
Title: Supporting Non-ASCII Identifiers
|
||
Version: $Revision$
|
||
Last-Modified: $Date$
|
||
Author: Martin v. Löwis <martin@v.loewis.de>
|
||
Status: Draft
|
||
Type: Standards Track
|
||
Content-Type: text/x-rst
|
||
Created: 1-May-2007
|
||
Python-Version: 3.0
|
||
Post-History:
|
||
|
||
|
||
Abstract
|
||
========
|
||
|
||
This PEP suggests to support non-ASCII letters (such as accented characters,
|
||
Cyrillic, Greek, Kanji, etc.) in Python identifiers.
|
||
|
||
Rationale
|
||
=========
|
||
|
||
Python code is written by many people in the world who are not
|
||
familiar with the English language, or even well-acquainted with the
|
||
Latin writing system. Such developers often desire to define classes
|
||
and functions with names in their native languages, rather than having
|
||
to come up with an (often incorrect) English translation of the
|
||
concept they want to name. By using identifiers in their native
|
||
language, code clarity and maintainability of the code among
|
||
speakers of that language improves.
|
||
|
||
For some languages, common transliteration systems exist (in particular, for the
|
||
Latin-based writing systems). For other languages, users have larger
|
||
difficulties to use Latin to write their native words.
|
||
|
||
Common Objections
|
||
=================
|
||
|
||
Some objections are often raised against proposals similar to this one.
|
||
|
||
People claim that they will not be able to use a library if to do so they have
|
||
to use characters they cannot type on their keyboards. However, it is the
|
||
choice of the designer of the library to decide on various constraints for using
|
||
the library: people may not be able to use the library because they cannot get
|
||
physical access to the source code (because it is not published), or because
|
||
licensing prohibits usage, or because the documentation is in a language they
|
||
cannot understand. A developer wishing to make a library widely available needs
|
||
to make a number of explicit choices (such as publication, licensing, language
|
||
of documentation, and language of identifiers). It should always be the choice
|
||
of the author to make these decisions - not the choice of the language
|
||
designers.
|
||
|
||
In particular, projects wishing to have wide usage probably might want to
|
||
establish a policy that all identifiers, comments, and documentation is written
|
||
in English (see the GNU coding style guide for an example of such a policy).
|
||
Restricting the language to ASCII-only identifiers does not enforce comments and
|
||
documentation to be English, or the identifiers actually to be English words, so
|
||
an additional policy is necessary, anyway.
|
||
|
||
Specification of Language Changes
|
||
=================================
|
||
|
||
The syntax of identifiers in Python will be based on the Unicode standard annex
|
||
UAX-31 [1]_, with elaboration and changes as defined below.
|
||
|
||
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
|
||
are the same as in Python 2.5. This specification only introduces additional
|
||
characters from outside the ASCII range. For other characters, the
|
||
classification uses the version of the Unicode Character Database as included in
|
||
the ``unicodedata`` module.
|
||
|
||
The identifier syntax is ``<ID_Start> <ID_Continue>*``.
|
||
|
||
``ID_Start`` is defined as all characters having one of the general
|
||
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
|
||
letters (Lt), modifier letters (Lm), other letters (Lo), letter
|
||
numbers (Nl), the underscore, and characters carrying the
|
||
Other_ID_Start property.
|
||
|
||
``ID_Continue`` is defined as all characters in ``ID_Start``, plus
|
||
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
|
||
(Nd), connector punctuations (Pc), and characters carryig the
|
||
Other_ID_Continue property.
|
||
|
||
All identifiers are converted into the normal form NFC while parsing;
|
||
comparison of identifiers is based on NFC.
|
||
|
||
A non-normative HTML file listing all valid identifier characters for
|
||
Unicode 4.1 can be found at
|
||
http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-331.html.
|
||
|
||
Policy Specification
|
||
====================
|
||
|
||
As an addition to the Python Coding style, the following policy is prescribed:
|
||
All identifiers in the Python standard library MUST use ASCII-only identifiers,
|
||
and SHOULD use English words wherever feasible.
|
||
|
||
As an option, this specification can be applied to Python 2.x. In that case,
|
||
ASCII-only identifiers would continue to be represented as byte string objects
|
||
in namespace dictionaries; identifiers with non-ASCII characters would be
|
||
represented as Unicode strings.
|
||
|
||
Implementation
|
||
==============
|
||
|
||
The following changes will need to be made to the parser:
|
||
|
||
1. If a non-ASCII character is found in the UTF-8 representation of
|
||
the source code, a forward scan is made to find the first ASCII
|
||
non-identifier character (e.g. a space or punctuation character)
|
||
|
||
2. The entire UTF-8 string is passed to a function to normalize the
|
||
string to NFC, and then verify that it follows the identifier
|
||
syntax. No such callout is made for pure-ASCII identifiers, which
|
||
continue to be parsed the way they are today. The Unicode database
|
||
must start including the Other_ID_{Start|Continue} property.
|
||
|
||
3. If this specification is implemented for 2.x, reflective libraries
|
||
(such as pydoc) must be verified to continue to work when Unicode
|
||
strings appear in ``__dict__`` slots as keys.
|
||
|
||
Open Issues
|
||
===========
|
||
|
||
John Nagle suggested consideration of Unicode Technical Standard #39,
|
||
[2]_, which discusses security mechanisms for Unicode identifiers.
|
||
It's not clear how that can precisely apply to this PEP; possible
|
||
consequences are
|
||
|
||
* warn about characters listed as "restricted" in xidmodifications.txt
|
||
* warn about identifiers using mixed scripts
|
||
* somehow perform Confusable Detection
|
||
|
||
In the latter two approaches, it's not clear how precisely the
|
||
algorithm should work. For mixed scripts, certain kinds of mixing
|
||
should probably allowed - are these the "Common" and "Inherited"
|
||
scripts mentioned in section 5? For Confusable Detection, it seems one
|
||
needs two identifiers to compare them for confusion - is it possible
|
||
to somehow apply it to a single identifier only, and warn?
|
||
|
||
References
|
||
==========
|
||
|
||
.. [1] http://www.unicode.org/reports/tr31/
|
||
.. [2] http://www.unicode.org/reports/tr39/
|
||
|
||
|
||
Copyright
|
||
=========
|
||
|
||
This document has been placed in the public domain.
|
||
|
||
|
||
|
||
..
|
||
Local Variables:
|
||
mode: indented-text
|
||
indent-tabs-mode: nil
|
||
sentence-end-double-space: t
|
||
fill-column: 70
|
||
coding: utf-8
|
||
End:
|