python-peps/pep-3131.txt

PEP: 3131
Title: Supporting Non-ASCII Identifiers
Version: $Revision$
Last-Modified: $Date$
Author: Martin v. Löwis <martin@v.loewis.de>
Status: Accepted
Type: Standards Track
Content-Type: text/x-rst
Created: 1-May-2007
Python-Version: 3.0
Post-History:


Abstract
========

This PEP suggests to support non-ASCII letters (such as accented characters,
Cyrillic, Greek, Kanji, etc.) in Python identifiers.

Rationale
=========

Python code is written by many people in the world who are not
familiar with the English language, or even well-acquainted with the
Latin writing system.  Such developers often desire to define classes
and functions with names in their native languages, rather than having
to come up with an (often incorrect) English translation of the
concept they want to name. By using identifiers in their native
language, code clarity and maintainability of the code among
speakers of that language improves.

For some languages, common transliteration systems exist (in particular, for the
Latin-based writing systems).  For other languages, users have larger
difficulties to use Latin to write their native words.

Common Objections
=================

Some objections are often raised against proposals similar to this one.

People claim that they will not be able to use a library if to do so they have
to use characters they cannot type on their keyboards.  However, it is the
choice of the designer of the library to decide on various constraints for using
the library: people may not be able to use the library because they cannot get
physical access to the source code (because it is not published), or because
licensing prohibits usage, or because the documentation is in a language they
cannot understand.  A developer wishing to make a library widely available needs
to make a number of explicit choices (such as publication, licensing, language
of documentation, and language of identifiers).  It should always be the choice
of the author to make these decisions - not the choice of the language
designers.

In particular, projects wishing to have wide usage probably might want to
establish a policy that all identifiers, comments, and documentation is written
in English (see the GNU coding style guide for an example of such a policy).
Restricting the language to ASCII-only identifiers does not enforce comments and
documentation to be English, or the identifiers actually to be English words, so
an additional policy is necessary, anyway.

Specification of Language Changes
=================================

The syntax of identifiers in Python will be based on the Unicode standard annex
UAX-31 [1]_, with elaboration and changes as defined below.

Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
are the same as in Python 2.5.  This specification only introduces additional
characters from outside the ASCII range.  For other characters, the
classification uses the version of the Unicode Character Database as included in
the ``unicodedata`` module.

The identifier syntax is ``<ID_Start> <ID_Continue>*``.

``ID_Start`` is defined as all characters having one of the general
categories uppercase letters (Lu), lowercase letters (Ll), titlecase
letters (Lt), modifier letters (Lm), other letters (Lo), letter
numbers (Nl), the underscore, and characters carrying the
Other_ID_Start property.

``ID_Continue`` is defined as all characters in ``ID_Start``, plus
nonspacing marks (Mn), spacing combining marks (Mc), decimal number
(Nd), connector punctuations (Pc), and characters carryig the
Other_ID_Continue property.

All identifiers are converted into the normal form NFC while parsing;
comparison of identifiers is based on NFC.

A non-normative HTML file listing all valid identifier characters for
Unicode 4.1 can be found at
http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-331.html.

Policy Specification
====================

As an addition to the Python Coding style, the following policy is
prescribed: All identifiers in the Python standard library MUST use
ASCII-only identifiers, and SHOULD use English words wherever feasible
(in many cases, abbreviations and technical terms are used which
aren't English). In addition, string literals and comments must also
be in ASCII. The only exceptions are (a) test cases testing the
non-ASCII features, and (b) names of authors. Authors whose names are
not based on the latin alphabet MUST provide a latin transliteration
of their names.

As an option, this specification can be applied to Python 2.x.  In
that case, ASCII-only identifiers would continue to be represented as
byte string objects in namespace dictionaries; identifiers with
non-ASCII characters would be represented as Unicode strings.

Implementation
==============

The following changes will need to be made to the parser:

1. If a non-ASCII character is found in the UTF-8 representation of
   the source code, a forward scan is made to find the first ASCII
   non-identifier character (e.g. a space or punctuation character)

2. The entire UTF-8 string is passed to a function to normalize the
   string to NFC, and then verify that it follows the identifier
   syntax. No such callout is made for pure-ASCII identifiers, which
   continue to be parsed the way they are today. The Unicode database
   must start including the Other_ID_{Start|Continue} property.

3. If this specification is implemented for 2.x, reflective libraries
   (such as pydoc) must be verified to continue to work when Unicode
   strings appear in ``__dict__`` slots as keys.

Open Issues
===========

John Nagle suggested consideration of Unicode Technical Standard #39,
[2]_, which discusses security mechanisms for Unicode identifiers.
It's not clear how that can precisely apply to this PEP; possible
consequences are

 * warn about characters listed as "restricted" in xidmodifications.txt
 * warn about identifiers using mixed scripts
 * somehow perform Confusable Detection

In the latter two approaches, it's not clear how precisely the
algorithm should work. For mixed scripts, certain kinds of mixing
should probably allowed - are these the "Common" and "Inherited"
scripts mentioned in section 5? For Confusable Detection, it seems one
needs two identifiers to compare them for confusion - is it possible
to somehow apply it to a single identifier only, and warn?

In follow-up discussion, it turns out that John Nagle actually
meant to suggest UTR#36, level "Highly Restrictive", [3]_.

Several people suggested to allow and ignore formatting control
characters (general category Cf), as is done in Java, JavaScript, and
C#. It's not clear whether this would improve things (it might
for RTL languages); if there is a need, these can be added
later.


References
==========

.. [1] http://www.unicode.org/reports/tr31/
.. [2] http://www.unicode.org/reports/tr39/
.. [3] http://www.unicode.org/reports/tr36/

Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: