PEP: 3131 Title: Supporting Non-ASCII Identifiers Version: $Revision$ Last-Modified: $Date$ Author: Martin v. Löwis Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 1-May-2007 Python-Version: 3.0 Post-History: Abstract ======== This PEP suggests to support non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Python identifiers. Rationale ========= Python code is written by many people in the world who are not familiar with the English language, or even well-acquainted with the Latin writing system. Such developers often desire to define classes and functions with names in their native languages, rather than having to come up with an (often incorrect) English translation of the concept they want to name. For some languages, common transliteration systems exist (in particular, for the Latin-based writing systems). For other languages, users have larger difficulties to use Latin to write their native words. Common Objections ================= Some objections are often raised against proposals similar to this one. People claim that they will not be able to use a library if to do so they have to use characters they cannot type on their keyboards. However, it is the choice of the designer of the library to decide on various constraints for using the library: people may not be able to use the library because they cannot get physical access to the source code (because it is not published), or because licensing prohibits usage, or because the documentation is in a language they cannot understand. A developer wishing to make a library widely available needs to make a number of explicit choices (such as publication, licensing, language of documentation, and language of identifiers). It should always be the choice of the author to make these decisions - not the choice of the language designers. In particular, projects wishing to have wide usage probably might want to establish a policy that all identifiers, comments, and documentation is written in English (see the GNU coding style guide for an example of such a policy). Restricting the language to ASCII-only identifiers does not enforce comments and documentation to be English, or the identifiers actually to be English words, so an additional policy is necessary, anyway. Specification of Language Changes ================================= The syntax of identifiers in Python will be based on the Unicode standard annex UAX-31 [1]_, with elaboration and changes as defined below. Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.5. This specification only introduces additional characters from outside the ASCII range. For other characters, the classification uses the version of the Unicode Character Database as included in the ``unicodedata`` module. The identifier syntax is `` *``. ``ID_Start`` is defined as all characters having one of the general categories uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), plus the underscore (XXX what are "stability extensions" listed in UAX 31). ``ID_Continue`` is defined as all characters in ``ID_Start``, plus nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), and connector punctuations (Pc). All identifiers are converted into the normal form NFC while parsing; comparison of identifiers is based on NFC. Policy Specification ==================== As an addition to the Python Coding style, the following policy is prescribed: All identifiers in the Python standard library MUST use ASCII-only identifiers, and SHOULD use English words wherever feasible. As an option, this specification can be applied to Python 2.x. In that case, ASCII-only identifiers would continue to be represented as byte string objects in namespace dictionaries; identifiers with non-ASCII characters would be represented as Unicode strings. Implementation ============== The following changes will need to be made to the parser: 1. If a non-ASCII character is found in the UTF-8 representation of the source code, a forward scan is made to find the first ASCII non-identifier character (e.g. a space or punctuation character) 2. The entire UTF-8 string is passed to a function to normalize the string to NFC, and then verify that it follows the identifier syntax. No such callout is made for pure-ASCII identifiers, which continue to be parsed the way they are today. 3. If this specification is implemented for 2.x, reflective libraries (such as pydoc) must be verified to continue to work when Unicode strings appear in ``__dict__`` slots as keys. References ========== .. [1] http://www.unicode.org/reports/tr31/ Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: