PEP: 258 Title: DPS Generic Implementation Details Version: $Revision$ Last-Modified: $Date$ Author: dgoodger@bigfoot.com (David Goodger) Discussions-To: doc-sig@python.org Status: Draft Type: Standards Track Created: 31-May-2001 Post-History: Abstract This PEP documents generic implementation details for a Python Docstring Processing System (DPS). The rationale and high-level concepts of the DPS are documented in PEP 256, "Docstring Processing System Framework" [1]. Specification Docstring Extraction Rules ========================== 1. If the '__all__' variable is present in the module being documented, only identifiers listed in '__all__' are examined for docstrings. In the absense of '__all__', all identifiers are examined, except those whose names are private (names begin with '_' but don't begin and end with '__'). 2. Docstrings are string literal expressions, and are recognized in the following places within Python modules: a) At the beginning of a module, class definition, or function definition, after any comments. This is the standard for Python __doc__ attributes. b) Immediately following a simple assignment at the top level of a module, class definition, or __init__ method definition, after any comments. See "Attribute Docstrings" below. c) Additional string literals found immediately after the docstrings in (a) and (b) will be recognized, extracted, and concatenated. See "Additional Docstrings" below. 3. Python modules must be parsed by the docstring processing system, not imported. There are security reasons for not importing untrusted code. Also, docstrings are to be recognized in places where the bytecode compiler ignores string literal expressions (2b and 2c above), meaning importing the module will lose these docstrings. Of course, standard Python parsing tools such as the 'parser' library module should be used. Since attribute docstrings and additional docstrings are not recognized by the Python bytecode compiler, no namespace pollution or performance degradation will result from their use. (The initial parsing of a module may take a slight performance hit.) Attribute Docstrings -------------------- XXX A description of attribute docstrings would be appropriate in PEP 257 "Docstring Conventions". (This is a simplified version of PEP 224 [3] by Marc-Andre Lemberg.) A string literal immediately following an assignment statement is interpreted by the docstring extration machinery as the docstring of the target of the assignment statement, under the following conditions: 1. The assignment must be in one of the following contexts: a) At the top level of a module (i.e., not inside a loop or conditional): a module attribute. b) At the top level of a class definition: a class attribute. c) At the top level of a class' '__init__' method definition: an instance attribute. Since each of the above contexts are at the top level (i.e., just inside the outermost suite of a definition), it may be necessary to place dummy assignments for attributes assigned conditionally or in a loop. Blank lines may be used after attribute docstrings to emphasize the connection between the assignment and the docstring. 2. The assignment must be to a single target, not to a list or a tuple of targets. 3. The form of the target: a) For contexts 1a and 1b above, the target must be a simple identifier (not a dotted identifier, a subscripted expression, or a sliced expression). b) For context 1c above, the target must be of the form 'self.attrib', where 'self' matches the '__init__' method's first parameter (the instance parameter) and 'attrib' is a simple indentifier as in 3a. Examples:: g = 'module attribute (global variable)' """This is g's docstring.""" class AClass: c = 'class attribute' """This is AClass.c's docstring.""" def __init__(self): self.i = 'instance attribute' """This is self.i's docstring.""" Additional Docstrings --------------------- XXX A description of additional docstrings would be appropriate in the PEP 257, "Docstring Conventions" [4]. Many programmers would like to make extensive use of docstrings for API documentation. However, docstrings do take up space in the running program, so some of these programmers are reluctant to 'bloat up' their code. Also, not all API documentation is applicable to interactive environments, where __doc__ would be displayed. The docstring processing system's extraction tools will concatenate all string literal expressions which appear at the beginning of a definition or after a simple assignment. Only the first strings in definitions will be available as __doc__, and can be used for brief usage text suitable for interactive sessions; subsequent string literals and all attribute docstrings are ignored by the Python bytecode compiler and may contain more extensive API information. Example:: def function(arg): """This is __doc__, function's docstring.""" """ This is an additional docstring, ignored by the bytecode compiler, but extracted by the docstring processing system. """ pass Issue: This breaks 'from __future__ import' statements in Python 2.1 for multiple module docstrings. Resolution? 1. Should we search for docstrings after a __future__ statement? Very ugly. 2. Redefine __future__ statements to allow multiple preceeding string literals? 3. Or should we not even worry about this? There shouldn't be __future__ statements in production code, after all. Modules with __future__ statements will have to put up with the single-docstring limitation. Choice of Docstring Format ========================== Rather than force everyone to use a single docstring format, multiple input formats are allowed by the processing system. A special variable, __docformat__, may appear at the top level of a module before any function or class definitions. Over time or through decree, a standard format or set of formats should emerge. The __docformat__ variable is a string containing the name of the format being used, a case-insensitive string matching the input parser's module or package name (i.e., the same name as required to 'import' the module or package), or a registered alias. If no __docformat__ is specified, the default format is 'plaintext' for now; this may be changed to the standard format once determined. The __docformat__ string may contain an optional second field, separated from the format name (first field) by a single space: a case-insensitive language identifier as defined in RFC 1766 [5]. A typical language identifier consists of a 2-letter language code from ISO 639 [6] (3-letter codes used only if no 2-letter code exists; RFC 1766 is currently being revised to allow 3-letter codes). If no language identifier is specified, the default is 'en' for English. The language identifier is passed to the parser and can be used for language-dependent markup features. DPS Structure ============= - package 'dps' - function 'dps.main()' (in 'dps/__init__.py') - package 'dps.parsers' - module 'dps.parsers.model'; see 'Input Parser API' below. - package 'dps.formatters' - module 'dps.formatters.model'; see 'Output Formatter API' below. - package 'dps.languages' - module 'dps.languages.en' (English) - others to be added - utility modules: 'dps.statemachine' Command-Line Interface ====================== XXX To be determined. System Python API ================= XXX To be determined. Input Parser API ================ Each input parser is a module or package exporting a 'Parser' class, with the following interface: class Parser: def __init__(self, inputstring, errors='warn', language='en'): """Initialize the Parser instance.""" def parse(self): """Return a DOM tree, the parsed input string.""" XXX This needs a lot of work. What is required for this API? A model 'Parser' class implementing the full interface along with utility functions can be found in the 'dps.parsers.model' module. Output Formatter API ==================== Each output formatter is a module or package exporting a 'Formatter' class, with the following interface: class Formatter: def __init__(self, domtree, language='en', showwarnings=0): """Initialize the Formatter instance.""" def format(self): """ Return a formatted string representation of the DOM tree. """ XXX This also needs a lot of work. What is required for this API? A model 'Formatter' class implementing the full interface along with utility functions can be found in the 'dps.formatters.model' module. Language Module API =================== Language modules will contain language-dependent strings and mappings. They will be named for their language identifier (as defined in 'Choice of Docstring Format' above), converting dashes to underscores. XXX Specifics to be determined. Intermediate Data Structure =========================== A single intermediate data structure is used internally by the docstring processing system. This data structure is a DOM tree whose schema is documented in an XML DTD (eXtensible Markup Language Document Type Definition), which comes in three parts: - the Python Plaintext Document Interface DTD, ppdi.dtd [7], - the Generic Plaintext Document Interface DTD, gpdi.dtd [8], - and the OASIS Exchange Table Model, soextbl.dtd [9]. The DTD defines a rich set of elements, suitable for any input syntax or output format. The input parser and the output formatter share the same intermediate data structure. The processing system may do transformations on the data from the input parser before passing it on to the output formatter. The DTD retains all information necessary to reconstruct the original input text, or a reasonable facsimile thereof. XXX Specifics (about the DOM tree) to be determined. Output Management ================= XXX To be determined. Type of output: filesystem only, or in-memory data structure too? File/directory naming & structure conventions. In-memory data structure should follow filesystem naming; file/directory == leaf/node. Use a directory hierarchy rather than long file names (long file names were one of the reasons pythondoc couldn't run on MacOS). References and Footnotes [1] http://python.sf.net/peps/pep-0256.html [2] http://www.python.org/sigs/doc-sig/ [3] http://python.sf.net/peps/pep-0224.html [4] http://python.sf.net/peps/pep-0257.html [5] http://www.rfc-editor.org/rfc/rfc1766.txt [6] http://lcweb.loc.gov/standards/iso639-2/englangn.html [7] http://docstring.sf.net/spec/ppdi.dtd [8] http://docstring.sf.net/spec/ppdi.dtd [9] http://docstring.sf.net/spec/soextblx.dtd Project Web Site A SourceForge project has been set up for this work at http://docstring.sf.net. Copyright This document has been placed in the public domain. Acknowledgements This document borrows ideas from the archives of the Python Doc-SIG [2]. Thanks to all members past & present. Local Variables: mode: indented-text indent-tabs-mode: nil End: