PEP: 258 Title: Docutils Design Specification Version: $Revision$ Last-Modified: $Date$ Author: goodger@users.sourceforge.net (David Goodger) Discussions-To: doc-sig@python.org Status: Draft Type: Standards Track Requires: 256, 257 Created: 31-May-2001 Post-History: 13-Jun-2001 Abstract This PEP documents design issues and implementation details for Docutils, a Python Docstring Processing System (DPS). The rationale and high-level concepts of a DPS are documented in PEP 256, "Docstring Processing System Framework" [1]. Also see PEP 256 for a "Roadmap to the Doctring PEPs". Docutils is being designed modularly so that any of its components can be replaced easily. In addition, Docutils is not limited to the processing of Python docstrings; it processes standalone documents as well, in several contexts. No changes to the core Python language are required by this PEP. Its deliverables consist of a package for the standard library and its documentation. Specification Docutils Project Model ====================== :: +--------------------------+ | Docutils: | | docutils.core.Publisher, | | docutils.core.publish() | +--------------------------+ / \ / \ 1,3,5,7 / \ 8,10 +--------+ +--------+ | READER | =========================> | WRITER | +--------+ +--------+ / || \ / \ / || \ / \ 2 / 4 || \ 6 9 / \ 11 +-----+ +--------+ +-------------+ +------------+ +-----+ | I/O | | PARSER |...| reader | | writer | | I/O | +-----+ +--------+ | transforms | | transforms | +-----+ | | | | | - docinfo | | - system | | - titles | | messages | | - linking | | - final | | - lookups | | checks | | - reader- | | - writer- | | specific | | specific | | - parser- | | - etc. | | specific | +------------+ | - layout | | (stylist) | | - etc. | +-------------+ The numbers indicate the path a document's data takes through the code. Double-width lines between reader & parser and between reader & writer indicate that data sent along these paths should be standard (pure & unextended) Docutils doc trees. Single-width lines signify that internal tree extensions or completely unrelated representations are possible, but they must be supported at both ends. Publisher --------- The "docutils.core" module contains a "Publisher" facade class and "publish" convenience function. Publisher encapsulates the high-level logic of a Docutils system. The Publisher.publish() method first calls its Reader, which reads data from its source I/O, parses and transforms the data, and returns it. Publisher.publish() then passes the resulting document tree to its Writer, which further transforms the document before translating it to the final output format and writing the formatted data to its destination I/O. Calling the "publish" function (or instantiating a "Publisher" object) with component names will result in default behavior. For custom behavior (setting component options), create custom component objects first, and pass *them* to publish/Publisher. Readers ------- Readers understand the input context (where the data is coming from), send the whole input or discrete "chunks" to the parser, and provide the context to bind the chunks together back into a cohesive whole. Using transforms_, Readers also resolve references, footnote numbers, interpreted text processing, and anything else that requires context-sensitive computation. Each reader is a module or package exporting a "Reader" class with a "read" method. The base "Reader" class can be found in the docutils/readers/__init__.py module. Most Readers will have to be told what parser to use. So far (see the list of examples below), only the Python Source Reader (PySource; still incomplete) will be able to determine the parser on its own. Responsibilities: - Get input text from the source I/O. - Pass the input text to the parser, along with a fresh doctree root. - Run transforms over the doctree(s). Examples: - Standalone (Raw/Plain): Just read a text file and process it. The reader needs to be told which parser to use. The "Standalone Reader" has been implemented in docutils/readers/standalone.py. - Python Source: See `Python Source Reader`_ below. This Reader is currently in development in the Docutils sandbox. - Email: RFC-822 headers, quoted excerpts, signatures, MIME parts. - PEP: RFC-822 headers, "PEP xxxx" and "RFC xxxx" conversion to URIs. Either interpret PEPs' indented sections or convert existing PEPs to reStructuredText (or both?). The "PEP Reader" is being implemented in docutils/readers/pep.py. - Wiki: Global reference lookups of "wiki links" incorporated into transforms. (CamelCase only or unrestricted?) Lazy indentation? - Web Page: As standalone, but recognize meta fields as meta tags. Support for templates of some sort? (After , before ?) - FAQ: Structured "question & answer(s)" constructs. - Compound document: Merge chapters into a book. Master TOC file? Parsers ------- Parsers analyze their input and produce a Docutils `document tree`_. They don't know or care anything about the source or destination of the data. Each input parser is a module or package exporting a "Parser" class with a "parse" method. The base "Parser" class can be found in the docutils/parsers/__init__.py module. Responsibilities: Given raw input text and a doctree root node, populate the doctree by parsing the input text. Example: The only parser implemented so far is for the reStructuredText markup. It is implemented in the docutils/parsers/rst/ package. Transforms ---------- Transforms change the document tree from one form to another, add to the tree, or prune it. Transforms are run by Reader and Writer objects. Some transforms are Reader-specific, some are Parser-specific, and others are Writer-specific. The choice and order of transforms is specified in the Reader and Writer objects. Each transform is a class in a module in the docutils/transforms package, a subclass of docutils.tranforms.Transform. Responsibilities: - Modify a doctree in-place, either purely transforming one structure into another, or adding new structures based on the doctree and/or external data. Examples (in the docutils/transforms/ package): - frontmatter.DocInfo: Conversion of document metadata (bibliographic information). - references.Hyperlinks: Resolution of hyperlinks. - parts.Contents: Generates a table of contents for a document. - document.Merger: Combining multiple populated doctrees into one (not yet implemented or fully understood). - document.Splitter: Splits a document into a tree-structure of subdocuments, perhaps by section. It will have to transform references appropriately. (Neither implemented not remotely understood.) - universal.Pending: Handles transforms that must be executed at specific stages of processing. - components.Filter: Includes or excludes elements which depend on a specific Docutils component (triggered by the universal.Pending transform). Writers ------- Writers produce the final output (HTML, XML, TeX, etc.). Writers translate the internal document tree structure into the final data format, possibly running Writer-specific transforms_ first. Each writer is a module or package exporting a "Writer" class with a "write" method. The base "Writer" class can be found in the docutils/writers/__init__.py module. Responsibilities: - Run transforms over the doctree(s). - Translate doctree(s) into specific output formats. - Transform references into format-native forms. - Write the translated output to the destination I/O. Examples: - XML: Various forms, such as: - DocBook (being implemented in the Docutils sandbox). - Raw doctree XML (accessible via "doctree.asdom().toxml()"; no Writer component implemented yet). - HTML (XHTML implemented as docutils/writers/html4css1.py). - PDF (a ReportLabs interface is being developed in the Docutils sandbox). - TeX - Docutils-native pseudo-XML (implemented as docutils/writers/pseudoxml.py, used for testing). - Plain text - reStructuredText? I/O --- I/O classes provide a uniform API for low-level input and output. Subclasses will exist for a variety of input/output mechanisms. I/O classes are currently in the preliminary stages; there's a lot of work yet to be done. Issues: - Looking at the list of writers, it seems that only HTML would require anything other than monolithic output. Perhaps "Writer" variants, one for each output distribution type? - How to represent a multi-file document (files & directories) in the API? Responsibilities: - Read data from the input source and/or write data to the output destination. Examples of input sources: - A single file on disk or a stream (implemented as docutils.io.FileIO). - Multiple files on disk (MultiFileIO?). - Python source files: modules and packages. - Python strings, as received from a client application (implemented as docutils.io.StringIO). Examples of output destinations: - A single file on disk or a stream (implemented as docutils.io.FileIO). - A tree of directories and files on disk. - A Python string, returned to a client application (implemented as docutils.io.StringIO). - A single tree-shaped data structure in memory. - Some other set of data structures in memory. Docutils Package Structure ========================== - Package "docutils". - Class "Component" is a base class for Docutils components. - Module "docutils.core" contains facade class "Publisher" and convenience function "publish()". See `Publisher`_ above. - Module "docutils.frontend" provides command-line and option processing for Docutils front-ends. - Module "docutils.io" provides a uniform API for low-level input and output. - Module "docutils.nodes" contains the Docutils document tree element class library plus Visitor pattern base classes. See `Document Tree`_ below. - Module "docutils.optik" provides option parsing and command-line help; from Greg Ward's http://optik.sf.net/ project, included for convenience. - Module "docutils.roman" contains Roman numeral conversion routines. - Module "docutils.statemachine" contains a finite state machine specialized for regular-expression-based text filters. The reStructuredText parser implementation is based on this module. - Module "docutils.urischemes" contains a mapping of known URI schemes ("http", "ftp", "mail", etc.). - Module "docutils.utils" contains utility functions and classes, including a logger class ("Reporter"; see `Error Handling`_ below). - Package "docutils.parsers": markup parsers_. - Function "get_parser_class(parser_name)" returns a parser module by name. Class "Parser" is the base class of specific parsers. (docutils/parsers/__init__.py) - Package "docutils.parsers.rst": the reStructuredText parser. - Alternate markup parsers may be added. - Package "docutils.readers": context-aware input readers. - Function "get_reader_class(reader_name)" returns a reader module by name or alias. Class "Reader" is the base class of specific readers. (docutils/readers/__init__.py) - Module "docutils.readers.standalone" reads independent document files. - Module "docutils.readers.pep" reads PEPs (Python Enhancement Proposals). - Readers to be added for: Python source code (structure & docstrings), PEPs, email, FAQ, and perhaps Wiki and others. - Package "docutils.writers": output format writers. - Function "get_writer_class(writer_name)" returns a writer module by name. Class "Writer" is the base class of specific writers. (docutils/writers/__init__.py) - Module "docutils.writers.pseudoxml" is a simple internal document tree writer; it writes indented pseudo-XML. - Module "docutils.writers.html4css1" is a simple HyperText Markup Language document tree writer for HTML 4.01 and CSS1. - Writers to be added: HTML 3.2 or 4.01-loose, XML (various forms, such as DocBook and the raw internal doctree), PDF, TeX, plaintext, reStructuredText, and perhaps others. - Package "docutils.transforms": tree transform classes. - Class "Transform" is the base class of specific transforms; see `Transform API`_ below. (docutils/transforms/__init__.py) - Each module contains related transform classes. - Package "docutils.languages": Language modules contain language-dependent strings and mappings. They are named for their language identifier (as defined in `Choice of Docstring Format`_ above), converting dashes to underscores. - Function "get_language(language_code)", returns matching language module. (docutils/languages/__init__.py) - Module "docutils.languages.en" (English). - Other languages to be added. Front-End Tools =============== @@@ To be determined. @@@ Document tools & summarize their command-line interfaces. Document Tree ============= A single intermediate data structure is used internally by Docutils, in the interfaces between components; it is defined in the docutils.nodes module. It is not required that this data structure be used *internally* by any of the components, just *between* components. Custom node types are allowed, providing that either (A) a transform converts them to standard Docutils nodes before they reach the Writer proper, or (B) the custom node is explicitly supported by certain Writers, and is wrapped in a filtered "pending" node. An example of condition A is the `Python Source Reader`_ (see below), where a "stylist" transform converts custom nodes. The HTML tag is an example of condition B; it is supported by the HTML Writer but not by others. The reStructuredText ".. meta::" directive creates a "pending" node, which contains knowledge that the embedded "meta" node can only be handled by HTML-compatible writers. The "pending" node is resolved by the "transforms.components.Filter" transform, which checks that the calling writer supports HTML; if it doesn't, the "meta" node is removed from the document. The document tree data structure is similar to a DOM tree, but with specific node names (classes) instead of DOM's generic nodes. The schema is documented in an XML DTD (eXtensible Markup Language Document Type Definition), which comes in two parts: - the Docutils Generic DTD, docutils.dtd [2], and - the OASIS Exchange Table Model, soextbl.dtd [3]. The DTD defines a rich set of elements, suitable for many input and output formats. The DTD retains all information necessary to reconstruct the original input text, or a reasonable facsimile thereof. See "The Docutils Document Tree" [4] for details (incomplete). Error Handling ============== When the parser encounters an error in markup, it inserts a system message (DTD element "system_message"). There are five levels of system messages: - Level-0, "DEBUG": an internal reporting issue. There is no effect on the processing. Level-0 system messages are handled separately from the others. - Level-1, "INFO": a minor issue that can be ignored. There is little or no effect on the processing. Typically level-1 system messages are not reported. - Level-2, "WARNING": an issue that should be addressed. If ignored, there may be minor problems with the output. Typically level-2 system messages are reported but do not halt processing - Level-3, "ERROR": a major issue that should be addressed. If ignored, the output will contain unpredictable errors. Typically level-3 system messages are reported but do not halt processing - Level-4, "SEVERE": a critical error that must be addressed. Typically level-4 system messages are turned into exceptions which halt processing. If ignored, the output will contain severe errors. Although the initial message levels were devised independently, they have a strong correspondence to VMS error condition severity levels [5]; the names in quotes for levels 1 through 4 were borrowed from VMS. Error handling has since been influenced by the log4j project [6]. Python Source Reader ==================== The Python Source Reader ("PySource") is the Docutils component that reads Python source files, extracts docstrings in context, then parses, links, and assembles the docstrings into a cohesive whole. It is a major and non-trivial component, currently under experimental development in the Docutils sandbox. High-level design issues are presented here. Processing Model ---------------- This model will evolve over time, incorporating experience and discoveries. 1. The PySource Reader uses an I/O class to read in some Python packages and modules, into a tree of strings. 2. The Python modules are parsed, converting the tree of strings into a tree of abstract syntax trees. 3. The abstract syntax trees are converted into an internal representation of the packages/modules. Docstrings are extracted, as well as code structure details. See `AST Mining`_ below. Namespaces are constructed for lookup in step 6. 4. One at a time, the docstrings are parsed, producing standard Docutils doctrees. 5. PySource assembles all the individual docstrings' doctrees into a Python-specific custom Docutils tree parallelling the package/module/class structure; this is a custom Reader-specific internal representation (see the Docutils Python Source DTD [7]). Namespaces must be merged: Python identifiers, hyperlink targets. 6. Cross-references from docstrings (interpreted text) to Python identifiers are resolved according to the Python namespace lookup rules. See `Identifier Cross-References`_ below. 7. A "Stylist" transform is applied to the custom doctree, custom nodes are rendered using standard nodes as primitives, and a standard document tree is emitted. See `Stylist Transforms`_ below. 8. Other transforms are applied to the standard doctree. 9. The standard doctree is sent to a Writer, which translates the document into a concrete format (HTML, PDF, etc.). 10. The Writer uses an I/O class to write the resulting data to its destination (disk file, directories and files, etc.). AST Mining ---------- Abstract Syntax Tree mining code will be written that scans a parsed Python module, and returns an ordered tree containing the names, docstrings (including attribute and additional docstrings; see below), and additional info (in parentheses below) of all of the following objects: - packages - modules - module attributes (+ initial values) - classes (+ inheritance) - class attributes (+ initial values) - instance attributes (+ initial values) - methods (+ parameters & defaults) - functions (+ parameters & defaults) (Extract comments too? For example, comments at the start of a module would be a good place for bibliographic field lists.) In order to evaluate interpreted text cross-references, namespaces for each of the above will also be required. See python-dev/docstring-develop thread "AST mining", started on 2001-08-14. Docstring Extraction Rules -------------------------- 1. What to examine: a) If the "__all__" variable is present in the module being documented, only identifiers listed in "__all__" are examined for docstrings. b) In the absense of "__all__", all identifiers are examined, except those whose names are private (names begin with "_" but don't begin and end with "__"). c) 1a and 1b can be overridden by a parameter or command-line option. 2. Where: Docstrings are string literal expressions, and are recognized in the following places within Python modules: a) At the beginning of a module, function definition, class definition, or method definition, after any comments. This is the standard for Python __doc__ attributes. b) Immediately following a simple assignment at the top level of a module, class definition, or __init__ method definition, after any comments. See "Attribute Docstrings" below. c) Additional string literals found immediately after the docstrings in (a) and (b) will be recognized, extracted, and concatenated. See "Additional Docstrings" below. d) @@@ 2.2-style "properties" with attribute docstrings? 3. How: Whenever possible, Python modules should be parsed by Docutils, not imported. There are several reasons: - Importing untrusted code is inherently insecure. - Information from the source is lost when using introspection to examine an imported module, such as comments and the order of definitions. - Docstrings are to be recognized in places where the bytecode compiler ignores string literal expressions (2b and 2c above), meaning importing the module will lose these docstrings. Of course, standard Python parsing tools such as the "parser" library module should be used. When the Python source code for a module is not available (i.e. only the .pyc file exists) or for C extension modules, to access docstrings the module can only be imported, and any limitations must be lived with. Since attribute docstrings and additional docstrings are ignored by the Python bytecode compiler, no namespace pollution or runtime bloat will result from their use. They are not assigned to __doc__ or to any other attribute. The initial parsing of a module may take a slight performance hit. Attribute Docstrings ```````````````````` (This is a simplified version of PEP 224 [8] by Marc-Andre Lemberg.) A string literal immediately following an assignment statement is interpreted by the docstring extration machinery as the docstring of the target of the assignment statement, under the following conditions: 1. The assignment must be in one of the following contexts: a) At the top level of a module (i.e., not nested inside a compound statement such as a loop or conditional): a module attribute. b) At the top level of a class definition: a class attribute. c) At the top level of the "__init__" method definition of a class: an instance attribute. Since each of the above contexts are at the top level (i.e., in the outermost suite of a definition), it may be necessary to place dummy assignments for attributes assigned conditionally or in a loop. 2. The assignment must be to a single target, not to a list or a tuple of targets. 3. The form of the target: a) For contexts 1a and 1b above, the target must be a simple identifier (not a dotted identifier, a subscripted expression, or a sliced expression). b) For context 1c above, the target must be of the form "self.attrib", where "self" matches the "__init__" method's first parameter (the instance parameter) and "attrib" is a simple indentifier as in 3a. Blank lines may be used after attribute docstrings to emphasize the connection between the assignment and the docstring. Examples:: g = 'module attribute (module-global variable)' """This is g's docstring.""" class AClass: c = 'class attribute' """This is AClass.c's docstring.""" def __init__(self): self.i = 'instance attribute' """This is self.i's docstring.""" Additional Docstrings ````````````````````` (This idea was adapted from PEP 216, Docstring Format [9], by Moshe Zadka.) Many programmers would like to make extensive use of docstrings for API documentation. However, docstrings do take up space in the running program, so some of these programmers are reluctant to "bloat up" their code. Also, not all API documentation is applicable to interactive environments, where __doc__ would be displayed. The docstring processing system's extraction tools will concatenate all string literal expressions which appear at the beginning of a definition or after a simple assignment. Only the first strings in definitions will be available as __doc__, and can be used for brief usage text suitable for interactive sessions; subsequent string literals and all attribute docstrings are ignored by the Python bytecode compiler and may contain more extensive API information. Example:: def function(arg): """This is __doc__, function's docstring.""" """ This is an additional docstring, ignored by the bytecode compiler, but extracted by the Docutils. """ pass Issue: This breaks "from __future__ import" statements in Python 2.1 for multiple module docstrings. The Python Reference Manual specifies: A future statement must appear near the top of the module. The only lines that can appear before a future statement are: * the module docstring (if any), * comments, * blank lines, and * other future statements. Resolution? 1. Should we search for docstrings after a __future__ statement? Very ugly. 2. Redefine __future__ statements to allow multiple preceeding string literals? 3. Or should we not even worry about this? There shouldn't be __future__ statements in production code, after all. Will modules with __future__ statements simply have to put up with the single-docstring limitation? Choice of Docstring Format -------------------------- Rather than force everyone to use a single docstring format, multiple input formats are allowed by the processing system. A special variable, __docformat__, may appear at the top level of a module before any function or class definitions. Over time or through decree, a standard format or set of formats should emerge. The __docformat__ variable is a string containing the name of the format being used, a case-insensitive string matching the input parser's module or package name (i.e., the same name as required to "import" the module or package), or a registered alias. If no __docformat__ is specified, the default format is "plaintext" for now; this may be changed to the standard format once determined. The __docformat__ string may contain an optional second field, separated from the format name (first field) by a single space: a case-insensitive language identifier as defined in RFC 1766 [10]. A typical language identifier consists of a 2-letter language code from ISO 639 [11] (3-letter codes used only if no 2-letter code exists; RFC 1766 is currently being revised to allow 3-letter codes). If no language identifier is specified, the default is "en" for English. The language identifier is passed to the parser and can be used for language-dependent markup features. Identifier Cross-References --------------------------- In Python docstrings, interpreted text is used to classify and mark up program identifiers, such as the names of variables, functions, classes, and modules. If the identifier alone is given, its role is inferred implicitly according to the Python namespace lookup rules. For functions and methods (even when dynamically assigned), parentheses ('()') may be included:: This function uses `another()` to do its work. For class, instance and module attributes, dotted identifiers are used when necessary. For example (using reStructuredText markup):: class Keeper(Storer): """ Extend `Storer`. Class attribute `instances` keeps track of the number of `Keeper` objects instantiated. """ instances = 0 """How many `Keeper` objects are there?""" def __init__(self): """ Extend `Storer.__init__()` to keep track of instances. Keep count in `self.instances`, data in `self.data`. """ Storer.__init__(self) self.instances += 1 self.data = [] """Store data in a list, most recent last.""" def storedata(self, data): """ Extend `Storer.storedata()`; append new `data` to a list (in `self.data`). """ self.data = data Each of the identifiers quoted with backquotes ("`") will become references to the definitions of the identifiers themselves. Stylist Transforms ------------------ Stylist transforms are specialized transforms specific to a Reader. The PySource Reader doesn't have to make any decisions as to style; it just produces a logically constructed document tree, parsed and linked, including custom node types. Stylist transforms understand the custom nodes created by the Reader and convert them into standard Docutils nodes. Multiple Stylist transforms may be implemented and one can be chosen at runtime (through a "--style" or "--stylist" command-line option). Each Stylist transform implements a different layout or style; thus the name. They decouple the context-understanding part of the Reader from the layout-generating part of processing, resulting in a more flexible and robust system. This also serves to "separate style from content", the SGML/XML ideal. By keeping the piece of code that does the styling small and modular, it becomes much easier for people to roll their own styles. The "barrier to entry" is too high with existing tools; extracting the stylist code will lower the barrier considerably. References and Footnotes [1] PEP 256, Docstring Processing System Framework, Goodger http://www.python.org/peps/pep-0256.html [2] http://docutils.sourceforge.net/spec/docutils.dtd [3] http://docutils.sourceforge.net/spec/soextblx.dtd [4] http://docutils.sourceforge.net/spec/doctree.txt [5] http://www.openvms.compaq.com:8000/73final/5841/ 5841pro_027.html#error_cond_severity [6] http://jakarta.apache.org/log4j/ [7] http://docutils.sourceforge.net/spec/pysource.dtd [8] PEP 224, Attribute Docstrings, Lemburg http://www.python.org/peps/pep-0224.html [9] PEP 216, Docstring Format, Zadka http://www.python.org/peps/pep-0216.html [10] http://www.rfc-editor.org/rfc/rfc1766.txt [11] http://lcweb.loc.gov/standards/iso639-2/englangn.html [12] http://www.python.org/sigs/doc-sig/ Project Web Site A SourceForge project has been set up for this work at http://docutils.sourceforge.net/. Copyright This document has been placed in the public domain. Acknowledgements This document borrows ideas from the archives of the Python Doc-SIG [12]. Thanks to all members past & present. Local Variables: mode: indented-text indent-tabs-mode: nil fill-column: 70 sentence-end-double-space: t End: