2005-02-12 17:02:05 -05:00
|
|
|
|
PEP: 339
|
2006-03-02 17:04:09 -05:00
|
|
|
|
Title: Design of the CPython Compiler
|
2005-02-12 17:02:05 -05:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
|
|
|
|
Author: Brett Cannon <brett@python.org>
|
2011-01-17 19:37:50 -05:00
|
|
|
|
Status: Withdrawn
|
2005-02-12 17:02:05 -05:00
|
|
|
|
Type: Informational
|
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 02-Feb-2005
|
2006-03-02 17:04:09 -05:00
|
|
|
|
Post-History:
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
|
|
|
|
|
2011-01-17 19:37:50 -05:00
|
|
|
|
.. note::
|
|
|
|
|
This PEP has been withdrawn and moved to the Python
|
|
|
|
|
developer's guide.
|
|
|
|
|
|
|
|
|
|
|
2005-02-12 17:02:05 -05:00
|
|
|
|
Abstract
|
2006-03-02 17:04:09 -05:00
|
|
|
|
--------
|
|
|
|
|
|
|
|
|
|
Historically (through 2.4), compilation from source code to bytecode
|
|
|
|
|
involved two steps:
|
|
|
|
|
|
|
|
|
|
1. Parse the source code into a parse tree (Parser/pgen.c)
|
|
|
|
|
2. Emit bytecode based on the parse tree (Python/compile.c)
|
|
|
|
|
|
|
|
|
|
Historically, this is not how a standard compiler works. The usual
|
|
|
|
|
steps for compilation are:
|
|
|
|
|
|
|
|
|
|
1. Parse source code into a parse tree (Parser/pgen.c)
|
|
|
|
|
2. Transform parse tree into an Abstract Syntax Tree (Python/ast.c)
|
|
|
|
|
3. Transform AST into a Control Flow Graph (Python/compile.c)
|
|
|
|
|
4. Emit bytecode based on the Control Flow Graph (Python/compile.c)
|
|
|
|
|
|
|
|
|
|
Starting with Python 2.5, the above steps are now used. This change
|
|
|
|
|
was done to simplify compilation by breaking it into three steps.
|
2006-06-03 13:20:12 -04:00
|
|
|
|
The purpose of this document is to outline how the latter three steps
|
2006-03-02 17:04:09 -05:00
|
|
|
|
of the process works.
|
|
|
|
|
|
|
|
|
|
This document does not touch on how parsing works beyond what is needed
|
|
|
|
|
to explain what is needed for compilation. It is also not exhaustive
|
|
|
|
|
in terms of the how the entire system works. You will most likely need
|
|
|
|
|
to read some source to have an exact understanding of all details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Parse Trees
|
|
|
|
|
-----------
|
|
|
|
|
|
2016-07-11 11:14:08 -04:00
|
|
|
|
Python's parser is an LL(1) parser mostly based on the
|
2006-03-02 17:04:09 -05:00
|
|
|
|
implementation laid out in the Dragon Book [Aho86]_.
|
|
|
|
|
|
|
|
|
|
The grammar file for Python can be found in Grammar/Grammar with the
|
|
|
|
|
numeric value of grammar rules are stored in Include/graminit.h. The
|
|
|
|
|
numeric values for types of tokens (literal tokens, such as ``:``,
|
|
|
|
|
numbers, etc.) are kept in Include/token.h). The parse tree made up of
|
|
|
|
|
``node *`` structs (as defined in Include/node.h).
|
|
|
|
|
|
|
|
|
|
Querying data from the node structs can be done with the following
|
|
|
|
|
macros (which are all defined in Include/token.h):
|
|
|
|
|
|
|
|
|
|
- ``CHILD(node *, int)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Returns the nth child of the node using zero-offset indexing
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``RCHILD(node *, int)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Returns the nth child of the node from the right side; use
|
|
|
|
|
negative numbers!
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``NCH(node *)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Number of children the node has
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``STR(node *)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
String representation of the node; e.g., will return ``:`` for a
|
|
|
|
|
COLON token
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``TYPE(node *)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
The type of node as specified in ``Include/graminit.h``
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``REQ(node *, TYPE)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Assert that the node is the type that is expected
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``LINENO(node *)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
retrieve the line number of the source code that led to the
|
|
|
|
|
creation of the parse rule; defined in Python/ast.c
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
To tie all of this example, consider the rule for 'while'::
|
|
|
|
|
|
|
|
|
|
while_stmt: 'while' test ':' suite ['else' ':' suite]
|
|
|
|
|
|
|
|
|
|
The node representing this will have ``TYPE(node) == while_stmt`` and
|
|
|
|
|
the number of children can be 4 or 7 depending on if there is an 'else'
|
|
|
|
|
statement. To access what should be the first ':' and require it be an
|
2022-01-21 06:03:51 -05:00
|
|
|
|
actual ':' token, ``(REQ(CHILD(node, 2), COLON)``.
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract Syntax Trees (AST)
|
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
|
|
The abstract syntax tree (AST) is a high-level representation of the
|
|
|
|
|
program structure without the necessity of containing the source code;
|
2006-03-03 16:45:55 -05:00
|
|
|
|
it can be thought of as an abstract representation of the source code. The
|
2006-03-02 17:04:09 -05:00
|
|
|
|
specification of the AST nodes is specified using the Zephyr Abstract
|
|
|
|
|
Syntax Definition Language (ASDL) [Wang97]_.
|
|
|
|
|
|
|
|
|
|
The definition of the AST nodes for Python is found in the file
|
|
|
|
|
Parser/Python.asdl .
|
|
|
|
|
|
|
|
|
|
Each AST node (representing statements, expressions, and several
|
|
|
|
|
specialized types, like list comprehensions and exception handlers) is
|
|
|
|
|
defined by the ASDL. Most definitions in the AST correspond to a
|
|
|
|
|
particular source construct, such as an 'if' statement or an attribute
|
|
|
|
|
lookup. The definition is independent of its realization in any
|
|
|
|
|
particular programming language.
|
|
|
|
|
|
|
|
|
|
The following fragment of the Python ASDL construct demonstrates the
|
|
|
|
|
approach and syntax::
|
|
|
|
|
|
|
|
|
|
module Python
|
|
|
|
|
{
|
2017-03-23 18:57:19 -04:00
|
|
|
|
stmt = FunctionDef(identifier name, arguments args, stmt* body,
|
|
|
|
|
expr* decorators)
|
|
|
|
|
| Return(expr? value) | Yield(expr value)
|
|
|
|
|
attributes (int lineno)
|
2006-03-02 17:04:09 -05:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
The preceding example describes three different kinds of statements;
|
|
|
|
|
function definitions, return statements, and yield statements. All
|
|
|
|
|
three kinds are considered of type stmt as shown by '|' separating the
|
|
|
|
|
various kinds. They all take arguments of various kinds and amounts.
|
|
|
|
|
|
|
|
|
|
Modifiers on the argument type specify the number of values needed; '?'
|
|
|
|
|
means it is optional, '*' means 0 or more, no modifier means only one
|
|
|
|
|
value for the argument and it is required. FunctionDef, for instance,
|
|
|
|
|
takes an identifier for the name, 'arguments' for args, zero or more
|
|
|
|
|
stmt arguments for 'body', and zero or more expr arguments for
|
|
|
|
|
'decorators'.
|
|
|
|
|
|
|
|
|
|
Do notice that something like 'arguments', which is a node type, is
|
|
|
|
|
represented as a single AST node and not as a sequence of nodes as with
|
2017-03-24 17:11:33 -04:00
|
|
|
|
stmt as one might expect.
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
All three kinds also have an 'attributes' argument; this is shown by the
|
|
|
|
|
fact that 'attributes' lacks a '|' before it.
|
|
|
|
|
|
|
|
|
|
The statement definitions above generate the following C structure type::
|
|
|
|
|
|
|
|
|
|
typedef struct _stmt *stmt_ty;
|
|
|
|
|
|
|
|
|
|
struct _stmt {
|
|
|
|
|
enum { FunctionDef_kind=1, Return_kind=2, Yield_kind=3 } kind;
|
|
|
|
|
union {
|
|
|
|
|
struct {
|
|
|
|
|
identifier name;
|
|
|
|
|
arguments_ty args;
|
|
|
|
|
asdl_seq *body;
|
|
|
|
|
} FunctionDef;
|
2017-03-24 17:11:33 -04:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
struct {
|
|
|
|
|
expr_ty value;
|
|
|
|
|
} Return;
|
2017-03-24 17:11:33 -04:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
struct {
|
|
|
|
|
expr_ty value;
|
|
|
|
|
} Yield;
|
|
|
|
|
} v;
|
|
|
|
|
int lineno;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Also generated are a series of constructor functions that allocate (in
|
|
|
|
|
this case) a stmt_ty struct with the appropriate initialization. The
|
|
|
|
|
'kind' field specifies which component of the union is initialized. The
|
|
|
|
|
FunctionDef() constructor function sets 'kind' to FunctionDef_kind and
|
|
|
|
|
initializes the 'name', 'args', 'body', and 'attributes' fields.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Memory Management
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
Before discussing the actual implementation of the compiler, a discussion of
|
|
|
|
|
how memory is handled is in order. To make memory management simple, an arena
|
|
|
|
|
is used. This means that a memory is pooled in a single location for easy
|
|
|
|
|
allocation and removal. What this gives us is the removal of explicit memory
|
|
|
|
|
deallocation. Because memory allocation for all needed memory in the compiler
|
|
|
|
|
registers that memory with the arena, a single call to free the arena is all
|
|
|
|
|
that is needed to completely free all memory used by the compiler.
|
|
|
|
|
|
|
|
|
|
In general, unless you are working on the critical core of the compiler, memory
|
|
|
|
|
management can be completely ignored. But if you are working at either the
|
|
|
|
|
very beginning of the compiler or the end, you need to care about how the arena
|
|
|
|
|
works. All code relating to the arena is in either Include/pyarena.h or
|
|
|
|
|
Python/pyarena.c .
|
|
|
|
|
|
|
|
|
|
PyArena_New() will create a new arena. The returned PyArena structure will
|
|
|
|
|
store pointers to all memory given to it. This does the bookkeeping of what
|
|
|
|
|
memory needs to be freed when the compiler is finished with the memory it used.
|
|
|
|
|
That freeing is done with PyArena_Free(). This needs to only be called in
|
|
|
|
|
strategic areas where the compiler exits.
|
|
|
|
|
|
|
|
|
|
As stated above, in general you should not have to worry about memory
|
|
|
|
|
management when working on the compiler. The technical details have been
|
|
|
|
|
designed to be hidden from you for most cases.
|
|
|
|
|
|
2006-03-03 16:45:55 -05:00
|
|
|
|
The only exception comes about when managing a PyObject. Since the rest
|
|
|
|
|
of Python uses reference counting, there is extra support added
|
|
|
|
|
to the arena to cleanup each PyObject that was allocated. These cases
|
|
|
|
|
are very rare. However, if you've allocated a PyObject, you must tell
|
|
|
|
|
the arena about it by calling PyArena_AddPyObject().
|
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
Parse Tree to AST
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
The AST is generated from the parse tree (see Python/ast.c) using the
|
|
|
|
|
function ``PyAST_FromNode()``.
|
|
|
|
|
|
|
|
|
|
The function begins a tree walk of the parse tree, creating various AST
|
|
|
|
|
nodes as it goes along. It does this by allocating all new nodes it
|
|
|
|
|
needs, calling the proper AST node creation functions for any required
|
|
|
|
|
supporting functions, and connecting them as needed.
|
|
|
|
|
|
|
|
|
|
Do realize that there is no automated nor symbolic connection between
|
|
|
|
|
the grammar specification and the nodes in the parse tree. No help is
|
|
|
|
|
directly provided by the parse tree as in yacc.
|
|
|
|
|
|
2006-03-03 16:45:55 -05:00
|
|
|
|
For instance, one must keep track of which node in the parse tree
|
|
|
|
|
one is working with (e.g., if you are working with an 'if' statement
|
|
|
|
|
you need to watch out for the ':' token to find the end of the conditional).
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
The functions called to generate AST nodes from the parse tree all have
|
|
|
|
|
the name ast_for_xx where xx is what the grammar rule that the function
|
|
|
|
|
handles (alias_for_import_name is the exception to this). These in turn
|
|
|
|
|
call the constructor functions as defined by the ASDL grammar and
|
|
|
|
|
contained in Python/Python-ast.c (which was generated by
|
|
|
|
|
Parser/asdl_c.py) to create the nodes of the AST. This all leads to a
|
|
|
|
|
sequence of AST nodes stored in asdl_seq structs.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Function and macros for creating and using ``asdl_seq *`` types as found
|
|
|
|
|
in Python/asdl.c and Include/asdl.h:
|
|
|
|
|
|
|
|
|
|
- ``asdl_seq_new()``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Allocate memory for an asdl_seq for the specified length
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``asdl_seq_GET()``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Get item held at a specific position in an asdl_seq
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``asdl_seq_SET()``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Set a specific index in an asdl_seq to the specified value
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``asdl_seq_LEN(asdl_seq *)``
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Return the length of an asdl_seq
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
If you are working with statements, you must also worry about keeping
|
|
|
|
|
track of what line number generated the statement. Currently the line
|
|
|
|
|
number is passed as the last parameter to each stmt_ty function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Control Flow Graphs
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
A control flow graph (often referenced by its acronym, CFG) is a
|
|
|
|
|
directed graph that models the flow of a program using basic blocks that
|
|
|
|
|
contain the intermediate representation (abbreviated "IR", and in this
|
|
|
|
|
case is Python bytecode) within the blocks. Basic blocks themselves are
|
|
|
|
|
a block of IR that has a single entry point but possibly multiple exit
|
|
|
|
|
points. The single entry point is the key to basic blocks; it all has
|
|
|
|
|
to do with jumps. An entry point is the target of something that
|
|
|
|
|
changes control flow (such as a function call or a jump) while exit
|
|
|
|
|
points are instructions that would change the flow of the program (such
|
|
|
|
|
as jumps and 'return' statements). What this means is that a basic
|
|
|
|
|
block is a chunk of code that starts at the entry point and runs to an
|
|
|
|
|
exit point or the end of the block.
|
2017-03-24 17:11:33 -04:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
As an example, consider an 'if' statement with an 'else' block. The
|
|
|
|
|
guard on the 'if' is a basic block which is pointed to by the basic
|
|
|
|
|
block containing the code leading to the 'if' statement. The 'if'
|
|
|
|
|
statement block contains jumps (which are exit points) to the true body
|
|
|
|
|
of the 'if' and the 'else' body (which may be NULL), each of which are
|
|
|
|
|
their own basic blocks. Both of those blocks in turn point to the
|
|
|
|
|
basic block representing the code following the entire 'if' statement.
|
|
|
|
|
|
|
|
|
|
CFGs are usually one step away from final code output. Code is directly
|
|
|
|
|
generated from the basic blocks (with jump targets adjusted based on the
|
|
|
|
|
output order) by doing a post-order depth-first search on the CFG
|
|
|
|
|
following the edges.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
AST to CFG to Bytecode
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
With the AST created, the next step is to create the CFG. The first step
|
|
|
|
|
is to convert the AST to Python bytecode without having jump targets
|
|
|
|
|
resolved to specific offsets (this is calculated when the CFG goes to
|
|
|
|
|
final bytecode). Essentially, this transforms the AST into Python
|
|
|
|
|
bytecode with control flow represented by the edges of the CFG.
|
|
|
|
|
|
|
|
|
|
Conversion is done in two passes. The first creates the namespace
|
|
|
|
|
(variables can be classified as local, free/cell for closures, or
|
|
|
|
|
global). With that done, the second pass essentially flattens the CFG
|
|
|
|
|
into a list and calculates jump offsets for final output of bytecode.
|
|
|
|
|
|
|
|
|
|
The conversion process is initiated by a call to the function
|
|
|
|
|
``PyAST_Compile()`` in Python/compile.c . This function does both the
|
|
|
|
|
conversion of the AST to a CFG and
|
|
|
|
|
outputting final bytecode from the CFG. The AST to CFG step is handled
|
|
|
|
|
mostly by two functions called by PyAST_Compile(); PySymtable_Build() and
|
|
|
|
|
compiler_mod() . The former is in Python/symtable.c while the latter is in
|
|
|
|
|
Python/compile.c .
|
|
|
|
|
|
|
|
|
|
PySymtable_Build() begins by entering the starting code block for the
|
|
|
|
|
AST (passed-in) and then calling the proper symtable_visit_xx function
|
|
|
|
|
(with xx being the AST node type). Next, the AST tree is walked with
|
|
|
|
|
the various code blocks that delineate the reach of a local variable
|
|
|
|
|
as blocks are entered and exited using symtable_enter_block() and
|
|
|
|
|
symtable_exit_block(), respectively.
|
|
|
|
|
|
|
|
|
|
Once the symbol table is created, it is time for CFG creation, whose
|
|
|
|
|
code is in Python/compile.c . This is handled by several functions
|
|
|
|
|
that break the task down by various AST node types. The functions are
|
|
|
|
|
all named compiler_visit_xx where xx is the name of the node type (such
|
|
|
|
|
as stmt, expr, etc.). Each function receives a ``struct compiler *``
|
|
|
|
|
and xx_ty where xx is the AST node type. Typically these functions
|
|
|
|
|
consist of a large 'switch' statement, branching based on the kind of
|
|
|
|
|
node type passed to it. Simple things are handled inline in the
|
|
|
|
|
'switch' statement with more complex transformations farmed out to other
|
|
|
|
|
functions named compiler_xx with xx being a descriptive name of what is
|
|
|
|
|
being handled.
|
|
|
|
|
|
|
|
|
|
When transforming an arbitrary AST node, use the VISIT() macro.
|
|
|
|
|
The appropriate compiler_visit_xx function is called, based on the value
|
|
|
|
|
passed in for <node type> (so ``VISIT(c, expr, node)`` calls
|
|
|
|
|
``compiler_visit_expr(c, node)``). The VISIT_SEQ macro is very similar,
|
|
|
|
|
but is called on AST node sequences (those values that were created as
|
|
|
|
|
arguments to a node that used the '*' modifier). There is also
|
|
|
|
|
VISIT_SLICE() just for handling slices.
|
|
|
|
|
|
|
|
|
|
Emission of bytecode is handled by the following macros:
|
|
|
|
|
|
|
|
|
|
- ``ADDOP()``
|
2006-03-03 16:45:55 -05:00
|
|
|
|
add a specified opcode
|
2006-03-02 17:04:09 -05:00
|
|
|
|
- ``ADDOP_I()``
|
|
|
|
|
add an opcode that takes an argument
|
|
|
|
|
- ``ADDOP_O(struct compiler *c, int op, PyObject *type, PyObject *obj)``
|
|
|
|
|
add an opcode with the proper argument based on the position of the
|
|
|
|
|
specified PyObject in PyObject sequence object, but with no handling of
|
|
|
|
|
mangled names; used for when you
|
|
|
|
|
need to do named lookups of objects such as globals, consts, or
|
|
|
|
|
parameters where name mangling is not possible and the scope of the
|
|
|
|
|
name is known
|
|
|
|
|
- ``ADDOP_NAME()``
|
|
|
|
|
just like ADDOP_O, but name mangling is also handled; used for
|
|
|
|
|
attribute loading or importing based on name
|
|
|
|
|
- ``ADDOP_JABS()``
|
|
|
|
|
create an absolute jump to a basic block
|
|
|
|
|
- ``ADDOP_JREL()``
|
|
|
|
|
create a relative jump to a basic block
|
|
|
|
|
|
|
|
|
|
Several helper functions that will emit bytecode and are named
|
|
|
|
|
compiler_xx() where xx is what the function helps with (list, boolop,
|
|
|
|
|
etc.). A rather useful one is compiler_nameop().
|
|
|
|
|
This function looks up the scope of a variable and, based on the
|
|
|
|
|
expression context, emits the proper opcode to load, store, or delete
|
|
|
|
|
the variable.
|
|
|
|
|
|
|
|
|
|
As for handling the line number on which a statement is defined, is
|
|
|
|
|
handled by compiler_visit_stmt() and thus is not a worry.
|
|
|
|
|
|
|
|
|
|
In addition to emitting bytecode based on the AST node, handling the
|
|
|
|
|
creation of basic blocks must be done. Below are the macros and
|
|
|
|
|
functions used for managing basic blocks:
|
|
|
|
|
|
|
|
|
|
- ``NEW_BLOCK()``
|
|
|
|
|
create block and set it as current
|
|
|
|
|
- ``NEXT_BLOCK()``
|
|
|
|
|
basically NEW_BLOCK() plus jump from current block
|
|
|
|
|
- ``compiler_new_block()``
|
|
|
|
|
create a block but don't use it (used for generating jumps)
|
|
|
|
|
|
|
|
|
|
Once the CFG is created, it must be flattened and then final emission of
|
|
|
|
|
bytecode occurs. Flattening is handled using a post-order depth-first
|
|
|
|
|
search. Once flattened, jump offsets are backpatched based on the
|
|
|
|
|
flattening and then a PyCodeObject file is created. All of this is
|
|
|
|
|
handled by calling assemble() .
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Introducing New Bytecode
|
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
|
|
Sometimes a new feature requires a new opcode. But adding new bytecode is
|
|
|
|
|
not as simple as just suddenly introducing new bytecode in the AST ->
|
|
|
|
|
bytecode step of the compiler. Several pieces of code throughout Python depend
|
|
|
|
|
on having correct information about what bytecode exists.
|
|
|
|
|
|
|
|
|
|
First, you must choose a name and a unique identifier number. The official
|
|
|
|
|
list of bytecode can be found in Include/opcode.h . If the opcode is to take
|
|
|
|
|
an argument, it must be given a unique number greater than that assigned to
|
2006-08-29 03:19:18 -04:00
|
|
|
|
``HAVE_ARGUMENT`` (as found in Include/opcode.h).
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
Once the name/number pair
|
|
|
|
|
has been chosen and entered in Include/opcode.h, you must also enter it into
|
2008-06-07 20:24:21 -04:00
|
|
|
|
Lib/opcode.py and Doc/library/dis.rst .
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
With a new bytecode you must also change what is called the magic number for
|
|
|
|
|
.pyc files. The variable ``MAGIC`` in Python/import.c contains the number.
|
2006-03-03 16:45:55 -05:00
|
|
|
|
Changing this number will lead to all .pyc files with the old MAGIC
|
|
|
|
|
to be recompiled by the interpreter on import.
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
Finally, you need to introduce the use of the new bytecode. Altering
|
2006-03-03 16:45:55 -05:00
|
|
|
|
Python/compile.c and Python/ceval.c will be the primary places to change.
|
|
|
|
|
But you will also need to change the 'compiler' package. The key files
|
|
|
|
|
to do that are Lib/compiler/pyassem.py and Lib/compiler/pycodegen.py .
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
If you make a change here that can affect the output of bytecode that
|
|
|
|
|
is already in existence and you do not change the magic number constantly, make
|
|
|
|
|
sure to delete your old .py(c|o) files! Even though you will end up changing
|
|
|
|
|
the magic number if you change the bytecode, while you are debugging your work
|
|
|
|
|
you will be changing the bytecode output without constantly bumping up the
|
|
|
|
|
magic number. This means you end up with stale .pyc files that will not be
|
|
|
|
|
recreated. Running
|
|
|
|
|
``find . -name '*.py[co]' -exec rm -f {} ';'`` should delete all .pyc files you
|
|
|
|
|
have, forcing new ones to be created and thus allow you test out your new
|
|
|
|
|
bytecode properly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Code Objects
|
|
|
|
|
------------
|
|
|
|
|
|
2006-03-03 16:45:55 -05:00
|
|
|
|
The result of ``PyAST_Compile()`` is a PyCodeObject which is defined in
|
2006-03-02 17:04:09 -05:00
|
|
|
|
Include/code.h . And with that you now have executable Python bytecode!
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-03 16:45:55 -05:00
|
|
|
|
The code objects (byte code) is executed in Python/ceval.c . This file
|
|
|
|
|
will also need a new case statement for the new opcode in the big switch
|
|
|
|
|
statement in PyEval_EvalFrameEx().
|
|
|
|
|
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
Important Files
|
|
|
|
|
---------------
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
+ Parser/
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- Python.asdl
|
|
|
|
|
ASDL syntax file
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- asdl.py
|
|
|
|
|
"An implementation of the Zephyr Abstract Syntax Definition
|
|
|
|
|
Language." Uses SPARK_ to parse the ASDL files.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2017-03-23 18:57:19 -04:00
|
|
|
|
- asdl_c.py
|
|
|
|
|
"Generate C code from an ASDL description." Generates
|
|
|
|
|
Python/Python-ast.c and Include/Python-ast.h .
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2017-03-23 18:57:19 -04:00
|
|
|
|
- spark.py
|
|
|
|
|
SPARK_ parser generator
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
+ Python/
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- Python-ast.c
|
|
|
|
|
Creates C structs corresponding to the ASDL types. Also
|
2016-05-13 00:07:07 -04:00
|
|
|
|
contains code for marshaling AST nodes (core ASDL types have
|
|
|
|
|
marshaling code in asdl.c). "File automatically generated by
|
|
|
|
|
Parser/asdl_c.py". This file must be committed separately
|
2016-05-03 04:18:02 -04:00
|
|
|
|
after every grammar change is committed since the __version__
|
|
|
|
|
value is set to the latest grammar change revision number.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- asdl.c
|
|
|
|
|
Contains code to handle the ASDL sequence type. Also has code
|
|
|
|
|
to handle marshalling the core ASDL types, such as number and
|
|
|
|
|
identifier. used by Python-ast.c for marshaling AST nodes.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- ast.c
|
|
|
|
|
Converts Python's parse tree into the abstract syntax tree.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- ceval.c
|
|
|
|
|
Executes byte code (aka, eval loop).
|
2006-03-03 16:45:55 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- compile.c
|
|
|
|
|
Emits bytecode based on the AST.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- symtable.c
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Generates a symbol table from AST.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- pyarena.c
|
|
|
|
|
Implementation of the arena memory manager.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- import.c
|
|
|
|
|
Home of the magic number (named ``MAGIC``) for bytecode versioning
|
2017-03-23 18:57:19 -04:00
|
|
|
|
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
+ Include/
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- Python-ast.h
|
|
|
|
|
Contains the actual definitions of the C structs as generated by
|
|
|
|
|
Python/Python-ast.c .
|
|
|
|
|
"Automatically generated by Parser/asdl_c.py".
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- asdl.h
|
|
|
|
|
Header for the corresponding Python/ast.c .
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- ast.h
|
|
|
|
|
Declares PyAST_FromNode() external (from Python/ast.c).
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- code.h
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Header file for Objects/codeobject.c; contains definition of
|
|
|
|
|
PyCodeObject.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- symtable.h
|
2017-03-23 18:57:19 -04:00
|
|
|
|
Header for Python/symtable.c . struct symtable and
|
|
|
|
|
PySTEntryObject are defined here.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- pyarena.h
|
|
|
|
|
Header file for the corresponding Python/pyarena.c .
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- opcode.h
|
|
|
|
|
Master list of bytecode; if this file is modified you must modify
|
|
|
|
|
several other files accordingly (see "`Introducing New Bytecode`_")
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
+ Objects/
|
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- codeobject.c
|
|
|
|
|
Contains PyCodeObject-related code (originally in
|
|
|
|
|
Python/compile.c).
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
+ Lib/
|
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- opcode.py
|
|
|
|
|
One of the files that must be modified if Include/opcode.h is.
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
- compiler/
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
* pyassem.py
|
|
|
|
|
One of the files that must be modified if Include/opcode.h is
|
|
|
|
|
changed.
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
2016-05-03 04:18:02 -04:00
|
|
|
|
* pycodegen.py
|
|
|
|
|
One of the files that must be modified if Include/opcode.h is
|
|
|
|
|
changed.
|
2006-03-02 17:04:09 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Known Compiler-related Experiments
|
|
|
|
|
----------------------------------
|
|
|
|
|
|
|
|
|
|
This section lists known experiments involving the compiler (including
|
|
|
|
|
bytecode).
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
|
|
|
|
Skip Montanaro presented a paper at a Python workshop on a peephole optimizer
|
|
|
|
|
[#skip-peephole]_.
|
|
|
|
|
|
|
|
|
|
Michael Hudson has a non-active SourceForge project named Bytecodehacks
|
|
|
|
|
[#Bytecodehacks]_ that provides functionality for playing with bytecode
|
|
|
|
|
directly.
|
|
|
|
|
|
|
|
|
|
An opcode to combine the functionality of LOAD_ATTR/CALL_FUNCTION was created
|
|
|
|
|
named CALL_ATTR [#CALL_ATTR]_. Currently only works for classic classes and
|
|
|
|
|
for new-style classes rough benchmarking showed an actual slowdown thanks to
|
|
|
|
|
having to support both classic and new-style classes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
References
|
|
|
|
|
----------
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
.. [Aho86] Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman.
|
|
|
|
|
`Compilers: Principles, Techniques, and Tools`,
|
|
|
|
|
http://www.amazon.com/exec/obidos/tg/detail/-/0201100886/104-0162389-6419108
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
.. [Wang97] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Chris
|
|
|
|
|
S. Serra. `The Zephyr Abstract Syntax Description Language.`_
|
|
|
|
|
In Proceedings of the Conference on Domain-Specific Languages, pp.
|
|
|
|
|
213--227, 1997.
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
.. _The Zephyr Abstract Syntax Description Language.:
|
2013-11-10 15:00:13 -05:00
|
|
|
|
http://www.cs.princeton.edu/research/techreps/TR-554-97
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
2006-03-02 17:04:09 -05:00
|
|
|
|
.. _SPARK: http://pages.cpsc.ucalgary.ca/~aycock/spark/
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
|
|
|
|
.. [#skip-peephole] Skip Montanaro's Peephole Optimizer Paper
|
|
|
|
|
(http://www.foretec.com/python/workshops/1998-11/proceedings/papers/montanaro/montanaro.html)
|
|
|
|
|
|
|
|
|
|
.. [#Bytecodehacks] Bytecodehacks Project
|
|
|
|
|
(http://bytecodehacks.sourceforge.net/bch-docs/bch/index.html)
|
|
|
|
|
|
|
|
|
|
.. [#CALL_ATTR] CALL_ATTR opcode
|
2018-07-21 19:57:17 -04:00
|
|
|
|
(https://bugs.python.org/issue709744)
|
2005-02-12 17:02:05 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
2006-03-02 17:04:09 -05:00
|
|
|
|
fill-column: 80
|
2005-02-12 17:02:05 -05:00
|
|
|
|
End:
|