PEP: 432 Title: Simplifying the CPython startup sequence Version: $Revision$ Last-Modified: $Date$ Author: Nick Coghlan Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 28-Dec-2012 Abstract ======== This PEP proposes a mechanism for simplifying the startup sequence for CPython, making it easier to modify the initialisation behaviour of the reference interpreter executable, as well as making it easier to control CPython's startup behaviour when creating an alternate executable or embedding it as a Python execution engine inside a larger application. Proposal Summary ================ This PEP proposes that CPython move to an explicit 2-phase initialisation process, where a preliminary interpreter is put in place with limited OS interaction capabilities early in the startup sequence. This essential core remains in place while all of the configuration settings are determined, until a final configuration call takes those settings and finishes bootstrapping the interpreter immediately before executing the main module. As a concrete use case to help guide any design changes, and to solve a known problem where the appropriate defaults for system utilities differ from those for running user scripts, this PEP also proposes the creation and distribution of a separate system Python (``spython``) executable which, by default, ignores user site directories and environment variables, and does not implicitly set ``sys.path[0]`` based on the current directory or the script being executed. Background ========== Over time, CPython's initialisation sequence has become progressively more complicated, offering more options, as well as performing more complex tasks (such as configuring the Unicode settings for OS interfaces in Python 3 as well as bootstrapping a pure Python implementation of the import system). Much of this complexity is accessible only through the ``Py_Main`` and ``Py_Initialize`` APIs, offering embedding applications little opportunity for customisation. This creeping complexity also makes life difficult for maintainers, as much of the configuration needs to take place prior to the ``Py_Initialize`` call, meaning much of the Python C API cannot be used safely. A number of proposals are on the table for even *more* sophisticated startup behaviour, such as better control over ``sys.path`` initialisation (easily adding additional directories on the command line in a cross-platform fashion, as well as controlling the configuration of ``sys.path[0]``), easier configuration of utilities like coverage tracing when launching Python subprocesses, and easier control of the encoding used for the standard IO streams when embedding CPython in a larger application. Rather than attempting to bolt such behaviour onto an already complicated system, this PEP proposes to instead simplify the status quo *first*, with the aim of making these further feature requests easier to implement. Key Concerns ============ There are a couple of key concerns that any change to the startup sequence needs to take into account. Maintainability --------------- The current CPython startup sequence is difficult to understand, and even more difficult to modify. It is not clear what state the interpreter is in while much of the initialisation code executes, leading to behaviour such as lists, dictionaries and Unicode values being created prior to the call to ``Py_Initialize`` when the ``-X`` or ``-W`` options are used [1_]. By moving to a 2-phase startup sequence, developers should only need to understand which features are not available in the core bootstrapping state, as the vast majority of the configuration process will now take place in that state. By basing the new design on a combination of C structures and Python dictionaries, it should also be easier to modify the system in the future to add new configuration options. Performance ----------- CPython is used heavily to run short scripts where the runtime is dominated by the interpreter initialisation time. Any changes to the startup sequence should minimise their impact on the startup overhead. (Given that the overhead is dominated by IO operations, this is not currently expected to cause any significant problems). The Status Quo ============== Much of the configuration of CPython is currently handled through C level global variables:: Py_IgnoreEnvironmentFlag Py_HashRandomizationFlag _Py_HashSecretInitialized _Py_HashSecret Py_BytesWarningFlag Py_DebugFlag Py_InspectFlag Py_InteractiveFlag Py_OptimizeFlag Py_DontWriteBytecodeFlag Py_NoUserSiteDirectory Py_NoSiteFlag Py_UnbufferedStdioFlag Py_VerboseFlag For the above variables, the conversion of command line options and environment variables to C global variables is handled by ``Py_Main``, so each embedding application must set those appropriately in order to change them from their defaults. Some configuration can only be provided as OS level environment variables:: PYTHONHASHSEED PYTHONSTARTUP PYTHONPATH PYTHONHOME PYTHONCASEOK PYTHONIOENCODING Additional configuration is handled via separate API calls:: Py_SetProgramName() (call before Py_Initialize()) Py_SetPath() (optional, call before Py_Initialize()) Py_SetPythonHome() (optional, call before Py_Initialize()???) Py_SetArgv[Ex]() (call after Py_Initialize()) The ``Py_InitializeEx()`` API also accepts a boolean flag to indicate whether or not CPython's signal handlers should be installed. Finally, some interactive behaviour (such as printing the introductory banner) is triggered only when standard input is reported as a terminal connection by the operating system. Also see more detailed notes at [1_] Proposal ======== (Note: details here are still very much in flux, but preliminary feedback is appreciated anyway) Core Interpreter Initialisation ------------------------------- The only configuration that currently absolutely needs to be in place before even the interpreter core can be initialised is the seed for the randomised hash algorithm. However, there are a couple of settings needed there: whether or not hash randomisation is enabled at all, and if it's enabled, whether or not to use a specific seed value. The proposed API for this step in the startup sequence is:: void Py_BeginInitialization(Py_CoreConfig *config); Like Py_Initialize, this part of the new API treats initialisation failures as fatal errors. While that's still not particularly embedding friendly, the operations in this step *really* shouldn't be failing, and changing them to return error codes instead of aborting would be an even larger task than the one already being proposed. The new Py_CoreConfig struct holds the settings required for preliminary configuration:: typedef struct { int use_hash_seed; size_t hash_seed; } Py_CoreConfig; To "disable" hash randomisation, set "use_hash_seed" and pass a hash seed of zero. (This seems reasonable to me, but there may be security implications I'm overlooking. If so, adding a separate flag or switching to a 3-valued "no randomisation", "fixed hash seed" and "randomised hash" option is easy) The core configuration settings pointer may be NULL, in which case the default behaviour of randomised hashes with a random seed will be used. A new query API will allow code to determine if the interpreter is in the bootstrapping state between core initialisation and the completion of the initialisation process:: int Py_IsInitializing(); While in the initialising state, the interpreter should be fully functional except that: * compilation is not allowed (as the parser is not yet configured properly) * The following attributes in the ``sys`` module are all either missing or ``None``: * ``sys.path`` * ``sys.argv`` * ``sys.executable`` * ``sys.base_exec_prefix`` * ``sys.base_prefix`` * ``sys.exec_prefix`` * ``sys.prefix`` * ``sys.warnoptions`` * ``sys.flags`` * ``sys.dont_write_bytecode`` * ``sys.stdin`` * ``sys.stdout`` * The filesystem encoding is not yet defined * The IO encoding is not yet defined * CPython signal handlers are not yet installed * only builtin and frozen modules may be imported (due to above limitations) * ``sys.stderr`` is set to a temporary IO object using unbuffered binary mode * The ``warnings`` module is not yet initialised * The ``__main__`` module does not yet exist The main things made available by this step will be the core Python datatypes, in particular dictionaries, lists and strings. This allows them to be used safely for all of the remaining configuration steps (unlike the status quo). In addition, the current thread will possess a valid Python thread state, allow any further configuration data to be stored. Any call to Py_InitStart() must have a matching call to Py_Finalize(). It is acceptable to skip calling Py_InitFinish() in between (e.g. if attempting to read the configuration settings fails) Determining the remaining configuration settings ------------------------------------------------ The next step in the initialisation sequence is to determine the full settings needed to complete the process. No changes are made to the interpreter state at this point. The core API for this step is:: int Py_ReadConfiguration(PyObject *config); The config argument should be a pointer to a Python dictionary. For any supported configuration setting already in the dictionary, CPython will sanity check the supplied value, but otherwise accept it as correct. Unlike Py_Initialize and Py_BeginInitialization, this call will raise an exception and report an error return rather than exhibiting fatal errors if a problem is found with the config data. Any supported configuration setting which is not already set will be populated appropriately. The default configuration can be overridden entirely by setting the value *before* calling Py_ReadConfiguration. The provided value will then also be used in calculating any settings derived from that value. Alternatively, settings may be overridden *after* the Py_ReadConfiguration call (this can be useful if an embedding application wants to adjust a setting rather than replace it completely, such as removing ``sys.path[0]``). Supported configuration settings -------------------------------- At least the following configuration settings will be supported:: raw_argv (list of str, default = retrieved from OS APIs) argv (list of str, default = derived from raw_argv) warnoptions (list of str, default = derived from raw_argv and environment) xoptions (list of str, default = derived from raw_argv and environment) program_name (str, default = retrieved from OS APIs) executable (str, default = derived from program_name) home (str, default = complicated!) prefix (str, default = complicated!) exec_prefix (str, default = complicated!) base_prefix (str, default = complicated!) base_exec_prefix (str, default = complicated!) path (list of str, default = complicated!) io_encoding (str, default = derived from environment or OS APIs) fs_encoding (str, default = derived from OS APIs) skip_signal_handlers (boolean, default = derived from environment or False) ignore_environment (boolean, default = derived from environment or False) dont_write_bytecode (boolean, default = derived from environment or False) no_site (boolean, default = derived from environment or False) no_user_site (boolean, default = derived from environment or False) Completing the interpreter initialisation ----------------------------------------- The final step in the process is to actually put the configuration settings into effect and finish bootstrapping the interpreter up to full operation:: int Py_EndInitialization(PyObject *config); Like Py_ReadConfiguration, this call will raise an exception and report an error return rather than exhibiting fatal errors if a problem is found with the config data. After a successful call, Py_IsInitializing() will be false, while Py_IsInitialized() will become true. The caveats described above for the interpreter during the initialisation phase will no longer hold. Stable ABI ---------- All of the APIs proposed in this PEP are excluded from the stable ABI, as embedding a Python interpreter involves a much higher degree of coupling than merely writing an extension. Backwards Compatibility ----------------------- Backwards compatibility will be preserved primarily by ensuring that Py_ReadConfiguration() interrogates all the previously defined configuration settings stored in global variables and environment variables. One acknowledged incompatiblity is that some environment variables which are currently read lazily may instead be read once during interpreter initialisation. As the PEP matures, these will be discussed in more detail on a case by case basis. A System Python Executable ========================== When executing system utilities with administrative access to a system, many of the default behaviours of CPython are undesirable, as they may allow untrusted code to execute with elevated privileges. The most problematic aspects are the fact that user site directories are enabled, environment variables are trusted and that the directory containing the executed file is placed at the beginning of the import path. Currently, providing a separate executable with different default behaviour would be prohibitively hard to maintain. One of the goals of this PEP is to make it possible to replace much of the hard to maintain bootstrapping code with more normal CPython code, as well as making it easier for a separate application to make use of key components of ``Py_Main``. Including this change in the PEP is designed to help avoid acceptance of a design that sounds good in theory but proves to be problematic in practice. One final aspect not addressed by the general embedding changes above is the current inaccessibility of the core logic for deciding between the different execution modes supported by CPython:: * script execution * directory/zipfile execution * command execution ("-c" switch) * module or package execution ("-m" switch) * execution from stdin (non-interactive) * interactive stdin Implementation ============== None as yet. Once I have a reasonably solid plan of attack, I intend to work on a reference implementation as a feature branch in my BitBucket sandbox [2_] References ========== .. [1] CPython interpreter initialization notes (http://wiki.python.org/moin/CPythonInterpreterInitialization) .. [2] BitBucket Sandbox (https://bitbucket.org/ncoghlan/cpython_sandbox) Copyright =========== This document has been placed in the public domain.