Further PEP 432 updates

- describe all 4 proposed initialisation phases - more detailed interface for hash seed handling - move ignore environment flag into core config - consistently use American spelling of initialize - preliminary concept for main execution API - misc notes on status quo
2012-12-30 23:39:20 +10:00 · 2012-12-30 23:39:20 +10:00 · 4281048f3b
parent 1be4a026ed
commit 4281048f3b
1 changed files with 181 additions and 65 deletions
--- a/pep-0432.txt
+++ b/pep-0432.txt
@ -15,7 +15,7 @@ Abstract
 ========

 This PEP proposes a mechanism for simplifying the startup sequence for
-CPython, making it easier to modify the initialisation behaviour of the
+CPython, making it easier to modify the initialization behaviour of the
 reference interpreter executable, as well as making it easier to control
 CPython's startup behaviour when creating an alternate executable or
 embedding it as a Python execution engine inside a larger application.
@ -25,15 +25,24 @@ resolution for most of these should become clearer as the reference
 implementation is developed.


-Proposal Summary
-================
+Proposal
+========

-This PEP proposes that CPython move to an explicit 2-phase initialisation
+This PEP proposes that CPython move to an explicit multi-phase initialization
 process, where a preliminary interpreter is put in place with limited OS
 interaction capabilities early in the startup sequence. This essential core
 remains in place while all of the configuration settings are determined,
 until a final configuration call takes those settings and finishes
-bootstrapping the interpreter immediately before executing the main module.
+bootstrapping the interpreter immediately before locating and executing
+the main module.
+
+In the new design, the interpreter will move through the following
+well-defined phases during the startup sequence:
+
+* Pre-Initialization - no interpreter available
+* Initialization - limited interpreter available
+* Pre-Main - full interpreter available, __main__ related metadata incomplete
+* Main Execution - normal interpreter operation

 As a concrete use case to help guide any design changes, and to solve a known
 problem where the appropriate defaults for system utilities differ from those
@ -46,20 +55,21 @@ script being executed.
 To keep the implementation complexity under control, this PEP does *not*
 propose wholesale changes to the way the interpreter state is accessed at
 runtime, nor does it propose changes to the way subinterpreters are
-created after the main interpreter has already been initialised. Changing
-the order in which the existing initialisation steps occur to make the
-startup sequence easier to maintain is already a substantial change, and
+created after the main interpreter has already been initialized. Changing
+the order in which the existing initialization steps occur in order to make
+the startup sequence easier to maintain is already a substantial change, and
 attempting to make those other changes at the same time will make the
 change significantly more invasive and much harder to review. However, such
 proposals may be suitable topics for follow-on PEPs or patches - one key
 benefit of this PEP is decreasing the coupling between the internal storage
-model and the configuration interface.
+model and the configuration interface, so such changes should be easier
+once this PEP has been implemented.


 Background
 ==========

-Over time, CPython's initialisation sequence has become progressively more
+Over time, CPython's initialization sequence has become progressively more
 complicated, offering more options, as well as performing more complex tasks
 (such as configuring the Unicode settings for OS interfaces in Python 3 as
 well as bootstrapping a pure Python implementation of the import system).
@ -72,7 +82,7 @@ maintainers, as much of the configuration needs to take place prior to the
 safely.

 A number of proposals are on the table for even *more* sophisticated
-startup behaviour, such as better control over ``sys.path`` initialisation
+startup behaviour, such as better control over ``sys.path`` initialization
 (easily adding additional directories on the command line in a cross-platform
 fashion, as well as controlling the configuration of ``sys.path[0]``), easier
 configuration of utilities like coverage tracing when launching Python
@ -96,14 +106,14 @@ Maintainability

 The current CPython startup sequence is difficult to understand, and even
 more difficult to modify. It is not clear what state the interpreter is in
-while much of the initialisation code executes, leading to behaviour such
+while much of the initialization code executes, leading to behaviour such
 as lists, dictionaries and Unicode values being created prior to the call
 to ``Py_Initialize`` when the ``-X`` or ``-W`` options are used [1_].

-By moving to a 2-phase startup sequence, developers should only need to
-understand which features are not available in the core bootstrapping state,
-as the vast majority of the configuration process will now take place in
-that state.
+By moving to an explicitly multi-phase startup sequence, developers should
+only need to understand which features are not available in the core
+bootstrapping state, as the vast majority of the configuration process
+will now take place in that state.

 By basing the new design on a combination of C structures and Python
 dictionaries, it should also be easier to modify the system in the
@ -114,7 +124,7 @@ Performance
 -----------

 CPython is used heavily to run short scripts where the runtime is dominated
-by the interpreter initialisation time. Any changes to the startup sequence
+by the interpreter initialization time. Any changes to the startup sequence
 should minimise their impact on the startup overhead.

 Experience with the importlib migration suggests that the startup time is
@ -141,15 +151,15 @@ builds)::
 Improvements in the import system and the Unicode support already resulted
 in a more than 30% improvement in startup time in Python 3.3 relative to
 3.2. Python 3.3 is still slightly slower to start than Python 2.7 due to the
-additional infrastructure that needs to be put in place to support the Unicode
-based text model.
+additional infrastructure that needs to be put in place to support the
+Unicode based text model.

 This PEP is not expected to have any significant effect on the startup time,
-as it is aimed primarily at *reordering* the existing initialisation
+as it is aimed primarily at *reordering* the existing initialization
 sequence, without making substantial changes to the individual steps.

 However, if this simple check suggests that the proposed changes to the
-initialisation sequence may pose a performance problem, then a more
+initialization sequence may pose a performance problem, then a more
 sophisticated microbenchmark will be developed to assist in investigation.


@ -198,7 +208,7 @@ be able to control the following aspects of the final interpreter state:
  * ``no_site`` (don't implicitly import site during startup)
  * ``ignore_environment`` (whether environment vars are used during config)
  * ``verbose`` (enable all sorts of random output)
-  * ``bytes_warning``
+  * ``bytes_warning`` (warnings/errors for implicit str/bytes interaction)
  * ``quiet`` (disable banner output even if verbose is also enabled or
    stdin is a tty and the interpreter is launched in interactive mode)

@ -219,7 +229,7 @@ be able to control the following aspects of the final interpreter state:
 Note that this just covers settings that are currently configurable in some
 manner when using the main CPython executable. While this PEP aims to make
 adding additional configuration settings easier in the future, it
-deliberately avoids any new settings of its own.
+deliberately avoids adding any new settings of its own.


 The Status Quo
@ -238,13 +248,14 @@ Ignoring Environment Variables
 ------------------------------

 The ``-E`` command line option allows all environment variables to be
-ignored when initialising the Python interpreter. An embedding application
+ignored when initializing the Python interpreter. An embedding application
 can enable this behaviour by setting ``Py_IgnoreEnvironmentFlag`` before
 calling ``Py_Initialize()``.

 In the CPython source code, the ``Py_GETENV`` macro implicitly checks this
 flag, and always produces ``NULL`` if it is set.

+<TBD: I believe PYTHONCASEOK is checked regardless of this setting >
 <TBD: Does -E also ignore Windows registry keys? >


@ -266,7 +277,8 @@ rather than in ``Py_Initialize()``).

 The new configuration API should make it straightforward for an
 embedding application to reuse the ``PYTHONHASHSEED`` processing with
-a text based configuration setting provided by other means.
+a text based configuration setting provided by other means (e.g. a
+config file or separate environment variable).


 Locating Python and the standard library
@ -301,7 +313,7 @@ Configuring ``sys.path``
 An embedding application may call ``Py_SetPath()`` prior to
 ``Py_Initialize()`` to completely override the calculation of
 ``sys.path``. It is not straightforward to only allow *some* of the
-calculations, as modifying ``sys.path`` after initialisation is
+calculations, as modifying ``sys.path`` after initialization is
 already complete means those modifications will not be in effect
 when standard library modules are imported during the startup sequence.

@ -332,10 +344,10 @@ for a given Python executable on a given system:

 (Note: you can see similar information using ``-m site`` instead of ``-c``,
 but this is slightly misleading as it calls ``os.abspath`` on all of the
-path entries (making relative path entries look absolute), and also causes
-problems in the last case, as on Python versions prior to 3.3, explicitly
-importing site will carry out the path modifications ``-S`` avoids, while on
-3.3+ combining ``-m site`` with ``-S`` currently fails)
+path entries, making relative path entries look absolute. Using the ``site``
+module also causes problems in the last case, as on Python versions prior to
+3.3, explicitly importing site will carry out the path modifications ``-S``
+avoids, while on 3.3+ combining ``-m site`` with ``-S`` currently fails)

 The calculation of ``sys.path[0]`` is comparatively straightforward:

@ -386,7 +398,7 @@ However, the ``runpy`` module does provide roughly equivalent logic in
 Other configuration settings
 ----------------------------

-TBD: Cover the initialisation of the following in more detail:
+TBD: Cover the initialization of the following in more detail:

 * The initial warning system state:
  * ``sys.warnoptions``
@ -419,7 +431,7 @@ TBD: Cover the initialisation of the following in more detail:
  * ``no_site`` (don't implicitly import site during startup)
  * ``ignore_environment`` (whether environment vars are used during config)
  * ``verbose`` (enable all sorts of random output)
-  * ``bytes_warning`` (This may be obsolete in Py3k...)
+  * ``bytes_warning`` (warnings/errors for implicit str/bytes interaction)
  * ``quiet`` (disable banner output even if verbose is also enabled or
    stdin is a tty and the interpreter is launched in interactive mode)

@ -428,15 +440,15 @@ TBD: Cover the initialisation of the following in more detail:
 Much of the configuration of CPython is currently handled through C level
 global variables::

-    Py_BytesWarningFlag
+    Py_BytesWarningFlag (-b)
    Py_DebugFlag (-d option)
    Py_InspectFlag (-i option, PYTHONINSPECT)
-    Py_InteractiveFlag
+    Py_InteractiveFlag (property of stdin, cannot be overridden)
    Py_OptimizeFlag (-O option, PYTHONOPTIMIZE)
    Py_DontWriteBytecodeFlag (-B option, PYTHONDONTWRITEBYTECODE)
    Py_NoUserSiteDirectory (-s option, PYTHONNOUSERSITE)
    Py_NoSiteFlag (-S option)
-    Py_UnbufferedStdioFlag
+    Py_UnbufferedStdioFlag (-u, PYTHONUNBUFFEREDIO)
    Py_VerboseFlag (-v option, PYTHONVERBOSE)

 For the above variables, the conversion of command line options and
@ -463,34 +475,63 @@ first comment line in the main script)
 Also see detailed sequence of operations notes at [1_]


-Proposal
-========
+Design Details
+==============

 (Note: details here are still very much in flux, but preliminary feedback
 is appreciated anyway)

 The main theme of this proposal is to create the interpreter state for
 the main interpreter *much* earlier in the startup process. This will allow
-most of the CPython API to be used during the remainder of the initialisation
+most of the CPython API to be used during the remainder of the initialization
 process, potentially simplifying a number of operations that currently need
 to rely on basic C functionality rather than being able to use the richer
 data structures provided by the CPython C API.

+In the following, the term "embedding application" also covers the standard
+CPython command line application.

-Core Interpreter Initialisation
-------------------------------

-The only configuration that currently absolutely needs to be in place
-before even the interpreter core can be initialised is a flag indicating
-whether or not to use a specific seed value for the randomised hashes, and
-if so, the specific value for the seed (a seed value of zero disables
-randomised hashing).
+Startup Phases
+--------------
+
+Four distinct phases are proposed:
+
+* Pre-Initialization: no interpreter is available. Embedding application
+  determines the settings required to create the core interpreter and
+  moves to the next phase by calling ``Py_BeginInitialization``.
+* Initialization - a limited interpreter is available. Embedding application
+  determines and applies the settings required to complete the initialization
+  process by calling ``Py_ReadConfiguration`` and ``Py_EndInitialization``.
+* Pre-Main - the full interpreter is available, but ``__main__`` related
+  metadata is incomplete.
+* Main Execution - normal interpreter operation
+
+All 4 phases will be used by the standard CPython interpreter and the
+proposed System Python interpreter. Other embedding applications may
+choose to skip the step of executing code in the ``__main__`` module.
+
+Pre-Initialization Phase
+------------------------
+
+The pre-initialization phase is where an embedding application determines
+the settings which are absolutely required before the interpreter can be
+initialized at all. Currently, the only configuration settings in this
+category are those related to the randomised hash algorithm - the hash
+algorithms must be consistent for the lifetime of the process, and so they
+must be in place before the core interpreter is created.
+
+The specific settings needed are a flag indicating whether or not to use a
+specific seed value for the randomised hashes, and if so, the specific value
+for the seed (a seed value of zero disables randomised hashing). In addition,
+the question of whether or not to consider environment variables must be
+addressed early.

 The proposed API for this step in the startup sequence is::

    void Py_BeginInitialization(Py_CoreConfig *config);

-Like Py_Initialize, this part of the new API treats initialisation failures
+Like Py_Initialize, this part of the new API treats initialization failures
 as fatal errors. While that's still not particularly embedding friendly,
 the operations in this step *really* shouldn't be failing, and changing them
 to return error codes instead of aborting would be an even larger task than
@ -500,16 +541,50 @@ The new Py_CoreConfig struct holds the settings required for preliminary
 configuration::

    typedef struct {
+        int ignore_environment;
        int use_hash_seed;
        unsigned long hash_seed;
    } Py_CoreConfig;

-To disable hash randomisation, set "use_hash_seed" and pass a hash seed of
-zero. (This is the same approach already used when interpreting the
-``PYTHONHASHSEED`` environment variable)
+The core configuration settings pointer may be ``NULL``, in which case the
+default values are ``ignore_environment = 0`` and ``use_hash_seed = -1``.

-The core configuration settings pointer may be NULL, in which case the
-default behaviour of randomised hashes with a random seed will be used.
+``ignore_environment`` controls the processing of all Python related
+environment variables. If the flag is zero, then environment variables are
+processed normally. Otherwise, all Python-specific environment variables
+are considered undefined (exceptions may be made for some OS specific
+environment variables, such as those used on Mac OS X to communicate
+between the App bundle and the main Python binary).
+
+``use_hash_seed`` controls the configuration of the randomised hash
+algorithm. If it is zero, then randomised hashes with a random seed will
+be used. It it is positive, then the value in ``hash_seed`` will be used
+to seed the random number generator. If the ``hash_seed`` is zero in this
+case, then the randomised hashing is disabled completely.
+
+If ``use_hash_seed`` is negative (and ``ignore_environment`` is zero),
+then CPython will inspect the ``PYTHONHASHSEED`` environment variable. If it
+is not set, is set to the empty string, or to the value ``"random"``, then
+randomised hashes with a random seed will be used. If it is set to the string
+``"0"`` the randomised hashing will be disabled. Otherwise, the hash seed is
+expected to be a string representation of an integer in the range
+``[0; 4294967295]``.
+
+To make it easier for embedding applications to use the ``PYTHONHASHSEED``
+processing with a different data source, the following helper function
+will be added to the C API::
+
+    int Py_ReadHashSeed(char *seed_text,
+                        int *use_hash_seed,
+                        unsigned long *hash_seed);
+
+This function accepts a seed string in ``seed_text`` and converts it to
+the appropriate flag and seed values. If ``seed_text`` is ``NULL``,
+the empty string or the value ``"random"``, both ``use_hash_seed`` and
+``hash_seed`` will be set to zero. Otherwise, ``use_hash_seed`` will be set to
+``1`` and the seed text will be interpreted as an integer and reported as
+``hash_seed``. On success the function will return zero. A non-zero return
+value indicates an error (most likely in the conversion to an integer).

 The aim is to keep this initial level of configuration as small as possible
 in order to keep the bootstrapping environment consistent across
@ -519,14 +594,14 @@ to ``Py_EndInitialization()`` rather than in the core configuration.

 A new query API will allow code to determine if the interpreter is in the
 bootstrapping state between the creation of the interpreter state and the
-completion of the bulk of the initialisation process::
+completion of the bulk of the initialization process::

    int Py_IsInitializing();

 Attempting to call ``Py_BeginInitialization()`` again when
 ``Py_IsInitializing()`` or ``Py_IsInitialized()`` is true is a fatal error.

-While in the initialising state, the interpreter should be fully functional
+While in the initializing state, the interpreter should be fully functional
 except that:

 * compilation is not allowed (as the parser and compiler are not yet
@ -551,7 +626,7 @@ except that:
 * only builtin and frozen modules may be imported (due to above limitations)
 * ``sys.stderr`` is set to a temporary IO object using unbuffered binary
  mode
-* The ``warnings`` module is not yet initialised
+* The ``warnings`` module is not yet initialized
 * The ``__main__`` module does not yet exist

 <TBD: identify any other notable missing functionality>
@ -573,7 +648,7 @@ between (e.g. if attempting to read the configuration settings fails)
 Determining the remaining configuration settings
 ------------------------------------------------

-The next step in the initialisation sequence is to determine the full
+The next step in the initialization sequence is to determine the full
 settings needed to complete the process. No changes are made to the
 interpreter state at this point. The core API for this step is::

@ -630,11 +705,12 @@ At least the following configuration settings will be supported::
    <TBD: at least more from sys.flags need to go here>


-Completing the interpreter initialisation
+Completing the interpreter initialization
 -----------------------------------------

-The final step in the process is to actually put the configuration settings
-into effect and finish bootstrapping the interpreter up to full operation::
+The final step in the initialization process is to actually put the
+configuration settings into effect and finish bootstrapping the interpreter
+up to full operation::

    int Py_EndInitialization(PyObject *config);

@ -648,7 +724,48 @@ is fully populated.

 After a successful call, Py_IsInitializing() will be false, while
 Py_IsInitialized() will become true. The caveats described above for the
-interpreter during the initialisation phase will no longer hold.
+interpreter during the initialization phase will no longer hold.
+
+However, some metadata related to the ``__main__`` module may still be
+incomplete:
+
+* ``sys.argv[0]`` may not yet have its final value
+  * it will be ``-m`` when executing a module or package with CPython
+  * it will be the same as ``sys.path[0]`` rather than the location of
+    the ``__main__`` module when executing a valid ``sys.path`` entry
+    (typically a zipfile or directory)
+* the metadata in the ``__main__`` module will still indicate it is a
+  builtin module
+
+
+Executing the main module
+-------------------------
+
+<TBD>
+
+Initial thought is that hiding the various options behind a single API
+would make that API too complicated, so 3 separate APIs is more likely::
+
+    Py_RunPathAsMain
+    Py_RunModuleAsMain
+    Py_RunStreamAsMain
+
+
+Internal Storage of Configuration Data
+--------------------------------------
+
+The interpreter state will be updated to include details of the configuration
+settings supplied during initialization by extending the interpreter state
+object with an embedded copy of the ``Py_CoreConfig`` struct and an
+additional ``PyObject`` pointer to hold a reference to a copy of the
+supplied configuration dictionary.
+
+For debugging purposes, the copied configuration dictionary will be
+exposed as ``sys._configuration``. It will include additional keys for
+the fields in the ``Py_CoreConfig`` struct.
+
+These are *snapshots* of the initial configuration settings. They are not
+consulted by the interpreter during runtime.


 Stable ABI
@ -670,7 +787,7 @@ locations.

 One acknowledged incompatiblity is that some environment variables which
 are currently read lazily may instead be read once during interpreter
-initialisation. As the PEP matures, these will be discussed in more detail
+initialization. As the PEP matures, these will be discussed in more detail
 on a case by case basis. The environment variables which are currently
 known to be looked up dynamically are:

@ -680,7 +797,7 @@ known to be looked up dynamically are:
 * ``PYTHONINSPECT``: ``os.environ['PYTHONINSPECT']`` will still be checked
  after execution of the ``__main__`` module terminates

-The ``Py_Initialize()`` style of initialisation will continue to be
+The ``Py_Initialize()`` style of initialization will continue to be
 supported. It will use (at least some elements of) the new API
 internally, but will continue to exhibit the same behaviour as it
 does today, ensuring that ``sys.argv`` is not populated until a subsequent
@ -691,7 +808,7 @@ continue to do so, and will also support being called prior to

 To minimise unnecessary code churn, and to ensure the backwards compatibility
 is well tested, the main CPython executable may continue to use some elements
-of the old style initialisation API. (very much TBC)
+of the old style initialization API. (very much TBC)


 A System Python Executable
@ -712,8 +829,8 @@ application to make use of key components of ``Py_Main``. Including this
 change in the PEP is designed to help avoid acceptance of a design that
 sounds good in theory but proves to be problematic in practice.

-One final aspect not addressed by the general embedding changes above is
-the current inaccessibility of the core logic for deciding between the
+Better supporting this kind of "alternate CLI" is the main reason for the
+proposed changes to better expose the core logic for deciding between the
 different execution modes supported by CPython:

 * script execution
@ -723,7 +840,6 @@ different execution modes supported by CPython:
 * execution from stdin (non-interactive)
 * interactive stdin

-<TBD: concrete proposal for better exposing the __main__ execution step>

 Implementation
 ==============