PEP 657: Update the public API and the opt-out mechanism (#1959)

2021-05-12 21:33:09 +01:00 · 2021-05-12 21:33:09 +01:00 · 27c75f67c4
parent bcf1f22b20
commit 27c75f67c4
1 changed files with 74 additions and 61 deletions
--- a/pep-0657.rst
+++ b/pep-0657.rst
@ -17,16 +17,17 @@ Abstract
 ========

 This PEP proposes adding a mapping from each bytecode instruction to the start
-and end column offsets of the line that generated them. This data will be used
-to improve tracebacks displayed by the CPython interpreter in order to improve
-the debugging experience. The PEP also proposes adding APIs that allow other
-tools (such as coverage analysis tools, profilers, tracers, debuggers) to
-consume this information from code objects.
+and end column offsets of the line that generated them as well as the end line
+number. This data will be used to improve tracebacks displayed by the CPython
+interpreter in order to improve the debugging experience. The PEP also proposes
+adding APIs that allow other tools (such as coverage analysis tools, profilers,
+tracers, debuggers) to consume this information from code objects.

 Motivation
 ==========

-The primary motivation for this PEP is to improve the feedback presented about the location of errors to aid with debugging.
+The primary motivation for this PEP is to improve the feedback presented about
+the location of errors to aid with debugging.

 Python currently keeps a mapping of bytecode to line numbers from compilation.
 The interpreter uses this mapping to point to the source line associated with
@ -150,51 +151,55 @@ instruction. This will have an impact on the size of ``pyc`` files on disk and
 the size of code objects in memory. The authors of this proposal have chosen
 the data types in a way that tries to minimize this impact. The proposed
 overhead is storing two ``uint8_t`` (one for the start offset and one for the
-end offset) for every bytecode instruction.
+end offset) and the end line information for every bytecode instruction (in
+the same encoded fashion as the start line is stored currently).

 As an illustrative example to gauge the impact of this change, we have
-calculated that this change will increase the size of the standard library’s
-pyc files by 22% (6MB) from 28.4MB to 34.7MB. The overhead in memory usage will be
-the same (assuming the *full standard library* is loaded into the same
-program). We believe that this is a very acceptable number since the order of
-magnitude of the overhead is very small, especially considering the storage
-size and memory capabilities of modern computers. Additionally, in general the
-memory size of a Python program is not dominated by code objects. To check this
-assumption we have executed the test suite of several popular PyPI projects
-(including NumPy, pytest, Django and Cython) as well as several applications
-(Black, pylint, mypy executed over either mypy or the standard library) and we
-found that code objects represent normally 3-6% of the average memory size of
-the program.
+calculated that including the start and end offsets will increase the size of
+the standard library’s pyc files by 22% (6MB) from 28.4MB to 34.7MB. The
+overhead in memory usage will be the same (assuming the *full standard library*
+is loaded into the same program). We believe that this is a very acceptable
+number since the order of magnitude of the overhead is very small, especially
+considering the storage size and memory capabilities of modern computers.
+Additionally, in general the memory size of a Python program is not dominated
+by code objects. To check this assumption we have executed the test suite of
+several popular PyPI projects (including NumPy, pytest, Django and Cython) as
+well as several applications (Black, pylint, mypy executed over either mypy or
+the standard library) and we found that code objects represent normally 3-6% of
+the average memory size of the program.

 We understand that the extra cost of this information may not be acceptable for
-some users, so we propose an opt-out mechanism when Python is executed in
-"opt-2" optimized mode (``python -OO``), which will cause pyc files to not include
-the extra information.
+some users, so we propose an opt-out mechanism which will cause generated code
+objects to not have the extra information while also allowing pyc files to not
+include the extra information.


 Specification
 =============

-In order to have enough information to correctly resolve the location within a
-given line where an error was raised, a map linking bytecode instructions and
-column offsets (start and end offset) is needed. This is similar in fashion to
-how line numbers are currently linked to bytecode instructions.
+In order to have enough information to correctly resolve the location
+within a given line where an error was raised, a map linking bytecode
+instructions to column offsets (start and end offset) and end line numbers
+is needed. This is similar in fashion to how line numbers are currently linked
+to bytecode instructions.

-The following changes will be performed as part of the implementation of this PEP:
+The following changes will be performed as part of the implementation of
+this PEP:

 * The offset information will be exposed to Python via a new attribute in the
-  code object class called ``co_col_offsets`` that will return a sequence of
-  two-element tuples (containing the start offsets and end offsets) or None if
-  the code object was created without the offset information. 
-* Two new C-API functions, ``PyCode_Addr2StartOffset`` and
-  ``PyCode_Addr2EndOffset`` will be added that can obtain the start and end
-  offsets respectively given the index of a bytecode instruction. These
-  functions will return 0 if the offset information is not available. 
-* A new private (underscore prefixed) C-API constructor for code objects will
-  be added that takes a bytes object containing the start offsets in the even
-  position and the end offsets in the odd positions. Old constructors will be
-  left untouched for backwards compatibility and will create code objects
-  without the new field.
+  code object class called ``co_positions`` that will return a sequence of
+  four-element tuples containing the full location of every instruction
+  (including start line, end line, start column offset and end column offset)
+  or ``None`` if the code object was created without the offset information.
+* Three new C-API functions, ``PyCode_Addr2EndLine``, ``PyCode_Addr2StartOffset``
+  and ``PyCode_Addr2EndOffset`` will be added that can obtain the end line, the
+  start column offsets and the end column offset respectively given the index
+  of a bytecode instruction. These functions will return 0 if the information
+  is not available.
+
+The internal storage, compression and encoding of the information is left as an
+implementation detail and can be changed at any point as long as the public API
+remains unchanged.

 Offset semantics
 ^^^^^^^^^^^^^^^^
@ -209,14 +214,12 @@ We believe this is an acceptable compromise as line lengths in Python tend to
 be much lower than this limit (a query of the top 100 packages in PyPI shows
 that less than 0.01% of lines were longer than 255 characters).

-Maintaining the current behavior, only a single line will be displayed in
-tracebacks. For instructions that span multiple lines (the end offset and the
-start offset belong to different lines), the end offset will be set to 0
-(meaning it is unavailable). If the start offset is not 0, this will be
-interpreted by the displaying code as if the range spans from the starting
-offset to the end of the line. The actual end offset cannot be calculated at
-compile time since the compiler does not know how many characters “the end of
-the line” actually represents.
+As specified previously, the underlying storage of the offsets should be
+considered an implementation detail, as the public APIs to obtain this values
+will return either C ``int`` types or Python ``int`` objects, which allows to
+implement better compression/encoding in the future if bigger ranges would need
+to be supported.  This PEP proposes to start with this simpler version and
+defer improvements to future work.

 Displaying tracebacks
 ^^^^^^^^^^^^^^^^^^^^^
@ -294,27 +297,37 @@ Will be displayed as::
            ^^^
    ZeroDivisionError: division by zero

+Maintaining the current behavior, only a single line will be displayed
+in tracebacks. For instructions that span multiple lines (the end offset
+and the start offset belong to different lines), the end line number must
+be inspected to know if the end offset applies to the same line as the
+starting offset.

 Opt-out mechanism
 ^^^^^^^^^^^^^^^^^

-To offer an opt-out mechanism for those users that care about the storage and
-memory overhead, the functionality will be deactivated along with the extra
-information when Python is executed in "opt-2" optimized mode (``python -OO``)
-resulting in ``pyc`` files not having the overhead associated with the extra
-required data.
+To offer an opt-out mechanism for those users that care about the
+storage and memory overhead and to allow third party tools and other
+programs that are currently parsing tracebacks to catch up the following
+methods will be provided to deactivate this feature:

-To allow third party tools and other programs that are currently parsing
-tracebacks to catch up and to allow users to deactivate the new feature, the
-following methods will be provided to deactivate displaying the new highlight
-carets (but not to avoid to storing the data, users will need to use Python in
-"opt-2" optimized mode for that):
+* A new environment variable: ``PYNODEBUGRANGES``.
+* A new command line option for the dev mode: ``python -Xnodebugranges``.

-* A new environment variable: ``PY_DEACTIVATE_TRACEBACK_RANGES``
-* A new command line option for the dev mode: ``python -Xnotracebackranges``.
+If any of these methods are used, the Python compiler will **not** populate
+code objects with the new information (``None`` will be used instead) and any
+unmarshalled code objects that contain the extra information will have it stripped
+away and replaced with ``None``). This method allows users to:

-These flags will be removed in the next version of the Python interpreter
-(counting from the version that releases this feature).
+* Create smaller ``pyc`` files by using one of the two methods when said files
+  are created.
+* Don't load the extra information from ``pyc`` files if those were created with
+  the extra information in the first place.
+
+Doing this has a **very small** performance hit as the interpreter state needs
+to be fetched when code objects are created to look up the configuration.
+Creating code objects is not a performance sensitive operation so this should
+not be a concern.

 Backwards Compatibility
 =======================