PEP 657: Update the public API and the opt-out mechanism (#1959)
This commit is contained in:
parent
bcf1f22b20
commit
27c75f67c4
135
pep-0657.rst
135
pep-0657.rst
|
@ -17,16 +17,17 @@ Abstract
|
|||
========
|
||||
|
||||
This PEP proposes adding a mapping from each bytecode instruction to the start
|
||||
and end column offsets of the line that generated them. This data will be used
|
||||
to improve tracebacks displayed by the CPython interpreter in order to improve
|
||||
the debugging experience. The PEP also proposes adding APIs that allow other
|
||||
tools (such as coverage analysis tools, profilers, tracers, debuggers) to
|
||||
consume this information from code objects.
|
||||
and end column offsets of the line that generated them as well as the end line
|
||||
number. This data will be used to improve tracebacks displayed by the CPython
|
||||
interpreter in order to improve the debugging experience. The PEP also proposes
|
||||
adding APIs that allow other tools (such as coverage analysis tools, profilers,
|
||||
tracers, debuggers) to consume this information from code objects.
|
||||
|
||||
Motivation
|
||||
==========
|
||||
|
||||
The primary motivation for this PEP is to improve the feedback presented about the location of errors to aid with debugging.
|
||||
The primary motivation for this PEP is to improve the feedback presented about
|
||||
the location of errors to aid with debugging.
|
||||
|
||||
Python currently keeps a mapping of bytecode to line numbers from compilation.
|
||||
The interpreter uses this mapping to point to the source line associated with
|
||||
|
@ -150,51 +151,55 @@ instruction. This will have an impact on the size of ``pyc`` files on disk and
|
|||
the size of code objects in memory. The authors of this proposal have chosen
|
||||
the data types in a way that tries to minimize this impact. The proposed
|
||||
overhead is storing two ``uint8_t`` (one for the start offset and one for the
|
||||
end offset) for every bytecode instruction.
|
||||
end offset) and the end line information for every bytecode instruction (in
|
||||
the same encoded fashion as the start line is stored currently).
|
||||
|
||||
As an illustrative example to gauge the impact of this change, we have
|
||||
calculated that this change will increase the size of the standard library’s
|
||||
pyc files by 22% (6MB) from 28.4MB to 34.7MB. The overhead in memory usage will be
|
||||
the same (assuming the *full standard library* is loaded into the same
|
||||
program). We believe that this is a very acceptable number since the order of
|
||||
magnitude of the overhead is very small, especially considering the storage
|
||||
size and memory capabilities of modern computers. Additionally, in general the
|
||||
memory size of a Python program is not dominated by code objects. To check this
|
||||
assumption we have executed the test suite of several popular PyPI projects
|
||||
(including NumPy, pytest, Django and Cython) as well as several applications
|
||||
(Black, pylint, mypy executed over either mypy or the standard library) and we
|
||||
found that code objects represent normally 3-6% of the average memory size of
|
||||
the program.
|
||||
calculated that including the start and end offsets will increase the size of
|
||||
the standard library’s pyc files by 22% (6MB) from 28.4MB to 34.7MB. The
|
||||
overhead in memory usage will be the same (assuming the *full standard library*
|
||||
is loaded into the same program). We believe that this is a very acceptable
|
||||
number since the order of magnitude of the overhead is very small, especially
|
||||
considering the storage size and memory capabilities of modern computers.
|
||||
Additionally, in general the memory size of a Python program is not dominated
|
||||
by code objects. To check this assumption we have executed the test suite of
|
||||
several popular PyPI projects (including NumPy, pytest, Django and Cython) as
|
||||
well as several applications (Black, pylint, mypy executed over either mypy or
|
||||
the standard library) and we found that code objects represent normally 3-6% of
|
||||
the average memory size of the program.
|
||||
|
||||
We understand that the extra cost of this information may not be acceptable for
|
||||
some users, so we propose an opt-out mechanism when Python is executed in
|
||||
"opt-2" optimized mode (``python -OO``), which will cause pyc files to not include
|
||||
the extra information.
|
||||
some users, so we propose an opt-out mechanism which will cause generated code
|
||||
objects to not have the extra information while also allowing pyc files to not
|
||||
include the extra information.
|
||||
|
||||
|
||||
Specification
|
||||
=============
|
||||
|
||||
In order to have enough information to correctly resolve the location within a
|
||||
given line where an error was raised, a map linking bytecode instructions and
|
||||
column offsets (start and end offset) is needed. This is similar in fashion to
|
||||
how line numbers are currently linked to bytecode instructions.
|
||||
In order to have enough information to correctly resolve the location
|
||||
within a given line where an error was raised, a map linking bytecode
|
||||
instructions to column offsets (start and end offset) and end line numbers
|
||||
is needed. This is similar in fashion to how line numbers are currently linked
|
||||
to bytecode instructions.
|
||||
|
||||
The following changes will be performed as part of the implementation of this PEP:
|
||||
The following changes will be performed as part of the implementation of
|
||||
this PEP:
|
||||
|
||||
* The offset information will be exposed to Python via a new attribute in the
|
||||
code object class called ``co_col_offsets`` that will return a sequence of
|
||||
two-element tuples (containing the start offsets and end offsets) or None if
|
||||
the code object was created without the offset information.
|
||||
* Two new C-API functions, ``PyCode_Addr2StartOffset`` and
|
||||
``PyCode_Addr2EndOffset`` will be added that can obtain the start and end
|
||||
offsets respectively given the index of a bytecode instruction. These
|
||||
functions will return 0 if the offset information is not available.
|
||||
* A new private (underscore prefixed) C-API constructor for code objects will
|
||||
be added that takes a bytes object containing the start offsets in the even
|
||||
position and the end offsets in the odd positions. Old constructors will be
|
||||
left untouched for backwards compatibility and will create code objects
|
||||
without the new field.
|
||||
code object class called ``co_positions`` that will return a sequence of
|
||||
four-element tuples containing the full location of every instruction
|
||||
(including start line, end line, start column offset and end column offset)
|
||||
or ``None`` if the code object was created without the offset information.
|
||||
* Three new C-API functions, ``PyCode_Addr2EndLine``, ``PyCode_Addr2StartOffset``
|
||||
and ``PyCode_Addr2EndOffset`` will be added that can obtain the end line, the
|
||||
start column offsets and the end column offset respectively given the index
|
||||
of a bytecode instruction. These functions will return 0 if the information
|
||||
is not available.
|
||||
|
||||
The internal storage, compression and encoding of the information is left as an
|
||||
implementation detail and can be changed at any point as long as the public API
|
||||
remains unchanged.
|
||||
|
||||
Offset semantics
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
@ -209,14 +214,12 @@ We believe this is an acceptable compromise as line lengths in Python tend to
|
|||
be much lower than this limit (a query of the top 100 packages in PyPI shows
|
||||
that less than 0.01% of lines were longer than 255 characters).
|
||||
|
||||
Maintaining the current behavior, only a single line will be displayed in
|
||||
tracebacks. For instructions that span multiple lines (the end offset and the
|
||||
start offset belong to different lines), the end offset will be set to 0
|
||||
(meaning it is unavailable). If the start offset is not 0, this will be
|
||||
interpreted by the displaying code as if the range spans from the starting
|
||||
offset to the end of the line. The actual end offset cannot be calculated at
|
||||
compile time since the compiler does not know how many characters “the end of
|
||||
the line” actually represents.
|
||||
As specified previously, the underlying storage of the offsets should be
|
||||
considered an implementation detail, as the public APIs to obtain this values
|
||||
will return either C ``int`` types or Python ``int`` objects, which allows to
|
||||
implement better compression/encoding in the future if bigger ranges would need
|
||||
to be supported. This PEP proposes to start with this simpler version and
|
||||
defer improvements to future work.
|
||||
|
||||
Displaying tracebacks
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
@ -294,27 +297,37 @@ Will be displayed as::
|
|||
^^^
|
||||
ZeroDivisionError: division by zero
|
||||
|
||||
Maintaining the current behavior, only a single line will be displayed
|
||||
in tracebacks. For instructions that span multiple lines (the end offset
|
||||
and the start offset belong to different lines), the end line number must
|
||||
be inspected to know if the end offset applies to the same line as the
|
||||
starting offset.
|
||||
|
||||
Opt-out mechanism
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
To offer an opt-out mechanism for those users that care about the storage and
|
||||
memory overhead, the functionality will be deactivated along with the extra
|
||||
information when Python is executed in "opt-2" optimized mode (``python -OO``)
|
||||
resulting in ``pyc`` files not having the overhead associated with the extra
|
||||
required data.
|
||||
To offer an opt-out mechanism for those users that care about the
|
||||
storage and memory overhead and to allow third party tools and other
|
||||
programs that are currently parsing tracebacks to catch up the following
|
||||
methods will be provided to deactivate this feature:
|
||||
|
||||
To allow third party tools and other programs that are currently parsing
|
||||
tracebacks to catch up and to allow users to deactivate the new feature, the
|
||||
following methods will be provided to deactivate displaying the new highlight
|
||||
carets (but not to avoid to storing the data, users will need to use Python in
|
||||
"opt-2" optimized mode for that):
|
||||
* A new environment variable: ``PYNODEBUGRANGES``.
|
||||
* A new command line option for the dev mode: ``python -Xnodebugranges``.
|
||||
|
||||
* A new environment variable: ``PY_DEACTIVATE_TRACEBACK_RANGES``
|
||||
* A new command line option for the dev mode: ``python -Xnotracebackranges``.
|
||||
If any of these methods are used, the Python compiler will **not** populate
|
||||
code objects with the new information (``None`` will be used instead) and any
|
||||
unmarshalled code objects that contain the extra information will have it stripped
|
||||
away and replaced with ``None``). This method allows users to:
|
||||
|
||||
These flags will be removed in the next version of the Python interpreter
|
||||
(counting from the version that releases this feature).
|
||||
* Create smaller ``pyc`` files by using one of the two methods when said files
|
||||
are created.
|
||||
* Don't load the extra information from ``pyc`` files if those were created with
|
||||
the extra information in the first place.
|
||||
|
||||
Doing this has a **very small** performance hit as the interpreter state needs
|
||||
to be fetched when code objects are created to look up the configuration.
|
||||
Creating code objects is not a performance sensitive operation so this should
|
||||
not be a concern.
|
||||
|
||||
Backwards Compatibility
|
||||
=======================
|
||||
|
|
Loading…
Reference in New Issue