python-peps/pep-0237.txt

PEP: 237
Title: Unifying Long Integers and Integers
Version: $Revision$
Last-Modified: $Date$
Author: Moshe Zadka, Guido van Rossum
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 11-Mar-2001
Python-Version: 2.2
Post-History: 16-Mar-2001, 14-Aug-2001, 23-Aug-2001


Abstract
========

Python currently distinguishes between two kinds of integers (ints): regular
or short ints, limited by the size of a C long (typically 32 or 64 bits), and
long ints, which are limited only by available memory.  When operations on
short ints yield results that don't fit in a C long, they raise an error.
There are some other distinctions too.  This PEP proposes to do away with most
of the differences in semantics, unifying the two types from the perspective
of the Python user.


Rationale
=========

Many programs find a need to deal with larger numbers after the fact, and
changing the algorithms later is bothersome.  It can hinder performance in the
normal case, when all arithmetic is performed using long ints whether or not
they are needed.

Having the machine word size exposed to the language hinders portability.  For
examples Python source files and .pyc's are not portable between 32-bit and
64-bit machines because of this.

There is also the general desire to hide unnecessary details from the Python
user when they are irrelevant for most applications. An example is memory
allocation, which is explicit in C but automatic in Python, giving us the
convenience of unlimited sizes on strings, lists, etc.  It makes sense to
extend this convenience to numbers.

It will give new Python programmers (whether they are new to programming in
general or not) one less thing to learn before they can start using the
language.


Implementation
==============

Initially, two alternative implementations were proposed (one by each author):

1. The ``PyInt`` type's slot for a C long will be turned into a::

       union {
           long i;
           struct {
               unsigned long length;
               digit digits[1];
           } bignum;
       };

   Only the ``n-1`` lower bits of the ``long`` have any meaning; the top bit
   is always set.  This distinguishes the ``union``.  All ``PyInt`` functions
   will check this bit before deciding which types of operations to use.

2. The existing short and long int types remain, but operations return
   a long int instead of raising ``OverflowError`` when a result cannot be
   represented as a short int.  A new type, ``integer``, may be introduced
   that is an abstract base type of which both the ``int`` and ``long``
   implementation types are subclassed.  This is useful so that programs can
   check integer-ness with a single test::

       if isinstance(i, integer): ...

After some consideration, the second implementation plan was selected, since
it is far easier to implement, is backwards compatible at the C API level, and
in addition can be implemented partially as a transitional measure.


Incompatibilities
=================

The following operations have (usually subtly) different semantics for short
and for long integers, and one or the other will have to be changed somehow.
This is intended to be an exhaustive list. If you know of any other operation
that differ in outcome depending on whether a short or a long int with the same
value is passed, please write the second author.

- Currently, all arithmetic operators on short ints except ``<<`` raise
  ``OverflowError`` if the result cannot be represented as a short int.  This
  will be changed to return a long int instead. The following operators can
  currently raise ``OverflowError``: ``x+y``, ``x-y``, ``x*y``, ``x**y``,
  ``divmod(x, y)``, ``x/y``, ``x%y``, and ``-x``.  (The last four can only
  overflow when the value ``-sys.maxint-1`` is involved.)

- Currently, ``x<<n`` can lose bits for short ints.  This will be changed to
  return a long int containing all the shifted-out bits, if returning a short
  int would lose bits (where changing sign is considered a special case of
  losing bits).

- Currently, hex and oct literals for short ints may specify negative values;
  for example ``0xffffffff == -1`` on a 32-bit machine.  This will be changed
  to equal ``0xffffffffL`` (``2**32-1``).

- Currently, the ``%u``, ``%x``, ``%X`` and ``%o`` string formatting operators
  and the ``hex()`` and ``oct()`` built-in functions behave differently for
  negative numbers: negative short ints are formatted as unsigned C long,
  while negative long ints are formatted with a minus sign.  This will be
  changed to use the long int semantics in all cases (but without the trailing
  *L* that currently distinguishes the output of ``hex()`` and ``oct()`` for
  long ints).  Note that this means that ``%u`` becomes an alias for ``%d``.
  It will eventually be removed.

- Currently, ``repr()`` of a long int returns a string ending in *L* while
  ``repr()`` of a short int doesn't.  The *L* will be dropped; but not before
  Python 3.0.

- Currently, an operation with long operands will never return a short int.
  This *may* change, since it allows some optimization.  (No changes have been
  made in this area yet, and none are planned.)

- The expression ``type(x).__name__`` depends on whether *x* is a short or a
  long int.  Since implementation alternative 2 is chosen, this difference
  will remain.  (In Python 3.0, we *may* be able to deploy a trick to hide the
  difference, because it *is* annoying to reveal the difference to user code,
  and more so as the difference between the two types is less visible.)

- Long and short ints are handled different by the ``marshal`` module, and by
  the ``pickle`` and ``cPickle`` modules.  This difference will remain (at
  least until Python 3.0).

- Short ints with small values (typically between -1 and 99 inclusive) are
  *interned* -- whenever a result has such a value, an existing short int with
  the same value is returned.  This is not done for long ints with the same
  values.  This difference will remain.  (Since there is no guarantee of this
  interning, it is debatable whether this is a semantic difference -- but code
  may exist that uses ``is`` for comparisons of short ints and happens to work
  because of this interning.  Such code may fail if used with long ints.)


Literals
========

A trailing *L* at the end of an integer literal will stop having any
meaning, and will be eventually become illegal.  The compiler will choose the
appropriate type solely based on the value. (Until Python 3.0, it will force
the literal to be a long; but literals without a trailing *L* may also be
long, if they are not representable as short ints.)


Built-in Functions
==================

The function ``int()`` will return a short or a long int depending on the
argument value.  In Python 3.0, the function ``long()`` will call the function
``int()``; before then, it will continue to force the result to be a long int,
but otherwise work the same way as ``int()``. The built-in name ``long`` will
remain in the language to represent the long implementation type (unless it is
completely eradicated in Python 3.0), but using the ``int()`` function is
still recommended, since it will automatically return a long when needed.


C API
=====

The C API remains unchanged; C code will still need to be aware of the
difference between short and long ints.  (The Python 3.0 C API will probably
be completely incompatible.)

The ``PyArg_Parse*()`` APIs already accept long ints, as long as they are
within the range representable by C ints or longs, so that functions taking C
int or long argument won't have to worry about dealing with Python longs.


Transition
==========

There are three major phases to the transition:

1. Short int operations that currently raise ``OverflowError`` return a long
   int value instead.  This is the only change in this phase.  Literals will
   still distinguish between short and long ints.  The other semantic
   differences listed above (including the behavior of ``<<``) will remain.
   Because this phase only changes situations that currently raise
   ``OverflowError``, it is assumed that this won't break existing code.
   (Code that depends on this exception would have to be too convoluted to be
   concerned about it.)  For those concerned about extreme backwards
   compatibility, a command line option (or a call to the warnings module)
   will allow a warning or an error to be issued at this point, but this is
   off by default.

2. The remaining semantic differences are addressed.  In all cases the long
   int semantics will prevail.  Since this will introduce backwards
   incompatibilities which will break some old code, this phase may require a
   future statement and/or warnings, and a prolonged transition phase.  The
   trailing *L* will continue to be used for longs as input and by
   ``repr()``.

   A. Warnings are enabled about operations that will change their numeric
      outcome in stage 2B, in particular ``hex()`` and ``oct()``, ``%u``,
      ``%x``, ``%X`` and ``%o``, ``hex`` and ``oct`` literals in the
      (inclusive) range ``[sys.maxint+1, sys.maxint*2+1]``, and left shifts
      losing bits.
   B. The new semantic for these operations are implemented. Operations that
      give different results than before will *not* issue a warning.

3. The trailing *L* is dropped from ``repr()``, and made illegal on input.
   (If possible, the ``long`` type completely disappears.) The trailing *L*
   is also dropped from ``hex()`` and ``oct()``.

Phase 1 will be implemented in Python 2.2.

Phase 2 will be implemented gradually, with 2A in Python 2.3 and 2B in
Python 2.4.

Phase 3 will be implemented in Python 3.0 (at least two years after Python 2.4
is released).


OverflowWarning
===============

Here are the rules that guide warnings generated in situations that currently
raise ``OverflowError``.  This applies to transition phase 1.  Historical
note: despite that phase 1 was completed in Python 2.2, and phase 2A in Python
2.3, nobody noticed that OverflowWarning was still generated in Python 2.3.
It was finally disabled in Python 2.4.  The Python builtin
``OverflowWarning``, and the corresponding C API ``PyExc_OverflowWarning``,
are no longer generated or used in Python 2.4, but will remain for the
(unlikely) case of user code until Python 2.5.

- A new warning category is introduced, ``OverflowWarning``.  This is a
  built-in name.

- If an int result overflows, an ``OverflowWarning`` warning is issued, with a
  message argument indicating the operation, e.g. "integer addition".  This
  may or may not cause a warning message to be displayed on ``sys.stderr``, or
  may cause an exception to be raised, all under control of the ``-W`` command
  line and the warnings module.

- The ``OverflowWarning`` warning is ignored by default.

- The ``OverflowWarning`` warning can be controlled like all warnings, via the
  ``-W`` command line option or via the ``warnings.filterwarnings()`` call.
  For example::

      python -Wdefault::OverflowWarning

  cause the ``OverflowWarning`` to be displayed the first time it occurs at a
  particular source line, and::

      python -Werror::OverflowWarning

  cause the ``OverflowWarning`` to be turned into an exception whenever it
  happens.  The following code enables the warning from inside the program::

      import warnings
      warnings.filterwarnings("default", "", OverflowWarning)

  See the python ``man`` page for the ``-W`` option and the ``warnings``
  module documentation for ``filterwarnings()``.

- If the ``OverflowWarning`` warning is turned into an error,
  ``OverflowError`` is substituted.  This is needed for backwards
  compatibility.

- Unless the warning is turned into an exceptions, the result of the operation
  (e.g., ``x+y``) is recomputed after converting the arguments to long ints.


Example
=======

If you pass a long int to a C function or built-in operation that takes an
integer, it will be treated the same as a short int as long as the value fits
(by virtue of how ``PyArg_ParseTuple()`` is implemented).  If the long value
doesn't fit, it will still raise an ``OverflowError``.  For example::

    def fact(n):
        if n <= 1:
        return 1
    return n*fact(n-1)

    A = "ABCDEFGHIJKLMNOPQ"
    n = input("Gimme an int: ")
    print A[fact(n)%17]

For ``n >= 13``, this currently raises ``OverflowError`` (unless the user
enters a trailing *L* as part of their input), even though the calculated
index would always be in ``range(17)``.  With the new approach this code will
do the right thing: the index will be calculated as a long int, but its value
will be in range.


Resolved Issues
===============

These issues, previously open, have been resolved.

- ``hex()`` and ``oct()`` applied to longs will continue to produce a trailing
  *L* until Python 3000.  The original text above wasn't clear about this,
  but since it didn't happen in Python 2.4 it was thought better to leave it
  alone.  BDFL pronouncement here:

  https://mail.python.org/pipermail/python-dev/2006-June/065918.html

- What to do about ``sys.maxint``?  Leave it in, since it is still relevant
  whenever the distinction between short and long ints is still relevant (e.g.
  when inspecting the type of a value).

- Should we remove ``%u`` completely?  Remove it.

- Should we warn about ``<<`` not truncating integers?  Yes.

- Should the overflow warning be on a portable maximum size?  No.


Implementation
==============

The implementation work for the Python 2.x line is completed; phase 1 was
released with Python 2.2, phase 2A with Python 2.3, and phase 2B will be
released with Python 2.4 (and is already in CVS).


Copyright
=========

This document has been placed in the public domain.


..
  Local Variables:
  mode: indented-text
  indent-tabs-mode: nil
  End: