PEP: 371
Title: Addition of the multiprocessing package to the standard library
Version: $Revision$
Last-Modified: $Date$
Author: Jesse Noller <jnoller@gmail.com>,
        Richard Oudkerk <r.m.oudkerk@googlemail.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 06-May-2008
Python-Version: 2.6 / 3.0
Post-History:


Abstract

    This PEP proposes the inclusion of the pyProcessing [1] package
    into the Python standard library, renamed to "multiprocessing".

    The processing package mimics the standard library threading
    module and API to provide a process-based approach to "threaded
    programming" allowing end-users to dispatch multiple tasks that
    effectively side-step the global interpreter lock.

    The package also provides server and client functionality
    (processing.Manager) to provide remote sharing and management of
    objects and tasks so that applications may not only leverage
    multiple cores on the local machine, but also distribute objects
    and tasks across a cluster of networked machines.

    While the distributed capabilities of the package are beneficial,
    the primary focus of this PEP is the core threading-like API and
    capabilities of the package.

Rationale

    The current CPython interpreter implements the Global Interpreter
    Lock (GIL) and barring work in Python 3000 or other versions
    currently planned [2], the GIL will remain as-is within the
    CPython interpreter for the foreseeable future.  While the GIL
    itself enables clean and easy to maintain C code for the
    interpreter and extensions base, it is frequently an issue for
    those Python programmers who are leveraging multi-core machines.

    The GIL itself prevents more than a single thread from running
    within the interpreter at any given point in time, effectively
    removing Python's ability to take advantage of multi-processor
    systems.  While I/O bound applications do not suffer the same
    slow-down when using threading, they do suffer some performance
    cost due to the GIL.

    The pyProcessing package offers a method to side-step the GIL
    allowing applications within CPython to take advantage of
    multi-core architectures without asking users to completely change
    their programming paradigm (i.e.: dropping threaded programming
    for another "concurrent" approach - Twisted, etc).

    The Processing package offers CPython users a known API (that of
    the threading module), with known semantics and easy-scalability.
    In the future, the package might not be as relevant should the
    CPython interpreter enable "true" threading, however for some
    applications, forking an OS process may sometimes be more
    desirable than using lightweight threads, especially on those
    platforms where process creation is fast/optimized.

    For example, a simple threaded application:

        from threading import Thread as worker

        def afunc(number):
            print number * 3

        t = worker(target=afunc, args=(4,))
        t.start()
        t.join()

    The pyprocessing package mirrors the API so well, that with a
    simple change of the import to:

        from processing import Process as worker

    The code now executes through the processing.Process class.  This
    type of compatibility means that, with a minor (in most cases)
    change in code, users' applications will be able to leverage all
    cores and processors on a given machine for parallel execution.
    In many cases the pyprocessing package is even faster than the
    normal threading approach for I/O bound programs.  This of course,
    takes into account that the pyprocessing package is in optimized C
    code, while the threading module is not.

The "Distributed" Problem

    In the discussion on Python-Dev about the inclusion of this
    package [3] there was confusion about the intentions this PEP with
    an attempt to solve the "Distributed" problem - frequently
    comparing the functionality of this package with other solutions
    like MPI-based communication [4], CORBA, or other distributed
    object approaches [5].

    The "distributed" problem is large and varied.  Each programmer
    working within this domain has either very strong opinions about
    their favorite module/method or a highly customized problem for
    which no existing solution works.

    The acceptance of this package does not preclude or recommend that
    programmers working on the "distributed" problem not examine other
    solutions for their problem domain.  The intent of including this
    package is to provide entry-level capabilities for local
    concurrency and the basic support to spread that concurrency
    across a network of machines - although the two are not tightly
    coupled, the pyprocessing package could in fact, be used in
    conjunction with any of the other solutions including MPI/etc.

    If necessary - it is possible to completely decouple the local
    concurrency abilities of the package from the
    network-capable/shared aspects of the package.  Without serious
    concerns or cause however, the author of this PEP does not
    recommend that approach.

Performance Comparison

    As we all know - there are "lies, damned lies, and benchmarks".
    These speed comparisons, while aimed at showcasing the performance
    of the pyprocessing package, are by no means comprehensive or
    applicable to all possible use cases or environments.  Especially
    for those platforms with sluggish process forking timing.

    All benchmarks were run using the following:
        * 4 Core Intel Xeon CPU @ 3.00GHz
        * 16 GB of RAM
        * Python 2.5.2 compiled on Gentoo Linux (kernel 2.6.18.6)
        * pyProcessing 0.52

    All of the code for this can be downloaded from:
        http://jessenoller.com/code/bench-src.tgz

    The basic method of execution for these benchmarks is in the
    run_benchmarks.py script, which is simply a wrapper to execute a
    target function through a single threaded (linear), multi-threaded
    (via threading), and multi-process (via pyprocessing) function for
    a static number of iterations with increasing numbers of execution
    loops and/or threads.

    The run_benchmarks.py script executes each function 100 times,
    picking the best run of that 100 iterations via the timeit module.

    First, to identify the overhead of the spawning of the workers, we
    execute an function which is simply a pass statement (empty):

        cmd: python run_benchmarks.py empty_func.py
        Importing empty_func
        Starting tests ...
        non_threaded (1 iters)  0.000001 seconds
        threaded (1 threads)    0.000796 seconds
        processes (1 procs)     0.000714 seconds

        non_threaded (2 iters)  0.000002 seconds
        threaded (2 threads)    0.001963 seconds
        processes (2 procs)     0.001466 seconds

        non_threaded (4 iters)  0.000002 seconds
        threaded (4 threads)    0.003986 seconds
        processes (4 procs)     0.002701 seconds

        non_threaded (8 iters)  0.000003 seconds
        threaded (8 threads)    0.007990 seconds
        processes (8 procs)     0.005512 seconds

    As you can see, process forking via the pyprocessing package is
    faster than the speed of building and then executing the threaded
    version of the code.

    The second test calculates 50000 Fibonacci numbers inside of each
    thread (isolated and shared nothing):

        cmd: python run_benchmarks.py fibonacci.py
        Importing fibonacci
        Starting tests ...
        non_threaded (1 iters)  0.195548 seconds
        threaded (1 threads)    0.197909 seconds
        processes (1 procs)     0.201175 seconds

        non_threaded (2 iters)  0.397540 seconds
        threaded (2 threads)    0.397637 seconds
        processes (2 procs)     0.204265 seconds

        non_threaded (4 iters)  0.795333 seconds
        threaded (4 threads)    0.797262 seconds
        processes (4 procs)     0.206990 seconds

        non_threaded (8 iters)  1.591680 seconds
        threaded (8 threads)    1.596824 seconds
        processes (8 procs)     0.417899 seconds

    The third test calculates the sum of all primes below 100000,
    again sharing nothing.

        cmd: run_benchmarks.py crunch_primes.py
        Importing crunch_primes
        Starting tests ...
        non_threaded (1 iters)  0.495157 seconds
        threaded (1 threads)    0.522320 seconds
        processes (1 procs)     0.523757 seconds

        non_threaded (2 iters)  1.052048 seconds
        threaded (2 threads)    1.154726 seconds
        processes (2 procs)     0.524603 seconds

        non_threaded (4 iters)  2.104733 seconds
        threaded (4 threads)    2.455215 seconds
        processes (4 procs)     0.530688 seconds

        non_threaded (8 iters)  4.217455 seconds
        threaded (8 threads)    5.109192 seconds
        processes (8 procs)     1.077939 seconds

    The reason why tests two and three focused on pure numeric
    crunching is to showcase how the current threading implementation
    does hinder non-I/O applications.  Obviously, these tests could be
    improved to use a queue for coordination of results and chunks of
    work but that is not required to show the performance of the
    package and core Processing module.

    The next test is an I/O bound test.  This is normally where we see
    a steep improvement in the threading module approach versus a
    single-threaded approach.  In this case, each worker is opening a
    descriptor to lorem.txt, randomly seeking within it and writing
    lines to /dev/null:

        cmd: python run_benchmarks.py file_io.py
        Importing file_io
        Starting tests ...
        non_threaded (1 iters)  0.057750 seconds
        threaded (1 threads)    0.089992 seconds
        processes (1 procs)     0.090817 seconds

        non_threaded (2 iters)  0.180256 seconds
        threaded (2 threads)    0.329961 seconds
        processes (2 procs)     0.096683 seconds

        non_threaded (4 iters)  0.370841 seconds
        threaded (4 threads)    1.103678 seconds
        processes (4 procs)     0.101535 seconds

        non_threaded (8 iters)  0.749571 seconds
        threaded (8 threads)    2.437204 seconds
        processes (8 procs)     0.203438 seconds

    As you can see, pyprocessing is still faster on this I/O operation
    than using multiple threads.  And using multiple threads is slower
    than the single threaded execution itself.

    Finally, we will run a socket-based test to show network I/O
    performance.  This function grabs a URL from a server on the LAN
    that is a simple error page from tomcat.  It gets the page 100
    times.  The network is silent, and a 10G connection:

        cmd: python run_benchmarks.py url_get.py
        Importing url_get
        Starting tests ...
        non_threaded (1 iters)  0.124774 seconds
        threaded (1 threads)    0.120478 seconds
        processes (1 procs)     0.121404 seconds

        non_threaded (2 iters)  0.239574 seconds
        threaded (2 threads)    0.146138 seconds
        processes (2 procs)     0.138366 seconds

        non_threaded (4 iters)  0.479159 seconds
        threaded (4 threads)    0.200985 seconds
        processes (4 procs)     0.188847 seconds

        non_threaded (8 iters)  0.960621 seconds
        threaded (8 threads)    0.659298 seconds
        processes (8 procs)     0.298625 seconds

    We finally see threaded performance surpass that of
    single-threaded execution, but the pyprocessing package is still
    faster when increasing the number of workers.  If you stay with
    one or two threads/workers, then the timing between threads and
    pyprocessing is fairly close.

    One item of note however, is that there is an implicit overhead
    within the pyprocessing package's Queue implementation due to the
    object serialization.
    
    Alec Thomas provided a short example based on the
    run_benchmarks.py script to demonstrate this overhead versus the
    default Queue implementation:

        cmd: run_bench_queue.py 
        non_threaded (1 iters)  0.010546 seconds
        threaded (1 threads)    0.015164 seconds
        processes (1 procs)     0.066167 seconds

        non_threaded (2 iters)  0.020768 seconds
        threaded (2 threads)    0.041635 seconds
        processes (2 procs)     0.084270 seconds

        non_threaded (4 iters)  0.041718 seconds
        threaded (4 threads)    0.086394 seconds
        processes (4 procs)     0.144176 seconds

        non_threaded (8 iters)  0.083488 seconds
        threaded (8 threads)    0.184254 seconds
        processes (8 procs)     0.302999 seconds

    Additional benchmarks can be found in the pyprocessing package's
    source distribution's examples/ directory.  The examples will be
    included in the package's documentation.

Maintenance

    Richard M. Oudkerk - the author of the pyprocessing package has
    agreed to maintain the package within Python SVN.  Jesse Noller
    has volunteered to also help maintain/document and test the
    package.

API Naming

    The API of the pyprocessing package is designed to closely mimic
    that of the threading and Queue modules.  It has been proposed that
    instead of adding the package as-is, we rename it to be PEP 8
    compliant instead.

    Since the aim of the package is to be a drop-in for the threading
    module, the authors feel that the current API should be used.
    When the threading and Queue modules are updated to fully reflect
    PEP 8, the pyprocessing/multiprocessing naming can be revised.

Timing/Schedule

    Some concerns have been raised about the timing/lateness of this
    PEP for the 2.6 and 3.0 releases this year, however it is felt by
    both the authors and others that the functionality this package
    offers surpasses the risk of inclusion.

    However, taking into account the desire not to destabilize
    Python-core, some refactoring of pyprocessing's code "into"
    Python-core can be withheld until the next 2.x/3.x releases.  This
    means that the actual risk to Python-core is minimal, and largely
    constrained to the actual package itself.

Open Issues

    * All existing tests for the package should be converted to
      UnitTest format.
    * Existing documentation has to be moved to ReST formatting.
    * Verify code coverage percentage of existing test suite.
    * Identify any requirements to achieve a 1.0 milestone if
      required.
    * Verify current source tree conforms to standard library
      practices.
    * Rename top-level package from "pyprocessing" to
      "multiprocessing".
    * Confirm no "default" remote connection capabilities, if needed
      enable the remote security mechanisms by default for those
      classes which offer remote capabilities.
    * Some of the API (Queue methods qsize(), task_done() and join())
      either need to be added, or the reason for their exclusion needs
      to be identified and documented clearly.
    * Add in "multiprocessing.setExecutable()" method to override the
      default behavior of the package to spawn processes using the
      current executable name rather than the Python interpreter.  Note
      that Mark Hammond has suggested a factory-style interface for
      this[7].
        * Also note that the default behavior of process spawning does
          not make it compatible with use within IDLE as-is, this will
          be examined as a bug-fix or "setExecutable" enhancement.

Closed Issues

    * Reliance on ctypes: The pyprocessing package's reliance on
      ctypes prevents the package from functioning on platforms where
      ctypes is not supported.  This is not a restriction of this
      package, but rather of ctypes.

References

    [1] PyProcessing home page
        http://pyprocessing.berlios.de/

    [2] See Adam Olsen's "safe threading" project
        http://code.google.com/p/python-safethread/

    [3] See: Addition of "pyprocessing" module to standard lib.
        http://mail.python.org/pipermail/python-dev/2008-May/079417.html

    [4] http://mpi4py.scipy.org/

    [5] See "Cluster Computing"
        http://wiki.python.org/moin/ParallelProcessing

    [6] The original run_benchmark.py code was published in Python
        Magazine in December 2008: "Python Threads and the Global
        Interpreter Lock" by Jesse Noller.  It has been modified for
        this PEP.

    [7] http://groups.google.com/group/python-dev2/msg/54cf06d15cbcbc34

Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: