python-peps/pep-0371.txt

346 lines
14 KiB
Plaintext
Raw Normal View History

PEP: 371
Title: Addition of the Processing module to standard library
Version: $Revision: $
Last-Modified: $Date: $
Author: Jesse Noller <jnoller@gmail.com>
Richard Oudkerk <r.m.oudkerk@googlemail.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 06-May-2008
Python-Version: 2.6 / 3.0
Post-History:
Abstract
This PEP proposes the inclusion of the pyProcessing [1] module into the
python standard library.
The processing module mimics the standard library threading module and API
to provide a process-based approach to "threaded programming" allowing
end-users to dispatch multiple tasks that effectively side-step the global
interpreter lock.
The module also provides server and client modules to provide remote-
sharing and management of objects and tasks so that applications may not
only leverage multiple cores on the local machine, but also distribute
objects and tasks across a cluster of networked machines.
While the distributed capabilities of the module are beneficial, the primary
focus of this PEP is the core threading-like API and capabilities of the
module.
Rationale
The current CPython interpreter implements the Global Interpreter Lock (GIL)
and barring work in Python 3000 or other versions currently planned [2], the
GIL will remain as-is within the CPython interpreter for the foreseeable
future. While the GIL itself enables clean and easy to maintain C code for
the interpreter and extensions base, it is frequently an issue for those
Python programmers who are leveraging multi-core machines.
The GIL itself prevents more than a single thread from running within the
interpreter at any given point in time, effectively removing python's
ability to take advantage of multi-processor systems. While I/O bound
applications do not suffer the same slow-down when using threading, they do
suffer some performance cost due to the GIL.
The Processing module offers a method to side-step the GIL allowing
applications within CPython to take advantage of multi-core architectures
without asking users to completely change their programming paradigm (i.e.:
dropping threaded programming for another "concurrent" approach - Twisted,
etc).
The Processing module offers CPython users a known API (that of the
threading module), with known semantics and easy-scalability. In the
future, the module might not be as relevant should the CPython interpreter
enable "true" threading, however for some applications, forking an OS
process may sometimes be more desirable than using lightweight threads,
especially on those platforms where process creation is fast/optimized.
For example, a simple threaded application:
from threading import Thread as worker
def afunc(number):
print number * 3
t = worker(target=afunc, args=(4,))
t.start()
t.join()
The pyprocessing module mirrors the API so well, that with a simple change
of the import to:
from processing import Process as worker
The code now executes through the processing.Process class. This type of
compatibility means that, with a minor (in most cases) change in code,
users' applications will be able to leverage all cores and processors on a
given machine for parallel execution. In many cases the pyprocessing module
is even faster than the normal threading approach for I/O bound programs.
This of course, takes into account that the pyprocessing module is in
optimized C code, while the threading module is not.
The "Distributed" Problem
In the discussion on Python-Dev about the inclusion of this module [3] there
was confusion about the intentions this PEP with an attempt to solve the
"Distributed" problem - frequently comparing the functionality of this
module with other solutions like MPI-based communication [4], CORBA, or
other distributed object approaches [5].
The "distributed" problem is large and varied. Each programmer working
within this domain has either very strong opinions about their favorite
module/method or a highly customized problem for which no existing solution
works.
The acceptance of this module does not preclude or recommend that
programmers working on the "distributed" problem not examine other solutions
for their problem domain. The intent of including this module is to provide
entry-level capabilities for local concurrency and the basic support to
spread that concurrency across a network of machines - although the two are
not tightly coupled, the pyprocessing module could in fact, be used in
conjunction with any of the other solutions including MPI/etc.
If necessary - it is possible to completely decouple the local concurrency
abilities of the module from the network-capable/shared aspects of the
module. Without serious concerns or cause however, the author of this PEP
does not recommend that approach.
Performance Comparison
As we all know - there are "lies, damned lies, and benchmarks". These speed
comparisons, while aimed at showcasing the performance of the pyprocessing
module, are by no means comprehensive or applicable to all possible use
cases or environments. Especially for those platforms with sluggish process
forking timing.
All benchmarks were run using the following:
* 4 Core Intel Xeon CPU @ 3.00GHz
* 16 GB of RAM
* Python 2.5.2 compiled on Gentoo Linux (kernel 2.6.18.6)
* pyProcessing 0.52
All of the code for this can be downloaded from:
http://jessenoller.com/code/bench-src.tgz
The basic method of execution for these benchmarks is in the
run_benchmarks.py script, which is simply a wrapper to execute a target
function through a single threaded (linear), multi-threaded (via threading),
and multi-process (via pyprocessing) function for a static number of
iterations with increasing numbers of execution loops and/or threads.
The run_benchmarks.py script executes each function 100 times, picking the
best run of that 100 iterations via the timeit module.
First, to identify the overhead of the spawning of the workers, we execute
an function which is simply a pass statement (empty):
cmd: python run_benchmarks.py empty_func.py
Importing empty_func
Starting tests ...
non_threaded (1 iters) 0.000001 seconds
threaded (1 threads) 0.000796 seconds
processes (1 procs) 0.000714 seconds
non_threaded (2 iters) 0.000002 seconds
threaded (2 threads) 0.001963 seconds
processes (2 procs) 0.001466 seconds
non_threaded (4 iters) 0.000002 seconds
threaded (4 threads) 0.003986 seconds
processes (4 procs) 0.002701 seconds
non_threaded (8 iters) 0.000003 seconds
threaded (8 threads) 0.007990 seconds
processes (8 procs) 0.005512 seconds
As you can see, process forking via the pyprocessing module is faster than
the speed of building and then executing the threaded version of the code.
The second test calculates 50000 fibonacci numbers inside of each thread
(isolated and shared nothing):
cmd: python run_benchmarks.py fibonacci.py
Importing fibonacci
Starting tests ...
non_threaded (1 iters) 0.195548 seconds
threaded (1 threads) 0.197909 seconds
processes (1 procs) 0.201175 seconds
non_threaded (2 iters) 0.397540 seconds
threaded (2 threads) 0.397637 seconds
processes (2 procs) 0.204265 seconds
non_threaded (4 iters) 0.795333 seconds
threaded (4 threads) 0.797262 seconds
processes (4 procs) 0.206990 seconds
non_threaded (8 iters) 1.591680 seconds
threaded (8 threads) 1.596824 seconds
processes (8 procs) 0.417899 seconds
The third test calculates the sum of all primes below 100000, again sharing
nothing.
cmd: run_benchmarks.py crunch_primes.py
Importing crunch_primes
Starting tests ...
non_threaded (1 iters) 0.495157 seconds
threaded (1 threads) 0.522320 seconds
processes (1 procs) 0.523757 seconds
non_threaded (2 iters) 1.052048 seconds
threaded (2 threads) 1.154726 seconds
processes (2 procs) 0.524603 seconds
non_threaded (4 iters) 2.104733 seconds
threaded (4 threads) 2.455215 seconds
processes (4 procs) 0.530688 seconds
non_threaded (8 iters) 4.217455 seconds
threaded (8 threads) 5.109192 seconds
processes (8 procs) 1.077939 seconds
The reason why tests two and three focused on pure numeric crunching is to
showcase how the current threading implementation does hinder non-I/O
applications. Obviously, these tests could be improved to use a queue for
coordination of results and chunks of work but that is not required to show
the performance of the module.
The next test is an I/O bound test. This is normally where we see a steep
improvement in the threading module approach versus a single-threaded
approach. In this case, each worker is opening a descriptor to lorem.txt,
randomly seeking within it and writing lines to /dev/null:
cmd: python run_benchmarks.py file_io.py
Importing file_io
Starting tests ...
non_threaded (1 iters) 0.057750 seconds
threaded (1 threads) 0.089992 seconds
processes (1 procs) 0.090817 seconds
non_threaded (2 iters) 0.180256 seconds
threaded (2 threads) 0.329961 seconds
processes (2 procs) 0.096683 seconds
non_threaded (4 iters) 0.370841 seconds
threaded (4 threads) 1.103678 seconds
processes (4 procs) 0.101535 seconds
non_threaded (8 iters) 0.749571 seconds
threaded (8 threads) 2.437204 seconds
processes (8 procs) 0.203438 seconds
As you can see, pyprocessing is still faster on this I/O operation than
using multiple threads. And using multiple threads is slower than the
single threaded execution itself.
Finally, we will run a socket-based test to show network I/O performance.
This function grabs a URL from a server on the LAN that is a simple error
page from tomcat. It gets the page 100 times. The network is silent, and a
10G connection:
cmd: python run_benchmarks.py url_get.py
Importing url_get
Starting tests ...
non_threaded (1 iters) 0.124774 seconds
threaded (1 threads) 0.120478 seconds
processes (1 procs) 0.121404 seconds
non_threaded (2 iters) 0.239574 seconds
threaded (2 threads) 0.146138 seconds
processes (2 procs) 0.138366 seconds
non_threaded (4 iters) 0.479159 seconds
threaded (4 threads) 0.200985 seconds
processes (4 procs) 0.188847 seconds
non_threaded (8 iters) 0.960621 seconds
threaded (8 threads) 0.659298 seconds
processes (8 procs) 0.298625 seconds
We finally see threaded performance surpass that of single-threaded
execution, but the pyprocessing module is still faster when increasing the
number of workers. If you stay with one or two threads/workers, then the
timing between threads and pyprocessing is fairly close.
Additional benchmarks can be found in the pyprocessing module's source
distribution's examples/ directory.
Maintenance
Richard M. Oudkerk - the author of the pyprocessing module has agreed to
maintaing the module within Python SVN. Jesse Noller has volunteered to
also help maintain/document and test the module.
Timing/Schedule
Some concerns have been raised about the timing/lateness of this PEP
for the 2.6 and 3.0 releases this year, however it is felt by both
the authors and others that the functionality this module offers
surpasses the risk of inclusion.
However, taking into account the desire not to destabilize python-core, some
refactoring of pyprocessing's code "into" python-core can be withheld until
the next 2.x/3.x releases. This means that the actual risk to python-core
is minimal, and largely constrained to the actual module itself.
Open Issues
* All existing tests for the module should be converted to UnitTest format.
* Existing documentation has to be moved to ReST formatting.
* Verify code coverage percentage of existing test suite.
* Identify any requirements to achieve a 1.0 milestone if required.
* Verify current source tree conforms to standard library practices.
* Rename top-level module from "pyprocessing" to "multiprocessing".
* Confirm no "default" remote connection capabilities, if needed enable the
remote security mechanisms by default for those classes which offer remote
capabilities.
* Some of the API (Queue methods qsize(), task_done() and join()) either
need to be added, or the reason for their exclusion needs to be identified
and documented clearly.
Closed Issues
* Reliance on ctypes: The pyprocessing module's reliance on ctypes prevents
the module from functioning on platforms where ctypes is not supported.
This is not a restriction of this module, but rather ctypes.
References
[1] PyProcessing home page
http://pyprocessing.berlios.de/
[2] See Adam Olsen's "safe threading" project
http://code.google.com/p/python-safethread/
[3] See: Addition of "pyprocessing" module to standard lib.
http://mail.python.org/pipermail/python-dev/2008-May/079417.html
[4] http://mpi4py.scipy.org/
[5] See "Cluster Computing"
http://wiki.python.org/moin/ParallelProcessing
[6] The original run_benchmark.py code was published in Python
Magazine in December 2008: "Python Threads and the Global Interpreter
Lock" by Jesse Noller. It has been modified for this PEP.
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: