346 lines
14 KiB
Plaintext
346 lines
14 KiB
Plaintext
|
PEP: 371
|
|||
|
Title: Addition of the Processing module to standard library
|
|||
|
Version: $Revision: $
|
|||
|
Last-Modified: $Date: $
|
|||
|
Author: Jesse Noller <jnoller@gmail.com>
|
|||
|
Richard Oudkerk <r.m.oudkerk@googlemail.com>
|
|||
|
Status: Draft
|
|||
|
Type: Standards Track
|
|||
|
Content-Type: text/plain
|
|||
|
Created: 06-May-2008
|
|||
|
Python-Version: 2.6 / 3.0
|
|||
|
Post-History:
|
|||
|
|
|||
|
|
|||
|
Abstract
|
|||
|
|
|||
|
This PEP proposes the inclusion of the pyProcessing [1] module into the
|
|||
|
python standard library.
|
|||
|
|
|||
|
The processing module mimics the standard library threading module and API
|
|||
|
to provide a process-based approach to "threaded programming" allowing
|
|||
|
end-users to dispatch multiple tasks that effectively side-step the global
|
|||
|
interpreter lock.
|
|||
|
|
|||
|
The module also provides server and client modules to provide remote-
|
|||
|
sharing and management of objects and tasks so that applications may not
|
|||
|
only leverage multiple cores on the local machine, but also distribute
|
|||
|
objects and tasks across a cluster of networked machines.
|
|||
|
|
|||
|
While the distributed capabilities of the module are beneficial, the primary
|
|||
|
focus of this PEP is the core threading-like API and capabilities of the
|
|||
|
module.
|
|||
|
|
|||
|
Rationale
|
|||
|
|
|||
|
The current CPython interpreter implements the Global Interpreter Lock (GIL)
|
|||
|
and barring work in Python 3000 or other versions currently planned [2], the
|
|||
|
GIL will remain as-is within the CPython interpreter for the foreseeable
|
|||
|
future. While the GIL itself enables clean and easy to maintain C code for
|
|||
|
the interpreter and extensions base, it is frequently an issue for those
|
|||
|
Python programmers who are leveraging multi-core machines.
|
|||
|
|
|||
|
The GIL itself prevents more than a single thread from running within the
|
|||
|
interpreter at any given point in time, effectively removing python's
|
|||
|
ability to take advantage of multi-processor systems. While I/O bound
|
|||
|
applications do not suffer the same slow-down when using threading, they do
|
|||
|
suffer some performance cost due to the GIL.
|
|||
|
|
|||
|
The Processing module offers a method to side-step the GIL allowing
|
|||
|
applications within CPython to take advantage of multi-core architectures
|
|||
|
without asking users to completely change their programming paradigm (i.e.:
|
|||
|
dropping threaded programming for another "concurrent" approach - Twisted,
|
|||
|
etc).
|
|||
|
|
|||
|
The Processing module offers CPython users a known API (that of the
|
|||
|
threading module), with known semantics and easy-scalability. In the
|
|||
|
future, the module might not be as relevant should the CPython interpreter
|
|||
|
enable "true" threading, however for some applications, forking an OS
|
|||
|
process may sometimes be more desirable than using lightweight threads,
|
|||
|
especially on those platforms where process creation is fast/optimized.
|
|||
|
|
|||
|
For example, a simple threaded application:
|
|||
|
|
|||
|
from threading import Thread as worker
|
|||
|
|
|||
|
def afunc(number):
|
|||
|
print number * 3
|
|||
|
|
|||
|
t = worker(target=afunc, args=(4,))
|
|||
|
t.start()
|
|||
|
t.join()
|
|||
|
|
|||
|
The pyprocessing module mirrors the API so well, that with a simple change
|
|||
|
of the import to:
|
|||
|
|
|||
|
from processing import Process as worker
|
|||
|
|
|||
|
The code now executes through the processing.Process class. This type of
|
|||
|
compatibility means that, with a minor (in most cases) change in code,
|
|||
|
users' applications will be able to leverage all cores and processors on a
|
|||
|
given machine for parallel execution. In many cases the pyprocessing module
|
|||
|
is even faster than the normal threading approach for I/O bound programs.
|
|||
|
This of course, takes into account that the pyprocessing module is in
|
|||
|
optimized C code, while the threading module is not.
|
|||
|
|
|||
|
The "Distributed" Problem
|
|||
|
|
|||
|
In the discussion on Python-Dev about the inclusion of this module [3] there
|
|||
|
was confusion about the intentions this PEP with an attempt to solve the
|
|||
|
"Distributed" problem - frequently comparing the functionality of this
|
|||
|
module with other solutions like MPI-based communication [4], CORBA, or
|
|||
|
other distributed object approaches [5].
|
|||
|
|
|||
|
The "distributed" problem is large and varied. Each programmer working
|
|||
|
within this domain has either very strong opinions about their favorite
|
|||
|
module/method or a highly customized problem for which no existing solution
|
|||
|
works.
|
|||
|
|
|||
|
The acceptance of this module does not preclude or recommend that
|
|||
|
programmers working on the "distributed" problem not examine other solutions
|
|||
|
for their problem domain. The intent of including this module is to provide
|
|||
|
entry-level capabilities for local concurrency and the basic support to
|
|||
|
spread that concurrency across a network of machines - although the two are
|
|||
|
not tightly coupled, the pyprocessing module could in fact, be used in
|
|||
|
conjunction with any of the other solutions including MPI/etc.
|
|||
|
|
|||
|
If necessary - it is possible to completely decouple the local concurrency
|
|||
|
abilities of the module from the network-capable/shared aspects of the
|
|||
|
module. Without serious concerns or cause however, the author of this PEP
|
|||
|
does not recommend that approach.
|
|||
|
|
|||
|
Performance Comparison
|
|||
|
|
|||
|
As we all know - there are "lies, damned lies, and benchmarks". These speed
|
|||
|
comparisons, while aimed at showcasing the performance of the pyprocessing
|
|||
|
module, are by no means comprehensive or applicable to all possible use
|
|||
|
cases or environments. Especially for those platforms with sluggish process
|
|||
|
forking timing.
|
|||
|
|
|||
|
All benchmarks were run using the following:
|
|||
|
* 4 Core Intel Xeon CPU @ 3.00GHz
|
|||
|
* 16 GB of RAM
|
|||
|
* Python 2.5.2 compiled on Gentoo Linux (kernel 2.6.18.6)
|
|||
|
* pyProcessing 0.52
|
|||
|
|
|||
|
All of the code for this can be downloaded from:
|
|||
|
http://jessenoller.com/code/bench-src.tgz
|
|||
|
|
|||
|
The basic method of execution for these benchmarks is in the
|
|||
|
run_benchmarks.py script, which is simply a wrapper to execute a target
|
|||
|
function through a single threaded (linear), multi-threaded (via threading),
|
|||
|
and multi-process (via pyprocessing) function for a static number of
|
|||
|
iterations with increasing numbers of execution loops and/or threads.
|
|||
|
|
|||
|
The run_benchmarks.py script executes each function 100 times, picking the
|
|||
|
best run of that 100 iterations via the timeit module.
|
|||
|
|
|||
|
First, to identify the overhead of the spawning of the workers, we execute
|
|||
|
an function which is simply a pass statement (empty):
|
|||
|
|
|||
|
cmd: python run_benchmarks.py empty_func.py
|
|||
|
Importing empty_func
|
|||
|
Starting tests ...
|
|||
|
non_threaded (1 iters) 0.000001 seconds
|
|||
|
threaded (1 threads) 0.000796 seconds
|
|||
|
processes (1 procs) 0.000714 seconds
|
|||
|
|
|||
|
non_threaded (2 iters) 0.000002 seconds
|
|||
|
threaded (2 threads) 0.001963 seconds
|
|||
|
processes (2 procs) 0.001466 seconds
|
|||
|
|
|||
|
non_threaded (4 iters) 0.000002 seconds
|
|||
|
threaded (4 threads) 0.003986 seconds
|
|||
|
processes (4 procs) 0.002701 seconds
|
|||
|
|
|||
|
non_threaded (8 iters) 0.000003 seconds
|
|||
|
threaded (8 threads) 0.007990 seconds
|
|||
|
processes (8 procs) 0.005512 seconds
|
|||
|
|
|||
|
As you can see, process forking via the pyprocessing module is faster than
|
|||
|
the speed of building and then executing the threaded version of the code.
|
|||
|
|
|||
|
The second test calculates 50000 fibonacci numbers inside of each thread
|
|||
|
(isolated and shared nothing):
|
|||
|
|
|||
|
cmd: python run_benchmarks.py fibonacci.py
|
|||
|
Importing fibonacci
|
|||
|
Starting tests ...
|
|||
|
non_threaded (1 iters) 0.195548 seconds
|
|||
|
threaded (1 threads) 0.197909 seconds
|
|||
|
processes (1 procs) 0.201175 seconds
|
|||
|
|
|||
|
non_threaded (2 iters) 0.397540 seconds
|
|||
|
threaded (2 threads) 0.397637 seconds
|
|||
|
processes (2 procs) 0.204265 seconds
|
|||
|
|
|||
|
non_threaded (4 iters) 0.795333 seconds
|
|||
|
threaded (4 threads) 0.797262 seconds
|
|||
|
processes (4 procs) 0.206990 seconds
|
|||
|
|
|||
|
non_threaded (8 iters) 1.591680 seconds
|
|||
|
threaded (8 threads) 1.596824 seconds
|
|||
|
processes (8 procs) 0.417899 seconds
|
|||
|
|
|||
|
The third test calculates the sum of all primes below 100000, again sharing
|
|||
|
nothing.
|
|||
|
|
|||
|
cmd: run_benchmarks.py crunch_primes.py
|
|||
|
Importing crunch_primes
|
|||
|
Starting tests ...
|
|||
|
non_threaded (1 iters) 0.495157 seconds
|
|||
|
threaded (1 threads) 0.522320 seconds
|
|||
|
processes (1 procs) 0.523757 seconds
|
|||
|
|
|||
|
non_threaded (2 iters) 1.052048 seconds
|
|||
|
threaded (2 threads) 1.154726 seconds
|
|||
|
processes (2 procs) 0.524603 seconds
|
|||
|
|
|||
|
non_threaded (4 iters) 2.104733 seconds
|
|||
|
threaded (4 threads) 2.455215 seconds
|
|||
|
processes (4 procs) 0.530688 seconds
|
|||
|
|
|||
|
non_threaded (8 iters) 4.217455 seconds
|
|||
|
threaded (8 threads) 5.109192 seconds
|
|||
|
processes (8 procs) 1.077939 seconds
|
|||
|
|
|||
|
|
|||
|
The reason why tests two and three focused on pure numeric crunching is to
|
|||
|
showcase how the current threading implementation does hinder non-I/O
|
|||
|
applications. Obviously, these tests could be improved to use a queue for
|
|||
|
coordination of results and chunks of work but that is not required to show
|
|||
|
the performance of the module.
|
|||
|
|
|||
|
The next test is an I/O bound test. This is normally where we see a steep
|
|||
|
improvement in the threading module approach versus a single-threaded
|
|||
|
approach. In this case, each worker is opening a descriptor to lorem.txt,
|
|||
|
randomly seeking within it and writing lines to /dev/null:
|
|||
|
|
|||
|
cmd: python run_benchmarks.py file_io.py
|
|||
|
Importing file_io
|
|||
|
Starting tests ...
|
|||
|
non_threaded (1 iters) 0.057750 seconds
|
|||
|
threaded (1 threads) 0.089992 seconds
|
|||
|
processes (1 procs) 0.090817 seconds
|
|||
|
|
|||
|
non_threaded (2 iters) 0.180256 seconds
|
|||
|
threaded (2 threads) 0.329961 seconds
|
|||
|
processes (2 procs) 0.096683 seconds
|
|||
|
|
|||
|
non_threaded (4 iters) 0.370841 seconds
|
|||
|
threaded (4 threads) 1.103678 seconds
|
|||
|
processes (4 procs) 0.101535 seconds
|
|||
|
|
|||
|
non_threaded (8 iters) 0.749571 seconds
|
|||
|
threaded (8 threads) 2.437204 seconds
|
|||
|
processes (8 procs) 0.203438 seconds
|
|||
|
|
|||
|
As you can see, pyprocessing is still faster on this I/O operation than
|
|||
|
using multiple threads. And using multiple threads is slower than the
|
|||
|
single threaded execution itself.
|
|||
|
|
|||
|
Finally, we will run a socket-based test to show network I/O performance.
|
|||
|
This function grabs a URL from a server on the LAN that is a simple error
|
|||
|
page from tomcat. It gets the page 100 times. The network is silent, and a
|
|||
|
10G connection:
|
|||
|
|
|||
|
cmd: python run_benchmarks.py url_get.py
|
|||
|
Importing url_get
|
|||
|
Starting tests ...
|
|||
|
non_threaded (1 iters) 0.124774 seconds
|
|||
|
threaded (1 threads) 0.120478 seconds
|
|||
|
processes (1 procs) 0.121404 seconds
|
|||
|
|
|||
|
non_threaded (2 iters) 0.239574 seconds
|
|||
|
threaded (2 threads) 0.146138 seconds
|
|||
|
processes (2 procs) 0.138366 seconds
|
|||
|
|
|||
|
non_threaded (4 iters) 0.479159 seconds
|
|||
|
threaded (4 threads) 0.200985 seconds
|
|||
|
processes (4 procs) 0.188847 seconds
|
|||
|
|
|||
|
non_threaded (8 iters) 0.960621 seconds
|
|||
|
threaded (8 threads) 0.659298 seconds
|
|||
|
processes (8 procs) 0.298625 seconds
|
|||
|
|
|||
|
We finally see threaded performance surpass that of single-threaded
|
|||
|
execution, but the pyprocessing module is still faster when increasing the
|
|||
|
number of workers. If you stay with one or two threads/workers, then the
|
|||
|
timing between threads and pyprocessing is fairly close.
|
|||
|
|
|||
|
Additional benchmarks can be found in the pyprocessing module's source
|
|||
|
distribution's examples/ directory.
|
|||
|
|
|||
|
Maintenance
|
|||
|
|
|||
|
Richard M. Oudkerk - the author of the pyprocessing module has agreed to
|
|||
|
maintaing the module within Python SVN. Jesse Noller has volunteered to
|
|||
|
also help maintain/document and test the module.
|
|||
|
|
|||
|
Timing/Schedule
|
|||
|
|
|||
|
Some concerns have been raised about the timing/lateness of this PEP
|
|||
|
for the 2.6 and 3.0 releases this year, however it is felt by both
|
|||
|
the authors and others that the functionality this module offers
|
|||
|
surpasses the risk of inclusion.
|
|||
|
|
|||
|
However, taking into account the desire not to destabilize python-core, some
|
|||
|
refactoring of pyprocessing's code "into" python-core can be withheld until
|
|||
|
the next 2.x/3.x releases. This means that the actual risk to python-core
|
|||
|
is minimal, and largely constrained to the actual module itself.
|
|||
|
|
|||
|
Open Issues
|
|||
|
|
|||
|
* All existing tests for the module should be converted to UnitTest format.
|
|||
|
* Existing documentation has to be moved to ReST formatting.
|
|||
|
* Verify code coverage percentage of existing test suite.
|
|||
|
* Identify any requirements to achieve a 1.0 milestone if required.
|
|||
|
* Verify current source tree conforms to standard library practices.
|
|||
|
* Rename top-level module from "pyprocessing" to "multiprocessing".
|
|||
|
* Confirm no "default" remote connection capabilities, if needed enable the
|
|||
|
remote security mechanisms by default for those classes which offer remote
|
|||
|
capabilities.
|
|||
|
* Some of the API (Queue methods qsize(), task_done() and join()) either
|
|||
|
need to be added, or the reason for their exclusion needs to be identified
|
|||
|
and documented clearly.
|
|||
|
|
|||
|
Closed Issues
|
|||
|
|
|||
|
* Reliance on ctypes: The pyprocessing module's reliance on ctypes prevents
|
|||
|
the module from functioning on platforms where ctypes is not supported.
|
|||
|
This is not a restriction of this module, but rather ctypes.
|
|||
|
|
|||
|
References
|
|||
|
|
|||
|
[1] PyProcessing home page
|
|||
|
http://pyprocessing.berlios.de/
|
|||
|
|
|||
|
[2] See Adam Olsen's "safe threading" project
|
|||
|
http://code.google.com/p/python-safethread/
|
|||
|
|
|||
|
[3] See: Addition of "pyprocessing" module to standard lib.
|
|||
|
http://mail.python.org/pipermail/python-dev/2008-May/079417.html
|
|||
|
|
|||
|
[4] http://mpi4py.scipy.org/
|
|||
|
|
|||
|
[5] See "Cluster Computing"
|
|||
|
http://wiki.python.org/moin/ParallelProcessing
|
|||
|
|
|||
|
[6] The original run_benchmark.py code was published in Python
|
|||
|
Magazine in December 2008: "Python Threads and the Global Interpreter
|
|||
|
Lock" by Jesse Noller. It has been modified for this PEP.
|
|||
|
|
|||
|
Copyright
|
|||
|
|
|||
|
This document has been placed in the public domain.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Local Variables:
|
|||
|
mode: indented-text
|
|||
|
indent-tabs-mode: nil
|
|||
|
sentence-end-double-space: t
|
|||
|
fill-column: 70
|
|||
|
coding: utf-8
|
|||
|
End:
|