You cannot usefully do this with threads in Python (at least not the CPython implementation you're probably using). The Global Interpreter Lock means that, instead of the near-800% efficiency you'd like out of 8 cores, you only get 90%.
But you can do this with separate processes. There are two options for this built into the standard library: concurrent.futures and multiprocessing. In general, futures is simpler in simple cases and often easier to compose; multiprocessing is more flexible and powerful in general. futures also only comes with Python 3.2 or later, but there's a backport for 2.5-3.1 at PyPI.
One of the cases where you want the flexibility of multiprocessing is when you have a big shared data structure. See Sharing state between processes and the sections directly above, below, and linked from it for details.
If your data structure is really simple, like a giant array of ints, this is pretty simple:
class MyClass(object):
def __init__(self, giant_iterator_of_ints):
self.big_shared_object = multiprocessing.Array('i', giant_iterator_of_ints)
def compute_heavy_task(self):
lock = multiprocessing.Lock()
def subtask(my_range):
return some_expensive_task(self.big_shared_object, lock, my_range)
pool = multiprocessing.pool.Pool(5)
my_ranges = split_into_chunks_appropriately(len(self.big_shared_object)
results = pool.map_async(subtask, my_ranges)
pool.close()
pool.join()
Note that the some_expensive_task function now takes a lock object—it has to make sure to acquire the lock around every access to the shared object (or, more often, every "transaction" made up of one or more accesses). Lock discipline can be tricky, but there's really no way around it if you want to use direct data sharing.
Also note that it takes a my_range. If you just call the same function 5 times on the same object, it'll do the same thing 5 times, which probably isn't very useful. One common way to parallelize things is to give each task a sub-range of the overall data set. (Besides being usually simple to describe, if you're careful with this, with the right kinds of algorithms, you can even avoid a lot of locking this way.)
If you instead want to map a bunch of different functions to the same dataset, you obviously need some collection of functions to work on, rather than just using some_expensive_task repeatedly. You can then, e.g., iterate over these functions calling apply_async on each one. But you can also just turn it around: write a single applier function, as a closure around the data, that takes takes a function and applies it to the data. Then, just map that function over the collection of functions.
I've also assumed that your data structure is something you can define with multiprocessing.Array. If not, you're going to have to design the data structure in C style, implement it as a ctypes Array of Structures or vice-versa, and then use the multiprocessing.sharedctypes stuff.
I've also moved the result object into results that just get passed back. If they're also huge and need to be shared, use the same trick to make them sharable.
Before going further with this, you should ask yourself whether you really do need to share the data. Doing things this way, you're going to spend 80% of your debugging, performance-tuning, etc. time adding and removing locks, making them more or less granular, etc. If you can get away with passing immutable data structures around, or work on files, or a database, or almost any other alternative, that 80% can go toward the rest of your code.
numpy,pypy,cython, …) that can give a much better performance gain for the same effort?