Perform Heavy Computations on Shared Data in Parallel in Python

Question

A quick question about parallel processing in Python. Lets say I have a big shared data structure and want to apply many functions on it in parallel. These functions are read only on the data structure but perform mutation in a result object:

def compute_heavy_task(self):
    big_shared_object = self.big_shared_object
    result_refs = self.result_refs
    for ref in result_refs:
         some_expensive_task(ref, big_shared_object)

How do I do these in parallel, say 5 at a time, or 10 at a time. How how about number of processors at a time?

If you can give a concrete, simple, runnable example of the kind of code you want to parallelize, and explain what kind of parallelism you're looking for (just running the same function on the same data 5 times in a row doesn't seem very useful… unless there are side-effects you haven't described, which will make things more complicated), it will be a lot easier to give you a concrete, runnable example of a parallelized version. — abarnert
– abarnert, Commented Apr 26, 2013 at 23:23
One more thing: Are you sure you need parallel processing? Is your code actually CPU-bound? Have you looked into alternatives (better algorithms, numpy, pypy, cython, …) that can give a much better performance gain for the same effort? — abarnert
– abarnert, Commented Apr 26, 2013 at 23:24

abarnert · Accepted Answer · 2013-04-26 23:28:37Z

4

You cannot usefully do this with threads in Python (at least not the CPython implementation you're probably using). The Global Interpreter Lock means that, instead of the near-800% efficiency you'd like out of 8 cores, you only get 90%.

But you can do this with separate processes. There are two options for this built into the standard library: concurrent.futures and multiprocessing. In general, futures is simpler in simple cases and often easier to compose; multiprocessing is more flexible and powerful in general. futures also only comes with Python 3.2 or later, but there's a backport for 2.5-3.1 at PyPI.

One of the cases where you want the flexibility of multiprocessing is when you have a big shared data structure. See Sharing state between processes and the sections directly above, below, and linked from it for details.

If your data structure is really simple, like a giant array of ints, this is pretty simple:

class MyClass(object):
    def __init__(self, giant_iterator_of_ints):
        self.big_shared_object = multiprocessing.Array('i', giant_iterator_of_ints)
    def compute_heavy_task(self):
        lock = multiprocessing.Lock()
        def subtask(my_range):
            return some_expensive_task(self.big_shared_object, lock, my_range)
        pool = multiprocessing.pool.Pool(5)
        my_ranges = split_into_chunks_appropriately(len(self.big_shared_object)
        results = pool.map_async(subtask, my_ranges)
        pool.close()
        pool.join()

Note that the some_expensive_task function now takes a lock object—it has to make sure to acquire the lock around every access to the shared object (or, more often, every "transaction" made up of one or more accesses). Lock discipline can be tricky, but there's really no way around it if you want to use direct data sharing.

Also note that it takes a my_range. If you just call the same function 5 times on the same object, it'll do the same thing 5 times, which probably isn't very useful. One common way to parallelize things is to give each task a sub-range of the overall data set. (Besides being usually simple to describe, if you're careful with this, with the right kinds of algorithms, you can even avoid a lot of locking this way.)

If you instead want to map a bunch of different functions to the same dataset, you obviously need some collection of functions to work on, rather than just using some_expensive_task repeatedly. You can then, e.g., iterate over these functions calling apply_async on each one. But you can also just turn it around: write a single applier function, as a closure around the data, that takes takes a function and applies it to the data. Then, just map that function over the collection of functions.

I've also assumed that your data structure is something you can define with multiprocessing.Array. If not, you're going to have to design the data structure in C style, implement it as a ctypes Array of Structures or vice-versa, and then use the multiprocessing.sharedctypes stuff.

I've also moved the result object into results that just get passed back. If they're also huge and need to be shared, use the same trick to make them sharable.

Before going further with this, you should ask yourself whether you really do need to share the data. Doing things this way, you're going to spend 80% of your debugging, performance-tuning, etc. time adding and removing locks, making them more or less granular, etc. If you can get away with passing immutable data structures around, or work on files, or a database, or almost any other alternative, that 80% can go toward the rest of your code.

edited Apr 26, 2013 at 23:28

answered Apr 26, 2013 at 22:59

abarnert

368k54 gold badges626 silver badges691 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Joran Beasley Over a year ago

theres also several pycloud type solutions ... but then you have internet latency ... but for huge datasets it might be worth it

David Williams Over a year ago

@abarnert do you have an example of hwo to do this? Im reading the multiprocessing docs but none of it seems quite clear.

abarnert Over a year ago

@DavidWilliams: Since your question was pretty vague, the answer I gave is also pretty vague. You need to determine how to parallelize your dataset, and what kind of locking you need, and so on, before you can write the code. But I tried.

abarnert Over a year ago

@DavidWilliams: OK, I've changed it from "vague" to "rambling", trying to cover as many possibilities as possible. I'm not sure if that makes it more useful or less…

David Williams Over a year ago

@abarnert, nothing wrong with your answer. more like the python doc on multiprocessing. Data set does not need to be parallelized. thanks for the update.

Collectives™ on Stack Overflow

Perform Heavy Computations on Shared Data in Parallel in Python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related