4
def content_generator(applications, dict):
    for app in applications:
        yield(app, dict[app])

with open('abc.pickle', 'r') as f:
    very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())

pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
    do some aggregation on result

I have a really large dictionary whose keys are strings (application names), values are information concerning the application. Since applications are independent, I want to use multiprocessing to process them in parallel. Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. I used dmesg to check what went wrong and found they were killed since the machine ran out of memory. I did top when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. This confuses me since it seems to have copied the whole dictionaries into the spawned processes. I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only dict[app] instead of dict. Any thoughts on what I did wrong?

14
  • Few things you should consider: passing large amounts of data for minimal processing is not a good way to use multiprocessing. If you have small data, that then requires long tasks, that is a better way. It sounds like your example is not a good use case, unless you have data serialized on disk, can pass just the path and unpickle it in the process. Next, with large files, many processing becomes a concern. Use a semaphore to avoid blowing through your memory, to limit the number of active processes. Commented Jul 14, 2016 at 20:53
  • Next: You need to consider pipe limits: A pipe is only guaranteed to have atomicity for a certain size, and once you pass that limit, you basically guarantee data corruption (like with large dictionaries): unix.stackexchange.com/questions/11946/… Commented Jul 14, 2016 at 20:54
  • On a Linux-y system, new processes are created by fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. Then the child processes will inherit a much smaller address space. Commented Jul 14, 2016 at 20:57
  • Also, finally, yes: multiprocessing cannot use shared memory unless you can pass pointers to underlying C data types (without the GIL, such as numpy arrays, and then do not modify the contents of the buffer). This means copy operations are inevitable with multiprocessing for dictionaries. Commented Jul 14, 2016 at 20:57
  • 1
    @AlexanderHuszagh, he's not passing them. He's passing content_generator(all_applications, very_large_dict), which is a generator yielding one (key, value) pair at a time. The generator runs in the main process. Commented Jul 14, 2016 at 21:06

1 Answer 1

1

The comments are becoming impossible to follow, so I'm pasting in my important comment here:

On a Linux-y system, new processes are created by fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating your Pool before creating giant data structures. Then the child processes will inherit a much smaller address space.

Then some answers to questions:

so in python 2.7, there is no way to spawn a new process?

On Linux-y systems, no. The ability to use "spawn" on those was first added in Python 3.4. On Windows systems, "spawn" has always been the only choice (no fork() on Windows).

The big dictionary is passed in to a function as an argument and I could only create the pool inside this function. How would I be able to create the pool before the big dictionary

As simple as this: make these two lines the first two lines in your program:

import multiprocessing
pool = multiprocessing.Pool()

You can create the pool any time you like (just so long as it exists sometime before you actually use it), and worker processes will inherit the entire address space at the time the Pool constructor is invoked.

ANOTHER SUGGESTION

If you're not mutating the dict after it's created, try using this instead:

def content_generator(dict):
    for app in dict:
        yield app, dict[app]

That way you don't have to materialize a giant set of the keys either. Or, even better (if possible), skip all that and iterate directly over the items:

for result in pool.imap_unordered(func_process_application, very_large_dict.iteritems()):
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.