def content_generator(applications, dict):
for app in applications:
yield(app, dict[app])
with open('abc.pickle', 'r') as f:
very_large_dict = pickle.load(f)
all_applications = set(very_large_dict.keys())
pool = multiprocessing.Pool()
for result in pool.imap_unordered(func_process_application, content_generator(all_applications, very_large_dict)):
do some aggregation on result
I have a really large dictionary whose keys are strings (application names), values are information concerning the application. Since applications are independent, I want to use multiprocessing to process them in parallel. Parallelization works when the dictionary is not that big but all the python processes were killed when the dictionary is too big. I used dmesg to check what went wrong and found they were killed since the machine ran out of memory. I did top when the pool processes are running and found that they all occupy the same amount of resident memory(RES), which is all 3.4G. This confuses me since it seems to have copied the whole dictionaries into the spawned processes. I thought I broke up the dictionary and passing only what is relevant to the spawned process by yielding only dict[app] instead of dict. Any thoughts on what I did wrong?
fork(), so get a copy of the entire parent-process address space at the time they're created. It's "copy on write", so is more of a "virtual" copy than a "real" copy, but still ... ;-) For a start, try creating yourPoolbefore creating giant data structures. Then the child processes will inherit a much smaller address space.content_generator(all_applications, very_large_dict), which is a generator yielding one(key, value)pair at a time. The generator runs in the main process.