I've been fighting with this problem for some time now and I've finally managed to narrow down the issue and create a minimum working example.
The summary of the problem is that I have a class that inherits from a dict to facilitate parsing of misc. input files. I've overridden the the __setitem__ call to support recursive indexing of sections in our input file (e.g. parser['some.section.variable'] is equivalent to parser['some']['section']['variable']). This has been working great for us for over a year now, but we just ran into an issue when passing these Parser classes through a multiprocessing.apply_async call.
Show below is the minimum working example - obviously the __setitem__ call isn't doing anything special, but it's important that it accesses some class attribute like self.section_delimiter - this is where it breaks. It doesn't break in the initial call or in the serial function call. But when you call the some_function (which doesn't do anything either) using apply_async, it crashes.
import multiprocessing as mp
import numpy as np
class Parser(dict):
def __init__(self, file_name : str = None):
print('\t__init__')
super().__init__()
self.section_delimiter = "."
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
dict.__setitem__(self, key, value)
def some_function(parser):
pass
if __name__ == "__main__":
print("Initialize creation/setting")
parser = Parser()
parser['x'] = 1
print("Single serial call works fine")
some_function(parser)
print("Parallel async call breaks on line 16?")
pool = mp.Pool(1)
for i in range(1):
pool.apply_async(some_function, (parser,))
pool.close()
pool.join()
If you run the code below, you'll get the following output
Initialize creation/setting
__init__
__setitem__
Single serial call works fine
Parallel async call breaks on line 16?
__setitem__
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
return _ForkingPickler.loads(res)
File "test_apply_async.py", line 13, in __setitem__
self.section_delimiter
AttributeError: 'Parser' object has no attribute 'section_delimiter'
Any help is greatly appreciated. I spent considerable time tracking down this bug and reproducing a minimal example. I would love to not only fix it, but clearly fill some gap in my understanding on how these apply_async and inheritance/overridden methods interact.
Let me know if you need any more information.
Thank you very much!
Isaac
Cause
The cause of the problem is that
multiprocessingserializes and deserializes yourParserobject to move its data across process boundaries. This is done using pickle. By default pickle does not call__init__()when deserializing classes. Because of thisself.section_delimiteris not set when the deserializer calls__setitem__()to restore the items in your dictionary and you get the error:Using just pickle and no multiprocessing gives the same error:
Deserialization will work for an object with no items and the value of
section_delimiterwill be restored:So in a sense you are just unlucky that pickle calls
__setitem__()before it restores the rest of the state of yourParser.Workaround
You can work around this by setting
section_delimiterin__new__()and telling pickle what arguments to pass to__new__()by implementing__getnewargs__():__getnewargs__()returns a tuple of arguments. Becausesection_delimiteris set in__new__(), it is no longer necessary to set it in__init__().This is the code of your
Parserclass after the change:Simpler solution
The reason pickle calls
__setitem__()on yourParserobject is because it is a dictionary. If yourParseris just a class that happens to implement__setitem__()and__getitem__()and has a dictionary to implement those calls then pickle will not call__setitem__()and serialization will work with no extra code:So if there is no other reason for your
Parserto be a dictionary, I would just not use inheritance here.