I am in the process of attempting to automate some slow processes by multithreading (multiprocessing - parallelizing) them with Python.
However I have a problem. Each running process must take a piece of data as an argument. For each piece of data, only one instance of each piece of data must be in use at any one time.
To explain this more clearly:
- Each process connects to an API and requires the use of an API key
- I have a fixed number of API keys
- Two processes must not use the same API key simultaniously
I am stuck and can't figure a way around this problem. (Other than the "dumb" solution which I will explain later.)
The issue with multiprocess is that one defines a fixed number of workers which execute as part of a pool. Each worker runs a function which expects to recieve some arguments. The arguments are initialized as a list, and workers are dispatched with one entry from the list.
Imagine a list like this:
[a, b, c, d, e, f, g, h, i, ...]
and a pool of 3 workers.
When the pool first launches, the values a, b, c will be passed to three processes. There is no guarantee about how long each one will take.
It is therefore possible that process 1 finishes, and consumes data d. It is possible this process finishes again before either process 2 or 3 has finished processing data b or c.
If that happens, process 1 will consume data e.
It should now be obvious why putting the api key data into the same list as the rest of the data will not work.
In the above example, processes 1 and 2 will be processing data e and b respectively. If the api keys had been part of the list feeding the processes with data then elements b and e would contain the same api key. (presumably)
Is there a way to explicitly "pin" some data (like an api key) to each process spawned by pool.map() and thus solve this problem?
When creating your
multiprocessing.Pool, pass it an initializer and initial args which includes a multiprocess-safeQueueto pass each worker a single API key.Something like this: