I am trying to send information extracted from lines of a big file to a process running on some server.
To speed this up, I would like to do this with some threads in parallel.
Using the Python 2.7 backport of concurrent.futures I tried this:
f = open("big_file")
with ThreadPoolExecutor(max_workers=4) as e:
for line in f:
e.submit(send_line_function, line)
f.close()
However, this is problematic, because all futures get submitted instantly, so that my machine runs out of memory, because the complete file gets loaded into memory.
My question is, if there is an easy way to only submit a new future when a free worker is available.
You could iterate over chunks of the file using
(This is an application of the grouper recipe, which collects items from the iterator
f
into groups of sizechunksize
. Note: This does not consume the entire file at once sincezip
returns an iterator in Python3.)Now, in the comments you rightly point out that this is not optimal. There could be some worker which takes a long time, and holds up a whole chunk of jobs.
Usually, if each call to worker takes roughly the same amount of time then this is not a big deal. However, here is a way to advance the filehandle on-demand. It uses a
threading.Condition
to notify thesprinkler
to advance the filehandle.