I am trying to parallelize a function that gets data from an API, but depending on the arguments that I give I recieve the information by blocks. so for example in some cases I recieve 5 blocks and sometimes 100 blocks.
When I am trying to obtain blocks if one return none means there is not more data so if block 40 return none but block 39 return some data means that the first 39 blocks contain some data and the other blocks is possible that if the code detect a block is none doesn't create threads for the blocks that are bigger than this and extract all the blocks with data the efficient way possible
This is my actual code:
blocksData = []
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(func, step), i for i in range(maxBlocks)}
for future in concurrent.futures.as_completed(futures):
if future.result() is not None:
blocksData += future.result()
return blocksData
def func(step):
#Logic
if len(block) = 0:
return None
else:
return block
Your code creates
maxBlocksthreads at a time, which account for a lot of wasted process. It is better to create threads in batches, with pause in between.The size of the batch determine the balance between wasted threads and performance. For example, the smaller the batch size, the less waste, but performance might suffer due to not able to utilize the system's resources.
Here is a sample code:
Sample output:
Notes
SIMULATED_BLOCKS_COUNTto simulate the actual blocks count the API returnsfuncnow takes in an additionalEventobject calledthe_end. If the API returnsNone, thenfuncwill set this event.main, we monitorthe_end: If it is set, then we know we are done