I am running a web scraping task where I run multiple scrapers concurrently. Sometimes my scraper gets stuck non-deterministically and there is nothing I can do about that. What happens is, after a while the entire script is stuck. I am running a total of more than 1000 scrapers, with max_workers as 20. I am guessing after a while all 20 workers get stuck. What I want is to set timeouts to individual threads so that if a thread is running for more than 120 seconds, it should just get killed or cancelled and then logged.
I found the pebble library, but interestingly it supports timeouts only to ProcessPool and not ThreadPool. My machine would crash if I use a ProcessPool. Is there a way I can implement a timeout on individual threads in python.
Here is what I tried:
import concurrent.futures
def func(t):
while t:
c = 1
return 'yo'
t = [0, 0, 0, 1, 1, 1, 1, 1, 0]
print(t)
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as pool:
futures = []
for _ in t:
future = pool.submit(func, _)
futures.append(future)
future.args = str(_)
for future in futures:
try:
result = future.result(timeout = 3)
print(result + future.args)
except Exception as e:
print(e)
print('timeout' + future.args)
It doesn't even print the exception. It just gets stuck after printing out this:
[0, 0, 0, 1, 1, 1, 1, 1, 0]
yo0
yo0
yo0
timeout1
timeout1
timeout1
timeout1
timeout1
timeout0
I also tried adding future.cancel()
, in the except block but same result. What do I do?