I've created a script using concurrent.futures library to do multithreading in order to execute the script faster. If the first function get_content_url()
within the script produced multiple links, the current implementation would work. However, as the first function is producing a single link, I don't understand how to use concurrent.futures in such cases.
To let you understand what the first function is doing - when I supply ids from a csv file to this function get_content_url()
, it generates a single link by using the token collected from json response.
How can I apply
concurrent.futures
within the script in the right way to make the execution faster?
I've tried with:
import requests
import concurrent.futures
from bs4 import BeautifulSoup
base_link = "https://www.some_website.com/{}"
target_link = "https://www.some_website.com/{}"
def get_content_url(item_id):
r = requests.get(base_link.format(item_id['id']))
token = r.json()['token']
content_url = target_link.format(token)
yield content_url
def get_content(target_link):
r = requests.get(target_link)
soup = BeautifulSoup(r.text,"html.parser")
try:
title = soup.select_one("h1#maintitle").get_text(strip=True)
except Exception: title = ""
print(title)
if __name__ == '__main__':
with open("IDS.csv","r") as f:
reader = csv.DictReader(f)
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
for _id in reader:
future_to_url = {executor.submit(get_content,item): item for item in get_content_url(_id)}
concurrent.futures.as_completed(future_to_url)
This might be a bit hard to reproduce, since I don't know what's inside the
IDS.csv
and a valid url case is missing in your question but here's something to play with:I'm mocking the .csv file with
write_fake_ids()
. You can ignore it or remove it, it doesn't get called anywhere in the code.