I am in over my head trying to use Selenium to get the number of results for specific searches on a website. Basically, I'd like to make the process run faster. I have code that works by iterating over search terms and then by newspapers and outputs the collected data into a CSV. Currently, this runs to produce 3 search terms x 3 newspapers over 3 years giving me 9 CSVs in about 10 minutes per CSV.
I would like to use multiprocessing to run each search and newspaper combination simultaneously or at least faster. I've tried to follow other examples on here, but have not been able to successfully implement them. Below is my code so far:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool
def websitesearch(search):
try:
start = list_of_inputs[0]
end = list_of_inputs[1]
newsabbv=list_of_inputs[2]
directory=list_of_inputs[3]
os.chdir(directory)
if search == broad:
specification = "broad"
relPapers = newsabbv
elif search == narrow:
specification = "narrow"
relPapers = newsabbv
elif search == general:
specification = "allarticles"
relPapers = newsabbv
else:
for newspapers in relPapers:
...rest of code here that gets the data and puts it in a list named all_Data...
browser.close()
df = pd.DataFrame(all_Data)
df.to_csv(filename, index=False)
except:
print('error with item')
if __name__ == '__main__':
...Initializing values and things like that go here. This helps with the setup for search...
#These are things that go into the function
start = ["January",2015]
end = ["August",2017]
directory = "STUFF GOES HERE"
newsabbv = all_news_abbv
search_list = [narrow, broad, general]
list_of_inputs = [start,end,newsabbv,directory]
pool = Pool(processes=4)
for search in search_list:
pool.map(websitesearch, search_list)
print(list_of_inputs)
If I add in a print statement in the main() function, it will print, but nothing really ends up happening. I'd appreciate any and all help. I left out the code that gets the values and puts it into a list since its convoluted but I know it works.
Thanks in advance for any and all help! Let me know if there is more information I can provide.
Isaac
EDIT: I have looked into more help online and realize that I misunderstood the purpose of mapping a list to the function using pool.map(fn, list). I have updated my code to reflect my current approach that is still not working. I also moved the initializing values into the main function.
i don't think it can be multiprocessing with your way. Because it's still have queue process there (not queue module) caused by selenium.
The reason is...selenium only can handle one window, cannot handle several window or tab browser at the same time (limitation of the window_handle features). that's means....your multi process only processing data process in memory that send to selenium or crawled by selenium. by try process the crawl of selenium in one script file, will make the selenium as the bottle neck process's source.
the best way to make real multiprocess is:
e.g:
i can give more explanation because this is the main core of the process, and what you want to do on that web, only you that understood.
a. devide the url, multi process means make some process and run it together with all cpu cores, the best way to make it... it's start by devide the input process, in your case maybe the url target (you don't give us, the website target that you want to crawl). but every pages of the website have the different url. just collect all url and devide it to several groups (best practice: your cpu cores - 1)
e.g:
b. send the url to processing with the crawl.py that already you made before (by subprocess, or other module e,g: os.system). make sure you run the crawl.py max == the cpucore.
e.g:
c. get the printed result to the memory of main script
d. continous your script process to process the data that you already got.
with this method:
you can open many browser windows and handle it with the same time, and because of the data processing that crawling from website is slower than data processing in memory, this method at least reducing the bottle neck on data flow. means it's more faster than your method before.
hopelly helpfull...cheers