Selenium Multiprocessing Help in Python 3.4

1.3k Views Asked by At

I am in over my head trying to use Selenium to get the number of results for specific searches on a website. Basically, I'd like to make the process run faster. I have code that works by iterating over search terms and then by newspapers and outputs the collected data into a CSV. Currently, this runs to produce 3 search terms x 3 newspapers over 3 years giving me 9 CSVs in about 10 minutes per CSV.

I would like to use multiprocessing to run each search and newspaper combination simultaneously or at least faster. I've tried to follow other examples on here, but have not been able to successfully implement them. Below is my code so far:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import pandas as pd
from multiprocessing import Pool

def websitesearch(search):
    try:
        start = list_of_inputs[0]
        end = list_of_inputs[1]
        newsabbv=list_of_inputs[2]
        directory=list_of_inputs[3]
        os.chdir(directory)

        if search == broad:
            specification = "broad"
            relPapers = newsabbv

        elif search == narrow:
            specification = "narrow"
            relPapers = newsabbv

        elif search == general:
            specification = "allarticles"
            relPapers = newsabbv

        else:
            for newspapers in relPapers:

               ...rest of code here that gets the data and puts it in a list named all_Data...

                browser.close()
                df = pd.DataFrame(all_Data)
                df.to_csv(filename, index=False)          

    except:
        print('error with item')



if __name__ == '__main__':
 ...Initializing values and things like that go here. This helps with the setup for search...

    #These are things that go into the function        
    start = ["January",2015]
    end = ["August",2017]
    directory = "STUFF GOES HERE"
    newsabbv = all_news_abbv
    search_list = [narrow, broad, general]

    list_of_inputs = [start,end,newsabbv,directory]    

    pool = Pool(processes=4)
    for search in search_list:
        pool.map(websitesearch, search_list)
        print(list_of_inputs)        

If I add in a print statement in the main() function, it will print, but nothing really ends up happening. I'd appreciate any and all help. I left out the code that gets the values and puts it into a list since its convoluted but I know it works.

Thanks in advance for any and all help! Let me know if there is more information I can provide.

Isaac

EDIT: I have looked into more help online and realize that I misunderstood the purpose of mapping a list to the function using pool.map(fn, list). I have updated my code to reflect my current approach that is still not working. I also moved the initializing values into the main function.

1

There are 1 best solutions below

0
On

i don't think it can be multiprocessing with your way. Because it's still have queue process there (not queue module) caused by selenium.

The reason is...selenium only can handle one window, cannot handle several window or tab browser at the same time (limitation of the window_handle features). that's means....your multi process only processing data process in memory that send to selenium or crawled by selenium. by try process the crawl of selenium in one script file, will make the selenium as the bottle neck process's source.

the best way to make real multiprocess is:

  1. make a script that use selenium to handle that url to crawl by selenium and save it as a file. e.g crawler.py and make sure the script have print command to print the result

e.g:

import -> all modules that you need to run selenium
import sys

url = sys.argv[1] #you will catch the url 

driver = ......#open browser

driver.get(url)
#just continue the script base on your method

print(--the result that you want--)
sys.exit(0)

i can give more explanation because this is the main core of the process, and what you want to do on that web, only you that understood.

  1. make another script file that:

a. devide the url, multi process means make some process and run it together with all cpu cores, the best way to make it... it's start by devide the input process, in your case maybe the url target (you don't give us, the website target that you want to crawl). but every pages of the website have the different url. just collect all url and devide it to several groups (best practice: your cpu cores - 1)

e.g:

import multiprocessing as mp

cpucore=int(mp.cpu_count())-1.

b. send the url to processing with the crawl.py that already you made before (by subprocess, or other module e,g: os.system). make sure you run the crawl.py max == the cpucore.

e.g:

crawler = r'YOUR FILE DIRECTORY\crawler.py'

def devideurl():
    global url1, url2, url3, url4
    make script that result:
    urls1 = groups or list of url
    urls2 = groups or list of url
    urls3 = groups or list of url
    urls4 = groups or list of url

def target1():
    for url in url1:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
        #do you see the combination between python crawler and url?
        #the cmd command will be: python crawler.py "value", the "value" is captured by sys.argv[1] command in crawler.py

def target2():
    for url in url2:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
def target3():
    for url in url1:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...
def target4():
    for url in url2:
        t1 = subprocess.Popen(['python', crawler, url], stdout = PIPE)
        #continue the script, base on your need...

cpucore = int(mp.cpu_count())-1
pool = Pool(processes="max is the value of cpucore")
for search in search_list:
    pool.map(target1, devideurl)
    pool.map(target2, devideurl)
    pool.map(target3, devideurl)
    pool.map(target4, devideurl)
    #you can make it, more, depend on your cpu core

c. get the printed result to the memory of main script

d. continous your script process to process the data that you already got.

  1. and the last, make the multiprocess script for the whole process in the main script.

with this method:

you can open many browser windows and handle it with the same time, and because of the data processing that crawling from website is slower than data processing in memory, this method at least reducing the bottle neck on data flow. means it's more faster than your method before.

hopelly helpfull...cheers