How to make python program faster? Is multithreading useful here?

75 Views Asked by At

I have a JSON file that I am parsing through in an attempt to see if a domain is live.

The code I have is the following:

for i in range(len(json_data)):
    print(i)        
    if int(json_data[i]['response']['result_count'])>0:  
        for j in range(len(json_data[i]['response']['matches'])):
            try: 
                socket.gethostbyname(json_data[i]['response']['matches'][j]['domain'] )
            except:
                del json_data[i]['response']['matches'][j]['domain']

I have attempted to use multithreading in the following form:

def run_half():
    for i in range(0,round(len(data_json)/2)):
        print(i)        # make this len(data_json) if NOT testing, range(10) if testing
        if int(data_json[i]['response']['result_count'])>0:  
            for j in range(len(data_json[i]['response']['matches'])):
                try: 
                    socket.gethostbyname( data_json[i]['response']['matches'][j]['domain'] )
                except:
                    del data_json[i]['response']['matches'][j]['domain']
def run_half_2():
    for i in range(round((len(data_json)/2))+1,len(data_json)):
        print(i)        # make this len(data_json) if NOT testing, range(10) if testing
        if int(data_json[i]['response']['result_count'])>0:  
            for j in range(len(data_json[i]['response']['matches'])):
                try: 
                    socket.gethostbyname( data_json[i]['response']['matches'][j]['domain'] )
                except:
                    del data_json[i]['response']['matches'][j]['domain']

t1 = threading.Thread(target=run_half(),args=(10,))
t2= threading.Thread(target=run_half_2(),args=(10,))

t1.start()
t2.start()

t1.join()
t2.join()

For some reason, I have not noticed a change in the time to compute.

Any advice or suggestions would be greatly appreciated. Thank you!

1

There are 1 best solutions below

0
On

Yes, threading useful here as this is a network/IO bound task.

Rather than splitting the work into groups as above, a better approach is to treat each host name check as an individual task and fan-out the execution out to number of workers.

I'd suggest that you use the thread pool executor provided by the python standard library to achieve this.

https://docs.python.org/3/library/concurrent.futures.html

The concept being that you fan-out each long running task into a future, and then fan-in to collect all the results.

e.g,

    list_of_work_to_do = ["url1", "url2", "url3"]

    with ThreadPoolExecutor(max_workers=8) as executor:
        futures = []
    
        # Fan-out work.
        for my_url in list_of_work_to_do:
            future = executor.submit(long_running_task, my_url)
            futures.append(future)

        # Fan-in results.
        results = [future.result() for future in futures]