Threads, Process and sequential (for loop) - Speed of execution

48 Views Asked by At

I'm little confused and surprised. I'm working on train a single RNN model over a huge different datasets, saving each learned net on separate file. My PC run on win10 , 12Gb Nvidia gpu, 4 hard 8 virtual cpu and 16Gb Ram. I developed 2 versions: 1st sequential and 2nd by Threadpool, which both are executed on gpu, I strangely obtain a lower ratio of learning/minute when executing the threadPool version: is of 2,38 train/min Vs 1,20 train/min for the 2nd version. During execution apart one versione from other, I see on windows' "Property Management" an avg of 30% of cpu use, ram between 40%-80%. The gpu use is : almost totally (floating 9/10) fitted (12 dedicated ram + 8 shared Gb) in case of ThreadPool(because see 3 to 8 threads running) Vs 50% - 100% of dedicate ram on gpu in case of sequential. Are those strange ratios ? Should not be gpu faster because parallel while uniformelly used, the speed in parallel ? Any recommendation to substantially increase speed ?

clean_mem()
cores = int(os.cpu_count()*3 /4) -2
dir_base = os.getcwd()+'\\'
conf = readjson(dir_base+'Dati_Apprendimento\\'+'_Conf_Learn_torch.txt')
device =  torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

while(conf['loop']):
    for (_,_,whole_tit) in os.walk(dir_base+conf['dirr']):
        break
    whole_tit = [ti for ti in whole_tit if '1m' in ti]
    
    with cf.ThreadPoolExecutor(max_workers=cores) as executor:
        executor.map(parallel_master_routine, whole_tit)

Versus sequential:

clean_mem()
dir_base = os.getcwd()+'\\'
conf = readjson(dir_base+'Dati_Apprendimento\\'+'_Conf_Learn_torch.txt')
device =  torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

while(conf['loop']):
    for (_,_,whole_tit) in os.walk(dir_base+conf['dirr']):
        break
    whole_tit = [ti for ti in whole_tit if '1m' in ti]
    for tit in whole_tit:
        master_routine(tit,dir_base,conf,device)

ProcessPool

def parallel_master_routine(tit):
    master_routine(tit, dir_base, conf, device)
    
#-----------------------------------------------------
def main():
    clean_mem()
    cores = int(os.cpu_count()*3 /4) -2
    dir_base = os.getcwd()+'\\'
    conf = readjson(dir_base+'Dati_Apprendimento\\'+'_Conf_Learn_torch.txt')
    device =  torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    
    while(conf['loop']):
        for (_,_,whole_tit) in os.walk(dir_base+conf['dirr']):
            break
        whole_tit = [ti for ti in whole_tit if '1m' in ti]

        if __name__ == '__main__':
            with cf.ProcessPoolExecutor(max_workers=cores) as executor:
                executor.map(parallel_master_routine, whole_tit)
if __name__ == '__main__':
main() 
0

There are 0 best solutions below