I'm little confused and surprised. I'm working on train a single RNN model over a huge different datasets, saving each learned net on separate file. My PC run on win10 , 12Gb Nvidia gpu, 4 hard 8 virtual cpu and 16Gb Ram. I developed 2 versions: 1st sequential and 2nd by Threadpool, which both are executed on gpu, I strangely obtain a lower ratio of learning/minute when executing the threadPool version: is of 2,38 train/min Vs 1,20 train/min for the 2nd version. During execution apart one versione from other, I see on windows' "Property Management" an avg of 30% of cpu use, ram between 40%-80%. The gpu use is : almost totally (floating 9/10) fitted (12 dedicated ram + 8 shared Gb) in case of ThreadPool(because see 3 to 8 threads running) Vs 50% - 100% of dedicate ram on gpu in case of sequential. Are those strange ratios ? Should not be gpu faster because parallel while uniformelly used, the speed in parallel ? Any recommendation to substantially increase speed ?
clean_mem()
cores = int(os.cpu_count()*3 /4) -2
dir_base = os.getcwd()+'\\'
conf = readjson(dir_base+'Dati_Apprendimento\\'+'_Conf_Learn_torch.txt')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
while(conf['loop']):
for (_,_,whole_tit) in os.walk(dir_base+conf['dirr']):
break
whole_tit = [ti for ti in whole_tit if '1m' in ti]
with cf.ThreadPoolExecutor(max_workers=cores) as executor:
executor.map(parallel_master_routine, whole_tit)
Versus sequential:
clean_mem()
dir_base = os.getcwd()+'\\'
conf = readjson(dir_base+'Dati_Apprendimento\\'+'_Conf_Learn_torch.txt')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
while(conf['loop']):
for (_,_,whole_tit) in os.walk(dir_base+conf['dirr']):
break
whole_tit = [ti for ti in whole_tit if '1m' in ti]
for tit in whole_tit:
master_routine(tit,dir_base,conf,device)
ProcessPool
def parallel_master_routine(tit):
master_routine(tit, dir_base, conf, device)
#-----------------------------------------------------
def main():
clean_mem()
cores = int(os.cpu_count()*3 /4) -2
dir_base = os.getcwd()+'\\'
conf = readjson(dir_base+'Dati_Apprendimento\\'+'_Conf_Learn_torch.txt')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
while(conf['loop']):
for (_,_,whole_tit) in os.walk(dir_base+conf['dirr']):
break
whole_tit = [ti for ti in whole_tit if '1m' in ti]
if __name__ == '__main__':
with cf.ProcessPoolExecutor(max_workers=cores) as executor:
executor.map(parallel_master_routine, whole_tit)
if __name__ == '__main__':
main()