I have a function that requires a path to a folder with only images and nothing else. However, I have 3000+ folders with images but also other files (random .txt files, RTSTRUC files, etc.).
for each folder, I was using sh.copy() to move the images into a tmp folder and then pointing my function at that. But I ran into long runtimes. I tried shortening with sh.copyfiles() also found it be slow. I tried parallelizing with the multiprocessing library but ran into problems as each subprocess was using the same temporary folder and there was inadvertent mixing and matching!
Any ideas?
for file in glob.glob(dicom_dir + "/*IMG*"):
shutil.copy(file,tmp_dicom_folder)
gives prohibitive runtimes.
Is there a way for me to parallelize without running into this issue using temporary folders? Making a temp folder for each process sounds..........messy.
Use symlinks instead of copies. Create a temporary directory for each process you want to run in parallel. If this is CPU intensive image processing, a little less than one process per CPU is a reasonable starting point, but could be changed for all sorts of reasons such as GPU usage.
List the files you want processed and scatter symlinks to them into the directories. Run a copy of your program on each of these symlinked directories and each will get a subset of the workload.