How to partition datasets into n blocks to reduce queue time on a supercomputer?

134 Views Asked by At

I have a dataset which includes approximately 2000 digital images. I am using MATLAB to perform some digital image processing to extract trees from the imagery. The script is currently configured to process the images in a parfor loop on n cores.

The challenge:
I have access to processing time on a University managed supercomputer with approximately 10,000 compute cores. If I submit the entire job for processing, I get put so far back in the tasking queue, a desktop computer could finish the job before the processing starts on the supercomputer. I have been told by support staff that partitioning the 2000 file dataset into ~100 file jobs will significantly decrease the tasking queue time. What method can I use to perform the tasks in parallel using the parfor loop, while submitting 100 files (of 2000) at a time?

My script is structured in the following way:

datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
fileIndex = find(~[files.isdir]);

parfor ix = 1:length(fileIndex) 
     % Perform the processing on each file;
end
2

There are 2 best solutions below

0
On BEST ANSWER

Similar to my comment I would spontaneously suggest something like

datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
files = files(~[files.isdir]);

% split up the data
N = length(files); % e.g. 20000
jobSize = 100;
jobFiles = mat2cell(files, [jobSize*ones(1,floor(N/jobSize)), mod(N,jobSize)]);
jobNum = length(jobFiles);

% Provide each job to a worker
parfor jobIdx = 1:jobNum
    thisJob = jobFiles{jobIdx}; % this indexing allows matlab for transfering
                                % only relevant file data to each worker

    for fIdx = 1:length(thisJob)
        thisFile = thisJob(fIdx);
        % Perform the processing on each file;
        thisFile.name
    end
end
4
On

Let me try to answer the higher level question of job partitioning to optimize for supercomputer queues. I find that a good rule of thumb is to submit jobs of size sqrt(p) on a machine with p processors, if the goal is to maximize throughput. Of course, this assumes a relatively balanced queue policy, which is not implemented at all sites. But most universities don't prioritize large jobs the way DOE facilities do, so this rule should work in your case.

I don't have a mathematical theory behind my rule of thumb, but I've been a large DOE supercomputer user over the past 8 years (100M+ hours personally, allocation owner for 500M+) and I was on staff at one of the DOE sites until recently (albeit one that has a queue policy that breaks my rule).