How to partition datasets into n blocks to reduce queue time on a supercomputer?

Question

How to partition datasets into n blocks to reduce queue time on a supercomputer?

125 Views Asked by Borealis At 17 August 2025 at 17:34

I have a dataset which includes approximately 2000 digital images. I am using MATLAB to perform some digital image processing to extract trees from the imagery. The script is currently configured to process the images in a parfor loop on n cores.

The challenge:
I have access to processing time on a University managed supercomputer with approximately 10,000 compute cores. If I submit the entire job for processing, I get put so far back in the tasking queue, a desktop computer could finish the job before the processing starts on the supercomputer. I have been told by support staff that partitioning the 2000 file dataset into ~100 file jobs will significantly decrease the tasking queue time. What method can I use to perform the tasks in parallel using the parfor loop, while submitting 100 files (of 2000) at a time?

My script is structured in the following way:

datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
fileIndex = find(~[files.isdir]);

parfor ix = 1:length(fileIndex) 
     % Perform the processing on each file;
end

Original Q&A

There are 2 best solutions below

Jeff Hammond On 05 January 2015 at 05:57

Let me try to answer the higher level question of job partitioning to optimize for supercomputer queues. I find that a good rule of thumb is to submit jobs of size sqrt(p) on a machine with p processors, if the goal is to maximize throughput. Of course, this assumes a relatively balanced queue policy, which is not implemented at all sites. But most universities don't prioritize large jobs the way DOE facilities do, so this rule should work in your case.

I don't have a mathematical theory behind my rule of thumb, but I've been a large DOE supercomputer user over the past 8 years (100M+ hours personally, allocation owner for 500M+) and I was on staff at one of the DOE sites until recently (albeit one that has a queue policy that breaks my rule).

**matheburg** · Accepted Answer

Similar to my comment I would spontaneously suggest something like

datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
files = files(~[files.isdir]);

% split up the data
N = length(files); % e.g. 20000
jobSize = 100;
jobFiles = mat2cell(files, [jobSize*ones(1,floor(N/jobSize)), mod(N,jobSize)]);
jobNum = length(jobFiles);

% Provide each job to a worker
parfor jobIdx = 1:jobNum
    thisJob = jobFiles{jobIdx}; % this indexing allows matlab for transfering
                                % only relevant file data to each worker

    for fIdx = 1:length(thisJob)
        thisFile = thisJob(fIdx);
        % Perform the processing on each file;
        thisFile.name
    end
end

How to partition datasets into n blocks to reduce queue time on a supercomputer?

There are 2 best solutions below

Related Questions in MATLAB

Related Questions in PARALLEL-PROCESSING

Related Questions in PARFOR

Related Questions in SUPERCOMPUTERS

Trending Questions

Popular # Hahtags

Popular Questions