Matlab: fastest method of reading parts/sequences of a large binary file

402 Views Asked by At

I want to read parts from a large (ca. 11 GB) binary file. The currently working solution is to load the entire file ( raw_data ) with fread(), then crop out pieces of interest ( data ).

Question: Is there a faster method of reading small (1-2% of total file, partially sequential reads) parts of a file, given something like a binary mask (i.e. a logical index of specific bytes of interst) in Matlab? Specifics below.

Notes for my specific case:

  • data of interest (26+e6 bytes, or ca. 24 MB) is roughly 2% of raw_data (1.2e+10 bytes or ca. 11 GB)
  • each 600.000 bytes contain ca 6.500 byte reads, which can be broken down to roughly 1.200 read-skip cycles (such as 'read 10 bytes, skip 5000 bytes').
  • the read instructions of the total file can be broken down in ca 20.000 similar but (not exactly identical) read-skip cycles (i.e. ca. 20.000x1.200 read-skip cycles)
  • The file is read from a GPFS (parallel file system)
  • Excessive RAM, newest Matlab ver and all toolboxes are available for the task

My initial idea of fread-fseek cycle proved to be extrodinarily much slower (see psuedocode below) than reading the whole file. Profiling revealed fread() is slowest (being called over a million times probably obvious to the experts here).

Alternatives I considered: memmapfile() [ ref ] has no feasible read multiple small parts as far as I could find. The MappedTensor library might be the next thing I'd look into. Related but didn't help, just to link to article: 1, 2.

%open file
fi=fopen('data.bin');

%example read-skip data
f_reads = [20  10   6  20  40];  %read this number of bytes
f_skips = [900 6000 40 300 600]; %skip these bytes after each read instruction

data = []; %save the result here
fseek(fi,90000,'bof'); %skip initial bytes until first read

%read the file
for ind=1:nbr_read_skip_cylces-1
  tmp_data = fread(fi,f_reads(ind));
  data = [data; tmp_data]; %add newly read bytes to data variable 
  fseek(fi,f_skips(ind),'cof'); %skip to next read position
end

FYI: To get an overview and for transparency, I've compiled some plots (below) of the first ca 6.500 read locations (of my actual data) that, after collapsing into fread-fseek pairs can, can be summarized in 1.200 fread-fseek pairs.

f_reads(bytes) f_skips(bytes) read locations

2

There are 2 best solutions below

0
On BEST ANSWER

I would do two things to speed up your code:

  1. preallocate the data array.
  2. write a C MEX-file to call fread and fseek.

This is a quick test I did to compare using fread and fseek from MATLAB or C:

%% Create large binary file
data = 1:10000000; % 80 MB
fi = fopen('data.bin', 'wb');
fwrite(fi, data, 'double');
fclose(fi);

n_read = 1;
n_skip = 99;

%% Read using MATLAB
tic
fi = fopen('data.bin', 'rb');
fseek(fi, 0, 'eof');
sz = ftell(fi);
sz = floor(sz / (n_read + n_skip));
data = zeros(1, sz);
fseek(fi, 0, 'bof');
for ind = 1:sz
  data(ind) = fread(fi, n_read, 'int8');
  fseek(fi, n_skip, 'cof');
end
toc

%% Read using C MEX-file
mex fread_test_mex.c

tic
data = fread_test_mex('data.bin', n_read, n_skip);
toc

And this is fread_test_mex.c:

#include <stdio.h>
#include <mex.h>

void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, const mxArray *prhs[])
{
   // No testing of inputs...
   // inputs = 'data.bin', 1, 99
   char* fname = mxArrayToString(prhs[0]);
   int n_read = mxGetScalar(prhs[1]);
   int n_skip = mxGetScalar(prhs[2]);
   FILE* fi = fopen(fname, "rb");
   fseek(fi, 0L, SEEK_END);
   int sz = ftell(fi);
   sz /= n_read + n_skip;
   plhs[0] = mxCreateNumericMatrix(1, sz, mxDOUBLE_CLASS, mxREAL);
   double* data = mxGetPr(plhs[0]);
   fseek(fi, 0L, SEEK_SET);
   char buffer[1];
   for(int ind = 1; ind < sz; ++ind) {
      fread(buffer, 1, n_read, fi);
      data[ind] = buffer[0];
      fseek(fi, n_skip, SEEK_CUR);
   }
   fclose(fi);
}

I see this:

Elapsed time is 6.785304 seconds.
Building with 'Xcode with Clang'.
MEX completed successfully.
Elapsed time is 1.376540 seconds.

That is, reading the data is 5x as fast with a C MEX-file. And that time includes loading the MEX-file into memory. A second run is a bit faster (1.14 s) because the MEX-file is already loaded.


In the MATLAB code, if I initialize data = []; and then extend the matrix every time I read like OP does:

tmp = fread(fi, n_read, 'int8');
data = [data, tmp];

then the execution time for that loop was 159 s, with 92.0% of the time spent in the data = [data, tmp] line. Preallocating really is important!

0
On

I bumped into the same kind of question (read data from a >4GB binary file stored with precision 'single' in my case) and I was trying to adapt your solution to my code, making me wondering what motivates the type 'int8' when reading the file when example data was written with type 'double'? Does it need to be that way for the mex file?

I am used to read the file with the same type as it was written/saved as binary. For example if I want to read the 7 consecutive indexes from 3 to 9 from the variable data (which should actually contain 3:9 because data = 1:10000000), I go to the appropriate index (according to the number of bytes used to store data: here 8 because fwrite was used with type 'double'):

fseek(fi, (3-1)*8,'bof')

and I then read from there the number of bytes corresponding to my 7 indexes:

fread(fi, 7*8, 'double')

Could you give me a hint on how to adapt the script/mex file to have the output actually corresponding to the content of the variable data before saved as binary? Thanks!