Decimating large data files from disk in MATLAB?

889 Views Asked by At

I have very large data files (typically 30Gb to 60Gb) in .txt format. I want to find a way to to automatically decimate the files without importing them to memory first. My .txt files consist of two columns of data, this is an example file: https://www.dropbox.com/s/87s7qug8aaipj31/RTL5_57.txt

What I have done so far is to import the data to a variable "C" then down sample the data. The problem with this method is that the variable "C" often fills the memory capacity of MATLAB before the program has change to decimate:

function [] = textscan_EPS(N,D,fileEPS )
%fileEPS: .txt address
%N: number of lines to read
%D: Decimation factor

fid = fopen(fileEPS);
format = '%f\t%f';

C = textscan(fid, format, N, 'CollectOutput', true);% this variable exceeds memory capacity

d = downsample(C{1},D);  
plot(d);

fclose(fid);


end

How can I modify this line:

C = textscan(fid, format, N, 'CollectOutput', true);

So that it effectively decimates the data at this instance by importing every other line of or every 3rd line ect.. of the .txt file from disk to variable "C" in memory.

Any help would be much appreciated.

Cheers, Jim

PS An alternative method that I have been playing around with uses "fread" but it encouters the same problem:

function [d] = fread_EPS(N,D,fileEPS)
%N: number of lines to read
%D: decimation factor
%fileEPS: location of .txt fiel

%read in the data as characters
fid = fopen(fileEPS);
c = fread(fid,N*19,'*char');% EWach line of .txt has 19 characters

%Parse and read the data into floading point numbers 
f=sscanf(c,'%f');

%Reshape the data into a two column format
format long 
d=decimate((flipud(rot90(reshape(f,2,[])))),D); %reshape for 2 colum format, rotate 90, flip veritically,decimation factor
2

There are 2 best solutions below

0
On

I believe that textscan is the way to go, however you may need to take an intermediate step. Here is what I would do assuming you can easily read N lines at a time:

  1. Read in N lines with textscan(fileID,formatSpec,N)
  2. Sample from these lines, store the result (file or variable) and drop the rest
  3. As long as there are lines left continue with step 1
  4. Optional, depending on your storage method: combine all stored results into one big sample

It should be possible to just read 1 line each time, and decide whether you want to keep/discard it. Though this should consume minimal memory I would try to do a few thousand each time to get reasonable performance.

0
On

I ended up writing the code below based on Dennis Jaheruddin's advice. It appears to work well for large .txt files (10GB to 50Gb). The code is also inspired by another post: Memory map file in MATLAB?

Nlines = 1e3; % set numbe of lines to sample per cycle
sample_rate = (1/1.5e6); %data sample rate
DECE= 1;% decimation factor

start = 40;  %start of plot time
finish = 50; % end plot time

TIME = (0:sample_rate:sample_rate*((Nlines)-1));
format = '%f\t%f';
fid = fopen('C:\Users\James Archer\Desktop/RTL5_57.txt');

while(~feof(fid))

  C = textscan(fid, format, Nlines, 'CollectOutput', true);
  d = C{1};  % immediately clear C at this point you need the memory! 
  clearvars C ;
  TIME = ((TIME(end)+sample_rate):sample_rate:(sample_rate*(size(d,1)))+(TIME(end)));%shift Time along 
       if ((TIME(end)) > start) && ((TIME(end)) < finish); 
        plot((TIME(1:DECE:end)),(d(1:DECE:end,:)))%plot and decimate
       end
  hold on;
  clearvars d;
end

fclose(fid);

older versions of MATLAB do not process this code well, the following message appears:

Caught std::exception Exception message is: bad allocation

But MATLAB 2013 works just fine