I have very large data files (typically 30Gb to 60Gb) in .txt format. I want to find a way to to automatically decimate the files without importing them to memory first. My .txt files consist of two columns of data, this is an example file: https://www.dropbox.com/s/87s7qug8aaipj31/RTL5_57.txt
What I have done so far is to import the data to a variable "C" then down sample the data. The problem with this method is that the variable "C" often fills the memory capacity of MATLAB before the program has change to decimate:
function [] = textscan_EPS(N,D,fileEPS )
%fileEPS: .txt address
%N: number of lines to read
%D: Decimation factor
fid = fopen(fileEPS);
format = '%f\t%f';
C = textscan(fid, format, N, 'CollectOutput', true);% this variable exceeds memory capacity
d = downsample(C{1},D);
plot(d);
fclose(fid);
end
How can I modify this line:
C = textscan(fid, format, N, 'CollectOutput', true);
So that it effectively decimates the data at this instance by importing every other line of or every 3rd line ect.. of the .txt file from disk to variable "C" in memory.
Any help would be much appreciated.
Cheers, Jim
PS An alternative method that I have been playing around with uses "fread" but it encouters the same problem:
function [d] = fread_EPS(N,D,fileEPS)
%N: number of lines to read
%D: decimation factor
%fileEPS: location of .txt fiel
%read in the data as characters
fid = fopen(fileEPS);
c = fread(fid,N*19,'*char');% EWach line of .txt has 19 characters
%Parse and read the data into floading point numbers
f=sscanf(c,'%f');
%Reshape the data into a two column format
format long
d=decimate((flipud(rot90(reshape(f,2,[])))),D); %reshape for 2 colum format, rotate 90, flip veritically,decimation factor
I believe that textscan is the way to go, however you may need to take an intermediate step. Here is what I would do assuming you can easily read
N
lines at a time:textscan(fileID,formatSpec,N)
It should be possible to just read 1 line each time, and decide whether you want to keep/discard it. Though this should consume minimal memory I would try to do a few thousand each time to get reasonable performance.