Given a .tar archive, Matlab allows one to extract the contained files to disk via UNTAR command. One can then manipulate the extracted files in the ordinary way.
Issue: When several files are stored in a tarball, they are stored contiguously on disk and, in principle, they can be accessed serially. When such files are extracted, this contiguity doesn't hold any more and the file access can become random, hence slow & inefficient.
This is especially critical when the considered files are many (thousands) and small.
My question: is there any way to access to the archived files avoiding the preliminary extraction (in a sort of HDF5 fashion)?
In other words, would it be possible to cache the .tar so to access the contained files from the memory rather than from the disk?
(In general, direct .tar manipulation is possible, e.g. is C# tar-cs, in python).
After some time I finally worked out a solution which gave me unbelievable speedups (like 10x or so).
In a word: ramdisk (tested on Linux (Ubuntu & CentOs)).
Recap:
Since the problem has some generality, let me state it again in a more complete fashion.
Say that I have many small files stored on disk (txt,pict, order of millions) which I want to manipulate (e.g. via matlab).
Working on such files (i.e. loading them/transmitting them on network) when they are stored on disk is tremendously slow since the disk access is mostly random.
Hence, tarballing the files in archives (e.g. of fixed size) looked to me like a good way to keep the disk access sequential.
Problem:
If case the manipulation of the
.tarrequires a preliminary extraction to disk (as it happens with matlab'sUNTAR), the speed up given by sequential disk access is mostly loss.Workaround:
The tarball (provided it is reasonably small) can be extracted to memory and then processed from there. In matlab, as I stated in the question,
.tarmanipulation in memory is not possible, though.What can be done (equivalently) is
untarringto ramdisk.In linux, e.g. Ubuntu, a default ramdisk drive is mounted in
/run/shm(tempfs). Files can be untarred via matlab there, having then extremely fast access.In other words, a possible workcycle is:
untarto/run/shm/mytemptaragain the output to diskThis allowed me to change the scale-time of my processing from
8hrsto40minand full CPUs load.