Multiple buffers on the same file

418 Views Asked by At

The procedure is as follows.

  1. Filtering a huge File.txt file (FASTQ file format if you are interested) by line by line through file streaming in C.

  2. After each filtering process, the output is a filtered_i.txt file.

  3. Repeat steps 1-2 with 1000 different filters.

  4. The expected results are 1000 filtered_i.txt files, i from 1 to 1000.

The question is:

Can I run these filtering processes in parallel?

My concern is multiple buffers would be opened in File.txt if do parallel. Is it safe to do? Any potential drawbacks?

3

There are 3 best solutions below

1
On BEST ANSWER

There is no best answer to your problem: here are some potential issues to take into consideration:

  • opening the same file multiple times for reading in the same or multiple processes does not pose any problems per se, but you might run out of file handles either at the process level or at the system level.
  • if the filters use a lot of RAM for their purpose, running too many of them in parallel may cause swapping, which will significantly slow down the whole process
  • if the file is large but fits in memory, it is likely to stay in the cache, so running filters in sequence would not cause I/O delays, but running them in parallel may take advantage of multiple cores.
  • conversely, if the file does not fit in memory, running filters in parallel should increase overall throughput, especially if they consume the same area of the file at the same time.
  • if the process is I/O bound and filters can consume one line at a time, calling them as functions in sequence in a simple loop in a process that reads one line at a time may be a simple solution. Running multiple such processes in parallel, each handling a subset of all filters can further improve the throughput.

As for all optimisation problems, you should test different approaches and measure performance.

Here is a simple script to run 20 filters in parallel:

#!/bin/bash
for i in {0..20}; do (for j in {0..50}; do ./filter_$[$j*20+$i+1]; done)& done
7
On

I would advise against opening a file multiple times in parallel. This puts a lot of strain on the OS, and if all of your threads are streaming at once, your performance is going to drop significantly because of thrashing. You'd be much better off streaming the file serially, even large files. If you do want a parallel solution, I'd suggest having one thread be the "streamer", where you'd read a certain number of chunks from the file and then pass those chunks off to the other threads.

0
On

In any sane operating system, including all the big ones, it is possible and safe for different processes, or different threads of the same process, to open the same file, in parallel, for reading.

Operating systems also cache the file and perform read-ahead, so if two threads/processes read from the same file, the first one will read from disk, the OS will cache it, and the second one will read from cache.

The main thing you should worry about is to match the level of parallelism to the capabilities of the machine (number of processors, memory size) and requirements of filters (whether the filtering threads are I/O bound or CPU bound, how much memory they consume, etc.).

Note that the memory used by filters is the same memory used by the OS cache to cache the file, so if you take too much memory for the filters, you'll get a sort of thrashing where the OS flushes the cached file and then reloads it every time.