Adding fast random access in reading and parsing file using csv-parse and node

442 Views Asked by At

I was using the csv-parser library to handle csv parsing in node. The file can be huge ranging from 50,000 to 500,000 lines, maybe even larger. I had to perform some computations on the csv, after this is submitted to the server for that I was thinking of dividing the csv into chunks which I could then provide to worker threads for performing the computation. The worker thread would get the number of lines to skip and then start reading the lines after that up to a particular limit. I create a read stream and pass the csv-parser in with the option of number of lines to skip. I tried to perform some benchmarks on it but could find no visible benefits between skipping lines and not skipping lines. Even if I read the whole file it was sometimes faster than reading the ending 30,000 lines.

My guess is that this problem is because of read stream which reads data one by one and hence is not perfect for quick random access of the file.

Maybe my benchmarking is wrong?

Here is the piece of code

const csv = require('csv-parser');
const bench = require('nanobench');

const fs = require('fs');
const parse = (number) => {
// const results = [];
    fs.createReadStream('test.csv').pipe(csv({
        skipLines: number
    })).on('data', (data) => {}).on('end', () => {});
}

const arr = [0, 30000, 15000, 15000/2, 15000/4, 15000/8];

arr.forEach(item => {
    bench(`CSV skip ${item} lines 40 times`, function(b) {
        b.start();
        for(let i = 0; i < 40; i++) parse(item);
        b.end();
    })
})

and here's the ouptut

# CSV skip 0 lines 40 times
ok ~4.14 ms (0 s + 4139981 ns)

# CSV skip 30000 lines 40 times
ok ~2.05 ms (0 s + 2054537 ns)

# CSV skip 15000 lines 40 times
ok ~2.7 ms (0 s + 2702328 ns)

# CSV skip 7500 lines 40 times
ok ~2.43 ms (0 s + 2434555 ns)

# CSV skip 3750 lines 40 times
ok ~1.97 ms (0 s + 1966652 ns)

# CSV skip 1875 lines 40 times
ok ~2.17 ms (0 s + 2172144 ns)

Is there some any other better way for what I aim to do ?

1

There are 1 best solutions below

0
On

The problem is, even if you want to skip N lines, the parser still has to read and analyze all bytes from the top down to the N-th line. The further from the beginning the first line is, the more useless work is to be performed (Schlemiel the Painter's algorithm).

You can consider the following logic instead:

  • for each file, start with currentPosition = 0
  • seek to the offset currentPosition + chunkSize
  • read i bytes until you encounter a newline or EOF
  • allocate a new thread with parameters position=currentPosition and size = chunkSize + i
  • continue with currentPosition = currentPosition + size + 1

This way, each chunk will contain a whole number of lines.

In a thread, use parameters position and size to read the whole chunk and parse it in-memory.

In pseudocode:

size = fs.statSync("filename").size

chunkSize = 99999

currentPos = 0

fd = fs.open("filename")

while (currentPos < size) {

    endPos = currentPos + chunkSize
    fs.readSync(fd, buf, 0, 1000, endPos)
    
    i = 0
    while(buf[i] != \n) i++
    endPos += i
    
    threads.add(filename: "filename", position: currentPos, size: endPos - currentPos)

    currentPos = endPos + 1
}