Adding fast random access in reading and parsing file using csv-parse and node

Question

Adding fast random access in reading and parsing file using csv-parse and node

435 Views Asked by Ritwiz Sinha At 15 August 2025 at 13:47

I was using the csv-parser library to handle csv parsing in node. The file can be huge ranging from 50,000 to 500,000 lines, maybe even larger. I had to perform some computations on the csv, after this is submitted to the server for that I was thinking of dividing the csv into chunks which I could then provide to worker threads for performing the computation. The worker thread would get the number of lines to skip and then start reading the lines after that up to a particular limit. I create a read stream and pass the csv-parser in with the option of number of lines to skip. I tried to perform some benchmarks on it but could find no visible benefits between skipping lines and not skipping lines. Even if I read the whole file it was sometimes faster than reading the ending 30,000 lines.

My guess is that this problem is because of read stream which reads data one by one and hence is not perfect for quick random access of the file.

Maybe my benchmarking is wrong?

Here is the piece of code

const csv = require('csv-parser');
const bench = require('nanobench');

const fs = require('fs');
const parse = (number) => {
// const results = [];
    fs.createReadStream('test.csv').pipe(csv({
        skipLines: number
    })).on('data', (data) => {}).on('end', () => {});
}

const arr = [0, 30000, 15000, 15000/2, 15000/4, 15000/8];

arr.forEach(item => {
    bench(`CSV skip ${item} lines 40 times`, function(b) {
        b.start();
        for(let i = 0; i < 40; i++) parse(item);
        b.end();
    })
})

and here's the ouptut

# CSV skip 0 lines 40 times
ok ~4.14 ms (0 s + 4139981 ns)

# CSV skip 30000 lines 40 times
ok ~2.05 ms (0 s + 2054537 ns)

# CSV skip 15000 lines 40 times
ok ~2.7 ms (0 s + 2702328 ns)

# CSV skip 7500 lines 40 times
ok ~2.43 ms (0 s + 2434555 ns)

# CSV skip 3750 lines 40 times
ok ~1.97 ms (0 s + 1966652 ns)

# CSV skip 1875 lines 40 times
ok ~2.17 ms (0 s + 2172144 ns)

Is there some any other better way for what I aim to do ?

Original Q&A

There are 1 best solutions below

**georg** · Answer 1

The problem is, even if you want to skip N lines, the parser still has to read and analyze all bytes from the top down to the N-th line. The further from the beginning the first line is, the more useless work is to be performed (Schlemiel the Painter's algorithm).

You can consider the following logic instead:

for each file, start with currentPosition = 0
seek to the offset currentPosition + chunkSize
read i bytes until you encounter a newline or EOF
allocate a new thread with parameters position=currentPosition and size = chunkSize + i
continue with currentPosition = currentPosition + size + 1

This way, each chunk will contain a whole number of lines.

In a thread, use parameters position and size to read the whole chunk and parse it in-memory.

In pseudocode:

size = fs.statSync("filename").size

chunkSize = 99999

currentPos = 0

fd = fs.open("filename")

while (currentPos < size) {

    endPos = currentPos + chunkSize
    fs.readSync(fd, buf, 0, 1000, endPos)
    
    i = 0
    while(buf[i] != \n) i++
    endPos += i
    
    threads.add(filename: "filename", position: currentPos, size: endPos - currentPos)

    currentPos = endPos + 1
}

Adding fast random access in reading and parsing file using csv-parse and node

There are 1 best solutions below

Related Questions in JAVASCRIPT

Related Questions in NODE.JS

Related Questions in FILESTREAM

Related Questions in NODE-WORKER-THREADS

Related Questions in CSV-PARSER

Trending Questions

Popular # Hahtags

Popular Questions