I was using the csv-parser library to handle csv parsing in node. The file can be huge ranging from 50,000 to 500,000 lines, maybe even larger. I had to perform some computations on the csv, after this is submitted to the server for that I was thinking of dividing the csv into chunks which I could then provide to worker threads for performing the computation. The worker thread would get the number of lines to skip and then start reading the lines after that up to a particular limit. I create a read stream and pass the csv-parser in with the option of number of lines to skip. I tried to perform some benchmarks on it but could find no visible benefits between skipping lines and not skipping lines. Even if I read the whole file it was sometimes faster than reading the ending 30,000 lines.
My guess is that this problem is because of read stream which reads data one by one and hence is not perfect for quick random access of the file.
Maybe my benchmarking is wrong?
Here is the piece of code
const csv = require('csv-parser');
const bench = require('nanobench');
const fs = require('fs');
const parse = (number) => {
// const results = [];
fs.createReadStream('test.csv').pipe(csv({
skipLines: number
})).on('data', (data) => {}).on('end', () => {});
}
const arr = [0, 30000, 15000, 15000/2, 15000/4, 15000/8];
arr.forEach(item => {
bench(`CSV skip ${item} lines 40 times`, function(b) {
b.start();
for(let i = 0; i < 40; i++) parse(item);
b.end();
})
})
and here's the ouptut
# CSV skip 0 lines 40 times
ok ~4.14 ms (0 s + 4139981 ns)
# CSV skip 30000 lines 40 times
ok ~2.05 ms (0 s + 2054537 ns)
# CSV skip 15000 lines 40 times
ok ~2.7 ms (0 s + 2702328 ns)
# CSV skip 7500 lines 40 times
ok ~2.43 ms (0 s + 2434555 ns)
# CSV skip 3750 lines 40 times
ok ~1.97 ms (0 s + 1966652 ns)
# CSV skip 1875 lines 40 times
ok ~2.17 ms (0 s + 2172144 ns)
Is there some any other better way for what I aim to do ?
The problem is, even if you want to skip N lines, the parser still has to read and analyze all bytes from the top down to the N-th line. The further from the beginning the first line is, the more useless work is to be performed (Schlemiel the Painter's algorithm).
You can consider the following logic instead:
currentPosition = 0
currentPosition + chunkSize
i
bytes until you encounter a newline or EOFposition=currentPosition
andsize = chunkSize + i
currentPosition = currentPosition + size + 1
This way, each chunk will contain a whole number of lines.
In a thread, use parameters
position
andsize
to read the whole chunk and parse it in-memory.In pseudocode: