Best way to read from a big CSV file without loading everything to memory using Javascript

8.2k Views Asked by At

I'm using Atom/Electron for building an app that has visualisations over video based on data. Each video has a corresponding CSV file with information for each frame. The videos are about 100 minutes, so the files has plenty of data!

The problem I'm having is that it takes a couple of seconds to load and parse the file. Most of the time this is not a problem. But I need to make playlist of parts of videos, and loading the whole CSV file each time a video is changed is not a viable option.

I been looking to file streaming options as fast-csv, but I didn't manage to start reading for an arbitrary part of the file.

EDIT: from the FS documentation. In this case, the question is how can I know which byte corresponds to the position I want in the file?

options can include start and end values to read a range of bytes from the file instead of the entire file. Both start and end are inclusive and start at 0.

What do you think would be the better and most performant approach to this situation?

In concrete:

Is there a way of starting to read a stream from any part of a CSV file?

Do you consider there is another storage method that would allow me to solve this problem better?

UPDATE:

In the end, I solved this by storing the data in a file in binary format. Since I know how many columns the file has I can just read straight from the segment of the file without any performance implications.

2

There are 2 best solutions below

0
On BEST ANSWER

Per my comment, Sqlite seems to be what your looking for. It may not be your permanent solution in the long run but it will certainly work for the time being while you decided if you want to stick with it or code your own solution.

Inner Workings of Sqlite

Sqlite is optimized to the core but it has three major features that cause it to execute faster than normal disk reads, especially CSV files:

  1. The entire database (every database you create) is stored in 1 single file, not multiple files or records.
  2. This file is paged into 1024 byte (1K) chunks allowing you to jump easily around the data.
  3. (Really part of 2) The entire database and paging system is one massive binary tree that usually takes under 10 jumps to find any given data. So in layman's terms, extremely fast!

If you are really interested in understanding the full scope of all of this, I have found no better explanation than this amazing blog post by Julia Evans.

Possible Drawbacks

Besides the inner workings, Sqlite is designed to be client side working on the users machine. If this is not a viable solution there are workarounds that can be done. Sqlite can be used as a web server for example but it really thrives in a stand alone or mixed installation best. Also remember every clients computer is different. One computer may process records faster than the next but in general you do not need to worry since client side computers are usually under little load.

  • Standalone would require everything to be on the clients side. This is typically how Sqlite is used. I have used it for games in the past, utilizing sqlite4java's API to connect to the database with Java; the API made the whole experience feel like PHP and MySQL on a server. You may need to find other API's since Sqlite is written in C.
  • Mixed instillation is done the same way as standalone but you code into your program a link to an actual server. For the games I helped make we would track things like scores and user data, and then periodically in the background pass this to an actual server if we could get a connection. This also works in reverse. You can start the user with nothing but on first run it can download everything you need and from then on keep its self up-to-date with what is on the server.

Summary

Sqlite will work for what you need but may require a little homework to setup in a fashion you need. Sqlite4java for example is easy to install but confusing to learn since their documentation is so poor; Stack Overflow got me through it though. Sqlite is also a use it and forget it type of installation so to answer your question, it will handle 25 rows a second like cake, you do not need to worry about optimizing it only your own code.

3
On

I would highly recommend Papaparse for this. It allows the streaming of a CSV 'row-by-row', which can be processed in JSON format based on headers in the file.

Within a config object passed to the parsing function, you can give a 'step' parameter, which is a function to be carried out for each row of the file as it steps through.

Note: Can also be configured to use a worker-thread for increased performance when handling very large CSV's

http://papaparse.com/docs