I have a massive dataset (text file) that is nearly 4GB and would like to work with the dataset using a pandas dataframe. I can read in the file but it takes a couple of minutes to read in all of the data.
So, I would like to leverage the speed of C using the Cython library.
I am having trouble finding out how to read a text file into a pandas dataframe using Cython.
Any guidance would be helpful.
Read it once and store it back as other file formats with faster I/O (e.g. HDF, pickle). You'll most likely see a 10x-20x improvement.
There's a rough comparison on each file format I/O speed and disk space in the official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#performance-considerations