I have a single pd.DataFrame with 4.3 million rows and 2 columns, like so:
Id Features
27693 [1.6555750043281372e-09, -6.016701912292214e-2...]
27694 [-1.5324687581597672e-32, 1.0946759676292507e-4...]
Features is a column that stores 512-position numpy arrays. I need to have this structure physically stored in my device for loading on-demand, but I am not sure what is the best way to achieve feasible load times. Currently, my solution is to have the DataFrame split into 9 equally sized partitions (~500.000 rows) and saved to feather files.
Loading these 9 files consistently takes me around 21.609 seconds. Hypothetically, let's say this time-to-load needs to be as fast as it possibly can, while time-to-save isn't an issue.
Are there better formats or techniques to efficiently load large DataFrames with numpy rows into memory?
Depends on your data, if it's binary you could use byteArray.
If not- use numpy arrays and save using pickle. Save the index and column names separately.
Pandas DF is heavier