I have some .gz files and they contain data on some timeseries. Naturally, I would like to do some timeseries analysis on this.
I tried this:
import gzip
f=gzip.open('data.csv.gz','r')
file_content=f.read()
print(file_content)
But it was loading for 20mins and I manually stopped it.
My question is, how should I read this? I have some ideas on using Dask, Spark, or should I just yield the lines?
Tried looking in the internet industry standards.
You can use Dask as follows:
Apache spark also supports reading .gz files. (It might be overkill for small datasets.
Yielding lines: If you’re writing a function to process the file, you can use a generator to yield lines one by one. This is memory-efficient as only one line is loaded into memory at a time.