I am trying to sift through a big database that is compressed in a .zst. I am aware that I can simply just decompress it and then work on the resulting file, but that uses up a lot of space on my ssd and takes 2+ hours so I would like to avoid that if possible.
Often when I work with large files I would stream it line by line with code like
with open(filename) as f:
for line in f.readlines():
do_something(line)
I know gzip has this
with gzip.open(filename,'rt') as f:
for line in f:
do_something(line)
but it doesn't seem to work with .zsf, so I am wondering if there're any libraries that can decompress and stream the decompressed data in a similar way. For example:
with zstlib.open(filename) as f:
for line in f.zstreadlines():
do_something(line)
Knowing which package to use and what the corresponding docs are can be a bit confusing, as there appears to be several Python bindings to the actual Zstandard library.
Below, I am referring to the library by Gregory Szorc, that I installed from
conda
s default channel with:(even though the docs say to install with
pip
, which I don't unless there is no other way, as I like my conda environments to remain usable).I am only inferring that this version is the one from G. Szorc, based on the comments in the
__init__.py
file:Thus, I think that the corresponding documentation is here.
In any case, quick test after install:
Produces:
Notes:
mode='rb'
, same as a regular file. The underlying file is always written in binary mode, but if we use text mode foropen
, then according toopen
's doc, "(...) anio.TextIOWrapper
if opened for reading or writing in text mode".f
, notreadlines()
. From the inline docstring, they make it sound likereadlines()
returns a list of lines from the file, i.e. the whole thing is slurped in memory. With the iterator, it is more likely that only portions of the file are in memory at any moment (inzstd
's buffer).Addendum
ABout notes 2 and 3 above: I tested empirically, by changing the number of lines to 100 millions and compared the memory usage of two versions (using
htop
):Streaming version
--no bump in memory usage.
Readlines version
--bump in memory usage by a few GBs.
This may be specific to the version installed (1.5.5).