What's the recommended approach to integrate zipped files into Foundry? I can see 3 options:
- Unzip on the box (if there is such option), and use Data Connection to ingest unzipped files
- Use some Data Connection plugin (if there exists one) to unzip files during ingestion
- Ingest zipped files and have some transform unzip them.
Generally I would recommend against 1 & 2. I often even do the opposite of 1 & 2 -- I zip files before ingesting them and never have them in their unzipped form anywhere in a foundry dataset.
If the files are merely compressed with gzip or bzip2, but not tarballs, then foundry allows you to access them transparently, as if they were not compressed at all. For instance like in this example dataset, into which I uploaded a single file,
test1.csv.bz2
:However, this breaks for tarballs or other archiving formats where multiple files are compressed into a single archive. So if you have the option to arrange things so that they are compressed like this, that's the easiest and likely most optimal way.
Otherwise I would recommend approach 3 -- extract the archives in-memory and then write out whatever results you've computed as parquet into the downstream dataset.