Databricks Autoloader Files Process issue

1.2k Views Asked by At

I've zip files in my container and I would get one or more files everyday and as they come in, I want to process the files. I have some questions.

  1. Can I use Databricks autoloader feature to process zip files? Is zip file supported by Autoloader?

  2. What settings need to be enabled to use Autoloader? I have my container and sas token.

  3. Once the zip file is processed (unzip, read each of the file in the zip file), I should not read the zip again. How can I do this when I use Autoloader? Is there any specific setting?

  4. Are there any samples available? I'm new to this area and trying to get more info.

2

There are 2 best solutions below

0
On

Unfortunately, processing of Zip file using Azure DataBrick is not possible. Auto Loader supports two modes for detecting new files: directory listing and file notification.

Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.

Auto Loader can scale to loading data from storage accounts that contain billions of files that need to be backfilled to pipelines where millions of files are loaded in an hour.

For more information you can refer this Microsoft Document

1
On

Autoloader can read compressed files directly. There is no need to unzip them and no special Autoloader option required. Just configure the same as if they were uncompressed.

Autoloader uses the checkpoint folder to remember what files it has processed.