How Databricks autoloader identify new files when cluster is not active?

1.6k Views Asked by At

If my cluster is not active, and I have uploaded 50 files in storage location, then where this Auto Loader will list out these 50 files if cluster is not active. Will it use any checkpoint location, if yes, then how can I set the checkpoint location in Cloud Storage for these new files identification? Can anyone please tell me the backend process that is used to identifying these new files if my cluster is not active?

2

There are 2 best solutions below

0
On

Autoloader supports two modes to identify new files to load:

Databricks Documentation to refer for file detection modes are at: configure-auto-loader-file-detection-modes

It is controlled using Autoloader options set on stream.

If cloudFiles.useNotifications is set to false then Directory listing mode is used else it will use Queues (This depends on Cloud that is used).

All supported documentation based on DBR version are available at Autoloader options

Whenever autoloader stream starts, Databricks starts daemon thread which is responsible to identify new files by consulting with existing file tracking that stored at checkpoint in rocksDB. You can specify checkpoint location on auto loader stream. It is documentation in supported option as checkpointLocation.

0
On

This explains it really well.

As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.

In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.