If my cluster is not active, and I have uploaded 50 files in storage location, then where this Auto Loader will list out these 50 files if cluster is not active. Will it use any checkpoint location, if yes, then how can I set the checkpoint location in Cloud Storage for these new files identification? Can anyone please tell me the backend process that is used to identifying these new files if my cluster is not active?
How Databricks autoloader identify new files when cluster is not active?
1.6k Views Asked by Asif Khan At
2
There are 2 best solutions below
0

This explains it really well.
As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.
In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.
Autoloader supports
two modes
to identify new files to load:Databricks Documentation to refer for file detection modes are at: configure-auto-loader-file-detection-modes
It is controlled using Autoloader options set on stream.
If
cloudFiles.useNotifications
is set tofalse
then Directory listing mode is used else it will useQueues
(This depends on Cloud that is used).All supported documentation based on
DBR version
are available at Autoloader optionsWhenever
autoloader stream
starts, Databricks starts daemon thread which is responsible to identify new files by consulting with existing file tracking that stored at checkpoint inrocksDB
. You can specify checkpoint location on auto loader stream. It is documentation in supported option ascheckpointLocation
.