If my cluster is not active, and I have uploaded 50 files in storage location, then where this Auto Loader will list out these 50 files if cluster is not active. Will it use any checkpoint location, if yes, then how can I set the checkpoint location in Cloud Storage for these new files identification? Can anyone please tell me the backend process that is used to identifying these new files if my cluster is not active?
How Databricks autoloader identify new files when cluster is not active?
1.6k Views Asked by Asif Khan AtThere are 2 best solutions below
On
Autoloader supports two modes to identify new files to load:
Databricks Documentation to refer for file detection modes are at: configure-auto-loader-file-detection-modes
It is controlled using Autoloader options set on stream.
If cloudFiles.useNotifications is set to false then Directory listing mode is used else it will use Queues (This depends on Cloud that is used).
All supported documentation based on DBR version are available at Autoloader options
Whenever autoloader stream starts, Databricks starts daemon thread which is responsible to identify new files by consulting with existing file tracking that stored at checkpoint in rocksDB. You can specify checkpoint location on auto loader stream. It is documentation in supported option as checkpointLocation.
This explains it really well.