Databricks AutoLoader - how to handle spark write transactional (_SUCCESS file) on Azure Data Lake Storage?

65 Views Asked by At

Databricks spark write method (df.write.parquet) for parquet files is transactional. After successfully writing to Azure Data Lake Storage, the file _SUCCESS is created in the path where parquet files were loaded.

Example of the folder on ADLS including the _SUCCESS file: image showing the example of the folder on ADLS including the _SUCCESS file

Is it possible to configure AutoLoader to load parquet files only in case the write is done with success (_SUCCESS file appeared in the folder)? In other words, if listing by AutoLoader folders doesn't include _SUCCESS files, parquets files from those folders shouldn't be processed by AutoLoaer.

I was looking for the right option in documentation, but it seems like none of the options can help me.

1

There are 1 best solutions below

0
DileeprajnarayanThumula On

I agree with @JayashankarGS. AutoLoader feature in Databricks allows you to automatically load data from a path into a Delta table when new files are added to that path. However, there is no built-in option in AutoLoader to conditionally load only parquet files that have a corresponding _SUCCESS file in the folder.

If you want to ensure that only parquet files with a successful write (_SUCCESS file) are loaded, modify the AutoLoader logic to include a conditional check. If the _SUCCESS file is found, load the Parquet files. If the _SUCCESS file is not found, indicating a failed write, skip the loading step.

You can try the following:

parquet_path = "<Path/to/parq_files>"
success_file_exists = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()).exists(spark._jvm.org.apache.hadoop.fs.Path(parquet_path + "/_SUCCESS"))

Reference: apache spark - check if file exists