How to load only files beginning with certain string pattern with Databricks Autoloader

30 Views Asked by andyh4050 At 01 March 2024 at 17:53

I'm attempting to use Databricks autoloader to load only csv files beginning with a certain string pattern. These files are located in some directory /Dir1/Dir2 in an Azure blob container.

The pattern I am currently trying is something like this, using the pathGlobfilter option.

df = (spark.readStream
  .format("cloudFiles") 
  .option("cloudFiles.format", "csv")
  .option("header", "true") 
  .option("cloudFiles.includeExistingFiles", "true") 
  .option("cloudFiles.inferColumnTypes", "true") 
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .option("pathGlobfilter", "String_Pattern_*.csv")
  .load("abfss://<container-path>/Dir1/Dir2/")
)

However, I am getting 0 files loaded at all. I have tested that the files are loaded properly under normal circumstances without any globfilter. I have also tried with globfilter of "*.csv" which also loads all files properly, so I believe my globfilter syntax is incorrect, but am unsure on how to fix it.

One other potential complication is that the csv files are not actually stored in Dir1/Dir2, but in a series of subdirectories. For example, the path to a certain csv file might be "Dir1/Dir2/Dir3/Dir4/String_Pattern_572_638.csv". I have tried setting the globfilter to "/**/String_Pattern_*.csv" to account for this, but I am told that the input path is empty:

details = "Cannot infer schema when the input path `abfss://<container-path>/Dir1/Dir2/` is empty. Please try to start the stream when there are files in the input path, or specify the schema."

Original Q&A

How to load only files beginning with certain string pattern with Databricks Autoloader

There are 0 best solutions below

Related Questions in PYSPARK

Related Questions in DATABRICKS

Related Questions in DATABRICKS-AUTOLOADER

Trending Questions

Popular # Hahtags

Popular Questions