How to load only files beginning with certain string pattern with Databricks Autoloader

30 Views Asked by At

I'm attempting to use Databricks autoloader to load only csv files beginning with a certain string pattern. These files are located in some directory /Dir1/Dir2 in an Azure blob container.

The pattern I am currently trying is something like this, using the pathGlobfilter option.

df = (spark.readStream
  .format("cloudFiles") 
  .option("cloudFiles.format", "csv")
  .option("header", "true") 
  .option("cloudFiles.includeExistingFiles", "true") 
  .option("cloudFiles.inferColumnTypes", "true") 
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .option("pathGlobfilter", "String_Pattern_*.csv")
  .load("abfss://<container-path>/Dir1/Dir2/")
)

However, I am getting 0 files loaded at all. I have tested that the files are loaded properly under normal circumstances without any globfilter. I have also tried with globfilter of "*.csv" which also loads all files properly, so I believe my globfilter syntax is incorrect, but am unsure on how to fix it.

One other potential complication is that the csv files are not actually stored in Dir1/Dir2, but in a series of subdirectories. For example, the path to a certain csv file might be "Dir1/Dir2/Dir3/Dir4/String_Pattern_572_638.csv". I have tried setting the globfilter to "/**/String_Pattern_*.csv" to account for this, but I am told that the input path is empty:

details = "Cannot infer schema when the input path `abfss://<container-path>/Dir1/Dir2/` is empty. Please try to start the stream when there are files in the input path, or specify the schema."
0

There are 0 best solutions below