I'm attempting to use Databricks autoloader to load only csv files beginning with a certain string pattern. These files are located in some directory /Dir1/Dir2 in an Azure blob container.
The pattern I am currently trying is something like this, using the pathGlobfilter option.
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaLocation", checkpoint_path)
.option("pathGlobfilter", "String_Pattern_*.csv")
.load("abfss://<container-path>/Dir1/Dir2/")
)
However, I am getting 0 files loaded at all. I have tested that the files are loaded properly under normal circumstances without any globfilter. I have also tried with globfilter of "*.csv" which also loads all files properly, so I believe my globfilter syntax is incorrect, but am unsure on how to fix it.
One other potential complication is that the csv files are not actually stored in Dir1/Dir2, but in a series of subdirectories. For example, the path to a certain csv file might be "Dir1/Dir2/Dir3/Dir4/String_Pattern_572_638.csv". I have tried setting the globfilter to "/**/String_Pattern_*.csv" to account for this, but I am told that the input path is empty:
details = "Cannot infer schema when the input path `abfss://<container-path>/Dir1/Dir2/` is empty. Please try to start the stream when there are files in the input path, or specify the schema."