How to pass Azure ADLS Storage Account Name and Container Name to the spark.readStream

80 Views Asked by At

I have two storage account (STORAGE_ACCOUNT_A & STORAGE_ACCOUNT_B) under the same Resource Group and I have set up spark Streaming Job (Auto Loader).

df = spark.readStream.format("cloudFiles")
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.tenantId", "XXXX")
.option("cloudFiles.subscriptionID", "XXXX")
.option("cloudFiles.resourceGroup", "XXXX")
.option("cloudFiles.clientId", "XXXX")
.option("cloudFiles.clientSecret", "XXXX")
.option("cloudFiles.format", "csv")
.option("header", "true")
.schema(schema)
.load(source_path)

But when I shift to other Storage Account in source_path it gives error as below.

java.lang.IllegalStateException: The container in the file event {"create":{"bucket":"CONTAINER@STORAGE_ACCOUNT_B","key":"workbench/data/Landing/table/file.csv","size":800,"eventTime":1706107083508,"sequencer":"0000000000000000000000000002def9000000000000279c","newerThan$default$2":false}} is different from expected by the source: CONTAINER@STORAGE_ACCOUNT_A.

Spark Autoloader is unintentionally reading new file creation events triggered at the Resource Group level, affecting data ingestion from the newly migrated storage account. How can we restrict Spark Autoloader to selectively consume file events/queue related only to the migrated storage account, ensuring accurate data processing

1

There are 1 best solutions below

0
KKL On

This should really be a comment, but I don't have enough reputation to comment yet.

What you are trying to achieve seems to be unsupported with the file notification mode.

You could try one of the two: