I am using following code to read from two datasets(lying on datalake) using SAS tokens. I am successfully able to read these dataset, but when I join between them, the authentication error is raised.
Steps:
- Set spark config for 1st dataset. (using sas token for 1st dataset)
- read dataset
- display dataset
- set spark config for 2nd dataset. (using sas token for 2nd dataset)
- read dataset
- display dataset
- join and show dataset
Now, It works fine till step-6 but fails on step-7. I believe the reason is on step-4, when I set config for 2nd dataset, the config for 1st dataset gets overwritten and hence, on step-7 when an action is performed using dataset-1, the authentication error occurs.
import json
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test-app").getOrCreate()
datalake_name = "MY_STORAGE_ACCOUNT"
container_name = "MY_CONTAINER_NAME"
################# Common Configs ###########################
spark_session.conf.set(f"fs.azure.account.auth.type.{datalake_name}.dfs.core.windows.net", "SAS")
spark_session.conf.set(f"fs.azure.sas.token.provider.type.{datalake_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
################ Dataframe 1 Read ##############################
spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "FIRST SAS TOKEN")
target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_1/test_data"
df1 = spark_session.read.format("parquet").load(target_file_path)
################# Dataframe 2 Read ###############################
spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "SECOND SAS TOKEN")
target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_2/test_data"
df2 = spark_session.read.format("parquet").load(target_file_path)
df3 = df2.join(df1)
df3.show()
Has anyone else ever faced this issue before? What should be the correct way to perform this? Please advise
It overwrites the Spark configuration when setting the Spark configuration for the second dataset. This may be the reason for the authentication error while joining two data frames. If the datasets are in one container, generate SAS at the container level as shown below:
Set Spark configuration with the generated SAS and join two data frames using the code below:
This will join the two data frames successfully as shown below: