PySpark: Reading and joining from two datasets using user-delegated SAS tokens in databricks

128 Views Asked by At

I am using following code to read from two datasets(lying on datalake) using SAS tokens. I am successfully able to read these dataset, but when I join between them, the authentication error is raised.

Steps:

  1. Set spark config for 1st dataset. (using sas token for 1st dataset)
  2. read dataset
  3. display dataset
  4. set spark config for 2nd dataset. (using sas token for 2nd dataset)
  5. read dataset
  6. display dataset
  7. join and show dataset

Now, It works fine till step-6 but fails on step-7. I believe the reason is on step-4, when I set config for 2nd dataset, the config for 1st dataset gets overwritten and hence, on step-7 when an action is performed using dataset-1, the authentication error occurs.

import json
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test-app").getOrCreate()
datalake_name = "MY_STORAGE_ACCOUNT"

container_name = "MY_CONTAINER_NAME"

#################  Common Configs  ###########################

spark_session.conf.set(f"fs.azure.account.auth.type.{datalake_name}.dfs.core.windows.net", "SAS")

spark_session.conf.set(f"fs.azure.sas.token.provider.type.{datalake_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")

################   Dataframe 1 Read   ##############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "FIRST SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_1/test_data"

df1 = spark_session.read.format("parquet").load(target_file_path)

#################  Dataframe 2 Read  ###############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "SECOND SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_2/test_data"

df2 = spark_session.read.format("parquet").load(target_file_path)



df3 = df2.join(df1)

df3.show()

Has anyone else ever faced this issue before? What should be the correct way to perform this? Please advise

1

There are 1 best solutions below

5
Bhavani On

It overwrites the Spark configuration when setting the Spark configuration for the second dataset. This may be the reason for the authentication error while joining two data frames. If the datasets are in one container, generate SAS at the container level as shown below:

enter image description here

Set Spark configuration with the generated SAS and join two data frames using the code below:

spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.<containerName>.<storageAccountName>.blob.core.windows.net", "<SAS token>")
file_path = "folder1/cars.parquet"
file_path2 = "folder2/mt cars.parquet"
df2 = spark.read.format("parquet").load("wasbs://[email protected]/" + file_path)
df1 = spark.read.format("parquet").load("wasbs://[email protected]/" + file_path2)
df3 = df1.join(df2)
df3.show()

This will join the two data frames successfully as shown below:

enter image description here