PySpark: Reading and joining from two datasets using user-delegated SAS tokens in databricks

128 Views Asked by Mayank At 12 January 2024 at 11:18

I am using following code to read from two datasets(lying on datalake) using SAS tokens. I am successfully able to read these dataset, but when I join between them, the authentication error is raised.

Steps:

Set spark config for 1st dataset. (using sas token for 1st dataset)
read dataset
display dataset
set spark config for 2nd dataset. (using sas token for 2nd dataset)
read dataset
display dataset
join and show dataset

Now, It works fine till step-6 but fails on step-7. I believe the reason is on step-4, when I set config for 2nd dataset, the config for 1st dataset gets overwritten and hence, on step-7 when an action is performed using dataset-1, the authentication error occurs.

import json
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test-app").getOrCreate()
datalake_name = "MY_STORAGE_ACCOUNT"

container_name = "MY_CONTAINER_NAME"

#################  Common Configs  ###########################

spark_session.conf.set(f"fs.azure.account.auth.type.{datalake_name}.dfs.core.windows.net", "SAS")

spark_session.conf.set(f"fs.azure.sas.token.provider.type.{datalake_name}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")

################   Dataframe 1 Read   ##############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "FIRST SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_1/test_data"

df1 = spark_session.read.format("parquet").load(target_file_path)

#################  Dataframe 2 Read  ###############################

spark_session.conf.set(f"fs.azure.sas.fixed.token.{datalake_name}.dfs.core.windows.net", "SECOND SAS TOKEN")

target_file_path = f"abfss://{container_name}@{datalake_name}.dfs.core.windows.net/folder_2/test_data"

df2 = spark_session.read.format("parquet").load(target_file_path)



df3 = df2.join(df1)

df3.show()

Has anyone else ever faced this issue before? What should be the correct way to perform this? Please advise

Original Q&A

There are 1 best solutions below

Bhavani On 17 January 2024 at 08:32

It overwrites the Spark configuration when setting the Spark configuration for the second dataset. This may be the reason for the authentication error while joining two data frames. If the datasets are in one container, generate SAS at the container level as shown below:

enter image description here

Set Spark configuration with the generated SAS and join two data frames using the code below:

spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.azure.account.auth.type", "SAS")
spark.conf.set("fs.azure.sas.<containerName>.<storageAccountName>.blob.core.windows.net", "<SAS token>")
file_path = "folder1/cars.parquet"
file_path2 = "folder2/mt cars.parquet"
df2 = spark.read.format("parquet").load("wasbs://[email protected]/" + file_path)
df1 = spark.read.format("parquet").load("wasbs://[email protected]/" + file_path2)
df3 = df1.join(df2)
df3.show()

This will join the two data frames successfully as shown below:

enter image description here

PySpark: Reading and joining from two datasets using user-delegated SAS tokens in databricks

There are 1 best solutions below

Related Questions in AZURE

Related Questions in AUTHENTICATION

Related Questions in PYSPARK

Related Questions in DATABRICKS

Related Questions in SAS-TOKEN

Trending Questions

Popular # Hahtags

Popular Questions