Cannot use Spark with AWS profile

51 Views Asked by At

I am newbie in Spark.

Now, I setup AWS SSO in my local machine. It works well.

This is code to test, upload data using boto3 library

        path_obj = Path(file_path)
        file_name = path_obj.name
        os.environ['AWS_PROFILE'] = 'my-profile'
        # Upload the file
        s3_client = boto3.client('s3')
        s3_path += ''
        s3_client.upload_file(file_path, s3_bucket, s3_path+f'/{file_name}')

But I can not read this file by using Spark, here is my code:

conf = SparkConf()
conf.setAll([
    ("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4"),
    ("spark.hadoop.fs.s3a.endpoint","s3.us-east-1.amazonaws.com"),
    ("spark.hadoop.fs.s3.impl","org.apache.hadoop.fs.s3a.S3AFileSystem"),
    ("spark.executor.cores",1)
    
])
spark = SparkSession.builder \
        .master('local[*]')\
        .config(conf=conf) \
        .appName("Api-app-1") \
        .getOrCreate()
s3_file_name = os.path.join('s3a://',s3_bucket,s3_path,file_name)
df = spark.read.format("csv").option("header", "true").load(s3_file_name)
df.show()
spark.stop()

It raises an error:

Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))

Could somebody help me with this problem or guide me how to work with Spark and AWS profile.

It can work if I provide the AWS_KEY, but in production we only use the profile for S3 access

Thank you so much

1

There are 1 best solutions below

1
glory9211 On

The below line in your simple python boto3 script is setting the AWS Profile environment variable which your machine uses to provide credentials for S3

os.environ['AWS_PROFILE'] = 'my-profile'

You will have to do the same for your spark job before accessing S3:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ExampleApp") \
    .getOrCreate()

# Set AWS profile
hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
hadoop_conf.set("fs.s3a.aws.profile", "your-profile-name")

Alternatively you can try to set the environment variable on your machine in shell by using

export AWS_PROFILE=your-profile-name