Providing AWS_PROFILE when reading S3 files with Spark

5.7k Views Asked by At

I want my Spark app (Scala) to be able to read S3 files

spark.read.parquet("s3://my-bucket-name/my-object-key")

On my dev machine, I could access S3 files using awscli a pre-configured profile in ~/.aws/config or ~/.aws/credentials, like:

aws --profile my-profile s3 ls s3://my-bucket-name/my-object-key

But when trying to read those files from Spark, with the aws_profile provided as an env variable (AWS_PROFILE), I got the following error:

doesBucketExist on my-bucket-name: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint

Also tried to provide the profile as a JVM option (-Daws.profile=my-profile), with no luck.

Thanks for reading.

3

There are 3 best solutions below

0
On

The solution is to provide the spark property: fs.s3a.aws.credentials.provider, setting it to com.amazonaws.auth.profile.ProfileCredentialsProvider. If I could change the code to build the Spark Session, then something like:

SparkSession
    .builder()
    .config("fs.s3a.aws.credentials.provider","com.amazonaws.auth.profile.ProfileCredentialsProvider")
    .getOrCreate()

The other way is to provide the JVM option -Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider.
*NOTE the prefix spark.hadoop

0
On

If problems arise still after setting fs.s3a.aws.credentials.provider to com.amazonaws.auth.profile.ProfileCredentialsProvider and correctly setting AWS_PROFILE, it might be because you're using Hadoop 2 for which the above configuration is not supported.

Therefore, the only workaround I found was to upgrade to Hadoop 3.

Check this post and Hadoop docs for more information.

0
On

I had two issue related to Spark+AWS compatibility.

First off, pyspark didn't see profiles specified in ~/.aws/config file. I had to move it to ~/.aws/credentials for spark to at least admit that the profile exists. At least from within jupyter and using pyspark.

More importantly, sso/identity server is a recommended way to access AWS from local machines, but sso-based auth setup is not supported by s3a connection chain. It's explained in an answer to another question.