I am newbie in Spark.
Now, I setup AWS SSO in my local machine. It works well.
This is code to test, upload data using boto3 library
path_obj = Path(file_path)
file_name = path_obj.name
os.environ['AWS_PROFILE'] = 'my-profile'
# Upload the file
s3_client = boto3.client('s3')
s3_path += ''
s3_client.upload_file(file_path, s3_bucket, s3_path+f'/{file_name}')
But I can not read this file by using Spark, here is my code:
conf = SparkConf()
conf.setAll([
("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-common:3.3.4"),
("spark.hadoop.fs.s3a.endpoint","s3.us-east-1.amazonaws.com"),
("spark.hadoop.fs.s3.impl","org.apache.hadoop.fs.s3a.S3AFileSystem"),
("spark.executor.cores",1)
])
spark = SparkSession.builder \
.master('local[*]')\
.config(conf=conf) \
.appName("Api-app-1") \
.getOrCreate()
s3_file_name = os.path.join('s3a://',s3_bucket,s3_path,file_name)
df = spark.read.format("csv").option("header", "true").load(s3_file_name)
df.show()
spark.stop()
It raises an error:
Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))
Could somebody help me with this problem or guide me how to work with Spark and AWS profile.
It can work if I provide the AWS_KEY, but in production we only use the profile for S3 access
Thank you so much
The below line in your simple python boto3 script is setting the AWS Profile environment variable which your machine uses to provide credentials for S3
You will have to do the same for your spark job before accessing S3:
Alternatively you can try to set the environment variable on your machine in shell by using