Spark Session With Multiple s3 Roles

1.1k Views Asked by At

I have a Spark job that reads files from an s3 bucket, formats them, and places them in another s3 bucket. I'm using the (SparkSession) spark.read.csv and spark.write.csv functionality to accomplish this

When I read the files, I need to use one IAM role (assume role), and when I write the files, need to drop the assumed role and revert to my default role.

Is this possible within the same spark session? And if not, is there another way to do this?

Any and all help is appreciated!

1

There are 1 best solutions below

3
On

For the S3A connector in Hadoop 2.8+, the S3A connector supports per-bucket settings, so you have different login options for different buckets

At some point (maybe around then, very much by hadoop 3) the AssumedRoleCredentialProvider takes a set of full credentials and calls AssumeRole for a given role ARN, so interacts with s3 under that role instead.

should be matter of

  1. Make sure your hadoop-jars are recent
  2. set the base settings with your full login
  3. per bucket setting for the source bucket to use the assumed role credential provider with the chosen arn
  4. make sure things work from the hadoop command line before trying to get submitted jobs to work.
  5. then submit the job.