Not able to read from s3a path from EMR on EKS with pyspark code from jupterlab

262 Views Asked by At

Trying to run following code on Pyspark kernel from EMR on EKS(using managed endpoint), I tried to set some s3a related spark config but seems not working

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("S3 Read Example") \
    .getOrCreate()

spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark.conf.set("fs.s3a.acl.default", "BucketOwnerFullControl")
spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider")

# Read data from S3 using the s3a path
s3_path = "s3a://bucket/file.parquet"

df = spark.read \
    .format("parquet") \
    .load(s3_path)

spark.stop()

And getting following error. Can someone help to identify the issue ?


Py4JJavaError: An error occurred while calling o119.load. : org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://com.bucket.name/file-path/file.snappy.parquet: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <com.bucket.name.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <com.bucket.name.s3.amazonaws.com> doesn't match any of the subject alternative names: [.s3.amazonaws.com, s3.amazonaws.com]

I tried to apply following spark-defaults spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true") spark.conf.set("fs.s3a.acl.default", "BucketOwnerFullControl") spark.conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") spark.conf.set("fs.s3a.impl","com.amazon.ws.emr.hadoop.fs.EmrFileSystem")

Not sure if it's a spark config issue.

How we can read s3a path on spark 3 with EMR on EKS?

1

There are 1 best solutions below

2
On

dotted bucket names aren't supported; AWS say they should only be used for web sites, not as as store of data you work on in your applications.

If you must try to to use them, set fs.s3a.path.style.access to true. however, before that, try using EMR's own s3:// connector, which is the one they officially support