reading a csv.gz file from sagemaker using pyspark kernel mode

1.4k Views Asked by At

i am trying to read a compressed csv file in pyspark. but i am unable to read in pyspark kernel mode in sagemaker.

The same file i can read using pandas when the kernel is conda-python3 (in sagemaker)

What I tried :

file1 =  's3://testdata/output1.csv.gz'
file1_df = spark.read.csv(file1, sep='\t')

Error message :

An error was encountered:
An error occurred while calling 104.csv.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 7FF77313; S3 Extended Request ID: 

Kindly let me know if i am missing anything

1

There are 1 best solutions below

0
On

An error was encountered: An error occurred while calling 104.csv. : java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 7FF77313; S3 Extended Request ID:

There are other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself. Apache's Hadoop's original s3:// client. This is no longer included in Hadoop. Apache's Hadoop's s3n: filesystem client. This connector is no longer available: users must migrate to the newer s3a.

I have attached a document for your reference Apache S3 Connectors

PySpark reads gz file automatically as per the document that they have provided. Click Spark Programming Guide for the document.

file1 =  's3://testdata/output1.csv.gz'
rdd = sc.textFile(file1)
rdd.take(10)

To load files in dataframe

df = spark.read.csv(file1)