i am trying to read a compressed csv file in pyspark. but i am unable to read in pyspark kernel mode in sagemaker.
The same file i can read using pandas when the kernel is conda-python3 (in sagemaker)
What I tried :
file1 = 's3://testdata/output1.csv.gz'
file1_df = spark.read.csv(file1, sep='\t')
Error message :
An error was encountered:
An error occurred while calling 104.csv.
: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 7FF77313; S3 Extended Request ID:
Kindly let me know if i am missing anything
There are other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself. Apache's Hadoop's original s3:// client. This is no longer included in Hadoop. Apache's Hadoop's s3n: filesystem client. This connector is no longer available: users must migrate to the newer s3a.
I have attached a document for your reference Apache S3 Connectors
PySpark reads gz file automatically as per the document that they have provided. Click Spark Programming Guide for the document.
To load files in dataframe