I am trying to access the parquet file that's available in S3 bucket using Pyspark local via Pycharm. I have the AWS toolkit configured in Pycharm and I have the access key and security key added in my ~/.aws/credentials yet I see the credentials are not getting accessed. Which throws me the error "Unable to load AWS credentials from any provider in the chain"

import os
import pyspark
from pyspark.sql import SparkSession


os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

spark = SparkSession.builder\
            .appName('Pyspark').getOrCreate()

my_df = spark.read.\
    parquet("s3a://<parquet_file_location>") --Using s3 gives me no file system error

my_df.printSchema()

Is there any alternative approach to try Pyspark locally and access the AWS resources.

Also I should be able to use s3 in parquet path but that seems to throw an error with file system not found. Does any dependency or jar file needs to be added for running the Pyspark locally

1

There are 1 best solutions below

5
On

if you set the secrets in AWS_ env vars they will be picked up, and then propagated with the job. Otherwise you can set them in spark-defaults.conf with the appropriate spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key.