I am trying to "debug" a pyspark script on EMR(EC2) cluster (v7.0.0), by stepping through the code using Pycharm Professional.
The script lives on the Master node of the EMR cluster, and is being run on yarn.
sc_conf = SparkConf()
sc_conf.setAppName(app_name)
sc_conf.setMaster('yarn')
Using a conda-installed pyspark (in an environment), I can step through the initial part of my code (creating the SparkContext etc.), right up to the point where it starts to read data from S3, where this error is received:
com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
I guess this is related to: com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found on PySpark script on AWS EMR
However, removing the self-installed spark, pyspark is not found anymore and the code exits with the first import.
from pyspark import SparkConf, SparkContext
Is there a way to solve this? or to sort out the configurations?