Scenario :
I have setup a spark cluster on my kubernetes environment :
- Livy Pod for submission of jobs
- Spark Master Pod
- Spark Worker Pod for execution
What I want to achieve is as follows: I have a jupyter notebook with a Pyspark kernel as a pod in the same environment wherein on the execution of cells a spark session is created and using livy post request /statements all my code gets executed. I was able to achieve the above scenario
Note : There is no YARN, HDFS, Hadoop in my env. I have made use of kubernetes, spark standalone and jupyter only.
Issue : Now what I wanted, is when I run my pyspark code and it gets executed over in the spark worker, I would like to send the following over in that execution environment :
- environment variables which I have used in the notebook
- pip packages which I have used in the notebook
- or a custom virtualenv in which i could provide all the packages used together I am unable to do the same.
Things that I have tried out so far : Since I made use of spark magic, have tried to set environment variables using the following ways I could find in the documentations and other answers.
%%configure {
"conf": {
spark.executorEnv.TESTVAR
spark.appMasterEnv.TESTVAR
spark.driver.TESTVAR
spark.driverenv.TESTVAR
spark.kubernetes.driverenv.TESTVAR
spark.kubernetes.driver.TESTVAR
spark.yarn.executorEnv.TESTVAR
spark.yarn.appMasterEnv.TESTVAR
spark.workerenv.TESTVAR
}
}
Bunching up for reference, I have tried the above options individually.
I have tried directly hitting the livy pod's service name like a normal post request but still no luck.
But the variables are still not getting detected
After this I tried directly setting the same manually in spark-defaults.conf in the spark cluster but did not work. Would appreciate any help on the matter. Also is my first SO question so please let know incase of issues.