Can a PySpark Kernel(JupyterHub) run in yarn-client mode?

3.5k Views Asked by At

My Current Setup:

  • Spark EC2 Cluster with HDFS and YARN
  • JuputerHub(0.7.0)
  • PySpark Kernel with python27

The very simple code that I am using for this question:

rdd = sc.parallelize([1, 2])
rdd.collect()

The PySpark kernel that works as expected in Spark standalone has the following environment variable in the kernel json file:

"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"

However, when I try to run in yarn-client mode it is getting stuck forever, while the log output from the JupyerHub logs is:

16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

As described here I have added the HADOOP_CONF_DIR env. variable to point to the directory where the Hadoop configurations are, and changed PYSPARK_SUBMIT_ARGS --master property to "yarn-client". Also i can confirm that there are no other jobs running during this and that the workers are correctly registered.

I am under the impression that it is possible to configure a JupyterHub Notebook with a PySpark kernel to run with YARN as other people have done it, if this indeed is the case what I am I doing wrong?

2

There are 2 best solutions below

0
On BEST ANSWER

In order to have your pyspark works in yarn mode you'll have to do some additional configurations:

  1. Configure yarn for remote yarn connection by copying the hadoop-yarn-server-web-proxy-<version>.jar of your yarn cluster in the <local hadoop directory>/hadoop-<version>/share/hadoop/yarn/ of your jupyter instance (You need a local hadoop)

  2. Copy the hive-site.xml of your cluster in the <local spark directory>/spark-<version>/conf/

  3. Copy the yarn-site.xml of your cluster in the <local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/

  4. Set environment variables:

    • export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
    • export SPARK_HOME=<local spark directory>/spark-<version>
    • export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
    • export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
  5. Now, you can create your kernel in file /usr/local/share/jupyter/kernels/pyspark/kernel.json

     {
        "display_name": "pySpark (Spark 2.1.0)",
         "language": "python",
         "argv": [
          "/opt/conda/envs/python35/bin/python",
          "-m",
          "ipykernel",
          "-f",
          "{connection_file}"
         ],
         "env": {
          "PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
          "SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
          "PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
          "PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
          "PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
         }
        }
    
  6. Relaunch your jupyterhub, you should see pyspark. Root user doesn't usually have yarn permission because of uid=1. You should connect to jupyterhub with another user

0
On

I hope my case can help you.

I config the url by simply passing a parameter:

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")