I have a master on the cloud with 3 masters and 10 slaves.

All the slaves appear active but 0 resources alocated in the Mesos Master UI:

Agents Detail Page

In the home page I can see 10 activated Agents but 9 are unreachable:

mes

The jobs I try to run on the cluster get stuck on RUNNING state for ever.

Does Spark need to be up and running (run start-slave.sh on every slave) or mesos does it? What could be wrong?

There are no ports blocked on the machines

Edit:

It looks like the machine that launches the application is able to connect to the Master:

I0902 18:01:31.472944 14997 zookeeper.cpp:262] A new leading master (UPID=master@X:5000) is detected
I0902 18:01:31.473047 14993 sched.cpp:343] New master detected at master@X:5000
I0902 18:01:31.473348 14993 sched.cpp:363] No credentials provided. Attempting to register without authentication
I0902 18:01:31.475391 14994 sched.cpp:751] Framework registered with 9984df9d-0efb-4f83-bf6a-0cecb19b1a39-0002

Also it tries to start a task but it gets stucked, this behavior is cyclic:

enter image description here

1

There are 1 best solutions below

0
Hassan Ahmadkhani On

Two sol for this problem :

  1. install hadoop client in all mesos slaves

    • put spark-x.y.z.tar.gz in hdfs
    • inspark-conf : spark.executor.uri hdfs://nn:9000/path/spark-x.y.z.tar.gz
    • in spark-env : export SPARK_EXECUTOR_URI=hdfs://nn:9000/path/spark-x.y.z.tar.gz
  2. put spark-x.y.z.tar.gz in /path/in/os/

    • in spark-conf : spark.executor.uri /path/in/os/spark-x.y.z.tar.gz
    • in spark-env : export SPARK_EXECUTOR_URI=/path/in/os/spark-x.y.z.tar.gz

otherwise : in mesos ui -> agent tab -> sandbox -> stderr (check error detail )