I read the Cluster mode overview (link: https://spark.apache.org/docs/latest/cluster-overview.html) and I was wondering how the components such as the Driver, Executor and Work nodes can be mapped on the components of the Spark Ecosystem such as Spark core, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX and Scheduling/cluster managers. Which of these components are for the Drivers, the Executors and the Work nodes?
So basically my question is if there is a link between these two figures of the components of Spark (figure 1) and the ecosystem of Spark (figure 2). If so can somebody please explain to my what belongs to the drivers/executors/work nodes
The cluster manager in the figure 1(as mentioned in the question) is related to (Standalone Scheduler, Yarn, Mesos) in the figure 2(as mentioned in the question).
The cluster manager can be any one of the cluster/resource managers like Yarn, Mesos, kubernates etc.
Nodes or worker Nodes are the machines that are part of the cluster on which you want to run your spark application in distributed manner. You cannot relate this to something on the spark ecosystem diagram. Nodes/Worker Nodes are actual physical machines like your computer/laptop.
Now the drivers and executors are the processes that runs on machines that are part of the cluster.
One of the node from the cluster is selected as the master/driver node and this is where the driver process runs which creates sparkContext and runs your main method and split up your code in a way that it can be executed in distributed fashion by creating jobs, stages and tasks.
Other nodes from the cluster are selected as Worker nodes and executor process runs the tasks assigned to them by driver process on this nodes.
Now coming to Spark Core , it is the component/framework that has been created which allows all of this communications, Scheduling and data transfer to happen between driver node and worker nodes and you don't have to worry about all these things and just focus on your business logic t get the required work done.
Structured Streaming, Spark SQL, MLib, GraphX are some functionality that is implemented utilizing Spark Core as the base functionality so you get some of common functionality that you can utilize to make your life easier. You would have spark installed on all the nodes i.e driver node and worker nodes and have all these components on those nodes by default.
You cannot compare both the figures exactly because one shows the working of how the spark application is executed when you submit your code to cluster and other just shows the various components that the spark framework in whole provides.