CPU core allocation in a DataProc cluster in GCP

72 Views Asked by At

I am just trying to understand how CPU cores are allocated/utilised in a DataProc cluster spined up in GCP.

When we create a cluster in DataProc, there is a master node. Besides that we can configure worker nodes. I observe that in the DataProc cluster, Hadoop gets installed be default. Hadoop has the following daemons which must be running:

  1. NameNode (which runs in the master node)
  2. DataNode (which runs in all the worker nodes and occupies one CPU core)
  3. Secondary name node (which runs in ONE of the worker nodes and occupies one CPU core)

If we run YARN as the cluster manager, the following daemons must be running:

  1. Resource manager (runs in the master node)
  2. Node manager (runs in all the worker nodes and occupies one CPU core)

If my understanding is accurate on the above statements and considering an 'N' core worker node in the cluster, then the node is left with N-2 cores only (one for DataNode and one for Node manager occupying).

So in case of an 8 core node, we can utilise only 6 cores.

Let us consider that we have cluster made of 4 nodes and each node has 8 cores. And YARN is the cluster manager. And we submit a spark application to the cluster; we request 5 cores per executors; we do ask for 4 executors as well. Since YARN is the cluster manager, an Application Master (AM) will be triggered for the spark app that we submit. This AM will occupy one core in one of the worker nodes.

So in this case, ideally there will be one executor (with 5 cores) will created in each worker nodes. In each worker node, there will be two cores allocated to DataNode and Node manager. So the total allocated cores in each node is 7 (5 for executor; 1 for node manager and 1 for data node) Hence, one core is left in each worker nodes.

Now AM and secondary name node will be given one core each in any of the 4 worker nodes.

So in this context: among four nodes, two nodes will be running with full capacity of the available CPU cores. And two nodes will be running with one core (each) being free.

Above is what I actually anticipate.

But in reality: In the DataProc cluster, when the application is submitted, only three executors are spined up (instead of four that we request). One complete node is dedicated to running the Application Master (AM).

May I request your help on how the CPU core allocation is carried out here in the DataProc cluster. Especially I would like to know where I misunderstand the allocation concept.

We can even swap the GCP DataProc cluster with an on-premise cluster for a better demonstration.

Thanks

0

There are 0 best solutions below