Apache Tez tasks on hold at the Application Master

1.3k Views Asked by At

I have a tez problem, when running about 14 queries at the same time, some of them get delays of more than 5 minutes, but the cluster utilization is just 14%.

This is the message that I am talking about.

INFO SessionState: [HiveServer2-Background-Pool: Thread-322319]: Get Query Coordinator (AM)            308.84s

My configuration is the following:

yarn.scheduler.maximum-allocation-mb=188000 
yarn.app.mapreduce.am.resource.mb = 16000 
tez.am.resource.memory.mb = 8000
hive.tez.container.size = 8192
tez.runtime.io.sort.mb 2048 
tez.am.launch.cmd-opts default - .8
tez.runtime.unordered.output.buffer.size-mb= 800 
hive.server2.tez.sessions.per.default.queue = 2 
tez.session.am.dag.submit.timeout.secs = 900  
tez.am.session.min.held.containers=8
tez.am.resource.memory.mb = 8000
hive.prewarm.enabled = TRUE

This is a 15 node cluster, 254GB ram p/node, 32 cores p/node.

Any clue what might be happening? Is the AM well sized? I don't have out of memory errors, just this long wait times when everything is running, but they are processing only 35 million records when they are all together.

Thanks

1

There are 1 best solutions below

0
On BEST ANSWER

There is a behavior that is not really well explained in the documentation, the fact that in order to really utilize the cluster and all your additional memory configurations you MUST set up default queues, and you need to specify them when you are going to query, or to connect spark, etc.

For example, when using tez, you need to use the tez.name.queue={your queue name} in order to fully utilize it, this enables parallelism in yarn.

For spark, you need to specify the --queue {your queue name} when launching pyspark, or when submitting jobs using the spark_submit.

In order to use the above, you need to have queues set up in yarn using the hive.server2.tez.default.queues, parameter that you need to set up with the list of default queues for tez. It is important to note that you can create the queues and not list them as default, by doing that you need need to call out the queue manually all the time and the queries are not going to get into any default queue.