i am running a job in spark shell job
--num-executors 15
--driver-memory 15G
--executor-memory 7G
--executor-cores 8
--conf spark.yarn.executor.memoryOverhead=2G
--conf spark.sql.shuffle.partitions=500
--conf spark.sql.autoBroadcastJoinThreshold=-1
--conf spark.executor.memoryOverhead=800
the job is stuck and not starting the code is doing a cross join with filter conditions on a large dataset of 270m. i have increased partitions to 16000 for the large table 270m and the small table (100000), i have converted it to a broadcast variable
i have added the spark ui for the job ,
so i do have to reduce the partitions , increase the executors, any idea
thanks for helping out .
![spark ui 1][1] ![spark ui 2][2] ![spark ui 3][3] after 10 hours
status: tasks : 7341/16936 (16624 failed)
check the container error logs
RM Home
NodeManager
Tools
Failed while trying to construct the redirect url to the log server. Log Server url may not be configured
java.lang.Exception: Unknown container. Container either has not started or has already completed or doesn't belong to this node at all.
[50per completed ui 1 ][4][50per completed ui 2][5] [1]: https://i.stack.imgur.com/nqcys.png [2]: https://i.stack.imgur.com/S2vwL.png [3]: https://i.stack.imgur.com/81FUn.png [4]: https://i.stack.imgur.com/h5MTa.png [5]: https://i.stack.imgur.com/yDfKF.png
If you can mention your cluster configurations, then it would be helpful.
But since you added Broadcast of small table of 1000 is working, but 100,000 is not probably you need to adjust your memory config.
As per your config i am assuming you have total :
15 * 7 = 105GB
of memory.You can try with
--num-executors 7 --executor-memory 15
This will give more memory to each executor to hold a broadcast variable. Please adjust
--executor-cores
accordingly for proper utilization