When running the same job in local (IntelliJ IDEA) the output counts are fine (For eg -55). But When submitted it on Yarn using spark-submit, Getting only few rows out of it (Rows -12).
spark2-submit --master yarn --deploy-mode client --num-executors 5 --executor-memory 5G --executor-cores 5 --driver-memory 8G --class com.test.Main --packages com.crealytics:spark-excel_2.11:0.13.1 --driver-class-path /test/ImpalaJDBC41.jar,/test/TCLIServiceClient.jar --jars /test/ImpalaJDBC41.jar,/test/TCLIServiceClient.jar /test/test-1.0-SNAPSHOT.jar
when use master - yarn getting partial rows. And
when use local - Able to read all rows but got Exception as - Caused by: java.sql.SQLFeatureNotSupportedException: [Simba][JDBC](10220) Driver not capable.
Seems like it is not able to read all the block from HDFS when running on cluster.
Any help will be much appreciated. Thanks
As you are mentioning that you are able to get all the rows in single executor (Running --master local) that means all the partition is in Driver using which you are submitting the job in spark-submit.
Once your partition distributed across cluster nodes ( --master yarn) You loose many partition and not able to read all the HDFS block.
Or any other loop with if condition. Generally the outer loop copied the same partition on each node and combiner combines the result as single partition. Please check this.
Hopes this helps