HDFS Excel Rows got decreased when running the spark job on Yarn

49 Views Asked by At

When running the same job in local (IntelliJ IDEA) the output counts are fine (For eg -55). But When submitted it on Yarn using spark-submit, Getting only few rows out of it (Rows -12).

spark2-submit --master yarn --deploy-mode client --num-executors 5 --executor-memory 5G --executor-cores 5 --driver-memory 8G --class com.test.Main --packages com.crealytics:spark-excel_2.11:0.13.1 --driver-class-path /test/ImpalaJDBC41.jar,/test/TCLIServiceClient.jar --jars /test/ImpalaJDBC41.jar,/test/TCLIServiceClient.jar /test/test-1.0-SNAPSHOT.jar

when use master - yarn getting partial rows. And when use local - Able to read all rows but got Exception as - Caused by: java.sql.SQLFeatureNotSupportedException: [Simba][JDBC](10220) Driver not capable.

Seems like it is not able to read all the block from HDFS when running on cluster.

Any help will be much appreciated. Thanks

1

There are 1 best solutions below

0
On

As you are mentioning that you are able to get all the rows in single executor (Running --master local) that means all the partition is in Driver using which you are submitting the job in spark-submit.

Once your partition distributed across cluster nodes ( --master yarn) You loose many partition and not able to read all the HDFS block.

  1. Look into your code are you using nested loops with if condition for example - while( while() )

Or any other loop with if condition. Generally the outer loop copied the same partition on each node and combiner combines the result as single partition. Please check this.

  1. For JDBC exception, you need to replace all the NULL values with other values for example using .na().fill() method on your final dataframe. As each column row inside the file should be greater than CHAR > 0 ( Null values is having zero length i.e not supported in JDBC writing )

Hopes this helps