Spark fat jar to run multiple versions on YARN

1.4k Views Asked by At

I have an older version of Spark setup with YARN that I don't want to wipe out but still want to use a newer version. I found a couple posts referring to how a fat jar can be used for this.

Many SO posts point to either maven(officially supported) or sbt to build a fat jar because it's not directly available for download. There seem to be multiple plugins to do it using maven: maven-assembly-plugin, maven-shade-plugin, onejar-maven-plugin etc.

However, I can't figure out if I really need a plugin and if so, which one and how exactly to go about it. I tried directly compiling github source using 'build/mvn' and 'build/sbt' but the 'spark-assembly_2.11-2.0.2.jar' file is just 283 bytes.

My goal is to run pyspark shell using the newer version's fat jar in a similar way as mentioned here.

2

There are 2 best solutions below

2
On

The easiest solution (without changing your Spark on YARN architecture and speaking to your YARN admins) is to:

  1. Define a library dependency on Spark 2 in your build system, be it sbt or maven.

  2. Assemble your Spark application to create a so-called uber-jar or fatjar with Spark libraries inside.

It works and I personally tested it at least once in a project.

The only (?) downside of it is that the build process takes longer (you have to sbt assembly not sbt package) and the size of your Spark application's deployable fatjar is...well...much bigger. That also makes the deployment longer since you have to spark-submit it to YARN over the wire.

All in all, it works but takes longer (which may still be shorter than convincing your admin gods to, say forget about what is available in commercial offerings like Cloudera's CDH or Hortonworks' HDP or MapR distro).

6
On

From spark version 2.0.0 creating far jar is no longer supported, you can find more information in Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?

The recommended way in your case (running on YARN) is to create directory on HDFS with content of spark's jars/ directory and add this path to spark-defaults.conf:

spark.yarn.jars    hdfs:///path/too/jars/directory/on/hdfs/*.jar

Then if you run pyspark shell it will use previously uploaded libraries so it will behave exactly like fat jar from Spark 1.X.