how to list jars in the dsx spark environment and the jars loaded into the spark JVM?

484 Views Asked by At

I'm hitting issues trying to use spark packages, for example:

java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource

I have listed the files in the lib dir:

!find ~/data/libs/

I can see my jars are installed:

/gpfs/fs01/user/xxxx/data/libs/
/gpfs/fs01/user/xxxx/data/libs/scala-2.11
/gpfs/fs01/user/xxxx/data/libs/scala-2.11/mongo-spark-connector_2.11-2.0.0.jar
/gpfs/fs01/user/xxxx/data/libs/scala-2.11/mongo-java-driver-3.2.2.jar
/gpfs/fs01/user/xxxx/data/libs/pixiedust.jar
/gpfs/fs01/user/xxxx/data/libs/spark-csv_2.11-1.3.0.jar

However, the error suggests that spark is unable to see the jar.

How can I list the jars available to spark?

2

There are 2 best solutions below

0
On BEST ANSWER

I created a scala notebook and ran the following code:

def urlses(cl: ClassLoader): Array[java.net.URL] = cl match {
  case null => Array()
  case u: java.net.URLClassLoader => u.getURLs() ++ urlses(cl.getParent)
  case _ => urlses(cl.getParent)
}

val  urls = urlses(getClass.getClassLoader)
println(urls.filterNot(_.toString.contains("ivy")).mkString("\n"))

Attribution: https://gist.github.com/jessitron/8376139

Running this highlighted an issue with the jvm loading the mongodb driver:

error: error while loading <root>, Error accessing /gpfs/fs01/user/xxxx/data/libs/scala-2.11/mongo-java-driver-3.2.2.jar
error: scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found.

This made me realise that although the jar file was present in the correct location was not getting loaded correctly into the jvm.

0
On

The classpath is in the environment variable SPARK_DIST_CLASSPATH. The following snippet for execution in a Python notebook yields some duplicates and non-JARs, but also the JARs on the classpath.

!ls $(printenv SPARK_DIST_CLASSPATH | sed -e 's/:/ /g')

Note that the classpath depends on the selected Spark version.