I am working with DeepLearning4j library. I am running everything on HPC and I generate a jar file to submit with spark-submit. I am using the version M1.1. Everything was fine with the CPU but when I switched to GPU, I got this error:
Warning: Versions of org.bytedeco:javacpp:1.5.4 and org.bytedeco:openblas:0.3.13-1.5.5 do not match.
Warning: Versions of org.bytedeco:javacpp:1.5.4 and org.bytedeco:opencv:4.5.1-1.5.5 do not match.
22/08/03 21:05:26 INFO BaseImageRecordReader: ImageRecordReader: 1000 label classes inferred using label generator ParentPathLabelGenerator
iterator
data list creator
java.lang.RuntimeException: No CUDA devices were found in system
at org.nd4j.linalg.jcublas.JCublasBackend.canRun(JCublasBackend.java:69)
at org.nd4j.linalg.jcublas.JCublasBackend.isAvailable(JCublasBackend.java:52)
at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:160)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5092)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
at org.datavec.image.loader.NativeImageLoader.transformImage(NativeImageLoader.java:670)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:593)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:281)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:256)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:250)
at org.datavec.image.recordreader.BaseImageRecordReader.next(BaseImageRecordReader.java:247)
at org.datavec.image.recordreader.BaseImageRecordReader.nextRecord(BaseImageRecordReader.java:511)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.initializeUnderlying(RecordReaderDataSetIterator.java:194)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:341)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:421)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:53)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.entryPoint(NetworkRetrainingMain.java:55)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.main(NetworkRetrainingMain.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
22/08/03 21:05:26 WARN Nd4jBackend: Skipped [JCublasBackend] backend (unavailable): java.lang.RuntimeException: No CUDA devices were found in system
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.datavec.image.loader.NativeImageLoader.transformImage(NativeImageLoader.java:670)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:593)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:281)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:256)
at org.datavec.image.loader.NativeImageLoader.asMatrix(NativeImageLoader.java:250)
at org.datavec.image.recordreader.BaseImageRecordReader.next(BaseImageRecordReader.java:247)
at org.datavec.image.recordreader.BaseImageRecordReader.nextRecord(BaseImageRecordReader.java:511)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.initializeUnderlying(RecordReaderDataSetIterator.java:194)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:341)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:421)
at org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(RecordReaderDataSetIterator.java:53)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.entryPoint(NetworkRetrainingMain.java:55)
at com.examples.DeepLearningOnSpark.imageNet_image.streaming.NetworkRetrainingMain.main(NetworkRetrainingMain.java:31)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.RuntimeException: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5095)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:270)
... 25 more
Caused by: org.nd4j.linalg.factory.Nd4jBackend$NoAvailableBackendException: Please ensure that you have an nd4j backend on your classpath. Please see: https://deeplearning4j.konduit.ai/nd4j/backend
at org.nd4j.linalg.factory.Nd4jBackend.load(Nd4jBackend.java:196)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:5092)
... 26 more
My pom.xml is:
<properties>
<dl4j-master.version>1.0.0-M1.1</dl4j-master.version>
<!-- Change the nd4j.backend property to nd4j-cuda-X-platform to use CUDA GPUs -->
<!-- <nd4j.backend>nd4j-cuda-10.2-platform</nd4j.backend> -->
<nd4j.backend>nd4j-cuda-11.0-platform</nd4j.backend>
<java.version>1.8</java.version>
<shadedClassifier>bin</shadedClassifier>
<scala.binary.version>2.11</scala.binary.version>
<maven-compiler-plugin.version>3.8.1</maven-compiler-plugin.version>
<maven.minimum.version>3.3.1</maven.minimum.version>
<exec-maven-plugin.version>1.4.0</exec-maven-plugin.version>
<maven-shade-plugin.version>2.4.3</maven-shade-plugin.version>
<jcommon.version>1.0.23</jcommon.version>
<jfreechart.version>1.0.13</jfreechart.version>
<logback.version>1.1.7</logback.version>
<jcommander.version>1.27</jcommander.version>
<spark.version>2.4.8</spark.version>
<jackson.version>2.5.1</jackson.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>${exec-maven-plugin.version}</version>
<executions>
<execution>
<goals>
<goal>exec</goal>
</goals>
</execution>
</executions>
<configuration>
<executable>java</executable>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>${maven-shade-plugin.version}</version>
<configuration>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>${shadedClassifier}</shadedClassifierName>
<createDependencyReducedPom>true</createDependencyReducedPom>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>org/datanucleus/**</exclude>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
<!-- Added to enable jar creation using mvn command-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<archive>
<manifest>
<mainClass>fully.qualified.MainClass</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>${nd4j.backend}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-cuda-11.0</artifactId>
<version>1.0.0-M1.1</version>
</dependency>
<dependency>
<groupId>org.datavec</groupId>
<artifactId>datavec-spark_${scala.binary.version}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>dl4j-spark_${scala.binary.version}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>dl4j-spark-parameterserver_${scala.binary.version}</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>com.beust</groupId>
<artifactId>jcommander</artifactId>
<version>${jcommander.version}</version>
</dependency>
<!-- Used for patent classification example -->
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-zoo</artifactId>
<version>${dl4j-master.version}</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>1.0.0-M1.1</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-cuda-11.0</artifactId>
<version>1.0.0-M1.1</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
</dependencies>
And these are my loaded dependencies :
1) modenv/scs5 (S) 7) Tcl/8.6.9-GCCcore-8.3.0 13) BigDataFrameworkConfigure/0.0.2 19) zlib/1.2.11-GCCcore-9.3.0
2) Maven/3.6.3 8) SQLite/3.29.0-GCCcore-8.3.0 14) Spark/3.0.1-Hadoop-2.7-Java-1.8-Python-3.7.4-GCCcore-8.3.0 20) binutils/2.34-GCCcore-9.3.0
3) Java/1.8.0_161-OpenJDK 9) XZ/5.2.4-GCCcore-8.3.0 15) CUDAcore/11.0.2 21) GCC/9.3.0
4) bzip2/1.0.8-GCCcore-8.3.0 10) GMP/6.1.2-GCCcore-8.3.0 16) numactl/2.0.14-GCCcore-10.3.0 22) CUDA/11.0.2-GCC-9.3.0
5) ncurses/6.1-GCCcore-8.3.0 11) libffi/3.2.1-GCCcore-8.3.0 17) NVHPC/21.7 23) nvidia-nsight/2019.3.1
6) libreadline/8.0-GCCcore-8.3.0 12) Python/3.7.4-GCCcore-8.3.0 18) GCCcore/9.3.0
Could anyone help me please. Thank you!
Make sure the spark workers are running on a GPU system if you are using the cuda backend.
Ideally every machine that gets a cuda backend job for a worker will be the same otherwise you won't see much performance.
Those machines should also have the same drivers and expected cuda versions.
I'm not sure what your system configuration is but if you do that you shouldn't have issues with libraries.