Converting Dataframe from Spark to the type used by DL4j

564 Views Asked by atos At 18 September 2018 at 15:13

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error: "type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".

Original Q&A

There are 1 best solutions below

Adam Gibson On 19 September 2018 at 02:20 BEST ANSWER

In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.

Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)

We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java

But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".

In general, a big reason we do this is because all of our ndarrays are off heap. Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.

When we do that conversion, it ends up being: raw data -> numerical representation -> ndarray

What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do: someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))

That will give you a "dataset" You need a pair of ndarrays matching your input data and a label. The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

Converting Dataframe from Spark to the type used by DL4j

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in DL4J

Trending Questions

Popular # Hahtags

Popular Questions