I am trying to convert a Dataframe to a Dataset, and the java classes structure is as follows:
class A:
public class A {
private int a;
public int getA() {
return a;
}
public void setA(int a) {
this.a = a;
}
}
class B:
public class B extends A {
private int b;
public int getB() {
return b;
}
public void setB(int b) {
this.b = b;
}
}
and class C
public class C {
private A a;
public A getA() {
return a;
}
public void setA(A a) {
this.a = a;
}
}
and the data in the dataframe is as follows :
+-----+
| a |
+-----+
|[1,2]|
+-----+
When I am trying to apply Encoders.bean[C](classOf[C]) to the dataframe. The object reference A which is a instance of B in class C is not returning true when I am checking for .isInstanceOf[B], I am getting it as false. The output of Dataset is as follows:
+-----+
| a |
+-----+
|[1,2]|
+-----+
How do we get all the fields of A and B under the C object while iterating over it in foreach?
Code :-
object TestApp extends App {
implicit val sparkSession = SparkSession.builder()
.appName("Test-App")
.config("spark.sql.codegen.wholeStage", value = false)
.master("local[1]")
.getOrCreate()
var schema = new StructType().
add("a", new ArrayType(new StructType().add("a", IntegerType, true).add("b", IntegerType, true), true))
var dd = sparkSession.read.schema(schema).json("Test.txt")
var ff = dd.as(Encoders.bean[C](classOf[C]))
ff.show(truncate = false)
ff.foreach(f => {
println(f.getA.get(0).isInstanceOf[A])//---true
println(f.getA.get(0).isInstanceOf[B])//---false
})
Content of File : {"a":[{"a":1,"b":2}]}
Spark-catalystuses google reflection to get schema out of java beans. Please take a look at the JavaTypeInference.scala#inferDataType. This class uses getters to collect the field name and the returnType of getters to compute theSparkType.Since class
Chas getter namedgetA()with returnType asAandA, in turn, has getter asgetA()with returnType asint, Schema will be created asstruct<a:struct<a:int>>wherestruct<a:int>is derived from thegetAof classA.The solution to this problem that I can think of is -
Output-