I have a generic Java class
public class Person<T> {
private String name;
private List<T> attributes;
}
public class AttributeOne {
// some fields
}
public class AttributeTwo {
// some fields
}
and I want to convert from a Spark DataSet to a list of Java Person<T> objects. (each record of the input DataSet is converted to a Person<T> object).
I first tried passing the generic type "T" to the Encoder. Code is something like:
val ds = inputDS.map(<logic to convert input dataset to Person object>)(Encoders.bean(classOf[Person[T]]))
ds.write.json(...)
Code compiles and runs without error. But the output shows only the name field is successfully encoded, the generic attributes field is not. The output data looks like:
{"name": "name-1", "attributes": [{}, {}, {}]}
{"name": "name-2", "attributes": [{}, {}]}
...
This means the Encoder already recognized how many "attribute" elements because the attributes list is not empty, but each "attribute" element is empty {}.
I realized that Encoders needs to know the exact types at compile time to generate the serialization/deserialization code. A generic type parameter T is not enough information.
So I explicitly passed the exact type to the Encoder:
val ds = inputDS.map(<logic to convert input dataset to Person<AttributeOne> object>)(Encoders.bean(classOf[Person[AttributeOne]]))
ds.write.json(...)
However, result is exactly the same with passing generic type "T" - the name field is encoded but the generic attributes field is not (though the Encoder recognized how many "attribute" elements).
Any suggestions?