In Spark SQL, there're limited DataTypes for Schema, and there're limited Encoders for converting JVM objects to and from the internal Spark SQL representation.
- In practice, we may have errors like this regarding
DataType, which usually happens in aDataFramewith custom types, BUT NOT in aDataset[T]with custom types. Discussion ofDataType(orUDT) points to How to define schema for custom type in Spark SQL?
java.lang.UnsupportedOperationException: Schema for type xxx is not supported
- In practice, we may have errors like this regarding
Encoder, which usually happens in aDataset[T]with custom types, BUT NOT in aDataFramewith custom types. Discussion ofEncoderpoints to How to store custom objects in Dataset?
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases
In my understanding, both touches the internal Spark SQL optimizer (which is why only a limited number of DataTypes and Encoders are provided); and both DataFrame and Dataset are just Dataset[A], then..
Question (or more.. confusion)
Why first error only appears in
DataFramebut not inDataset[T]? Same question for the second error...Can creating
UDTsolve the 2nd error? Can creating encoders solve the 1st error?How should I understand the relation between each, and how do they interact with
Datasetor Spark SQL engine?
The initiative of this post is to explore more in the two concepts and to attract open discussion, so please bear a bit if the questions are not too specific.. and any sharing of understanding is appreciated. Thanks.