How to create encoder for Option type constructor, e.g. Option[Int]?

4.9k Views Asked by At

Is it possible to use Option[_] member in a case class used with Dataset API? eg. Option[Int]

I tried to find an example but could not find any yet. This can probably be done with with a custom encoder (mapping?) but I could not find an example for that yet.

This might be achievable using Frameless library: https://github.com/adelbertc/frameless but there should be an easy way to get it done with base Spark libraries.

Update

I am using: "org.apache.spark" %% "spark-core" % "1.6.1"

And getting the following error when trying to use an Option[Int]:

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases

Solution update

Since I was prototyping I was just declaring the case class inside the function before the conversion to the Dataset (in my case inside object Main {).

Option types worked just fine when I moved the case class outside of the Main function.

2

There are 2 best solutions below

2
On BEST ANSWER

We only define implicits for a subset of the types we support in SQLImplicits. We should probably consider adding Option[T] for common T as the internal infrastructure does understand Option. You can workaround this by either creating a case class, using a Tuple or constructing the required implicit yourself (though this is using and internal API so may break in future releases).

implicit def optionalInt: org.apache.spark.sql.Encoder[Option[Int]] = org.apache.spark.sql.catalyst.encoders.ExpressionEncoder()

val ds = Seq(Some(1), None).toDS()
5
On

"Support for serializing other types will be added in future releases". Custom encoders aren't supported yet though obviously it's planned. You could try to implement the trait yourself, but there are certainly no official examples.

One option would be to use a Seq[Int] member and ensure it only has at most one value.