I have a use case where I need to read a json file or json string using spark as Dataset[T] in scala. The json file has nested elements and some of the elements in the json are optional. I am able to read the json file and map those to case class if I ignore optional fields in the json as the schema matches with the case class.
According to this link and answer it works for first level json when case class have option field but if it is there is nested element it does not work.
Json String that I am using is as below :
val jsonString = """{
"Input" :
{
"field1" : "Test1",
"field2" : "Test2",
"field3Array" : [
{
"key1" : "Key123",
"key2" : ["keyxyz","keyAbc"]
}
]
},
"Output":
{
"field1" : "Test2",
"field2" : "Test3",
"requiredKey" : "RequiredKeyValue",
"field3Array" : [
{
"key1" : "Key123",
"key2" : ["keyxyz","keyAbc"]
}
]
}
}"""
The case class that I have created are as below :
case class InternalFields (key1: String, key2 : Array[String])
case class Input(field1:String, field2: String,field3Array : Array[InternalFields])
case class Output(field1:String, field2: String,requiredKey : String,field3Array : Array[InternalFields])
case class ExternalObject(input : Input, output : Output)
The code through which I am reading the jsonString is as below :
val df = spark.read.option("multiline","true").json(Seq(jsonString).toDS).as[ExternalObject]
The above code works perfectly fine. Now when I add a optional field in the Output case class as json string could have it to support some use case it throws an error saying that the optional field that I have specified in the case class is missing.
So in order to get around this I went ahead and tried providing schema using encoders and see if that works.
After adding optional field my case class got changed to as below :
case class InternalFields (key1: String, key2 : Array[String])
case class Input(field1:String, field2: String,field3Array : Array[InternalFields])
case class Output(field1:String, field2: String,requiredKey : String, optionalKey : Option[String],field3Array : Array[InternalFields]) //changed
case class ExternalObject(input : Input, output : Output)
There is one additional optional field added in Output case class.
Now I am trying to read the jsonString as below :
import org.apache.spark.sql.Encoders
val schema = Encoders.product[ExternalObject].schema
val df = spark.read
.schema(schema)
.json(Seq(jsonString).toDS)
.as[ExternalObject]
When I do df.show or display(df) it gives me output table as below which is null for both input column as well as output column.
If I remove that optional field from the case class then this code also works fine and shows me the expected output.
Is there any way by which I can make this optional field in the inner json or inner case class work and cast it directly to respective case class inside dataset[T].
Any ideas, guidance, suggestions that can make it work would be of great help.
The problem is that spark uses struct types to map a class to a
Row
, take this as an example:Can spark create a dataframe, which sometimes has column
c
and sometimes not? like:Well it cannot, and being nullable, means the key has to exist, but the value can be null:
This is considered to be valid, and is convertible to your structs. I suggest you use a dedicated JSON library (as you know there are many of them out there), and use udf's or something to extract what you need from json.