|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: string (nullable = true)
I have the above nested schema where I want to change column z's log from string to struct.
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: struct (nullable = true)
| | | | | | |-- b: string (nullable = true)
| | | | | | |-- c: string (nullable = true)
I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.
Is there a way to do this with some udf or any other way?
I know it's easy to do this via from_json but the nested array of struct is causing issues.
I think it depends on the values in your
logcolumn. I mean, the way you want to split the string into 2 separate fields.The following PySpark code will just "move" your
logvalues tobandcfields.If string transformations are needed on
logcolumn,e.z.log[0]parts need to be changed to include string transformations.