I have a use case where I want to convert a struct field to an Avro record. The struct field originally maps to an Avro type. The input data is avro files and the struct field corresponds to a field in the input avro records.
Below is what I want to achieve in pseudocode.
DataSet<Row> data = loadInput(); // data is of form (foo, bar, myStruct) from avro data.
// do some joins to add more data
data = doJoins(data); // now data is of form (a, b, myStruct)
// transform DataSet<Row> to DataSet<MyType>
DataSet<MyType> myData = data.map(row -> myUDF(row), encoderOfMyType);
// method `myUDF` definition
MyType myUDF(Row row) {
String a = row.getAs("a");
String b = row.getAs("b");
// MyStruct is the generated avro class that corresponds to field myStruct
MyStruct myStruct = convertToAvro(row.getAs("myStruct"));
return generateMyType(a, b, myStruct);
}
My question is: how can I implement the convertToAvro
method in above pseudocode?
From the documentation:
The function to_avro acts as replacement for the
convertToAvro
method:prints
To convert the avro column back, the function from_avro can be used:
Output:
A word about the udf: in the question you performed the transformation to the avro format within the udf. I would prefer to include only the actual business logic in the udf and keep the format transformation outside. This separates the logic and the format transformation. If necessary, you can drop the original column
mystruct
after creating the avro column.