What's the best way to merge a series of json format binlog record into a Hudi table using Spark?

43 Views Asked by At

I have a Hudi table, and some json-format binlog records. And now I want to merge these binlog record into the Hudi table. As we know, binlog records need to be executed in order. What's the best way to do this? Should I traverse each binlog record in order and perform corresponding operations in the Hudi table? Or is there any other elegant operation to achieve this?

1

There are 1 best solutions below

0
Sumit Singh On

You can use Custom Spark Job with Ordered Processing:

  1. Create a Spark job to read the binlog records as a DataFrame
  2. Sort the DataFrame by the binlog sequence number or timestamp.
  3. Traverse the sorted DataFrame, performing Hudi operations (insert, update, delete) for each record.

You can also check Hudi DeltaStreamer with Custom Converter