How do I create a new pyspark dataframe that captures only the columns that have changed into a payload?

132 Views Asked by At

Alright here's the situation. I have a change data feed table delta table in databricks that looks something like this:

first_name last_name favorite_color food sport age timestamp
John Doe Blue Pizza Football 28 2023-01-01 12:00:00
John Smith Red Burger Basketball 35 2023-01-01 12:00:00
Jane Doe Green Sushi Tennis 28 2023-01-01 12:00:00
Jane Doe Green Sushi Tennis 29 2023-01-02 12:00:00

It's constantly updating, but I want to have logic in place where I can grab the unique identifier, in this case the first and last name, and then ONLY THE COLUMNS THAT CHANGED, and then save that output into this format:

first_name last_name payload
Jane Doe {age:29, timestamp: 1/2/2023 12:00:00}

So in the above example, because only the age and timestamp column changed for Jane Doe, I want to include that into a json payload.

Another example:

first_name last_name favorite_color food sport age timestamp
John Doe Blue Pizza Football 28 2023-01-01 12:00:00
John Smith Red Burger Basketball 35 2023-01-01 12:00:00
Jane Doe Green Sushi Tennis 28 2023-01-01 12:00:00
Jane Doe Green Sushi Tennis 29 2023-01-02 12:00:00
John Smith Red Burger Swimming 35 2023-01-03 12:00:00
first_name last_name payload
Jane Doe {age:29, timestamp: 1/2/2023 12:00:00}
John Smith {sport:Swimming, timestamp: 1/3/2023 12:00:00}

I am doing all of this in pyspark and am having a hard time getting started.

0

There are 0 best solutions below