Alright here's the situation. I have a change data feed table delta table in databricks that looks something like this:
first_name | last_name | favorite_color | food | sport | age | timestamp |
---|---|---|---|---|---|---|
John | Doe | Blue | Pizza | Football | 28 | 2023-01-01 12:00:00 |
John | Smith | Red | Burger | Basketball | 35 | 2023-01-01 12:00:00 |
Jane | Doe | Green | Sushi | Tennis | 28 | 2023-01-01 12:00:00 |
Jane | Doe | Green | Sushi | Tennis | 29 | 2023-01-02 12:00:00 |
It's constantly updating, but I want to have logic in place where I can grab the unique identifier, in this case the first and last name, and then ONLY THE COLUMNS THAT CHANGED, and then save that output into this format:
first_name | last_name | payload |
---|---|---|
Jane | Doe | {age:29, timestamp: 1/2/2023 12:00:00} |
So in the above example, because only the age and timestamp column changed for Jane Doe, I want to include that into a json payload.
Another example:
first_name | last_name | favorite_color | food | sport | age | timestamp |
---|---|---|---|---|---|---|
John | Doe | Blue | Pizza | Football | 28 | 2023-01-01 12:00:00 |
John | Smith | Red | Burger | Basketball | 35 | 2023-01-01 12:00:00 |
Jane | Doe | Green | Sushi | Tennis | 28 | 2023-01-01 12:00:00 |
Jane | Doe | Green | Sushi | Tennis | 29 | 2023-01-02 12:00:00 |
John | Smith | Red | Burger | Swimming | 35 | 2023-01-03 12:00:00 |
first_name | last_name | payload |
---|---|---|
Jane | Doe | {age:29, timestamp: 1/2/2023 12:00:00} |
John | Smith | {sport:Swimming, timestamp: 1/3/2023 12:00:00} |
I am doing all of this in pyspark and am having a hard time getting started.