How do I create a new pyspark dataframe that captures only the columns that have changed into a payload?

141 Views Asked by buttermilk At 06 June 2025 at 02:55

Alright here's the situation. I have a change data feed table delta table in databricks that looks something like this:

first_name	last_name	favorite_color	food	sport	age	timestamp
John	Doe	Blue	Pizza	Football	28	2023-01-01 12:00:00
John	Smith	Red	Burger	Basketball	35	2023-01-01 12:00:00
Jane	Doe	Green	Sushi	Tennis	28	2023-01-01 12:00:00
Jane	Doe	Green	Sushi	Tennis	29	2023-01-02 12:00:00

It's constantly updating, but I want to have logic in place where I can grab the unique identifier, in this case the first and last name, and then ONLY THE COLUMNS THAT CHANGED, and then save that output into this format:

first_name	last_name	payload
Jane	Doe	{age:29, timestamp: 1/2/2023 12:00:00}

So in the above example, because only the age and timestamp column changed for Jane Doe, I want to include that into a json payload.

Another example:

first_name	last_name	favorite_color	food	sport	age	timestamp
John	Doe	Blue	Pizza	Football	28	2023-01-01 12:00:00
John	Smith	Red	Burger	Basketball	35	2023-01-01 12:00:00
Jane	Doe	Green	Sushi	Tennis	28	2023-01-01 12:00:00
Jane	Doe	Green	Sushi	Tennis	29	2023-01-02 12:00:00
John	Smith	Red	Burger	Swimming	35	2023-01-03 12:00:00

first_name	last_name	payload
Jane	Doe	{age:29, timestamp: 1/2/2023 12:00:00}
John	Smith	{sport:Swimming, timestamp: 1/3/2023 12:00:00}

I am doing all of this in pyspark and am having a hard time getting started.

Original Q&A

How do I create a new pyspark dataframe that captures only the columns that have changed into a payload?

There are 0 best solutions below

Related Questions in PYSPARK

Related Questions in DELTA-LAKE

Related Questions in CHANGE-DATA-CAPTURE

Trending Questions

Popular # Hahtags

Popular Questions