Optimizing the computation of rows using Pyspark

44 Views Asked by lily ng At 17 August 2025 at 04:53

I'm encountering a problem that is working with loops in pandas is not efficient, and I'm new to Pyspark, so I would like to ask for ideas, solutions to complete 2 tasks illustrated below.

I have a dataset consisting of 700K rows like this:

Date format: MM-DD-YYYY

ID	DATE	REF_DATE	A	B	C
1	01-31-2020	10-31-2019	3000	0.05	320
2	02-28-2020	10-31-2019	2750	0.04	300
3	03-31-2020	10-31-2019	2600	0.03	270

Tasks/steps:

1. Generate new rows for each id if the condition is satisfied

Condition: if Date > Ref_Date, add one more row, with the value of date decreasedcreased by 1 month unit until the new added date value = Ref_Date, keeping other variables' value constant. This is the expected output which will have in reality 30 million rows.

ID	DATE	REF_DATE	A	B	C
1	01-31-2020	10-31-2019	3000	0.05	320
1	12-31-2019	10-31-2019	3000	0.05	320
1	11-30-2019	10-31-2019	3000	0.05	320
1	10-31-2019	10-31-2019	3000	0.05	320
2	02-28-2020	10-31-2019	2750	0.04	300
2	01-31-2020	10-31-2019	2750	0.04	300
2	12-31-2019	10-31-2019	2750	0.04	300
2	11-30-2019	10-31-2019	2750	0.04	300
2	10-31-2019	10-31-2019	2750	0.04	300
3	03-31-2020	10-31-2019	2600	0.03	270
3	02-28-2020	10-31-2019	2600	0.03	270
3	01-31-2020	10-31-2019	2600	0.03	270
3	12-31-2019	10-31-2019	2600	0.03	270
3	11-30-2019	10-31-2019	2600	0.03	270
3	10-31-2019	10-31-2019	2600	0.03	270

2. Create & compute 3 new columns named D, E, F as follows

IF it's the first observation within the group (by ID):
    E = A * B
    F = C - E
    D = A - C 
IF it's the second and later on rows within that group:
    E = lag(D) * B
    F = C - E
    D = lag(D) - C

D here indicates the value of D from the previous row, if we proceed row 2, we use D from row 1, then to proceed row 3, we use the value of D of row 2, ...

The expected output will be like, I show only the first ID as an example, the other ID will be processed in the same manner.

ID	DATE	REF_DATE	A	B	C	D	E	F
1	01-31-2020	10-31-2019	3000	0.05	320	2830	150	170
1	10-31-2019	10-31-2019	3000	0.05	320	2651.5	141.5	178.5
1	12-31-2019	10-31-2019	3000	0.05	320	2464.01	132.6	187.4
1	11-30-2019	10-31-2019	3000	0.05	320	2267.3	123.2	196.8

P.s: I need to process and store this whole dataset for other calculation later. Is there any way without using loops? Thank you!

I tried using pandas and multithreading but now I need to do it in Pyspark

Original Q&A

Optimizing the computation of rows using Pyspark

There are 0 best solutions below

Related Questions in PERFORMANCE

Related Questions in PYSPARK

Related Questions in OPTIMIZATION

Related Questions in BIGDATA

Related Questions in COMPUTATION

Trending Questions

Popular # Hahtags

Popular Questions