How do I reset pyspark "last" function when condition changes?

44 Views Asked by At

I want to create a new conditional variable "idx_cut" using following condition:

if idx = 1, then idx_cut = thru_dt + 30 (i.e., add 30 days to "thru_dt" date variable)

else: if from_dt > the lag(idx_cut), then idx_cut = thru_dt + 30 else retain idx_cut

This is what I tried:

w = Window.partitionBy('id').orderBy('id', 'from_dt', 'thru_dt')

df = data.withColumn('idx_cut', when((col('idx') ==lit(1)), col('hse_clm_thru_dt')+lit(30)).otherwise(None)). \
                withColumn('idx_cut', when((col('idx_cut').isNotNull()), col('idx_cut')). \
                    otherwise(when((col('from_dt') < last('idx_cut', True).over(w)), last('idx_cut', True).over(w1x)).otherwise((col('thru_dt')+lit(30)))))

and this is the output:

+------------+---------+-----+---------------+---------------+----------+
|          id|   lag_id|  idx|        from_dt|        thru_dt|   idx_cut|
+------------+---------+-----+---------------+---------------+----------+
|           C|     null|    1|     2023-01-02|     2023-01-08|2023-02-07|
|           C|        C|    2|     2023-01-09|     2023-01-09|2023-02-07|
|           C|        C|    3|     2023-02-01|     2023-02-04|2023-02-07|
|           C|        C|    4|     2023-02-06|     2023-02-10|2023-02-07|
|           C|        C|    5|     2023-02-13|     2023-02-14|*2023-03-16*|
|           C|        C|    6|     *2023-02-15*|   2023-02-23|**2023-03-25**|
|           C|        C|    7|     2023-02-24|     2023-03-03|2023-04-02|
|           C|        C|    8|     2023-03-07|     2023-03-11|2023-04-10|
|           C|        C|    9|     2023-03-14|     2023-03-16|2023-04-15|
|           C|        C|   10|     2023-03-20|     2023-03-21|2023-04-20|
|           C|        C|   11|     2023-03-26|     2023-03-31|2023-04-30|
|           C|        C|   12|     2023-04-03|     2023-04-10|2023-05-10|
|           C|        C|   13|     2023-04-15|     2023-04-22|2023-05-22|
|           C|        C|   14|     2023-04-24|     2023-04-28|2023-05-28|
|           C|        C|   15|     2023-05-04|     2023-05-18|2023-06-17|
+------------+---------+---+---------------+---------------+----------+

Unfortunately, this is not the desired output. My code works well up to where idx = 5, at which point it is supposed to reset the value of idx_cut on the sixth row to the lag value "2023-03-16" (But I get "2023-03-25").

The correct output would be the following:

+--+---------+-----+---------------+---------------+----------+
|id|   lag_id|  idx|    from_dt    |        thru_dt|   idx_cut|
+------------+-----+---------------+---------------+----------+
|C |     null|    1|     2023-01-02|     2023-01-08|2023-02-07|
|C |        C|    2|     2023-01-09|     2023-01-09|2023-02-07|
|C |        C|    3|     2023-02-01|     2023-02-04|2023-02-07|
|C |        C|    4|     2023-02-06|     2023-02-10|2023-02-07|
|C |        C|    5|     2023-02-13|     2023-02-14|*2023-03-16*|
|C |        C|    6|     *2023-02-15*|   2023-02-23|**2023-03-16**|
|C |        C|    7|     2023-02-24|     2023-03-03|**2023-03-16**|
|C |        C|    8|     2023-03-07|     2023-03-11|**2023-03-16**|
|C |        C|    9|     2023-03-14|     2023-03-16|**2023-03-16**|
|C |        C|   10|     2023-03-20|     2023-03-21|2023-04-20|
|C |        C|   11|     2023-03-26|     2023-03-31|**2023-04-20**|
|C |        C|   12|     2023-04-03|     2023-04-10|**2023-04-20**|
|C |        C|   13|     2023-04-15|     2023-04-22|**2023-04-20**|
|C |        C|   14|     2023-04-24|     2023-04-28|2023-05-28|
|C |        C|   15|     2023-05-04|     2023-05-18|**2023-05-04**|
+--+---------+-----+---------------+---------------+----------+

Thank you in advance!

0

There are 0 best solutions below