I have a dataset and it’s output in the picture attached,
I want to create 3 new columns called start_time_1, start_time_2, start_time_3 such that I can update the first timestamps of each of the codes in coded, code, code based on start_timestamps.
In the attached image, the first group is id=1, so, timestamps when M, 8, 6 started is, 2023-05-06.
So, everytime, M,8 or 6 appears in code1, code2, code3 it should be filled by this timestamp. A came little late at 2023-08-13, so A’s timestamp will be 2023-08-13.
Similarly, on group id=2, D came first at 2023-06-07, so everytime D appeared on any code1, code2, code3. the timestamps will be 2023-06-07.
How do I achieve this on pyspark? Also, please note that the columns are not sorted in ascending order, although in the example shown, the cols are sorted.
I tried using unbounded preceding. but, i was not able to achieve my results. The problem I have is I am not able to search in all the codes. My window function only searches for one column, it doesn’t work when I put all the columns at once.
My approach to solve this problem has been to create a map of code and minimum timestamp it appeared in. Then using that map to populate the start_time_i columns. Below is the code.
Output :