How to iterate over nested groups in pandas dataframe and assign values to a new column?

111 Views Asked by At

I have a data frame:

Id  modification_date   date1   date2   estimate_date
aaa 2022-07-19T13:27:01Z    2022-07-21T20:01:00Z    2022-07-21T22:59:00Z    
aaa 2022-07-20T13:49:21Z    2022-07-21T20:01:00Z    2022-07-21T22:59:00Z    2022-07-22T01:06:05Z
aaa 2022-07-21T14:24:20Z    2022-07-21T14:01:00Z    2022-07-21T22:59:00Z    2022-07-21T19:06:33Z
aaa 2022-07-21T18:52:23Z    2022-07-21T14:01:00Z    2022-07-21T22:59:00Z    
bbb 2022-07-18T21:04:13Z    2022-07-20T13:01:00Z    2022-07-21T00:59:00Z    
bbb 2022-07-19T20:30:20Z    2022-07-20T13:01:00Z    2022-07-21T00:59:00Z    
bbb 2022-07-20T17:37:58Z    2022-07-20T13:01:00Z    2022-07-21T00:59:00Z    
ddd 2022-07-11T14:40:09Z    2022-07-12T22:01:00Z    2022-07-13T00:59:00Z    2022-07-12T23:00:33Z
ddd 2022-07-11T17:29:11Z    2022-07-12T22:01:00Z    2022-07-13T00:59:00Z    2022-07-13T00:37:21Z
ddd 2022-07-13T20:36:11Z    2022-07-12T22:01:00Z    2022-07-13T00:59:00Z    
eee 2022-07-14T14:44:08Z    2022-07-18T17:01:00Z    2022-07-18T19:59:00Z    2022-07-18T21:36:38Z
eee 2022-07-18T19:45:42Z    2022-07-21T17:01:00Z    2022-07-21T19:59:00Z    2022-07-21T17:01:00Z
eee 2022-07-19T20:46:39Z    2022-07-21T17:01:00Z    2022-07-21T19:59:00Z    
eee 2022-07-21T13:35:07Z    2022-07-21T17:01:00Z    2022-07-21T19:59:00Z    2022-07-21T20:01:41Z
eee 2022-07-21T21:37:24Z    2022-07-21T17:01:00Z    2022-07-21T19:59:00Z    

I want to keep the latest modified record, but also, I dont want to miss on the least estimated date for that day's date1 and date2.

What I am doing is first sorting on the basis of modification date on group of id, and then again grouping date1 and date2 together where the last modified record's date1 and date2 group is same. and sorting on that group to get minimum estimation date.

What I am expecting is:

Id  modification_date   date1   date2   new_estimate_date
aaa 2022-07-21T18:52:23Z    2022-07-21T14:01:00Z    2022-07-21T22:59:00Z    2022-07-21T19:06:33Z
bbb 2022-07-20T17:37:58Z    2022-07-20T13:01:00Z    2022-07-21T00:59:00Z    
ddd 2022-07-13T20:36:11Z    2022-07-12T22:01:00Z    2022-07-13T00:59:00Z    2022-07-12T23:00:33Z
eee 2022-07-21T21:37:24Z    2022-07-21T17:01:00Z    2022-07-21T19:59:00Z    2022-07-21T17:01:00Z

I have tried:

def estimatedCT(df):           
        for i, j in df.sort_values('modification_date', ascending=False).groupby('id'):
                # print(j.head(1))
                for k, l in j.sort_values('estimate_date', ascending=True, na_position='last').groupby(['date1', 'date2']):
                        final_val = l['estimate_date'].head(1)
                        print(final_val)
                        if pd.notnull(final_val).all():
                                print(final_val)
                                return final_val
                        else:
                                print(np.nan)
                                return np.nan
                                
df['new_estimate_date'] = df['estimate_date'].apply(lambda x: estimatedCT(x))

But it is somehow giving me values of the last group in every record. Please suggest

0

There are 0 best solutions below