Dataframe column locating min and max value depeding of an ID

584 Views Asked by At

I'm wondering how to optimize a part of code to remove a loop which takes forever since I have around 350 000 IDs. Here is the current code, which is not optimal and takes quite a while. I'm trying to get it working better and if possible removing a loop.

The dataset is made of 4 columns with IDs, start_dates, end_dates and amount. We can have multi rows with same IDs but not the same amount. The main thing is in some rows the dates are not saved in the dataset. In that case we have to find the earlier start_date of the ID and the later end_date and add them to the row where it's not put in the dataframe

Dataset

ID  start_date  end_date    value
ABC 12/10/2010  12/12/2020  8
ABC 01/01/2020  01/04/2021  9
ABC                         43
BCD 14/02/2020  14/03/2020  8

So we should have on the third row the start_date as 12/10/2010 and end date 01/04/2021. In the picture you cant see it but don't forget that BCD start_date could be earlier than ABC but you still use the 12/10/2010 because it is linked to the ID

    for x in df['ID'].unique():
        tmp = df.loc[df['ID'] == x].reset_index()
        df.loc[(df['ID'] == x) & (df['start_date'].isna()), 'start_date'] = tmp['start_date'].min()
        df.loc[(df['ID'] == x) & (df['end_date'].isna()), 'end_date'] = tmp['end_date'].max()

I suppose the code is quite clear about what I am trying to do. But if you have any questions don't hesitate do post them I'll do my best to answer.

1

There are 1 best solutions below

0
On BEST ANSWER

set up the job

import pandas as pd 
data = { 'ID': ['ABC','ABC','ABC','BCD'], 'start_date' : ['12/10/2010', '01/01/2020',None ,'14/02/2020'], 'end_date': ['12/12/2020', '01/01/2021',None ,'14/03/2020'], 'value': [8,9,43,8]}

df = pd.DataFrame(data)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

we get this result

    ID  start_date  end_date    value
0   ABC 2010-12-10  2020-12-12  8
1   ABC 2020-01-01  2021-01-01  9
2   ABC NaT                NaT  43
3   BCD 2020-02-14  2020-03-14  8

do the work

df.start_date = df.groupby('ID')['start_date'].apply(lambda x: x.fillna(x.min()))
df.end_date = df.groupby('ID')['end_date'].apply(lambda x: x.fillna(x.max()))

we get this result

    ID  start_date  end_date    value
0   ABC 2010-12-10  2020-12-12  8
1   ABC 2020-01-01  2021-01-01  9
2   ABC 2010-12-10  2021-01-01  43
3   BCD 2020-02-14  2020-03-14  8