I have 2 separate dataframes, data1 and df_attr.Where for df1, there is hierarchy data, where df1.manager_01_email is at the top of the hierarchy while df1.manager_04_email represents the lowest hierarchy
data1 = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
[["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"],
["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]
data_attr = [["quarter","supervisor_email", "attrition_rate"],
[ ["y2022q1","[email protected]", 0.3], ["y2022q2","[email protected]", 0.6],["y2022q1","[email protected]", 0.25],["y2022q2","[email protected]", 0.1],["y2022q1","[email protected]", 0.7],["y2022q2","[email protected]", 0.35],["y2022q2","[email protected]", np.NaN],["y2022q1","[email protected]", 0.1],["y2022q2","[email protected]", 0.8]]]
df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])
I would need the "attrition_rate" column from data_attr df but due to NaN values and also missing values for attrition_rate, I would need to create a function to handle NaN values
For example, for "employee_id" = 1012, his supervisor_email "[email protected]" has NaN value in df_attr.attrition_rate at y2022q2, I will need to fill it with the attrition_rate value of the manager in the higher hierarchy, but because "[email protected]" is found at both manager_04_email and manager_03_email, we have to then traverse up the next hierarchy which is manager_02_email of "[email protected]" at y2022q2
and for "employeed_id" = 1012, his supervisor_email "[email protected]" is not found in df_attr, so his attrition_rate value is NaN after merging. I will need to fill it with the attrition_rate value of the manager in higher hierarchy, which in this case is "[email protected]" at y2022q1
The outcome I would like to achieve is
outcome = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email", "mgr_attrition_rate"],
[["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.1],
["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 0.25],
["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.1],
["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.35],
["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.6]]]
df_outcome = pd.DataFrame(data=outcome[1], columns=outcome[0])
I have tried based on the suggestion but am still not able to get it, appreciate any form of help or assistance, thank you.
# Merge dataframes based on quarter and supervisor_email
df_outcome = pd.merge(df1, df_attr, how='left',left_on=['quarter','supervisor_email'],right_on=['quarter','supervisor_email'])
#create function to handle NaNs
def fill_nan_with_higher_manager(row):
if pd.isna(row['attrition_rate']):
for i in range(4, 0, -1):
higher_manager_email = row[f'manager_0{i}_email']
if pd.notna(higher_manager_email):
higher_manager_attrition = df_attr.loc[(df_attr["supervisor_email"] == higher_manager_email) & (df_attr["quarter"] == higher_manager_quarter), 'attrition_rate']
if not higher_manager_attrition.empty:
return higher_manager_attrition.values[0]
return row['attrition_rate']
df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)
You were almost there. You need to handle the cases where you might have NaN. For this, I created a function that will handle these case. You can of course move it outside the
merge_attrition_ratesfunction if you want.This will return