Creating a function to replace NaN value by traversing up hierarchy columns

59 Views Asked by At

I have 2 separate dataframes, data1 and df_attr.Where for df1, there is hierarchy data, where df1.manager_01_email is at the top of the hierarchy while df1.manager_04_email represents the lowest hierarchy

data1 = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
         [["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"], 
          ["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
          ["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          ["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          ["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]


data_attr = [["quarter","supervisor_email", "attrition_rate"],
         [ ["y2022q1","[email protected]", 0.3], ["y2022q2","[email protected]", 0.6],["y2022q1","[email protected]", 0.25],["y2022q2","[email protected]", 0.1],["y2022q1","[email protected]", 0.7],["y2022q2","[email protected]", 0.35],["y2022q2","[email protected]", np.NaN],["y2022q1","[email protected]", 0.1],["y2022q2","[email protected]", 0.8]]]


df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])

I would need the "attrition_rate" column from data_attr df but due to NaN values and also missing values for attrition_rate, I would need to create a function to handle NaN values

For example, for "employee_id" = 1012, his supervisor_email "[email protected]" has NaN value in df_attr.attrition_rate at y2022q2, I will need to fill it with the attrition_rate value of the manager in the higher hierarchy, but because "[email protected]" is found at both manager_04_email and manager_03_email, we have to then traverse up the next hierarchy which is manager_02_email of "[email protected]" at y2022q2

and for "employeed_id" = 1012, his supervisor_email "[email protected]" is not found in df_attr, so his attrition_rate value is NaN after merging. I will need to fill it with the attrition_rate value of the manager in higher hierarchy, which in this case is "[email protected]" at y2022q1

The outcome I would like to achieve is

outcome = [["quarter","employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email", "mgr_attrition_rate"],
         [["y2022q1",1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.1], 
          ["y2022q1",1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]", 0.25],
          ["y2022q2",1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.1],
          ["y2022q2",1011, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]", 0.35],
          ["y2022q2",1012, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]", 0.6]]]

df_outcome = pd.DataFrame(data=outcome[1], columns=outcome[0])

I have tried based on the suggestion but am still not able to get it, appreciate any form of help or assistance, thank you.

# Merge dataframes based on quarter and supervisor_email
df_outcome = pd.merge(df1, df_attr, how='left',left_on=['quarter','supervisor_email'],right_on=['quarter','supervisor_email'])

#create function to handle NaNs
def fill_nan_with_higher_manager(row):
        if pd.isna(row['attrition_rate']):
            for i in range(4, 0, -1):
                higher_manager_email = row[f'manager_0{i}_email']
                if pd.notna(higher_manager_email):
                    higher_manager_attrition = df_attr.loc[(df_attr["supervisor_email"] == higher_manager_email) & (df_attr["quarter"] == higher_manager_quarter), 'attrition_rate']
                    if not higher_manager_attrition.empty:
                        return higher_manager_attrition.values[0]
        return row['attrition_rate']

df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)

1

There are 1 best solutions below

1
Serge de Gosson de Varennes On

You were almost there. You need to handle the cases where you might have NaN. For this, I created a function that will handle these case. You can of course move it outside the merge_attrition_rates function if you want.

import pandas as pd
import numpy as np

data1 = [["employee_id", "supervisor_email", "manager_01_email", "manager_02_email","manager_03_email","manager_04_email"],
         [[1011, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"], 
          [1012, "[email protected]", "[email protected]","[email protected]","[email protected]", "[email protected]"],
          [1013, "[email protected]" ,"[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1014, "[email protected]","[email protected]", "[email protected]", "[email protected]", "[email protected]"],
          [1015, "[email protected]", "[email protected]", "[email protected]", "[email protected]","[email protected]"]]]

data_attr = [["supervisor_email", "attrition_rate"],
         [ ["[email protected]", 0.3], ["[email protected]", 0.25],["[email protected]", 0.35],
          ["[email protected]", np.NaN],["[email protected]", 0.1]]]

df1 = pd.DataFrame(data=data1[1], columns=data1[0])
df_attr = pd.DataFrame(data=data_attr[1], columns=data_attr[0])

def merge_attrition_rates(df1, df_attr):
    """
    Merges the 'attrition_rate' column from df_attr into df1 based on supervisor_email.
    Handles NaN values and missing supervisor emails.

    Args:
        df1 (pd.DataFrame): Hierarchy dataframe with columns 'employee_id', 'supervisor_email', and others.
        df_attr (pd.DataFrame): Attributes dataframe with columns 'supervisor_email' and 'attrition_rate'.

    Returns:
        pd.DataFrame: Merged dataframe with the 'attrition_rate' column added to df1.
    """
    df_outcome = pd.merge(df1, df_attr, on='supervisor_email', how='left', suffixes=('', '_supervisor'))

    def fill_nan_with_higher_manager(row):
        if pd.isna(row['attrition_rate']):
            for i in range(4, 0, -1):
                manager_email = row[f'manager_0{i}_email']
                if pd.notna(manager_email):
                    manager_attrition = df_attr.loc[df_attr['supervisor_email'] == manager_email, 'attrition_rate']
                    if not manager_attrition.empty:
                        return manager_attrition.values[0]
        return row['attrition_rate']

    df_outcome['attrition_rate'] = df_outcome.apply(fill_nan_with_higher_manager, axis=1)

    missing_supervisors = df_outcome[df_outcome['attrition_rate'].isna()]['supervisor_email']
    df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'attrition_rate'] = df_outcome.loc[df_outcome['supervisor_email'].isin(missing_supervisors), 'manager_01_email'].map(df_attr.set_index('supervisor_email')['attrition_rate'])

    return df_outcome

df_outcome = merge_attrition_rates(df1, df_attr)
print(df_outcome)

This will return

 employee_id    supervisor_email manager_01_email    manager_02_email  \
0         1011    [email protected]     [email protected]  [email protected]   
1         1012   [email protected]     [email protected]  [email protected]   
2         1013    [email protected]     [email protected]  [email protected]   
3         1014  [email protected]     [email protected]  [email protected]   
4         1015     [email protected]     [email protected]  [email protected]   

     manager_03_email    manager_04_email  attrition_rate  
0     [email protected]    [email protected]            0.10  
1    [email protected]   [email protected]            0.25  
2    [email protected]    [email protected]            0.25  
3  [email protected]  [email protected]            0.35  
4     [email protected]     [email protected]             NaN