Combining non-null values ​in multiple rows and displaying them in a single row on pandas df

31 Views Asked by At

There are multiple id's. Some of them duplicates. I want to show not nan values on other rows in a single rows based on id's. But there are other things to.. An example;

>>> df

id      val_one  one           val_two   two            val_three  three          val_four  four

11      0        NaN           1         in_progress    0          NaN            0         NaN
11      1        in_progress   0         NaN            0          NaN            0         NaN
22      0        NaN           0         NaN            0          NaN            1        completed
22      0        NaN           1         in_progress    0          canceled       0         NaN
22      1        completed     0         NaN            0          NaN            0         NaN
33      0        NaN           0         NaN            1          completed      0         NaN
33      0        canceled      0         NaN            0          NaN            0         NaN

This is the last step i can get.

First, for example 'val_one' and 'one' columns are related and other columns are also related to each other. Values ​​'in_progress' and 'completed' mean '1' for the "val_" columns. 'canceled' and NaN is 0.

Secondly, 'in_progress', 'completed' and 'canceled' is actually is a value as you can see.

If there is a '0' value in rows i want to change it with '1' if there is a '1' in related columns.

If there is 'canceled' in a row and if there is no '1' value in the associated rows, I want to write it instead of the NaN value in the actual output.

If there is 'in_progress' somewhere there won't be another 'completed' or 'canceled' for the same id. This goes for the other 2 values ​​as well.

I need it to be like this;

id      val_one  one           val_two   two            val_three  three          val_four  four

11      1        in_progress   1         in_progress    0          NaN            0         NaN
22      1        completed     1         in_progress    0          NaN            1        completed
33      0        canceled      0         NaN            1          completed      0         NaN

I did new columns for '0' and '1' values; (I can rename later for correct ones)

df['val_onee']=df.groupby(['id'])['val_one'].transform(max) df['val_twoo']=df.groupby(['id'])['val_two'].transform(max) df['val_threee']=df.groupby(['id'])['val_three'].transform(max) df['val_fourr']=df.groupby(['id'])['val_four'].transform(max)

Then I did something like this but not worked;

df = df[~df.drop('one', axis=1).duplicated(keep=False) | pd.notna(df['one'])] df = df[~df.drop('two', axis=1).duplicated(keep=False) | pd.notna(df['two'])] df = df[~df.drop('three', axis=1).duplicated(keep=False) | pd.notna(df['three'])] df = df[~df.drop('four', axis=1).duplicated(keep=False) | pd.notna(df['four'])]

Thanks in advance for all the help guys

1

There are 1 best solutions below

0
Corralien On BEST ANSWER

You can use groupby and a dict of aggregation functions:

funcs = {c: 'max' for c in df.select_dtypes(int).columns} \
        | {c: 'first' for c in df.select_dtypes(object).columns}
out = df.groupby('id', as_index=False).agg(funcs)[df.columns]

Output:

>>> out
   id  val_one          one  val_two          two  val_three      three  val_four       four
0  11        1  in_progress        1  in_progress          0       None         0       None
1  22        1    completed        1  in_progress          0   canceled         1  completed
2  33        0     canceled        0         None          1  completed         0       None