If a row contains at least two not NaN values, split the row into two separate ones

312 Views Asked by At

I am trying to convert datafarame to desired output format with requirements mentioned below.


Provided requirements:

  • Each row can only keep one not Nan value (except Trh1 and Trh2)
  • I want to avoid methods that iterate over each row for performance reasons.
  • I have only included four columns, for example, in a real scenario there are many more columns to share

Example:

Input:

Index Schema Column Trh1 Trh2 Trh3 Trh4
0 schema_1 col_1 NaN 0.01 NaN NaN
1 schema_2 col_2 0.02 0.03 NaN NaN
2 schema_3 col_3 0.03 0.04 0.05 NaN
3 schema_4 col_4 NaN NaN 0.06 0.07

Expected output:

Index Schema Column Trh1 Trh2 Trh3 Trh4
0 schema_1 col_1 NaN 0.01 NaN NaN
1 schema_2 col_2 0.02 0.03 NaN NaN
2 schema_3 col_3 0.03 0.04 NaN NaN
3 schema_3 col_3 NaN NaN 0.05 NaN
4 schema_4 col_4 NaN NaN 0.06 NaN
5 schema_4 col_4 NaN NaN NaN 0.07

I explored following approach: Split row into 2 based on condition pandas. However, this approach is only suitable for splitting a row if there are no Nan values in the two columns.

3

There are 3 best solutions below

4
mozway On BEST ANSWER

handling a jump

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
special = ['Trh1', 'Trh2']
others = list(df.columns.difference(cols))


out = (df
   .assign(init=lambda d: d[others].isna().all(axis=1))
   [cols+['init']+others]
   .set_index(cols).stack().to_frame()
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   .set_index('n', append=True)[0]
   .unstack(-2)
   .reset_index()
)

out.loc[out['init'].isna(), special] = np.nan

out = out.drop(columns=['n', 'init'])

out = out.dropna(subset=special+others, how='all')

Output:

 Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04   NaN   NaN
3      2  schema_3  col_3   NaN   NaN  0.05   NaN
5      3  schema_4  col_4   NaN   NaN  0.06   NaN
6      3  schema_4  col_4   NaN   NaN   NaN  0.07

original answer

You can use reshaping with de-duplication, with stack/unstack:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df
   # stack and remove NaNs
   .set_index(cols).stack().to_frame()
   # deduplicate
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   # reshape to original shape
   .set_index('n', append=True)[0]
   .unstack(-2)
   # cleanup
   .reset_index()
   .drop(columns='n')
)

# add rows that were dropped because having no value
out = pd.concat([df[df[df.columns.difference(cols)].isna().all(axis=1)], out],
                ignore_index=True).sort_values(by='Index') # optional

NB. this requires no duplicates in the initial cols.

Or with melt, which might be more memory intensive but also more robust if you have duplicates:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df.melt(cols)
       # drop NAs, except first row per group
       .loc[lambda d: d['value'].notna() | ~d[cols].duplicated()]
       # de-duplicate
       .assign(n=lambda d: d.groupby(cols, dropna=False).cumcount())
       # reshape
       .pivot(index=cols+['n'], columns='variable', values='value')
       # cleanup
       .reset_index().rename_axis(index=None, columns=None)
      )

Output:

   Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04  0.05   NaN
3      3  schema_4  col_4   NaN   NaN  0.06   NaN
4      3  schema_4  col_4   NaN   NaN   NaN  0.07
0
rhug123 On

This should work as well.

c = ['Trh1','Trh2']

(df.set_index(['Schema','Column'],append=True)
.stack()
.to_frame()
.assign(cc = lambda x: x.groupby(level=[0]).cumcount().where(~x.index.isin(c,level=-1)))
.set_index('cc',append=True)[0]
.unstack(level=-2)
.droplevel([0,-1])
.reset_index())

Output:

     Schema Column  Trh1  Trh2  Trh3  Trh4
0  schema_1  col_1   NaN  0.01   NaN   NaN
1  schema_2  col_2  0.02  0.03   NaN   NaN
2  schema_3  col_3  0.03  0.04   NaN   NaN
3  schema_3  col_3   NaN   NaN  0.05   NaN
4  schema_4  col_4   NaN   NaN  0.06   NaN
5  schema_4  col_4   NaN   NaN   NaN  0.07
0
Karthik Palaniappan On
import pandas as pd
d1 = pd.DataFrame({'schema':['schema_1','schema_2','schema_3','schema_4'],
                   'Column':['col_1','col_2','col_3','col_4'],
                    'trh1':[None,2,3,None],
                    'trh2':[1,3,4,None],
                    'trh3':[None,None,5,6],
                    'trh4':[None,None,None,7]})


d2 = pd.concat([d1.drop(['trh3','trh4'],axis=1),
                d1[['Column','schema','trh3']],
                d1[['Column','schema','trh4']],])

d2.dropna(subset=d2.columns.difference(['schema','Column']), how='all', inplace=True)

d2.reset_index(drop=True, inplace=True)

uniquely separate all the columns trh3, trh4, (trh1 and trh2 combined) and concatenate them. Remove the NaN values and reset index.