If a row contains at least two not NaN values, split the row into two separate ones

Question

If a row contains at least two not NaN values, split the row into two separate ones

312 Views Asked by Slawek At 13 October 2023 at 13:01

I am trying to convert datafarame to desired output format with requirements mentioned below.

Provided requirements:

Each row can only keep one not Nan value (except Trh1 and Trh2)
I want to avoid methods that iterate over each row for performance reasons.
I have only included four columns, for example, in a real scenario there are many more columns to share

Example:

Input:

Index	Schema	Column	Trh1	Trh2	Trh3	Trh4
0	schema_1	col_1	NaN	0.01	NaN	NaN
1	schema_2	col_2	0.02	0.03	NaN	NaN
2	schema_3	col_3	0.03	0.04	0.05	NaN
3	schema_4	col_4	NaN	NaN	0.06	0.07

Expected output:

Index	Schema	Column	Trh1	Trh2	Trh3	Trh4
0	schema_1	col_1	NaN	0.01	NaN	NaN
1	schema_2	col_2	0.02	0.03	NaN	NaN
2	schema_3	col_3	0.03	0.04	NaN	NaN
3	schema_3	col_3	NaN	NaN	0.05	NaN
4	schema_4	col_4	NaN	NaN	0.06	NaN
5	schema_4	col_4	NaN	NaN	NaN	0.07

I explored following approach: Split row into 2 based on condition pandas. However, this approach is only suitable for splitting a row if there are no Nan values in the two columns.

Original Q&A

There are 3 best solutions below

rhug123 On 13 October 2023 at 14:24

This should work as well.

c = ['Trh1','Trh2']

(df.set_index(['Schema','Column'],append=True)
.stack()
.to_frame()
.assign(cc = lambda x: x.groupby(level=[0]).cumcount().where(~x.index.isin(c,level=-1)))
.set_index('cc',append=True)[0]
.unstack(level=-2)
.droplevel([0,-1])
.reset_index())

Output:

     Schema Column  Trh1  Trh2  Trh3  Trh4
0  schema_1  col_1   NaN  0.01   NaN   NaN
1  schema_2  col_2  0.02  0.03   NaN   NaN
2  schema_3  col_3  0.03  0.04   NaN   NaN
3  schema_3  col_3   NaN   NaN  0.05   NaN
4  schema_4  col_4   NaN   NaN  0.06   NaN
5  schema_4  col_4   NaN   NaN   NaN  0.07

Karthik Palaniappan On 13 October 2023 at 14:37

import pandas as pd
d1 = pd.DataFrame({'schema':['schema_1','schema_2','schema_3','schema_4'],
                   'Column':['col_1','col_2','col_3','col_4'],
                    'trh1':[None,2,3,None],
                    'trh2':[1,3,4,None],
                    'trh3':[None,None,5,6],
                    'trh4':[None,None,None,7]})


d2 = pd.concat([d1.drop(['trh3','trh4'],axis=1),
                d1[['Column','schema','trh3']],
                d1[['Column','schema','trh4']],])

d2.dropna(subset=d2.columns.difference(['schema','Column']), how='all', inplace=True)

d2.reset_index(drop=True, inplace=True)

uniquely separate all the columns trh3, trh4, (trh1 and trh2 combined) and concatenate them. Remove the NaN values and reset index.

**mozway** · Accepted Answer · 2023-10-13T13:08:31.743000

handling a jump

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
special = ['Trh1', 'Trh2']
others = list(df.columns.difference(cols))


out = (df
   .assign(init=lambda d: d[others].isna().all(axis=1))
   [cols+['init']+others]
   .set_index(cols).stack().to_frame()
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   .set_index('n', append=True)[0]
   .unstack(-2)
   .reset_index()
)

out.loc[out['init'].isna(), special] = np.nan

out = out.drop(columns=['n', 'init'])

out = out.dropna(subset=special+others, how='all')

Output:

 Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04   NaN   NaN
3      2  schema_3  col_3   NaN   NaN  0.05   NaN
5      3  schema_4  col_4   NaN   NaN  0.06   NaN
6      3  schema_4  col_4   NaN   NaN   NaN  0.07

original answer

You can use reshaping with de-duplication, with stack/unstack:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df
   # stack and remove NaNs
   .set_index(cols).stack().to_frame()
   # deduplicate
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   # reshape to original shape
   .set_index('n', append=True)[0]
   .unstack(-2)
   # cleanup
   .reset_index()
   .drop(columns='n')
)

# add rows that were dropped because having no value
out = pd.concat([df[df[df.columns.difference(cols)].isna().all(axis=1)], out],
                ignore_index=True).sort_values(by='Index') # optional

NB. this requires no duplicates in the initial cols.

Or with melt, which might be more memory intensive but also more robust if you have duplicates:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df.melt(cols)
       # drop NAs, except first row per group
       .loc[lambda d: d['value'].notna() | ~d[cols].duplicated()]
       # de-duplicate
       .assign(n=lambda d: d.groupby(cols, dropna=False).cumcount())
       # reshape
       .pivot(index=cols+['n'], columns='variable', values='value')
       # cleanup
       .reset_index().rename_axis(index=None, columns=None)
      )

Output:

   Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04  0.05   NaN
3      3  schema_4  col_4   NaN   NaN  0.06   NaN
4      3  schema_4  col_4   NaN   NaN   NaN  0.07

If a row contains at least two not NaN values, split the row into two separate ones

There are 3 best solutions below

handling a jump

original answer

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in DATA-SCIENCE

Trending Questions

Popular # Hahtags

Popular Questions