How to find the consecutive nulls (NaN) in the columns of pandas dataframe?

65 Views Asked by At

I have a pandas dataframe like below:

import pandas as pd
nan = float('nan')
data = {'col1': [1, nan, nan, nan, nan, 1, nan, nan], 
        'col2': [1, 1, nan, 1, 0, 0, 1, 0], 
        'col3': [nan, 0, nan, 1, 0, nan, nan, nan], 
        'col4': [1, 0, 0, 1, 0, 1, 1, 1]}
df = pd.DataFrame(data)

df

|col1|  |col2|  |col3|  |col4|
| 1  |  |  1 |  | NaN|  | 1  |
|NaN |  |  1 |  | 0  |  | 0  |
|NaN |  | NaN|  | NaN|  | 0  |
|NaN |  |  1 |  | 1  |  | 1  |
|NaN |  |  0 |  | 0  |  | 0  |
| 1  |  |  0 |  | NaN|  | 1  |
|NaN |  |  1 |  | NaN|  | 1  |
|NaN |  |  0 |  | NaN|  | 1  |

I want to count the number of consecutive nulls (NaN) values for every column, and if there's more than two consecutive nulls, I want to get the max of it.

For the above df, I would get:

df_nulls = ['col1': 4, 'col2': 0, 'col3': 3, 'col4': 0]

With the above results, the columns with more than two consecutive nulls should be deleted. In this case, the final dataframe should only contain col2 and col4. I found similar threads but none resolved the above issue. How can i fix this problem? Thanks in advance.

2

There are 2 best solutions below

0
Panda Kim On BEST ANSWER

Code

transform + max

out = (df
       .transform(lambda x: x.isna().groupby(x.notna().cumsum()).cumsum())
       .max()
       .mask(lambda x: x.eq(1), 0)
       .to_dict()
)

out

{'col1': 4, 'col2': 0, 'col3': 3, 'col4': 0}

or use agg instead transform + max

out = (df
       .agg(lambda x: x.isna().groupby(x.notna().cumsum()).cumsum().max())
       .mask(lambda x: x.eq(1), 0)
       .to_dict()
)

same result

2
Chris Fu On
>>> (
...     df.notna().cumsum().apply(
...         lambda s: (
...             s.value_counts(sort=False).pipe(
...                 lambda s: s - (s.index != 0)
...             ).max()
...         )
...     )
... ).replace(1, 0).to_dict()
{'col1': 4, 'col2': 0, 'col3': 3, 'col4': 0}

Edit:

Fixed for cases that df contains leading NaNs.