Deleting rows in pandas data frame after evaluating all columns

680 Views Asked by At

I have a very large pandas DataFrame (>100 million rows, and >1000s of columns). Each row has a unique label as index, for most of the rows, only one column contains value. I want to make a new DataFrame by deleting those rows with only one of the columns has value, and keeping those rows that with more than two columns have values.

1

There are 1 best solutions below

4
On BEST ANSWER

You can drop them using dropna:

In [3]:
#sample df
df = pd.DataFrame({'a':[0,NaN, 2,3,4], 'b':[0,NaN, 2,3,NaN], 'c':arange(5)})
df

Out[3]:
    a   b  c
0   0   0  0
1 NaN NaN  1
2   2   2  2
3   3   3  3
4   4 NaN  4
In [5]:
# drop just the rows which have 2 or more NaN values
df.dropna(thresh=2, axis=0)
Out[5]:
   a   b  c
0  0   0  0
2  2   2  2
3  3   3  3
4  4 NaN  4

You pass the params thresh=2 to specify that you require at least 2 non-NA values, and axis=0 will specify that the criteria should be applied row-wise.