Filtering row values in pandas by threshold value

3.7k Views Asked by At

I have a pandas correlation matrix dataframe that has hundreds of columns and rows. I want to filter the whole dataframe so that i only get cells that are above a certain value, any row value > .4, for example. I'm not sure what is the best way to do that. I've seen multiple posts that filter by column name, but with hundreds of columns, how would you iterate through each column?

1

There are 1 best solutions below

1
On BEST ANSWER

If want filter only rows use boolean indexing with mask created by gt > with any for check at least one True:

df[df.gt(.4).any(axis=1)]

Or if want filter rows and columns:

m = df.gt(.4)
df.loc[m.any(axis=1), m.any()]

If want only values by condition is possible use where for replace another values to NaNs:

df.where(df.gt(.4))

Sample:

np.random.seed(4137)
df = pd.DataFrame(np.random.rand(5,3), columns=list('ABC'))
df['A'] /=  10
print (df)
          A         B         C
0  0.090262  0.313517  0.319292
1  0.086841  0.346788  0.314144
2  0.044069  0.764295  0.886659
3  0.089739  0.230103  0.594301
4  0.047101  0.132819  0.692846

df1 = df[df.gt(.4).any(axis=1)]
print (df1)
          A         B         C
2  0.044069  0.764295  0.886659
3  0.089739  0.230103  0.594301
4  0.047101  0.132819  0.692846

m = df.gt(.4)
df2 = df.loc[m.any(axis=1), m.any()]
print (df2)
          B         C
2  0.764295  0.886659
3  0.230103  0.594301
4  0.132819  0.692846

df3 = df.where(df.gt(.4))
print (df3)
    A         B         C
0 NaN       NaN       NaN
1 NaN       NaN       NaN
2 NaN  0.764295  0.886659
3 NaN       NaN  0.594301
4 NaN       NaN  0.692846

Detail:

print (m)
       A      B      C
0  False  False  False
1  False  False  False
2  False   True   True
3  False  False   True
4  False  False   True

print (m.any(axis=1))
0    False
1    False
2     True
3     True
4     True
dtype: bool

print (m.any())
A    False
B     True
C     True
dtype: bool