Different results from any(df.isnull()) and pd.isnull(data).any()

277 Views Asked by At

I am using the standard Boston houses data frame with pandas and I have noticed something that bugs me:

when I'm checking for missing values in 2 different ways - I'm getting 2 different results, though it shouldn't be.

Any ideas why this is happening?

Here's my code:

# loading df
from sklearn.datasets import load_boston
boston=load_boston()
boston_data = pd.DataFrame(data=boston.data, columns=boston.feature_names)
boston_data['price']=boston.target # the price column

Now if I run this code:

pd.isnull(boston_data).any()

this is the outcome:

CRIM       False
ZN         False
INDUS      False
CHAS       False
NOX        False
RM         False
AGE        False
DIS        False
RAD        False
TAX        False
PTRATIO    False
B          False
LSTAT      False
dtype: bool

However, if I run it like this:

any(boston_data.isnull())

it returns: True

Why?..

2

There are 2 best solutions below

3
dramarama On

pd.isnull(boston_data).any() checks for missing values across columns and returns False for all columns in your case

any(boston_data.isnull()) checks for missing values across all columns and returns True because there is at least one missing value in DF

0
JarroVGIT On

There is a difference between the .any() function that comes with Pandas, and the builtin any() function that comes with Python. The Pandas .any() will evaluate bool(x) for each x that is a value in the Series. So, if you call df.isnull(), you will create a boolean dataframe (lots of True and/or False). Calling Pandas' .any() function on that dataframe, will return True if any of the values in the boolean dataframe is True. In this case, none of the values are missing so this will return False.

Now, the builtin any() function takes an iterable and will apply bool(x) for each value in the provided iterable. If you provide a pd.DataFrame object to it, it will iterate over it. You can see what happens if you try to iterate over a DataFrame object like so:

df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6]})
print(df)
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6
for x in df:
    print(f'x is of type {type(x)} and is of value {x}')
# x is of type <class 'str'> and is of value A
# x is of type <class 'str'> and is of value B

So, if you call the builtin any() function, you basically are saying any([bool('A'), bool('B')]), which is the same as any([True, True]) which is True.

Long story short: don't use builtin any() with Pandas, as the iteration on a dataframe might not yield what you'd expect.