The difference between comparison to np.nan and isnull()

73.1k Views Asked by At

I supposed that

data[data.agefm.isnull()]

and

data[data.agefm == numpy.nan]

are equivalent. But no, the first truly returns rows where agefm is NaN, but the second returns an empty DataFrame. I thank that omitted values are always equal to np.nan, but it seems wrong.

agefm column has float64 dtype:

(Pdb) data.agefm.describe()
count    2079.000000
mean       20.686388
std         5.002383
min        10.000000
25%        17.000000
50%        20.000000
75%        23.000000
max        46.000000
Name: agefm, dtype: float64

What does data[data.agefm == np.nan] mean exactly?

2

There are 2 best solutions below

4
On BEST ANSWER

np.nan is not comparable to np.nan... directly.

np.nan == np.nan

False

While

np.isnan(np.nan)

True

Could also do

pd.isnull(np.nan)

True

examples
Filters nothing because nothing is equal to np.nan

s = pd.Series([1., np.nan, 2.])
s[s != np.nan]

0    1.0
1    NaN
2    2.0
dtype: float64

Filters out the null

s = pd.Series([1., np.nan, 2.])
s[s.notnull()]

0    1.0
2    2.0
dtype: float64

Use odd comparison behavior to get what we want anyway. If np.nan != np.nan is True then

s = pd.Series([1., np.nan, 2.])
s[s == s]

0    1.0
2    2.0
dtype: float64

Just dropna

s = pd.Series([1., np.nan, 2.])
s.dropna()

0    1.0
2    2.0
dtype: float64
0
On

NaN is not equal to NaN; in fact, it's not equal to anything. It's implementation looks similar to the following.1

class MyNaN(float):
    def __eq__(self, other):
        return False
    def __ne__(self, other):
        return True
    def __repr__(self):
        return 'nan'
    
x = MyNaN()
print(x)        # nan
print(x == x)   # False
print(x != x)   # True

So, one possible way to check if there are NaN values is to check if a value doesn't equal itself. For example, if the aim is to find out which rows contain NaN, instead of checking for equality with NaN, check inequality with itself.

An example:

s = pd.Series(['a', np.nan, 'b'])
x = s[s != s]
y = s[s.isnull()]

x.equals(y)       # True
np.isnan vs pd.isnull

np.isnan only works for numeric values, while pd.isnull works for all kinds of datatypes.

s = pd.Series(['a', np.nan, 'b'])

s.isnull()        # OK
np.isnan(s)       # TypeError

1: Its actual CPython implementation is much more involved. This is only for illustration.