why sometimes R can't tell difference between NA and 0?

1k Views Asked by At

I am trying to extract rows of data with field "var" equals 0.

But I found "NA" were taken as 0:

There are 20 rows of 0 and 809 rows of "NA".

There are total 81291 rows in data frame d.

> length(d$var[d$var == "0"])
[1] 829

> length(d$var[d$var == 0])
[1] 829

The above 829 values include both 0 and "NA"

> length(d$var[d$var == "NA"])
[1] 809

> length(d$var[d$var == NA])
[1] 81291

Why does the above code gave the length of d?

3

There are 3 best solutions below

1
On

x == NA is not the way to test whether the value of some variable x is NA. Use is.na()instead:

> 2 == NA
[1] NA
> is.na(2)
[1] FALSE

Similarly, use is.null() to test whether an object is the NULL object.

2
On

One way to evaluate this is the inelegant

length(d$var[(d$var == 0) & (!is.na(d$var))])

(or slightly more compactly, sum(d$var==0 & !is.na(d$var)))

I think your code illustrates some misunderstandings you are having about R syntax. Let's make a compact, reproducible example to illustrate:

d <- data.frame(var=c(7, 0, NA, 0))

As you point out, length(d$var[d$var==0]) will return 3, because NA==0 is evaluated as NA.

When you enclose the value you're looking for in quotation marks, R evaluates it as a string. So length(d$var[d$var == "NA"]) is asking how many elements in d$var are the character string "NA". Since there are no characters "NA" in your data set, you get back the number of values that evaluate to NA (because "NA"==NA evaluates to NA).

In order to answer your last question, look at what d$var[d$var==NA] returns: a vector of NA of the same length as your original vector. Again, any == comparison with NA evaluates to NA. Since all of the comparisons in that expression are to NA, you'll get back a vector of NAs that is the same length as your original vector.

5
On

Here is the solution that gives the right answer.

length(which(d$var == 0))

the reason you are facing that problem is that in your expression, the condition check does not give FALSE for the NA values, it gives NA instead and when you add the condition as the index, the values which are not FALSE are checked for. in the expression i have given, it checks for which conditions are TRUE and hence you get the right answer.