Can somebody explain to me why logical evaluations that resolve to NA produce bogus rows in vector-comparison-based subsets? For example:
employee <- c("Big Shot CEO", "Programmer","Intern","Guy Who Got Fired Last Week")
salary <- c( 10000000, 50000, 0, NA)
emp_salary <- data.frame(employee,salary)
# how many employees paid over 100K?
nrow(emp_salary[salary>100000,]) # Returns 2 instead of 1 -- why?
emp_salary[salary>100000,]
# returns a bogus row of all NA's (not "Guy Who Got Fired")
# employee salary
# 1 Big Shot CEO 1e+07
# NA <NA> <NA>
salary[salary>100000]
# returns:
# [1] 1e+07 NA
NA > 100000 #returns NA
Given this unexpected behavior, what is the preferred way to count employees making over 100K in the above example?
First of all, you probably don't want to
cbind()
first -- that will coerce all of your variables to character.Two possible solutions:
subset
automatically excludes cases where the criterion isNA
:na.rm=TRUE
:As for the logic behind the bogus rows:
bigsal <- salary>1e5
is a logical vector which containsNA
s, as it must (because there is no way to know whether anNA
value satisfies the criterion or not).NA
s, this is probably the most salient bit of document (fromhelp("[")
):help("[.data.frame")
and couldn't see anything more useful.)The thing to remember is that once the indexing is being done, R no longer has any knowledge that the logical vector was created from the
salary
column, so there's no way for it to do what you might want, which is to retain the values in the other columns. Here's one way to think about the seemingly strange behaviour of filling in all the columns in theNA
row withNA
s: if R leaves the row out entirely, that would correspond to the criterion beingFALSE
. If it retains it (and remember that it can't retain just a few columns and drop the others), then that would correspond to the criterion beingTRUE
. If the criterion is neitherFALSE
norTRUE
, then it's hard to see what other behaviour makes sense ...