R: replacing NAs in a data.frame with values in the same position in another dataframe

2.6k Views Asked by At

I have a dataframe with some NA values:

dfa <- data.frame(a=c(1,NA,3,4,5,NA),b=c(1,5,NA,NA,8,9),c=c(7,NA,NA,NA,2,NA))
dfa

I would like to replace the NAs with values in the same position in another dataframe:

dfrepair <- data.frame(a=c(2:7),b=c(6:1),c=c(8:3))
dfrepair

I tried:

dfa1 <- dfa

dfa1 <- ifelse(dfa == NA, dfrepair, dfa)
dfa1

but this did not work.

5

There are 5 best solutions below

2
On BEST ANSWER

You can do:

dfa <- data.frame(a=c(1,NA,3,4,5,NA),b=c(1,5,NA,NA,8,9),c=c(7,NA,NA,NA,2,NA))
dfrepair <- data.frame(a=c(2:7),b=c(6:1),c=c(8:3))
dfa[is.na(dfa)] <- dfrepair[is.na(dfa)]
dfa

  a b c
1 1 1 7
2 3 5 7
3 3 4 6
4 4 3 5
5 5 8 2
6 7 9 3
0
On

In the tidyverse, you can use purrr::map2_df, which is a strictly bivariate version of mapply that simplifies to a data.frame, and dplyr::coalesce, which replaces NA values in its first argument with the corresponding ones in the second.

library(tidyverse)

dfrepair %>% 
    mutate_all(as.numeric) %>%    # coalesce is strict about types
    map2_df(dfa, ., coalesce)

## # A tibble: 6 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     1     7
## 2     3     5     7
## 3     3     4     6
## 4     4     3     5
## 5     5     8     2
## 6     7     9     3
0
On

We can use Map from base R to do a columnwise comparison between the two datasets

dfa[] <- Map(function(x,y) {x[is.na(x)] <- y[is.na(x)]; x}, dfa, dfrepair)
dfa
#  a b c
#1 1 1 7
#2 3 5 7
#3 3 4 6
#4 4 3 5
#5 5 8 2
#6 7 9 3
0
On

In case there are different types the replacement should be done columnwise. Another simple way allowing in place exchange might be.

for(i in seq_along(dfa)) {
    . <- is.na(dfa[[i]])
    dfa[[i]][.] <- dfrepair[[i]][.]
}

Or using in addition which which might improve speed / memory usage in some cases.

for(i in seq_along(dfa)) {
    . <- which(is.na(dfa[[i]]))
    dfa[[i]][.] <- dfrepair[[i]][.]
}

Benchmark of columnwise base options.

dfa <- data.frame(a=c("A",NA,"B","C","D",NA),b=c(1,5,NA,NA,8,9),c=c(7,NA,NA,NA,2,NA))
dfrepair <- data.frame(a=letters[2:7],b=c(6:1),c=c(8:3))

bench::mark(
akrun = local({dfa[] <- Map(function(x,y) {x[is.na(x)] <- y[is.na(x)]; x}, dfa, dfrepair); dfa}),
GKi = local({for(i in seq_along(dfa)) {. <- is.na(dfa[[i]])
                 dfa[[i]][.] <- dfrepair[[i]][.]}
                 dfa})
)
#  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 akrun        64.4µs 70.3µs    12895.      280B     26.7  5793    12      449ms
#2 GKi          54.8µs   60µs    16347.      280B     28.7  7395    13      452ms
5
On
dfa <- data.frame(a=c(1,NA,3,4,5,NA),b=c(1,5,NA,NA,8,9),c=c(7,NA,NA,NA,2,NA))
dfa
dfrepair <- data.frame(a=c(2:7),b=c(6:1),c=c(8:3))
dfrepair 
library(dplyr)
coalesce(as.numeric(dfa), as.numeric(dfrepair))

  a b c
1 1 1 7
2 3 5 7
3 3 4 6
4 4 3 5
5 5 8 2
6 7 9 3

As the code in dplyr is written in C++ it is faster in most cases. An other important advantage is that coalesce as well as many other dplyr functions are the same in SQL. Using dplyr you learn SQL by coding in R. ;-)