replacing NA with next available number within a group

71 Views Asked by At

I have a relatively large dataset and I want to replace NA value for the price in a specific year and for a specific ID number with an available value in next year within a group for the same ID number. Here is a reproducible example:

ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)

   ID year value
1   1 2000  1000
2   2 2001 20000
3   3 2002 30000
4   2 2002    NA
5   2 2003 40000
6   3 2007    NA
7   1 2001  6000
8   4 2000  4000
9   5 2005    NA
10  5 2006 20000
11  1 2002  7000
12  2 2004 50000
13  2 2005 60000

So, for example for ID=2 we have following value and years:

   ID year value
   2 2001  20000
   2 2002  NA
   2 2003  40000
   2 2004  50000
   2 2005  60000

So in the above case, NA should be replaced with 40000 (Values in next year). And the same story for other IDs. the final result should be in this form:

   ID year value
   1 2000  1000
   1 2001  6000
   1 2002  7000
   2 2001  20000
   2 2002  40000
   2 2003  40000
   2 2004  50000
   2 2005  60000
   3 2007    NA
   4 2000  4000
   5 2005  20000
   5 2006  20000

Please note that for ID=3 since there is no next year available, we want to leave it as is. That's why it's in the form of NA

I appreciate if you can suggest a solution
Thanks

3

There are 3 best solutions below

2
On BEST ANSWER

dplyr solution

library(tidyverse)

data2 <- data %>%
  dplyr::group_by(ID) %>%
  dplyr::arrange(year) %>% 
  dplyr::mutate(replaced_value = ifelse(is.na(value), lead(value), value))
print(data2)
# A tibble: 13 x 4
# Groups:   ID [5]
      ID  year value replaced_value
   <dbl> <dbl> <dbl>          <dbl>
 1     1  2000  1000           1000
 2     4  2000  4000           4000
 3     2  2001 20000          20000
 4     1  2001  6000           6000
 5     3  2002 30000          30000
 6     2  2002    NA          40000
 7     1  2002  7000           7000
 8     2  2003 40000          40000
 9     2  2004 50000          50000
10     5  2005    NA          20000
11     2  2005 60000          60000
12     5  2006 20000          20000
13     3  2007    NA             NA
0
On

You could do:

library(dplyr)

data %>%
  group_by(ID) %>%
  mutate(value = coalesce(value, as.integer(sapply(pmin(year + 1, max(year)), function(x) value[year == x])))) %>%
  arrange(ID, year)

Output:

# A tibble: 13 x 3
# Groups:   ID [5]
      ID  year value
   <dbl> <dbl> <dbl>
 1     1  2000  1000
 2     1  2001  6000
 3     1  2002  7000
 4     2  2001 20000
 5     2  2002 40000
 6     2  2003 40000
 7     2  2004 50000
 8     2  2005 60000
 9     3  2002 30000
10     3  2007    NA
11     4  2000  4000
12     5  2005 20000
13     5  2006 20000

Now in case you want to replace NA with any value that follows immediately - i.e. even if the year is not necessarily consecutive - you could do:

library(tidyverse)

data %>%
  arrange(ID, year) %>%
  group_by(ID, idx = cumsum(is.na(value))) %>%
  fill(value, .direction = 'up') %>%
  ungroup %>%
  select(-idx)

This is much more straightforward (and likely much faster) in data.table:

library(data.table)

setDT(data)[order(ID, year), ][
  , value := nafill(value, type = 'nocb'), by = .(ID, cumsum(is.na(value)))]
0
On

Try this tidyverse approach using a flag to check sequential years and fill() to complete data:

library(tidyverse)
#Data
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
#Code
data2 <- data %>% arrange(ID,year) %>%
  group_by(ID) %>% 
  mutate(Flag=c(1,diff(year))) %>%
  fill(value,.direction = 'downup') %>%
  mutate(value=ifelse(Flag!=1,NA,value)) %>% select(-Flag)

Output:

# A tibble: 13 x 3
# Groups:   ID [5]
      ID  year value
   <dbl> <dbl> <dbl>
 1     1  2000  1000
 2     1  2001  6000
 3     1  2002  7000
 4     2  2001 20000
 5     2  2002 20000
 6     2  2003 40000
 7     2  2004 50000
 8     2  2005 60000
 9     3  2002 30000
10     3  2007    NA
11     4  2000  4000
12     5  2005 20000
13     5  2006 20000