Create a data frame with lagged values as numeric vector using dplyr

94 Views Asked by At

My data looks as follows

data <- data.frame(
  A = c(10,20,30,40,50,60,70,80,90,100),
  B = c(110,120,130,140,150,160,170,180,190,200)
)

I wish to create a new column where row-wise I take each value from column A and concatenate it with the corresponding value from B as well as the lagged 3 values from column B and stores it in a numeric vector.

The final data will look as follows:

    new_data <- data.frame(
  A = c(10,20,30,40,50,60,70,80,90,100),
  B = c(110,120,130,140,150,160,170,180,190,200),
  c = I(list(c(10,110,NA,NA,NA),c(20,120,110,NA,NA),c(30,130,120,110,NA),c(40,140,130,120,110),c(50,150,140,130,120),c(60,160,150,140,130),c(70,170,160,150,140),c(80,180,170,160,150),c(90,190,180,170,160),c(100,200,190,180,170))
))

In my real data, I need to capture up to 200 lagged values so manually typing list(A, lag(B),lag(B,2),lag(B,3).... is not a road I want to go down.

I've tried using purrr::map to achieve my results but I can't get the lagged values into my vector

data %>%
  rowwise() %>% 
  mutate(z = list(c(A,B,map_vec(1:3, ~lag(B,.x, default = NA)))))

The ultimate goal is to calculate the percentage_rank of each value of column A relative to the current row + 3 previous values from column B to identify whether each point of A is an outlier i.e. > 75th percentile.

My approach is to create a numeric vector of values in column C that contains the current value of A and the 4 values from B, apply dplyr::percent_rank to each vector and pluck the percent rank for each index = 1 value (column A value) to get me the percent rank of value A.

Any guidance you can provide would be greatly appreciated.

2

There are 2 best solutions below

6
Andre Wildberg On

EDIT: new data and variable lag-length

Set <- 3 # For e.g. all rows use "Set <- (nrow(data)-1)"

data %>% 
  mutate(setNames(data.frame(sapply(1:!!Set, \(x) 
    lag(B, x))), paste0("b", 1:!!Set))) %>% 
  rowwise() %>% 
  mutate(C = list(c_across(everything()))) %>% 
  select(-matches("b\\d+")) %>% 
  data.frame()
     A   B                       C
1   10 110     10, 110, NA, NA, NA
2   20 120    20, 120, 110, NA, NA
3   30 130   30, 130, 120, 110, NA
4   40 140  40, 140, 130, 120, 110
5   50 150  50, 150, 140, 130, 120
6   60 160  60, 160, 150, 140, 130
7   70 170  70, 170, 160, 150, 140
8   80 180  80, 180, 170, 160, 150
9   90 190  90, 190, 180, 170, 160
10 100 200 100, 200, 190, 180, 170

Using a pre-rowwise lag variable

library(dplyr)

df %>% 
  mutate(b = lag(B)) %>% 
  rowwise() %>% 
  mutate(C = list(c(A, B, b)), b = NULL) %>% 
  data.frame()
   A   B           C
1 10  60  10, 60, NA
2 20  70  20, 70, 60
3 30  80  30, 80, 70
4 40  90  40, 90, 80
5 50 100 50, 100, 90
0
ThomasIsCoding On

With base R, you can try embed like below

L <- 3
transform(
    data,
    C = asplit(cbind(A, B, embed(c(rep(NA, L), head(B, -1)), L)), 1)
)

and you will obtain

     A   B                       C
1   10 110     10, 110, NA, NA, NA
2   20 120    20, 120, 110, NA, NA
3   30 130   30, 130, 120, 110, NA
4   40 140  40, 140, 130, 120, 110
5   50 150  50, 150, 140, 130, 120
6   60 160  60, 160, 150, 140, 130
7   70 170  70, 170, 160, 150, 140
8   80 180  80, 180, 170, 160, 150
9   90 190  90, 190, 180, 170, 160
10 100 200 100, 200, 190, 180, 170