Residuals from linear model with lag predictor by group as a new variable

43 Views Asked by At

I am trying to add a new variable (residual) to my original data frame based on the residuals of a first-order autoregressive model using lm().

residuals(lm(Var1 ~lag(Var1), panel_data)

It is similar to this question R: Replacement has [x] rows, data has [y] - residuals from a linear model in new variable but with groups. I already tried the proposed code including a line with group_by. However it is producing wrong residuals for the first observation of every group. How can I adapt the following code?

library(dplyr)
library(broom)

panel_data %>% 
  group.by = group %>%
  lm(Var1 ~ Var1, data = .) %>% 
  augment() %>% 
  select(.rownames, .std.resid) %>% 
  right_join(mutate(panel_data, row = as.character(row_number())), 
             by = c(".rownames" = "row"))

An example data set can be as follows:

# Number of groups
num_groups <- 20

# Number of months
num_months <- 100

panel_data <- data.table(
  group = rep(1:num_groups, each = num_months), # Group IDs
  time = rep(1:num_months, times = num_groups), # Time period
  Var1 = rnorm(num_groups * num_months), # Variable 1
  Var2 = rnorm(num_groups * num_months)  # Variable 2
)
2

There are 2 best solutions below

1
On BEST ANSWER

Please have a look at this:

library(dplyr)
library(broom)

panel_data %>% 
  group_by(group) %>%
  mutate(Var1_lag = lag(Var1)) %>%
  filter(!is.na(Var1_lag)) %>%
  do({
    model_data <- .
    augment(lm(Var1 ~ Var1_lag, data = model_data), data = model_data)
  }) %>% 
  right_join(panel_data, by = c("group", "time", "Var1")) %>% 
  select(group, time, Var1, Var1_lag, .resid) %>%
  mutate(.resid = ifelse(is.na(Var1_lag), NA, .resid)) %>% 
  ungroup()
A tibble: 2,000 × 5
   group  time    Var1 Var1_lag .resid
   <int> <int>   <dbl>    <dbl>  <dbl>
 1     1     2  0.689   -0.269   0.671
 2     1     3  1.21     0.689   1.25 
 3     1     4  2.06     1.21    2.13 
 4     1     5 -0.292    2.06   -0.175
 5     1     6  1.44    -0.292   1.42 
 6     1     7 -0.938    1.44   -0.857
 7     1     8 -1.33    -0.938  -1.39 
 8     1     9 -0.0830  -1.33   -0.163
 9     1    10  0.273   -0.0830  0.266
10     1    11 -0.466    0.273  -0.452
# ℹ 1,990 more rows
0
On

The answer by TarJae works fine. Another way is to select the residual variable first and merge it with the initial data set as follows:

panel_data %>% 
  group_by(group) %>%
  mutate(Var1_lag = lag(Var1)) %>%
  filter(!is.na(Var1_lag)) %>%
  do({
    model_data <- .
    augment(lm(Var1 ~ Var1_lag, data = model_data), data = model_data)
  }) %>% 
  select(group, time, Var1, Var1_lag, .resid) %>%
  right_join(panel_data, by = c("group", "time", "Var1")) %>% 
  ungroup()