Bootstrap panel data: randomly selecting individuals, not the person-waves

84 Views Asked by At

I am trying to replicate the Stata code (available here) in R. The idea is randomly selecting bootstrap sample from the idcode, not the idcode-year and then estimate the parameters from the selected samples and save them in an Stata datafile. Cluster and idcluster options in the Stata code are used to force Stata to randomly select the bootstrap samples from the idcodes, not the idcode-year.

The below R code does the sampling part per @PBulls suggestion. I wonder how to modify the R code so it replicates the Stata code also pasted below.

Here is the R code:

data <- haven::read_dta("http://www.stata-press.com/data/r9/nlswork.dta") |>
  dplyr::filter(!is.na(ttl_exp) & !is.na(hours))

panelboot <- function(df) {
  ids <- unique(df[["idcode"]])
  
  data.frame(
    idcode = sample(ids, replace=TRUE),
    nidcode = seq_along(ids)
  ) |> dplyr::left_join(df, by = "idcode", relationship = "many-to-many")
}

set.seed(1)
boot1 <- panelboot(data)

## ID 1130 is selected 3 times total, contributing 8 observations each time
fid <- which(boot1[,1] == boot1[1,1])

## The resamples of ID 1130 have new IDs 1, 933, 1671
boot1[fid,2]

Here is the Stata code and its output:

use "nslwork.dta", clear 
*you need below for bootstraping from observation not person-years
tsset, clear
generate long newid = idcode // long is not about the data format but it refers to the variable format 
tsset newid year

capture program drop savemargins
program savemargins, rclass
reg ttl_exp hours
end 
bootstrap _b,  saving(boot_output.dta, replace ) reps(10) cluster(idcode) idcluster(newid) : savemargins 

Here is Stata output:

Bootstrap replications (10): .........10 done

Linear regression                                       Number of obs = 28,467
                                                        Replications  =     10
                                                        Wald chi2(1)  = 194.08
                                                        Prob > chi2   = 0.0000
                                                        R-squared     = 0.0118
                                                        Adj R-squared = 0.0118
                                                        Root MSE      = 4.6267

                              (Replications based on 4,710 clusters in idcode)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
     ttl_exp | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |   .0512427   .0036782    13.93   0.000     .0440335    .0584519
       _cons |   4.343879   .1302096    33.36   0.000     4.088673    4.599085
------------------------------------------------------------------------------
2

There are 2 best solutions below

0
PBulls On BEST ANSWER

From the comments, here's code that will perform cluster resampling & perform a "pooled OLS":

N_SAMPLES <- 100

df <- haven::read_dta("http://www.stata-press.com/data/r9/nlswork.dta") |>
  dplyr::filter(!is.na(ttl_exp) & !is.na(hours))
 
panelboot <- function(df) {
  ids <- unique(df[["idcode"]])
 
  data.frame(
    idcode = sample(ids, replace=TRUE),
    nidcode = seq_along(ids)
  ) |> dplyr::left_join(df, by = "idcode", relationship = "many-to-many")
}
 
reg <- function(df, id="nidcode") {
  coef(plm::plm(ttl_exp ~ hours, data = df, model="pooling", index=c(id, "year")))
}
 
set.seed(1)
t0 <- reg(df, id="idcode")
t <- t(vapply(seq_len(N_SAMPLES), \(d) reg(panelboot(df)), numeric(2)))
boot_se <- matrixStats::colSds(t)
 
cbind(t0, boot_se)
#>               Estimate  SE
#> (Intercept)   4.343879  0.141799
#> hours         0.051243  0.003935
 
t(vapply(seq_along(t0), \(i) boot::norm.ci(t0=t0[i], t=t[,i])[2:3], numeric(2)))
#>               Normal-based 95% CI
#> (Intercept)   4.044749  4.627206
#> hours         0.043783  0.059056

A few pointers to some other questions that came up:

  • The number of bootstrap samples is set by the length of the object in the first vapply call, I've moved this to the top.
  • The bootstrap draws themselves are being saved as the object t (in line with boot's parameter naming). This is an N*P matrix, with N the number of bootstrap samples (rows) and P the number of bootstrapped parameters (columns). I subsequently calculate the standard deviations of each column (matrixStats::colSds). You were previously filling and possibly overwriting this object very differently, in that case you may have to calculate row-wise SDs or something else entirely.
0
jay.sf On

Actually this is a "cluster bootstrap", where idcode are the clusters.

I'm not sure if you estimate an OLS or a FE model, so I'll show both. Both estimators are available in lfe::felm which is fast and easy to handle (look into help('felm', package=lfe)). In the FE case, in the formula we have to replace the old name of the cluster variable with the new one.

To replicate the Stata code, you can do:

> clboot <- \(fo, data, cluster, reps) {
+   uclu <- unique(data[[cluster]])  ## get unique clusters
+   ## check for sufficient number of clusters
+   if ((l <- length(uclu)) < 42L) {warning(sprintf('very few clusters (%s)', l))}
+   replicate(reps, {  ## replicate B times
+     samp <- sample(uclu, replace=TRUE)  ## draw samples w/ replacement
+     X <- merge(data,  ## leave only sampled clusters, but with new cluster IDs
+                data.frame(idcode=samp, .newid=seq_along(samp)))
+     ## if FE, replace name of cluster variable
+     if (any(grepl('\\|', fo)) && fo[[3]][[3]] != 0) {  
+       fo[[3]][[3]][[
+         grep(cluster, as.character(fo[[3]][[3]]))
+       ]] <- as.name('.newid')
+     }
+     lfe::felm(fo, X)$coefficients  ## estimate FE
+   }, simplify=FALSE) |> do.call(what='cbind') |> t()  ## make matrix
+ }

OLS model

> ## estimating OLS ####
> ols <- lfe::felm(ttl_exp ~ hours + wks_work, dat)
> 
> B <- 299L ## nreps
> set.seed(42)  ## for sake of reproducibility
> bt_ols <- clboot(ols$formula, dat, 'idcode', 299L)  ## using formula of fit
> 
> cbind(Estimate=as.vector(ols$coefficients),  ## estimate comes from fit
+       Boot_SE=matrixStats::colSds(bt_ols))  ## SEs are column SDs of boot
               Estimate     Boot_SE
(Intercept) 0.543404981 0.112434762
hours       0.007563445 0.002802732
wks_work    0.100810715 0.000923662

FE model

> ## estimating FE w/ idcode + year as FE ####
> fe <- lfe::felm(ttl_exp ~ hours + wks_work | idcode + year, dat)
> 
> set.seed(42)  ## for sake of reproducibility
> B <- 299L ## nreps
> bt_fe <- clboot(fe$formula, dat, 'idcode', B)  ## using formula of fit
> 
> cbind(Estimate=as.vector(fe$coefficients),
+       Boot_SE=matrixStats::colSds(bt_fe))
           Estimate      Boot_SE
hours    0.01754546 0.0017847532
wks_work 0.01583750 0.0007092328

Parallelize

To publish, you may want to increase the number of bootstrap replications to sth like 1999L. To speed things up, consider to parallelize the code. On Linux we easily may use parallel::mclapply:

> clboot <- \(fo, data, cluster, reps, ncpu=7L, seed=42L) {
+   ## setting seed is slightly more complicated
+   RNGkind("L'Ecuyer-CMRG")
+   on.exit(RNGkind('default'))
+   set.seed(seed)
+   parallel::mc.reset.stream()
+   ## +++++++++++++++++++++++++++++++++++++++++
+   uclu <- unique(data[[cluster]])
+   if ((l <- length(uclu)) < 42L) {warning(sprintf('very few clusters (%s)', l))}
+   parallel::mclapply(seq_len(reps), \(i) {
+     samp <- sample(uclu, replace=TRUE)
+     X <- merge(data,
+                data.frame(idcode=samp, .newid=seq_along(samp)))
+     ## if FE, replace name of cluster variable
+     if (any(grepl('\\|', fo)) && fo[[3]][[3]] != 0) {
+       fo[[3]][[3]][[
+         grep(cluster, as.character(fo[[3]][[3]]))
+       ]] <- as.name('.newid')
+     }
+     lfe::felm(fo, X)$coefficients
+   }, mc.cores=ncpu) |> do.call(what='cbind') |> t()
+ }
> fe <- lfe::felm(ttl_exp ~ hours + wks_work | idcode + year, dat)
> system.time(bt_fe <- clboot(fe$formula, dat, 'idcode', 1999L, seed=42))
   user  system elapsed 
  1.033   0.548  48.057 
> cbind(Estimate=as.vector(fe$coefficients),  ## estimate comes from fit
+       Boot_SE=matrixStats::colSds(bt_fe))  ## SEs are column SDs of boot
           Estimate      Boot_SE
hours    0.01754546 0.0018040498
wks_work 0.01583750 0.0007021654

Under Windows you can look at parallel::parLapply or the foreach package.

You mentioned you want to save some output for Stata. I'm not sure what exactly, but you can use readstata13::save.dta13.


Data:

> dat <- readstata13::read.dta13('http://www.stata-press.com/data/r9/nlswork.dta')