How to use svydesign data object in R for NESARC-III data OR apply survey weights in SPSS?

182 Views Asked by At

I am using a large national dataset for the first time, specifically a subset of the NESARC-III (N=26960). I have the data in SPSS because I used that software to clean it (creating variables, defining missings, etc.)simply because I am not as proficient in R, but then loaded the cleaned dataset into R for all of my SEM analyses. I have all my SEM analyses completed, but realized after the fact no weighting variables are applied. I am looking for an easy way to apply these either in R or SPSS. Unfortunately, the NESARC-III supplies code for SAS, STATA, and SUDAAN. Below is the documentation for the sampling design variables that need to be applied and the code for the other programs in case that is helpful.

AUDWEIGHT (AUDADIS full‐sample weight) VARSTRAT (Stratum) VARUNIT (PSU)

SUDAAN CODE:

PROC SORT DATA=dsname; BY VARSTRAT VARUNIT; RUN; PROC procname DESIGN=WR DATA=dsname; NEST varstrat varunit / MISSUNIT; WEIGHT audweight;

SAS CODE:

PROC procname DATA=dsname VARMETHOD=TAYLOR; WEIGHT audweight; STRATA varstrat; CLUSTER varunit;

STATA CODE:

svyset varunit [pweight=audweight], strata(varstrat) vce(linear)

After some online research I figured I needed to use the R 'survey' package and use svydesign() to create a survey design object and then I could use that object in all of my already written code. This is what I wrote:

#cleaned dataset from SPSS
d <- read_sav("NESARC-III subset Black and White 10.28.sav") #29,960 obs of 4201 variables

#weighting data:
library(survey)
dweight <- svydesign(id      = ~varunit,
                          strata  = ~varstrat,
                          weights = ~audweight,
                          nest    = TRUE,
                          data    = d)

However, my next code is to subset the data for only the variables I need for the SEM analyses and I get an error that indicates it is not reading the new survey data object:

d_sem <- subset(dweight, select = c("n2eq10a1","n2eq10a2", "n2eq10a3", "n2eq10a4", "n2eq10a5", "n2eq10a6",  "n2eq9AR", "n2eq9BR", "n2eq9DR", "n2eq9FR", "black", "nnbs5", "nnbs6", "nnbs7", "nnbs8", "PYany", "female", "marcohab", "lgbtq", "incomeF", "unemployed", "nage", "varunit", "varstrat", "audweight"))

Error in eval(e, x$variables, parent.frame()) : argument "e" is missing, with no default

Can I not use a survey design object (dweight) for any commands outside of the survey package? All examples I see online use functions like svymean or svyglm which are within the survey package. Or should I be able to use the survey design object (dweight) with any code and I just specified it wrong?

Alternatively, is there a way to apply those 3 variables (audweight, varstrat, varunit) in SPSS? Because then I can just save the weighted dataset and load that into R as d (instead of the raw data I have now). I know weighting the data is simple by going to Data -> Weight Cases -> Weight cases by: audweight, but how do I apply the PSU and Stratum variables?

Thanks in advance! I have never used complex survey design variables before so I apologize if I struggle to answer any follow-up questions.

Example data below:

library(stats)

#function to create min and max so varstrat(stratum) and varunit(PSU) have similar parameters to my same data
rlimpois <- function(n, lambda, lowlimit, toplimit){
  sample(x=lowlimit:toplimit, size=n, 
         prob=dpois(lowlimit:toplimit, lambda), replace=TRUE)
}

#I've never created random data before so not sure if this seed will apply to all variables below
set.seed(9)
audweight <- runif(n= 26960, min=585.66, max=49402.81)
varstrat <- rlimpois(26960, 63.41, 1, 100)
varunit <- rlimpois(26960, 1.62, 1, 3)
dummydat <- data.frame(audweight, varstrat, varunit)

#again, similar parameters to actual data, but random numbers generated
dummydat$d1 <- rnorm(n=26960, mean= 1.12, sd= 0.5)
dummydat$d2 <- rnorm(n=26960, mean= 1.23, sd= 0.49)
dummydat$d3 <- rnorm(n=26960, mean= 1.09, sd= 0.48)
dummydat$d4 <- rnorm(n=26960, mean= 1.47, sd= 0.51)
dummydat$d5 <- rnorm(n=26960, mean= 1.02, sd= 0.52)
dummydat$d6 <- rnorm(n=26960, mean= 1.19, sd= 0.54)
dummydat$r1 <- rnorm(n=26960, mean= 5.19, sd= 1.12)
dummydat$r2 <- rnorm(n=26960, mean= 4.79, sd= 1.05)
dummydat$r3 <- rnorm(n=26960, mean= 5.03, sd= 1.16)
dummydat$r4 <- rnorm(n=26960, mean= 4.98, sd= 1.11)
dummydat$m1 <- rnorm(n=26960, mean= 51.4, sd= 10.58)
dummydat$m2 <- rnorm(n=26960, mean= 50, sd= 10)
dummydat$m3 <- rnorm(n=26960, mean= 50.8, sd= 10.3)
dummydat$m4 <- rnorm(n=26960, mean= 49.2, sd= 10.5)
dummydat$black <- rep(letters[1:2], times = c(4, 2), length.out=26960)
dummydat$female <- rep(letters[6:7], length.out=26960)
2

There are 2 best solutions below

2
On

I believe this subsets your data as you want. Let me know!

library(stats)

#function to create min and max so varstrat(stratum) and varunit(PSU) have similar parameters to my same data
rlimpois <- function(n, lambda, lowlimit, toplimit){
  sample(x=lowlimit:toplimit, size=n, 
         prob=dpois(lowlimit:toplimit, lambda), replace=TRUE)
}

#I've never created random data before so not sure if this seed will apply to all variables below
set.seed(9)
audweight <- runif(n= 26960, min=585.66, max=49402.81)
varstrat <- rlimpois(26960, 63.41, 1, 100)
varunit <- rlimpois(26960, 1.62, 1, 3)
dummydat <- data.frame(audweight, varstrat, varunit)

#again, similar parameters to actual data, but random numbers generated
dummydat$d1 <- rnorm(n=26960, mean= 1.12, sd= 0.5)
dummydat$d2 <- rnorm(n=26960, mean= 1.23, sd= 0.49)
dummydat$d3 <- rnorm(n=26960, mean= 1.09, sd= 0.48)
dummydat$d4 <- rnorm(n=26960, mean= 1.47, sd= 0.51)
dummydat$d5 <- rnorm(n=26960, mean= 1.02, sd= 0.52)
dummydat$d6 <- rnorm(n=26960, mean= 1.19, sd= 0.54)
dummydat$r1 <- rnorm(n=26960, mean= 5.19, sd= 1.12)
dummydat$r2 <- rnorm(n=26960, mean= 4.79, sd= 1.05)
dummydat$r3 <- rnorm(n=26960, mean= 5.03, sd= 1.16)
dummydat$r4 <- rnorm(n=26960, mean= 4.98, sd= 1.11)
dummydat$m1 <- rnorm(n=26960, mean= 51.4, sd= 10.58)
dummydat$m2 <- rnorm(n=26960, mean= 50, sd= 10)
dummydat$m3 <- rnorm(n=26960, mean= 50.8, sd= 10.3)
dummydat$m4 <- rnorm(n=26960, mean= 49.2, sd= 10.5)
dummydat$black <- rep(letters[1:2], times = c(4, 2), length.out=26960)
dummydat$female <- rep(letters[6:7], length.out=26960)


#cleaned dataset from SPSS
d <- dummydat

#weighting data:
library(survey)
dweight <- svydesign(id      = ~varunit,
                     strata  = ~varstrat,
                     weights = ~audweight,
                     nest    = TRUE,
                     data    = d)


d_sem <- dweight[,  c( "audweight", "varstrat","varunit","d1","d2","d3","d4","d5","d6","r1","r2" )] 

> names(d_sem)
[1] "cluster"    "strata"     "has.strata" "prob"       "allprob"    "call"       "variables"  "fpc"        "pps"  

> describe(d_sem$variables$d1)
d_sem$variables$d1 
       n  missing distinct     Info     Mean      Gmd      .05      .10      .25      .50      .75      .90      .95 
   26960        0    26960        1    1.123   0.5602   0.3045   0.4930   0.7907   1.1221   1.4550   1.7631   1.9459 

lowest : -0.8532596 -0.6479487 -0.6091726 -0.6073364 -0.5948703, highest:  2.9098212  2.9973350  3.0361524  3.2779159  3.2845548
0
On

I found a solution by way of the lavaan.survey package I just discovered. Rather than trying to use the survey design object as a data frame in all of my SEM models. You first run your unweighted model(s), then you use lavaan.survey as I did below which adjusts the unweighted results based on the survey design criteria. The new weighted model fit provides all the same information, but adjusted. So I will just need to do this after running every unweighted model.

survey.fit1 <- lavaan.survey(fit1, #unweighted sem model
                        dweight, #survey design object
                        estimator="DWLS") #estimator for ordered outcome
summary(survey.fit1)