Imputation missing data for MLM in R

177 Views Asked by At

Maybe anyone can help me with this question. I conducted a follow-up study and obviously now have to face missing data. Now I am considering how to impute the missing data at best using MLM in R (f.e. participants concluded the follow up 2 survey, but not the follow up 1 survey, therefore I am missing L1 predictors for my longitudinal analysis).

I read about Multiple Imputation of multilevel data using the pan package (Schafer & Yucel, 2002) and came across the following code:

imp <- panImpute(data, formula = fml, n.burn = 1000, n.iter = 100, m = 5) 

Yet, I have troubles understanding it completely. Is there maybe another way to impute missing data in R? Or maybe somebody could illustrate the process of the imputation method a bit more detailed, that would be so great! Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)

Thank you in advance

1

There are 1 best solutions below

0
On

Is there maybe another way to impute missing data in R?

There are other packages. mice is the one that I normally use, and it does support multilevel data.

Do I have to conduct the imputation for every model I built in my MLM? (f.e. when I compared, whether a random intercept versus a random intercept and random slope model fits better for my data, do I have to use the imputation code for every model, or do I use it at the beginning of all my calculations?)

You have to specify the imputation model. Basically that means you have to tell the software which variables are predicted by which other variables. Since you are comparing models with the same fixed effect, and only changing the random effects (in particular comparing models with and without random slopes), the imputation model should be the same in both cases. So the workflow is:

  1. perform the imputations;
  2. run the model on all the imputed datasets,
  3. pool the results (typically using Rubin's rules)

So you will need to do this twice, to end up with 2 sets of pooled results - one for each model. The software should provide functionality for doing all of this.

Having said all of that, I would advise against choosing your model based on fit statistics and instead use expert knowledge. If you have strong theoretical reasons for expecting slopes to vary by group, then include random slopes. If not, then don't include them.