I have an outcome variable that represents the average of several Likert scale scores (0-7), and has been calculated to 1 decimal place. I did consider a linear regression, but it is really not normally distributed. I think an ordinal regression will be best. However, because the variable has been calculated to 1d.p., I cannot transform it to a factor (as required for ordinal regression) prior to imputation with mice. I am hoping to run the imputation, then transform the outcome variable to a factor, and then run and pool ordinal regression models.

I have tried to run the following code for this step, which seemed to work for the first problem:

imp_data <- mice(my_data, seed = 4764638)

imputed_datasets <- lapply(1:5, function(i) complete(imp_data, action = i))

for (i in 1:5) {
  imputed_datasets[[i]]$likert_var <- as.factor(imputed_datasets[[i]]$likert_var)
}

However i am having trouble running and pooling ordinal logistic regression models. I can get a single model to work:

my_model <- clm(likert_var ~ time + v1 + v2, data = imputed_datasets[[1]])
summary(my_model)

When I try to run them and pool them (using miceadds), I run into problems:

results_list <- list()

for (i in 1:5) {
  my_model <- clm(likert_var ~ time + v1 + v2, data = imputed_datasets[[i]])
  results_list[[i]] <- my_model
}

pooled_results <- pool_mi(results_list)

I think the issue is the imputed datasets, rather than using the mids object out of mice. Is there a way to solve this issue? Thank you- quite new to coding, and I hope I have provided enough info here.

1

There are 1 best solutions below

0
On

Use imputation cautionary, as imputation could introduce bias in the following modeling if the underlying unavailable data do not follow the imputed pattern although it could also correct potential bias if the assumptions are met. You could check whether there is substantial distributional differences between the data with missingness and the imputed complete data, to see whether imputation makes any difference. Random missingness does not harm the consistency of model estimates and can be ignored. Nevertheless, you should supply models without imputed data as reference even if you decide that imputation is necessary.

Coercing fractional data into factors is dangerous, in imputed_datasets[[i]]$likert_var <- as.factor(imputed_datasets[[i]]$likert_var). You should not do it prior to imputation either, even if the package allowed factor data. What this line does is to assign different factor levels to all unique numeric values. For example, if the original response in individual questions are labeled 1, 2, 3, 4, 5, 6, 7, and the mean of all questions have 10 unique values as 3, 3.25, 4, 4.5, 4.75, 4.8, 5.1, 5.25, 6.1, 6.3, as.factor() will transform the mean into a factor of 10 levels. This could substantially increase the number of response levels in ordinal regression and challenge the estimation. If the Likert mean comes in 100 unique values, and you have about 100 observations, the model cannot be identified, as there are not enough observations to estimate coefficients beyond the 99 intercepts or thresholds.

You could use Likert ratings of individual questions, instead of the "mean" of several questions, as the response variable in ordinal regression without imputation. This will produce several ordinal regression models, one for each Likert response. In clm(), it is very important to examine the linearity of the linear predictor by introducing square, cubic, logarithm, interactions, and other functions of predictors. For example, use the formula (time + v1 + v2)^2 to see if any pairwise interaction needs to be included. You should check heteroscedasticity in the latent error term by scale_test() and clm(scale = ~ ,) and the parallel-line assumption by nominal_test() and clm(nominal = ~ ,). After modeling, you should investigate goodness of fit though tests in the gofcat package. You could treat answers to several questions as repeated measurements of the same quantity taken from the same participant or multiple units clustered in the same participant. Then you may use one single mixed-effect ordinal regression model that includes random effects, typically a random intercept varying by participant ID. All above can be implanted by clmm(likert ~ (time + v1 + v2)^2 + (1 | id), scale = ~, nominal = ~ , data =).

Alternatively, you could use factional models applied to the average Likert score as the response. Fractional response estimators fit models on continuous zero to one data using probit, logit, heteroskedastic probit, and beta regression. Beta regression can be used only when the endpoints zero and one are excluded, although zero-, one- and zero-and-one- inflated beta regression exists. https://www.stata.com/features/overview/fractional-outcome-models/. You can see a demonstration in R by Michael Clark (2019) Fractional Regression: A quick primer regarding data between zero and one, including zero and one https://m-clark.github.io/posts/2019-08-20-fractional-regression/. In your case, you may need to rescale the response from 1-7 to 0-1, by (likert_var - 1)/7. Since your response is neither a probability nor a proportion, choose interpretation of model coefficients wisely, as certain terms such as odds and odds ratio no longer apply.

I am not sure what pool_mi() actually does, but comparing (or equivalently pooling) coefficients from ordinal regression (in fact, any generalized linear model) built from different samples is invalid, unless you assume the latent error's scale parameter is the same among different samples. For different imputed iterations, this assumption is plausible, as all resulting samples are constructed to recover the same population. See Williams, R. (2009). Using heterogeneous choice models to compare logit and probit coefficients across groups. Sociological Methods & Research, 37(4), 531–559. https://doi.org/10.1177/0049124109335735.