I have a dataset which has 3 collinear predictors. I end up extracting these predictors and use a principal component analysis to reduce multi-collinearity. What I want is to use these predictors for further modelling.
- Is it incorrect to use the
predict
function and get the values for the 3 collinear predictors and use the predicted values for further analysis? - Or since the first two axes capture the majority of variance (70% in the demo dataset and 96% in the actual dataset) Should I use only the values from the first two axes instead of the 3 predicted values for further analysis?
#Creating sample dataset
df<- data.frame(ani_id = as.factor(1:10), var1 = rnorm(500), var2=rnorm(500),var3=rnorm(500))
### Principal Component Analysis
myPCA1 = prcomp(df[,-1],data = df , scale. = TRUE, center = TRUE)
summary(myPCA1)
This was my result from the demo dataset when I ran
> summary(myPCA1)
Importance of components:
PC1 PC2 PC3
Standard deviation 1.0355 1.0030 0.9601
Proportion of Variance 0.3574 0.3353 0.3073
Cumulative Proportion 0.3574 0.6927 1.0000
This shows that the first two axes captures almost 70% variance.
Now is it correct to do the following?
## Using predict function to predict the values of the 3 collinear predictors
axes1 <- predict(myPCA1, newdata = df)
head(axes1)
subset1 <- cbind(df, axes1)
names(subset1)
### Removing the actual 3 collinear predictors and getting a dataset with the ID and 3 predictors who are no long collinear
subset1<- subset1[,-c(2:4)]
summary(subset1)
## Merge this to the actual dataset to use for further analysis in linear mixed effect models
Thanks for helping! :)
PS- I did read https://stats.stackexchange.com/questions/72839/how-to-use-r-prcomp-results-for-prediction/72847#72847
But was still unsure. Which is why I am asking here.
Is it incorrect to use the predict function and get the values for the 3 collinear predictors and use the predicted values for further analysis?
Yes. The values are the same as
myPCA1$x
Or since the first two axes capture the majority of variance (70% in the demo dataset and 96% in the actual dataset) Should I use only the values from the first two axes instead of the 3 predicted values for further analysis?
I personally only use the first axis (but that's when it explains at least 70%). However I don't see any issue with using multiple. The second axis is orthogonal to the first axis. I guess my caution would be that you would have to understand what the PCA axis represent in terms of your predictor variables (e.g., does predictor 1 increase or decrease along the PCA1 vs PCA2?). Including a third one increases the number of predictors in the model and you have to question where or not the extra 30 percent of the variation is worth to include versus potential model overfitting?
Also not sure if this is a question for stackoverflow or crossvalidated.