Principal Component Analysis (collinear predictors) and predict function in R

214 Views Asked by At

I have a dataset which has 3 collinear predictors. I end up extracting these predictors and use a principal component analysis to reduce multi-collinearity. What I want is to use these predictors for further modelling.

  1. Is it incorrect to use the predict function and get the values for the 3 collinear predictors and use the predicted values for further analysis?
  2. Or since the first two axes capture the majority of variance (70% in the demo dataset and 96% in the actual dataset) Should I use only the values from the first two axes instead of the 3 predicted values for further analysis?
#Creating sample dataset
df<- data.frame(ani_id = as.factor(1:10), var1 = rnorm(500), var2=rnorm(500),var3=rnorm(500))

### Principal Component Analysis
myPCA1 = prcomp(df[,-1],data = df , scale. = TRUE, center = TRUE)
summary(myPCA1)

This was my result from the demo dataset when I ran

> summary(myPCA1)
Importance of components:
                          PC1    PC2    PC3
Standard deviation     1.0355 1.0030 0.9601
Proportion of Variance 0.3574 0.3353 0.3073
Cumulative Proportion  0.3574 0.6927 1.0000

This shows that the first two axes captures almost 70% variance.

Now is it correct to do the following?

## Using predict function to predict the values of the 3 collinear predictors
axes1 <- predict(myPCA1, newdata = df)
head(axes1)

subset1 <- cbind(df, axes1)
names(subset1)

### Removing the actual 3 collinear predictors and getting a dataset with the ID and 3 predictors who are no long collinear
subset1<- subset1[,-c(2:4)]

summary(subset1)

## Merge this to the actual dataset to use for further analysis in linear mixed effect models

Thanks for helping! :)

PS- I did read https://stats.stackexchange.com/questions/72839/how-to-use-r-prcomp-results-for-prediction/72847#72847

But was still unsure. Which is why I am asking here.

1

There are 1 best solutions below

2
On

Is it incorrect to use the predict function and get the values for the 3 collinear predictors and use the predicted values for further analysis?

Yes. The values are the same as myPCA1$x

Or since the first two axes capture the majority of variance (70% in the demo dataset and 96% in the actual dataset) Should I use only the values from the first two axes instead of the 3 predicted values for further analysis?

I personally only use the first axis (but that's when it explains at least 70%). However I don't see any issue with using multiple. The second axis is orthogonal to the first axis. I guess my caution would be that you would have to understand what the PCA axis represent in terms of your predictor variables (e.g., does predictor 1 increase or decrease along the PCA1 vs PCA2?). Including a third one increases the number of predictors in the model and you have to question where or not the extra 30 percent of the variation is worth to include versus potential model overfitting?

Also not sure if this is a question for stackoverflow or crossvalidated.