Dimension reduction using psych::principal() does not work for smaller data

207 Views Asked by At

I am trying to get the PCA components using the training data by using function psych::Principal().

> train <- read.csv("mytraindata.csv", header = TRUE)
> train[is.na(train)] <- 0
> train <- sapply(train, as.numeric)
> fit <- principal(train, nfactors = 6, rotate = "promax", missing = TRUE)

Now, I am trying to reduce the dimension on the test data. So, I first load my data as follows:

> test <- read.csv("mytestdata.csv", header = TRUE)
> test[is.na(test)] <- 0
> test <- sapply(test, as.numeric)

When I apply this on my first four rows, I get the some valid output as follows:

> sm <- test[1:4,]
> predict(fit, sm)
       PC1        PC2        PC3        PC4        PC5       PC6
[1,]  2.208531 -0.5038822 -2.6390489  0.4115814  1.7402972  3.213355
[2,] -4.678453 -0.4528760  0.7745650 -1.2372164 -0.3016823 -2.706421
[3,] -1.864383 -2.6386053  0.6979575 -1.3102945 -1.2105619 -2.833270
[4,]  4.334304  3.5953635  1.1665265  2.1359295 -0.2280531  2.326335

However, when I apply the same on 3 rows, it gives NaN as follows:

> sm <- test[1:3,]
> predict(fit, sm)
     PC1 PC2 PC3 PC4 PC5 PC6
[1,] NaN NaN NaN NaN NaN NaN
[2,] NaN NaN NaN NaN NaN NaN
[3,] NaN NaN NaN NaN NaN NaN

I also get the similar output if I use training data instead of test data.

I am worried since I was thinking that this would work in the same the way a machine learning model can be used get the predictions. Anyone would you please help me in figuring out why this is occurring.

1

There are 1 best solutions below

0
On

I found the solution for this problem.

Actually, psych.predict() takes third optional argument that is the data for standardization. Seems that it needs some data to standardize the predictions. If the third argument is not provided, it uses the second argument as the data for standardization. Since, the second argument had only fewer instances, it is not able to standardize the data and throws NaNs.

If you pass some data for standardization, for example data used for training the model (which is the good style. Documentation of psych.predict() says that using test data for standardization may lead to confusion. See on page 234 for details of this pdf CRAN psych documentation), it will give you the reduced matrix.

predict(fit, sm, train) #third argument i.e. standardization data should be passed