lda from MASS-package with more predictors than observations

211 Views Asked by At

I am fairly new to multivariat statistics and cannot find the answer in the help-Section of R neither in the source of the MASS-package so maybe you can help me.

My data has many predictors (450) and just few observations (~200). I read it is not possiple to calculate a lda due to the necessary inversion of the variance matrix. But just trying it out before knowing this showed it works and gives kinda good results. How to explain that? Does lda forehand select the variables with the highest seperation impact?

I'm using the caret package to add a 5 fold cv and seperate beforehand into train(0.8) and test(0.2) data.

Validierung <- trainControl(method = "cv", number = 5)
ldaFit1 <- train(`Species` ~., data= train,
             method= "lda",
             trControl = Validierung,
             metric = "Accuracy")  
1

There are 1 best solutions below

15
On

LDA has an internal mechanism to reduce the number of features into a few important latent variables:

Like PCA, LDA uses linear combinations of the predictors to create new axes which are used for the final classification. Unlike PCA, it tries to maximize the differences between the groups whereas PCA does not care about the labels and maximizes the total variance instead.

Furthermore, the coefficient will be set constant, if the variance of a variable is lower than a tolerance threshold (option tol in MASSS::lda).

The features are weighted by multiplying the raw data with the scaling coefficients matrix to get the data in the LDA transformed space. Sepal.Length is the most useful feature to discriminate between the species (Highest absolute value of LD1 in the scaling matrix) and the second LDA axis is almost not important at all (Proportion of trace):

library(MASS)

model <- lda(Species ~ ., iris)
model
#> Call:
#> lda(Species ~ ., data = iris)
#> 
#> Prior probabilities of groups:
#>     setosa versicolor  virginica 
#>  0.3333333  0.3333333  0.3333333 
#> 
#> Group means:
#>            Sepal.Length Sepal.Width Petal.Length Petal.Width
#> setosa            5.006       3.428        1.462       0.246
#> versicolor        5.936       2.770        4.260       1.326
#> virginica         6.588       2.974        5.552       2.026
#> 
#> Coefficients of linear discriminants:
#>                     LD1         LD2
#> Sepal.Length  0.8293776  0.02410215
#> Sepal.Width   1.5344731  2.16452123
#> Petal.Length -2.2012117 -0.93192121
#> Petal.Width  -2.8104603  2.83918785
#> 
#> Proportion of trace:
#>    LD1    LD2 
#> 0.9912 0.0088
model$scaling
#>                     LD1         LD2
#> Sepal.Length  0.8293776  0.02410215
#> Sepal.Width   1.5344731  2.16452123
#> Petal.Length -2.2012117 -0.93192121
#> Petal.Width  -2.8104603  2.83918785

Created on 2021-10-04 by the reprex package (v2.0.1)