R Mclust(data, G = 1) giving weird Sigma outputs if one variable is 'too constant'?

116 Views Asked by At

I'm trying to estimate Mu and Covariance of values assuming single normal distribution using MClust(, G = 1). I think it works fine most of the time. But if one of the variables consists of a repeated constant (e.g. all 0s, all 5s, etc.), it affects covariance in a way I don't understand.

For example, in the code below, introducing column D, changes Sigma so that they're all equal for some reason. If I change one value in D, things go back to more expected values. Depending on the number of rows, sometimes it requires more than 1 sample to be changed.

Is there a reason or explanation for this? I'm trying to understand it better so I can predict how to avoid it in cases where one of my variables happens to be 'too constant'. If it's predictable, I could maybe use some logic to manually remove the variable, analyse as univariate and plop it back in, etc

Test Demo:

library(mclust)    
testing <- data.frame(A = runif(100, -5.0, 10.0), 
                      B = runif(100, -7.5, 5.0), 
                      C = runif(100, -5.0, 5.0), 
                      D = rep(0,100))
testing$B <- testing$B + testing$A
testing$C <- testing$C - testing$B

Using 3 typical variables:

testing_OP <- Mclust(testing[,1:3], G = 1)
testing_OP$parameters$variance$Sigma
testing_OP$parameters$mean

Outputs:

      A         B         C
A  19.73553  19.58861 -19.75416
B  19.58861  31.11929 -31.57945
C -19.75416 -31.57945  39.59255

   [,1]
A  3.086933
B  2.133667
C -1.980933

Adding the 'too constant' variable:

testing_OP <- Mclust(testing, G = 1)
testing_OP$parameters$variance$Sigma
testing_OP$parameters$mean

Outputs:

         A        B        C        D
A 22.61184  0.00000  0.00000  0.00000
B  0.00000 22.61184  0.00000  0.00000
C  0.00000  0.00000 22.61184  0.00000
D  0.00000  0.00000  0.00000 22.61184

       [,1]
A  3.086933
B  2.133667
C -1.980933
D  0.000000

Changing 'too constant' variable slightly:

testing$D[100] = 1
testing_OP <- Mclust(testing, G = 1)
testing_OP$parameters$variance$Sigma
testing_OP$parameters$mean

Outputs:

             A            B            C           D
A  19.73552599  19.58861034 -19.75416206  0.04663097
B  19.58861034  31.11928871 -31.57945373  0.03878541
C -19.75416206 -31.57945373  39.59255338 -0.06956324
D   0.04663097   0.03878541  -0.06956324  0.00990000

       [,1]
A  3.086933
B  2.133667
C -1.980933
D  0.010000
2

There are 2 best solutions below

1
On BEST ANSWER

Should always check the model chosen by BIC after you fit it, so if we run it without the last constant column:

set.seed(111)
testing <- data.frame(A = runif(100, -5.0, 10.0), 
                      B = runif(100, -7.5, 5.0), 
                      C = runif(100, -5.0, 5.0), 
                      D = rep(0,100))
testing$B <- testing$B + testing$A
testing$C <- testing$C - testing$B

testing_OP <- Mclust(testing[,1:3], G = 1)

testing_OP$BIC
Bayesian Information Criterion (BIC): 
        EII       VII       EEI       VEI       EVI       VVI       EEE
1 -1895.339 -1895.339 -1883.817 -1883.817 -1883.817 -1883.817 -1655.214
        VEE       EVE       VVE       EEV       VEV       EVV       VVV
1 -1655.214 -1655.214 -1655.214 -1655.214 -1655.214 -1655.214 -1655.214

Top 3 models based on the BIC criterion: 
    EEE,1     EEV,1     EVE,1 
-1655.214 -1655.214 -1655.214 

If one of your variable is a constant, it is not quite possible to calculate covariance between the constant variable and others, making a lot of the models invalid:

testing_OP$BIC
Bayesian Information Criterion (BIC): 
        EII       VII EEI VEI EVI VVI EEE VEE EVE VVE EEV VEV EVV VVV
1 -2410.511 -2410.511  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

Top 3 models based on the BIC criterion: 
    EII,1     VII,1           
-2410.511 -2410.511        NA

You are left with only EII and VII models, from the help page it would be:

"EII" spherical, equal volume

"VII" spherical, unequal volume

So you can also run a EII or VII model on your non-constant columns and you get the same shape:

Mclust(testing[,1:3], G = 1,model="EII")$parameters$variance$Sigma

         A        B        C
A 30.52413  0.00000  0.00000
B  0.00000 30.52413  0.00000
C  0.00000  0.00000 30.52413

If you have a column that is constant, doesn't make sense to estimate a gaussian distribution from it, let alone a multivariate gaussian

0
On

Including a constant "variable" means that the covariance matrix will have 0s for any pair of variables including the constant. In practice, it doesn't make sense to include a constant in a mixture model (e.g., a model created with Mclust).