Getting AIC of lognormal distributions using scikit's GMM

437 Views Asked by At

I have the following data generated using:

from scipy.stats import skewnorm
import matplotlib.pyplot as plt
import numpy as np

numValues = 10000
maxValue = 100
skewness = -5   
data = skewnorm.rvs(a = skewness,loc=maxValue, size=numValues) 

enter image description here

for guassian model, I have calculated the AIC from below:

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 1).fit(X=np.expand_dims(data,1))
gmm_aic = gmm.aic(X=np.expand_dims(data,1))
gmm_bic = gmm.bic(X=np.expand_dims(data,1))

for lognormal model, I took the logarithm of data and fitted a guassian from below:

log_gmm = GaussianMixture(n_components = 1).fit(X=np.expand_dims(np.log(data),1))

and calculated AIC for this model on the data and not on np.log(data)

log_gmm_aic = log_gmm.aic(X=np.expand_dims(data,1))
log_gmm_bic = log_gmm.bic(X=np.expand_dims(data,1))

This way, my AIC and BIC values in both the cases are:

Guassian - AIC: 18790.27 BIC:18804.69 

lognormal - AIC: 2231500479614.06 BIC:2231500479628.48

The values seem incomparable. What am I doing wrong if I want to calculate lognormal model's AIC and BIC?

1

There are 1 best solutions below

0
On

The short answer is that you should not compare aic of models that have different scales of response variables. See post like this or this.

So for example you can compare

gmm = GaussianMixture(n_components = 1).fit(X=np.expand_dims(data,1))
print(gmm.aic(X=np.expand_dims(data,1)))

gmm2 = GaussianMixture(n_components = 2).fit(X=np.expand_dims(data,1))
print(gmm2.aic(X=np.expand_dims(data,1)))

18753.718210403502
17498.87614003799

You can see that the first model has a better BIC than the second.

A bit further on why you are getting weird values. For the non-log model, the estimated parameters are:

gmm.means_
array([[99.2174906]])

gmm.covariances_
array([[[0.38499364]]])

For your so called log-normal data, your input data is on the log scale and you can see that the estimated parameters are on the log scale:

log_gmm.means_
array([[4.59729469]])

log_gmm.covariances_
array([[[4.0321591e-05]]])

The BIC and AIC are derived from the log likelihood of the model, and you have to use your input data, because you want to know given a value on the log space, what is it's probability of belonging to a cluster.

However you instantly notice that you get a negative aic:

log_gmm.bic(np.log(np.expand_dims(data,1)))
Out[59]: -73122.88028438632

The reason for this is that your variance is extremely small, so in estimating the probabilities, see source code for gmm, most of your probabilities exceed 1, making the aic positive. See this post for more details on pdf values > 1.

Lastly, if you want to know whether your response variable or distribution is better log-transformed, you don't need to use a gaussian mixture. You can use a qqplot or basically check whether changes are multiplicative or additive.