I have the following data generated using:
from scipy.stats import skewnorm
import matplotlib.pyplot as plt
import numpy as np
numValues = 10000
maxValue = 100
skewness = -5
data = skewnorm.rvs(a = skewness,loc=maxValue, size=numValues)
for guassian model, I have calculated the AIC from below:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 1).fit(X=np.expand_dims(data,1))
gmm_aic = gmm.aic(X=np.expand_dims(data,1))
gmm_bic = gmm.bic(X=np.expand_dims(data,1))
for lognormal model, I took the logarithm of data and fitted a guassian from below:
log_gmm = GaussianMixture(n_components = 1).fit(X=np.expand_dims(np.log(data),1))
and calculated AIC for this model on the data and not on np.log(data)
log_gmm_aic = log_gmm.aic(X=np.expand_dims(data,1))
log_gmm_bic = log_gmm.bic(X=np.expand_dims(data,1))
This way, my AIC and BIC values in both the cases are:
Guassian - AIC: 18790.27 BIC:18804.69
lognormal - AIC: 2231500479614.06 BIC:2231500479628.48
The values seem incomparable. What am I doing wrong if I want to calculate lognormal model's AIC and BIC?
The short answer is that you should not compare aic of models that have different scales of response variables. See post like this or this.
So for example you can compare
You can see that the first model has a better BIC than the second.
A bit further on why you are getting weird values. For the non-log model, the estimated parameters are:
For your so called log-normal data, your input data is on the log scale and you can see that the estimated parameters are on the log scale:
The BIC and AIC are derived from the log likelihood of the model, and you have to use your input data, because you want to know given a value on the log space, what is it's probability of belonging to a cluster.
However you instantly notice that you get a negative aic:
The reason for this is that your variance is extremely small, so in estimating the probabilities, see source code for gmm, most of your probabilities exceed 1, making the aic positive. See this post for more details on pdf values > 1.
Lastly, if you want to know whether your response variable or distribution is better log-transformed, you don't need to use a gaussian mixture. You can use a qqplot or basically check whether changes are multiplicative or additive.