Outlier detection using Gaussian mixture

4.7k Views Asked by At

I have 5000 data points for each of my 17 features in a numpy array resulting in a 5000 x 17 array. I am trying to find the outliers for each feature using Gaussian mixture and I am rather confused on the following: 1)how many components should I use for my GaussiasnMixture? 2) Should I fit the GaussianMixture directly on the array of 5000 x 17 or to each feature column seperately resulting in 17 GaussianMixture models?

clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
clf.fit(full_feature_array)

or

clf = mixture.GaussianMixture(n_components=17, covariance_type='full')
clf.fit(full_feature_array)

or

for feature in range(0, full_feature_matrix):
    clf[feature] = mixture.GaussianMixture(n_components=1, covariance_type='full')
    clf.fit(full_feature_array[:,feature)
2

There are 2 best solutions below

5
On BEST ANSWER

The task of selecting the number of components to model a distribution with a Gaussian mixture model is an instance of Model Selection. This is not so straightforward and there exist many approaches. A good summary can be found here https://en.m.wikipedia.org/wiki/Model_selection . One of the simplest and most widely used is to perform cross validation.

Normally outliers can be determined as those belonging to the component or components with the largest variance. You would call this strategy an unsupervised approach, however it can still be difficult to decide what the cutoff variance should be. A better approach (if applicable) is a supervised approach where you would train the GMM with outlier-free data (by manually removing outliers). You then use this to classify outliers as those which have particularly low likelihood scores. The second way to do it with a supervised approach would be to train two GMMs (one for outliers and one for inliers using model selection) then perform two-class classification for new data. Regarding your question about training univariate versus multivariate GMMs - it's difficult to say but for the purposes of outlier detection univariate GMMs ( or equivalently multivariate GMMs with diagonal covariance matrices) may be sufficient and require training fewer parameters compared to general multivariate GMMs, so I would start with that.

0
On

Using Gaussian Mixture Model (GMM) any point sitting on low-density area can be considered outlier - Perhaps the challenge is how to define low density area - For example you can say whatever lower than 4th quantile density is outlier.

densities = gm.score_samples(X)
density_threshold = np.percentile(densities, 4)
anomalies = X[densities < density_threshold]

regarding choosing the number of component - look into "information criterion" provided by AIC or BIC given different number of components - they often agree in such cases. The lowest is better.

gm.bic(x)
gm.aic(x)

alternatively, BayesianGaussianMixture gives zero as weight to those clusters that are unnecessary.

from sklearn.mixture import BayesianGaussianMixture
bgm = BayesianGaussianMixture(n_components=8, n_init=10) # n_components should be large enough
bgm.fit(X)
np.round(bgm.weights_, 2)

output

array([0.5 , 0.3, 0.2 , 0. , 0. , 0. , 0. , 0. ])

so here it the bayesian gmm detected there are three clusters.