Machine Learning - one class classification/novelty detection/anomaly assessment?

3.5k Views Asked by At

I need a machine learning algorithm that will satisfy the following requirements:

  • The training data are a set of feature vectors, all belonging to the same, "positive" class (as I cannot produce negative data samples).
  • The test data are some feature vectors which might or might not belong to the positive class.
  • The prediction should be a continuous value, which should indicate the "distance" from the positive samples (i.e. 0 means the test sample clearly belongs to the positive class and 1 means it is clearly negative, but 0.3 means it is somewhat positive)

An example: Let's say that the feature vectors are 2D feature vectors.

Positive training data:

  • (0, 1), (0, 2), (0, 3)

Test data:

  • (0, 10) should be an anomaly, but not a distinct one
  • (1, 0) should be an anomaly, but with higher "rank" than (0, 10)
  • (1, 10) should be an anomaly, with an even higher anomaly "rank"
1

There are 1 best solutions below

2
On

The problem you described is usually referred to as outlier, anomaly or novelty detection. There are many techniques that can be applied to this problem. A nice survey of novelty detection techniques can be found here. The article gives a thorough classification of the techniques and a brief description of each, but as a start, I will list some of the standard ones:

  • K-nearest neighbors - a simple distance-based method which assumes that normal data samples are close to other normal data samples, while novel samples are located far from the normal points. Python implementation of KNN can be found in ScikitLearn.
  • Mixture models (e.g. Gaussian Mixture Model) - probabilistic models modeling the generative probability density function of the data, for instance using a mixture of Gaussian distributions. Given a set of normal data samples, the goal is to find parameters of a probability distribution so that it describes the samples best. Then, use the probability of a new sample to decide if it belongs to the distribution or is an outlier. ScikitLearn implements Gaussian Mixture Models and uses the Expectation Maximization algorithm to learn them.
  • One-class Support Vector Machine (SVM) - an extension of the standard SVM classifier which tries to find a boundary that separates the normal samples from the unknown novel samples (in the classic approach, the boundary is found by maximizing the margin between the normal samples and the origin of the space, projected to the so called "feature space"). ScikitLearn has an implementation of one-class SVM which allows you to use it easily, and a nice example. I attach the plot of that example to illustrate the boundary one-class SVM finds "around" the normal data samples: enter image description here