I need a machine learning algorithm that will satisfy the following requirements:
- The training data are a set of feature vectors, all belonging to the same, "positive" class (as I cannot produce negative data samples).
- The test data are some feature vectors which might or might not belong to the positive class.
- The prediction should be a continuous value, which should indicate the "distance" from the positive samples (i.e. 0 means the test sample clearly belongs to the positive class and 1 means it is clearly negative, but 0.3 means it is somewhat positive)
An example: Let's say that the feature vectors are 2D feature vectors.
Positive training data:
- (0, 1), (0, 2), (0, 3)
Test data:
- (0, 10) should be an anomaly, but not a distinct one
- (1, 0) should be an anomaly, but with higher "rank" than (0, 10)
- (1, 10) should be an anomaly, with an even higher anomaly "rank"
The problem you described is usually referred to as outlier, anomaly or novelty detection. There are many techniques that can be applied to this problem. A nice survey of novelty detection techniques can be found here. The article gives a thorough classification of the techniques and a brief description of each, but as a start, I will list some of the standard ones: