Threshold of anomaly score in scikit-learn's IsolationForest

389 Views Asked by At

I'm trying to understand more about how the contamination parameter affects the threshold_ in which a sample is predicted to be an anomaly or not in IsolationForest.

In the code for IsolationForest here, in fit(), the threshold_ is set by

self.threshold_ = -sp.stats.scoreatpercentile(
            -self.decision_function(X), 100. * (1. - self.contamination))

Then in predict(), a sample is predicted as an anomaly in

is_inlier = np.ones(X.shape[0], dtype=int)
is_inlier[self.decision_function(X) <= self.threshold_] = -1

I always thought that only negative scores returned by decision_function would be predicted as anomaly. But say I have 10 scores [0.5, 0.4, 0.3, 0.2, 0.1, 0.1, 0, -0.1, -0.2, -0.3], if I set contamination = 0.9, 9 samples with scores between -0.3 and 0.4 would be predicted as anomaly, meaning samples with positive scores are also predicted as anomaly.

Is the calculation of the anomaly scores somehow affected by the contamination parameter, such that only up to contamination percentage of the scores would be negative? Which in turn would mean threshold_ = 0?

0

There are 0 best solutions below