I'm trying to understand more about how the contamination
parameter affects the threshold_
in which a sample is predicted to be an anomaly or not in IsolationForest.
In the code for IsolationForest here, in fit()
, the threshold_
is set by
self.threshold_ = -sp.stats.scoreatpercentile(
-self.decision_function(X), 100. * (1. - self.contamination))
Then in predict()
, a sample is predicted as an anomaly in
is_inlier = np.ones(X.shape[0], dtype=int)
is_inlier[self.decision_function(X) <= self.threshold_] = -1
I always thought that only negative scores returned by decision_function
would be predicted as anomaly. But say I have 10 scores [0.5, 0.4, 0.3, 0.2, 0.1, 0.1, 0, -0.1, -0.2, -0.3]
, if I set contamination = 0.9
, 9 samples with scores between -0.3 and 0.4 would be predicted as anomaly, meaning samples with positive scores are also predicted as anomaly.
Is the calculation of the anomaly scores somehow affected by the contamination
parameter, such that only up to contamination
percentage of the scores would be negative? Which in turn would mean threshold_ = 0
?