IsolationForest is always predicting 1

239 Views Asked by At

I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:

TRAINING

  • On tfidf:

    • Fit and transform in-domain dataset using CountVectorizer().
    • Fit a tfidftransformer() with my with this CountVectorizer() and save the transformer (to use it during test time).
    • Therefore, transform the training data using tfidftransformer()
    • Save both CountVectorizer()'s vocabulary_ and TfidfTransformer() object using pickle for test time usage.
  • On IsolationForest:

    • Collect the transformed in-domain dataset and train a IsolationForest() novelity detector.
    • Save the model using joblib.

TESTING:

  • Load all of the saved models.
  • Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
  • Predict if it is out-of-domain or not, using the saved IsolationForest model.

But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.

What is probably going wrong?

NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.

0

There are 0 best solutions below