I am working with a project to detect out-of-domain
text input, with the help of IsolationForest
and tf-idf
feature. Following is my works in summarized form:
TRAINING
On tfidf:
- Fit and transform in-domain dataset using
CountVectorizer()
. - Fit a tfidftransformer() with my with this
CountVectorizer()
and save the transformer (to use it during test time). - Therefore, transform the training data using
tfidftransformer()
- Save both
CountVectorizer()
'svocabulary_
andTfidfTransformer()
object usingpickle
for test time usage.
- Fit and transform in-domain dataset using
On IsolationForest:
- Collect the transformed in-domain dataset and train a
IsolationForest()
novelity detector. - Save the model using
joblib
.
- Collect the transformed in-domain dataset and train a
TESTING:
- Load all of the saved models.
- Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
- Predict if it is out-of-domain or not, using the saved
IsolationForest
model.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest
always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest
model by mimicking the output of tf-idf
transformer to make sure if the tf-idf
module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest
. Also note that, tf-idf
has a lot of features (tokens), in my case the count is 48015.