I am trying to do text classification with TPOT. I know you can save the vocabulary for the TfidfVectors but I am having some issues with getting the results for my model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from tpot import TPOTClassifier
x_sentences = ["hello world", "how are you", ...]
y_classes = [1, 2, ...]
tfidfconverter = TfidfVectorizer(max_features=500, min_df=5, max_df=0.7)
X = tfidfconverter.fit_transform(x_sentences).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y_classes, test_size=0.2, random_state=42)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
model = TPOTClassifier(generations=3, population_size=30, cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)
# Return prediction for new sample
sample = tfidfconverter.transform(["hello friend"])
print(model.predict(sample))
I want my model to accept words that are not on the original dataset. I am not sure if I have to pad the sentences or how can I make it generalize to different values. I think using the same tfidconverter should be good enough. When I run the inference on a new sample it returns the following error:
ValueError: Not all operators in None supports sparse matrix. Please use "TPOT sparse" for sparse matrix.
It is exactly as the error states. You need to add an extra parameter to your TPOTClassifier object : config_dict = 'TPOT sparse'. You can read more about it under 'Built-in TPOT Configurations' at http://epistasislab.github.io/tpot/using/