I am having an issue where some classes have a 0% or <60% success rate given the training set. I was given a list of words to help classify data like this, but I am not sure how to do so. I know stop words remove certain words from the data, but can you apply a list of words to a certian class that can help the ML algo determine a better result?

1

There are 1 best solutions below

2
On

I think you're looking for the vocabulary parameter of the vectorizers. For example, here's a minimal example with CountVectorizer that only uses the words "one" and "two."

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

raw_data = [
    "one two three four",
    "two four six eight",
    "one three five seven",
    "two six ten twelve",
]
labels = np.array([0, 0, 1, 1])

X_train_raw, X_test_raw, y_train, y_test = train_test_split(raw_data, labels)

vectorizer = CountVectorizer(vocabulary=["one", "two"])

X_train = vectorizer.fit_transform(X_train_raw)
print(vectorizer.get_feature_names())
print(X_train.toarray())
# ['one', 'two']
# [[1 1]
#  [0 1]
#  [0 1]]

If you don't know what words to use ahead of time, another approach would be to do feature selection on the outputs of a vectorizer.