Using CountVectorizer or TfidfVectorizer, can you do the opposite of stop words, but to apply certain words to a classification?

223 Views Asked by InfernoKun At 04 April 2025 at 06:00

I am having an issue where some classes have a 0% or <60% success rate given the training set. I was given a list of words to help classify data like this, but I am not sure how to do so. I know stop words remove certain words from the data, but can you apply a list of words to a certian class that can help the ML algo determine a better result?

Original Q&A

There are 1 best solutions below

Alexander L. Hayes On 03 May 2021 at 14:29

I think you're looking for the vocabulary parameter of the vectorizers. For example, here's a minimal example with CountVectorizer that only uses the words "one" and "two."

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

raw_data = [
    "one two three four",
    "two four six eight",
    "one three five seven",
    "two six ten twelve",
]
labels = np.array([0, 0, 1, 1])

X_train_raw, X_test_raw, y_train, y_test = train_test_split(raw_data, labels)

vectorizer = CountVectorizer(vocabulary=["one", "two"])

X_train = vectorizer.fit_transform(X_train_raw)
print(vectorizer.get_feature_names())
print(X_train.toarray())
# ['one', 'two']
# [[1 1]
#  [0 1]
#  [0 1]]

If you don't know what words to use ahead of time, another approach would be to do feature selection on the outputs of a vectorizer.

Using CountVectorizer or TfidfVectorizer, can you do the opposite of stop words, but to apply certain words to a classification?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in SCIKIT-LEARN

Related Questions in TEXT-PROCESSING

Related Questions in TFIDFVECTORIZER

Trending Questions

Popular # Hahtags

Popular Questions