How to identify feature names from indices in a decision tree using scikit-learn’s CountVectorizer?

Question

How to identify feature names from indices in a decision tree using scikit-learn’s CountVectorizer?

40 Views Asked by code_to_joy At 11 March 2024 at 10:24

I have the following data for training a model to detect whether a sentence is about:

a cat or dog
NOT about a cat or dog

I ran the following code to train a DecisionTreeClassifier() model then view the tree visualisation:

import numpy as np
from numpy.random import seed
import random as rn
import os
import pandas as pd
seed_num = 1
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(seed_num)
rn.seed(seed_num)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

dummy_train = pd.read_csv('dummy_train.csv')

tree_clf = tree.DecisionTreeClassifier()

X_train = dummy_train["text"]
y_train = dummy_train["label"]

dt_tree_pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1,1),
                                                 binary=True)),
                     ('tfidf', TfidfTransformer(use_idf=False)),
                      ('clf', DecisionTreeClassifier(random_state=seed_num,
                                                 class_weight={0:1, 1:1})),
                   ])

tree_model_fold_1 = dt_tree_pipe.fit(X_train, y_train)

tree.plot_tree(dt_tree_pipe["clf"])

...resulting in the following tree:

The first node checks if x[7] is less than or equal to 0.177. How do I find out which word x[7] represents?

I tried the following code but the words returned in the output ("describing" and "the") don't look correct. I would have thought 'cat' and 'dog' would be the two words used to split the data into the positive and negative class.

vect_from_pipe = dt_tree_pipe["vect"]
words = vect_from_pipe.vocabulary_.keys()
print(list(words)[7])
print(list(words)[5])

Original Q&A

There are 2 best solutions below

Ben Reiniger On 11 March 2024 at 15:19

The vocabulary_ attribute is not in order; in fact, the values of that dictionary tell you the feature indices:

vocabulary_ : dict
A mapping of terms to feature indices.

Since we already have a pretty good idea what the two features in the tree should be, you can just check vect_from_pipe.vocabulary_['cat'], vect_from_pipe.vocabulary_['dog'] to see if they are 5 and 7. Otherwise, you would want to reverse the dictionary, looking for values of 5 and 7 and seeing what the corresponding keys are. But easier to just use vect_from_pipe.get_feature_names_out() and look at the 5th and 7th indices there. Indeed, it's quite common to use that in plot_tree:

tree.plot_tree(
    dt_tree_pipe[-1],
    feature_names = df_tree_pipe[:-1].get_feature_names_out(),
)

**DataJanitor** · Accepted Answer · 2024-03-11T15:20:46.887000

In scikit-learn, the term you’re looking for is feature names. These are the inputs before a transformation is applied.

In your code, you’re accessing the vocabulary_ attribute of CountVectorizer, which returns a dictionary where the keys are the words and the values are the indices. When you convert the keys to a list and access the 7th or 5th element, it doesn’t necessarily correspond to the word at the 7th or 5th index in the feature matrix.

To get the feature name (word) corresponding to a particular index, you should use the get_feature_names_out() method of CountVectorizer. This method returns a list of feature names ordered by their corresponding indices in the feature matrix.

Use this code instead:

vect_from_pipe = dt_tree_pipe["vect"]
feature_names = vect_from_pipe.get_feature_names_out()
print(feature_names[7])
print(feature_names[5])

This will print the words that correspond to the indices 7 and 5 in your feature matrix. The word at index 7 is the one used in the first split of your decision tree. So, in your case, x[7] in the decision tree corresponds to the word feature_names[7] from your CountVectorizer.

How to identify feature names from indices in a decision tree using scikit-learn’s CountVectorizer?

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in NLP

Related Questions in DECISION-TREE

Related Questions in COUNTVECTORIZER

Trending Questions

Popular # Hahtags

Popular Questions