How to identify feature names from indices in a decision tree using scikit-learn’s CountVectorizer?

40 Views Asked by At

I have the following data for training a model to detect whether a sentence is about:

  • a cat or dog
  • NOT about a cat or dog

screenshot of data consisting of a text column and label column

I ran the following code to train a DecisionTreeClassifier() model then view the tree visualisation:

import numpy as np
from numpy.random import seed
import random as rn
import os
import pandas as pd
seed_num = 1
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(seed_num)
rn.seed(seed_num)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

dummy_train = pd.read_csv('dummy_train.csv')

tree_clf = tree.DecisionTreeClassifier()

X_train = dummy_train["text"]
y_train = dummy_train["label"]

dt_tree_pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1,1),
                                                 binary=True)),
                     ('tfidf', TfidfTransformer(use_idf=False)),
                      ('clf', DecisionTreeClassifier(random_state=seed_num,
                                                 class_weight={0:1, 1:1})),
                   ])

tree_model_fold_1 = dt_tree_pipe.fit(X_train, y_train)

tree.plot_tree(dt_tree_pipe["clf"])

...resulting in the following tree:

screenshot of decision tree visualisation

The first node checks if x[7] is less than or equal to 0.177. How do I find out which word x[7] represents?

I tried the following code but the words returned in the output ("describing" and "the") don't look correct. I would have thought 'cat' and 'dog' would be the two words used to split the data into the positive and negative class.

vect_from_pipe = dt_tree_pipe["vect"]
words = vect_from_pipe.vocabulary_.keys()
print(list(words)[7])
print(list(words)[5])

screenshot of the words 'describing' and 'the'

2

There are 2 best solutions below

0
DataJanitor On BEST ANSWER

In scikit-learn, the term you’re looking for is feature names. These are the inputs before a transformation is applied.

In your code, you’re accessing the vocabulary_ attribute of CountVectorizer, which returns a dictionary where the keys are the words and the values are the indices. When you convert the keys to a list and access the 7th or 5th element, it doesn’t necessarily correspond to the word at the 7th or 5th index in the feature matrix.

To get the feature name (word) corresponding to a particular index, you should use the get_feature_names_out() method of CountVectorizer. This method returns a list of feature names ordered by their corresponding indices in the feature matrix.

Use this code instead:

vect_from_pipe = dt_tree_pipe["vect"]
feature_names = vect_from_pipe.get_feature_names_out()
print(feature_names[7])
print(feature_names[5])

This will print the words that correspond to the indices 7 and 5 in your feature matrix. The word at index 7 is the one used in the first split of your decision tree. So, in your case, x[7] in the decision tree corresponds to the word feature_names[7] from your CountVectorizer.

0
Ben Reiniger On

The vocabulary_ attribute is not in order; in fact, the values of that dictionary tell you the feature indices:

vocabulary_ : dict
A mapping of terms to feature indices.

Since we already have a pretty good idea what the two features in the tree should be, you can just check vect_from_pipe.vocabulary_['cat'], vect_from_pipe.vocabulary_['dog'] to see if they are 5 and 7. Otherwise, you would want to reverse the dictionary, looking for values of 5 and 7 and seeing what the corresponding keys are. But easier to just use vect_from_pipe.get_feature_names_out() and look at the 5th and 7th indices there. Indeed, it's quite common to use that in plot_tree:

tree.plot_tree(
    dt_tree_pipe[-1],
    feature_names = df_tree_pipe[:-1].get_feature_names_out(),
)