Python CountVectorizer(): why do we have to assign CountVectorizer() to a variable in order for this to work?

96 Views Asked by At

I took this example from the SKLearn website. Here's the initial code:

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

# WORKING: assigning a variable "vectorizer" for CountVectorizer()
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
>>> ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

# NOT WORKING
X = CountVectorizer().fit_transform(corpus)
CountVectorizer().get_feature_names()
>>> NotFittedError: Vocabulary not fitted or provided

I'm confused at this point. Why do we have to assign a variable to CountVectorizer() if they are exactly the same thing?

1

There are 1 best solutions below

0
On

In the first example, you create one CountVectorizer() object and use it throughout the entire code snippet.

In the second example, the two CountVectorizers() refer to two different objects.

Let's walk through the code.

X = CountVectorizer().fit_transform(corpus)
CountVectorizer().get_feature_names()

In the first line, we create a new CountVectorizer() object, call .fit_transform() on it, and then assign the result of the call to .fit_transform() to X.

In the second line, we create a different CountVectorizer() object and call .get_feature_names() on it. This object is completely independent from the first one we created; it does not share any memory with the original object. Since you haven't called the .fit_transform() method on this one, Python throws an error stating that the vocabulary hasn't been fitted.