I took this example from the SKLearn website. Here's the initial code:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
# WORKING: assigning a variable "vectorizer" for CountVectorizer()
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
>>> ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
# NOT WORKING
X = CountVectorizer().fit_transform(corpus)
CountVectorizer().get_feature_names()
>>> NotFittedError: Vocabulary not fitted or provided
I'm confused at this point. Why do we have to assign a variable
to CountVectorizer()
if they are exactly the same thing?
In the first example, you create one
CountVectorizer()
object and use it throughout the entire code snippet.In the second example, the two
CountVectorizers()
refer to two different objects.Let's walk through the code.
In the first line, we create a new
CountVectorizer()
object, call.fit_transform()
on it, and then assign the result of the call to.fit_transform()
toX
.In the second line, we create a different CountVectorizer() object and call
.get_feature_names()
on it. This object is completely independent from the first one we created; it does not share any memory with the original object. Since you haven't called the.fit_transform()
method on this one, Python throws an error stating that the vocabulary hasn't been fitted.