Hello I am trying to make a model of the topics of several small pieces of text, the corpus is composed by comments from a social web page, I have the following structure first a list with the documents as follows:
listComments = ["I like the post", "I hate to use this smartphoneee","iPhone 7 now has the best performance and battery life :)",...]
tfidf_vectorizer = TfidfVectorizer(min_df=10,ngram_range=(1,3),analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(listComments)
I used tfidf to generate a model with that parameters and then I used LDA as follows:
#Using Latent Dirichlet Allocation
n_topics = 30
n_top_words = 20
lda = LatentDirichletAllocation(n_topics=n_topics,
learning_method='online',
learning_offset=50.,
random_state=0)
lda.fit(tfidf)
def print_top_words(model, feature_names, n_top_words):
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
print("\nTopics in LDA model:")
tf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
y_pred = lda.fit_transform(tfidf)
then I saved the two models tfidf and LDA to develop the following experiment given a new comment I vectorize it with the same model
comment = ['the car is blue']
x = tdf.transform(comment)
y = lda.transform(x)
print("this is the prediction",y)
And I got:
this is the prediction [[ 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
0.03333333 0.03333333 0.03333333 0.03333333 0.59419197 0.03333333
0.03333333 0.03333333 0.03333333 0.03333333 0.03333333 0.03333333
0.03333333 0.03333333 0.03333333 0.86124492 0.03333333 0.03333333]]
I dont undestand this vector I was reserching a little and I am not sure but I believe that is composed by the probability of be part of the n_topics, that I used I mean 30, for this case my new comment would have more probability to belong to the topic with the higher component, but this is not very direct, my main question is if I need to construct a method to give the index of the higher component of this transformation to clasify a vector or if LDA has some method to automatically give the number of topic, thanks in advance for the support.
First, you choosed to look into a number of topics equal to n_topics (= 30). The prediction vector you got is an (30,) shaped array. Each component represents the probability that the comment belongs to the i-th topics.
Remember that LDA is not exclusive, a document can belong to more than one class. For example here, i can say that your comment belongs to 2 different classes with probability 0.86 and 0.59