Interpreting Perplexity, U_mass coherence and Cv score trends for a Latent Dirichlet Allocation Model

65 Views Asked by Sarthak At 12 February 2024 at 21:40

I'm running an LDA model through gensim. To my understanding, closer the u_mass coherence score is to zero, higher is the interpretability of the topics that come up. I'm getting the u_mass coherence score trend like this-1. Since there's a noticeable dip in bound-2 between num_topics 10 to 15, I explored the top words in each topic for the same number of topics. While the actual topics that come up between 10 to 15 as the number of topics are meaningful and interpretable, I'm unable to understand why my perplexity trend-3 is exact opposite because according to Latent Dirichlet Allocation by Blei, Ng, & Jordan, perplexity should monotonically decrease as number of topics increases-4. Also, the C_v score trend seems off -5. I'm performing lemmatization and stopwords removal as preprocessing. Additionally, I'm looking at the tfidf scores to pick maximum number of unique words to make the corpus. Could the opposite perplexity trend be due to lack of parameter tuning? I would be grateful for any comments on the trends as well as how to tune the alpha and eta.

Below is my code-

tokenized_narratives = [text.split() for text in cleaned_narrative_list]

**# Extract top 2800 terms from TF matrix
top_2800_words = tf_vectorizer.get_feature_names_out()[:2800]\

**# Filter tokenized narratives using top 2800 terms
filtered_tokenized_narratives = [[token for token in text if token in top_2800_words] for text in tokenized_narratives]

min_topics = 2
max_topics = 30
step_size = 2
num_topics_range = range(min_topics, max_topics + 1, step_size)


id2word = corpora.Dictionary(filtered_tokenized_narratives)
corpus = [id2word.doc2bow(text) for text in filtered_tokenized_narratives]
train_corpus, test_corpus = train_test_split(corpus, test_size=0.2)`


**# Visualize results**

umass_coherence_values = []
cv_coherence_values = []
bound_values = []
perplexity_values=[]

for num_topics in num_topics_range:
    lda_model = gensim.models.LdaModel(
        corpus=train_corpus,
        id2word=id2word,
        num_topics=num_topics,
        iterations=400,
        chunksize=100,
        passes=10,
        per_word_topics=True,
    )
**# bound and perplexity calculation**
    bound = lda_model.log_perplexity(test_corpus)
    bound_values.append(bound)
    print("bound=",bound)
    
**# perplexity**
    perplexity = np.exp(-bound)
    perplexity_values.append(perplexity)
    print(perplexity)
    
**# UMass Coherence**
    coherence_model_umass = CoherenceModel(model=lda_model, corpus=corpus, dictionary=id2word,         coherence='u_mass')
    umass_coherence = coherence_model_umass.get_coherence()
    umass_coherence_values.append(umass_coherence)
    print(umass_coherence)`

**# C_v Coherence**
    coherence_model_cv = CoherenceModel(model=lda_model, texts=filtered_tokenized_narratives, dictionary=id2word, coherence='c_v')
    cv_coherence = coherence_model_cv.get_coherence()
    cv_coherence_values.append(cv_coherence)
    print(cv_coherence)

    print(num_topics)

(gensim.models.LdaModel.log_perplexity) doesn't output perplexity but bound. So I calculated the bound and then perplexity using perplexity = np.exp(-bound). But since my perplexity trend was exact opposite of expected, I had to switch to coherence measures.

Original Q&A

Interpreting Perplexity, U_mass coherence and Cv score trends for a Latent Dirichlet Allocation Model

There are 0 best solutions below

Related Questions in NLP

Related Questions in DATA-ANALYSIS

Related Questions in LDA

Related Questions in TOPIC-MODELING

Related Questions in NLP-QUESTION-ANSWERING

Trending Questions

Popular # Hahtags

Popular Questions