Extract Word Saliency from Gensim LDA or pyLDAvis

710 Views Asked by At

I see that pyLDAvis visualize each word's saliency under each topic.

enter image description here

But do we have a way to extract each word's saliency under each topic? Or how to calculate each word's saliency directly using Gensim LDA?

So finally, I want to get a pandas dataframe such that one row represents one word, each column represents each topic and its value represents the word's saliency under the corresponding topic.

Many thanks in advance.

2

There are 2 best solutions below

0
On

Adding to @gojomo's reply: Yes, there is no direct way of getting the list of most salient words as proposed by Chuang et al. (2012). But, there is a library named TMToolkit that offers a way of extracting this. They provide a method called word_saliency that can give you what you are looking for. The problem is this method expects you to provide the following items:

  • topic_word_distribution
  • doc_topic_distribution
  • doc_lengths

If you are using gensim LDA, then providing doc_topic_distribution will become a significant challenge as Gensim does not provide this out of the box. In that case, you can utilize _extract_data method that is part of PyLDAvis library. As this method is designed for Gensim specifically, you should have all the parameters required for this method. This will yield a dictionary that will contain topic_word_distribution, doc_topic_distribution, and doc_lengths. However, you might want to sort the output of TMToolkit.

A word of caution about TMToolkit: it is notorious for downgrading most of the helpful packages like numpy, pandas, etc. So it is highly recommended to install it using virtual environments.

4
On

Gensim's LDA support does not have out-of-the-box support for this particular 'saliency' calculation from Chuang et al (2012).

Still, I suspect the model's .get_term_topics() and/or .get_topic_terms() methods are the proper supporting data for implementing that calculation. In particular, one or the other of those methods might provide the p( w | t ) term, but a deeper read of the paper would be required to know for sure. (I suspect the P(t) term might require a separate survey of the training data.)

From the class docs:

https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_term_topics

Returns The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word.

https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_topic_terms

Returns Word ID - probability pairs for the most relevant words generated by the topic.

I hadn't come across this particular 'saliency' calculation before, but if it is popular among LDA users, or of potential general use, and you figure out how to calculate it, it'd likely be a welcome contribution to the Gensim project - especially if it can be a simple extra convenience method on LdaModel.