I see that pyLDAvis visualize each word's saliency under each topic.
But do we have a way to extract each word's saliency under each topic? Or how to calculate each word's saliency directly using Gensim LDA?
So finally, I want to get a pandas dataframe such that one row represents one word, each column represents each topic and its value represents the word's saliency under the corresponding topic.
Many thanks in advance.
Adding to @gojomo's reply: Yes, there is no direct way of getting the list of most salient words as proposed by Chuang et al. (2012). But, there is a library named TMToolkit that offers a way of extracting this. They provide a method called word_saliency that can give you what you are looking for. The problem is this method expects you to provide the following items:
If you are using gensim LDA, then providing doc_topic_distribution will become a significant challenge as Gensim does not provide this out of the box. In that case, you can utilize _extract_data method that is part of PyLDAvis library. As this method is designed for Gensim specifically, you should have all the parameters required for this method. This will yield a dictionary that will contain topic_word_distribution, doc_topic_distribution, and doc_lengths. However, you might want to sort the output of TMToolkit.
A word of caution about TMToolkit: it is notorious for downgrading most of the helpful packages like numpy, pandas, etc. So it is highly recommended to install it using virtual environments.