Number of keywords in text cluster

88 Views Asked by At

I'm working in a decently-sized data set, and wish to identify what # topics make sense. I used both NMF and LDA (sklearn implementation), but the key question: what is a suitable measure for success. Visually I have in many topics only a few height-weight keywords (the other weights ~ 0), and a few topics with more bell-shaped distribution of the topics. What is the target: a topic with a few words, high weight, rest low (a spike) or a bell-shape distribution, gradual reduction of weights over a large # keywords NMF enter image description here

or the LDA method

enter image description here that gives mostly a bell-shape (not curve, obviously)

I also use a weighted jaccard (set overlap of the keywords, weighted; there are no doubt better methods, but this is kind-of intuitive

Your thoughts on this?

best,

Andreas

code at https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html?highlight=document%20word%20matrix

1

There are 1 best solutions below

0
On

There are a few commonly used evaluation metrics that can give a good intuition of the quality of your topic sets in general, as well as your choice of k (number of topics). A recent paper by Dieng et al. (Topic Modeling in Embedded Spaces) uses two of the best measures: coherence and diversity. In conjunction, coherence and diversity give an idea of how well-clustered topics are. Coherence measures the similarities of words in each topic using their co-occurrences in documents, and diversity measures the similarity between topics based on the overlap of topics. If you score low in diversity, that means that words are overlapping in topics, and you might want to increase k.

There's really no "best way to decide k," but these kind of measures can help you decide whether to increase or decrease the number.