LDA Topic Modeling Producing Identical/Empty Topics

71 Views Asked by At

I am topic modeling on two large text documents (around 500-750 KB) and am asking for ten topics. I keep getting a repeat of two topics. Could this be an issue of the small number of documents? Or should I change the alpha/beta parameters?

Here is the code for the model part:


`lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=2,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)`

Here are the topics:

[(0,
  '0.005*"city" + 0.004*"police" + 0.003*"people" + 0.003*"thank" + '
  '0.003*"know" + 0.003*"want" + 0.002*"go" + 0.002*"say" + 0.002*"time" + '
  '0.002*"cop"'),
 (1,
  '0.001*"people" + 0.001*"cop" + 0.001*"city" + 0.001*"want" + 0.001*"go" + '
  '0.001*"police" + 0.001*"thank" + 0.001*"time" + 0.001*"know" + 0.001*"say"'),
 (2,
  '0.001*"people" + 0.001*"police" + 0.001*"city" + 0.001*"thank" + '
  '0.001*"want" + 0.001*"cop" + 0.001*"go" + 0.001*"know" + 0.001*"say" + '
  '0.001*"make"'),
 (3,
  '0.002*"city" + 0.002*"people" + 0.001*"know" + 0.001*"want" + '
  '0.001*"police" + 0.001*"go" + 0.001*"say" + 0.001*"vote" + 0.001*"time" + '
  '0.001*"cop"'),
 (4,
  '0.001*"city" + 0.001*"police" + 0.001*"cop" + 0.001*"people" + 0.001*"go" + '
  '0.001*"thank" + 0.001*"want" + 0.001*"vote" + 0.001*"make" + 0.001*"time"'),
 (5,
  '0.020*"city" + 0.014*"people" + 0.013*"police" + 0.011*"cop" + 0.010*"go" + '
  '0.010*"thank" + 0.009*"want" + 0.009*"know" + 0.008*"say" + 0.006*"time"'),
 (6,
  '0.001*"city" + 0.001*"go" + 0.001*"know" + 0.001*"people" + 0.001*"police" '
  '+ 0.001*"cop" + 0.001*"want" + 0.001*"vote" + 0.000*"say" + 0.000*"time"'),
 (7,
  '0.002*"city" + 0.001*"people" + 0.001*"police" + 0.001*"thank" + 0.001*"go" '
  '+ 0.001*"want" + 0.001*"know" + 0.001*"cop" + 0.001*"vote" + 0.001*"say"'),
 (8,
  '0.003*"city" + 0.003*"people" + 0.003*"police" + 0.002*"thank" + 0.002*"go" '
  '+ 0.002*"know" + 0.002*"vote" + 0.002*"want" + 0.002*"say" + 0.002*"time"'),
 (9,
  '0.017*"people" + 0.014*"city" + 0.012*"police" + 0.010*"go" + 0.010*"thank" '
  '+ 0.010*"want" + 0.009*"know" + 0.009*"say" + 0.009*"vote" + 0.008*"time"')]

The visualization:

`# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
`

see picture of visualization here

I have tried changing the parameters some, but haven't seen results. It's hard to find what the normal range for alpha and beta parameters are.

0

There are 0 best solutions below