Why does filter_extremes from the gensim variable makes it impossible for LdaMulticore to converge?

19 Views Asked by Andreas Rouvalis At 10 March 2024 at 20:03

When implementing topic modeling with Latent Dirichlet Allocation (LDA) using the Gensim library on a dataset of conspiracy theory documents, I've performed preprocessing, created the id2word and Term Document Frequency (TDF) data structures. However, when I use the filter_extremes function to remove words that are present in most documents, the LDA model runs indefinitely, while without this filtering, it converges in just a minute. Why is this happening? I would expect not using 'filter_extremes' would slow thing down as wihout it the input to the model is way larger (without it the size of vocabulary is 20000, while after calli 'filter extremes' it goes down to 2000).

This is the relevant part of the code(clean_comspiracy.lematized is the dataset)

id2word = corpora.Dictionary(clean_conspiracy.lemmatized)

collection = clean_conspiracy.lemmatized
# Term Document Frequency
conspiracy_corpus = [id2word.doc2bow(doc) for doc in collection]

id2word.filter_extremes(no_below=15, no_above=0.55, keep_n=3000)

lda_model = gensim.models.LdaMulticore(
    corpus=conspiracy_corpus,
    id2word=id2word,
    num_topics=3,  # Adjust the number of topics as needed
    random_state=100,
    chunksize=1,
    passes=5,
    per_word_topics=True
)

Original Q&A

Why does filter_extremes from the gensim variable makes it impossible for LdaMulticore to converge?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in GENSIM

Related Questions in MODELING

Related Questions in LDA

Trending Questions

Popular # Hahtags

Popular Questions