Why does filter_extremes from the gensim variable makes it impossible for LdaMulticore to converge?

19 Views Asked by At

When implementing topic modeling with Latent Dirichlet Allocation (LDA) using the Gensim library on a dataset of conspiracy theory documents, I've performed preprocessing, created the id2word and Term Document Frequency (TDF) data structures. However, when I use the filter_extremes function to remove words that are present in most documents, the LDA model runs indefinitely, while without this filtering, it converges in just a minute. Why is this happening? I would expect not using 'filter_extremes' would slow thing down as wihout it the input to the model is way larger (without it the size of vocabulary is 20000, while after calli 'filter extremes' it goes down to 2000).

This is the relevant part of the code(clean_comspiracy.lematized is the dataset)

id2word = corpora.Dictionary(clean_conspiracy.lemmatized)

collection = clean_conspiracy.lemmatized
# Term Document Frequency
conspiracy_corpus = [id2word.doc2bow(doc) for doc in collection]

id2word.filter_extremes(no_below=15, no_above=0.55, keep_n=3000)

lda_model = gensim.models.LdaMulticore(
    corpus=conspiracy_corpus,
    id2word=id2word,
    num_topics=3,  # Adjust the number of topics as needed
    random_state=100,
    chunksize=1,
    passes=5,
    per_word_topics=True
)

0

There are 0 best solutions below