I tried running Latent Dirichlet Allocation on a very large dataset using simple LDA and LDAMulticore. But getting the below error after two days of execution "An attempt has been made to start a new process before the current process has finished its bootstrapping phase.
from gensim.models.coherencemodel import CoherenceModel
print('started')
Lda = gensim.models.ldamodel.LdaModel
ldamodel = Lda(corpus, num_topics=50, id2word = id2word, passes=40,iterations=100, chunksize = 10000, eval_every = None,random_state=100)
print('lda completed')
coherencemodel = CoherenceModel(model=ldamodel, texts=data_ready, dictionary=id2word, coherence='c_v')
print('coherence completed')
coherence_lda = coherencemodel.get_coherence()
perplexity_values=ldamodel.log_perplexity(corpus)
I got the first three print statements and the error is happening when getting the coherence value to the variable.
Also, the whole process is taking a long time as the document has around 2400000 lines.
I got to know from other post, that the error can be resolved by using if __name__ == '__main__':
I am new to python and not sure how to use it in my case as all the other data preprocessing and data loading is done within the same file and each step is done one by one.
Any help would be appreciated.
Thanks in advance.
It is caused by get_coherence() function, you need to wrap the whole code into
main()function and add__name__ == "__main__"structure, see:https://github.com/RaRe-Technologies/gensim/issues/2291#issuecomment-447269158
(You can also try it first on some very simple text sample, like this one: https://radimrehurek.com/gensim/models/ldamodel.html)