I´m working on a STM Model (topicmodelling) and i´d like to evaluate and verify the model, but i´m not sure how to do it. My code is:
Corpus.STM <- readCorpus(dtm, type = "slam")
Model choice:
BestM1. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(10,20, 30, 40, 50, 60), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM2. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(85,110), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM3. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(20,21,22,23,24,25,26,27,28,29,30), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
str(BestM1.)
plot.searchK(BestM1.)
plot.STM(BestM2)
plot.searchK(BestM3.)
#27 seems to be a good choice
#Heldout
set.seed(1)
heldout<- make.heldout(Corpus.STM$documents, Corpus.STM$vocab, proportion = .5,seed = 1)
stm.mod1 <- stm(heldout$documents, heldout$vocab, K =27, seed = 1, init.type = "Spectral", max.em.its = 100 )
heldout.evaluation <- eval.heldout(stm.mod1, heldout$missing)
heldout.evaluation
#evaluation heldout
labelTopics(stm.mod1)
plot.STM(stm.mod1, type="labels", n=5, frexweight = 0.25)
cloud(stm.mod1, topic=5)
plot.STM(stm.mod1, type="summary", labeltype="frex", topics=c(1:5), n=8)
I´m not sure how to interpret the output of "eval.heldout". Additional I want to make sure that the model doesn´t overfit, but i´m not sure how it could work.
eval.heldout() calculates the held-out log-likelihood using document completion. The number you want is the heldout.evaluation$expected.heldout which is the average of the held-out log-likelihood values for each document. Unfortunately there is no unambiguous measure of whether or not the model is "overfit." The plot.searchK() call you have will give you a plot of the held-out log-likelihood over different values of K and certainly if that number is decreasing as K goes up one explanation is overfitting.
Sorry to not have a clearer answer but unfortunately there are no hard and fast rules here.