How to calculate the coherence score for a LDA model?

239 Views Asked by At

I want to use coherence and perplexity to decide the best K(number of topics) in topic modeling. The sample of my dataset is:

doc_id <- c(1:20)
date <- c(1901:1920)
text <- c("contribut theori microscop microscop percept", "illumin apparatus microscop", "stephenson system homogen immers microscop object", 
"relat apertur power microscop continu", "note proper definit amplifi power len lenssystem", "note proper definit amplifi power len lenssystem",
 "mode vision object wide apertur", "treatis theori microscop", "theori imag format microscop", "maintain oscil pendulum tune fork tube amplifi" ,
"contribut theori microscop microscop percept", "illumin apparatus microscop", "stephenson system homogen immers microscop object", "relat apertur power microscop continu", 
"note proper definit amplifi power len lenssystem", "note proper definit amplifi power len lenssystem", "mode vision object wide apertur", "treatis theori microscop", "theori imag format microscop", "maintain oscil pendulum tune fork tube amplifi" )

data <- as.data.frame(cbind(doc_id,date,text))

My code are:

corpus <- Corpus(DataframeSource(data))
# class(corpus) "SimpleCorpus" "Corpus" 

# Preprocessing chain
processedCorpus <- tm_map(corpus, content_transformer(tolower))
processedCorpus <- tm_map(corpus, removeWords, stopwords("english")) 
processedCorpus <- tm_map(processedCorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
processedCorpus <- tm_map(processedCorpus, removeNumbers)
processedCorpus <- tm_map(processedCorpus, stemDocument, language = "en")
processedCorpus <- tm_map(processedCorpus, stripWhitespace) 

minimumFrequency <- 2
DTM <- DocumentTermMatrix(processedCorpus, 
                          control = list(bounds = list(global = c(minimumFrequency, Inf))))

# have a look at the number of documents and terms in the matrix
dim(DTM) 

# due to vocabulary pruning, we have empty rows in our DTM
# LDA does not like this. So we remove those docs from the DTM and the metadata
sel_idx <- slam::row_sums(DTM) > 0 
DTM <- DTM[sel_idx, ]
data <- data[sel_idx, ]

Instead of using a specific K in this LDA model:

topicModel <- LDA(DTM, K, method = "Gibbs", control = list(iter = 500, verbose = 25))

Can I make a function to run LDA models with different K(from 1 to 20?) and check the coherence and perplexity with different K?

In addition, the function from textmineR package to create LDA model doesn't work here for my DTM. I get:

FitLdaModel(dtm = DTM, k = k, iterations = 500) 

Error in FitLdaModel(dtm = DTM, k = k, iterations = 500) : conversion failed. Please pass an object of class dgCMatrix for dtm
2. stop("conversion failed. Please pass an object of class dgCMatrix for dtm")
1. FitLdaModel(dtm = DTM, k = k, iterations = 500)
0

There are 0 best solutions below