I am a newbie at this, so I apologize if I am asking the obvious here. I ran a bi-term topic modeling algorithm to model short text data and discover topics among them. I am using LDAvis package to visualize and understand the data. However, as I understand, reading the raw data qualitatively will help me understand what the underlying topics are and what people are talking about. A keyword search from the raw text data isn't helping since keywords typically overlap with several topics and selecting one keyword may lead to subset which may contain data related to several topics. To this end, I want to assign each tweet/ document to one topic to be able to analyze them manually and get what the topic is talking about since the customer feedback data that I am analyzing typically contains keywords that could indicate several things at once.
I tried assigning the max value from each row in the scores array generated by BTM package scores array indicated here as the topic that is most likely associated with each document, and that was syntaxically successful. However the generated topics didn't match visually from LDAvis. For instance, I could see a keyword "apps" as being present in only topic "10" as indicated by LDAvis but manually selecting the subset of data that contained keyword "apps" led to two tweets/ documents that were both assigned to different topics per scores array. Am I doing this correctly? I am currently writing the topics to a csv file since I am new to R and didn't want to figure out a way to add the topic to the original dataframe (named data) and then write that file to disk.
#Run bi-term topic modeling
set.seed(9082374)
model <- BTM(x, k = 10, alpha = 0.1, beta = 0.01, iter = 2000, trace = 100, detailed=TRUE)
#Predict topic scores for new data
scores <- predict(model, x)
#assign topic to each score
final.topic<-apply(scores, 1, which.min)
fwrite(list(final.topic), file="topic_max.csv")
Here is the code if you want to see the code I am running. Specially, the LDAvis part since I did have some challenges in running LDAvis.
term.table1 <- table(x$text)
# Create JSON data for LDAvis
docsize <- table(x$id)
scores <- scores[names(docsize), ]
json <- createJSON(
phi = t(model$phi),
theta = scores,
doc.length = as.integer(docsize),
vocab = as.character(rownames(model$phi)),
term.frequency = term.table1)
serVis(json)
library(data.table)
final.topic<-apply(scores, 1, which.max)
fwrite(list(final.topic), file="topic_max.csv")
Find out the solution to my problem. Essentially, LDAvis sorts the topics generated in the order of highest occurring I think? Not sure. But it sorts the topic in some way.
One can use the same sort order from the LDAvis visualiztion (using the json obeject that is created) to sort their document/ topic and topic word arrays (thetha and phi) to ensure that the topic assignment from LDAvis and scores/theta matrix matches.
Here is a way to do it.