I have been trying to do topic modeling on a collection of discussion forum posts in a MOOC. I have tried basic LDA to create topics, and the topics were meaningless. So now I'm looking into seeding my topics to create better topics. I found the seededlda package, which requires a dfm as an input as well as a dictionary of seeded terms. It works well! My issue is figuring out how each document, or forum post, is categorized.
My original data has "userid" as a variable and "post" as the document I'm using for LDA. So far my code looks like this.
text <- introduction_posts$post
dfmt <- dfm(text, remove_number = TRUE) %>%
dfm_remove(stopwords('en'), min_nchar = 2)
#install.packages("seededlda")
library(seededlda)
slda <- textmodel_seededlda(dfmt,
seeded_dict,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = FALSE,
residual = TRUE,
weight = 0.01,
max_iter = 2000,
alpha = NULL,
beta = NULL,
verbose = quanteda_options("verbose")
)
terms <- terms(slda)
How can I determine which terms go to which user?
When I used the LDA function under the topicmodeling package I used a document term matrix defined this way
posts_dtm <- CreateDtm(doc_vec = introduction_posts$post, # character vector of documents
doc_names = introduction_posts$userid_bycourse, # document names
ngram_window = c(1, 2), # minimum and maximum n-gram length
stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
stopwords::stopwords(source = "smart"))
which named the documents as it went along. In the end I was able to nicely see which topics went to which participants. But I can't seem to do that with the dfm that the seededlda package uses.
Any help would be appreciated.
It seems to me that it is more about how to construct dfm using quanteda than seededlda.
As for seededlda, its
topics()
does not return a vector with document names but you can give names.