R: Remove words systematically from corpus after processing topic model

199 Views Asked by At

I am doing topic modeling with the -package and a corpus consisting of three documents.

model <- LDA(dat_dtm, method = "VEM", k = 3, control = list(alpha = 0.1))

Output:

A LDA_VEM topic model with 3 topics.

After that, I use the -function to obtain the top 5 words of each model.

terms(model, 5)

Outuput with made up words:

topic 1 topic 2 topic 3
strong poor class
wealth struggle middle
money homeless money
power money sufficient
rich wealth wealth

As you can see, the words "money" and "wealth" appear in each topic, but they are not really meaningful for my analysis. So I thought it might be a good idea to remove these words from the whole corpora and conduct a new topic model without them. I tried to do this automatically by telling R that it should observe the top 20 words for each topic and remove all words from the corpora which are in each topic under the top 20. However, I only generated errors because I am not really familiar with the topicmodels-package. Obviously, you can just add these words to the stop word list manually, but maybe there is a more professional way to do it?

Thank you in advance!

1

There are 1 best solutions below

0
Leonardo19 On

I think the easiest way is to make a vector object of the top 20 words and add it to your stop word list.

You can use tidyverse to specify these words for each topic.

library(tidyverse)

remove_words <- model %>% 
  tidy(matrix = "beta") %>% 
  group_by(topic) %>%
  slice_max(beta, n = 20) %>% 
  pull(term)

Now you have a vector object called remove_words, which should be added into your stop word list before conducting a new topic model.

Hope this helps!