Why are these stop words not being removed from my data?

374 Views Asked by At

Tokenization of the data

tidy_text <- data %>% 
  unnest_tokens(word, q_content)

Removal of stop words

data("stop_words")
stop_words
tidy_text <- tidy_text %>% anti_join(stop_words, by ="word")
tidy_text %>% count(word, sort = TRUE)

Output including most important 10 words

1                                                                                   im 13012
2                                                                                 dont 11197
3                                                                                 feel  9168
4                                                                                 time  6697
5                                                                                 life  4464
6                                                                                  ive  4403
7                                                                               people  4233
8                                                                                 told  4150
9                                                                              friends  4045
10                                                                                love  3281
1

There are 1 best solutions below

0
On

As explained by @Maurits Evers, the words in your data and stop_words do not exactly match. You may remove ' from the words in stop_words before joining them. Try :

library(dplyr)

tidy_text <- tidy_text %>% 
              anti_join(stop_words %>%
                          mutate(word = gsub("'", "", word)), by ="word")

tidy_text %>% count(word, sort = TRUE)