anti_join is not recognizing tidytext stop words in my dataset

421 Views Asked by At

I am working on removing stop words from a body of text with the tidytext approach in R. https://www.tidytextmining.com/tidytext.html

The following example works:

library(tidytext)
library(dplyr)

data(stop_words)
str_v <- paste(c("i've been dancing after midnight, i'd know because it's 
daylight"))

str_v %>% 
as_tibble %>% 
unnest_tokens(word, value) %>%
anti_join(stop_words)

When I apply this method to the data I'm working with it does not error, but the stop words are not removed. Does something invisible need to happen to the structure of the text for the stop words to match? The output rows appear identical to the stop words (lowered, squished, etc), and yet they remain... I'm working with protected data and am unable to share out source material. Any suggestions or advice on this problem would be super helpful, thank you!

1

There are 1 best solutions below

0
On

After struggling with syntax it turns out the problem is an artifact in punctuation, summarized as:

"’" != "'"

Used mutate() to str_replace_all() in the vector and now stop words work.

answer <- 
 my_data %>% 
  mutate(text = str_replace_all(text, "’", "'"))