Getting accented characters recognized when building a custom stopwords lexicon in R

42 Views Asked by At

I'm building a custom stopwords lexicon in R to remove accented characters. I thought that using the unicode reference would enable this, but it doesn't work and I'm having trouble thinking off different solutions, especially as some of these could not be covered by running a lexicon from another language.

Current code:

en_custom_stopwords <- bind_rows(data_frame(word = c("8217", "8216", "le", "de", "en", "el", "8221", "8220", "los", "039", "se", 
                                                     "aei", "\\\\U+00E4"), lexicon = c("custom")), stop_words)

This words find with regular characters.

0

There are 0 best solutions below