How to apply stopwords accurately in French using R

2.2k Views Asked by At

I'm trying to pull a book using the Gutenberg library and then remove French stopwords. I've been able to do this accurately in English by doing this:

twistEN <- gutenberg_download(730)
twistEN <- twistEN[118:nrow(twistEN),]
twistEN <- twistEN %>%
  unnest_tokens(word, text)
data(stop_words)
twistEN <- twistEN %>%
  anti_join(stop_words)
countsEN <- twistEN %>%
  count(word, sort=TRUE)
top.en <- countsEN[1:20,]

I can see here that the top 20 words (by frequency) in the English version of Oliver Twist are these:

word          n
   <chr>     <int>
 1 oliver      746
 2 replied     464
 3 bumble      364
 4 sikes       344
 5 time        329
 6 gentleman   309
 7 jew         294
 8 boy         291
 9 fagin       291
10 dear        277
11 door        238
12 head        226
13 girl        223
14 night       218
15 sir         210
16 lady        209
17 hand        205
18 eyes        204
19 rose        201
20 cried       182

I'm trying to accomplish the same thing with the French version of the same novel:

twistFR <- gutenberg_download(16023)
twistFR <- twistFR[123:nrow(twistFR),]
twistFR <- twistFR %>%
  unnest_tokens(word, text)
stop_french <- data.frame(word = stopwords::stopwords("fr"), stringsAsFactors = FALSE)
stop_french <- get_stopwords("fr","snowball")
as.data.frame(stop_french)
twistFR <- twistFR %>%
  anti_join(stop_words, by = c('word')) %>%
  anti_join(stop_french, by = c("word"))
countsFR <- twistFR %>%
  count(word, sort=TRUE)
top.fr <- countsFR[1:20,]

I did alter the code for the French stopwords based on info I found online, and it is removing some stopwords. But this is the list I'm getting:

word         n
   <chr>    <int>
 1 dit       1375
 2 r         1311
 3 tait      1069
 4 re         898
 5 e          860
 6 qu'il      810
 7 plus       780
 8 a          735
 9 olivier    689
10 si         673
11 bien       656
12 tout       635
13 tre        544
14 d'un       533
15 comme      519
16 c'est      494
17 pr         481
18 pondit     472
19 juif       450
20 monsieur   424

At least half of these words should be getting captured by a stopwords list and they're not. Is there something I'm doing wrong in my code? I'm new to tidy text, so I'm sure there are better ways to get at this.

2

There are 2 best solutions below

0
On

I used a few different packages to get what you want. I used the stopwords from tidystopwords as these are based on the universal dependency models. But you could use the stopwords from snowball, stopwords or from the proustr package. You might even decide to use the stopwords from multiple packages depending on your requirements and what you consider to be stopwords. All stopword lists are slightly different.

I use the udpipe package to split the text into it's separate tokens. This takes longer than unnest_tokens from tidytext (but I use the default option, which includes pos tagging and lemmatisation). I find that unnest_tokens doesn't work well with non english languages.

library(gutenbergr)
library(tidystopwords)
library(udpipe)
library(dplyr)

# get twist in French
twistFR <- gutenberg_download(16023)
# Convert all lines to utf8 (needed on my system)
twistFR$text <- iconv(twistFR$text, to = "UTF-8")


# get french stopwords based on ud language model
my_french_stopswords <- generate_stoplist(lang_name = "French")
my_french_stopswords <- data.frame(word = my_french_stopswords, stringsAsFactors = FALSE)

# download udpipe model for french language 
ud_model <- udpipe_download_model(language = "french")
ud_model_fr <- udpipe_load_model(ud_model)

# set parallel.cores. Udpipe annotate can take a while as it does a lot more than just tokenizing.
ud_twistFR <- udpipe_annotate(ud_model_fr, twistFR$text[123:nrow(twistFR)], parallel.cores = 3)

# transform to data.frame
ud_twistFR_df <- data.frame(ud_twistFR, stringsAsFactors = FALSE)

# put tokens in lowercase, remove stopwords and punctuations
ud_twistFR_df <- ud_twistFR_df %>% 
  mutate(token = tolower(token)) %>% 
  anti_join(my_french_stopswords, by = c("token" = "word")) %>% 
  filter(upos != "PUNCT") # remove punctuations.

# count tokens
ud_countsFR <- ud_twistFR_df %>%
  count(token, sort=TRUE)

ud_countsFR[1:20,]
# A tibble: 20 x 2
   token        n
   <chr>    <int>
 1 pas       1558
 2 dit       1366
 3 m.         915
 4 olivier    843
 5 plus       775
 6 bien       652
 7 répondit   469
 8 juif       435
 9 monsieur   412
10 bumble     367
11 enfant     355
12 sikes      341
13 jeune      336
14 air        290
15 porte      281
16 tête       279
17 encore     278
18 homme      267
19 même       261
20 demanda    257
0
On

It turns out that my main problem was actually not the stop words. It was that accented characters were coming through as codes instead of as the accents. I applied this:

twistFR$text <- iconv(twistFR$text, "latin1", "UTF-8")

And the situation pretty much resolved itself. I did also apply the stopwords-iso larger list. Thanks for both of your comments!