Regex doesn't capture numbers written out as words

154 Views Asked by At

I'm looking at Oliver Twist in both English and French. I found this site (https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) that provides code to apply the chapter number per row of text. When I apply it to the English text, it works just fine:

library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
twistEN <- gutenberg_download(730)
twistEN <- twistEN[118:nrow(twistEN),]
chaptersEN <- twistEN %>%
  mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup()

When I then look at chaptersEN, I can see that it's appropriately applied the chapter number on each row. Where I'm running into trouble is with the French text. Here's my code:

twistFR <- gutenberg_download(16023)
twistFR <- twistFR[123:nrow(twistFR),]
twistFR$text <- iconv(twistFR$text, "latin1", "UTF-8")
chaptersFR <- twistFR %>%
  mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^chaptitre [\\divxlc]", ignore_case = TRUE)))) %>%
  ungroup()

The problem here is that the chapters aren't named Chapter 1 and Chapter 2, they are named Chapitre Premier, Chapitre Deuxieme. I believe the regex is finding the chapter number by looking at the numeral following the word chapter (please correct me if I'm wrong), so it doesn't know what to do when that numeral is written in as a word. Any ideas on how to apply the chapter number?

2

There are 2 best solutions below

0
On BEST ANSWER

Matching on rows that begin with an upper case 'CHAPITRE' is sufficient in this case.

chaptersFR <- twistFR %>%
  mutate(line = row_number(), chapter = cumsum(str_detect(text, regex("^CHAPITRE")))) %>%
  ungroup()

chaptersFR %>% 
  filter(grepl("^chapitre", text, ignore.case = TRUE)) %>%
  head(5)

# A tibble: 5 x 4
  gutenberg_id text               line chapter
         <int> <chr>             <int>   <int>
1        16023 CHAPITRE PREMIER.     1       1
2        16023 CHAPITRE II         124       2
3        16023 CHAPITRE III        604       3
4        16023 CHAPITRE IV.       1006       4
5        16023 CHAPITRE V.        1333       5

chaptersFR %>% 
  filter(grepl("^chapitre", text, ignore.case = TRUE)) %>%
  tail(5)

# A tibble: 5 x 4
  gutenberg_id text                                                            line chapter
         <int> <chr>                                                          <int>   <int>
1        16023 CHAPITRE L.                                                    18443      50
2        16023 CHAPITRE LI.                                                   18973      51
3        16023 chapitre, Olivier se trouvait, à trois heures de l'après-midi, 18979      51
4        16023 CHAPITRE LII                                                   19580      52
5        16023 CHAPITRE LIII.                                                 19989      53
1
On

The short answer: you wrote chaptitre instead of chapitre

For what are you using the [\\divxlc] part in the code?
For example: ^chapitre [\\divxlc]
^ means at the start of a row
chapitre matches just the word chapitre(only lowercase)
the blank field matches the space
and the part [\\divxlc] matches only '\', 'd', 'i','v','x','l' or 'c'

So it could match these examples: chapitre d, chapitre i, or chapitre \

And if you want the c at the start of chapitre to be uppercase or lowercase you could use this:
^[cC]hapitre [\\divxlc]