Export data from word to R as well as the format of the character read

51 Views Asked by At

I have a word docx, with some coloured characters. I am trying to export this data into a dataframe and want to retain the information of the font color as well. The colors represent important information and so, I would like the output to state the colour of the character being read. Are there any R packages that would help me read this?

I have tried converting it into XML, but have had no luck trying to retrieve the text based on the font color. I have also tried the officer package but unfortunately, it doesn't read the font colors.

Sample input would be a docx with characters like this:

enter image description here

Sample output could look something like:

Character   Underline   Bold    Color
 O               No          Yes      Red
 %               Yes         Yes    Black
 8               Yes         Yes    Green

OR

Character   Underline   Bold    Color
 O               No          Yes      Red
 %               Yes         Yes    Black
 8               Yes         Yes    Green

OR

Red Character positions- 1
Green Character positions- 3
Underline character positions- 2,3
Bold character positions- 1,2,3
1

There are 1 best solutions below

0
On BEST ANSWER

Note: my test document is about pigs, hence the variable names.

library(xml2)

pigsin <- read_xml(unz(file.choose(), "word/document.xml"))

text_nodeset <- pigsin |> xml2::xml_find_all("//w:r[w:t]") |> as_list()

This gives you a list of all sections of the document containing text. Then iterate over them to extract the relevant text and values, e.g:

lapply(text_nodeset, 
       FUN = \(x) {
         out <- data.frame(chars = strsplit(unlist(x$t),""),
                    italic = !is.null(x$rPr$i),
                    bold = !is.null(x$rPr$b),
                    colour = ifelse(is.null(x$rPr$color), "-", attr(x$rPr$color, "val")))
         colnames(out) <- c("chars", "italic", "bold", "colour")
         out
       }) |> dplyr::bind_rows()

gives

   chars italic  bold colour
1      P   TRUE FALSE      -
2      i   TRUE FALSE FF0000
3      g   TRUE FALSE      -
4      P  FALSE FALSE      -
5      A  FALSE FALSE      -
6      G  FALSE FALSE FF0000
7      P  FALSE  TRUE      -
8      o  FALSE  TRUE      -
9      g  FALSE  TRUE      -
10     P  FALSE  TRUE 00B050
11     U  FALSE  TRUE 00B050
...
(# for my silly toy file)