Write a function in R to process docx files

972 Views Asked by At

I have a folder that contains *.docx files. I want to convert the script below into some sort of a loop function to read all docx files but I really dont know how to write R function and someone please guide me?

library(docxtractr)
real_world <- read_docx("C:/folder/doc1.docx")
docx_tbl_count(real_world)
tbls <- docx_extract_all_tbls(real_world)
a <- as.data.frame(tbls)

So ideally it appends new table everytime a new document is extracted.

Thanks Peddie

2

There are 2 best solutions below

1
On BEST ANSWER

Edit: I assumed for this answer that the term "function" was not used in the sense of an R function by OP. I think OP means just an algorithm to solve the problem.

#### load packages ####
library(docxtractr)
library(plyr)

#### load data ####
# define path of dir
pathto <- "stackoverflow/41251392/example/"
# get path of every .docx-file in dir
filelist <- list.files(path = pathto, pattern = "*.docx", full.names = TRUE)
# read every file with docxtractr::read_docx()
tablelist <- lapply(filelist, read_docx)
# extract every table from every file with docxtractr::docx_extract_all_tbls()
tables <- lapply(tablelist, docx_extract_all_tbls)

#### append data to create one data.frame #### 
# combine extracted tables with plyr::ldply()
ldply(lapply(tables, function(x) {ldply(x, data.frame)}), data.frame)

The last line is a bit difficult to understand. Take a look at ?plyr::ldply.

0
On

I don't know whether your code as intended works. But here, I converted it to a function with the path argument so that you can batch process all docx under that path (don't use a slash at the end of the path). Default argument is the default path:

library(docxtractr)

docxextr <- function(pathh = ".") {
    files <- list.files(path = pathh)
    for (i in files) {
        filen <- sprintf("%s/%s", pathh, i)
        real_world <- read_docx(filen)
        docx_tbl_count(real_world) # didn't understand where this count goes?
        tbls <- docx_extract_all_tbls(real_world)
        a <- as.data.frame(tbls)
        return(a)
    }
}