Convert scanned PDF to searcheable PDF (in R)

780 Views Asked by At

I'm trying to convert a series of scanned PDF into searchable PDF using the tesseract and pdftools packages. I've accomplished two steps. Now I need to write back to a searchable pdf.

  1. Read scanned PDF
  2. Run OCR
  3. Write back to a searcheable PDF
eg <- download.file("https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf", "example.pdf", mode = "wb")

results <- tesseract::ocr_data("example.pdf", engine = "eng")
R> results
# A tibble: 406 x 3
   word        confidence bbox             
   <chr>            <dbl> <chr>            
 1 PFU               96.9 228,181,404,249  
 2 Business          96.2 459,180,847,249  
 3 report            96.2 895,182,1145,259 
 4 |                 52.5 3980,215,3984,222
 5 No.068            91.0 4439,163,4754,237
 6 New               96.0 493,503,1005,687 
 7 customer's        94.6 1069,484,2231,683
 8 development       96.5 2304,483,3714,732
 9 di                90.4 767,763,1009,959 
10 ing               96.3 1754,773,1786,807
# ... with 396 more rows

Alternatively, is there another package or command-line tool I can invoke in R for Windows?

3

There are 3 best solutions below

0
On

I had a similar need and wrote a simple function in R to call the command line for OCRmyPDF.

I'm using Ubuntu, so first install OCRmyPDF in Ubuntu via:

sudo apt install ocrmypdf

Here's the info for installing it on other operating systems.

Then load up the R function in R by running:

    ocr_my_pdf <- function(path_read, ..., path_save = NULL){
      
      path_read <- here::here(path_read)
      if(is.null(path_save)){ 
        path_save <- stringr::str_replace(path_read, '(?i)\\.pdf$','_ocr.pdf') 
      } else {
        path_save <- here::here(path_save)
      }
      
      sys_args <- c(
        glue::glue("'{unlist(list(...))}'"), 
        glue::glue("'{path_read}'"), 
        glue::glue("'{path_save}'"))
      system2('ocrmypdf', args = sys_args) 
      
    }

Then call the function on a test PDF with:

    ocr_my_pdf('/home/test.pdf')

Or, with whatever additional arguments you want to pass:

    ocr_my_pdf('test.pdf', '--deskew', '--clean', '--rotate-pages')

Here's the info for available arguments.

0
On

Here is one approach based on the RDCOMClient R package. Basically, we convert the PDF to Word. In the process, Word uses an embedded OCR. Afterwards, with the Word software, we convert the Word document to a searchable PDF.

library(RDCOMClient)

download.file("https://www.fujitsu.com/global/Images/sv600_c_automatic.pdf", "example.pdf", mode = "wb")

path_PDF <- "C:/example.pdf"
path_Word <- "C:/example.docx"

################################################################
#### Step 1 : Convert PDF to word document with OCR of Word ####
################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)
doc_Selection <- wordApp$Selection()

##########################################################
#### Step 3 : Convert word document to searchable pdf ####
##########################################################
path_PDF_Searchable <- "C:/example_searchable.pdf"
wordApp[["ActiveDocument"]]$SaveAs(path_PDF_Searchable, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp
0
On

If you have the software ECopy installed on your computer (not a free software), you can use the following function to convert scanned pdfs to searchable pdfs:

ecopy_Scanned_PDF_To_Numeric_PDF <- function(directory_Scanned_PDF, directory_Numeric_PDF)
{
  path_To_BatchConverter <- "C:/Program Files (x86)/Nuance/eCopy PDF Pro Office 6/BatchConverter.com"
  args <- paste0("-I", directory_Scanned_PDF, "\\*.pdf -O", directory_Numeric_PDF, " -Tpdfs -Lfre -W -V1.5 -J -Ao")
  system2(path_To_BatchConverter, args = args)
}

I use this function at my job and it works very well