Using {readtext} to import text data from a PDF file containing text and images: R Studio aborts. No error message

59 Views Asked by At

I am trying to import text contained in a PDF file into R Studio, using {readtext}. In this past, this has worked smoothly and still does so for the most part. However, there are a handful of PDF files I struggle to import, meaning that R Studio will abort (no error message!) when I try to read in the file.

Essentially, this is what I am doing:

library(readtext)

readtext::readtext("pdf_1.pdf")

#> readtext::readtext("pdf_1.pdf")
#readtext object consisting of 1 document and 0 docvars.
## Description: df [1 × 2]
#doc_id    text               
#<chr>     <chr>              
#  1 pdf_1.pdf "\"      DEMO\"..."

readtext::readtext("pdf_2.pdf")

# R Studio aborts.

The funny thing is that both PDF files are remarkably similar, in terms of usage rights, file size, its contents (text surrounded by imgs) and its creator. I am using the most recent versions of R and the R Studio IDE, as well as the most recent version of {readtext}, namely V 0.81.

Since I cannot provide the PDF files directly, please allow me to refer you to the following link, where the PDF can be downloaded.

PDF that I can import: link

PDF that I cannot import: link

Word of advice: Don't spend too much time reading. They are the weekly newspapers of the German anti-lockdown movement, Querdenken. My trying to import them in R only serves research purposes. :)

Any help with this is much appreciated. I've run out of ideas.

1

There are 1 best solutions below

0
Nicolás Velasquez On

This trick simply re-writes the problematic pdf. It uses qpdf::pdf_combine() to "fake combine" it with nothing, but does output a new pdf that should be readable by R in your OS.

library(tidyverse)
library(readtext)
library(qpdf)
  

pdf_combine(input = "problematic_01.pdf", output = "working_01.pdf")
readtext("working_01.pdf")

readtext object consisting of 1 document and 0 docvars.
# Description: df [1 × 2]
  doc_id         text               
  <chr>          <chr>              
1 working_01.pdf "\"          \"..."