In the following example, the result is empty for every page in the PDF.
library(pdftools)
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
destfile = file.path(getwd(), basename(url))
download.file(url, destfile, mode = "wb")
file = list.files(path=".", pattern="pdf$")
pdf_text(file)
I am not sure whether there is a problem with the PDF file and the way it was scanned and saved that prevents PDF reading. Is there a workaround for PDF files like this or a better package/library that I should consider?
I would guess that the issue is that it's a scanned document. So your probably need some OCR tools to extract the text and information from the document. One option would be the
tesseract
package: