Text Mining newspaper pdf in R?

52 Views Asked by At

I am trying to extract all the text from a PDF (a newspaper front page) in R with the following code:

library(pdftools)
text<-pdftools::pdf_text(pdf = "https://www.nytimes.com/images/2013/06/02/nytfrontpage/scan.pdf")
text<-gsub("\\n", " ", x1)     
text<-gsub(pattern="\\W", x1, replace=" ")
text<-stripWhitespace(x1)

But, this is not working due to the way the text is organized and other factors like many line breaks. This approach leads to different articles and headlines being mismatched and stitched together, rather than each being in a continuous format after the other.

For example, the headline "US and China Will Hold Talks About Hacking" instead becomes "U.S. and China AS SYRIANS FIGHT, Will Hold Talks SECTARIAN STRIFE About Hacking". Does anyone perhaps know how I might fix the code so that the mined text is in a more continuous format?

0

There are 0 best solutions below