Text Mining newspaper pdf in R?

52 Views Asked by James Rider At 24 August 2023 at 16:04

I am trying to extract all the text from a PDF (a newspaper front page) in R with the following code:

library(pdftools)
text<-pdftools::pdf_text(pdf = "https://www.nytimes.com/images/2013/06/02/nytfrontpage/scan.pdf")
text<-gsub("\\n", " ", x1)     
text<-gsub(pattern="\\W", x1, replace=" ")
text<-stripWhitespace(x1)

But, this is not working due to the way the text is organized and other factors like many line breaks. This approach leads to different articles and headlines being mismatched and stitched together, rather than each being in a continuous format after the other.

For example, the headline "US and China Will Hold Talks About Hacking" instead becomes "U.S. and China AS SYRIANS FIGHT, Will Hold Talks SECTARIAN STRIFE About Hacking". Does anyone perhaps know how I might fix the code so that the mined text is in a more continuous format?

Original Q&A

Text Mining newspaper pdf in R?

There are 0 best solutions below

Related Questions in R

Related Questions in PDF

Related Questions in TEXT-MINING

Related Questions in SENTIMENT-ANALYSIS

Trending Questions

Popular # Hahtags

Popular Questions