I'm converting pdf to text convertion using PyPDF2 and during this code some words are mixing, the code is shown below :-
filename = 'CS1.pdf'
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
print(pageObj)
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process('/home/ayush/Ayush/1june/pdf_to_text/CS1.pdf', method='tesseract', language='eng')
print(text)
output:-
Topursuegraduatestudiesincomputerscienceandengineering
how can i expect
To,pursue,graduate,studies,in,computer,science,and,engineering
Please try to add
How does the text look at that stage before the concatenation?
I might have found the reason. Download iText RUPS to inspect the pdf. This tool shows how the content is rendered and placed on the page.
Navigate to
StreamIn the lower right corner you can read
I am not familiar with the PDF spec, but this answer states
My suspicion is that
PyPDF2does not interpret a number as space. This is probably not that easy as you have to know how many pixels equal a space character.I had a quick look in another pdfs and the text with spaces instead of numbers in between is read correctly. Please try that.
If this is the problem your next move could be to iterate the elements as shown in iText RUPS directly. It is a bit cumbersome but possible. You can find examples for
PyPDF2.