How to comma separate words when using Pypdf2 library

390 Views Asked by At

I'm converting pdf to text convertion using PyPDF2 and during this code some words are mixing, the code is shown below :-

filename = 'CS1.pdf'      
pdfFileObj = open(filename,'rb')       
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)      
num_pages = pdfReader.numPages  
count = 0      
text = ""    

while count < num_pages:       
pageObj = pdfReader.getPage(count)  
    count +=1  
    print(pageObj)  
    text += pageObj.extractText()
if text != "":  
   text = text  
else:  
   text = textract.process('/home/ayush/Ayush/1june/pdf_to_text/CS1.pdf', method='tesseract', language='eng')
print(text)

output:-

Topursuegraduatestudiesincomputerscienceandengineering

how can i expect

To,pursue,graduate,studies,in,computer,science,and,engineering

1

There are 1 best solutions below

3
Joe On

Please try to add

text += pageObj.extractText()
print(pageObj.extractText())

How does the text look at that stage before the concatenation?

I might have found the reason. Download iText RUPS to inspect the pdf. This tool shows how the content is rendered and placed on the page.

Navigate to Stream

enter image description here

In the lower right corner you can read

enter image description here

I am not familiar with the PDF spec, but this answer states

These numbers adjust the respective text position by that amount. Numbers are expressed in thousandths of a unit of text space. According to the official PDF spec, this "amount shall be subtracted from the current horizontal or vertical coordinate". A positive number therefor moves the next string to the left when writing horizontally. A negative number moves it to the right.

My suspicion is that PyPDF2 does not interpret a number as space. This is probably not that easy as you have to know how many pixels equal a space character.

I had a quick look in another pdfs and the text with spaces instead of numbers in between is read correctly. Please try that.

If this is the problem your next move could be to iterate the elements as shown in iText RUPS directly. It is a bit cumbersome but possible. You can find examples for PyPDF2.