How to comma separate words when using Pypdf2 library

390 Views Asked by Aayush Sharma At 02 October 2018 at 09:21

I'm converting pdf to text convertion using PyPDF2 and during this code some words are mixing, the code is shown below :-

filename = 'CS1.pdf'      
pdfFileObj = open(filename,'rb')       
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)      
num_pages = pdfReader.numPages  
count = 0      
text = ""    

while count < num_pages:       
pageObj = pdfReader.getPage(count)  
    count +=1  
    print(pageObj)  
    text += pageObj.extractText()
if text != "":  
   text = text  
else:  
   text = textract.process('/home/ayush/Ayush/1june/pdf_to_text/CS1.pdf', method='tesseract', language='eng')
print(text)

output:-

Topursuegraduatestudiesincomputerscienceandengineering

how can i expect

To,pursue,graduate,studies,in,computer,science,and,engineering

Original Q&A

There are 1 best solutions below

Joe On 08 October 2018 at 05:52

Please try to add

text += pageObj.extractText()
print(pageObj.extractText())

How does the text look at that stage before the concatenation?

I might have found the reason. Download iText RUPS to inspect the pdf. This tool shows how the content is rendered and placed on the page.

Navigate to Stream

In the lower right corner you can read

I am not familiar with the PDF spec, but this answer states

These numbers adjust the respective text position by that amount. Numbers are expressed in thousandths of a unit of text space. According to the official PDF spec, this "amount shall be subtracted from the current horizontal or vertical coordinate". A positive number therefor moves the next string to the left when writing horizontally. A negative number moves it to the right.

My suspicion is that PyPDF2 does not interpret a number as space. This is probably not that easy as you have to know how many pixels equal a space character.

I had a quick look in another pdfs and the text with spaces instead of numbers in between is read correctly. Please try that.

If this is the problem your next move could be to iterate the elements as shown in iText RUPS directly. It is a bit cumbersome but possible. You can find examples for PyPDF2.

How to comma separate words when using Pypdf2 library

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in DATA-SCIENCE

Related Questions in TEXT-ANALYSIS

Related Questions in PYPDF

Trending Questions

Popular # Hahtags

Popular Questions