Python PyPDF - getting additional spaces when reading text using ExtractText

1.4k Views Asked by At

I'm trying to extract text from a PDF file that has Address information, shown as below

CALIFORNIA EYE SPECIALISTS MED GRP INC 1900 W GARVEY AVE S # 335 WEST COVINA CA 91790

and I'm using below logic to extract the data

f = open(addressPath.pdf,'rb')
pdf_reader = PyPDF2.PdfFileReader(f)
first_page = pdf_reader.getPage(0)
mytext = first_page.extractText().split('\n')

but I'm getting below output, logic is introducing additional spaces. Any Idea, why this is happening?

C A L I F O RN IA E YE SP E C I A L I STS M ED GRP INC 19 00 W G A R V EY A VE S # 3 35 WE S T CO VI NA C A 91 7 90

1

There are 1 best solutions below

0
On

PyPDF2 was handling spaces not at all for a long time. In April 2022, I improved the situation with a very simple logic. It was too simple and got many cases wrong.

The contributor pubpub-zz changed that. Today, version 2.1.0 was released which improves spacing a lot.