Python PyPDF - getting additional spaces when reading text using ExtractText

1.4k Views Asked by sas_python_user At 27 July 2025 at 18:13

I'm trying to extract text from a PDF file that has Address information, shown as below

CALIFORNIA EYE SPECIALISTS MED GRP INC 1900 W GARVEY AVE S # 335 WEST COVINA CA 91790

and I'm using below logic to extract the data

f = open(addressPath.pdf,'rb')
pdf_reader = PyPDF2.PdfFileReader(f)
first_page = pdf_reader.getPage(0)
mytext = first_page.extractText().split('\n')

but I'm getting below output, logic is introducing additional spaces. Any Idea, why this is happening?

C A L I F O RN IA E YE SP E C I A L I STS M ED GRP INC 19 00 W G A R V EY A VE S # 3 35 WE S T CO VI NA C A 91 7 90

Original Q&A

There are 1 best solutions below

Martin Thoma On 06 June 2022 at 21:21

PyPDF2 was handling spaces not at all for a long time. In April 2022, I improved the situation with a very simple logic. It was too simple and got many cases wrong.

The contributor pubpub-zz changed that. Today, version 2.1.0 was released which improves spacing a lot.

Python PyPDF - getting additional spaces when reading text using ExtractText

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PDF

Related Questions in SPACES

Related Questions in PYPDF

Related Questions in EXTRACT-TEXT-PLUGIN

Trending Questions

Popular # Hahtags

Popular Questions