I'm trying to extract text from a PDF file that has Address information, shown as below
CALIFORNIA EYE SPECIALISTS MED GRP INC 1900 W GARVEY AVE S # 335 WEST COVINA CA 91790
and I'm using below logic to extract the data
f = open(addressPath.pdf,'rb')
pdf_reader = PyPDF2.PdfFileReader(f)
first_page = pdf_reader.getPage(0)
mytext = first_page.extractText().split('\n')
but I'm getting below output, logic is introducing additional spaces. Any Idea, why this is happening?
C A L I F O RN IA E YE SP E C I A L I STS M ED GRP INC 19 00 W G A R V EY A VE S # 3 35 WE S T CO VI NA C A 91 7 90
PyPDF2 was handling spaces not at all for a long time. In April 2022, I improved the situation with a very simple logic. It was too simple and got many cases wrong.
The contributor pubpub-zz changed that. Today, version 2.1.0 was released which improves spacing a lot.