extract the specific text from pdfs using python

982 Views Asked by At

I have tried different python libraries to extract the specific text from pdfs, I have to extract text under the heading pdf1 from this pdf, I have to extract the text starting from Case 1 to diamond ◆ bold.

The next pdf contains the data in a different format pdf2. in this pdf I have to extract data from history to examination, then from examination to investigations with history and investigation as columns in an excel file and corresponding data in rows. and python regex cannot satisfy this condition because every pdf format is different and we want different type of text from different pdfs

apart from these types of pdf, I have 5+ different types of pdfs to process I have tried different python libraries like pdfminer, pdfplumber, PyMUPDF, pytesseract , textract, GROBID,

sample pdf:sample pdfs

code 1

import pdfplumber
import docx

file='Book_EM-Cases-Digest-Vol-2-Pediatric-Emergencies (1).pdf'

pdf=pdfplumber.open(file)

for page in pdf.pages:
    text=page.extract_text()

code 2


import fitz

file='Book_EM-Cases-Digest-Vol-2-Pediatric-Emergencies (1).pdf'


docum=docx.Document()
with fitz.open(file) as doc:
    for page in doc:
        text=page.get_text()

the above codes will extract the text for the whole page. but I want specific text. I know we can also use python regex to do this but I have a variety of different pdfs as well and its become difficult to use python regex for all pdfs

2

There are 2 best solutions below

0
Mohit Mehlawat On

Using the library PyMuPDF:-

  1. Find the coordinates of the blocks of the page using Page.get_text('dict')
  2. You will get the coordinates of the required text---> rect.
  3. Now for extracting the text Page.get_text(clip=rect,sort=False). Here, the rect is the coordinates of the rectangle box(text) that you want to extract.
0
Luca Foppiano On

Grobid is not made for parsing such big PDF documents. It is designed to understand scholarly publication.

Anyway, there is a python client that can be useful: https://github.com/kermitt2/grobid-client-python You can use the Huggingface space demo server: https://kermitt2-grobid.hf.space/ and you can parse the output XML with https://pypi.org/project/grobid-tei-xml/

Simple example:


pdf_file, status, text = self.grobid_client.process_pdf("processFulltextDocument",input_path)

if status == 200:
    doc = grobid_tei_xml.parse_document_xml(text)

    print(doc.abstract)