Here's the gist of the code:
import fitz
import easyocr
from PIL import Image
def extract_text_from_pdf(pdf_path):
reader = easyocr.Reader(['en'], download_enabled=False)
extracted_text = ""
for page_number in range(pdf_document.page_count):
page = pdf_document[page_number]
resolution = 300
zoomfactor = resolution/72.0
pixmap = page.get_pixmap(matrix-fitz.Matrix(zoomfactor, zoomfactor))
image = pixmap.tobytes()
result = reader.readtext(image, paragraph=True)
print("Page {page_number + 1} - OCR Result:")
for detection in result:
extracted_text += detection[1].strip()
pdf_document.close()
return extracted_text
The image passed looks something like this:
But the extracted text looks like this:
"account : 1234
url xyz"
The expectation is:
"account : 1234
url : xyz"
If you notice the colon symbol is missing from the extracted text. Can you please suggest something to resolve this?
