EasyOCR is not able to extract some of the colon symbols in the image

62 Views Asked by At

Here's the gist of the code:

import fitz
import easyocr
from PIL import Image
def extract_text_from_pdf(pdf_path):
    reader = easyocr.Reader(['en'], download_enabled=False)
    extracted_text = ""
    for page_number in range(pdf_document.page_count):
    
        page = pdf_document[page_number]
        resolution = 300
        zoomfactor = resolution/72.0
        pixmap = page.get_pixmap(matrix-fitz.Matrix(zoomfactor, zoomfactor))
        image = pixmap.tobytes()
        result = reader.readtext(image, paragraph=True)

        print("Page {page_number + 1} - OCR Result:") 
        for detection in result:
            extracted_text += detection[1].strip()

    pdf_document.close()
    
return extracted_text

The image passed looks something like this:

enter image description here

But the extracted text looks like this:

"account : 1234

url xyz"

The expectation is:

"account : 1234

url : xyz"

If you notice the colon symbol is missing from the extracted text. Can you please suggest something to resolve this?

0

There are 0 best solutions below