I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn't work, but no answers as such. So, I decided to convert the PDF to an image, and then use pytesseract
to extract texts. I have downloaded the Hindi trained data, however that also gives highly inaccurate text.
That's the actual Hindi text from the PDF (download link):
That's my code so far:
import fitz
filepath = "D:\\BADI KA BANS-Ward No-002.pdf"
doc = fitz.open(filepath)
page = doc.loadPage(3) # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)
from PIL import Image
import pytesseract
# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Create an image object of PIL library
image = Image.open('outfile.png')
# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='hin')
# Print the text
print(image_to_text)
That's some output sample:
कार बिता देवी व ०... नाम बाइुनान िक०क नाक तो
पति का नाव: रवजी लात. “50९... पिला का सामशामाव.... “पति का नाम: बादुलल
कान सब: 43 लसमनंध्या: 93९. मकान ंब्या: 3९
आप: 29 _ लिंग सी. | आइ 57 लिंग पुरुष आप: 62 लिंग सी
एजगल्णब्णस्य (बन्द जगाख्मिणण्य
नमः बायगी बसों ०४... नि बयावर्णो ०५०... निफर सनक नी
चिता का नामजबूजल वर्ष.“ ००० | पिला का नामब्राइलाल वर्षो... 0 2... | पिता कामामशुल चब्द .... “20०
|सकानसंब्या: 43९ बसवकंब्या: 43९. कान संब्या: 44
जाए: 27 लिंग सो कई: 27 नि खी मा लिंग पुरुष
There is an answer to this question I want to scrape a Hindi(Indian Langage) pdf file with python, which seems to tell how to do this, but provides no explanation whatsoever.
Is there any way to do this?
I will give some ideas how to process your image, but I will limit that to page 3 of the given document, i.e. the page shown in the question.
For converting the PDF page to some image, I used
pdf2image
.For the OCR, I use
pytesseract
, but instead oflang='hin'
, I uselang='Devanagari'
, cf. the Tesseract GitHub. In general, make sure to work through Improving the quality of the output from the Tesseract documentation, especially the page segmentation method.Here's a (lengthy) description of the whole procedure:
pytesseract
to extract the texts.pytesseract
to extract the texts as-is.pytesseract
usinglang='Devaganari'
on the left, and usinglang='eng'
on the right part to improve OCR quality for both.That'd be the whole code:
And, here are the first few lines of the output:
I checked a few texts using manual character-by-character comparison, and thought it looked quite good, but unable to understand Hindi or reading Devanagari script, I can't comment on the overall quality of the OCR. Please let me know!
Annoyingly, the number
9
from the corresponding "card" is falsely extracted as2
. I assume, that happens due to the different font compared to the rest of the text, and in combination withlang='Devanagari'
. Couldn't find a solution for that – without extracting the rectangle separately from the "card".