Python PyTesseract Module returning gibberish from an image

618 Views Asked by At

I'm guessing this is because the images I have contain text on top of a picture. pytesseract.image_to_string() can usually scan the text properly but it also returns a crap ton of gibberish characters: I'm guessing it's because of the pictures underneath the text making Pytesseract think they are text too or something.

When Pytesseract returns a string, how can I make it so that it doesn't include any text unless it's certain that the text is right. Like, if there a way for Pytesseract to also return some sort of number telling me how certain the text is scanned accurately?

I know I kinda sound dumb but somebody pls help

1

There are 1 best solutions below

3
cagataygulten On

You can set a character whitelist with config argument to get rid of gibberish characters,and also you can try with different psm options to get better result.

Unfortunately, it is not that easy, I think the only way is applying some image preprocessing and this is my best:

  1. Firstly I applied some blurring to smoothing:
 import cv2
 blurred = cv2.blur(img,(5,5))
  1. Then to remove everything except text, converted image to grayscale and applied thresholding to get only white color which is the text color (I used inverse thresholding to make text black which is the optimum condition for tesseract ocr):
gray_blurred=cv2.cvtColor(blurred, cv2.COLOR_BGR2GRAY)
ret,th1 = cv2.threshold(gray_blurred,239,255,cv2.THRESH_BINARY_INV)

enter image description here

and applied ocr then removed whitespace characters :

txt = pytesseract.image_to_string(th1,lang='eng', config='--psm 12')
txt = txt.replace("\n", " ").replace("\x0c", "")
print(txt)
>>>"WINNING'OLYMPIC  GOLD MEDAL  IT'S MADE OUT OF  RECYCLED ELECTRONICS "

Related topics:

Pytesser set character whitelist

Pytesseract OCR multiple config options

You can also try preprocessing your image to let pytesseract work more accurate and if you want to recognize meaningful words you can apply spell check after ocr:

https://pypi.org/project/pyspellchecker/