Python PyTesseract Module returning gibberish from an image

618 Views Asked by Charlie At 20 August 2021 at 20:33

I'm guessing this is because the images I have contain text on top of a picture. pytesseract.image_to_string() can usually scan the text properly but it also returns a crap ton of gibberish characters: I'm guessing it's because of the pictures underneath the text making Pytesseract think they are text too or something.

When Pytesseract returns a string, how can I make it so that it doesn't include any text unless it's certain that the text is right. Like, if there a way for Pytesseract to also return some sort of number telling me how certain the text is scanned accurately?

I know I kinda sound dumb but somebody pls help

Original Q&A

There are 1 best solutions below

cagataygulten On 20 August 2021 at 20:48

You can set a character whitelist with config argument to get rid of gibberish characters,and also you can try with different psm options to get better result.

Unfortunately, it is not that easy, I think the only way is applying some image preprocessing and this is my best:

Firstly I applied some blurring to smoothing:

 import cv2
 blurred = cv2.blur(img,(5,5))

Then to remove everything except text, converted image to grayscale and applied thresholding to get only white color which is the text color (I used inverse thresholding to make text black which is the optimum condition for tesseract ocr):

gray_blurred=cv2.cvtColor(blurred, cv2.COLOR_BGR2GRAY)
ret,th1 = cv2.threshold(gray_blurred,239,255,cv2.THRESH_BINARY_INV)

and applied ocr then removed whitespace characters :

txt = pytesseract.image_to_string(th1,lang='eng', config='--psm 12')
txt = txt.replace("\n", " ").replace("\x0c", "")
print(txt)
>>>"WINNING'OLYMPIC  GOLD MEDAL  IT'S MADE OUT OF  RECYCLED ELECTRONICS "

Python PyTesseract Module returning gibberish from an image

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-TESSERACT

Related Questions in PYTESSER

Trending Questions

Popular # Hahtags

Popular Questions