I'm guessing this is because the images I have contain text on top of a picture. pytesseract.image_to_string() can usually scan the text properly but it also returns a crap ton of gibberish characters: I'm guessing it's because of the pictures underneath the text making Pytesseract think they are text too or something.
When Pytesseract returns a string, how can I make it so that it doesn't include any text unless it's certain that the text is right. Like, if there a way for Pytesseract to also return some sort of number telling me how certain the text is scanned accurately?
I know I kinda sound dumb but somebody pls help
You can set a character whitelist with config argument to get rid of gibberish characters,and also you can try with different psm options to get better result.
Unfortunately, it is not that easy, I think the only way is applying some image preprocessing and this is my best:
and applied ocr then removed whitespace characters :
Related topics:
Pytesser set character whitelist
Pytesseract OCR multiple config options
You can also try preprocessing your image to let pytesseract work more accurate and if you want to recognize meaningful words you can apply spell check after ocr:
https://pypi.org/project/pyspellchecker/