Tesseract is not finding text in simple handwriting test. Is there any way to fix this?

42 Views Asked by At

I am trying to put together a better solution for automated grading of paper tests. The problem is to extract rectangular areas from a test and do OCR on handwritten input. While handwriting is obviously challenging, this problem is significantly simpler than generically reading handwriting:

  1. The text orientation is known
  2. I can specify exactly what answers I am expecting, and/or the set of characters that are legal.
  3. I would be willing to get a probability from the engine and if the probability is too low, call in a human to adjudicate (preferably not).

Tesseract claims to work on handwriting, works on linux and windows using mingw, so it seemed good.

I extracted a sample of handwritten data from a form. Here is the sample:

enter image description here

In this case, the bounds of the rectangle have not been cropped out, but I expected that it would be able to find my 64. It failed.

When I cropped the bounding box, it worked.

While in this case, I can solve the problem, I wanted to know whether there is anything I can do to improve recognition, because the bounding box seemed innocuous, and I am worried that any trivial noise could ruin detection.

  1. Is there a better open source package I could use?

  2. Is there is a way to improve the training for my application? I think I could create a "language" for single letters, and a different language for integers, and load multiple tesseract engines, each specialized for a kind of question type.

  3. Is there a way in the internal API to give it a list of the potential strings/character set, ie hinting to improve accuracy?

0

There are 0 best solutions below