I am trying to put together a better solution for automated grading of paper tests. The problem is to extract rectangular areas from a test and do OCR on handwritten input. While handwriting is obviously challenging, this problem is significantly simpler than generically reading handwriting:
- The text orientation is known
- I can specify exactly what answers I am expecting, and/or the set of characters that are legal.
- I would be willing to get a probability from the engine and if the probability is too low, call in a human to adjudicate (preferably not).
Tesseract claims to work on handwriting, works on linux and windows using mingw, so it seemed good.
I extracted a sample of handwritten data from a form. Here is the sample:
In this case, the bounds of the rectangle have not been cropped out, but I expected that it would be able to find my 64. It failed.
When I cropped the bounding box, it worked.
While in this case, I can solve the problem, I wanted to know whether there is anything I can do to improve recognition, because the bounding box seemed innocuous, and I am worried that any trivial noise could ruin detection.
Is there a better open source package I could use?
Is there is a way to improve the training for my application? I think I could create a "language" for single letters, and a different language for integers, and load multiple tesseract engines, each specialized for a kind of question type.
Is there a way in the internal API to give it a list of the potential strings/character set, ie hinting to improve accuracy?