I have trained a Tesseract 4 LSTM model against a set of ~30,000 ground truth images that I generated (as opposed to using "real" images from scanned works, of which I do not have enough to reliably train a model).
The model works well (or at least better than eng
, on which it is based). The image generation script has several parameters that I can adjust, but I'd like to do that in a more "ordered" way than just eyeballing the output, so I'd like to generate metrics based on accuracy across a (much smaller) set of real-world images.
However, it is not clear to me how you can take a set of line images and ground truth text files and generate the required files to run lstmeval
on the new model. How do you generate the data to feed to lstmeval
when the evaluation images are not related to the images actually used to train the model in the first place?
You can generate the
.ltsmf
files needed for the evaluation like this, assuming the evaluation ground-truth is intesstrain/data/eval-ground-truth
:This will generate a file
data/eval/all-lstmf
, which contains a list of all the.lstmf
files generated. Thelist.eval
contains only a subset, as the ground truth corpus is partitioned into evaluation and training sets (according toRATIO_TRAIN
).You can then run
lstmeval
:Producing something like this (the mistake below was added to the ground truth of one
.gt.txt
file to provoke an error for example purposes):If there are no errors (as it was in this case), it looks like: