How to Train Tesseract 5 for Amharic Texts in Old Scanned Books

52 Views Asked by At

Background

I'm trying to use tesseract 5.3.3 on scanned old books written in Amharic (which uses Ethiopic script).

Major Shortcomings of amh.traineddata from tesseract

Difference in type of Ethiopic script: there are Ethiopic script characters in old Amharic texts that are not used in the unicharset of amh.traineddata.

Difference in punctuation styles: the old texts use some punctuations not used in modern Amharic, and also for some that are used in modern Amharic, the old texts have d/t pattern (mostly space b/n word and punctuation character --- while the old texts always put space b/n punctuation chars and both preceding and following words, in modern times these punctuation chars doesn't have space b/n them and the preceding word).

Very narrow training_text & wordlist (based on tesseract/langdata_lstm) The amh.training_text & amh.wordlist text files used by tesseract (the one from langdata_lstm) is very small.

(to give you an Idea: for tir.traineddata (another language which uses Ethiopic script) the tir.training_text from langdata_lstm has more than 400,000 lines while the amh.training_text has only around 400 lines)

Other challenges

  • The old Amharic books use fonts that are not in use (or available).
  • The old Amharic books contain many Ge'ez words (a liturgical language like Latin which uses Ethiopic script).
  • The old Amharic books mostly use Ge'ez numbers, while modern Amharic texts use Arabic numerals.

What I've Done So far

As an experiment I've tried to fine tune amh.traineddata (from tessdata_best) using close to 300 line images & texts (from sample pages of some old Amharic books) and using files from langdata_lstm (for 10,000 iterations).

The resulting model has a very satisfactory improvement in addressing some of the challenges mentioned above, especially regarding punctuations. But it still fails to solve the problems I've with some characters (the ones not present in the unicharset of amh.traineddata) and fails for almost all Ge'ez numbers (eventhough the training sample pages have many Ge'ez nums).

What I'm Planning to Do

First I want to train tesseract with a large training_text & wordlist files, and also a complete unicharset file , Then fine tune the resulting traineddata based on sample line images from the old books.

Questions (for now. I'll definitely add more questions later)

Is there another path I should take that would efficiently get me to where I want?

Regarding training tesseract with large training_text & wordlist files, and also a complete unicharset file:

  • How to prepare the training_text & wordlist file? (What the text files should contain)
  • How to prepare the unicharset file, and also how to pass it to the make training command ?

Regarding generating a text, image(tif) and box file from training_text: I've looked up python scripts to do this job, but have question about the proper values for these params in text2image: --font (what criteria should I use to select the list of fonts), --leading, --xsize, --ysize, --char_spacing, --exposure, --unicharset_file and --margin.

And finally, I've observed that the example training line images in tesseract/tesstrain are tightly cropped with minimal space around the text line. Should the line images generated from the training_text file be tightly cropped too?

Thanks for Your Time

0

There are 0 best solutions below