Tesseract Training - Error reading radical code table data/langdata/radical-stroke.txt

368 Views Asked by At

I've tried to train Tesseract OCR on specific font, based on polish language model (pol) and my own "ground truth" text - it may be important, that the one generated by me does not contain all chars from polish charset, because in my application of OCR not all of them are used.

Tesseract 5.3.2 built on Ubuntu 22.04.

Here is a snippet initializing the training:

TESSDATA_PREFIX=/home/xxx/tesseract/tessdata make training MODEL_NAME=POLcalibri START_MODEL=pol TESSDATA=/home/xxx/tesseract/tessdata MAX_ITERATIONS=1000

The training goes on, and in the end, the following code appears:

python3 shuffle.py 0 "data/POLcalibri/all-lstmf"
+ head -n 134999 data/POLcalibri/all-lstmf
+ tail -n 15000 data/POLcalibri/all-lstmf
+ '[' '' = Windows_NT ']'
if [ "" = "Windows_NT" ]; then \
    dos2unix "data/POLcalibri/POLcalibri.numbers"; \
    dos2unix "data/POLcalibri/POLcalibri.punc"; \
    dos2unix "data/POLcalibri/POLcalibri.wordlist"; \
    dos2unix "data/langdata/POLcalibri/POLcalibri.config"; \
fi
combine_lang_model \
  --input_unicharset data/POLcalibri/unicharset \
  --script_dir data/langdata \
  --numbers data/POLcalibri/POLcalibri.numbers \
  --puncs data/POLcalibri/POLcalibri.punc \
  --words data/POLcalibri/POLcalibri.wordlist \
  --output_dir data \
   \
  --lang POLcalibri
Failed to read data from: data/POLcalibri/POLcalibri.wordlist
Failed to read data from: data/POLcalibri/POLcalibri.punc
Failed to read data from: data/POLcalibri/POLcalibri.numbers
Loaded unicharset of size 121 from file data/POLcalibri/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/langdata/Latin.unicharset
Warning: properties incomplete for index 3 = P
Warning: properties incomplete for index 4 = O
Warning: properties incomplete for index 5 = T
Warning: properties incomplete for index 6 = R
Warning: properties incomplete for index 7 = Z
Warning: properties incomplete for index 8 = E
Warning: properties incomplete for index 9 = B
Warning: properties incomplete for index 10 = N
Warning: properties incomplete for index 11 = )
Warning: properties incomplete for index 12 = G
Warning: properties incomplete for index 13 = U
Warning: properties incomplete for index 14 = J
Warning: properties incomplete for index 15 = !
Warning: properties incomplete for index 16 = ,
Warning: properties incomplete for index 17 = W
Warning: properties incomplete for index 18 = C
Warning: properties incomplete for index 19 = Ł
Warning: properties incomplete for index 20 = A
Warning: properties incomplete for index 21 = S
Warning: properties incomplete for index 22 = K
Warning: properties incomplete for index 23 = I
Warning: properties incomplete for index 24 = '
Warning: properties incomplete for index 25 = M
Warning: properties incomplete for index 26 = L
Warning: properties incomplete for index 27 = D
Warning: properties incomplete for index 28 = .
Warning: properties incomplete for index 29 = Ę
Warning: properties incomplete for index 30 = H
Warning: properties incomplete for index 31 = ?
Warning: properties incomplete for index 32 = Y
Warning: properties incomplete for index 33 = "
Warning: properties incomplete for index 34 = Ż
Warning: properties incomplete for index 35 = :
Warning: properties incomplete for index 36 = V
Warning: properties incomplete for index 37 = 6
Warning: properties incomplete for index 38 = 0
Warning: properties incomplete for index 39 = 8
Warning: properties incomplete for index 40 = F
Warning: properties incomplete for index 41 = Ą
Warning: properties incomplete for index 42 = Ć
Warning: properties incomplete for index 43 = Ś
Warning: properties incomplete for index 44 = /
Warning: properties incomplete for index 45 = Ó
Warning: properties incomplete for index 46 = _
Warning: properties incomplete for index 47 = (
Warning: properties incomplete for index 48 = Ń
Warning: properties incomplete for index 49 = ;
Warning: properties incomplete for index 50 = -
Warning: properties incomplete for index 51 = Q
Warning: properties incomplete for index 52 = X
Warning: properties incomplete for index 53 = |
Warning: properties incomplete for index 54 = „
Warning: properties incomplete for index 55 = 2
Warning: properties incomplete for index 56 = 3
Warning: properties incomplete for index 57 = 1
Warning: properties incomplete for index 58 = 7
Warning: properties incomplete for index 59 = 9
Warning: properties incomplete for index 60 = ”
Warning: properties incomplete for index 61 = +
Warning: properties incomplete for index 62 = ]
Warning: properties incomplete for index 63 = [
Warning: properties incomplete for index 64 = 4
Warning: properties incomplete for index 65 = 5
Warning: properties incomplete for index 66 = =
Warning: properties incomplete for index 67 = Ź
Warning: properties incomplete for index 68 = »
Warning: properties incomplete for index 69 = <
Warning: properties incomplete for index 70 = >
Warning: properties incomplete for index 71 = *
Warning: properties incomplete for index 72 = $
Warning: properties incomplete for index 73 = «
Warning: properties incomplete for index 74 = %
Warning: properties incomplete for index 75 = ©
Warning: properties incomplete for index 76 = €
Warning: properties incomplete for index 77 = —
Warning: properties incomplete for index 78 = £
Warning: properties incomplete for index 79 = l
Warning: properties incomplete for index 80 = o
Warning: properties incomplete for index 81 = r
Warning: properties incomplete for index 82 = e
Warning: properties incomplete for index 83 = n
Warning: properties incomplete for index 84 = t
Warning: properties incomplete for index 85 = y
Warning: properties incomplete for index 86 = ń
Warning: properties incomplete for index 87 = c
Warning: properties incomplete for index 88 = z
Warning: properties incomplete for index 89 = k
Warning: properties incomplete for index 90 = m
Warning: properties incomplete for index 91 = b
Warning: properties incomplete for index 92 = s
Warning: properties incomplete for index 93 = a
Warning: properties incomplete for index 94 = j
Warning: properties incomplete for index 95 = d
Warning: properties incomplete for index 96 = g
Warning: properties incomplete for index 97 = ł
Warning: properties incomplete for index 98 = ę
Warning: properties incomplete for index 99 = p
Warning: properties incomplete for index 100 = w
Warning: properties incomplete for index 101 = i
Warning: properties incomplete for index 102 = v
Warning: properties incomplete for index 103 = u
Warning: properties incomplete for index 104 = f
Warning: properties incomplete for index 105 = h
Warning: properties incomplete for index 106 = ó
Warning: properties incomplete for index 107 = x
Warning: properties incomplete for index 108 = ą
Warning: properties incomplete for index 109 = ż
Warning: properties incomplete for index 110 = ś
Warning: properties incomplete for index 111 = q
Warning: properties incomplete for index 112 = ć
Warning: properties incomplete for index 113 = ź
Warning: properties incomplete for index 114 = á
Warning: properties incomplete for index 115 = Ü
Warning: properties incomplete for index 116 = ü
Warning: properties incomplete for index 117 = ’
Warning: properties incomplete for index 118 = Ű
Warning: properties incomplete for index 119 = ű
Warning: properties incomplete for index 120 = Á
Config file is optional, continuing...
Failed to read data from: data/langdata/POLcalibri/POLcalibri.config
Failed to read data from: data/langdata/radical-stroke.txt
Error reading radical code table data/langdata/radical-stroke.txt
make: *** [Makefile:309: data/POLcalibri/POLcalibri.traineddata] Error 1

I have no idea how to solve it, similar issue was raised here on GitHub, but there's no solution.

2

There are 2 best solutions below

0
On

What about downloading radical-stroke.txt to data/langdata/?

BTW: Try to read instructions before posting to SO.

0
On

I had the same error, also with Polish.

In my case, this error was caused by using the .traineddata file as START_MODEL from the "default" tessdata repository instead of the tessdata_best repository.

From tesseract documentation:

tessdata_best is for people willing to trade a lot of speed for slightly better accuracy. It is also the only set of files which can be used as start_model for certain retraining scenarios for advanced users.