How can I train the EAST text detector on my custom data. There aren't any blogs online that shows step by step procedure to do the same. What I have currently.
I have a folder that contains all the images and corresponding xml file for each of our images that tells where our text are located.
Example :
<annotation>
<folder>Dataset</folder>
<filename>FFDDAPMDD1.png</filename>
<path>C:\Users\HPO2KOR\Desktop\Work\venv\Patent\Dataset\Dataset\FFDDAPMDD1.png</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>839</width>
<height>1000</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>text</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>522</xmin>
<ymin>29</ymin>
<xmax>536</xmax>
<ymax>52</ymax>
</bndbox>
</object>
<object>
<name>text</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>510</xmin>
<ymin>258</ymin>
<xmax>521</xmax>
<ymax>281</ymax>
</bndbox>
</object>
<object>
<name>text</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>546</xmin>
<ymin>528</ymin>
<xmax>581</xmax>
<ymax>555</ymax>
</bndbox>
</object>
<object>
<name>text</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>523</xmin>
<ymin>646</ymin>
<xmax>555</xmax>
<ymax>674</ymax>
</bndbox>
</object>
<object>
<name>text</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>410</xmin>
<ymin>748</ymin>
<xmax>447</xmax>
<ymax>776</ymax>
</bndbox>
</object>
<object>
<name>text</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>536</xmin>
<ymin>826</ymin>
<xmax>567</xmax>
<ymax>851</ymax>
</bndbox>
</object>
<object>
<name>text</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>792</xmin>
<ymin>918</ymin>
<xmax>838</xmax>
<ymax>945</ymax>
</bndbox>
</object>
</annotation>
Also I have the parsed xml file for each one of my images in the format which is used to train YOLO models.
Example
C:\Users\HPO2KOR\...\text\FFDDAPMDD1.png 522,29,536,52,0 510,258,521,281,0 546,528,581,555,0 523,646,555,674,0 410,748,447,776,0 536,826,567,851,0 792,918,838,945,0 660,918,706,943,0 63,1,108,24,0 65,51,110,77,0 65,101,109,126,0 63,151,110,175,0 63,202,109,228,0 63,252,110,276,0 63,303,110,330,0 62,353,110,381,0 65,405,109,434,0 90,457,110,482,0 59,505,101,534,0 64,565,107,590,0 61,616,107,644,0 62,670,103,694,0 62,725,104,753,0 63,778,104,804,0 62,831,100,857,0 87,887,106,912,0 98,919,144,943,0 240,916,284,943,0 378,915,420,943,0 520,918,565,942,0
C:\Users\HPO2KOR\...\text\FFDDAPMDD2.png 91,145,109,171,0 68,192,106,218,0 92,239,111,265,0 69,286,108,311,0 92,333,107,357,0 66,379,110,405,0 90,424,111,451,0 69,472,107,497,0 91,518,109,545,0 66,564,109,590,0 90,613,110,637,0 121,644,140,670,0 279,643,322,671,0 446,645,490,668,0 615,642,661,669,0 786,643,831,667,0 954,643,997,672,0 820,22,866,50,0 823,73,866,103,0
C:\Users\HPO2KOR\...\text\FFDDAPMDD3.png 648,1,698,30,0 68,64,129,91,0 55,144,128,168,0 70,218,129,247,0 56,300,127,326,0 71,377,125,404,0 58,459,127,482,0 109,535,130,560,0 140,568,160,594,0 344,568,382,594,0 563,566,581,591,0 760,568,800,593,0 982,569,1000,591,0
What is the procedure to train this EAST text detector for my custom dataset. I am on windows.
According to the documentation in the readme file, custom training the keras implementation of EAST requires a folder of images with an accompanying text file for each image named
gt_IMAGENAME.txt
. (Replace IMAGENAME with the name of the image it maps to.)In each text file, "the ground truth is given as separate text files (one per image) where each line specifies the coordinates of one word's bounding box and its transcription in a comma separated format." This quotation is from https://rrc.cvc.uab.es/?ch=4&com=tasks, which is linked in the read me to the tensorflow implementation of EAST at https://github.com/argman/EAST. The bounding box is expressed as coordinates for the four corners.
You seem to have all the information you need to construct training data in the right format. There could be a tool out there to convert everything, but a quick python script will work just fine as well. Something like ...
object
tagsxmin
,xmax
,ymin
, andymax
values to express the x,y coordinates of all corners. (upper-left is xmin, ymax; upper-right is xmax, ymax; etc.) The order, based on https://github.com/argman/EAST/blob/master/training_samples/img_1.txt, appears to belower-left, lower-right, upper-right, upper-left
x1, y1, x2, y2, x3, y3, x4, y4, transcription
orx1, y1, x2, y2, x3, y3, x4, y4, ###
(followed by a\n
for newline)python train.py
with all the command line arguments set the way the "execution example" is setup, but change the value after--training_data_path=
to your path