Extracting text in a structured way not working with Transformer DONUT

83 Views Asked by At

I am currently working on fine tuning DONUT transformer (https://huggingface.co/docs/transformers/model_doc/donut) on this task : I want it to extract only the paragraphs of my text document like this :

<> Text of the paragraph <>" .

For this, I used the notebooks of Donut fine tuned on doc parsing (https://github.com/NielsRogge/Transformers-Tutorials/tree/master/Donut/CORD), and my own dataset with roundly 5000 training docs (from doclaynet).

For my training, I chose 20 epochs, a learning rate of 3e - 7 , a train batch size of 8. My training and validation losses are decreasing but my Tree edit distance (based on levenstein distance) is increasing whereas I want it near 0.

What is most surprising is how bad DONUT is at predictions doing things like that :

Prediction: """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" "" "" "" "" "" "" " "" " "" " "" " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "

My question is : do you think I did something wrong or is it just DONUT not created for this task ?

thanks a lot

0

There are 0 best solutions below