Problems with reproducing the training of the spaCy pipeline

69 Views Asked by At

I'm trying to reproduce the training of one of the spaCy pipeline for Italian language: it_core_news_sm. This pipeline is trained on 2 datasets:

  1. UD_Italian-ISDT for the conllu tasks
  2. WikiNer for NET tagging

Where can I find more info about the data used to trained? They used both training and dev sets to train the pipeline? Did They group sentences together as suggested by the spaCy command convert?

And also, how is it possible to train the pipeline on 2 datasets? Should I train first the pipeline on the first dataset and then the NER component on the second dataset, or is it possible to do it simultaneously?

Anyway, at the moment I trained the dataset on just the UD_Italian-ISDT dataset for POS tagging (coarse-grained and fine-grained), parsing, lemmatization and morphological analysis using the config file for the training available here. I used the train set to train and validation set to test the pipeline and I obtain results far lower respect to those claimed by spaCy here. Here's my results:

pos_acc: 0.9020224719

morph_acc: 0.9004449638

tag_acc: 0.9001348315

dep_uas: 0.7801636499

dep_las: 0.7451524919

sents_p: 0.9754816112

sents_r: 0.9875886525

sents_f: 0.9814977974

lemma_acc: 0.9028083577

Could someone help me with this? Where I an find more info about the setting of the training and what reasons could cause these scores?

1

There are 1 best solutions below

0
On

The solution is to merge the subtokens while convreting the dataset for the training in spacy (with the flag --merge-subtokens in the command convert). This is because the way conllu handles multiword tokens, like "nel", "nella", "dello", "delle" (see here more information about it)