I'm trying to reproduce the training of one of the spaCy pipeline for Italian language: it_core_news_sm. This pipeline is trained on 2 datasets:
- UD_Italian-ISDT for the conllu tasks
- WikiNer for NET tagging
Where can I find more info about the data used to trained? They used both training and dev sets to train the pipeline? Did They group sentences together as suggested by the spaCy command convert?
And also, how is it possible to train the pipeline on 2 datasets? Should I train first the pipeline on the first dataset and then the NER component on the second dataset, or is it possible to do it simultaneously?
Anyway, at the moment I trained the dataset on just the UD_Italian-ISDT dataset for POS tagging (coarse-grained and fine-grained), parsing, lemmatization and morphological analysis using the config file for the training available here. I used the train set to train and validation set to test the pipeline and I obtain results far lower respect to those claimed by spaCy here. Here's my results:
pos_acc: 0.9020224719
morph_acc: 0.9004449638
tag_acc: 0.9001348315
dep_uas: 0.7801636499
dep_las: 0.7451524919
sents_p: 0.9754816112
sents_r: 0.9875886525
sents_f: 0.9814977974
lemma_acc: 0.9028083577
Could someone help me with this? Where I an find more info about the setting of the training and what reasons could cause these scores?
The solution is to merge the subtokens while convreting the dataset for the training in spacy (with the flag --merge-subtokens in the command convert). This is because the way conllu handles multiword tokens, like "nel", "nella", "dello", "delle" (see here more information about it)