Extreme performance disparity between training output and evaluate function with flair NLP?

182 Views Asked by At

Ive trained a custom NER model in flair and after the training is completed, it outputs the results which were

Results:
- F1-score (micro) 0.5714
- F1-score (macro) 0.4831

By class:
SymProp    tp: 13 - fp: 25 - fn: 21 - precision: 0.3421 - recall: 0.3824 - f1-score: 0.3611

SymRel     tp: 3 - fp: 3 - fn: 7 - precision: 0.5000 - recall: 0.3000 - f1-score: 0.3750

Symptom    tp: 46 - fp: 19 - fn: 18 - precision: 0.7077 - recall: 0.7188 - f1-score: 0.7132

then i used the evaluate function using this code:

from flair.models import SequenceTagger

tagger = SequenceTagger.load('/content/flairmodels/ner/final-model.pt')


result, score = tagger.evaluate(corpus.test, mini_batch_size=1, out_path=f"predictions.txt")

print(result.detailed_results)

which outputted:

Results:
- F1-score (micro) 0.9580
- F1-score (macro) 0.9520
By class:
SymProp    tp: 48 - fp: 3 - fn: 4 - precision: 0.9412 - recall: 0.9231 - f1-score: 0.9320
SymRel     tp: 17 - fp: 0 - fn: 2 - precision: 1.0000 - recall: 0.8947 - f1-score: 0.9444
Symptom    tp: 72 - fp: 2 - fn: 1 - precision: 0.9730 - recall: 0.9863 - f1-score: 0.9796

This has confused me drastically. One performance is quite bad where as the other is incredible. It is performed on small data. let me know if im thoroughly misunderstanding something. Thanks so much.

0

There are 0 best solutions below