Extreme performance disparity between training output and evaluate function with flair NLP?

192 Views Asked by George At 29 June 2025 at 23:40

Ive trained a custom NER model in flair and after the training is completed, it outputs the results which were

Results:
- F1-score (micro) 0.5714
- F1-score (macro) 0.4831

By class:
SymProp    tp: 13 - fp: 25 - fn: 21 - precision: 0.3421 - recall: 0.3824 - f1-score: 0.3611

SymRel     tp: 3 - fp: 3 - fn: 7 - precision: 0.5000 - recall: 0.3000 - f1-score: 0.3750

Symptom    tp: 46 - fp: 19 - fn: 18 - precision: 0.7077 - recall: 0.7188 - f1-score: 0.7132

then i used the evaluate function using this code:

from flair.models import SequenceTagger

tagger = SequenceTagger.load('/content/flairmodels/ner/final-model.pt')


result, score = tagger.evaluate(corpus.test, mini_batch_size=1, out_path=f"predictions.txt")

print(result.detailed_results)

which outputted:

Results:
- F1-score (micro) 0.9580
- F1-score (macro) 0.9520
By class:
SymProp    tp: 48 - fp: 3 - fn: 4 - precision: 0.9412 - recall: 0.9231 - f1-score: 0.9320
SymRel     tp: 17 - fp: 0 - fn: 2 - precision: 1.0000 - recall: 0.8947 - f1-score: 0.9444
Symptom    tp: 72 - fp: 2 - fn: 1 - precision: 0.9730 - recall: 0.9863 - f1-score: 0.9796

This has confused me drastically. One performance is quite bad where as the other is incredible. It is performed on small data. let me know if im thoroughly misunderstanding something. Thanks so much.

Original Q&A

Extreme performance disparity between training output and evaluate function with flair NLP?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in FLAIR

Trending Questions

Popular # Hahtags

Popular Questions