Ive trained a custom NER model in flair and after the training is completed, it outputs the results which were
Results:
- F1-score (micro) 0.5714
- F1-score (macro) 0.4831
By class:
SymProp tp: 13 - fp: 25 - fn: 21 - precision: 0.3421 - recall: 0.3824 - f1-score: 0.3611
SymRel tp: 3 - fp: 3 - fn: 7 - precision: 0.5000 - recall: 0.3000 - f1-score: 0.3750
Symptom tp: 46 - fp: 19 - fn: 18 - precision: 0.7077 - recall: 0.7188 - f1-score: 0.7132
then i used the evaluate function using this code:
from flair.models import SequenceTagger
tagger = SequenceTagger.load('/content/flairmodels/ner/final-model.pt')
result, score = tagger.evaluate(corpus.test, mini_batch_size=1, out_path=f"predictions.txt")
print(result.detailed_results)
which outputted:
Results:
- F1-score (micro) 0.9580
- F1-score (macro) 0.9520
By class:
SymProp tp: 48 - fp: 3 - fn: 4 - precision: 0.9412 - recall: 0.9231 - f1-score: 0.9320
SymRel tp: 17 - fp: 0 - fn: 2 - precision: 1.0000 - recall: 0.8947 - f1-score: 0.9444
Symptom tp: 72 - fp: 2 - fn: 1 - precision: 0.9730 - recall: 0.9863 - f1-score: 0.9796
This has confused me drastically. One performance is quite bad where as the other is incredible. It is performed on small data. let me know if im thoroughly misunderstanding something. Thanks so much.