Why is F1 score of NER lower after optimizng CRF hyperparameters on development set?

164 Views Asked by At
from sklearn_crfsuite import CRF, metrics

X_train,y_train
X_dev,  y_dev
X_test, y_test

# baseline run
baseline = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)
baseline.fit(X_train, y_train)
print('Overall Accuracy on Test Set:', baseline.score(X_test, y_test), '\n')
labels = list(baseline.classes_)
labels.remove('O')
sorted_labels = sorted(labels, key = lambda name:(name[1:], name[0]))
print(metrics.flat_classification_report(y_test, baseline.predict(X_test), labels=sorted_labels, digits=4))

I split a dataset, in which each text is represented by a list of tuple (token, POS tag, OBI label), into training set, development set, and test set by size ratio 0.6:0.2:0.2 and try to do Name Entity Recognition (NER) by Conditional Random Fields (CRF) using sklearn_crfsuite. After optimizing the hyperparameters on development set, the weighted average F1 scores on test set become lower than those in the baseline run, where hyperparameters are blindly prescribed. I find it counterintuitive.

I know this problem is pretty vague. But any advice on where to look? How can I find out whether this is a normal situation or my model goes wrong somewhere? Should I use a larger development set? Is it more appropriate to optimize hyperparameters by cross-validation? Or do I need to go back to modify the features?

BTW, I'm aware that diverse data source could cause this problem, so I check in particular. The distributions of text length, POS tags, and OBI labels all look sufficiently similar in the training set, development set, and test set.

0

There are 0 best solutions below