I am trying to update the baseline code of nltk.classify.rte_classify
to add more features in order to improve the accuracy of the model. It uses MaxentClassifier. My problem is that every time I execute my code I get different accuracy results (mentioned after the code. ). Usually, for scikit-learn classifiers, we have parameter 'random_state'
to get reproducible result. I want to do the same for MaxentClassifier in my case. I checked in their documentation but I could not find anything similar to random_state
as we have for scikit classifier.
from nltk.classify.util import accuracy
import nltk.classify.rte_classify as classify
def rte_classifier(algorithm):
from nltk.corpus import rte as rte_corpus
train_set = rte_corpus.pairs(['rte1_dev.xml', 'rte2_dev.xml', 'rte3_dev.xml'])
test_set = rte_corpus.pairs(['rte1_test.xml'])
featurized_train_set = classify.rte_featurize(train_set)
featurized_test_set = classify.rte_featurize(test_set)
# Train the classifier
print('Training classifier...')
if algorithm in ['GIS', 'IIS']: # Use default GIS/IIS MaxEnt algorithm
clf = nltk.MaxentClassifier.train(featurized_train_set, algorithm)
else:
err_msg = str(
"RTEClassifier only supports these algorithms:\n "
" 'GIS', 'IIS'.\n")
raise Exception(err_msg)
print('Testing classifier...')
acc = accuracy(clf, featurized_test_set)
print('Accuracy: %6.4f' % acc)
return clf
rte_classifier('GIS')
- 1st time : Accuracy: 0.5929
- 2nd time : Accuracy: 0.5908
- 3rd time : Accuracy: 0.5854
- 4th time : Accuracy: 0.5913
The variation in accuracy for the test set may look smaller but in my own dataset with high number of features, the difference sometime goes up to 10% .