sklearn RandomizedSearchCV extract confusion matrix for different folds

1.6k Views Asked by At

I try to calculate an aggregated confusion matrix to evaluate my model:

cv_results = cross_validate(estimator, dataset.data, dataset.target, scoring=scoring,
                cv=Config.CROSS_VALIDATION_FOLDS, n_jobs=N_CPUS, return_train_score=False)

But I don't know how to extract the single confusion matrices of the different folds. In a scorer I can compute it:

scoring = {
'cm': make_scorer(confusion_matrix)
}

, but I cannot return the comfusion matrix, because it has to return a number instead of an array. If I try it I get the following error:

ValueError: scoring must return a number, got [[...]] (<class 'numpy.ndarray'>) instead. (scorer=cm)

I wonder if it is possible to store the confusion matrices in a global variable, but had no success using

global cm_list
cm_list.append(confusion_matrix(y_true,y_pred))

in a custom scorer.

Thanks in advance for any advice.

2

There are 2 best solutions below

2
On BEST ANSWER

The problem was, that I could not get access to the estimator after RandomizedSearchCV was finished, because I did not know RandomizedSearchCV implements a predict method. Here is my personal solution:

r_search = RandomizedSearchCV(estimator=estimator, param_distributions=param_distributions,
                          n_iter=n_iter, cv=cv, scoring=scorer, n_jobs=n_cpus,
                          refit=next(iter(scorer)))
r_search.fit(X, y_true)
y_pred = r_search.predict(X)
cm = confusion_matrix(y_true, y_pred)
0
On

To return confusion matrix for each fold ,you can call confusion_matrix from metrics modules in each iteration(fold) which will give you an array as output.Input will be a y_true and y_predict values obtained for each fold.

from sklearn import metrics
print metrics.confusion_matrix(y_true,y_predict)
array([[327582, 264313],
       [167523, 686735]])

Alternatively, if you are using pandas then pandas has a crosstab module

df_conf = pd.crosstab(y_true,y_predict,rownames=['Actual'],colnames=['Predicted'],margins=True)
print df_conf

Predicted       0       1     All
Actual                           
  0          332553   58491  391044
  1           97283  292623  389906
  All        429836  351114  780950