Sklearn - Multi-class confusion matrix for ordinal data

841 Views Asked by At

I've written a model that predicts on ordinal data. At the moment, I'm evaluating my model using quadratic cohen's kappa. I'm looking for a way to visualize the results using a confusion matrix, then calculate recall, precision and f1 score taking into account the prediction distance.

I.E predicting 2 when class was 1 is better than predicting 3 when class was 1.

I've written the following code to plot and calculate the results:

def plot_cm(df, ax):
    cf_matrix = confusion_matrix(df.x, df.y,normalize='true',labels=[0,1,2,3,4,5,6,7,8]) 
    
    ax = sns.heatmap(cf_matrix, linewidths=1, annot=True, ax=ax, fmt='.2f')
    ax.set_ylabel(f'Actual')
    ax.set_xlabel(f'Predicted')

    print(f'Recall score:',recall_score(df.x,df.y, average= 'weighted',zero_division=0))
    print(f'Precision score:',precision_score(df.x,df.y, average= 'weighted',zero_division=0))
    print(f'F1 score:',f1_score(df.x,df.y, average= 'weighted',zero_division=0))

enter image description here

Recall score: 0.53505
Precision score: 0.5454783454981732
F1 score: 0.5360650278722704

The visualization is fine, however, the calculation ignores predictions that where "almost" true. I.E predicted 8 when actual was 9 (for example).

Is there a way to calculate Recall, Precision and F1 taking into account the ordinal behavior of the data?

1

There are 1 best solutions below

2
On

A regular Precision (for class) is calculated as ratio of True Positives over Totally Detected for that class. Usually Truly Positive detection is defined in a binary fashion: you either correctly detected the class or not. There is no constriction whatsoever to make TP detection score for sample i fuzzy (or in other words lightly penalize close-to-class detections and make the penalty more severe as the difference grows):

TP(i) = max(0, (1 - abs(detected_class(i) - true_class(i))/penalty_factor) )

where TP_i is a value of "true positive detection" for samle i, and would be some number between [0,1] - this is . It is reasonable to make penalty_factor equal to the number of classes (it should be larger than 1). By changing it you can control how much "distant" classes would be penalized. For example if you decide that difference of more than 3 is enough to consider "not detected", set it to 3. If you set it to 1, you will get back to the "regular" precision formulation. I'm using max() to make sure that TP score will not become negative.

Now, to get the denominator right, you need to set it to the count of samples that got TP(i)>0. That is if you have a total 100 samples, and out of those 5 were detected with TP detection score of 1, and 6 got TP detection score 0.5, your Precision would be (5 + 6*0.5)/(5+6).

One issue here is that "precision per class" becomes meaningless as any class becomes somehow relevant to all classes, and if you need total precision "weighted" by class (for unbalanced classes case), you need to factor it in TP score considering true class of the sample i.

Employing the same logic, the Recall would be the sum of TP scores over the relevant population, i.e.

R = (sum of (weighted) TP scores)/(total amount of samples)

And, finally, F1 is a harmonic mean of Precision and Recall.