Train a classification model using the "rpart" and "caret" libraries in R with four classes: how to define accuracy metric

Question

Train a classification model using the "rpart" and "caret" libraries in R with four classes: how to define accuracy metric

140 Views Asked by Mark At 14 June 2023 at 15:40

The following code trains a classification model using the "rpart" and "caret" libraries in R. It uses the train() function from the "caret" library to train the model with the "rpart" method, specifically using the Gini index for splitting. The trained model is stored in the variable classifier.

library(rpart)
library(caret)
classifier = train(x = training_set[, names(training_set) != "Target"],
                   y = training_set$Target,
                   method = 'rpart',
                   parms = list(split = "gini"),
                   tuneLength = 20)

The variable classifier is as follows:

> classifier
CART 

7112 samples
  89 predictor
   4 classes: 'Q1', 'Q2', 'Q3', 'Q4' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 7112, 7112, 7112, 7112, 7112, 7112, ... 
Resampling results across tuning parameters:

  cp            Accuracy   Kappa    
  0.0002343457  0.9536618  0.9382023
  0.0002812148  0.9535851  0.9380999
  0.0003749531  0.9535394  0.9380391
  0.0004686914  0.9539980  0.9386511
  0.0005624297  0.9539678  0.9386110
  0.0006561680  0.9543640  0.9391389
  0.0007499063  0.9540123  0.9386694
  0.0008248969  0.9536724  0.9382163
  0.0010311211  0.9536133  0.9381370
  0.0011248594  0.9532129  0.9376029
  0.0014373203  0.9515384  0.9353684
  0.0029058868  0.9470504  0.9293828
  0.0042182227  0.9388870  0.9184975
  0.0052493438  0.9336715  0.9115402
  0.0082489689  0.9247140  0.8995937
  0.0133108361  0.9169616  0.8892603
  0.0221222347  0.9060093  0.8746638
  0.0380577428  0.8739447  0.8319098
  0.2065991751  0.8156983  0.7544120
  0.3101799775  0.4304355  0.2461903

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.000656168.

So it is a predictor based on 4 classes. The optimal model is obtained by means the accuracy metric.

In binary classification, accuracy is defined as the ratio of the number of correct predictions (true positives and true negatives) to the total number of predictions.

Mathematically, the accuracy can be calculated using the following formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

where:

TP (True Positives) represents the number of instances correctly predicted as positive.
TN (True Negatives) represents the number of instances correctly predicted as negative.
FP (False Positives) represents the number of instances predicted as positive but are actually negative (Type I error).
FN (False Negatives) represents the number of instances predicted as negative but are actually positive (Type II error).

What is the definition of accuracy used by train for multiclass problems?

Original Q&A

There are 2 best solutions below

Tom Wenseleers On 17 June 2023 at 08:29

In multiclass classification problems, the accuracy is calculated as the total number of correct predictions divided by the total number of predictions, just as in binary classification problems. However, the notion of "correct prediction" now extends beyond just true positives and true negatives, given that there are more than two classes.

That is, in multiclass classification the number of correct predictions is simply the count of instances where the predicted class matches the actual class, irrespective of what that class is. Hence, the accuracy in a multiclass classification problem is just:

Accuracy = (number of correct predictions) / (total number of predictions)

where:

The number of correct predictions represents the number of instances where the predicted class matches the actual class.

The total number of predictions is simply the count of all instances in the dataset.

This is the definition of accuracy used by the train function in the caret package for multiclass problems. In the output you've provided, the accuracy for each value of the complexity parameter (cp) represents the proportion of instances in the bootstrapped resamples for which the model correctly predicted the class. See e.g. this paper for a nice review.

**Jonathan V. Solórzano** · Accepted Answer · 2023-06-16T20:16:35.750000

For multiclass problems, you just need to expand the same definition of accuracy to a multiclass problem (i.e., number of true positives over all observations). Here is also a reputable source that defines a multiclass accuracy equation for map classification accuracy assessment: Congalton, 1991. In this article, overall accuracy is defined as being calculated by "dividing the total correct (i.e., the sum of the major diagonal) by the total number of pixels in the error matrix". Thus, for example, for the following confusion matrix where the predicted class is shown in the rows and the observed one in the columns:

Class	1	2	-	q	Total
1	n₁₁	n₁₂	-	n_1q	n_1.
2	n₂₁	n₂₂	-	n_2q	n_2.
-	-	-	-	-	-
q	n_q1	n_q2	-	n_qq	n_q.
Total	n_.1	n_.2	-	n_.q	n

The overall accuracy would be calculated as the sum of the all the n_kk, which stands for the number of correct observations for each k class, and then divided by the total number of observations (n).

Train a classification model using the "rpart" and "caret" libraries in R with four classes: how to define accuracy metric

There are 2 best solutions below

Related Questions in R

Related Questions in CARET

Related Questions in RPART

Trending Questions

Popular # Hahtags

Popular Questions