Can someone help me figure out how to create a balanced logarithmic loss function in yardstick for use in a tidymodels pipeline?
I looked up the documentation on creating custom metrics and I was able to create straightforward custom regression and classification metrics, these weren't that complicated. For custom class probability metrics the documentation suggested looking at the implementation for roc_auc, which wasn't particularly enlightening.
If someone could help me with this I would really appreciate it.
The cross-entropy loss function, also known as Log Loss, or logistic loss, is
where (y) is the true label (0 or 1) and (p) is the probability estimate that (y = 1) (also described in this python library)
However, a Balanced Log Loss will includes a weighting term (w(y_i)) (as seen here):
Where:
Nis the total number of samples.y_iis the true label of the i-th sample (0 or 1).p(y_i)is the predicted probability of the i-th sample being of class 1.w(y_i)is the weight for classy_i, usually defined as the inverse of the class frequencies.Meaning, the general structure of the Log Loss formula is contained within the Balanced Log Loss formula, but the Balanced Log Loss includes a weighting term to account for class imbalance.
As this fcakyon/balanced-loss library summarizes:
The
yardstickpackage is an R package that is part of thetidymodelsecosystem and is used for calculating model performance metrics.When creating a balanced logarithmic loss function, you should use
new_prob_metricrather thannew_class_metric(). This is because the balanced logarithmic loss function operates on predicted probabilities, which makes it a probability metric.In
yardstick,new_class_metricis used for metrics that evaluate predictions in terms of class labels (e.g., accuracy, sensitivity, specificity), whereasnew_prob_metricis used for metrics that evaluate the predicted probabilities (e.g., log loss, AUC).direction = "minimize"indicates that smaller values of this metric are better (which is true for log loss, as we want to minimize the error).fun = function(data, lev, model = NULL, w_positive, w_negative) { ... }defines the function that computes the Balanced Log Loss. The function takes five parameters:data: A data frame containing the true labels and predicted probabilities.lev: The levels of the factor for binary classification. In binary classification, there are two levels, e.g., "positive" and "negative".model: Not used in this function, but it can be passed.w_positive: The weight for the positive class.w_negative: The weight for the negative class.truth <- as.numeric(data$truth == lev[1])converts the true labels into a numeric vector where 1 represents the positive class (assumed to belev[1]), and 0 represents the negative class.prob <- data$.pred_1extracts the predicted probabilities for the positive class from the.pred_1column of the data frame.The Balanced Log Loss is calculated with this expression:
This expression implements the formula by:
(truth * log(prob) + (1 - truth) * log(1 - prob)).w_positivewhen the true label is positive, andw_negativewhen the true label is negative.Finally, the function returns the value of the calculated
loss.And here is how you can use the custom metric in your
tidymodelspipeline:This example uses
new_prob_metricto define a balanced logarithmic loss function (named withmetric_nm = "balanced_log_loss") that operates on predicted probabilities.The weights
w_positiveandw_negativeare used to balance the contribution of each class to the loss.