Can someone help me figure out how to create a balanced logarithmic loss function in yardstick
for use in a tidymodels
pipeline?
I looked up the documentation on creating custom metrics and I was able to create straightforward custom regression and classification metrics, these weren't that complicated. For custom class probability metrics the documentation suggested looking at the implementation for roc_auc
, which wasn't particularly enlightening.
If someone could help me with this I would really appreciate it.
The cross-entropy loss function, also known as Log Loss, or logistic loss, is
where (y) is the true label (0 or 1) and (p) is the probability estimate that (y = 1) (also described in this python library)
However, a Balanced Log Loss will includes a weighting term (w(y_i)) (as seen here):
Where:
N
is the total number of samples.y_i
is the true label of the i-th sample (0 or 1).p(y_i)
is the predicted probability of the i-th sample being of class 1.w(y_i)
is the weight for classy_i
, usually defined as the inverse of the class frequencies.Meaning, the general structure of the Log Loss formula is contained within the Balanced Log Loss formula, but the Balanced Log Loss includes a weighting term to account for class imbalance.
As this fcakyon/balanced-loss library summarizes:
The
yardstick
package is an R package that is part of thetidymodels
ecosystem and is used for calculating model performance metrics.When creating a balanced logarithmic loss function, you should use
new_prob_metric
rather thannew_class_metric()
. This is because the balanced logarithmic loss function operates on predicted probabilities, which makes it a probability metric.In
yardstick
,new_class_metric
is used for metrics that evaluate predictions in terms of class labels (e.g., accuracy, sensitivity, specificity), whereasnew_prob_metric
is used for metrics that evaluate the predicted probabilities (e.g., log loss, AUC).direction = "minimize"
indicates that smaller values of this metric are better (which is true for log loss, as we want to minimize the error).fun = function(data, lev, model = NULL, w_positive, w_negative) { ... }
defines the function that computes the Balanced Log Loss. The function takes five parameters:data
: A data frame containing the true labels and predicted probabilities.lev
: The levels of the factor for binary classification. In binary classification, there are two levels, e.g., "positive" and "negative".model
: Not used in this function, but it can be passed.w_positive
: The weight for the positive class.w_negative
: The weight for the negative class.truth <- as.numeric(data$truth == lev[1])
converts the true labels into a numeric vector where 1 represents the positive class (assumed to belev[1]
), and 0 represents the negative class.prob <- data$.pred_1
extracts the predicted probabilities for the positive class from the.pred_1
column of the data frame.The Balanced Log Loss is calculated with this expression:
This expression implements the formula by:
(truth * log(prob) + (1 - truth) * log(1 - prob))
.w_positive
when the true label is positive, andw_negative
when the true label is negative.Finally, the function returns the value of the calculated
loss
.And here is how you can use the custom metric in your
tidymodels
pipeline:This example uses
new_prob_metric
to define a balanced logarithmic loss function (named withmetric_nm = "balanced_log_loss"
) that operates on predicted probabilities.The weights
w_positive
andw_negative
are used to balance the contribution of each class to the loss.