Balanced log loss function in yardstick

534 Views Asked by At

Can someone help me figure out how to create a balanced logarithmic loss function in yardstick for use in a tidymodels pipeline?

I looked up the documentation on creating custom metrics and I was able to create straightforward custom regression and classification metrics, these weren't that complicated. For custom class probability metrics the documentation suggested looking at the implementation for roc_auc, which wasn't particularly enlightening.

If someone could help me with this I would really appreciate it.

1

There are 1 best solutions below

0
On

The cross-entropy loss function, also known as Log Loss, or logistic loss, is

t

where (y) is the true label (0 or 1) and (p) is the probability estimate that (y = 1) (also described in this python library)

However, a Balanced Log Loss will includes a weighting term (w(y_i)) (as seen here):

Balanced Log Log

Where:

  • N is the total number of samples.
  • y_i is the true label of the i-th sample (0 or 1).
  • p(y_i) is the predicted probability of the i-th sample being of class 1.
  • w(y_i) is the weight for class y_i, usually defined as the inverse of the class frequencies.

Meaning, the general structure of the Log Loss formula is contained within the Balanced Log Loss formula, but the Balanced Log Loss includes a weighting term to account for class imbalance.
As this fcakyon/balanced-loss library summarizes:

When training dataset labels are imbalanced, one thing to do is to balance the loss across sample classes.

The yardstick package is an R package that is part of the tidymodels ecosystem and is used for calculating model performance metrics.

When creating a balanced logarithmic loss function, you should use new_prob_metric rather than new_class_metric(). This is because the balanced logarithmic loss function operates on predicted probabilities, which makes it a probability metric.

In yardstick, new_class_metric is used for metrics that evaluate predictions in terms of class labels (e.g., accuracy, sensitivity, specificity), whereas new_prob_metric is used for metrics that evaluate the predicted probabilities (e.g., log loss, AUC).

library(yardstick)

# Define the custom metric
balanced_log_loss <- new_prob_metric(
  metric_nm = "balanced_log_loss",
  direction = "minimize",
  fun = function(data, lev, model = NULL, w_positive, w_negative) {
    # Extract true labels and predicted probabilities
    truth <- as.numeric(data$truth == lev[1]) # Assuming lev[1] is the positive class
    prob <- data$.pred_1 # Assuming .pred_1 column contains probabilities for positive class

    # Calculate the balanced log loss
    loss <- -mean(
      (truth * log(prob) * w_positive) + ((1 - truth) * log(1 - prob) * w_negative)
    )
    loss
  }
)

# You can then use the balanced_log_loss function within tidymodels pipelines.

Balanced Log Log

  • direction = "minimize" indicates that smaller values of this metric are better (which is true for log loss, as we want to minimize the error).

  • fun = function(data, lev, model = NULL, w_positive, w_negative) { ... } defines the function that computes the Balanced Log Loss. The function takes five parameters:

    • data: A data frame containing the true labels and predicted probabilities.
    • lev: The levels of the factor for binary classification. In binary classification, there are two levels, e.g., "positive" and "negative".
    • model: Not used in this function, but it can be passed.
    • w_positive: The weight for the positive class.
    • w_negative: The weight for the negative class.
  • truth <- as.numeric(data$truth == lev[1]) converts the true labels into a numeric vector where 1 represents the positive class (assumed to be lev[1]), and 0 represents the negative class.

  • prob <- data$.pred_1 extracts the predicted probabilities for the positive class from the .pred_1 column of the data frame.

  • The Balanced Log Loss is calculated with this expression:

    loss <- -mean(
      (truth * log(prob) * w_positive) + ((1 - truth) * log(1 - prob) * w_negative)
    )
    

    This expression implements the formula by:

    • Calculating the log loss for each observation (truth * log(prob) + (1 - truth) * log(1 - prob)).
    • Weighting the log loss of each observation by w_positive when the true label is positive, and w_negative when the true label is negative.
    • Taking the mean of the weighted log losses, which corresponds to the sum in the formula divided by the number of observations (N).
    • Multiplying by -1 to make it a loss (since the formula has a negative sign outside the sum).
  • Finally, the function returns the value of the calculated loss.

And here is how you can use the custom metric in your tidymodels pipeline:

# Example of using the custom metric in tidymodels
library(tidymodels)

# Define metric set with custom metric
metrics <- metric_set(balanced_log_loss, accuracy, roc_auc)

# Estimate performance
workflow_fit %>%
  predict(testing_set, type = "prob") %>%
  bind_cols(testing_set) %>%
  metrics(truth, .pred_class, .pred_1, w_positive = 0.7, w_negative = 0.3)

This example uses new_prob_metric to define a balanced logarithmic loss function (named with metric_nm = "balanced_log_loss") that operates on predicted probabilities.
The weights w_positive and w_negative are used to balance the contribution of each class to the loss.