How to Handle Imbalance Dataset in NER?

115 Views Asked by At

I'm now doing information extraction using NER. My dataset domain (mostly) in computer science. It contains label/tag: "TUJUAN", "METODE", and "TEMUAN". The problem is almost 80-90% data are labeled O which means it has no meaningful tag. The precision and recall from the model is 0, while the accuracy is about 0.78. I use IndoBERT as model for NER task.

enter image description here

I suspect this happens because my dataset is extremely unbalanced. At first, I want to modify the loss function based on BertForTokenClassification documentation to Dice Loss or Focal Loss as it mentioned here but I don't know how since my Python knowledge is still very weak.

class BertForTokenClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config, add_pooling_layer=False)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        checkpoint=_CHECKPOINT_FOR_TOKEN_CLASSIFICATION,
        output_type=TokenClassifierOutput,
        config_class=_CONFIG_FOR_DOC,
        expected_output=_TOKEN_CLASS_EXPECTED_OUTPUT,
        expected_loss=_TOKEN_CLASS_EXPECTED_LOSS,
    )
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]

        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return TokenClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

My full code is here

Can I get any help how to handle my imbalance dataset based on my problems?

3

There are 3 best solutions below

3
Sina Salam On

Imbalance dataset in NER is highly multi-tasking sometimes, but there are several strategies you can employ to handle this issue.

  1. You can handle class imbalance by assigning different weights to different classes in the Cross-Entropy Loss function.

  2. Focal Loss is designed to address class imbalance by down-weighting well-classified examples. It focuses more on hard, misclassified examples. This can help in cases where the majority class dominates the loss calculation.

  3. Dice Loss is another loss function commonly used for imbalanced datasets. It measures the overlap between predicted and target masks. This loss function tends to work well for tasks like segmentation but can also be adapted for NER.

You can actually modify your code to implement Weighted Cross-Entropy Loss, the similar code provided above should look like the followings using Python:

import torch.nn.functional as F

class BertForTokenClassification(BertPreTrainedModel):
    def __init__(self, config, class_weights=None):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.class_weights = class_weights

        self.bert = BertModel(config, add_pooling_layer=False)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:

        # Your forward function code here
        
        if labels is not None:
            # Calculate loss with class weights
            loss_weights = torch.tensor(self.class_weights, dtype=torch.float32).to(logits.device)
            loss_fct = nn.CrossEntropyLoss(weight=loss_weights)
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

So, in carefully examine the above you can pass the class_weights parameter when you initialize your model. You can calculate class weights based on the frequency of each class in your dataset. Of course, it should looks like the following:

# Calculate class weights
class_weights = [0.1, 0.3, 0.6]  # Adjust these weights based on your dataset
model = BertForTokenClassification.from_pretrained("indobenchmark/indobert-base-p2", num_labels=num_labels, class_weights=class_weights)

Adjust the values of class_weights according to the distribution of your classes. This approach will give more weight to minority classes, which can help in training with imbalanced datasets.

Check through and implement correctly, do not hesitate to ask more should there be any issue.

As a reference, you can also read more from this link:

A Step-by-Step Guide to handling imbalanced datasets in Python

2
Sina Salam On

Imbalance dataset in NER is highly multi-tasking sometimes, but there are several strategies you can employ to handle this issue.

  1. You can handle class imbalance by assigning different weights to different classes in the Cross-Entropy Loss function.

  2. Focal Loss is designed to address class imbalance by down-weighting well-classified examples. It focuses more on hard, misclassified examples. This can help in cases where the majority class dominates the loss calculation.

  3. Dice Loss is another loss function commonly used for imbalanced datasets. It measures the overlap between predicted and target masks. This loss function tends to work well for tasks like segmentation but can also be adapted for NER.

You can actually modify your code to implement Weighted Cross-Entropy Loss, the similar code provided above should look like the followings using Python:

import torch.nn.functional as F

class BertForTokenClassification(BertPreTrainedModel):
    def __init__(self, config, class_weights=None):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.class_weights = class_weights

        self.bert = BertModel(config, add_pooling_layer=False)
        classifier_dropout = (
            config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
        )
        self.dropout = nn.Dropout(classifier_dropout)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:

        # Your forward function code here
        
        if labels is not None:
            # Calculate loss with class weights
            loss_weights = torch.tensor(self.class_weights, dtype=torch.float32).to(logits.device)
            loss_fct = nn.CrossEntropyLoss(weight=loss_weights)
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

So, in carefully examine the above you can pass the class_weights parameter when you initialize your model. You can calculate class weights based on the frequency of each class in your dataset. Of course, it should looks like the following:

# Calculate class weights
class_weights = [0.1, 0.3, 0.6]  # Adjust these weights based on your dataset
model = BertForTokenClassification.from_pretrained("indobenchmark/indobert-base-p2", num_labels=num_labels, class_weights=class_weights)

Adjust the values of class_weights according to the distribution of your classes. This approach will give more weight to minority classes, which can help in training with imbalanced datasets.

Check through and implement correctly, do not hesitate to ask more should there be any issue.

As a reference, you can also read more from this link:

A Step-by-Step Guide to handling imbalanced datasets in Python

Sequel to your additional questions in the comment:

You were asking based on the above, how to set an ideal class_weight, and also should if you should assign weight for label "O"?

Do not forget that setting an ideal class_weight for an imbalanced dataset involves considering the distribution of your classes. The goal is to assign higher weights to underrepresented classes and lower weights to overrepresented classes to mitigate the impact of class imbalance during training. When setting class weights, it's essential to strike a balance that effectively addresses the class imbalance without biasing the model too much towards the minority classes.

Also, regarding if to assign a weight for the "O" label, it depends on the context of your Named Entity Recognition (NER) task. In many cases, the "O" label represents tokens that are not part of any named entity, and they typically constitute the majority class. Since the goal is to address imbalance, assigning a weight to the "O" label might not be necessary or might even be counterproductive, as it could overly penalize these tokens during training. However, if you have a specific reason to treat the "O" label differently, such as considering it as a separate class for analysis purposes, you can assign a weight accordingly.

Now, I put an example of how you can set class weights for a hypothetical dataset with three classes below:

# Calculate class frequencies (example values)

class_frequencies = [1000, 200, 50]  

# Assuming class 0 has 1000 samples, class 1 has 200 samples, and class 2 
# has 50 samples

    # Compute class weights
    total_samples = sum(class_frequencies)
    class_weights = [total_samples / (len(class_frequencies) * freq) for freq in class_frequencies]
    
    # Adjust weights if necessary
    # You might want to manually adjust weights based on the characteristics of your dataset and task requirements
    
    # Example output
    print("Class Weights:", class_weights)

Cheers.

0
Sina Salam On

Based on you question posted in the comment section and as a continuation to the previous question, though this is an independent.

my data follows IOB format so although it has 3 label, it has 6 tags. How much class_weight element it should be?

Typically, for NER tasks with IOB (Inside, Outside, Beginning) tagging, you would have three main classes: the beginning of an entity (B-), the inside of an entity (I-), and tokens that are outside of any entity (O). Each of these classes may have different weights assigned to them depending on the imbalance in your dataset.

However, if your data follows the IOB format and you have 3 labels with 6 tags (B-, I-, O for Outside), then you should have 6 elements in your class_weights list.

You might like to define your class_weights list as in the following code:

# Assuming class frequencies for each tag (example values)
class_frequencies = [1000, 200, 50, 100, 1500, 800]

# Compute class weights
total_samples = sum(class_frequencies)
class_weights = [total_samples / (len(class_frequencies) * freq) for freq in class_frequencies]

# Example output
print("Class Weights:", class_weights)