How to execute Seq2SeqTrainer compute_metric() once to verify the correctness of the function?

49 Views Asked by At

I am working on Chinese sequence-to-sequence generation. I have the following HuggingFace Transformers codes to train a sequence-to-sequence model.

trainer = Seq2SeqTrainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_testset,
    eval_dataset = tokenized_evalset,
    tokenizer = tokenizer,
    data_collator = data_collator,
    compute_metrics = compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=3)]
)

And the compute_metrics() function is as follows:

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)    
    results = {"bleu": result["sacrebleu_score"], "chrf": result["chr_f_score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    results["gen_len"] = np.mean(prediction_lens)
    results = {k: round(v, 4) for k, v in results.items()}
    return results

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

where the tokenizer is BertTokenizer.from_pretrained('fnlp/bart-base-chinese')

For the training results, BLEU & chrF++ scores are around 24.126 and 22.440 respectively. However, when I run the following codes to evaluate one of the sentence pairs in the same dataset, the BLEU and chrF++ scores are higher by 10 to 20 points. The codes are in a class, thus the self. in the functions.

def calculate_bleu(self, reference, input_sentence):
    bleu = self.sacrebleu.compute(predictions=[input_sentence], references=[reference], tokenize='zh')
    return bleu["score"]

def calculate_chrf(self, reference, input_sentence):
    chrf = self.chrf.compute(predictions=[input_sentence], references=[reference], word_order=2, lowercase=True)
    return chrf["score"]

where self.chrf and self.sacrebleu are defined as:

self.sacrebleu = evaluate.load("sacrebleu")
self.chrf = evaluate.load("chrf")

which evaluate is the evaluate library from import evaluate

I want to investigate the reason behind and I suspect it's the problem within the compute_metrics() function. For the latter approach, it does not pass through the encoding-decoding process at all.

Is there a chance that I can execute the compute_metrics() individually? What should I insert for the eval_preds parameter?

0

There are 0 best solutions below