How to make sense of the output of the reward model, how do we know what string it is preferring?

49 Views Asked by At

In the process of doing RLHF I made a reward model using a dataset of chosen and rejected string pairs. It is very similar to the example that's there in the official TRL library - Reward Modeling

I used LLaMA 2 7b model (tried both the chat and non-chat versions - the behavior is the same). Now what I would like to do is to actually pass an input and see the output of the Reward model. However I can’t seem to make any sense of what the reward model outputs.

For example: I tried to make the input as follows -

chosen = "This is the chosen text."
rejected = "This is the rejected text."
test = {"chosen": chosen, "rejected": rejected}

Then I try -

import torch
import torch.nn as nn

from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM
base_model_id = "./llama2models/Llama-2-7b-chat-hf"
model_id = "./reward_models/Llama-2-7b-chat-hf_rm_inference/checkpoint-500"

model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    # num_labels=1, #gives an error since the model always outputs a tensor of [2, 4096]
)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
rewards_chosen = model(
            **tokenizer(chosen, return_tensors='pt')
        ).logits
print('reward chosen is ', rewards_chosen)

rewards_rejected = model(
           **tokenizer(rejected, return_tensors='pt')
        ).logits

print('reward rejected is ', rewards_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
print(loss)

And the output looks something like this -

reward chosen is  tensor([[ 2.1758, -8.8359]], dtype=torch.float16)
reward rejected is  tensor([[ 1.0908, -2.2168]], dtype=torch.float16)
tensor(0.0044)

Printing loss wasn’t helpful. I mean I do not see any trend (for example positive loss turning negative) even if I switch rewards_chosen and rewards_rejected in the formula.

Also the outputs did not yield any insights. I do not understand how to make sense of rewards_chosen and rewards_rejected. Why are they a tensor with two elements instead of one?

I tried rewards_chosen > rewards_rejected but that is also not helpful since it outputs tensor([[ True, False]])

When I try some public reward model (its just a few megabytes since its just the adapter - https://huggingface.co/vincentmin/llama-2-13b-reward-oasst1) then I get outputs that make more sense since its outputs a single element tensor -

Code -

import torch
import torch.nn as nn

from peft import PeftModel, PeftConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModelForCausalLM

peft_model_id = "./llama-2-13b-reward-oasst1"
base_model_id = "/cluster/work/lawecon/Work/raj/llama2models/13b-chat-hf"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    base_model_id,
    num_labels=1,
    # torch_dtype=torch.float16,
)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
chosen = "prompter: What is your purpose? assistant: My purpose is to assist you."
rejected = "prompter: What is your purpose? assistant: I do not understand you."
test = {"chosen": chosen, "rejected": rejected}

model.eval()
with torch.no_grad():
    rewards_chosen = model(
                **tokenizer(chosen, return_tensors='pt')
            ).logits
    print('reward chosen is ', rewards_chosen)

    rewards_rejected = model(
               **tokenizer(rejected, return_tensors='pt')
            ).logits

    print('reward rejected is ', rewards_rejected)
    loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
    print(loss)

Output -

reward chosen is  tensor([[0.6876]])
reward rejected is  tensor([[-0.9243]])
tensor(0.1819)

This output makes more sense to me. But why do I get the outputs that have two values with my reward model?

1

There are 1 best solutions below

0
melvindoo On

I've been facing the exact same issue myself, after following the same example on the TRL library! I think there's a mistake in that example; reward models should output single-element tensors as you suggest, rather than two-element tensors.

I believe that setting num_labels=1 when calling AutoModelForSequenceClassification.from_pretrained is the solution here. This instantiates a model with a single-element output.

I can see that you've commented this out in your example, saying that it "gives an error since the model always outputs a tensor of [2, 4096]". I get no such error, so I'm not sure what's going on for you there.