What differentiates Direct Preference Optimization (DPO) from supervised fine-tuning (SFT)

24 Views Asked by Michael Fengyuan Liu At 21 March 2024 at 08:25

Assume that I want to build a binary classifier using LLM, which takes an input document x and outputs a label y, where y_w is the correct answer, and y_l is the incorrect answer.

Intuitively, I want to maximize p(y_w|x) and minimize p(y_l|x). So what difference does it make if we simply do a SFT using the cross-entropy loss as apposed to using DPO?

Cross entropy loss:

The loss function in the DPO paper:

In this particular scenario of using LLM as a classifier, can I say that SFT and DPO are equivalent?

I can see that the loss functions are specified differently, but what does the difference mean from a mathematical/computational perspective? In other words, what is the contribution of the DPO method when we already have SFT? Thanks in advance.

Original Q&A

What differentiates Direct Preference Optimization (DPO) from supervised fine-tuning (SFT)

There are 0 best solutions below

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in FINE-TUNING

Trending Questions

Popular # Hahtags

Popular Questions