Assume that I want to build a binary classifier using LLM, which takes an input document x and outputs a label y, where y_w is the correct answer, and y_l is the incorrect answer.
Intuitively, I want to maximize p(y_w|x) and minimize p(y_l|x). So what difference does it make if we simply do a SFT using the cross-entropy loss as apposed to using DPO?
Cross entropy loss:
The loss function in the DPO paper:
In this particular scenario of using LLM as a classifier, can I say that SFT and DPO are equivalent?
I can see that the loss functions are specified differently, but what does the difference mean from a mathematical/computational perspective? In other words, what is the contribution of the DPO method when we already have SFT? Thanks in advance.

