Introduction

Direct Preference Optimization (DPO) is a method used to fine-tune large language models by directly optimizing their parameters based on human feedback. DPO effectively addresses the challenges associated with supervised fine-tuning during the advanced stages of reinforcement learning. The concept behind DPO is to integrate preference data directly into the model’s training process, adjusting the model’s parameters to increase the likelihood of generating outputs similar to those preferred by humans. In advanced stages of reinforcement learning, as the complexity of training escalates and the distinction between selected and discarded responses diminishes, Supervised Fine-Tuning (SFT) may inadvertently increase the rewards for incorrect responses.

Background

In the field of Natural Language Processing (NLP), the introduction of InstructGPT 1 has significantly advanced research in reinforcement learning from human feedback (RLHF). Typically, the RLHF stage follows supervised fine-tuning (SFT). InstructGPT employs proximal policy optimization (PPO) for preference optimization, but PPO requires the training of an additional reward model. This reward model’s training is inherently unstable, and its outcomes directly impact the final optimization’s effectiveness. Subsequent research has bypassed the reward model training step, directly proceeding with preference training. Knowledge Training Optimization (KTO) 2 learns from non-paired preference data, while Initial Public Offering (IPO) 3 uses a root-finding MSE loss to incorporate Kullback-Leibler (KL) regularization. Order Police (ORPO) 4 introduces a reference-model-free odds ratio term to directly contrast winning and losing responses with the policy model, jointly training with the SFT objective.

Theoretical foundation

Bradley-Terry model

The DPO algorithm 5, and true to its name, is to integrate the decision and reward functions directly into the reinforcement learning objective, eliminating the need for separate reward modeling. Moreover, DPO has emerged as a promising alternative for aligning Large Language Models (LLMs) to human or AI preferences.

Imagine that the large language model is prompted with prompts $x$ to produce pairs of answers, we are definitely going to prefer one over the other. Here, the Bradley-Terry model 6 is used to quantify preferences. This model is particularly used in a pairwise comparison, allowing us to express human preference distribution, denoted by $p^{*}$. Like most alignment methods, DPO requires a dataset of paired preferences $\mathcal{D} = \lbrace x^{(i)}, y_{w}^{(i)}, y_{l}^{(i)} \rbrace_{i = 1}^{N}$ sampled from $p^{*}$, where $y_{w}$ and $y_{l}$ denotes the preferred and dispreferred choice. And we assume that the preferences are generated by an underlying reward model $r^{*}(y, x)$, which could be parametrized by $r_{\phi}(y, x)$. Specifically, the probability that item $y_{w}$ is preferred over item $y_{l}$ could be modeled as

$$ \begin{equation} p^{*}(y_{w} \succ y_{l} \mid x) = \frac{\exp(r^{*}(x,y_{w}))}{\exp(r^{*}(x,y_{w}))+\exp(r^{*}(x,y_{l}))}. \label{eq1} \end{equation} $$

In Eq. \eqref{eq1}, we still have a question, why do we use the exponential function?

By using the exponential function, we could transform ratios into a log-linear model, which is more convenient for optimization and computation. And then, this exponential function has a natural probabilistic interpretation in distribution models. Finally, the non-negativity of the exponential function ensures that model parameters are always positive, avoiding issues with negative scoring.

There is a more general explanation we could use as a reference. Assuming we do not use $\exp(\cdot)$ and replace it with $a^{x}$, we could get

$$ \begin{equation} p^{*}(y_{w}\succ y_{l}) = \frac{a^{(r^{*}(x,y_{w}))}}{a^{(r^{*}(x,y_{w}))}+a^{(r^{*}(x,y_{l}))}} \end{equation} $$

But if we use this linear transformation, $b(x,y) = \log_{a}(\mathrm{e})c(x,y)$, so that

$$ \begin{equation} a^{b(x,y)} = a^{\log_{a}(\mathrm{e})c(x,y)} = (\mathrm{e}^{\log a})^{\log_{a}(\mathrm{e})c(x,y)} = \mathrm{e}^{c(x,y)} \end{equation} $$

It is not hard to find that for any base number $a$, we can still find a corresponding linear transformation, which brings us back to the exponential function. And it has those good properties that we mentioned earlier, which is why we chose it.

Log-likelihood loss function

We now have Eq. \eqref{eq1}, taking the natural logarithm of both sides of the equation allows us to linearize the model

$$ \begin{equation} \begin{aligned} \mathcal{L}_ {R}(r_{\phi}, \mathcal{D}) & =-\mathbb{E}_ {\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \frac{\exp \left(r_{\phi}\left(x, y_w\right)\right)}{\exp \left(r_{\phi}\left(x, y_w\right)\right)+\exp \left(r_{\phi}\left(x, y_l\right)\right)}\right] \\ & =-\mathbb{E}_ {\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \frac{1}{1+\exp \left(r_{\phi}\left(x, y_l\right)-r_{\phi}\left(x, y_w\right)\right)}\right] \\ & =-\mathbb{E}_ {\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_{\phi}\left(x, y_w\right)-r_{\phi}\left(x, y_l\right)\right)\right] \end{aligned} \label{eq4} \end{equation} $$

where $\sigma$ is the logistic function, the more specific expression is $ \sigma (x)= {1}/({1 +\exp{(-x)}})$.

During the reinforcement learning fine-tuning phase, we need to find the following target to provide feedback to the language model,

$$ \begin{equation} \max_{\pi_{\theta}} \mathbb{E}_ {x \sim \mathcal{D}, y \sim \pi_{\theta}}[r_{\phi}(x,y)] - \beta \mathbb{D}_ {\text{KL}}[\pi_{\theta}(y\mid x) || \pi_{\text{ref}}(y\mid x)] \label{eq5} \end{equation} $$

where $\beta$ is a hyper-parameter the controls the deviation from baseline reference policy $\pi_{\text{ref}}$, and $\pi_{\theta}$ is the language model policy. For this equation, $\mathbb{E}_ {x \sim \mathcal{D}, y \sim \pi_{\theta}}[r_ {\phi}(x,y)]$, reflects our desire to achieve as much reward as possible in the given task while simultaneously minimizing $\mathbb{D}_ {\text{KL}}[\pi_{\theta}(y\mid x) || \pi_{\text{ref}}(y\mid x)]$, which measures how close or far the model distribution is from the reference distribution.

Deriving the optimum of reward maximization objective

For two probability distributions $P$ and $Q$, Kullback-Leibler (KL) divergence $\mathbb{D}_{\mathrm{KL}}(P|| Q)$ is defined as

$$ \begin{equation} \mathbb{D}_ {\text{KL}}(P || Q)= \sum_{x \in \mathcal{X}} P(x)\log \left( \frac{P(x)}{Q(x)} \right) \end{equation} $$

where $\mathcal{X}$ is the set of possible outcomes for the discrete random variable $x$. KL divergence measures the information loss when distribution $Q$ is used to approximate distribution $P$. In other words, it quantifies the difference between distributions $Q$ and $P$. In the following derivation, we need to introduce the concept of partition function in physics, $Z (x)$. With Eq. \eqref{eq5}, we could have