Viewing direct policy optimization from a physical perspective

Introduction

Direct Preference Optimization (DPO) fine-tunes LLMs using human feedback directly, bypassing the traditional reinforcement learning pipeline. Unlike methods that require training a separate reward model, DPO integrates preference data straight into the optimization objective—making it both simpler and more stable.

Why does this matter? In later stages of RL training, the gap between "preferred" and "rejected" responses becomes vanishingly small. Supervised Fine-Tuning (SFT) alone can inadvertently amplify rewards for subtly incorrect outputs. DPO addresses this by directly optimizing against pairwise preferences, ensuring the model learns from nuanced distinctions that SFT misses.

Background

In NLP, InstructGPT¹ pioneered RLHF—using reinforcement learning to align language models with human preferences. The standard pipeline trains a reward model first, then optimizes policy against it via PPO. But this two-stage approach is unstable: the reward model's quality directly determines downstream performance.

Recent work has streamlined this pipeline by eliminating the separate reward model entirely:

KTO²: Learns from unpaired preference data (winners without explicit losers)
IPO³: Uses root-finding MSE loss with KL regularization
ORPO⁴: Introduces a reference-free odds ratio term, jointly training with SFT objectives

DPO sits in this lineage—it replaces the complex RL loop with a single log-likelihood objective over pairwise preferences.

Theoretical Foundation

The Bradley-Terry Model

The DPO algorithm eliminates the need for a separate reward model by treating preference optimization as a direct logistic regression problem ⁵. At its core lies the Bradley-Terry model—a statistical framework for inferring latent preferences from pairwise comparisons.

Imagine prompting an LLM with input $x$ to generate a pair of responses. Given human preference data, the Bradley-Terry model⁶ posits that the probability of preferring $y_w$ over $y_l$ follows a sigmoid function of their reward difference:

$$ \begin{equation} p^*(y_w \succ y_l \mid x) = \frac{\exp(r^*(x,y_w))}{\exp(r^*(x,y_w))+\exp(r^*(x,y_l))}. \label{eq1} \end{equation} $$

This formulation is elegant for three reasons. First, exponentiation transforms reward differences into probabilities—a non-negative operation that ensures valid probability values. Second, the log-linear form admits efficient gradient-based optimization. Third, it mirrors the Boltzmann distribution in statistical physics, where particle energies $E_i$ determine state probabilities via $\propto \exp(-E_i/k_BT)$. The connection is not merely cosmetic: the exponential family's mathematical properties guarantee stable training dynamics.

Given this model, we can formulate the learning objective as a maximum likelihood problem over paired preference data $\mathcal{D} = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}_{i=1}^N$. Taking the negative log-likelihood gives us a clean gradient-based objective:

Log-likelihood Loss Function

$$ \begin{equation} \begin{aligned} \mathcal{L}_ {R}(r_{\phi}, \mathcal{D}) & =-\mathbb{E}_ {\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \frac{\exp \left(r_{\phi}\left(x, y_w\right)\right)}{\exp \left(r_{\phi}\left(x, y_w\right)\right)+\exp \left(r_{\phi}\left(x, y_l\right)\right)}\right] \\ & =-\mathbb{E}_ {\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \frac{1}{1+\exp \left(r_{\phi}\left(x, y_l\right)-r_{\phi}\left(x, y_w\right)\right)}\right] \\ & =-\mathbb{E}_ {\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_{\phi}\left(x, y_w\right)-r_{\phi}\left(x, y_l\right)\right)\right] \end{aligned} \label{eq2} \end{equation} $$

where $\sigma$ is the logistic function, $\sigma(x) = 1/(1 + \exp(-x))$.

To align the policy $\pi_\theta$ with preferences, we maximize expected reward while constraining it to stay close to a reference model $\pi_{\text{ref}}$:

$$ \begin{equation} \max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta}[r_\phi(x,y)] - \beta \mathbb{D}_{\text{KL}}[\pi_\theta(y \mid x) || \pi_{\text{ref}}(y \mid x)] \label{eq3} \end{equation} $$

Here, $\beta$ is a hyperparameter controlling how far $\pi_\theta$ can drift from the reference policy $\pi_{\text{ref}}$. The KL term acts as a regularization anchor—it prevents the model from collapsing onto a single high-reward response and preserves diversity.

Deriving the optimum of reward maximization objective

For two probability distributions $P$ and $Q$, Kullback-Leibler (KL) divergence $\mathbb{D}_{\mathrm{KL}}(P|| Q)$ is defined as

$$ \begin{equation} \mathbb{D}_ {\text{KL}}(P || Q)= \sum_{x \in \mathcal{X}} P(x)\log \left( \frac{P(x)}{Q(x)} \right) \end{equation} $$

where $\mathcal{X}$ is the set of possible outcomes for the discrete random variable $x$. KL divergence measures the information loss when distribution $Q$ is used to approximate distribution $P$. In other words, it quantifies the difference between distributions $Q$ and $P$. In the following derivation, we need to introduce the concept of partition function in physics, $Z (x)$. With Eq. \eqref{eq3}, we could have

$$ \begin{equation} \begin{aligned} & \max_ {\pi_{\theta}} \mathbb{E}_ {x \sim \mathcal{D}, y \sim \pi_{\theta}} [r_{\phi}(x, y)] -\beta \mathbb{D}_ {\mathrm{KL}}\left[\pi_{\theta}(y \mid x) \| \pi_{\mathrm{ref}}(y \mid x)\right] \\ & = \max_{\pi_{\theta}} \mathbb{E}_ {x \sim \mathcal{D}, y \sim \pi_{\theta}}[r_{\phi}(x, y)] - \mathbb{E} _ {x \sim \mathcal{D}, y \sim \pi_{\theta}}\left[ \beta \log \frac{\pi_{\theta}(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \right]\\ & =\max_ {\pi_{\theta}} \mathbb{E}_ {x \sim \mathcal{D}} \mathbb{E}_ {y \sim \pi_{\theta}}\left[r_{\phi}(x, y)-\beta \log \frac{\pi_{\theta}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}\right] \\ & =\min_ {\pi_{\theta}} \mathbb{E}_ {x \sim \mathcal{D}} \mathbb{E}_ {y \sim \pi_{\theta}}\left[\log \frac{\pi_{\theta}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)}-\frac{1}{\beta} r_{\phi}(x, y)\right] \\ & =\min_ {\pi_{\theta}} \mathbb{E}_ {x \sim \mathcal{D}} \mathbb{E}_ {y \sim \pi_{\theta}}\left[\log \frac{\pi_{\theta}(y \mid x)}{\textcolor{red}{\frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r_{\phi}(x, y)\right)}}-\log Z(x)\right], \end{aligned}\label{eq7} \end{equation} $$

where the partition function is

$$ \begin{equation} Z (x) = \sum_{y}\pi_{\text{ref}(y \mid x) }\exp\left( \frac{1}{\beta} r_{\phi} (x, y) \right). \end{equation} $$

Let us go back to statistical physics for explaining the partition function. The partition function $Z$ is the sum of all possible state weights of a system, and it determines the equilibrium properties of this system. For a given system, the partition function in physics is defined as $ Z = \sum_{i} \exp\left(-\frac{E_{i}}{k_{B}T}\right)$, where $E_{i}$ is the energy of the $i$-th state, $k_{B}$ is the Boltzmann constant, and $T$ is the temperature of the system. Physically, the partition function $Z$ represents a normalization factor in the system's state space that determines the probability of each state. In the given optimization framework, our goal is to obtain the policy $\pi_\theta$ by maximizing the reward $r_\phi(x, y)$ while minimizing the KL divergence between the distribution $\pi_\theta(y \mid x)$ and the reference distribution $\pi_{\text{ref}}(y \mid x)$. To simplify the computation of the KL divergence, we could introduce a partition function $Z(x)$, which is similar to the partition function in statistical physics. This function normalizes the weighted probabilities across all possible values of $y$. And then we need to define

$$ \begin{equation} \begin{aligned} \frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) &\exp \left(\frac{1}{\beta} r_{\phi}(x, y)\right) \\ & = \frac{\pi_{\text{ref}}(y \mid x)\exp \left( \frac{1}{\beta}r_{\phi}(x, y) \right)}{\sum_{y}\pi_{\text{ref}(y \mid x)}\exp\left( \frac{1}{\beta}r_{\phi}(x,y) \right)}\\ & = \pi^{*}(y \mid x) \end{aligned} \end{equation} $$

Now we should reorganize the detailed result in Eq. \eqref{eq7} as

$$ \begin{equation} \begin{aligned} \min_{\pi_{\theta}}\mathbb{E}_ {x \sim \mathcal{D}, y \sim \pi_{ \theta}}& \left[ \log \frac{\pi(y \mid x)}{\pi^{*}(y \mid x)} - \log Z(x) \right] \\ & = \min_{\pi_{\theta}}\mathbb{E}_ {x \sim \mathcal{D}, y \sim \pi_{ \theta}} \left[ \log \frac{\pi_{\theta}(y \mid x)}{\pi^{*}(y \mid x)} \right] \\ & = \min_{\pi_{\theta}}\mathbb{E} _ {x \sim \mathcal{D}}[\mathbb{D}_ {\text{KL}}(\pi_{\theta}(y \mid x) ||\pi^{*}(y \mid x))]. \end{aligned} \end{equation} $$

Since we all know $Z(x)$ does not depend on $y$, only if the two distributions are identical we will find that the KL-divergence is minimized at $0$. Hence we got the corresponding optimal solution

$$ \begin{equation} \pi_{\theta}(y \mid x)=\pi^*(y \mid x)=\frac{1}{Z(x)} \pi_{\mathrm{ref}}(y \mid x) \exp \left(\frac{1}{\beta} r_{\phi}(x, y)\right).\label{eq11} \end{equation} $$

And then we derive $r_{\phi}$ from Eq. \eqref{eq11}

$$ \begin{equation} \begin{aligned} r_{\phi}(x, y) &=\beta \log \left[\frac{\pi_{\theta}(y \mid x)}{\pi_{\text{ref}}(y \mid x)} Z(x)\right] \\ & =\beta \log \frac{\pi_{\theta}(y \mid x)}{\pi_{\text{ref}}(y \mid x)}+\beta \log Z(x). \end{aligned} \label{eq12} \end{equation} $$

By substituting Eq. \eqref{eq12} into Eq. \eqref{eq2} we could obtain

$$ \begin{equation} \begin{aligned} & \mathcal{L}_ {\mathrm{DPO}}\left(\pi_\theta ; \pi_{\mathrm{ref}}\right)= -\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(r_\phi\left(x, y_w\right)-r_\phi\left(x, y_l\right)\right)\right]\\ & = -\mathbb{E}_{\left(x, y_w, y_l\right) \sim \mathcal{D}}\left[\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_w \mid x\right)}{\pi_{\mathrm{ref}}\left(y_w \mid x\right)}-\beta \log \frac{\pi_\theta\left(y_l \mid x\right)}{\pi_{\mathrm{ref}}\left(y_l \mid x\right)}\right)\right] \end{aligned} \end{equation} $$

Discussion

Recent Literature Review. Even recently, researchers unveiled SimPO ⁷, an innovative algorithm that streamlines reinforcement learning by leveraging human feedback. By adopting the average log probability of a sequence as an inherent reward, SimPO cleverly bypasses the complexity of using a reference model. This method not only simplifies the process but also proves to be more efficient in computation and memory usage. When pitted against established methods such as DPO, especially in benchmarks like AlpacaEval 2 and Arena-Hard, SimPO demonstrates superior performance.

References

L. Ouyang, J. Wu, X. Jiang, et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022. ↩︎
K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela, “Kto: Model alignment as prospect theoretic optimization,” arXiv preprint arXiv:2402.01306, 2024. ↩︎
Z. Jiang, X. Huang, and C. Wei, “Preference as reward, maximum preference optimization with importance sampling,” arXiv preprint arXiv:2312.16430, 2023. ↩︎
J. Hong, N. Lee, and J. Thorne, “Reference-free monolithic preference optimization with odds ratio,” arXiv preprint arXiv:2403.07691, 2024. ↩︎
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024. ↩︎
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952. ↩︎
Y. Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,” arXiv preprint arXiv:2405.14734, 2024. ↩︎

Introduction#

Background#

Theoretical Foundation#

The Bradley-Terry Model#

Log-likelihood Loss Function#

Deriving the optimum of reward maximization objective#

Discussion#

References#