Deep Reinforcement Learning

Introduction Direct Preference Optimization (DPO) fine-tunes LLMs using human feedback directly, bypassing the traditional reinforcement learning pipeline. Unlike methods that require training a separate reward model, DPO integrates preference data straight into the optimization objective—making it both simpler and more stable. Why does this matter? In later stages of RL training, the gap between "preferred" and "rejected" responses becomes vanishingly small. Supervised Fine-Tuning (SFT) alone can inadvertently amplify rewards for subtly incorrect outputs....