Chengshuo Dai

The transition from a raw, pre-trained Large Language Model (LLM) to a helpful, harmless, and honest AI assistant involves a critical phase known as alignment. While Supervised Fine-Tuning (SFT) teaches the model the basic format of human interaction (e.g., answering questions rather than completing them), it does not inherently teach the model to prioritize high-quality, safe, or nuanced responses. SFT simply mimics the distribution of the training data. If the SFT dataset contains toxic or factually incorrect examples, the model will faithfully reproduce them. To truly align the model with complex human values and preferences, researchers rely on advanced optimization techniques, historically dominated by Reinforcement Learning from Human Feedback (RLHF).

The standard RLHF pipeline is notoriously complex and computationally expensive. It begins by collecting a dataset of prompts and generating multiple responses for each prompt using the SFT model. Human annotators then rank these responses based on their quality. This preference data is used to train a separate Reward Model (RM), which learns to assign a scalar score to any given text, essentially acting as a proxy for human judgment.

Once the RM is trained, the core alignment phase begins using a reinforcement learning algorithm, most commonly Proximal Policy Optimization (PPO). In this phase, the LLM acts as the "policy." It generates a response to a prompt, the RM scores the response, and PPO updates the LLM's weights to maximize this reward. However, to prevent the LLM from "reward hacking"—generating nonsensical text that exploits loopholes in the RM to get a high score—PPO incorporates a Kullback-Leibler (KL) divergence penalty. This penalty ensures that the aligned model does not deviate too far from the original SFT model's probability distribution. The entire PPO process is highly unstable, requiring the simultaneous loading of four separate models in memory (the policy model, the reference model, the reward model, and the value model), making hyperparameter tuning a dark art.

To bypass the immense engineering complexity of RLHF, researchers introduced Direct Preference Optimization (DPO). DPO completely eliminates the need for a separate Reward Model and the unstable PPO optimization phase. Instead, it mathematically reformulates the RLHF objective. DPO leverages the insight that the optimal policy in the RLHF framework can be expressed directly in terms of the reward function. By substituting this relationship back into the Bradley-Terry model of human preferences, DPO creates a simple classification loss function.

In practice, DPO only requires a dataset of paired preferences: a prompt, a "chosen" (preferred) response, and a "rejected" (less preferred) response. The algorithm fine-tunes the LLM by increasing the probability of the chosen response while simultaneously decreasing the probability of the rejected response, implicitly optimizing for the same reward objective as PPO but in a single, stable training step. DPO has rapidly become the industry standard for open-weights alignment due to its simplicity, stability, and lower computational overhead.

However, both RLHF and DPO rely on paired preference data, which is expensive and time-consuming to collect. Annotators must carefully compare two responses to the exact same prompt. To address this data bottleneck, Kahneman-Tversky Optimization (KTO) was developed. KTO draws inspiration from prospect theory in behavioral economics, which posits that humans evaluate outcomes as gains or losses relative to a reference point, and that losses loom larger than gains.

Unlike DPO, KTO does not require paired preferences. It only requires a dataset of prompts and single responses, where each response is simply labeled as "good" (a gain) or "bad" (a loss). The KTO loss function updates the model's weights to maximize the likelihood of generating "good" responses and minimize the likelihood of "bad" ones, applying different mathematical penalties based on whether the outcome is perceived as a gain or a loss. This allows developers to utilize vast amounts of existing, unpaired feedback data (e.g., thumbs up/down ratings on a chatbot interface) to align models, significantly reducing the cost and barrier to entry for high-quality LLM alignment.

References:

Direct Preference Optimization: Your Language Model is Secretly a Reward Model - https://arxiv.org/abs/2305.18290
KTO: Model Alignment as Prospect Theoretic Optimization - https://arxiv.org/abs/2402.01306