36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2405.19316","publisher":"arxiv","paperJSON":{"title":"Robust Preference Optimization through Reward Model Distillation","paperID":"2405.19316","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, the empirical evidence suggests that DPO typically assigns implicit rewards that overfit, and trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"preferred ","element":"span"},{"text":"generations to go to zero. In this work, we analyze this phenomenon and use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"distillation ","element":"span"},{"text":"to get a better proxy for the true preference distribution over generation pairs: we train the LM such that its induced implicit reward, i.e., the scaled log-likelihood ratio of the model to the reference model, matches an explicit reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"family of reward models ","element":"span"},{"text":"that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Language model (LM) post-training (or alignment) aims to steer language model policies towards responses that agree with human preferences. Early state-of-the-art approaches have focused on reward learning from human feedback. In this paradigm, preference annotations are used to train reward models, which then guide the optimization of the language model policy through online reinforcement learning (an approach broadly referred to as RLHF). Recent research on offline “Direct Preference Optimization” (DPO; ","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"2023","element":"a"},{"text":") and extensions thereof (","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":48,"text":"Tang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":48,"text":"2024b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-3","referenceIndex":29,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":29,"text":"2024","element":"a"},{"text":"), however, has demonstrated that it is possible to directly optimize policies on the preference data, which (a) bypasses the need for a separate reward model, and (b) uses standard supervised techniques rather than online reinforcement learning, which can be more difficult to optimize. These advantages have led to the the adoption of offline alignment, and in particular offline DPO, as the post-training algorithm of choice in both smaller-scale academic settings as well as larger-scale projects such as Llama 3 (","element":"span"},{"href":"#id-4","referenceIndex":1,"text":"AI@Meta","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":1,"text":"2024","element":"a"},{"text":") and OLMo (","element":"span"},{"href":"#id-5","referenceIndex":19,"text":"Groeneveld et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":19,"text":"2024","element":"a"},{"text":").","element":"span"}],[{"text":"While this direct approach to preference optimization is attractive in its simplicity and efficiency, it also raises questions about the effectiveness and robustness of the resulting policies—as well as the broader utility of an explicit reward model beyond online reinforcement learning. In this paper, we argue that explicit reward modeling can, in fact, offer substantial practical and theoretical benefits. In particular, we theoretically show that relying solely on the preference data can be a precarious strategy, with few natural brakes in place to prevent policies trained under the DPO objective from careening off towards degenerate policies when the preference data exhibits certain idiosyncratic properties. On the other hand, explicit reward models can easily be regularized and understood—regardless of whether they are Bradley-Terry models (","element":"span"},{"href":"#id-6","referenceIndex":6,"text":"Bradley and ","element":"a"},{"href":"#id-6","referenceIndex":6,"text":"Terry","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":6,"text":"1952","element":"a"},{"text":"), margin-based ranking models (","element":"span"},{"href":"#id-7","referenceIndex":59,"text":"Zhao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":59,"text":"2023","element":"a"},{"text":"), or any other function that correlates well with human preferences (","element":"span"},{"href":"#id-8","referenceIndex":24,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":24,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":48,"text":"Tang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":48,"text":"2024b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":45,"text":"Swamy et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":45,"text":"2024","element":"a"},{"text":").","element":"span"}],[{"text":"Taking a step back from pure direct preference optimization, we first explore a method that merges the best of both worlds for the offline setting: an efficient reward model distillation algorithm that (i) operates effectively in the offline setting, (ii) makes minimal assumptions about the true, optimal reward we aim to maximize, and (iii) demonstrates greater robustness to the specific distribution of prompt/response data used for policy alignment. Drawing inspiration from prior knowledge distillation techniques (","element":"span"},{"href":"#id-10","referenceIndex":21,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":21,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-11","referenceIndex":41,"text":"Romero et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":41,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":53,"text":"Yang ","element":"a"},{"href":"#id-12","referenceIndex":53,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":53,"text":"2019","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":14,"text":"Furlanello et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":14,"text":"2018","element":"a"},{"text":"), we use the same change of variables trick employed in DPO to express the language model policy in terms of its implicit reward model (","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"2023","element":"a"},{"text":"). We then train the policy’s implicit reward model to match our desired, explicit reward via an ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/1-0.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"loss that directly regresses the pairwise differences in target rewards for any two generation pairs (","element":"span"},{"style":{"height":17.39},"width":318.03,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/1-1.png","element":"img","alt":"x, y1) and (x, y2).1 ","inline":true,"padRight":true},{"text":"Here we theoretically establish the equivalence between optimizing this simple distillation loss over a sufficiently diverse offline dataset of unlabeled examples, and optimizing the traditional online RLHF objective with reinforcement learning.","element":"span"}],[{"text":"While this approach adds reward modeling and reward inference ","element":"span"},{"style":{"fontStyle":"italic"},"text":"back ","element":"span"},{"text":"into the pipeline, it still maintains much of the simplicity and efficiency of reward-model-free DPO. Specifically, in our setting, rewards for the training data can be computed offline, once, ahead of time. This computation is completely parallelizable, and reward inference is significantly faster than the autoregressive generation (also known as model rollout) that is done in online settings. Consequently, this allows the policy training framework to be nearly identical to standard DPO, modulo the structure of the data that is fed in. This is true regardless of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"size ","element":"span"},{"text":"of the policy that is trained (as the policy’s implicit reward model does not need to be the same size as the explicit reward model) or its hyper-parameters (as the policy’s hyper-parameters do not need to match those of the reward model).","element":"span"}],[{"text":"Reward model distillation, however, is still subject to some of the same challenges facing DPO-style training. In particular, distillation requires having a reliable reward model—but having a reliable reward still requires having a reliable method for extracting a reward model from a potentially noisy preference dataset. To address the uncertainty surrounding what the “right” reward model to optimize against is, we also introduce a pessimistic extension to our approach. This extension aims to maximize the worst-case improvement of our model across a plausible family of reward models (e.g., those sufficiently consistent with annotated preference data). This strategy aligns with that of existing work in conservative offline reinforcement learning (","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"Cheng ","element":"a"},{"href":"#id-14","referenceIndex":9,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"Kumar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"2020","element":"a"},{"text":"). We show that this pessimistic objective can be equivalently expressed and optimized by adding a simple additional KL-divergence regularization to the original distillation objective.","element":"span"}],[{"text":"Empirically, we find that reward model distillation, particularly pessimistic reward model distillation, leads to similar performance to prior direct preference optimization methods when the preference datasets used are unbiased. When the preference datasets are biased, however, it leads to significantly ","element":"span"},{"style":{"fontStyle":"italic"},"text":"better ","element":"span"},{"text":"performance when compared to DPO and the Identity Preference Optimization (IPO) framework of ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":"), which was introduced as a more robust alternative to DPO. To further support these empirical observations, we provide an extensive theoretical analysis that both (i) sheds more light on the degenerative tendencies of DPO and issues inherent to its objective, and (ii) highlights relative advantages of our explicitly regularized approaches.","element":"span"}]]},{"heading":"2 Related work","paragraphs":[[{"text":"Recent work in offline alignment has focused on DPO (","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"2023","element":"a"},{"text":") as a simpler alternative for aligning language models from preference data. Subsequent work, however, has identified issues with DPO, including weak regularization (","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":") and a tendency to decrease the probability of winning generations during training (","element":"span"},{"href":"#id-16","referenceIndex":32,"text":"Pal et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":32,"text":"2024","element":"a"},{"text":"). Similar to some of the findings in this work, ","element":"span"},{"href":"#id-17","referenceIndex":35,"text":"Rafailov et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-17","referenceIndex":35,"text":"2024a","element":"a"},{"text":") theoretically showed the existence of minima to the empirical DPO loss that assign non-zero probability to outputs not present in the training data, which can lead to unexpected model behaviour. Here we further show that not only can such minima place non-zero probability on outputs not present in the training data, but they can also assign ","element":"span"},{"style":{"fontStyle":"italic"},"text":"near","element":"span"},{"text":"-zero probability to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"preferred ","element":"span"},{"text":"responses that do appear in the training data. A number of methods have since explored various avenues for combating these issues. These include analyzing the impact of noise on DPO alignment (","element":"span"},{"href":"#id-18","referenceIndex":15,"text":"Gao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":15,"text":"2024a","element":"a"},{"text":"), proposing to update the reference policy during training (","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"Gorbatovski et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":18,"text":"2024","element":"a"},{"text":"), and suggesting a variant of IPO with a per-context margin (","element":"span"},{"href":"#id-20","referenceIndex":2,"text":"Amini et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":2,"text":"2024","element":"a"},{"text":"). Additional research has focused on token-level alignment methods (","element":"span"},{"href":"#id-21","referenceIndex":56,"text":"Zeng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":56,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-22","referenceIndex":36,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":36,"text":"2024b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":31,"text":"Mudgal et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":31,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":8,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":8,"text":"2024","element":"a"},{"text":") and on developing a unified view of various offline alignment methods (","element":"span"},{"href":"#id-2","referenceIndex":48,"text":"Tang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":48,"text":"2024b","element":"a"},{"text":"). Similar to our contributions towards using pessimism with respect to the correct choice of reward model for robust reward model distillation, existing work on offline RLHF has also focused on encompassing various forms of conservative reward penalties (","element":"span"},{"href":"#id-25","referenceIndex":60,"text":"Zhu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":60,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-26","referenceIndex":27,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":27,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-27","referenceIndex":58,"text":"Zhan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":58,"text":"2024","element":"a"},{"text":"). Most relevant to our work on a technical level, concurrent work by ","element":"span"},{"href":"#id-28","referenceIndex":28,"text":"Mao et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-28","referenceIndex":28,"text":"2024","element":"a"},{"text":") also uses an offline loss that leverages targets from a learned reward model for supervision (as opposed to binary preference labels) that has the same form of the simple “distillation loss” analyzed in §","element":"span"},{"href":"#id-29","text":"5.1 ","element":"a"},{"text":"of this paper, but does not explore pessimism. Similarly, the REBEL algorithm of ","element":"span"},{"href":"#id-30","referenceIndex":16,"text":"Gao et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-30","referenceIndex":16,"text":"2024b","element":"a"},{"text":") also studies the same reward regression loss, but in the online setting. This work builds upon these findings, and provides further analysis, as well as a solution based on pessimism together with reward distillation.","element":"span"}],[{"text":"As discussed in §","element":"span"},{"text":"1","element":"span"},{"text":", offline settings are attractive since they allow for simple, efficient, and scalable training frameworks. At the same time, while offline alignment methods are popular, recent evidence suggests that online alignment methods such as RLHF (","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"Stiennon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"2020","element":"a"},{"text":"), can still lead to more favorable outcomes (","element":"span"},{"href":"#id-33","referenceIndex":20,"text":"Guo et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":20,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-34","referenceIndex":46,"text":"Tajwar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":46,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-35","referenceIndex":12,"text":"Dong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":12,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-36","referenceIndex":52,"text":"Xu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":52,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-37","referenceIndex":51,"text":"Xiong ","element":"a"},{"href":"#id-37","referenceIndex":51,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":51,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":7,"text":"Calandriello et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":7,"text":"2024","element":"a"},{"text":"), especially when high reward outputs have low probability under the base policy. One of the advantages of online settings over offline settings is that many strategies for mitigating over-optimization that are simple to apply in online settings, such as reward shaping, are not as straightforward to apply in offline settings. For example, reward ensembles have been widely investigated recently as a mechanism for tackling reward hacking in online RLHF (","element":"span"},{"href":"#id-39","referenceIndex":13,"text":"Eisenstein et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-39","referenceIndex":13,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-40","referenceIndex":11,"text":"Coste et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":11,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-41","referenceIndex":57,"text":"Zhai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":57,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":38,"text":"Ramé et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":38,"text":"2024","element":"a"},{"text":"), and in the context of multi-objective optimization (","element":"span"},{"href":"#id-43","referenceIndex":30,"text":"Moskovitz et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":30,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":37,"text":"Rame et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":37,"text":"2024","element":"a"},{"text":"). Other notable examples for combating reward over-fitting and optimization by training reward models with regularized training objectives include iterative data smoothing (","element":"span"},{"href":"#id-45","referenceIndex":61,"text":"Zhu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":61,"text":"2024","element":"a"},{"text":"), which uses a trained model to softly label data during RLHF, and reward calibration from demonstrations (","element":"span"},{"href":"#id-46","referenceIndex":40,"text":"Rita ","element":"a"},{"href":"#id-46","referenceIndex":40,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","referenceIndex":40,"text":"2024","element":"a"},{"text":"). This work addresses some of the methodological and experimental gap that exists between online and offline methods for RLHF, by allowing for explicitly designed, trained, and regularized reward models (or pessimistic reward model ensembles) to be added back into the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"offline ","element":"span"},{"text":"alignment setting without losing the practical benefits of the offline setting. Also relevant to our work, ","element":"span"},{"href":"#id-43","referenceIndex":30,"text":"Moskovitz et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-43","referenceIndex":30,"text":"2023","element":"a"},{"text":") focus on “composite” rewards in the online setting, with the goal of achieving high task reward while ensuring that every individual component is above some threshold—also by applying a Lagrangian relaxation. In this work, we also consider multiple reward models, but we only focus on cases where there is no known, obvious reward decomposition.","element":"span"}],[{"text":"Finally, the question of using a small amount of offline data to learn high-quality policies, instead of online access to reward feedback, has also been widely studied in the offline reinforcement learning (RL) literature. The predominant approach here is to use pessimism, that is, to learn a policy with the highest reward under all plausible environment models consistent with the data, with an extensive theoretical (","element":"span"},{"href":"#id-47","referenceIndex":26,"text":"Liu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":26,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-48","referenceIndex":55,"text":"Zanette ","element":"a"},{"href":"#id-48","referenceIndex":55,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-48","referenceIndex":55,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-49","referenceIndex":50,"text":"Xie et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":50,"text":"2021","element":"a"},{"text":") and empirical (","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"Kumar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"Cheng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-50","referenceIndex":54,"text":"Yu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-50","referenceIndex":54,"text":"2021","element":"a"},{"text":") body of supporting work. The key insight in this literature is that without pessimism, the RL algorithm learns undesirable behaviors which are not explicitly ruled out in the training data, and pessimism provides a robust way of preventing such undesirable extrapolations, while still preserving generalization within the support of the data.","element":"span"}]]},{"heading":"3 Preliminaries","paragraphs":[[{"text":"We begin with a brief review of Direct Preference Optimization (DPO) (","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"2023","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The preference alignment problem","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"be an input prompt, ","element":"span"},{"style":{"height":16},"width":207.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-0.png","element":"img","alt":" y ∼ πθ(· | x","inline":true},{"text":") be the language model policy ","element":"span"},{"style":{"height":9.19},"width":37.71,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-1.png","element":"img","alt":" πθ","inline":true},{"text":"’s response to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", and ","element":"span"},{"style":{"height":16},"width":161.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-2.png","element":"img","alt":" πref(y | x","inline":true},{"text":") a reference policy (such as a pretrained or finetuned language model that is high-performing, but not yet aligned, and often used as the starting point for optimization). Given some reward function ","element":"span"},{"style":{"height":15.6},"width":113.12,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-3.png","element":"img","alt":" r∗(x, y","inline":true},{"text":"), the goal of alignment is to solve for the “aligned” policy ","element":"span"},{"style":{"height":16},"width":149.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-4.png","element":"img","alt":" πθ∗(y | x","inline":true},{"text":") that maximizes the following RLHF objective, i.e.,","element":"span"}],[{"id":"id-51","style":{"width":"84%"},"width":1576,"height":63,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":15.6},"width":62.2,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-6.png","element":"img","alt":" µ(x","inline":true},{"text":") is a fixed distribution over prompts, and the KL-divergence term keeps the aligned policy close to the anchoring reference policy, ","element":"span"},{"style":{"height":16},"width":155.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-7.png","element":"img","alt":" πref(y | x","inline":true},{"text":"). Here, the reward function ","element":"span"},{"style":{"height":11.38},"width":35.09,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-8.png","element":"img","alt":" r∗ ","inline":true,"padRight":true},{"text":"is typically not known in advance, but rather inferred from collected human preference data in the form of (","element":"span"},{"style":{"height":15.6},"width":311.45,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-9.png","element":"img","alt":"x, yw, yℓ), where x","inline":true,"padRight":true},{"text":"is the prompt, ","element":"span"},{"style":{"height":14.18},"width":147.58,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-10.png","element":"img","alt":" yw is the","inline":true,"padRight":true},{"text":"“winning”, or preferred, response, and ","element":"span"},{"style":{"height":10.8},"width":28.97,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-11.png","element":"img","alt":" yℓ ","inline":true,"padRight":true},{"text":"is the “losing”, or dispreferred, response. A common approach is to assume that (","element":"span"},{"style":{"height":10.8},"width":90.66,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-12.png","element":"img","alt":"y1, y2","inline":true},{"text":") follow a Bradley-Terry model (","element":"span"},{"href":"#id-6","referenceIndex":6,"text":"Bradley and Terry","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":6,"text":"1952","element":"a"},{"text":"), under which the probability that ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-13.png","element":"img","alt":"y1","inline":true,"padRight":true},{"text":"is preferred to ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-14.png","element":"img","alt":" y2","inline":true,"padRight":true},{"text":"given the reward function ","element":"span"},{"style":{"height":16},"width":1058.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-15.png","element":"img","alt":" r∗ and prompt x is p∗(y1 ≻ y2 | x) = σ(r∗(x, y1) − r∗(x, y2)),","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":50.93,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-16.png","element":"img","alt":" σ(·","inline":true},{"text":") is the sigmoid function and ","element":"span"},{"style":{"height":9.6},"width":31,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-17.png","element":"img","alt":" ≻","inline":true,"padRight":true},{"text":"denotes preference. Under this model, we can use the preference data (","element":"span"},{"style":{"height":16.79},"width":296.85,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-18.png","element":"img","alt":"x, yw, yℓ) ∼ Dpref","inline":true,"padRight":true},{"text":"to estimate ","element":"span"},{"style":{"height":11.38},"width":35.09,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-19.png","element":"img","alt":" r∗","inline":true,"padRight":true},{"text":"via maximum likelihood estimation, i.e.,","element":"span"}],[{"id":"id-54","style":{"width":"76%"},"width":1426,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-20.png","element":"img"}],[{"text":"With ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"in hand, Eq. (","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":") can be optimized using standard reinforcement learning algorithms (","element":"span"},{"href":"#id-52","referenceIndex":42,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-52","referenceIndex":42,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"Stiennon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":10,"text":"2017","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Direct preference optimization","element":"span"}],[{"text":"DPO is a simple approach for offline policy optimization that uses preferences to directly align the language model policy, without training an intermediate reward model. Specifically, DPO leverages the fact that the optimal solution to the KL-constrained objective in (","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":") takes the form (","element":"span"},{"href":"#id-53","referenceIndex":22,"text":"Korbak et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-53","referenceIndex":22,"text":"2022","element":"a"},{"text":")","element":"span"}],[{"style":{"width":"99%"},"width":1874,"height":157,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-21.png","element":"img"}],[{"text":"function ","element":"span"},{"style":{"height":11.38},"width":35.09,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-22.png","element":"img","alt":" r∗","inline":true,"padRight":true},{"text":"in terms of the optimal policy ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-23.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"text":"that it induces, i.e.,","element":"span"}],[{"id":"id-61","style":{"width":"69%"},"width":1309,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-24.png","element":"img"}],[{"text":"Under the Bradley-Terry model, the likelihood that ","element":"span"},{"style":{"height":12},"width":126.08,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-25.png","element":"img","alt":" y1 ≻ y2","inline":true,"padRight":true},{"text":"can then be written as","element":"span"}],[{"style":{"width":"73%"},"width":1380,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/3-26.png","element":"img"}],[{"text":"where now ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-0.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"text":"can be directly estimated on ","element":"span"},{"style":{"height":15.59},"width":86.96,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-1.png","element":"img","alt":" Dpref","inline":true,"padRight":true},{"text":"following the objective in (","element":"span"},{"href":"#id-54","text":"2","element":"a"},{"text":"), in place of the intermediate reward model ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"text":", i.e., ","element":"span"},{"style":{"height":17.14},"width":186.76,"height":42.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-2.png","element":"img","alt":" πˆθ(y | x) ∈","inline":true,"padRight":true},{"text":"argmin","element":"span"},{"style":{"height":16.79},"width":285.78,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-3.png","element":"img","alt":"πθ Ldpo(πθ; Dpref","inline":true},{"text":") where","element":"span"}],[{"id":"id-56","style":{"width":"84%"},"width":1579,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-4.png","element":"img"}],[{"text":"As described in §","element":"span"},{"text":"1","element":"span"},{"text":", optimizing ","element":"span"},{"style":{"height":15.99},"width":79.56,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-5.png","element":"img","alt":" Ldpo","inline":true,"padRight":true},{"text":"offers two main advantages over using online RL for Eq. (","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":"): (a) there is no need for a separate reward model, and (b) ","element":"span"},{"style":{"height":15.99},"width":79.56,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-6.png","element":"img","alt":" Ldpo","inline":true,"padRight":true},{"text":"is a supervised objective that can be trained offline, which allows for a simpler training setup than online learning. Still, ","element":"span"},{"style":{"height":15.99},"width":79.56,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-7.png","element":"img","alt":" Ldpo","inline":true,"padRight":true},{"text":"also has certain pitfalls, as we analyze next.","element":"span"}]]},{"heading":"4 Pitfalls of direct preference optimization","paragraphs":[[{"text":"As argued by ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":"), DPO strongly relies on the Bradley-Terry assumption, which leads to surprising and undesirable consequences when trained on finite preference data. The root issue is that if we have any two responses ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-8.png","element":"img","alt":" y1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-9.png","element":"img","alt":" y2","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":16},"width":1141.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-10.png","element":"img","alt":" p∗(y1 ≻ y2 | x) = 1, then the Bradley-Terry model dictates that","inline":true},{"style":{"height":16},"width":384.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-11.png","element":"img","alt":"r∗(y1) − r∗(y2) = +∞","inline":true},{"text":", and therefore ","element":"span"},{"style":{"height":16},"width":316.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-12.png","element":"img","alt":" πθ∗(y2 | x) = 0 for","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"any ","element":"span"},{"text":"finite KL-regularization strength ","element":"span"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-13.png","element":"img","alt":" β","inline":true},{"text":".","element":"span"}],[{"text":"We can illustrate this phenomenon on a broader level with the following example:","element":"span"}],[{"id":"id-55","style":{"fontWeight":"bold"},"text":"Assumption 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose we are given a preference dataset of (context-free) pairs ","element":"span"},{"style":{"height":16.79},"width":445.29,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-14.png","element":"img","alt":" Dpref = {(ywi , yℓi)}ni=1, the","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"pairs ","element":"span"},{"text":"(","element":"span"},{"style":{"height":15.75},"width":115.77,"height":39.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-15.png","element":"img","alt":"ywi , yℓi)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are mutually disjoint in both the elements. Further suppose that we optimize the DPO objective ","element":"span"},{"style":{"fontStyle":"italic"},"text":"on ","element":"span"},{"style":{"height":15.59},"width":86.95,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-16.png","element":"img","alt":" Dpref","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with a single parameter ","element":"span"},{"style":{"height":15.59},"width":34.71,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-17.png","element":"img","alt":" θy","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-83","style":{"fontWeight":"bold"},"text":"Proposition 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-55","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", for any ","element":"span"},{"text":"(","element":"span"},{"style":{"height":10.8},"width":67.65,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-18.png","element":"img","alt":"y, y′","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":15.13},"width":124.31,"height":37.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-19.png","element":"img","alt":" y = ywi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":11.75},"width":122.06,"height":29.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-20.png","element":"img","alt":" y′ = yℓi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"style":{"fontStyle":"italic"},"text":", we have","element":"span"}],[{"style":{"height":16.3},"width":306.64,"height":40.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-21.png","element":"img","alt":"πθ∗(y′)πref(y) → ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", for all global minimizers ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-22.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of the DPO objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-56","text":"6","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", for any ","element":"span"},{"style":{"height":14},"width":66.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-23.png","element":"img","alt":" β >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"id":"id-58","style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-55","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", further assume that ","element":"span"},{"text":"0 ","element":"span"},{"style":{"height":16},"width":206.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-24.png","element":"img","alt":" < πref(y) <","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-25.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a global minimizer of the DPO objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-56","text":"6","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"iff ","element":"span"},{"style":{"height":16},"width":249.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-26.png","element":"img","alt":" πθ∗(C(yℓ)c) →","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16.15},"width":117.18,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-27.png","element":"img","alt":" πθ∗(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"height":16},"width":205.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-28.png","element":"img","alt":" > 0 ∀i ∈ [n","inline":true},{"text":"]","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":16},"width":105.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-29.png","element":"img","alt":" C(yℓ)c","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"complement of the set of all responses ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"style":{"fontStyle":"italic"},"text":"that appear as a dispreferred ","element":"span"},{"style":{"height":11.75},"width":30.54,"height":29.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-30.png","element":"img","alt":" yℓi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for any ","element":"span"},{"style":{"height":16},"width":97.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-31.png","element":"img","alt":" i ∈ [n","inline":true},{"text":"]","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"Additional analysis of the training dynamics of DPO is provided in §","element":"span"},{"text":"7","element":"span"},{"text":". A significant implication of this result is that the set of global optima of the DPO loss includes policies that can shift nearly all probability mass to responses that never appear in the training set—and even assign near-zero probability to all of the training data responses that do in fact correspond to winning generations, ","element":"span"},{"style":{"height":14.19},"width":43.97,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-32.png","element":"img","alt":" yw","inline":true},{"text":", a phenomenon that has been observed empirically and analyzed theoretically (","element":"span"},{"href":"#id-16","referenceIndex":32,"text":"Pal et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":32,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":35,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":35,"text":"2024a","element":"a"},{"text":";","element":"span"},{"href":"#id-22","referenceIndex":36,"text":"b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-34","referenceIndex":46,"text":"Tajwar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-34","referenceIndex":46,"text":"2024","element":"a"},{"text":").","element":"span"},{"href":"#id-57","style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-33.png","element":"img","alt":"2","inline":true}],[{"text":"Stated differently, Corollary ","element":"span"},{"href":"#id-58","text":"1 ","element":"a"},{"text":"implies that any ","element":"span"},{"style":{"height":11.39},"width":35.82,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-34.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"merely satisfying ","element":"span"},{"style":{"height":16.15},"width":423.21,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-35.png","element":"img","alt":" πθ∗(yℓi) = 0 with πθ∗(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"height":16},"width":196.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-36.png","element":"img","alt":" > 0 ∀i ∈ [n","inline":true},{"text":"] ","element":"span"},{"text":"is a global minimizer of the DPO objective in this setting. Though simplistic, the scenario in Assumption ","element":"span"},{"href":"#id-55","text":"1 ","element":"a"},{"text":"is closer to reality than might first be appreciated: in many practical situations we can expect the finite-sample preference data to contain one (or at most a few) preference annotations per example (","element":"span"},{"style":{"height":10.8},"width":131.15,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-37.png","element":"img","alt":"x, y1, y2","inline":true},{"text":"), while the policies ","element":"span"},{"style":{"height":9.19},"width":37.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-38.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"can have billions of parameters (","element":"span"},{"style":{"height":10.4},"width":74.92,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-39.png","element":"img","alt":"≫ n","inline":true},{"text":"). It is important to note that the supposed regularization term ","element":"span"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-40.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"does not help: it can limit the speed at which the optimizer reaches the degenerate solution, but it cannot alter the final destination. This issue could be viewed as a classic instance of overfitting—but as opposed to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"overpredicting ","element":"span"},{"text":"responses within the training set, we might overfit to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"almost never ","element":"span"},{"text":"producing anything like the “good” responses that do appear within the training set. Furthermore, without additional regularization (beyond ","element":"span"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-41.png","element":"img","alt":" β","inline":true},{"text":"), we can expect this degeneration to occur in typical preference datasets.","element":"span"}],[{"style":{"width":"99%"},"width":1872,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-42.png","element":"img"}],[{"text":"The DPO objective only requires that the likelihood of the preferred response is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"relatively ","element":"span"},{"text":"higher than that of the dispreferred response. A peculiarity of this objective is that when preference data is disjoint (i.e., preferred responses never appear as dispreferred responses, and vice versa), certain types of policies (e.g., over-parameterized) will learn to assign 0 probability to all dispreferred responses, with merely ","element":"span"},{"style":{"fontStyle":"italic"},"text":"non-zero ","element":"span"},{"text":"probability to all preferred responses. This includes policies that assign ","element":"span"},{"style":{"fontStyle":"italic"},"text":"near-zero ","element":"span"},{"text":"probability to preferred responses, and place all mass on (often degenerate) generations outside the training set.","element":"span"}],[{"id":"id-57","style":{"width":"99%"},"width":1871,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/4-43.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Case study: degenerate DPO optima in a BoW model","element":"span"}],[{"text":"For intuition on why the DPO global optima can include policies where ","element":"span"},{"style":{"height":16},"width":83.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-0.png","element":"img","alt":" π(yw","inline":true},{"text":") may be nearly 0 for all ","element":"span"},{"style":{"height":14.19},"width":93.06,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-1.png","element":"img","alt":" yw in","inline":true,"padRight":true},{"text":"the training set, consider the simplified case where the policy is a bag-of-words model, ","element":"span"},{"style":{"height":16},"width":134.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-2.png","element":"img","alt":" πθ(y) ∝","inline":true,"padRight":true},{"text":"exp (","element":"span"},{"style":{"height":16},"width":116.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-3.png","element":"img","alt":"c(y) · θ","inline":true},{"text":") for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":") representing a vector of counts in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":13.19},"width":29.71,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-4.png","element":"img","alt":" θi","inline":true,"padRight":true},{"text":"representing the unnormalized log-probability of token ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". Then we can formally show that DPO optimization monotonically decreases an upper bound on the probability of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"preferred ","element":"span"},{"text":"completion, ˜","element":"span"},{"style":{"height":16},"width":612.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-5.png","element":"img","alt":"πθ(t−1)(yw) ≥ ˜πθ(t)(yw) ≥ πθ(t)(yw).","inline":true}],[{"id":"id-88","style":{"fontWeight":"bold"},"text":"Proposition 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":14.18},"width":209.64,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-6.png","element":"img","alt":" yw, yℓ ∈ Vn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be preferred versus dispreferred outputs of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":", respectively, with ","element":"span"},{"style":{"height":16},"width":340.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-7.png","element":"img","alt":"πref(yw), πref(yℓ) >","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and corresponding count vectors ","element":"span"},{"style":{"height":16},"width":209.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-8.png","element":"img","alt":" c(yw), c(yℓ).","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"text":"log ","element":"span"},{"style":{"height":16},"width":430.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-9.png","element":"img","alt":" πθ(y) = c(y) · θ − nZ(θ","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16},"width":129.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-10.png","element":"img","alt":"Z(θ) =","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":20.4},"width":117.85,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-11.png","element":"img","alt":"�Vi eθi","inline":true},{"style":{"fontStyle":"italic"},"text":", with upper bound ","element":"span"},{"text":"log ˜","element":"span"},{"style":{"height":16},"width":359.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-12.png","element":"img","alt":"πθ(y) = c(y) · θ − n","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":15.59},"width":55.14,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-13.png","element":"img","alt":"j θj","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":14.19},"width":56.31,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-14.png","element":"img","alt":" θ(t)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"represent the parameters ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-15.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"after ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"style":{"fontStyle":"italic"},"text":"steps of gradient descent on ","element":"span"},{"style":{"height":16.79},"width":278.42,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-16.png","element":"img","alt":" Ldpo({yℓ, yw, x}","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", with ","element":"span"},{"style":{"height":14.18},"width":141.22,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-17.png","element":"img","alt":" θ(0) = 0","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, we have that ","element":"span"},{"style":{"height":16},"width":198.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-18.png","element":"img","alt":" πθ(t)(yw) ≤","inline":true,"padRight":true},{"text":"˜","element":"span"},{"style":{"height":16},"width":376.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-19.png","element":"img","alt":"πθ(t)(yw) ≤ ˜πθ(t−1)(yw","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"style":{"fontStyle":"italic"},"text":", with strict inequality when ","element":"span"},{"style":{"height":16},"width":332.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-20.png","element":"img","alt":" ||c(yw) − c(yℓ)||0 >","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"If ","element":"span"},{"style":{"height":16},"width":134.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-21.png","element":"img","alt":" πθ(t)(yw","inline":true},{"text":") decreases in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", what other strings become more probable? In the following proposition, we show that under the bag-of-words model, DPO optimization moves probability mass away from ","element":"span"},{"style":{"height":14.19},"width":43.97,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-22.png","element":"img","alt":" yw","inline":true,"padRight":true},{"text":"to sequences that contain only the tokens that maximize the difference between ","element":"span"},{"style":{"height":14.19},"width":43.97,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-23.png","element":"img","alt":" yw","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":28.97,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-24.png","element":"img","alt":" yℓ","inline":true},{"text":".","element":"span"}],[{"id":"id-89","style":{"fontWeight":"bold"},"text":"Proposition 3. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":14.18},"width":43.97,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-25.png","element":"img","alt":" yw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":10.8},"width":28.97,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-26.png","element":"img","alt":" yℓ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be preferred versus dispreferred outputs of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":16},"width":293.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-27.png","element":"img","alt":" ∆ = c(yw) − c(yℓ","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be the difference in unigram counts. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"= [","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, i, . . . , i","element":"span"},{"text":"]","element":"span"},{"style":{"fontStyle":"italic"},"text":", for ","element":"span"},{"style":{"height":11.2},"width":61.05,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-28.png","element":"img","alt":" i ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"arg ","element":"span"},{"text":"max ∆","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16},"width":37.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-29.png","element":"img","alt":" ∥c","inline":true},{"text":"(ˆ","element":"span"},{"style":{"height":16},"width":170.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-30.png","element":"img","alt":"y)∥1 = n","inline":true},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Then ","element":"span"},{"style":{"height":16},"width":470.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-31.png","element":"img","alt":"πθ(t)(yw) − πθ(t)(ˆy) = τ(t)k","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":13.2},"width":64.07,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-32.png","element":"img","alt":" k ≤","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and some non-decreasing ","element":"span"},{"style":{"height":14.39},"width":223.93,"height":35.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-33.png","element":"img","alt":" τ : Z+ → R+","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"We have ","element":"span"},{"style":{"height":16},"width":397.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-34.png","element":"img","alt":" k = 0 when c(yw) = c","inline":true},{"text":"(ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":"), and ","element":"span"},{"style":{"height":12},"width":76,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-35.png","element":"img","alt":" k ≪","inline":true,"padRight":true},{"text":"0 when ","element":"span"},{"style":{"height":16},"width":259.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-36.png","element":"img","alt":" ∥c(yw)∥2 ≪ ∥c","inline":true},{"text":"(ˆ","element":"span"},{"style":{"height":16},"width":158.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-37.png","element":"img","alt":"y)∥2 = n","inline":true,"padRight":true},{"text":"(when ","element":"span"},{"style":{"height":16},"width":77.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-38.png","element":"img","alt":" c(yw","inline":true},{"text":") is dense) and ","element":"span"},{"style":{"height":16},"width":248.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-39.png","element":"img","alt":"∥∆∥2 ≈ ∥∆∥∞","inline":true,"padRight":true},{"text":"(when ∆ is sparse). This implies that when ","element":"span"},{"style":{"height":14.18},"width":166.33,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-40.png","element":"img","alt":" yw and yℓ ","inline":true,"padRight":true},{"text":"are similar, ","element":"span"},{"style":{"height":16},"width":100.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-41.png","element":"img","alt":" πθ(yw","inline":true},{"text":") will degrade more rapidly. Early stopping will therefore have to trade off between reaching the degenerate solution on such cases, and underfitting other cases in which ","element":"span"},{"style":{"height":14.19},"width":166.41,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-42.png","element":"img","alt":" yw and yℓ ","inline":true,"padRight":true},{"text":"are more distinct. Related findings are also reported in ","element":"span"},{"href":"#id-59","referenceIndex":39,"text":"Razin et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-59","referenceIndex":39,"text":"2024","element":"a"},{"text":").","element":"span"}],[{"style":{"width":"99%"},"width":1872,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-43.png","element":"img"}]]},{"heading":"5 Uncertainty-aware reward model distillation","paragraphs":[[{"text":"As discussed in the previous section, a core issue in preference optimization is that the true preference distribution ","element":"span"},{"style":{"height":16},"width":238.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-44.png","element":"img","alt":" p∗(y1 ≻ y2 | x","inline":true},{"text":") is not known. Attempting to infer it from finite-sample preference data (which may further be biased or out-of-distribution with respect to the target domain) can then result in a failure to learn reasonable policies. In this section, we propose a regularized, pessimistic approach to direct preference optimization that brings explicit reward modeling back into the picture through a model distillation objective, while still maintaining the simplicity and efficiency of offline alignment methods.","element":"span"}],[{"id":"id-29","style":{"fontWeight":"bold"},"text":"5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Reward model distillation","element":"span"}],[{"text":"Suppose for the moment that the reward function ","element":"span"},{"style":{"height":11.38},"width":35.08,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-45.png","element":"img","alt":" r∗","inline":true,"padRight":true},{"text":"was in fact known, and did not have to be inferred from sampled preference data. Under this setting, we can then construct a straightforward and efficient offline optimization procedure that is similar in spirit to DPO, but no longer relies directly on a preference dataset. Concretely, given unlabeled samples (","element":"span"},{"style":{"height":15.6},"width":222.4,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-46.png","element":"img","alt":"x, y1, y2) ∼ ρ","inline":true,"padRight":true},{"text":"(where the number of samples can be potentially unlimited), we can define a simple squared “distillation” loss that matches the pairwise differences of the explicit reward ","element":"span"},{"style":{"height":11.38},"width":35.09,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-47.png","element":"img","alt":" r∗","inline":true,"padRight":true},{"text":"with those of the implicit policy reward defined by ","element":"span"},{"style":{"height":9.19},"width":37.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-48.png","element":"img","alt":" πθ","inline":true},{"text":", i.e.,","element":"span"}],[{"id":"id-60","style":{"width":"89%"},"width":1679,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-49.png","element":"img"}],[{"text":"Due to symmetry, here it does not matter if (","element":"span"},{"style":{"height":16},"width":283.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/5-50.png","element":"img","alt":"y1, y2) = (yw, yℓ","inline":true},{"text":"), i.e., preferred vs. dispreferred, or vice versa. Notably, a similar squared-loss objective has also been recently proposed and shown to be effective in ","element":"span"},{"text":"independent work by ","element":"span"},{"href":"#id-28","referenceIndex":28,"text":"Mao et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-28","referenceIndex":28,"text":"2024","element":"a"},{"text":") and ","element":"span"},{"href":"#id-30","referenceIndex":16,"text":"Gao et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-30","referenceIndex":16,"text":"2024b","element":"a"},{"text":"). We now first provide further motivation for this approach, before extending it to pessimistic variants in the next section.","element":"span"}],[{"text":"Intuitively, the distillation loss in (","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":") seeks to exactly match ","element":"span"},{"style":{"fontStyle":"italic"},"text":"differences ","element":"span"},{"text":"in reward model scores across all generation pairs (","element":"span"},{"style":{"height":10.8},"width":131.15,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-0.png","element":"img","alt":"x, y1, y2","inline":true},{"text":"). It is easy to see that under the Bradley-Terry model, this is equivalent to matching the strength of the preference relationship, ","element":"span"},{"style":{"height":12},"width":126.13,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-1.png","element":"img","alt":" y1 ≻ y2","inline":true},{"text":". Furthermore, by only matching differences, we can still conveniently ignore the log partition term, log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"), in the implicit reward formulation for ","element":"span"},{"style":{"height":9.19},"width":37.71,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-2.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"as shown in (","element":"span"},{"href":"#id-61","text":"4","element":"a"},{"text":"), as it is constant across different ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"for any given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". Finally, similar to the motivation in DPO, we can show that minimizing ","element":"span"},{"style":{"height":16},"width":258.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-3.png","element":"img","alt":" Ldistill(r∗, πθ; ρ","inline":true},{"text":") indeed results in an optimally aligned policy ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-4.png","element":"img","alt":" πθ∗","inline":true},{"text":", as long as the data distribution ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-5.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"has sufficient support over the space of prompts and responses.","element":"span"}],[{"id":"id-70","style":{"fontWeight":"bold"},"text":"Theorem 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"style":{"fontStyle":"italic"},"text":"denote the set of all possible responses for any model ","element":"span"},{"style":{"height":9.19},"width":51.74,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-6.png","element":"img","alt":" πθ.","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"text":"supp(","element":"span"},{"style":{"height":16},"width":281.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-7.png","element":"img","alt":"πref(y | x)) = Y,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"i.e., the reference policy may generate any outcome with non-zero probability. Further, let ","element":"span"},{"text":"supp(","element":"span"},{"style":{"height":15.6},"width":240.26,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-8.png","element":"img","alt":"ρ(x, y1, y2)) =","inline":true,"padRight":true},{"text":"supp(","element":"span"},{"style":{"height":16},"width":256.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-9.png","element":"img","alt":"µ(x)) × Y × Y","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":16},"width":215.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-10.png","element":"img","alt":" πθ∗(y | x) ∈","inline":true,"padRight":true},{"text":"argmin","element":"span"},{"style":{"height":16.08},"width":302.29,"height":40.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-11.png","element":"img","alt":"πθ Ldistill(r∗, πθ; ρ","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a minimizer over all possible policies, of the implicit reward distillation loss in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", for which ","element":"span"},{"style":{"height":16},"width":113.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-12.png","element":"img","alt":" r∗(x, y","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is assumed to be deterministic, and finite everywhere. Then for any ","element":"span"},{"style":{"height":14},"width":66.74,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-13.png","element":"img","alt":" β >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-14.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"also maximizes the alignment objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The theorem holds for a broad class of data distributions ","element":"span"},{"style":{"height":16},"width":167.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-15.png","element":"img","alt":" ρ(x, y1, y2","inline":true},{"text":"), and makes no assumptions on ","element":"span"},{"style":{"height":11.38},"width":124.64,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-16.png","element":"img","alt":" r∗. For","inline":true,"padRight":true},{"text":"example, it is no longer necessary for it to be defined using a Bradley-Terry model. Notably, it applies for reward models that are much larger and potentially better than the policy model, as the reward model is not used at test time (in contrast to DPO, which ties the size of the policy model used for generation to the size of the implicit reward model that is trained on preferences). In fact, this result can also be seen as strict generalization of the IPO framework of ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":"), which corresponds to the special case ","element":"span"},{"style":{"height":16},"width":365.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-17.png","element":"img","alt":"r∗(x, y) ≜ 1{y = yw}","inline":true},{"text":", if labeled pairs (","element":"span"},{"style":{"height":10.8},"width":133.07,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-18.png","element":"img","alt":"x, yw, yl","inline":true},{"text":") are provided instead of the unlabeled pairs (","element":"span"},{"style":{"height":10.8},"width":131.16,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-19.png","element":"img","alt":"x, y1, y2","inline":true},{"text":").","element":"span"}],[{"text":"Of course, the true reward ","element":"span"},{"style":{"height":11.39},"width":35.08,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-20.png","element":"img","alt":" r∗ ","inline":true,"padRight":true},{"text":"is usually not known in practice. Still, as in standard RLHF, we can construct good proxies by using the preference data to identify plausible target reward models ","element":"span"},{"style":{"height":11.59},"width":58.3,"height":28.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-21.png","element":"img","alt":" rtgt","inline":true},{"text":", further guided by any amount of regularization and inductive bias that we desire. Moreover, prior work has found that such explicitly trained reward models are more accurate and generalize better than the implicit reward model defined in (","element":"span"},{"href":"#id-61","text":"4","element":"a"},{"text":") that is learned by DPO (","element":"span"},{"href":"#id-62","referenceIndex":47,"text":"Tang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-62","referenceIndex":47,"text":"2024a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-63","referenceIndex":25,"text":"Lin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-63","referenceIndex":25,"text":"2024","element":"a"},{"text":"). A natural choice is to thus first learn ","element":"span"},{"style":{"height":11.59},"width":58.3,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-22.png","element":"img","alt":" rtgt","inline":true,"padRight":true},{"text":"on the preference data ","element":"span"},{"style":{"height":15.59},"width":86.96,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-23.png","element":"img","alt":" Dpref","inline":true,"padRight":true},{"text":"using standard methods, and then reuse ","element":"span"},{"style":{"height":15.59},"width":86.96,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-24.png","element":"img","alt":" Dpref","inline":true,"padRight":true},{"text":"to distill ","element":"span"},{"style":{"height":9.19},"width":37.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-25.png","element":"img","alt":" πθ","inline":true},{"text":", which is similar to classical settings in teacher-based model distillation (","element":"span"},{"href":"#id-10","referenceIndex":21,"text":"Hinton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":21,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-11","referenceIndex":41,"text":"Romero et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":41,"text":"2015","element":"a"},{"text":"). Furthermore, as ","element":"span"},{"style":{"height":11.59},"width":58.29,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-26.png","element":"img","alt":" rtgt","inline":true,"padRight":true},{"text":"is a real-valued model, at a bare minimum it is guaranteed to induce a regularized Bradley-Terry preference distribution ","element":"span"},{"style":{"height":16.79},"width":779.57,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-27.png","element":"img","alt":" ptgt(y1 ≻ y2 | x) > 0, ∀x, y1, y2 ∈ X × Y × Y","inline":true},{"text":", and thereby avoid the degeneracies identified in §","element":"span"},{"text":"4 ","element":"span"},{"text":"for the maximum likelihood estimate under DPO.","element":"span"}],[{"style":{"width":"99%"},"width":1872,"height":275,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-28.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Pessimistic reward model distillation","element":"span"}],[{"text":"Choosing a single reward model ","element":"span"},{"style":{"height":11.59},"width":58.29,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-29.png","element":"img","alt":" rtgt","inline":true,"padRight":true},{"text":"for anchoring the LM policy can naturally still lead to degenerate behavior if ","element":"span"},{"style":{"height":11.59},"width":58.29,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-30.png","element":"img","alt":" rtgt","inline":true,"padRight":true},{"text":"is a poor approximation of the true ","element":"span"},{"style":{"height":11.38},"width":35.08,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-31.png","element":"img","alt":" r∗","inline":true,"padRight":true},{"text":"that accurately reflects human preferences. However, we can easily extend our framework to handle uncertainty in the right target reward function by defining a confidence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"set ","element":"span"},{"text":"of ","element":"span"},{"style":{"height":13.2},"width":66.62,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-32.png","element":"img","alt":" k ≥","inline":true,"padRight":true},{"text":"1 plausible target reward models, ","element":"span"},{"style":{"height":20.05},"width":360.18,"height":50.13,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-33.png","element":"img","alt":" S =�r1tgt, . . . , rktgt�,","inline":true,"padRight":true},{"text":"and training ","element":"span"},{"style":{"height":16},"width":155.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-34.png","element":"img","alt":" πθ∗(y | x","inline":true},{"text":") to maximize the following “pessimistic” form of the objective in (","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":"):","element":"span"}],[{"id":"id-65","style":{"width":"89%"},"width":1684,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-35.png","element":"img"}],[{"text":"In this pessimistic objective we are no longer optimizing ","element":"span"},{"style":{"height":9.19},"width":37.71,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-36.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"for a single reward, but optimizing ","element":"span"},{"style":{"height":9.19},"width":37.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/6-37.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"to produce generations that are scored favorably on average, even by the worst-case reward model in the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":", relative","element":"span"}],[{"id":"id-67","style":{"width":"99%"},"width":1872,"height":464,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-0.png","element":"img"}],[{"text":"Figure 1: A toy illustration of Theorem ","element":"figcaption","subtype":"caption"},{"href":"#id-64","text":"2","element":"a","subtype":"caption"},{"text":", which states that the optimal ","element":"figcaption","subtype":"caption"},{"style":{"height":13.19},"width":115.11,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-1.png","element":"img","alt":" πθ∗ for","inline":true,"padRight":true},{"text":"(","element":"figcaption","subtype":"caption"},{"href":"#id-65","text":"8","element":"a","subtype":"caption"},{"text":") is the policy in ","element":"figcaption","subtype":"caption"},{"style":{"height":15.99},"width":194.18,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-2.png","element":"img","alt":" Pβ(S) with","inline":true,"padRight":true},{"text":"the lowest forward-KL from ","element":"figcaption","subtype":"caption"},{"style":{"height":15.99},"width":342.58,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-3.png","element":"img","alt":" πSFT. The set Pβ(S","inline":true},{"text":") contains a (potentially infinite) set of policies ","element":"figcaption","subtype":"caption"},{"style":{"height":10},"width":246.62,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-4.png","element":"img","alt":" π1, π2, . . . cor-","inline":true,"padRight":true},{"text":"responding to target reward models. Here, ","element":"figcaption","subtype":"caption"},{"style":{"height":9.19},"width":83.88,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-5.png","element":"img","alt":" πSFT","inline":true,"padRight":true},{"text":"assigns equal mass to ","element":"figcaption","subtype":"caption"},{"style":{"height":14.19},"width":281.4,"height":35.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-6.png","element":"img","alt":" yw and yℓ, πMLE","inline":true,"padRight":true},{"text":"is the MLE solution for the DPO objective, which puts all probability mass on ","element":"figcaption","subtype":"caption"},{"style":{"height":14.18},"width":177.98,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-7.png","element":"img","alt":" yw, and π3","inline":true,"padRight":true},{"text":"is the policy in ","element":"figcaption","subtype":"caption"},{"style":{"height":15.99},"width":90.4,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-8.png","element":"img","alt":" Pβ(S","inline":true},{"text":") with lowest forward-KL.","element":"figcaption","subtype":"caption"}],[{"text":"to the generations of the baseline policy ","element":"span"},{"href":"#id-66","style":{"height":15.78},"width":90.5,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-9.png","element":"img","alt":" πref.3","inline":true,"padRight":true},{"text":"When the set ","element":"span"},{"style":{"height":16},"width":163.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-10.png","element":"img","alt":" S = {r∗}","inline":true,"padRight":true},{"text":"consists of only the ground-truth reward, the objective (","element":"span"},{"href":"#id-65","text":"8","element":"a"},{"text":") is equivalent to standard RLHF (","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":"), up to a constant offset independent of ","element":"span"},{"style":{"height":10.8},"width":136.09,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-11.png","element":"img","alt":" θ. More","inline":true,"padRight":true},{"text":"generally, whenever ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"includes a good proxy ","element":"span"},{"style":{"height":6.8},"width":18,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-12.png","element":"img","alt":" �r","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":11.38},"width":35.08,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-13.png","element":"img","alt":" r∗","inline":true},{"text":", the pessimistic advantage evaluation ensures that the policy ","element":"span"},{"style":{"height":15.9},"width":40.15,"height":39.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-14.png","element":"img","alt":" π∗θ ","inline":true,"padRight":true},{"text":"that maximizes eq. (","element":"span"},{"href":"#id-65","text":"8","element":"a"},{"text":") still has a large advantage over ","element":"span"},{"style":{"height":13.59},"width":334.3,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-15.png","element":"img","alt":" πref under all r ∈ S","inline":true},{"text":", including ","element":"span"},{"style":{"height":6.8},"width":18,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-16.png","element":"img","alt":" �r","inline":true},{"text":". This use of ","element":"span"},{"text":"pessimism to handle uncertainty in the knowledge of the true reward is related to similar techniques in the offline RL literature (","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"Kumar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":23,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"Cheng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":9,"text":"2022","element":"a"},{"text":").","element":"span"}],[{"text":"For the objective to be meaningful, however, the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"has to be chosen carefully. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"is small, it might not include any good proxy for ","element":"span"},{"style":{"height":11.38},"width":35.08,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-17.png","element":"img","alt":" r∗","inline":true},{"text":". Conversely, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"is too rich, it forces ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-18.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"text":"to be nearly identical to ","element":"span"},{"style":{"height":9.19},"width":61.32,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-19.png","element":"img","alt":" πref","inline":true},{"text":", since any deviations from ","element":"span"},{"style":{"height":9.19},"width":61.33,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-20.png","element":"img","alt":" πref","inline":true,"padRight":true},{"text":"might be penalized by some reward model in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":". Consequently, we want to design ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"to be the smallest possible set which contains a reasonable approximation to ","element":"span"},{"style":{"height":11.38},"width":35.08,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-21.png","element":"img","alt":" r∗","inline":true},{"text":".","element":"span"}],[{"text":"Finally, to solve (","element":"span"},{"href":"#id-65","text":"8","element":"a"},{"text":"), we can reformulate it as an equivalent constrained offline optimization problem, which conveniently admits a similar loss form as (","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":"), as shown below:","element":"span"}],[{"id":"id-64","style":{"fontWeight":"bold"},"text":"Theorem 2 ","element":"span"},{"text":"(Pessimistic distillation)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Define the constrained minimizer","element":"span"}],[{"id":"id-68","style":{"width":"74%"},"width":1399,"height":75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-22.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":16.39},"width":107.76,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-23.png","element":"img","alt":" Pβ(S)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the set of all possible policies with implicit reward models that are consistent with any target reward model ","element":"span"},{"style":{"height":19.33},"width":138.69,"height":48.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-24.png","element":"img","alt":" ritgt ∈ S","inline":true},{"style":{"fontStyle":"italic"},"text":", i.e., ","element":"span"},{"style":{"height":21.12},"width":308.4,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-25.png","element":"img","alt":" Pβ(S) ≜ {πθi}|S|i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":16},"width":268.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-26.png","element":"img","alt":" πθi ∝ πref(y | x","inline":true},{"text":") exp ","element":"span"},{"style":{"height":21.37},"width":161.07,"height":53.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-27.png","element":"img","alt":" 1β ritgt(x, y","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then for any ","element":"span"},{"style":{"height":14},"width":68.34,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-28.png","element":"img","alt":" β >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-29.png","element":"img","alt":"πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"also maximizes the pessimistic alignment objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-65","text":"8","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"To unpack this result, Theorem ","element":"span"},{"href":"#id-64","text":"2 ","element":"a"},{"text":"stipulates that the ","element":"span"},{"style":{"height":9.19},"width":37.71,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-30.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"that maximizes the pessimistic objective in (","element":"span"},{"href":"#id-65","text":"8","element":"a"},{"text":") is the policy in ","element":"span"},{"style":{"height":16.39},"width":90.54,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-31.png","element":"img","alt":" Pβ(S","inline":true},{"text":") that is closest in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"forward ","element":"span"},{"text":"KL-divergence to ","element":"span"},{"style":{"height":9.19},"width":61.33,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-32.png","element":"img","alt":" πref","inline":true,"padRight":true},{"text":"(see Figure ","element":"span"},{"href":"#id-67","text":"1","element":"a"},{"text":").","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-33.png","element":"img","alt":"4 ","inline":true,"padRight":true},{"text":"In addition, this policy also maximizes the expected reward of one of the ","element":"span"},{"style":{"height":19.33},"width":135.49,"height":48.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-34.png","element":"img","alt":" ritgt ∈ S","inline":true,"padRight":true},{"text":"(minus the additional weighted reverse KL-divergence ","element":"span"},{"text":"penalty term). Intuitively, the forward KL-divergence term serves the role of biasing the model towards optimizing for reward models that are similar to the implicit reward that ","element":"span"},{"style":{"height":9.19},"width":61.33,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-35.png","element":"img","alt":" πref","inline":true,"padRight":true},{"text":"already maximizes. Otherwise, there might exist a target reward model ","element":"span"},{"style":{"height":19.32},"width":135.49,"height":48.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-36.png","element":"img","alt":" ritgt ∈ S","inline":true,"padRight":true},{"text":"for which the advantage of ","element":"span"},{"style":{"height":9.19},"width":37.71,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-37.png","element":"img","alt":" πθ ","inline":true,"padRight":true},{"text":"relative to ","element":"span"},{"style":{"height":9.19},"width":61.33,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-38.png","element":"img","alt":" πref ","inline":true,"padRight":true},{"text":"will be low, or ","element":"span"},{"text":"even negative (a solution that we would like to avoid).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Optimization","element":"span"}],[{"text":"The constraint in (","element":"span"},{"href":"#id-68","text":"9","element":"a"},{"text":") can be relaxed and approximately optimized by introducing an objective with a Lagrangian-style penalty with strength ","element":"span"},{"style":{"height":9.6},"width":67.74,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-39.png","element":"img","alt":" α >","inline":true,"padRight":true},{"text":"0 on a form of distillation loss as (","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":"), i.e.,","element":"span"}],[{"id":"id-66","style":{"width":"99%"},"width":1873,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/7-40.png","element":"img"}],[{"text":"where for convenience we divide by ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-0.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and instead optimize","element":"span"},{"style":{"height":8},"width":16,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-1.png","element":"img","alt":"5","inline":true}],[{"id":"id-69","style":{"width":"85%"},"width":1603,"height":78,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-2.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":147.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-3.png","element":"img","alt":" γ = β/α","inline":true},{"text":". In reality, minimizing (","element":"span"},{"href":"#id-69","text":"11","element":"a"},{"text":") for ","element":"span"},{"style":{"height":12.4},"width":64.91,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-4.png","element":"img","alt":" γ >","inline":true,"padRight":true},{"text":"0 is equivalent to solving the constrained optimization problem in (","element":"span"},{"href":"#id-68","text":"9","element":"a"},{"text":") with an implicitly larger set of possible reward models ","element":"span"},{"style":{"height":15.99},"width":123.73,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-5.png","element":"img","alt":" Sγ ⊇ S","inline":true,"padRight":true},{"text":"indexed by ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-6.png","element":"img","alt":" γ","inline":true},{"text":". More specifically, ","element":"span"},{"style":{"height":15.99},"width":42.13,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-7.png","element":"img","alt":"Sγ","inline":true,"padRight":true},{"text":"also contains all reward models ˜","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"that are approximately consistent with the anchoring reward models ","element":"span"},{"style":{"height":19.33},"width":58.29,"height":48.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-8.png","element":"img","alt":"ritgt","inline":true,"padRight":true},{"text":"contained in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":", as the following result states.","element":"span"}],[{"id":"id-71","style":{"fontWeight":"bold"},"text":"Proposition 4 ","element":"span"},{"text":"(Soft pessimistic distillation)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume the same conditions as Theorem ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":". Then for any ","element":"span"},{"text":"0 ","element":"span"},{"style":{"height":12.4},"width":158.04,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-9.png","element":"img","alt":" < γ < ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists a ","element":"span"},{"style":{"height":13.2},"width":65.32,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-10.png","element":"img","alt":" λ ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":203.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-11.png","element":"img","alt":" πθ∗(y | x) ∈","inline":true,"padRight":true},{"text":"argmin","element":"span"},{"style":{"height":16.79},"width":310.1,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-12.png","element":"img","alt":"πθ Lpdistill(S, πθ; ρ","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-13.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a minimizer over all possible policies of the objective ","element":"span"},{"text":"(","element":"span"},{"href":"#id-68","text":"9","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", for the effective reward model set","element":"span"}],[{"id":"id-92","style":{"width":"87%"},"width":1631,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-14.png","element":"img"}],[{"text":"As a result, optimizing (","element":"span"},{"href":"#id-69","text":"11","element":"a"},{"text":") even when using the singleton ","element":"span"},{"style":{"height":16.79},"width":180.32,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-15.png","element":"img","alt":" S = {rtgt}","inline":true,"padRight":true},{"text":"yields an implicitly pessimistic objective, in which the pessimism is over all reward models ˜","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"that are consistent up to ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-16.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":11.59},"width":58.3,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-17.png","element":"img","alt":" rtgt","inline":true},{"text":".","element":"span"}],[{"style":{"width":"99%"},"width":1872,"height":274,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-18.png","element":"img"}],[{"id":"id-74","style":{"fontWeight":"bold"},"text":"5.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Pessimistic DPO","element":"span"}],[{"text":"Proposition ","element":"span"},{"href":"#id-71","text":"4 ","element":"a"},{"text":"can also be leveraged to obtain an alternative, implicitly pessimistic, objective that uses DPO directly instead of distillation. Consider the following regularized DPO loss:","element":"span"}],[{"style":{"width":"82%"},"width":1541,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-19.png","element":"img"}],[{"text":"Following a similar analysis as in Proposition ","element":"span"},{"href":"#id-71","text":"4","element":"a"},{"text":", we can derive that this implicitly corresponds to maximizing the pessimistic objective in (","element":"span"},{"href":"#id-65","text":"8","element":"a"},{"text":") for the reward model set","element":"span"}],[{"style":{"width":"75%"},"width":1418,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-20.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.08},"width":223.1,"height":40.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-21.png","element":"img","alt":" rπθ(x, y) ≜ β","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":410.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-22.png","element":"img","alt":" πθ(y | x)/πref(y | x) + β","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") is the implicit reward model defined by ","element":"span"},{"style":{"height":15.99},"width":204.17,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-23.png","element":"img","alt":" πθ. Sγ then","inline":true,"padRight":true},{"text":"corresponds to the set of reward models ","element":"span"},{"style":{"height":10.88},"width":50.61,"height":27.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-24.png","element":"img","alt":" rπθ","inline":true,"padRight":true},{"text":"that are all approximate minimizers of the DPO loss. This includes not only the MLE, but also all other estimators that obtain nearly the same loss. In principle, this can be expected to help ameliorate some of the issues of §","element":"span"},{"text":"4","element":"span"},{"text":": since driving the reward to ","element":"span"},{"style":{"height":11.2},"width":71,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-25.png","element":"img","alt":" ±∞","inline":true,"padRight":true},{"text":"only marginally decreases the ","element":"span"},{"style":{"height":15.99},"width":79.56,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-26.png","element":"img","alt":" Ldpo","inline":true,"padRight":true},{"text":"loss past a certain point, the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"will also include finite reward functions ","element":"span"},{"style":{"height":16.08},"width":261.88,"height":40.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-27.png","element":"img","alt":" |rπθ(x, y)| < ∞","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"height":12.4},"width":69.66,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-28.png","element":"img","alt":" γ >","inline":true,"padRight":true},{"text":"0. These rewards would then be preferred if they induce a policy with a smaller (forward) KL-divergence to ","element":"span"},{"style":{"height":9.19},"width":61.33,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-29.png","element":"img","alt":" πref","inline":true,"padRight":true},{"text":"than the degenerate, infinite rewards.","element":"span"}]]},{"heading":"6 Experimental results","paragraphs":[[{"text":"The main motivation for reward distillation and pessimism is to increase alignment robustness in challenging settings where it is difficult to learn good policies directly from the preference data. To demonstrate the effectiveness of our approach, we run experiments on the popular TL;DR summarization task (","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"Stiennon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"2020","element":"a"},{"text":"; ","element":"span"},{"href":"#id-72","referenceIndex":49,"text":"Völske et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-72","referenceIndex":49,"text":"2017","element":"a"},{"text":"), in which we simulate a scenario where the preference data has a spurious correlation between the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"length ","element":"span"},{"text":"of a summary and whether or not it is preferred.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/8-30.png","element":"img","alt":"6","inline":true,"padRight":true},{"text":"Additionally, we show results for an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"unbiased ","element":"span"},{"text":"setting on TL;DR, as well for an unbiased setting on Anthropic Helpfulness (","element":"span"},{"href":"#id-73","referenceIndex":5,"text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-73","referenceIndex":5,"text":"2022","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental setup","element":"span"}],[{"text":"We first train an “oracle” reward model on the TL;DR preference data training set (","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"Stiennon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":44,"text":"2020","element":"a"},{"text":") and relabel all preference pairs with this oracle. This enables us to use the oracle reward model for evaluation, without worrying about the gap to true human preferences. After relabeling, longer responses (where longer is defined as ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-0.png","element":"img","alt":" y1","inline":true,"padRight":true},{"text":"having at least 10% more tokens than ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-1.png","element":"img","alt":" y2","inline":true},{"text":") are preferred in 61% of the examples.","element":"span"}],[{"text":"To test the effect of a spurious correlation on preference-based policy optimization, we select a training set of 30K examples from the relabeled data such that the longer output is preferred in ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-2.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"fraction of examples, with ","element":"span"},{"style":{"height":16},"width":587.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-3.png","element":"img","alt":" ρ ∈ {0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}.","inline":true,"padRight":true},{"text":"Each such training set is denoted ","element":"span"},{"style":{"height":15.59},"width":47.74,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-4.png","element":"img","alt":" Dρ","inline":true},{"text":". At each ","element":"span"},{"style":{"height":15.59},"width":47.74,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-5.png","element":"img","alt":" Dρ","inline":true},{"text":", we compare our approach to DPO (","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":34,"text":"2023","element":"a"},{"text":") and IPO (","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":"), which are currently the most commonly used offline alignment methods. We test the following variants of distillation and pessimism:","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Distilled DPO ","element":"span"},{"text":"(d-DPO): Trains a reward model ","element":"span"},{"style":{"height":11.59},"width":34.98,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-6.png","element":"img","alt":" rρ","inline":true,"padRight":true},{"text":"on ","element":"span"},{"style":{"height":15.59},"width":47.74,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-7.png","element":"img","alt":" Dρ","inline":true},{"text":", and then optimizes ","element":"span"},{"style":{"height":16.79},"width":257.42,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-8.png","element":"img","alt":" Ldistill(rρ, πθ; ρ","inline":true},{"text":").","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Pessimistic DPO ","element":"span"},{"text":"(p-DPO): A pessimistic version of DPO as described in §","element":"span"},{"href":"#id-74","text":"5.3","element":"a"},{"text":", trained on ","element":"span"},{"style":{"height":15.59},"width":47.74,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-9.png","element":"img","alt":" Dρ","inline":true},{"text":".","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Pessimistic Distilled DPO ","element":"span"},{"text":"(dp-DPO): Combines the above two by training a reward model ","element":"span"},{"style":{"height":11.59},"width":34.98,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-10.png","element":"img","alt":" rρ","inline":true,"padRight":true},{"text":"on ","element":"span"},{"style":{"height":15.59},"width":47.74,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-11.png","element":"img","alt":"Dρ","inline":true,"padRight":true},{"text":"and optimizing the pessimistic distillation objective (Eq. (","element":"span"},{"href":"#id-69","text":"11","element":"a"},{"text":")) with confidence set ","element":"span"},{"style":{"height":16.79},"width":180.94,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-12.png","element":"img","alt":" S = {rtgt}","inline":true},{"text":".","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Pessimistic Ensemble DPO ","element":"span"},{"text":"(e-DPO): To create ensembles of reward models, we subsample from each ","element":"span"},{"style":{"height":15.59},"width":47.74,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-13.png","element":"img","alt":" Dρ","inline":true,"padRight":true},{"text":"five preference datasets, ","element":"span"},{"style":{"height":15.59},"width":70.74,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-14.png","element":"img","alt":" Dρ,b","inline":true},{"text":", at ","element":"span"},{"style":{"height":16},"width":520.14,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-15.png","element":"img","alt":" b ∈ B = {0.2, 0.4, 0.5, 0.6, 0.8}","inline":true},{"text":", such that the fraction of pairs where the longer response is preferred is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":", and train reward models ","element":"span"},{"style":{"height":11.59},"width":57.98,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-16.png","element":"img","alt":" rρ,b","inline":true,"padRight":true},{"text":"on those subsets. Consequently, sensitivity to length should vary across ensemble members. We then apply the same procedure as dp-DPO above, with a confidence set ","element":"span"},{"style":{"height":18.17},"width":250.11,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-17.png","element":"img","alt":" Sρ = {rρ,b}Bb=1","inline":true},{"text":".","element":"span"}],[{"text":"All reward models and policies are initialized from Palm-2-XS (","element":"span"},{"href":"#id-75","referenceIndex":3,"text":"Anil et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-75","referenceIndex":3,"text":"2023","element":"a"},{"text":"). Policies also go through a supervised finetuning step on human-written summaries from the original TL;DR training set (","element":"span"},{"href":"#id-72","referenceIndex":49,"text":"Völske et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-72","referenceIndex":49,"text":"2017","element":"a"},{"text":") prior to alignment, and we term this policy ","element":"span"},{"style":{"height":9.19},"width":83.88,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-18.png","element":"img","alt":" πSFT","inline":true},{"text":". We evaluate performance by sampling summaries for test set prompts, evaluating the average reward according to the oracle reward model, and computing the advantage in average reward compared to ","element":"span"},{"style":{"height":9.19},"width":83.88,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-19.png","element":"img","alt":" πSFT","inline":true,"padRight":true},{"text":"(before alignment). We train policies for 10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"000 steps with batch size 16 and learning rate 10","element":"span"},{"style":{"height":7.6},"width":40.91,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-20.png","element":"img","alt":"−6","inline":true},{"text":", and reward models for 3","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"steps with batch size 64 and learning rate 4 ","element":"span"},{"style":{"height":8},"width":31,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-21.png","element":"img","alt":" ×","inline":true,"padRight":true},{"text":"10","element":"span"},{"style":{"height":7.6},"width":40.9,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-22.png","element":"img","alt":"−6","inline":true},{"text":". We use the validation set for model selection during policy training and to choose the following hyperparameters. For all DPO variants, we sweep over ","element":"span"},{"style":{"height":16},"width":287.45,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-23.png","element":"img","alt":" β ∈ {.01, .1, 1, 3,","inline":true,"padRight":true},{"text":"10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"30","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"100","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". For IPO, we sweep over ","element":"span"},{"style":{"height":16},"width":475.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-24.png","element":"img","alt":" τ ∈ {0.01, 0.1, 1, 3, 5, 10, 25}","inline":true},{"text":". For all pessimistic methods we anneal ","element":"span"},{"style":{"height":17.39},"width":619.77,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-25.png","element":"img","alt":" γ = α/β from 10−4 to 10−2 linearly","inline":true,"padRight":true},{"text":"during the 10","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"training steps (however, in later experiments performed with e-DPO, we found annealing does not affect performance and a constant ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-26.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"also leads to similar performance, see Figure ","element":"span"},{"href":"#id-76","text":"B.5","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"6.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Results","element":"span"}],[{"text":"We present the results of our experiment in Figure ","element":"span"},{"href":"#id-77","text":"2","element":"a"},{"text":". As can be seen in the plot, the more challenging setting is when ","element":"span"},{"style":{"height":14},"width":104.96,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-27.png","element":"img","alt":" ρ < 0.","inline":true},{"text":"5, which corresponds to a sample of preference annotations in which shorter outputs are generally preferred. This distribution shift is more difficult because as mentioned the oracle reward model (trained on human annotations) has a bias in favor of longer outputs (","element":"span"},{"href":"#id-78","referenceIndex":43,"text":"Singhal et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-78","referenceIndex":43,"text":"2023","element":"a"},{"text":"). Nevertheless we get sizable improvements compared to the reference policy ","element":"span"},{"style":{"height":9.19},"width":83.88,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-28.png","element":"img","alt":" πSFT","inline":true,"padRight":true},{"text":"for all length bias values.","element":"span"}],[{"text":"All approaches that invoke distillation (d-DPO, e-DPO, dp-DPO) outperform IPO and DPO (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p < .","element":"span"},{"text":"01 by a Wald test) for ","element":"span"},{"style":{"height":13.6},"width":104.26,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-29.png","element":"img","alt":" ρ ≤ 0.","inline":true},{"text":"5, where shorter responses are preferred. Pessimistic ensemble DPO (e-DPO) performs particularly well in these settings, generally outperforming all methods that use a single reward model. When longer responses are preferred (","element":"span"},{"style":{"height":14},"width":104.34,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-30.png","element":"img","alt":"ρ > 0.","inline":true},{"text":"6), single reward distillation (d-DPO) leads to the highest performance, significantly outperforming both DPO and IPO (","element":"span"},{"style":{"fontStyle":"italic"},"text":"p < .","element":"span"},{"text":"01 by a Wald test). Interestingly, p-DPO does not provide empirical benefits relative to the distillation based methods, indicating that the distillation loss itself is quite important. For the effect of hyper-parameter selection, see Figure ","element":"span"},{"href":"#id-79","text":"B.4","element":"a"},{"text":". In DPO-based methods, the optimal value of ","element":"span"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-31.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is inversely correlated with the bias; in IPO the same holds for the ","element":"span"},{"style":{"height":6.8},"width":21,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-32.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"hyperparameter.","element":"span"}],[{"text":"To better understand the utility of reward ensembles in e-DPO, in particular when ","element":"span"},{"style":{"height":14},"width":104.58,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-33.png","element":"img","alt":" ρ < 0.","inline":true},{"text":"5, we examine the role of each reward model in the ensemble across different biases. Specifically, for e-DPO, we identify for each example, throughout training, the reward model ","element":"span"},{"style":{"height":11.59},"width":57.98,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/9-34.png","element":"img","alt":" rρ,b","inline":true,"padRight":true},{"text":"that best matches the implicit reward of the current","element":"span"}],[{"id":"id-77","style":{"width":"99%"},"width":1872,"height":810,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-0.png","element":"img"}],[{"text":"Figure 2: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Main results","element":"figcaption","subtype":"caption"},{"text":", showing the oracle reward compared to the initial finetuned policy (the oracle reward of the initial finetuned policy is ","element":"figcaption","subtype":"caption"},{"style":{"height":8},"width":73.08,"height":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-1.png","element":"img","alt":" ≈ −","inline":true},{"text":"1). Error bars correspond to bootstrap 95% confidence intervals for finite sample variance. Ensemble DPO (e-DPO) is significantly better than DPO and IPO in the challenging setup where shorter responses are preferred (","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":104.68,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-2.png","element":"img","alt":"ρ ≤ 0.","inline":true},{"text":"5), and is generally the best-performing method overall in this regime. Distilled DPO (d-DPO) performs best when longer responses are preferred (","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":104.66,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-3.png","element":"img","alt":"ρ > 0.","inline":true},{"text":"6).","element":"figcaption","subtype":"caption"}],[{"text":"policy, i.e., for which reward model is ","element":"span"},{"style":{"height":13.99},"width":106.2,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-4.png","element":"img","alt":" Ldistill","inline":true,"padRight":true},{"text":"minimized on that example (see Eq. (","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":") and (","element":"span"},{"href":"#id-69","text":"11","element":"a"},{"text":")). We find that when the policy is trained on data where shorter preference are preferred (","element":"span"},{"style":{"height":12},"width":84.76,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-5.png","element":"img","alt":"ρ < .","inline":true},{"text":"5), the reward model that best matches the policy often has the opposite bias (","element":"span"},{"style":{"fontStyle":"italic"},"text":"b ","element":"span"},{"text":"is high), and vice versa. Thus, the success of e-DPO may be explained by its ability to distill from reward models that do not suffer from the bias in the policy training data, which is particularly helpful when ","element":"span"},{"style":{"height":13.6},"width":84.74,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-6.png","element":"img","alt":" ρ ≤ .","inline":true},{"text":"5 as this bias is also not shared by the oracle RM. We provide the full distribution over reward models for all ","element":"span"},{"style":{"height":14.4},"width":133.9,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-7.png","element":"img","alt":" ρ and β","inline":true,"padRight":true},{"text":"in Appendix ","element":"span"},{"href":"#id-80","text":"B.3","element":"a"},{"text":". Overall, these results demonstrate the efficacy of training a policy by distilling from a reward model in the presence of distribution shifts, and that a careful design of an ensemble to mitigate spurious correlations can lead to further performance gains.","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-8.png","element":"img","alt":"7","inline":true}],[{"style":{"fontWeight":"bold"},"text":"6.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Additional results in an unbiased setting","element":"span"}],[{"text":"To test the ability of our method to perform well on preference tasks where no bias is present, we next run experiments on the Anthropic Helpfulness dataset (","element":"span"},{"href":"#id-73","referenceIndex":5,"text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-73","referenceIndex":5,"text":"2022","element":"a"},{"text":"). We use a Gemini 1.0 Ultra (","element":"span"},{"href":"#id-81","referenceIndex":17,"text":"Gemini ","element":"a"},{"href":"#id-81","referenceIndex":17,"text":"Team","element":"a"},{"text":", ","element":"span"},{"href":"#id-81","referenceIndex":17,"text":"2024","element":"a"},{"text":") LLM-as-a-judge model for evaluating win-rates of the policies over both the SFT starting point and the best DPO baseline. As shown in Table ","element":"span"},{"href":"#id-82","text":"1","element":"a"},{"text":", in this unbiased setting our distillation objectives can also provide modest gains. Concretely, e-DPO’s win rate against the SFT policy is 65.8%, while DPO’s win rate is 64.2%. Moreover, comparing e-DPO and DPO directly, e-DPO wins in 49.7% of the cases, while DPO wins in 46.9% of the cases (the rest are considered to be ties with no preference relation).","element":"span"}],[{"style":{"width":"99%"},"width":1872,"height":272,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-9.png","element":"img"}]]},{"heading":"7 Theoretical analysis","paragraphs":[[{"text":"This section characterizes solutions offered by pessimistic DPO and distillation to the issues identified in §","element":"span"},{"text":"4","element":"span"},{"text":", focusing on the simplified scenario in which we optimize with respect to a single preference pair (","element":"span"},{"style":{"height":14.18},"width":93.45,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/10-10.png","element":"img","alt":"yw, yℓ","inline":true},{"text":").","element":"span"}],[{"id":"id-82","style":{"width":"99%"},"width":1872,"height":407,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-0.png","element":"img"}],[{"text":"Table 1: Side-by-side win rates on the Helpfulness dataset (with a Gemini 1.0 Ultra evaluator).","element":"figcaption","subtype":"caption"}],[{"id":"id-86","style":{"fontWeight":"bold"},"text":"7.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Optima","element":"span"}],[{"text":"In its Lagrangian formulation, pessimistic DPO adds a forward KL term to the DPO objective (§","element":"span"},{"href":"#id-74","text":"5.3","element":"a"},{"text":"). Here we seek to better analyze how this additional term effects the optimal policy. For the sake of analysis, we assume that the preference annotations are sampled from the reference distribution, ","element":"span"},{"style":{"height":16},"width":502.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-1.png","element":"img","alt":" µ(x) × πref(y | x) × πref(y | x","inline":true},{"text":"). Then a finite-sample approximation of the forward KL term is","element":"span"}],[{"style":{"width":"44%"},"width":826,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-2.png","element":"img"}],[{"text":"By applying this finite-sample approximation, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p-DPO has a finite optimum, unlike DPO","element":"span"},{"text":", as shown in Proposition ","element":"span"},{"href":"#id-83","text":"1","element":"a"},{"text":". Note that this analysis is limited in two ways: (1) as mentioned, we compute the KL term over the completions in the preference data; (2) we directly optimize the probability ratios ","element":"span"},{"style":{"height":16},"width":384.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-3.png","element":"img","alt":" ψw = πθ(yw)/πref(yw)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":342.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-4.png","element":"img","alt":" ψℓ = πθ(yℓ)/πref(yℓ","inline":true},{"text":"), rather than optimizing them jointly through the parameters. For sufficiently expressive ","element":"span"},{"style":{"height":9.19},"width":37.72,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-5.png","element":"img","alt":" πθ","inline":true},{"text":", however, this approximation captures the behavior of the two algorithms reasonably well.","element":"span"}],[{"id":"id-84","style":{"fontWeight":"bold"},"text":"Proposition 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-6.png","element":"img","alt":"Lpdpo","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"represent a finite-sample approximation to ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-7.png","element":"img","alt":" Lpdpo","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with the empirical forward KL term ","element":"span"},{"text":"ˆ","element":"span"},{"text":"Ω(Θ)","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For a fixed ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":16.15},"width":100.52,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-8.png","element":"img","alt":"πθ(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.6},"width":75.87,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-9.png","element":"img","alt":" α >","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", the ","element":"span"},{"text":"argmin","element":"span"},{"style":{"height":11.2},"width":90.32,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-10.png","element":"img","alt":"πθ(yℓ)","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-11.png","element":"img","alt":"Lpdpo","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"text":"min","element":"span"},{"style":{"height":19.2},"width":187.1,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-12.png","element":"img","alt":"�1 − ˆπθ(ywi","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":143.75,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-13.png","element":"img","alt":", ˆπθ(yℓi)�","inline":true},{"style":{"fontStyle":"italic"},"text":", with ","element":"span"},{"text":"log ˆ","element":"span"},{"style":{"height":21.37},"width":214.95,"height":53.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-14.png","element":"img","alt":"πθ(yℓi) = − 1β","inline":true,"padRight":true},{"text":"log (","element":"span"},{"style":{"height":6.8},"width":65.49,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-15.png","element":"img","alt":"α −","inline":true,"padRight":true},{"text":"1) + log ˆ","element":"span"},{"style":{"height":16.15},"width":100.21,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-16.png","element":"img","alt":"πθ(ywi","inline":true,"padRight":true},{"text":") + log ","element":"span"},{"style":{"height":26.04},"width":133.96,"height":65.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-17.png","element":"img","alt":" πref(yℓi )πref(ywi ).","inline":true}],[{"text":"The optimum in Proposition ","element":"span"},{"href":"#id-84","text":"5 ","element":"a"},{"text":"corresponds to log ","element":"span"},{"style":{"height":17.38},"width":231.04,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-18.png","element":"img","alt":" ψw/ψℓ = β−1","inline":true,"padRight":true},{"text":"log(","element":"span"},{"style":{"height":6.8},"width":62.73,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-19.png","element":"img","alt":"α −","inline":true,"padRight":true},{"text":"1). Recall that IPO (","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":") seeks to assign a constant value to this ratio by minimizing (log ","element":"span"},{"style":{"height":22.21},"width":193.24,"height":55.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-20.png","element":"img","alt":"ψwψℓ − τ −1)2","inline":true},{"text":"; the (unconstrained) optima are ","element":"span"},{"text":"identical for ","element":"span"},{"style":{"height":16.19},"width":196.21,"height":40.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-21.png","element":"img","alt":" τ −1 := β−1","inline":true,"padRight":true},{"text":"log(","element":"span"},{"style":{"height":6.8},"width":65.67,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-22.png","element":"img","alt":"α −","inline":true,"padRight":true},{"text":"1), but the loss surfaces are different (see further analysis of this in §","element":"span"},{"href":"#id-85","text":"7.2","element":"a"},{"text":"). DPO sets ","element":"span"},{"style":{"height":16.15},"width":160.75,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-23.png","element":"img","alt":" πθ(yℓi) →","inline":true,"padRight":true},{"text":"0, as shown in Corollary ","element":"span"},{"href":"#id-58","text":"1","element":"a"},{"text":"; this is due not only to competition from ","element":"span"},{"style":{"height":16.15},"width":100.52,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-24.png","element":"img","alt":" πθ(ywi","inline":true,"padRight":true},{"text":") but from ","element":"span"},{"text":"DPO penalizing positive probability on ","element":"span"},{"style":{"height":11.75},"width":30.54,"height":29.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-25.png","element":"img","alt":" yℓi","inline":true},{"text":". Analysis of the distilled loss gives a similar result:","element":"span"}],[{"id":"id-93","style":{"fontWeight":"bold"},"text":"Proposition 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any fixed ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":16.15},"width":100.52,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-26.png","element":"img","alt":"πθ(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.4},"width":126.09,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-27.png","element":"img","alt":" β > 0,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the ","element":"span"},{"text":"argmin ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of the distilled DPO objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"text":"min(1 ","element":"span"},{"style":{"height":16.15},"width":140.06,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-28.png","element":"img","alt":" − ˆπθ(ywi","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":16.15},"width":104.49,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-29.png","element":"img","alt":", ˆπθ(yℓi","inline":true},{"text":"))","element":"span"},{"style":{"fontStyle":"italic"},"text":", with ","element":"span"},{"text":"log ˆ","element":"span"},{"style":{"height":21.37},"width":183.96,"height":53.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-30.png","element":"img","alt":"πθ(yℓi) = 1β","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":16.15},"width":320.46,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-31.png","element":"img","alt":"rt(x, yℓi) − rt(x, ywi","inline":true,"padRight":true},{"text":")) + log ˆ","element":"span"},{"style":{"height":16.15},"width":100.21,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-32.png","element":"img","alt":"πθ(ywi","inline":true,"padRight":true},{"text":") + log ","element":"span"},{"style":{"height":26.04},"width":117.72,"height":65.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-33.png","element":"img","alt":" πref(yℓi )πref(ywi )","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"While the setting is simplistic, the results are comforting: here the additional regularization effects of both distillation and pessimism (in the case of p-DPO) clearly help to avoid degenerate optima.","element":"span"}],[{"id":"id-85","style":{"fontWeight":"bold"},"text":"7.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Transitive closure: p-DPO vs IPO","element":"span"}],[{"text":"As pointed out in §","element":"span"},{"href":"#id-86","text":"7.1","element":"a"},{"text":", both p-DPO and IPO target a constant ratio for log ","element":"span"},{"style":{"height":16},"width":107.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-34.png","element":"img","alt":" ψw/ψl","inline":true},{"text":". Despite the similar form of optima, however, the loss surfaces of the two objectives differ in notable ways. To see this, we consider a simplified setting with three possible outputs, ","element":"span"},{"style":{"height":10.8},"width":161.08,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-35.png","element":"img","alt":" y1, y2, y3","inline":true},{"text":". We observe either ","element":"span"},{"style":{"height":16},"width":468.66,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-36.png","element":"img","alt":" D = {(y1 ≺ y2), (y2 ≺ y3)}","inline":true,"padRight":true},{"text":"or ","element":"span"},{"style":{"height":16},"width":360.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-37.png","element":"img","alt":"D = D ∪ {(y1 ≺ y3)}","inline":true},{"text":". If we treat this problem as a multi-arm bandit, the goal is to assign a weight to each arm, which we denote ","element":"span"},{"style":{"height":16},"width":295.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-38.png","element":"img","alt":" ψi = log πθ(yi | x","inline":true},{"text":") + ","element":"span"},{"style":{"height":14},"width":58.26,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-39.png","element":"img","alt":" Zx,","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":13.19},"width":45.2,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-40.png","element":"img","alt":" Zx","inline":true,"padRight":true},{"text":"an underdetermined log-partition function.","element":"span"}],[{"id":"id-94","style":{"fontWeight":"bold"},"text":"Proposition 7. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, i ","element":"span"},{"text":"+ 1) : ","element":"span"},{"style":{"height":16},"width":264.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-41.png","element":"img","alt":" i ∈ 1, 2, . . . , n}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n > ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be the dataset arising from the transitive closure of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":9.19},"width":61.33,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-42.png","element":"img","alt":" πref","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is indifferent to all ","element":"span"},{"text":"(","element":"span"},{"style":{"height":12.39},"width":83.05,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-43.png","element":"img","alt":"yi, yj","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":19.88},"width":123.33,"height":49.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-44.png","element":"img","alt":" ψ(D)∞ =","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":21.12},"width":140.24,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-45.png","element":"img","alt":"i ψ(D)i −","inline":true,"padRight":true},{"text":"min","element":"span"},{"style":{"height":21.12},"width":97.42,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-46.png","element":"img","alt":"i ψ(D)i","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":20.68},"width":212.36,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-47.png","element":"img","alt":"ψ(D)∞ = (n −","inline":true,"padRight":true},{"text":"1)","element":"span"},{"style":{"height":22.18},"width":416.64,"height":55.45,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-48.png","element":"img","alt":"τ −1 > ψ(D)∞ = 2 n−1n τ −1.","inline":true}],[{"text":"Intuitively, the observation of ","element":"span"},{"style":{"height":12},"width":126.08,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-49.png","element":"img","alt":" y1 ≺ y3","inline":true,"padRight":true},{"text":"should increase confidence that ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-50.png","element":"img","alt":" y3","inline":true,"padRight":true},{"text":"is superior to ","element":"span"},{"style":{"height":10.8},"width":48.42,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/11-51.png","element":"img","alt":" y1,","inline":true,"padRight":true},{"text":"but in IPO it has the opposite effect, drawing their scores closer together. While pessimistic DPO also has a target ratio between each preference pair, its loss surface is different: in particular, it does not increase quadratically as we move away from the target. We find empirically that pessimistic DPO is robust to the transitive closure ","element":"span"},{"text":"of preference annotations in the multi-arm bandit setting, as shown in Figure ","element":"span"},{"href":"#id-87","text":"B.2","element":"a"},{"text":". As discussed above, DPO will set ","element":"span"},{"style":{"height":14},"width":177.65,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/12-0.png","element":"img","alt":" ψ1 → −∞","inline":true,"padRight":true},{"text":"because ","element":"span"},{"style":{"height":10.8},"width":35.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/12-1.png","element":"img","alt":" y1","inline":true,"padRight":true},{"text":"is never preferred. Specifically, in our experiments we solve the p-DPO and IPO objectives for both ","element":"span"},{"style":{"height":16},"width":795.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/12-2.png","element":"img","alt":" D = {(y1, y2), (y2, y3)} and D = D ∪ {(y1, y3)}","inline":true},{"text":", solving with respect to ","element":"span"},{"style":{"height":16},"width":238.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/12-3.png","element":"img","alt":" {πθ(yi)}. IPO","inline":true,"padRight":true},{"text":"is solved analytically as a quadratic program; for p-DPO we used projected gradient descent. We consider ","element":"span"},{"style":{"height":14},"width":62.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/12-4.png","element":"img","alt":"β ∈","inline":true,"padRight":true},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"30) and ","element":"span"},{"style":{"height":9.6},"width":63.72,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/12-5.png","element":"img","alt":" α ∈","inline":true,"padRight":true},{"text":"(5","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"10","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"20","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"50","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"100","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1000). As shown in Figure ","element":"span"},{"href":"#id-87","text":"B.2","element":"a"},{"text":", there are significant differences in the IPO solutions with and without transitive closure, while for p-DPO these differences are imperceptible.","element":"span"}],[{"style":{"width":"99%"},"width":1872,"height":272,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/12-6.png","element":"img"}]]},{"heading":"8 Conclusion","paragraphs":[[{"text":"LM alignment is crucial for deploying safe and helpful assistants, but is difficult due to lack of access to perfect preference oracles. We presented a thorough theoretical analysis of some of the degeneracies that DPO is susceptible to when learning from sampled human preference data. Furthermore, our findings suggest that explicit reward modeling remains a powerful vehicle for introducing regularization into post-training. By distilling the reward assigned by a single, explicit reward model—or a family of explicit reward models—directly into the implicit reward maximized by our policies using offline data, we demonstrated that we can achieve improved robustness to variations in preference dataset quality, while maintaining the simplicity of offline alignment frameworks. Finally, reward model distillation also results in modest but consistent improvements in performance even on unbiased settings, making it an overall compelling algorithmic modification to offline training.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"We thank the anonympous reviewers, Alexander D’Amour, and Chris Dyer for helpful comments and feedback on the manuscript. This research also benefited from discussions with Victor Veitch, Mandar Joshi, Kenton Lee, Kristina Toutanova, David Gaddy, Dheeru Dua, Yuan Zhang, Tianze Shi, and Anastasios Angelopoulos.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-4","text":"AI@Meta. The llama 3 herd of models, 2024. URL ","element":"span"},{"href":"https://arxiv.org/abs/2407.21783","style":{"fontFamily":"monospace"},"text":"https://arxiv.org/abs/2407.21783","element":"a"},{"text":".","element":"span"}],[{"id":"id-20","text":"Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.10571","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-75","text":"Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak ","element":"span"},{"text":"$3c","element":"span"}],[{"text":"Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report, 2023.","element":"span"}],[{"id":"id-1","text":"Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, ","element":"span"},{"text":"and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Artificial Intelligence and Statistics","element":"span"},{"text":", pages 4447–4455. PMLR, 2024.","element":"span"}],[{"id":"id-73","text":"Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Dassarma, Dawn Drain, ","element":"span"},{"text":"Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, John Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Christopher Olah, Benjamin Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ArXiv","element":"span"},{"text":", abs/2204.05862, 2022.","element":"span"}],[{"id":"id-6","text":"Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired ","element":"span"},{"text":"comparisons. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Biometrika","element":"span"},{"text":", 39(3/4):324–345, 1952. ISSN 00063444. URL ","element":"span"},{"href":"http://www.jstor.org/stable/2334029","style":{"fontFamily":"monospace"},"text":"http://www.jstor.org/stable/ ","element":"a"},{"href":"http://www.jstor.org/stable/2334029","style":{"fontFamily":"monospace"},"text":"2334029","element":"a"},{"text":".","element":"span"}],[{"id":"id-38","text":"Daniele Calandriello, Zhaohan Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, ","element":"span"},{"text":"Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, and Bilal Piot. Human alignment of large language models through online preference optimisation. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 41st International Conference on Machine Learning","element":"span"},{"text":", volume 235 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 5409–5435. PMLR, 21–27 Jul 2024. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v235/calandriello24a.html","style":{"fontFamily":"monospace"},"text":"https://proceedings.mlr.press/v235/calandriello24a.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-24","text":"Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, ","element":"span"},{"text":"and Furong Huang. Transfer q star: Principled decoding for llm alignment. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2405.20495","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-14","text":"Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline ","element":"span"},{"text":"reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3852–3878. PMLR, 2022.","element":"span"}],[{"id":"id-31","text":"Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement ","element":"span"},{"text":"learning from human preferences. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 30, 2017.","element":"span"}],[{"id":"id-40","text":"Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate ","element":"span"},{"text":"overoptimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2310.02743","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-35","text":"Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, ","element":"span"},{"text":"Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2405.07863","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-39","text":"Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D’Amour, DJ Dvijotham, Adam ","element":"span"},{"text":"Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, et al. Helping or herding? Reward model ensembles mitigate but do not eliminate reward hacking. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2312.09244","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-13","text":"Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born ","element":"span"},{"text":"again neural networks. In Jennifer Dy and Andreas Krause, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 35th International Conference on Machine Learning","element":"span"},{"text":", volume 80 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 1607–1616. PMLR, 10–15 Jul 2018. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v80/furlanello18a.html","style":{"fontFamily":"monospace"},"text":"https://proceedings.mlr.press/v80/furlanello18a.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-18","text":"Yang Gao, Dana Alon, and Donald Metzler. Impact of preference noise on the alignment performance of ","element":"span"},{"text":"generative language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.09824","element":"span"},{"text":", 2024a.","element":"span"}],[{"id":"id-30","text":"Zhaolin Gao, Jonathan Daniel Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten ","element":"span"},{"text":"Joachims, J. Andrew Bagnell, Jason D. Lee, and Wen Sun. REBEL: Reinforcement learning via regressing relative rewards. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ICML 2024 Workshop on Models of Human Feedback for AI Alignment","element":"span"},{"text":", 2024b. URL ","element":"span"},{"href":"https://openreview.net/forum?id=4SKidIUPP6","style":{"fontFamily":"monospace"},"text":"https://openreview.net/forum?id=4SKidIUPP6","element":"a"},{"text":".","element":"span"}],[{"id":"id-81","text":"Gemini Team. Gemini: A family of highly capable multimodal models, 2024. URL ","element":"span"},{"href":"https://arxiv.org/abs/2312.11805","style":{"fontFamily":"monospace"},"text":"https://arxiv.org/ ","element":"a"},{"href":"https://arxiv.org/abs/2312.11805","style":{"fontFamily":"monospace"},"text":"abs/2312.11805","element":"a"},{"text":".","element":"span"}],[{"id":"id-19","text":"Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian ","element":"span"},{"text":"Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.09656","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-5","text":"Dirk Groeneveld, Iz Beltagy, Evan Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Jha, ","element":"span"},{"text":"Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, William Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah Smith, and Hannaneh Hajishirzi. OLMo: Accelerating the science of language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","element":"span"},{"text":", pages 15789–15809, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.841. URL ","element":"span"},{"href":"https://aclanthology.org/2024.acl-long.841","style":{"fontFamily":"monospace"},"text":"https://aclanthology.org/2024.acl-long.841","element":"a"},{"text":".","element":"span"}],[{"id":"id-33","text":"Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, ","element":"span"},{"text":"Thomas Mesnard, Yao Zhao, Bilal Piot, et al. Direct language model alignment from online ai feedback. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.04792","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-10","text":"Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NIPS Deep Learning and Representation Learning Workshop","element":"span"},{"text":", 2015. URL ","element":"span"},{"href":"http://arxiv.org/abs/1503.02531","style":{"fontFamily":"monospace"},"text":"http://arxiv.org/abs/1503.02531","element":"a"},{"text":".","element":"span"}],[{"id":"id-53","text":"Tomasz Korbak, Ethan Perez, and Christopher Buckley. RL with KL penalties is better viewed as bayesian ","element":"span"},{"text":"inference. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics: EMNLP 2022","element":"span"},{"text":", pages 1083–1091, 2022.","element":"span"}],[{"id":"id-15","text":"Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforce- ","element":"span"},{"text":"ment learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33:1179–1191, 2020.","element":"span"}],[{"id":"id-8","text":"Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, ","element":"span"},{"text":"and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2309.00267","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-63","text":"Yong Lin, Skyler Seto, Maartje ter Hoeve, Katherine Metcalf, Barry-John Theobald, Xuan Wang, Yizhe ","element":"span"},{"text":"Zhang, Chen Huang, and Tong Zhang. On the limited generalization capability of the implicit reward model induced by direct preference optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv: 2409.03650","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-47","text":"Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. ","element":"span"},{"text":"Provably good batch off-policy reinforcement learning without great exploration. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 33: 1264–1274, 2020.","element":"span"}],[{"id":"id-26","text":"Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, and Zhaoran ","element":"span"},{"text":"Wang. Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2405.16436","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-28","text":"Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, and Anh Tuan Luu. Don’t forget your reward ","element":"span"},{"text":"values: Language model alignment via value-based calibration. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing","element":"span"},{"text":", pages 17622–17642, Miami, Florida, USA, November 2024. Association for Computational","element":"span"}],[{"text":"Linguistics. doi: 10.18653/v1/2024.emnlp-main.976. URL ","element":"span"},{"href":"https://aclanthology.org/2024.emnlp-main.976","style":{"fontFamily":"monospace"},"text":"https://aclanthology.org/2024.emnlp-main. ","element":"a"},{"href":"https://aclanthology.org/2024.emnlp-main.976","style":{"fontFamily":"monospace"},"text":"976","element":"a"},{"text":".","element":"span"}],[{"id":"id-3","text":"Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple preference optimization with a reference-free ","element":"span"},{"text":"reward. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-43","text":"Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D Dragan, ","element":"span"},{"text":"and Stephen McAleer. Confronting reward model overoptimization with constrained rlhf. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2310.04373","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-23","text":"Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng- ","element":"span"},{"text":"Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, and Ahmad Beirami. Controlled decoding from language models. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Forty-first International Conference on Machine Learning","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-16","text":"Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: ","element":"span"},{"text":"Fixing failure modes of preference optimisation with dpo-positive. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.13228","element":"span"},{"text":", 2024.","element":"span"}],[{"text":"Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct ","element":"span"},{"text":"preference optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2403.19159","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-0","text":"Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. ","element":"span"},{"text":"Direct preference optimization: Your language model is secretly a reward model. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2023.","element":"span"}],[{"id":"id-17","text":"Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, W. Bradley Knox, Chelsea ","element":"span"},{"text":"Finn, and Scott Niekum. Scaling laws for reward model overoptimization in direct alignment algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Thirty-eighth Annual Conference on Neural Information Processing Systems","element":"span"},{"text":", 2024a. URL ","element":"span"},{"href":"https://openreview.net/forum?id=pf4OuJyn4Q","style":{"fontFamily":"monospace"},"text":"https: ","element":"a"},{"href":"https://openreview.net/forum?id=pf4OuJyn4Q","style":{"fontFamily":"monospace"},"text":"//openreview.net/forum?id=pf4OuJyn4Q","element":"a"},{"text":".","element":"span"}],[{"id":"id-22","text":"Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From ","element":"span"},{"style":{"height":14.59},"width":115.76,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/15-0.png","element":"img","alt":" r to q∗","inline":true},{"text":": Your language model is secretly a q-function. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.12358","element":"span"},{"text":", 2024b.","element":"span"}],[{"id":"id-44","text":"Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure ","element":"span"},{"text":"Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-42","text":"Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and ","element":"span"},{"text":"Johan Ferret. WARM: On the benefits of weight averaged reward models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2401.12187","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-59","text":"Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, and Boris Hanin. Unintentional ","element":"span"},{"text":"unalignment: Likelihood displacement in direct preference optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv: 2410.08847","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-46","text":"Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, and Olivier Pietquin. ","element":"span"},{"text":"Countering reward over-optimization in LLM with demonstration-guided reinforcement learning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Findings of the Association for Computational Linguistics ACL 2024","element":"span"},{"text":", pages 12447–12472, Bangkok, Thailand and virtual meeting, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.740. URL ","element":"span"},{"href":"https://aclanthology.org/2024.findings-acl.740","style":{"fontFamily":"monospace"},"text":"https://aclanthology.org/ ","element":"a"},{"href":"https://aclanthology.org/2024.findings-acl.740","style":{"fontFamily":"monospace"},"text":"2024.findings-acl.740","element":"a"},{"text":".","element":"span"}],[{"id":"id-11","text":"Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua ","element":"span"},{"text":"Bengio. Fitnets: Hints for thin deep nets. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In Proceedings of ICLR","element":"span"},{"text":", 2015.","element":"span"}],[{"id":"id-52","text":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization ","element":"span"},{"text":"algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1707.06347","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-78","text":"Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length ","element":"span"},{"text":"correlations in rlhf. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2310.03716","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-32","text":"Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario ","element":"span"},{"text":"Amodei, and Paul F Christiano. Learning to summarize with human feedback. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33:3008–3021, 2020.","element":"span"}],[{"id":"id-9","text":"Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. A minimaximalist ","element":"span"},{"text":"approach to reinforcement learning from human feedback, 2024.","element":"span"}],[{"id":"id-34","text":"Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, ","element":"span"},{"text":"Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.14367","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-62","text":"Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi ","element":"span"},{"text":"Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, and Will Dabney. Understanding the performance gap between online and offline alignment algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv: 2405.08448","element":"span"},{"text":", 2024a.","element":"span"}],[{"id":"id-2","text":"Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, ","element":"span"},{"text":"Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, and Bilal Piot. Generalized preference optimization: A unified approach to offline alignment. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2402.05749","element":"span"},{"text":", 2024b.","element":"span"}],[{"id":"id-72","text":"Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. TL;DR: Mining Reddit to learn automatic ","element":"span"},{"text":"summarization. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Workshop on New Frontiers in Summarization","element":"span"},{"text":", pages 59–63, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL ","element":"span"},{"href":"https://aclanthology.org/W17-4508","style":{"fontFamily":"monospace"},"text":"https://aclanthology. ","element":"a"},{"href":"https://aclanthology.org/W17-4508","style":{"fontFamily":"monospace"},"text":"org/W17-4508","element":"a"},{"text":".","element":"span"}],[{"id":"id-49","text":"Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism ","element":"span"},{"text":"for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:6683–6694, 2021.","element":"span"}],[{"id":"id-37","text":"Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative ","element":"span"},{"text":"preference learning from human feedback: Bridging theory and practice for RLHF under KL-constraint. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Forty-first International Conference on Machine Learning","element":"span"},{"text":", 2024. URL ","element":"span"},{"href":"https://openreview.net/forum?id=c1AKcA6ry1","style":{"fontFamily":"monospace"},"text":"https://openreview.net/forum? ","element":"a"},{"href":"https://openreview.net/forum?id=c1AKcA6ry1","style":{"fontFamily":"monospace"},"text":"id=c1AKcA6ry1","element":"a"},{"text":".","element":"span"}],[{"id":"id-36","text":"Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. ","element":"span"},{"text":"Is DPO superior to PPO for LLM alignment? a comprehensive study. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.10719","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-12","text":"Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L. Yuille. Training deep neural networks in generations: a ","element":"span"},{"text":"more tolerant teacher educates better students. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence","element":"span"},{"text":", AAAI’19/IAAI’19/EAAI’19. AAAI Press, 2019. ISBN 978-1-57735-809-1. doi: 10.1609/aaai.v33i01.33015628. URL ","element":"span"},{"href":"https://doi.org/10.1609/aaai.v33i01.33015628","style":{"fontFamily":"monospace"},"text":"https://doi.org/ ","element":"a"},{"href":"https://doi.org/10.1609/aaai.v33i01.33015628","style":{"fontFamily":"monospace"},"text":"10.1609/aaai.v33i01.33015628","element":"a"},{"text":".","element":"span"}],[{"id":"id-50","text":"Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: ","element":"span"},{"text":"Conservative offline model-based policy optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:28954–28967, 2021.","element":"span"}],[{"id":"id-48","text":"Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for ","element":"span"},{"text":"offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:13626–13640, 2021.","element":"span"}],[{"id":"id-21","text":"Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct ","element":"span"},{"text":"preference optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2404.11999","element":"span"},{"text":", 2024.","element":"span"}],[{"id":"id-41","text":"Yuanzhao Zhai, Han Zhang, Yu Lei, Yue Yu, Kele Xu, Dawei Feng, Bo Ding, and Huaimin Wang. Uncertainty- ","element":"span"},{"text":"penalized reinforcement learning from human feedback with diverse reward lora ensembles. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2401.00243","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-27","text":"Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, and Wen Sun. Provable offline preference- ","element":"span"},{"text":"based reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Twelfth International Conference on Learning Representations","element":"span"},{"text":", 2024. URL ","element":"span"},{"href":"https://openreview.net/forum?id=tVMPfEGT2w","style":{"fontFamily":"monospace"},"text":"https://openreview.net/forum?id=tVMPfEGT2w","element":"a"},{"text":".","element":"span"}],[{"id":"id-7","text":"Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence ","element":"span"},{"text":"likelihood calibration with human feedback. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2305.10425","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-25","text":"Banghua Zhu, Michael Jordan, and Jiantao Jiao. Principled reinforcement learning with human feedback ","element":"span"},{"text":"from pairwise or k-wise comparisons. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 40th International Conference on Machine Learning","element":"span"},{"text":", volume 202 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 43037–43067. PMLR, 23–29 Jul 2023. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v202/zhu23f.html","style":{"fontFamily":"monospace"},"text":"https://proceedings.mlr.press/v202/zhu23f.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-45","text":"Banghua Zhu, Michael I Jordan, and Jiantao Jiao. Iterative data smoothing: Mitigating reward overfitting ","element":"span"},{"text":"and overoptimization in rlhf. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2401.16335","element":"span"},{"text":", 2024.","element":"span"}]]},{"heading":"A Proofs","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-83","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"text":"(Proposition ","element":"span"},{"href":"#id-83","text":"1 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-55","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", for any ","element":"span"},{"text":"(","element":"span"},{"style":{"height":15.75},"width":663.18,"height":39.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-0.png","element":"img","alt":"y, y′) such that y = ywi and y′ = yℓi for","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"some ","element":"span"},{"style":{"height":24.43},"width":500.54,"height":61.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-1.png","element":"img","alt":" i, we have πθ∗(y)πref(y′)πθ∗(y′)πref(y) → ∞","inline":true},{"style":{"fontStyle":"italic"},"text":", for all global minimizers ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-2.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"of the DPO objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-56","text":"6","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", for any ","element":"span"},{"style":{"height":14},"width":109.44,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-3.png","element":"img","alt":" β > 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Since all the preference pairs (","element":"span"},{"style":{"height":10.8},"width":67.64,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-4.png","element":"img","alt":"y, y′","inline":true},{"text":") are mutually disjoint, and ","element":"span"},{"style":{"height":15.59},"width":34.71,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-5.png","element":"img","alt":" θy","inline":true,"padRight":true},{"text":"is specific to each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", the DPO objective over ","element":"span"},{"style":{"height":16.79},"width":640.1,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-6.png","element":"img","alt":" Dpref is convex in ∆ = {∆1, . . . , ∆n}","inline":true},{"text":", where","element":"span"}],[{"style":{"width":"62%"},"width":1165,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-7.png","element":"img"}],[{"text":"Furthermore, the different ∆","element":"span"},{"style":{"height":7.2},"width":11,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-8.png","element":"img","alt":"i","inline":true,"padRight":true},{"text":"are completely independent from each other due to the preference pairs being disjoint, so they can be optimized over separately.","element":"span"}],[{"text":"In particular, for every ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"we have that","element":"span"}],[{"style":{"width":"61%"},"width":1150,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-9.png","element":"img"}],[{"text":"which implies that ∆","element":"span"},{"style":{"height":16},"width":170.54,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-10.png","element":"img","alt":"∗ = {∞}n ","inline":true,"padRight":true},{"text":"is the unique global minimizer of the DPO loss over ","element":"span"},{"style":{"height":15.59},"width":86.96,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-11.png","element":"img","alt":" Dpref","inline":true,"padRight":true},{"text":"in the space of ∆’s, and any ","element":"span"},{"style":{"height":11.39},"width":35.82,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-12.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"that is a global minimizer must therefore satisfy","element":"span"}],[{"style":{"width":"61%"},"width":1147,"height":175,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-13.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Corollary ","element":"span"},{"href":"#id-58","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Corollary ","element":"span"},{"text":"(Corollary ","element":"span"},{"href":"#id-58","text":"1 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under Assumption ","element":"span"},{"href":"#id-55","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":", further assume that ","element":"span"},{"text":"0 ","element":"span"},{"style":{"height":15.6},"width":502,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-14.png","element":"img","alt":" < πref(y) < 1 for all y. Then","inline":true},{"style":{"height":9.19},"width":53.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-15.png","element":"img","alt":"πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a global minimizer of the DPO objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-56","text":"6","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"iff ","element":"span"},{"style":{"height":16},"width":247.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-16.png","element":"img","alt":" πθ∗(C(yℓ)c) →","inline":true,"padRight":true},{"text":"1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16.15},"width":117.18,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-17.png","element":"img","alt":" πθ∗(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"height":16},"width":197.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-18.png","element":"img","alt":" > 0 ∀i ∈ [n","inline":true},{"text":"]","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":16},"width":104.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-19.png","element":"img","alt":"C(yℓ)c","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the complement of the set of all responses ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"style":{"fontStyle":"italic"},"text":"that appear as a dispreferred ","element":"span"},{"style":{"height":11.75},"width":30.54,"height":29.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-20.png","element":"img","alt":" yℓi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for any ","element":"span"},{"style":{"height":16},"width":97.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-21.png","element":"img","alt":" i ∈ [n","inline":true},{"text":"]","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Following the same argument of the proof of Proposition ","element":"span"},{"href":"#id-83","text":"1","element":"a"},{"text":", we have that all global minimizers ","element":"span"},{"style":{"height":11.38},"width":35.82,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-22.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"of the DPO satisfy ∆","element":"span"},{"style":{"height":15.54},"width":111.45,"height":38.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-23.png","element":"img","alt":"∗i = ∞","inline":true},{"text":", which in turn implies that","element":"span"}],[{"style":{"width":"59%"},"width":1122,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-24.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":16},"width":98.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-25.png","element":"img","alt":" πref(y","inline":true},{"text":") is assumed to satisfy 0 ","element":"span"},{"style":{"height":16},"width":199.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-26.png","element":"img","alt":" < πref(y) <","inline":true,"padRight":true},{"text":"1 for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", this implies that all ","element":"span"},{"style":{"height":11.39},"width":35.81,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-27.png","element":"img","alt":" θ∗","inline":true,"padRight":true},{"text":"satisfy","element":"span"}],[{"style":{"width":"99%"},"width":1874,"height":276,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-28.png","element":"img"}],[{"text":"then gives that","element":"span"}],[{"style":{"width":"73%"},"width":1384,"height":171,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/18-29.png","element":"img"}],[{"text":"To prove the converse, let ","element":"span"},{"style":{"height":9.19},"width":44.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-0.png","element":"img","alt":" πθ′","inline":true,"padRight":true},{"text":"be a policy that satisfies ","element":"span"},{"style":{"height":16.15},"width":506.07,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-1.png","element":"img","alt":" πθ′(C(yℓ)c) = 1, with πθ′(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"0, ","element":"span"},{"style":{"height":16},"width":127.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-2.png","element":"img","alt":" ∀i ∈ [n","inline":true},{"text":"],. As ","element":"span"},{"style":{"height":16},"width":145.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-3.png","element":"img","alt":"πθ′(y) ≥","inline":true,"padRight":true},{"text":"0 for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", this implies that ","element":"span"},{"style":{"height":19.91},"width":313.67,"height":49.76,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-4.png","element":"img","alt":" πθ′(yℓi ) = 0 ∀i ∈ [n","inline":true},{"text":"]. Then, we have","element":"span"}],[{"style":{"width":"56%"},"width":1054,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-5.png","element":"img"}],[{"text":"which by Proposition ","element":"span"},{"href":"#id-83","text":"1 ","element":"a"},{"text":"implies that ","element":"span"},{"style":{"height":9.19},"width":44.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-6.png","element":"img","alt":" πθ′","inline":true,"padRight":true},{"text":"is a global optimum.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-70","style":{"fontWeight":"bold"},"text":"1","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"text":"(Theorem ","element":"span"},{"href":"#id-70","text":"1 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y ","element":"span"},{"style":{"fontStyle":"italic"},"text":"denote the set of all possible responses for any model ","element":"span"},{"style":{"height":9.19},"width":51.75,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-7.png","element":"img","alt":" πθ.","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Assume that ","element":"span"},{"text":"supp(","element":"span"},{"style":{"height":16},"width":268.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-8.png","element":"img","alt":"πref(y | x)) = Y","inline":true},{"style":{"fontStyle":"italic"},"text":", i.e., the reference policy may generate any outcome with non-zero probability. Further, let ","element":"span"},{"text":"supp(","element":"span"},{"style":{"height":16},"width":249.87,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-9.png","element":"img","alt":"ρ(x, y1, y2)) =","inline":true,"padRight":true},{"text":"supp(","element":"span"},{"style":{"height":16},"width":261.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-10.png","element":"img","alt":"µ(x)) × Y × Y","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":16},"width":224.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-11.png","element":"img","alt":" πθ∗(y | x) ∈","inline":true,"padRight":true},{"text":"argmin","element":"span"},{"style":{"height":16.08},"width":302.29,"height":40.21,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-12.png","element":"img","alt":"πθ Ldistill(r∗, πθ; ρ","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a minimizer over all possible policies, of the implicit reward distillation loss in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", for which ","element":"span"},{"style":{"height":16},"width":113.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-13.png","element":"img","alt":" r∗(x, y","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is assumed to be deterministic, and finite everywhere. Then for any ","element":"span"},{"style":{"height":14.4},"width":177.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-14.png","element":"img","alt":" β > 0, πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"also maximizes the alignment objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We know that the optimal policy for the RLHF objective (","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":") is given by ","element":"span"},{"style":{"height":16},"width":214.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-15.png","element":"img","alt":" πθ∗(y|x) ∝","inline":true},{"style":{"height":16},"width":134.05,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-16.png","element":"img","alt":"πref(y|x","inline":true},{"text":") exp(","element":"span"},{"style":{"height":16},"width":173.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-17.png","element":"img","alt":"r∗(x, y)/β","inline":true},{"text":"). ","element":"span"},{"text":"Plugging this ","element":"span"},{"text":"policy into the ","element":"span"},{"text":"distillation objective (","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":"), ","element":"span"},{"text":"we ","element":"span"},{"text":"see ","element":"span"},{"text":"that ","element":"span"},{"style":{"height":16},"width":547.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-18.png","element":"img","alt":"Ldistill(r∗, πθ∗, ρ) = 0 for all ρ","inline":true},{"text":". In fact, the loss is equal to 0 pointwise, meaning that ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-19.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"text":"is a global minimizer of the distillation objective (","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":"). Further, let ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-20.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"be some other minimizer of ","element":"span"},{"style":{"height":16},"width":228.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-21.png","element":"img","alt":" Ldistill(r∗, ·, ρ","inline":true},{"text":"). Then ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-22.png","element":"img","alt":"π","inline":true,"padRight":true},{"text":"also has to attain a loss of 0 at all (","element":"span"},{"style":{"height":10.8},"width":108.13,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-23.png","element":"img","alt":"x, y, y′","inline":true},{"text":") in the support of ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-24.png","element":"img","alt":" ρ","inline":true},{"text":", meaning that log ","element":"span"},{"style":{"height":16},"width":150.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-25.png","element":"img","alt":" π(y|x) −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":164.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-26.png","element":"img","alt":" π(y′|x) =","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":182.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-27.png","element":"img","alt":" πθ∗(y|x) −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":127.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-28.png","element":"img","alt":" πθ∗(y|x","inline":true},{"text":") for all (","element":"span"},{"style":{"height":10.8},"width":108.14,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-29.png","element":"img","alt":"x, y, y′","inline":true},{"text":") in the support of ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-30.png","element":"img","alt":" ρ","inline":true},{"text":". Consequently, the two policies coincide in the support of ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-31.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"(due to the normalization constraint, there is no additional offset term allowed as the support of ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-32.png","element":"img","alt":"ρ","inline":true,"padRight":true},{"text":"covers all of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Y","element":"span"},{"text":"). Finally, noting that the support of the chosen ","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-33.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"is such that ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-34.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"text":"puts no mass outside its support due to the KL constraint in (","element":"span"},{"href":"#id-51","text":"1","element":"a"},{"text":"), we complete the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Theorem ","element":"span"},{"href":"#id-64","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Theorem ","element":"span"},{"text":"(Theorem ","element":"span"},{"href":"#id-64","text":"2 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Define the constrained minimizer","element":"span"}],[{"style":{"width":"49%"},"width":922,"height":74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-35.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":16.39},"width":107.76,"height":40.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-36.png","element":"img","alt":" Pβ(S)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the set of all possible policies with implicit reward models that are consistent with any target reward model ","element":"span"},{"style":{"height":19.32},"width":138.69,"height":48.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-37.png","element":"img","alt":" ritgt ∈ S","inline":true},{"style":{"fontStyle":"italic"},"text":", i.e., ","element":"span"},{"style":{"height":21.12},"width":308.4,"height":52.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-38.png","element":"img","alt":" Pβ(S) ≜ {πθi}|S|i=1","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":16},"width":268.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-39.png","element":"img","alt":" πθi ∝ πref(y | x","inline":true},{"text":") exp ","element":"span"},{"style":{"height":21.37},"width":161.07,"height":53.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-40.png","element":"img","alt":" 1β ritgt(x, y","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then for any ","element":"span"},{"style":{"height":14},"width":68.34,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-41.png","element":"img","alt":" β >","inline":true,"padRight":true},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"height":9.19},"width":53.75,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-42.png","element":"img","alt":"πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"also maximizes the pessimistic alignment objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-65","text":"8","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Consider the pessimistic objective:","element":"span"}],[{"style":{"width":"85%"},"width":1601,"height":81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-43.png","element":"img"}],[{"text":"As it is linear in ","element":"span"},{"style":{"height":11.59},"width":58.29,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-44.png","element":"img","alt":" rtgt","inline":true,"padRight":true},{"text":"and convex in ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-45.png","element":"img","alt":" π","inline":true},{"text":", we can switch the order of min and max:","element":"span"}],[{"style":{"width":"85%"},"width":1610,"height":97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-46.png","element":"img"}],[{"text":"Note that every ","element":"span"},{"style":{"height":15.99},"width":135.49,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-47.png","element":"img","alt":" rtgt ∈ S","inline":true,"padRight":true},{"text":"can be written in terms of the KL-constrained policy ","element":"span"},{"style":{"height":18.52},"width":72.88,"height":46.3,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-48.png","element":"img","alt":" π∗rtgt","inline":true,"padRight":true},{"text":"it induces, i.e.,","element":"span"}],[{"style":{"width":"71%"},"width":1337,"height":101,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-49.png","element":"img"}],[{"text":"where","element":"span"}],[{"style":{"width":"75%"},"width":1421,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/19-50.png","element":"img"}],[{"text":"which has the form","element":"span"}],[{"style":{"width":"74%"},"width":1387,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.79},"width":144.33,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-1.png","element":"img","alt":" Z(x, rtgt","inline":true},{"text":") is the partition function:","element":"span"}],[{"style":{"width":"99%"},"width":1871,"height":964,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-88","style":{"fontWeight":"bold"},"text":"2","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"text":"(Proposition ","element":"span"},{"href":"#id-88","text":"2 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":14.18},"width":201.6,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-3.png","element":"img","alt":" yw, yℓ ∈ Vn","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be preferred versus dispreferred outputs of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":", respectively, with ","element":"span"},{"style":{"height":16},"width":339.47,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-4.png","element":"img","alt":" πref(yw), πref(yℓ) >","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and corresponding count vectors ","element":"span"},{"style":{"height":16},"width":209.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-5.png","element":"img","alt":" c(yw), c(yℓ).","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"text":"log ","element":"span"},{"style":{"height":16},"width":142.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-6.png","element":"img","alt":" πθ(y) =","inline":true},{"style":{"height":16},"width":255.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-7.png","element":"img","alt":"c(y) · θ − nZ(θ","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"height":16},"width":123.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-8.png","element":"img","alt":" Z(θ) =","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":20.4},"width":117.85,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-9.png","element":"img","alt":"�Vi eθi","inline":true},{"style":{"fontStyle":"italic"},"text":", with upper bound ","element":"span"},{"text":"log ˜","element":"span"},{"style":{"height":16},"width":337.23,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-10.png","element":"img","alt":"πθ(y) = c(y) · θ − n","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":15.59},"width":55.14,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-11.png","element":"img","alt":"j θj","inline":true},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":14.19},"width":56.31,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-12.png","element":"img","alt":" θ(t)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"represent ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the parameters of ","element":"span"},{"style":{"height":14},"width":147.13,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-13.png","element":"img","alt":" π after t","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"steps of gradient descent on ","element":"span"},{"style":{"height":18.97},"width":541.42,"height":47.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-14.png","element":"img","alt":" Ldpo({yℓ, yw, x}), with θ(0) = 0","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, we have that","element":"span"}],[{"id":"id-90","style":{"width":"41%"},"width":786,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"with strict inequality when ","element":"span"},{"style":{"height":16},"width":332.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-16.png","element":"img","alt":" ||c(yw) − c(yℓ)||0 >","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"style":{"height":16},"width":413.61,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-17.png","element":"img","alt":" Let ∆ = [c(yw) − c(yℓ","inline":true},{"text":")] and ","element":"span"},{"style":{"height":16},"width":364.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-18.png","element":"img","alt":" ρ = πref(yw)/πref(yℓ","inline":true},{"text":"). ","element":"span"},{"text":"The theorem assumes ","element":"span"},{"style":{"height":16},"width":201.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-19.png","element":"img","alt":" |yw| = |yℓ|","inline":true},{"text":". ","element":"span"},{"text":"Then ","element":"span"},{"style":{"height":15.99},"width":165.55,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-20.png","element":"img","alt":"Ldpo = −","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":167.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-21.png","element":"img","alt":" σ (β(∆ · θ","inline":true},{"text":") + ","element":"span"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-22.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":16},"width":53.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-23.png","element":"img","alt":" ρ) .","inline":true,"padRight":true},{"text":"The derivative with respect to ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-24.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"is,","element":"span"}],[{"style":{"width":"81%"},"width":1524,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-25.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":18.19},"width":384.23,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-26.png","element":"img","alt":" δt = βp(yℓ ≻ yw; θ(t)).","inline":true,"padRight":true},{"text":"Then,","element":"span"}],[{"style":{"width":"78%"},"width":1477,"height":514,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/20-27.png","element":"img"}],[{"text":"We obtain max","element":"span"},{"style":{"height":28.8},"width":351.75,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-0.png","element":"img","alt":"j�θ(t−1)j + δt∆j�=","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":23.52},"width":120.52,"height":58.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-1.png","element":"img","alt":"j θ(t−1)j","inline":true,"padRight":true},{"text":"+ max","element":"span"},{"style":{"height":16.39},"width":102.05,"height":40.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-2.png","element":"img","alt":"j δt∆j","inline":true,"padRight":true},{"text":"from the fact that ","element":"span"},{"style":{"height":14.18},"width":413.65,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-3.png","element":"img","alt":" θ(0) = 0 and therefore","inline":true}],[{"style":{"height":14},"width":62.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-4.png","element":"img","alt":"j ∈","inline":true,"padRight":true},{"text":"arg max ∆ implies ","element":"span"},{"style":{"height":14},"width":62.76,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-5.png","element":"img","alt":" j ∈","inline":true,"padRight":true},{"text":"arg max ","element":"span"},{"style":{"height":14.18},"width":67.09,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-6.png","element":"img","alt":" θ(t′)","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":10.8},"width":73.64,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-7.png","element":"img","alt":" t′ >","inline":true,"padRight":true},{"text":"0. The second-to-last step uses ","element":"span"},{"style":{"height":22.8},"width":243.67,"height":56.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-8.png","element":"img","alt":" n = �Vj cj(yw","inline":true},{"text":") and ","element":"span"},{"text":"the final step uses ∆","element":"span"},{"style":{"height":15.19},"width":62.09,"height":37.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-9.png","element":"img","alt":"j ≤","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":12.33},"width":13,"height":30.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-10.png","element":"img","alt":"′j","inline":true,"padRight":true},{"text":"∆","element":"span"},{"style":{"height":9.6},"width":20.8,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-11.png","element":"img","alt":"j′","inline":true},{"text":". Finally, we have ","element":"span"},{"style":{"height":16},"width":323.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-12.png","element":"img","alt":" πθ(t)(y) ≤ ˜πθ(t)(yw","inline":true},{"text":") because ","element":"span"},{"style":{"height":16},"width":126.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-13.png","element":"img","alt":" Z(θ) =","inline":true,"padRight":true},{"text":"log ","element":"span"},{"style":{"height":19.14},"width":55.06,"height":47.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-14.png","element":"img","alt":" �j","inline":true,"padRight":true},{"text":"exp ","element":"span"},{"style":{"height":15.59},"width":80.79,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-15.png","element":"img","alt":" θj ≥","inline":true,"padRight":true},{"text":"log max","element":"span"},{"style":{"height":9.6},"width":13,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-16.png","element":"img","alt":"j","inline":true,"padRight":true},{"text":"exp ","element":"span"},{"style":{"height":15.59},"width":233.37,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-17.png","element":"img","alt":" θj = maxj θj.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"A.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-89","style":{"fontWeight":"bold"},"text":"3","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"text":"(Proposition ","element":"span"},{"href":"#id-89","text":"3 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":14.18},"width":43.97,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-18.png","element":"img","alt":" yw","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":10.8},"width":28.97,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-19.png","element":"img","alt":" yℓ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be preferred versus dispreferred outputs of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":16},"width":308.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-20.png","element":"img","alt":" ∆ = c(yw) − c(yℓ","inline":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be the difference in unigram counts. Let ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"= [","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, i, . . . , i","element":"span"},{"text":"]","element":"span"},{"style":{"fontStyle":"italic"},"text":", for ","element":"span"},{"style":{"height":11.2},"width":57.11,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-21.png","element":"img","alt":" i ∈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"arg ","element":"span"},{"text":"max ∆","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"height":16},"width":204.07,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-22.png","element":"img","alt":"∥c(ˆy)∥1 = n","inline":true},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":16},"width":470.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-23.png","element":"img","alt":" πθ(t)(yw) − πθ(t)(ˆy) = τ(t)k","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"height":13.2},"width":64.07,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-24.png","element":"img","alt":" k ≤","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and some non-decreasing ","element":"span"},{"style":{"height":14.39},"width":223.94,"height":35.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-25.png","element":"img","alt":" τ : Z+ → R+","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Applying gradient descent with learning rate ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-26.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"to the gradient from Equation (","element":"span"},{"href":"#id-90","text":"30","element":"a"},{"text":"), at each step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"the parameters are,","element":"span"}],[{"style":{"width":"85%"},"width":1608,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-27.png","element":"img"}],[{"text":"Plugging these parameters into the likelihoods,","element":"span"}],[{"style":{"width":"82%"},"width":1541,"height":176,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-28.png","element":"img"}],[{"text":"with ","element":"span"},{"style":{"height":13.2},"width":64.07,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-29.png","element":"img","alt":" k ≤","inline":true,"padRight":true},{"text":"0 by ","element":"span"},{"style":{"height":16},"width":604.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-30.png","element":"img","alt":" c(yw) · ∆ ≤ ||c(yw)||1 × ||∆||∞ = n","inline":true,"padRight":true},{"text":"max ∆","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.7 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-71","style":{"fontWeight":"bold"},"text":"4","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"text":"(Proposition ","element":"span"},{"href":"#id-71","text":"4 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume the same conditions as Theorem ","element":"span"},{"href":"#id-70","style":{"fontStyle":"italic"},"text":"1","element":"a"},{"style":{"fontStyle":"italic"},"text":". Then for any ","element":"span"},{"text":"0 ","element":"span"},{"style":{"height":12.4},"width":169.89,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-31.png","element":"img","alt":" < γ < ∞,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"there exists a ","element":"span"},{"style":{"height":13.2},"width":69.73,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-32.png","element":"img","alt":" λ ≥","inline":true,"padRight":true},{"text":"0 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":217.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-33.png","element":"img","alt":" πθ∗(y | x) ∈","inline":true,"padRight":true},{"text":"argmin","element":"span"},{"style":{"height":16.79},"width":310.46,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-34.png","element":"img","alt":"πθ Lpdistill(S, πθ; ρ","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":9.19},"width":53.76,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-35.png","element":"img","alt":" πθ∗","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is a minimizer over all possible policies of the objective ","element":"span"},{"text":"(","element":"span"},{"href":"#id-68","text":"9","element":"a"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", for the effective reward model set","element":"span"}],[{"style":{"width":"73%"},"width":1385,"height":112,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-36.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"The proof is a standard Lagrangian duality argument, which we reproduce here for completeness. For two functions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":") and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":"), let us define","element":"span"}],[{"id":"id-91","style":{"width":"61%"},"width":1162,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-37.png","element":"img"}],[{"text":"Let us also consider the constrained problem","element":"span"}],[{"style":{"width":"66%"},"width":1249,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-38.png","element":"img"}],[{"text":"Suppose by contradiction that ","element":"span"},{"style":{"height":11.38},"width":36.28,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-39.png","element":"img","alt":" z∗","inline":true,"padRight":true},{"text":"is not a minimizer of (","element":"span"},{"href":"#id-91","text":"41","element":"a"},{"text":"). Since ","element":"span"},{"style":{"height":11.38},"width":36.28,"height":28.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-40.png","element":"img","alt":" z∗","inline":true,"padRight":true},{"text":"is feasible for the constraint by construction, we get that ","element":"span"},{"style":{"height":16},"width":214.97,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-41.png","element":"img","alt":" f(z′) < f(z∗","inline":true},{"text":"). Consequently, we further have","element":"span"}],[{"style":{"width":"29%"},"width":561,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-42.png","element":"img"}],[{"text":"where the inequality follows from the feasibility of ","element":"span"},{"style":{"height":6.8},"width":28.29,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-43.png","element":"img","alt":" z′","inline":true,"padRight":true},{"text":"in (","element":"span"},{"href":"#id-91","text":"41","element":"a"},{"text":"). This contradicts the optimality of ","element":"span"},{"style":{"height":11.39},"width":36.28,"height":28.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-44.png","element":"img","alt":" z∗","inline":true,"padRight":true},{"text":"in (","element":"span"},{"href":"#id-91","text":"40","element":"a"},{"text":"), meaning that ","element":"span"},{"style":{"height":11.39},"width":36.28,"height":28.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-45.png","element":"img","alt":" z∗","inline":true,"padRight":true},{"text":"must be a minimizer of (","element":"span"},{"href":"#id-91","text":"41","element":"a"},{"text":"). Applying this general result with ","element":"span"},{"style":{"height":17.68},"width":414.5,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-46.png","element":"img","alt":" f = βEµ(x)DKL(πref(y |","inline":true},{"style":{"height":16},"width":307.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-47.png","element":"img","alt":"x)∥πθ(y | x)), g =","inline":true,"padRight":true},{"text":"min","element":"span"},{"style":{"height":21.97},"width":610.83,"height":54.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-48.png","element":"img","alt":"ritgt∈S Ldistill(ritgt, πθ; ρ), and z = πθ ","inline":true,"padRight":true},{"text":"completes the proof, since we recognize the set ","element":"span"},{"style":{"height":15.99},"width":42.14,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-49.png","element":"img","alt":" Sγ","inline":true,"padRight":true},{"text":"in (","element":"span"},{"href":"#id-92","text":"12","element":"a"},{"text":") to be equivalent to ","element":"span"},{"style":{"height":23.09},"width":509.76,"height":57.73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/21-50.png","element":"img","alt":"�ritgt∈S Ldistill(ritgt, πθ; ρ) ≤ λ","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.8 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-84","style":{"fontWeight":"bold"},"text":"5","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"text":"(Proposition ","element":"span"},{"href":"#id-84","text":"5 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-0.png","element":"img","alt":"Lpdpo","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"represent a finite-sample approximation to ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-1.png","element":"img","alt":" Lpdpo","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with the empirical forward KL term ","element":"span"},{"text":"ˆ","element":"span"},{"text":"Ω(Θ)","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For a fixed ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":16.15},"width":100.53,"height":40.37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-2.png","element":"img","alt":"πθ(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":9.6},"width":80.66,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-3.png","element":"img","alt":" α >","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", the ","element":"span"},{"text":"argmin","element":"span"},{"style":{"height":11.2},"width":90.32,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-4.png","element":"img","alt":"πθ(yℓ)","inline":true,"padRight":true},{"text":"ˆ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-5.png","element":"img","alt":"Lpdpo","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"text":"min","element":"span"},{"style":{"height":19.2},"width":187.11,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-6.png","element":"img","alt":"�1 − ˆπθ(ywi","inline":true,"padRight":true},{"text":")","element":"span"},{"style":{"height":19.2},"width":143.75,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-7.png","element":"img","alt":", ˆπθ(yℓi)�","inline":true},{"style":{"fontStyle":"italic"},"text":", with ","element":"span"},{"text":"log ˆ","element":"span"},{"style":{"height":21.37},"width":214.95,"height":53.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-8.png","element":"img","alt":"πθ(yℓi) = − 1β","inline":true,"padRight":true},{"text":"log (","element":"span"},{"style":{"height":6.8},"width":65.5,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-9.png","element":"img","alt":"α −","inline":true,"padRight":true},{"text":"1) + log ˆ","element":"span"},{"style":{"height":16.15},"width":100.21,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-10.png","element":"img","alt":"πθ(ywi","inline":true,"padRight":true},{"text":") + log ","element":"span"},{"style":{"height":26.04},"width":133.96,"height":65.11,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-11.png","element":"img","alt":" πref(yℓi )πref(ywi ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"We differentiate ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-12.png","element":"img","alt":" Lpdpo","inline":true,"padRight":true},{"text":"with respect to ","element":"span"},{"style":{"height":16},"width":330.06,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-13.png","element":"img","alt":" ψℓ = πθ(yℓ)/πref(yℓ","inline":true},{"text":") with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"implicit, obtaining,","element":"span"}],[{"style":{"width":"78%"},"width":1471,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-14.png","element":"img"}],[{"text":"which is zero when,","element":"span"}],[{"style":{"width":"80%"},"width":1516,"height":383,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-15.png","element":"img"}],[{"text":"By the second-order condition, the critical point is a minimum. The objective ","element":"span"},{"style":{"height":15.99},"width":97.99,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-16.png","element":"img","alt":" Lpdpo","inline":true,"padRight":true},{"text":"is the sum of two components: the negative log sigmoid term for ","element":"span"},{"style":{"height":13.59},"width":38.48,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-17.png","element":"img","alt":" Li","inline":true,"padRight":true},{"text":"and the negative log probability for ","element":"span"},{"text":"ˆ","element":"span"},{"text":"Ω. Because each component is a convex function of ","element":"span"},{"style":{"height":15.99},"width":250.07,"height":39.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-18.png","element":"img","alt":" ψi, so is Lpdpo","inline":true},{"text":". As a result, the local minimum log ˆ","element":"span"},{"style":{"height":16},"width":85.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-19.png","element":"img","alt":"πθ(yℓ","inline":true},{"text":") is also a global minimum.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.9 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-93","style":{"fontWeight":"bold"},"text":"6","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"text":"(Proposition ","element":"span"},{"href":"#id-93","text":"6 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any fixed ","element":"span"},{"text":"ˆ","element":"span"},{"style":{"height":16.15},"width":100.52,"height":40.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-20.png","element":"img","alt":"πθ(ywi","inline":true,"padRight":true},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":14.4},"width":112.51,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-21.png","element":"img","alt":" β > 0,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"the ","element":"span"},{"text":"argmin ","element":"span"},{"style":{"fontStyle":"italic"},"text":"of the distilled DPO ","element":"span"},{"style":{"fontStyle":"italic"},"text":"objective in ","element":"span"},{"text":"(","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"text":"min(1","element":"span"},{"style":{"height":15.75},"width":391.12,"height":39.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-22.png","element":"img","alt":"− ˆπθ(ywi ), ˆπθ(yℓi)), with","inline":true,"padRight":true},{"text":"log ˆ","element":"span"},{"style":{"height":21.37},"width":584.54,"height":53.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-23.png","element":"img","alt":"πθ(yℓi) = 1β (rt(x, yℓi)−rt(x, ywi ))+","inline":true},{"text":"log ˆ","element":"span"},{"style":{"height":15.75},"width":153,"height":39.38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-24.png","element":"img","alt":"πθ(ywi )+","inline":true},{"text":"log ","element":"span"},{"style":{"height":26.05},"width":134.96,"height":65.12,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-25.png","element":"img","alt":" πref(yℓi )πref(ywi ).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"This follows directly from differentiating (","element":"span"},{"href":"#id-60","text":"7","element":"a"},{"text":") with respect to ","element":"span"},{"style":{"height":16},"width":120.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-26.png","element":"img","alt":" πθ(y2).","inline":true}],[{"style":{"fontWeight":"bold"},"text":"A.10 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof of Proposition ","element":"span"},{"href":"#id-94","style":{"fontWeight":"bold"},"text":"7","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Proposition ","element":"span"},{"text":"(Proposition ","element":"span"},{"href":"#id-94","text":"7 ","element":"a"},{"text":"restated)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, i ","element":"span"},{"text":"+ 1) : ","element":"span"},{"style":{"height":16},"width":269.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-27.png","element":"img","alt":" i ∈ 1, 2, . . . , n}","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n > ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be the dataset arising from the transitive closure of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Assume ","element":"span"},{"style":{"height":9.19},"width":61.33,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-28.png","element":"img","alt":" πref","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is indifferent to all ","element":"span"},{"text":"(","element":"span"},{"style":{"height":12.39},"width":83.06,"height":30.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-29.png","element":"img","alt":"yi, yj","inline":true},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":19.88},"width":131.33,"height":49.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-30.png","element":"img","alt":" ψ(D)∞ =","inline":true,"padRight":true},{"text":"max","element":"span"},{"style":{"height":21.12},"width":139.72,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-31.png","element":"img","alt":"i ψ(D)i −","inline":true,"padRight":true},{"text":"min","element":"span"},{"style":{"height":21.12},"width":97.42,"height":52.8,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-32.png","element":"img","alt":"i ψ(D)i","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":". Then ","element":"span"},{"style":{"height":20.68},"width":212.37,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-33.png","element":"img","alt":" ψ(D)∞ = (n −","inline":true,"padRight":true},{"text":"1)","element":"span"},{"style":{"height":22.17},"width":416.63,"height":55.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-34.png","element":"img","alt":"τ −1 > ψ(D)∞ = 2 n−1n τ −1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", the IPO objective can be minimized at zero, so that ","element":"span"},{"style":{"height":20.68},"width":234.93,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-35.png","element":"img","alt":" ψ(D)∞ = (n −","inline":true,"padRight":true},{"text":"1)","element":"span"},{"style":{"height":13.38},"width":62.84,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-36.png","element":"img","alt":"τ −1","inline":true},{"text":". ","element":"span"},{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", each adjacent pair of completions is separated by ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-37.png","element":"img","alt":" γ","inline":true},{"text":", and the objective is ","element":"span"},{"style":{"height":20.4},"width":424.76,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-38.png","element":"img","alt":"�n−1i=1 (n − i)(iγ − τ −1)2.","inline":true,"padRight":true},{"text":"The minimum is ","element":"span"},{"style":{"height":24.43},"width":549.19,"height":61.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-39.png","element":"img","alt":"γ = n(n+1)(n−1)/6n2(n+1)(n−1)/12τ −1 = 2nτ −1","inline":true},{"text":", so that ","element":"span"},{"style":{"height":20.68},"width":212.36,"height":51.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-40.png","element":"img","alt":" ψ(D)∞ = (n −","inline":true,"padRight":true},{"text":"1)","element":"span"},{"style":{"height":19.37},"width":363.08,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-41.png","element":"img","alt":"γ = 2 n−1n τ −1 < (n −","inline":true,"padRight":true},{"text":"1)","element":"span"},{"style":{"height":19.88},"width":195.36,"height":49.7,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/22-42.png","element":"img","alt":"τ −1 = ψ(D)∞","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n > ","element":"span"},{"text":"2.","element":"span"}]]},{"heading":"B Additional results","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The effect of distillation and pessimism on likelihood collapse","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-95","text":"B.1 ","element":"a"},{"text":"tests the effect of pessimism in both DPO (DPO vs p-DPO) and distilled DPO (d-DPO vs dp-DPO) on likelihood collapse in the preferred vs. dispreferred generations in the offline data. Note that a subtle point arises from the distinction between regularizing to the reference policy output distribution versus regularizing to the preference data distribution, which are only (asymptotically) identical if preferences are annotated on","element":"span"}],[{"id":"id-95","style":{"width":"102%"},"width":1928,"height":896,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-0.png","element":"img"}],[{"text":"Figure B.1: Pessimism mitigates likelihood collapse. In this case, we penalize the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"sample ","element":"figcaption","subtype":"caption"},{"text":"KL-divergence rather than the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"data ","element":"figcaption","subtype":"caption"},{"text":"KL-divergence, because the samples are not drawn from the preference data distribution. Thus the effect of regularization is primarily on ","element":"figcaption","subtype":"caption"},{"text":"kl_loss_ref","element":"figcaption","subtype":"caption"},{"text":", which quantifies this divergence, rather than on the likelihood of the preference data itself, ","element":"figcaption","subtype":"caption"},{"text":"(dis)preferred_log_likelihood","element":"figcaption","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"a sample from the reference policy (in our initial experiments we had found slightly better results overall when regularizing to the reference policy output distribution). Figure ","element":"span"},{"href":"#id-95","text":"B.1 ","element":"a"},{"text":"reports results with respect to both distributions, showing that pessimism (a) mitigates the decrease in probability of preferred and dispreferred preference annotations, despite this data not being used in the regularizers (left-most subplots), and (b) mitigates the increase in KL divergence with respect to the reference distribution (third subplot from left), as expected due to the additional regularization term. As argued in §","element":"span"},{"text":"4 ","element":"span"},{"text":"(see also ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"Azar et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"2024","element":"a"},{"text":")), the ","element":"span"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-1.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"hyperparameter of DPO does not effectively regularize this KL distribution because the implicit DPO reward model assigns infinite-magnitude rewards.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Transitive closure","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-87","text":"B.2 ","element":"a"},{"text":"shows the results of our multi-arm bandit experiments with p-DPO vs. IPO losses, as described in §","element":"span"},{"href":"#id-85","text":"7.2","element":"a"},{"text":". In this synthetic setup, we solve the p-DPO and IPO objectives for both ","element":"span"},{"style":{"height":16},"width":464.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-2.png","element":"img","alt":" D = {(y1, y2), (y2, y3)} and","inline":true},{"style":{"height":16},"width":324.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-3.png","element":"img","alt":"D = D ∪ {(y1, y3)}","inline":true},{"text":", solving with respect to ","element":"span"},{"style":{"height":16},"width":155.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-4.png","element":"img","alt":" {πθ(yi)}.","inline":true}],[{"id":"id-80","style":{"fontWeight":"bold"},"text":"B.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Distribution over reward models for e-DPO","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-96","text":"B.3 ","element":"a"},{"text":"investigates the reason for the success of e-DPO, especially when ","element":"span"},{"style":{"height":12},"width":84.76,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-5.png","element":"img","alt":" ρ < .","inline":true},{"text":"5. For every length bias, we track during training for all training examples which reward model, ","element":"span"},{"style":{"height":11.59},"width":57.98,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-6.png","element":"img","alt":" rρ,b","inline":true},{"text":", best matched the implicit reward of the currently trained e-DPO policy, and plot the distribution over reward models. The policy matches different reward models in different examples. Moreover, there is inverse correlation between the data bias for policy training (","element":"span"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-7.png","element":"img","alt":"ρ","inline":true},{"text":") and the data bias for training the reward models (","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":"). This suggests that the ensemble in e-DPO helps as the policy is distilling from reward models that do not share the data bias of the policy training set.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Hyperparameters","element":"span"}],[{"text":"Validation set performance across the range of hyperparameter settings is shown in Figure ","element":"span"},{"href":"#id-79","text":"B.4","element":"a"},{"text":". Figure ","element":"span"},{"href":"#id-76","text":"B.5 ","element":"a"},{"text":"also explores the impact of ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/23-8.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"(which is minimal). In pilot studies we found that these results were relatively robust to variation in the random seed, but did not conduct extensive investigation of this effect across all methods and hyperparameters due to cost.","element":"span"}],[{"id":"id-87","style":{"width":"99%"},"width":1872,"height":1437,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/24-0.png","element":"img"}],[{"text":"Figure B.2: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Effect of transitive closure on p-DPO and IPO solutions to preference learning in a multi-arm bandit","element":"figcaption","subtype":"caption"},{"text":". Each column shows the learned policy probability for a given arm, based on the preferences ","element":"figcaption","subtype":"caption"},{"style":{"height":12},"width":216.82,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/24-1.png","element":"img","alt":" y1 ≺ y2 ≺ y3","inline":true},{"text":". The top row shows that in p-DPO, the probabilities are not materially affected by the transitive closure ","element":"figcaption","subtype":"caption"},{"style":{"height":12},"width":126.09,"height":30,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/24-2.png","element":"img","alt":" y1 ≺ y3","inline":true},{"text":". The bottom row shows that in IPO, transitive closure causes the probabilities to be compressed. In each subfigure, we sweep a range of effective values of ","element":"figcaption","subtype":"caption"},{"style":{"height":13.38},"width":62.84,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/24-3.png","element":"img","alt":" τ −1","inline":true},{"text":", shown on the x-axis.","element":"figcaption","subtype":"caption"}]]},{"heading":"C Compute resources","paragraphs":[[{"text":"We train policies on 32 TPU v3 chips and reward models on 16 TPU v3 chips. We obtain roughly 0.1 steps per second when training, for both the policy and reward models.","element":"span"}],[{"id":"id-96","style":{"width":"99%"},"width":1872,"height":904,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/25-0.png","element":"img"}],[{"text":"Figure B.3: For all training examples, we track which reward model during training best matches the implicit reward of the current e-DPO policy and plot the distribution over reward models, for every length bias, ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":102.55,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/25-1.png","element":"img","alt":" ρ. We","inline":true,"padRight":true},{"text":"observe that the e-DPO policy matches different reward models across examples during training. Moreover, when the policy is trained with data biased towards preferring short responses, the reward model that was trained on longer responses is by and large preferred and vice versa.","element":"figcaption","subtype":"caption"}],[{"id":"id-79","style":{"width":"99%"},"width":1872,"height":1386,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/25-2.png","element":"img"}],[{"text":"Figure B.4: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Validation set results ","element":"figcaption","subtype":"caption"},{"text":"across hyperparameters for each method. For all methods, different values of ","element":"figcaption","subtype":"caption"},{"style":{"height":10},"width":21,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/25-3.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"induce different optimal hyperparameters ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/25-4.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":13.38},"width":62.84,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/25-5.png","element":"img","alt":" τ −1","inline":true},{"text":".","element":"figcaption","subtype":"caption"}],[{"id":"id-76","style":{"width":"99%"},"width":1872,"height":1671,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/26-0.png","element":"img"}],[{"text":"Figure B.5: Comparing e-dpo with and without annealing of ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/26-1.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"on the development set for all length biases using the best value of ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":23,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/26-2.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"per bias. Annealing ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.19316/images/26-3.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"has minimal effect on performance and is not necessary for the success of e-dpo.","element":"figcaption","subtype":"caption"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]