36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"44867","publisher":"icml","paperJSON":{"title":"ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization","paperID":"44867","avgLineHeight":11.94,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"},{"text":"The code is publicly accessible at ","element":"span"},{"href":"https://github.com/hee-suk-yoon/ConfPO","text":"https://github.com/hee-suk-yoon/ConfPO","element":"a"},{"text":".","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"Aligning Large Language Models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF) (","element":"span"},{"href":"#id-0","referenceIndex":7,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":7,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":27,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":27,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"Yoon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"2024a","element":"a"},{"text":"), which finetunes a policy using a learned reward model. While effective, this approach introduces substantial computational overhead and is sensitive to reward model misspecification.","element":"span"}],[{"style":{"width":"96%"},"width":910,"height":376,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/0-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"id":"id-15","text":"Highlighted tokens—green for the chosen, red for the ","element":"figcaption","subtype":"caption"},{"text":"rejected—are those regarded low-confidence by the policy model, meaning their predicted probability falls below the sequence average. This strategy surfaces tokens that carry high informational value or mark key decision points, while skipping predictable continuations. For example, in “22” (tokenized as ‘2’, ‘2’), the first digit may be highlighted as uncertain, while the second, being a likely follow-up, is not. The incorrect answer “1” in the rejected response is another such low-confidence token, reflecting both its unpredictability and its role in distinguishing preference.","element":"figcaption","subtype":"caption"}],[{"text":"Direct Alignment Algorithms (DAAs) (","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"2024b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":36,"text":"Xu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":36,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"Ethayarajh et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"Hong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":29,"text":"Park et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":29,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":") offer a more efficient alternative by using the policy’s own log-probabilities as an implicit reward signal. Among these, Direct Preference Optimization (DPO) (","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"2024b","element":"a"},{"text":") has gained particular attention for optimizing the log-probability gap between chosen and rejected responses. SimPO (","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":"), a recent variant of DPO, improves this framework by removing the need for a reference model and mitigating length bias, resulting in stronger alignment with human preferences. Yet, these methods assume all tokens contribute equally to preference, applying uniform optimization across the entire sequence without considering the varying informativeness of individual tokens.","element":"span"}],[{"text":"A growing body of linguistic and cognitive research shows that words are not equally informative—some contribute little beyond confirming expectations, whereas others compel readers to revise their interpretation. Psycholinguistic ","element":"span"},{"style":{"fontStyle":"italic"},"text":"surprisal theory ","element":"span"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":14,"text":"Hale","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":14,"text":"2001","element":"a"},{"text":"; ","element":"span"},{"href":"#id-11","referenceIndex":16,"text":"Jaeger & Levy","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":16,"text":"2006","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":34,"text":"Smith & ","element":"a"},{"href":"#id-12","referenceIndex":34,"text":"Levy","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":34,"text":"2013","element":"a"},{"text":") formalizes this insight by linking a word’s informational value to its contextual predictability: the lower the conditional probability ","element":"span"},{"style":{"height":16},"width":227,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/0-1.png","element":"img","alt":" P(wi|context)","inline":true},{"text":", the higher the","element":"span"}],[{"id":"id-16","style":{"width":"99%"},"width":1944,"height":484,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/1-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Overview of the Confidence-guided Preference Optimization (ConfPO). ","element":"figcaption","subtype":"caption"},{"text":"Existing DAA algorithm (SimPO (","element":"figcaption","subtype":"caption"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a","subtype":"caption"},{"text":", ","element":"figcaption","subtype":"caption"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a","subtype":"caption"},{"text":")) uses all the tokens uniformly for preference alignment. ConfPO selectively optimizes on few tokens based on the policy model’s confidence with no additional compute overhead.","element":"figcaption","subtype":"caption"}],[{"text":"surprisal and the greater the information conveyed. Modern LLMs embody this principle by explicitly estimating ","element":"span"},{"style":{"height":16},"width":227,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/1-1.png","element":"img","alt":"P(wi|context)","inline":true,"padRight":true},{"text":"for every token. Highly predictable tokens (high probability) add little new information, whereas lowprobability tokens are information-rich and often determine how the rest of the sentence is interpreted (","element":"span"},{"href":"#id-13","referenceIndex":26,"text":"Merkx & Frank","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":26,"text":"2021","element":"a"},{"text":"; ","element":"span"},{"href":"#id-14","referenceIndex":33,"text":"Slaats et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":33,"text":"2024","element":"a"},{"text":"). In essence, a small subset of low-confidence (high-surprisal) tokens carries most of the sentence-level information flow. We illustrate this idea in Figure ","element":"span"},{"href":"#id-15","text":"1","element":"a"},{"text":", which shows that tokens with below-average conditional probability tend to be the unexpected, informationbearing tokens, whereas the remaining tokens are generally highly predictable continuations.","element":"span"}],[{"text":"Motivated by these insights, we ask whether preferencerelevant learning signals are better captured by selectively optimizing those low-confidence tokens. We provide empirical and theoretical analyses of gradient behavior in Direct Alignment Algorithms and show that a policy model’s own confidence scores form a simple yet reliable proxy for identifying preference-critical tokens. Building on these findings, we introduce ","element":"span"},{"style":{"fontWeight":"bold"},"text":"ConfPO ","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"Conf","element":"span"},{"text":"idence-guided ","element":"span"},{"style":{"fontWeight":"bold"},"text":"P","element":"span"},{"text":"reference ","element":"span"},{"style":{"fontWeight":"bold"},"text":"O","element":"span"},{"text":"ptimization), a targeted extension of SimPO (","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":") that focuses optimization on the most informative tokens without auxiliary models or extra inference passes.","element":"span"}],[{"text":"As illustrated in Figure ","element":"span"},{"href":"#id-16","text":"2","element":"a"},{"text":", ConfPO dynamically filters out low-impact tokens, focusing updates only on those that meaningfully drive alignment. This targeted optimization strategy not only enhances performance on benchmarks such as AlpacaEval 2 and Arena-Hard, but also mitigates overoptimization by utilizing the KL budget more efficiently.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"In summary, our key contributions are:","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Efficient Selective-Token Alignment. ","element":"span"},{"text":"We propose a method that uses the policy model’s confidence scores to identify tokens most critical for preference learning. Guided by empirical and theoretical gradient analyses of DAAs, this approach enables selective optimization without additional computational overhead (Section ","element":"span"},{"text":"4","element":"span"},{"text":").","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Improved Alignment Performance. ","element":"span"},{"text":"Across multiple LLMs and alignment benchmarks, ConfPO outperforms DAAs that uniformly train on all tokens (Table ","element":"span"},{"href":"#id-17","text":"1","element":"a"},{"text":").","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Overoptimization Mitigation. ","element":"span"},{"text":"By discarding tokens that contribute little to alignment, ConfPO makes more efficient use of the KL budget, reducing overoptimization (reward hacking) effects (Figure ","element":"span"},{"href":"#id-18","text":"7","element":"a"},{"text":").","element":"span"}]]},{"heading":"2. Related Works","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Reinforcement Learning from Human Feedback (RLHF) ","element":"span"},{"text":"Ensuring that models are helpful, safe, and factually accurate is a central goal of modern AI research (","element":"span"},{"href":"#id-19","referenceIndex":42,"text":"Yoon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":42,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-20","referenceIndex":40,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-21","referenceIndex":41,"text":"2024b","element":"a"},{"text":";","element":"span"},{"href":"#id-22","referenceIndex":43,"text":"c","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":38,"text":"Yoon et al.","element":"a"},{"text":"). In particular, RLHF has become a widely used approach for aligning language models with human preferences. Early methods in this area often relied on online reinforcement learning techniques, such as PPO with a learned reward model, to adjust model behavior without requiring human input at every step (","element":"span"},{"href":"#id-0","referenceIndex":7,"text":"Chris- ","element":"a"},{"href":"#id-0","referenceIndex":7,"text":"tiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":7,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":27,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":27,"text":"2022","element":"a"},{"text":"). However, these approaches are highly sensitive to hyperparameters and can be inefficient due to the need for extensive trajectory sampling. To address these challenges, Direct Alignment Algorithms (DAAs) have been developed as an alternative (","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"Rafailov ","element":"a"},{"href":"#id-3","referenceIndex":31,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"2024b","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"Azar et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":36,"text":"Xu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":36,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"Ethayarajh ","element":"a"},{"href":"#id-6","referenceIndex":11,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"Hong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":29,"text":"Park et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":29,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":"). DAAs eliminate the need for multi-stage training making the alignment process more efficient.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Token-level DAAs ","element":"span"},{"text":"Sequence-level RLHF assigns a reward signal to the entire sequence, treating all tokens as equally contributing. However, this approach makes it challenging to obtain fine-grained learning signals. To address this limitation, recent efforts have focused on token-level reward signals for RLHF, particularly in DAAs. FIGA (","element":"span"},{"href":"#id-24","referenceIndex":13,"text":"Guo ","element":"a"},{"href":"#id-24","referenceIndex":13,"text":"et al.","element":"a"},{"text":") utilizes an external language model to generate data, which is then used for imitation learning. TDPO (","element":"span"},{"href":"#id-25","referenceIndex":45,"text":"Zeng et al.","element":"a"},{"text":") enhances alignment and diversity by leveraging forward ","element":"span"},{"text":"KL divergence, while SePO (","element":"span"},{"href":"#id-26","referenceIndex":37,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":37,"text":"2024","element":"a"},{"text":") improves weak-to-strong generalization by pre-selecting tokens for preference learning using a weak model. T-REG (","element":"span"},{"href":"#id-27","referenceIndex":48,"text":"Zhou ","element":"a"},{"href":"#id-27","referenceIndex":48,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":48,"text":"2024","element":"a"},{"text":"), in contrast, does not rely on external models but instead obtains token-level rewards by multiple forward passes through prompt augmentation. Despite these advancements, existing methods often introduce additional computational complexity, either through external models, extra training, or synthetic data generation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"In this paper, we propose an approach that enables token-level DAA while maintaining the computational efficiency of existing DAAs.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Overoptimizaion in RLHF ","element":"span"},{"text":"Overoptimization in reward models has been identified as a critical issue in RLHF. ","element":"span"},{"href":"#id-28","referenceIndex":12,"text":"Gao ","element":"a"},{"href":"#id-28","referenceIndex":12,"text":"et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-28","referenceIndex":12,"text":"2023","element":"a"},{"text":") highlighted that reward models in RLHF can suffer from overoptimization, leading to suboptimal alignment. Later, similar concerns were observed in DAAs, where ","element":"span"},{"href":"#id-29","referenceIndex":30,"text":"Rafailov et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-29","referenceIndex":30,"text":"2024a","element":"a"},{"text":") demonstrated that overoptimization can occur even without explicitly training a reward model. To mitigate this issue, ","element":"span"},{"href":"#id-30","referenceIndex":22,"text":"Liu et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-30","referenceIndex":22,"text":"2024","element":"a"},{"text":") proposed using a supervised loss as a regularizer, suggesting that it can help alleviate overoptimization. Additionally, ","element":"span"},{"href":"#id-31","referenceIndex":3,"text":"Anony- ","element":"a"},{"href":"#id-31","referenceIndex":3,"text":"mous ","element":"a"},{"text":"(","element":"span"},{"href":"#id-31","referenceIndex":3,"text":"2025","element":"a"},{"text":") introduced an approach based on importance sampling, demonstrating its potential to mitigate the overoptimization problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"We show in this paper that our proposed ConfPO tends to mitigate overoptimization due to the efficient use of the KL budget.","element":"span"}]]},{"heading":"3. Background and Problem Formulation","paragraphs":[[{"text":"In this section, we provide an overview of preference learning for aligning large language models (LLMs) with human feedback and Direct Alignment Algorithms (DAAs).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1. RLHF: Preference Learning for Language Models","element":"span"}],[{"text":"Reinforcement ","element":"span"},{"text":"learning ","element":"span"},{"text":"from ","element":"span"},{"text":"human ","element":"span"},{"text":"feedback (RLHF) (","element":"span"},{"href":"#id-0","referenceIndex":7,"text":"Christiano et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":7,"text":"2017","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":27,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":27,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"Yoon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"2024a","element":"a"},{"text":") is a commonly adopted approach to aligning large language models (LLMs) with human preferences. By leveraging annotations from humans (or AI systems), RLHF encourages LLMs to produce outputs that more closely match user expectations. Formally, let ","element":"span"},{"style":{"height":18.19},"width":441,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-0.png","element":"img","alt":"D = {x(i), yw,(i), yl,(i)}Ni=1 ","inline":true,"padRight":true},{"text":"be a preference dataset where ","element":"span"},{"text":"each entry contains a prompt ","element":"span"},{"style":{"height":14.4},"width":55.48,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-1.png","element":"img","alt":" x(i)","inline":true},{"text":"alongside two candidate responses: a preferred response ","element":"span"},{"style":{"height":17.6},"width":86.48,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-2.png","element":"img","alt":" yw,(i)","inline":true},{"text":"and an unpreferred response ","element":"span"},{"style":{"height":17.6},"width":73,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-3.png","element":"img","alt":" yl,(i)","inline":true},{"text":". The core of RLHF lies in a reward model ","element":"span"},{"style":{"height":16},"width":125.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-4.png","element":"img","alt":"r∗(x, y)","inline":true},{"text":", which assigns a scalar indicating how well a response ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"aligns with human preferences for a given prompt ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":". A widely used model for pairwise preferences is the Bradley-Terry (BT) framework (","element":"span"},{"href":"#id-32","referenceIndex":5,"text":"Bradley & Terry","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":5,"text":"1952","element":"a"},{"text":"), which specifies the probability that ","element":"span"},{"style":{"height":14.21},"width":42,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-5.png","element":"img","alt":" yw","inline":true},{"text":"is preferred over ","element":"span"},{"style":{"height":17.01},"width":28,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-6.png","element":"img","alt":" yl","inline":true,"padRight":true},{"text":"as:","element":"span"}],[{"id":"id-33","style":{"width":"88%"},"width":832,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-7.png","element":"img"}],[{"text":"RLHF proceeds in two stages: training the reward model and then using it to guide a policy model. In the first stage, the reward model ","element":"span"},{"style":{"height":16},"width":107,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-8.png","element":"img","alt":" ˆr(x, y)","inline":true,"padRight":true},{"text":"is obtained via maximum likelihood estimation on the preference data. In the second stage, ","element":"span"},{"style":{"height":16},"width":107,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-9.png","element":"img","alt":"ˆr(x, y)","inline":true,"padRight":true},{"text":"provides a feedback signal that drives optimization of the language model’s policy ","element":"span"},{"style":{"height":9.6},"width":37,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-10.png","element":"img","alt":" πθ","inline":true},{"text":". Let ","element":"span"},{"style":{"height":13.6},"width":47,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-11.png","element":"img","alt":" Dx","inline":true,"padRight":true},{"text":"represent the distribution of prompts contained within ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". The objective can be written as:","element":"span"}],[{"style":{"width":"95%"},"width":900,"height":82,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-12.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":14.59},"width":22.48,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-13.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is a coefficient controlling the penalty for deviating from a reference policy ","element":"span"},{"style":{"height":9.39},"width":70.52,"height":23.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-14.png","element":"img","alt":" πref.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"3.2. Direct Alignment Algorithms","element":"span"}],[{"text":"Direct Alignment Algorithms (DAAs) offer an alternative to traditional RLHF by removing the need for explicit reward model training. Instead, these methods directly optimize the policy model using preference data, leveraging implicit signals from the model’s own probability distributions.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Direct Preference Optimization (DPO) ","element":"span"},{"text":"DPO (","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"Rafailov ","element":"a"},{"href":"#id-3","referenceIndex":31,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"2024b","element":"a"},{"text":") reparamerterizes the reward function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"using a closed-form expression as:","element":"span"}],[{"id":"id-34","style":{"width":"95%"},"width":892,"height":218,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":9.6},"width":36.48,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-16.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"is the policy model, ","element":"span"},{"style":{"height":9.6},"width":55,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-17.png","element":"img","alt":" πref","inline":true,"padRight":true},{"text":"is the reference policy, typically the supervised fine-tuned (SFT) model, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is the partition function. This reward formulation is incorporated into the BT ranking objective (Eq. ","element":"span"},{"href":"#id-33","text":"1","element":"a"},{"text":"), allowing DPO to express the probability of preference data directly with the policy model, rather than a separate reward model, yielding the following objective:","element":"span"}],[{"id":"id-40","style":{"width":"99%"},"width":934,"height":231,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-18.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Simple Preference Optimization (SimPO) ","element":"span"},{"text":"The reward function in Eq. ","element":"span"},{"href":"#id-34","text":"3 ","element":"a"},{"text":"presents several limitations. First, it relies on a reference model ","element":"span"},{"style":{"height":9.6},"width":54.48,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-19.png","element":"img","alt":" πref","inline":true,"padRight":true},{"text":"during training, which is not required during inference. This creates an inconsistency between the training objective and the generation strategy at inference time. Second, it employs the summed log probabilities of tokens as the reward, which introduces length bias—longer sequences tend to have lower log probabilities. As a result, the model may overestimate probabilities for longer sequences to ensure that ","element":"span"},{"style":{"height":14.21},"width":42,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-20.png","element":"img","alt":" yw ","inline":true,"padRight":true},{"text":"is assigned a higher reward than ","element":"span"},{"style":{"height":16.8},"width":40,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/2-21.png","element":"img","alt":" yl.","inline":true}],[{"text":"To mitigate these issues, SimPO (","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":") redefines the reward function as follows:","element":"span"}],[{"id":"id-45","style":{"width":"95%"},"width":900,"height":210,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-0.png","element":"img"}],[{"text":"Using this implicit reward function, the objective for SimPO becomes the following:","element":"span"}],[{"id":"id-41","style":{"width":"99%"},"width":934,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":21.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-2.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"is a target margin introduced by ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":") to ensure that the reward for the preferred response ","element":"span"},{"style":{"height":16},"width":214.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-3.png","element":"img","alt":"rSimPO(x, yw)","inline":true,"padRight":true},{"text":"exceeds the reward for the unpreferred response ","element":"span"},{"style":{"height":17.6},"width":201,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-4.png","element":"img","alt":" rSimPO(x, yl)","inline":true,"padRight":true},{"text":"by at least ","element":"span"},{"style":{"height":10.8},"width":21.52,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-5.png","element":"img","alt":" γ","inline":true},{"text":". SimPO demonstrates significant improvements on various representative benchmarks (i.e., AlphacaEval 2 (","element":"span"},{"href":"#id-35","referenceIndex":20,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":20,"text":"2023","element":"a"},{"text":") and Arena-Hard (","element":"span"},{"href":"#id-36","referenceIndex":19,"text":"Li* et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":19,"text":"2024","element":"a"},{"text":")) compared to DPO, whereas many other variants of DPO (","element":"span"},{"href":"#id-37","referenceIndex":44,"text":"Yuan et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":44,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-38","referenceIndex":46,"text":"Zhao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":46,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"Azar ","element":"a"},{"href":"#id-4","referenceIndex":4,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":36,"text":"Xu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":36,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"Ethayarajh et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":11,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"Hong ","element":"a"},{"href":"#id-7","referenceIndex":15,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":15,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-8","referenceIndex":29,"text":"Park et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":29,"text":"2024","element":"a"},{"text":") fail to show a consistent performance advantage over the standard DPO. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Thus, our method is built upon SimPO; however, in Section ","element":"span"},{"href":"#id-39","style":{"fontStyle":"italic","fontWeight":"bold"},"text":"7.2","element":"a"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":", we also demonstrate that our findings generalize to DPO.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3. Problem Formulation: Token Selection for Preference Learning","element":"span"}],[{"text":"Although Direct Alignment Algorithms (DAAs), such as DPO (Eq. ","element":"span"},{"href":"#id-40","text":"4","element":"a"},{"text":") and SimPO (Eq. ","element":"span"},{"href":"#id-41","text":"6","element":"a"},{"text":"), have demonstrated strong performance in aligning LLMs with human preferences, they are typically optimized over all tokens in the training dataset. However, prior studies (","element":"span"},{"href":"#id-42","referenceIndex":21,"text":"Lin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":21,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-43","referenceIndex":6,"text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":6,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":18,"text":"Lai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":18,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"Yoon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"2024a","element":"a"},{"text":") have shown that not all tokens contribute equally to preference alignment. Uniformly optimizing over all tokens can introduce noise into the training process.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Selective Token Reward Formulation ","element":"span"},{"text":"To address these challenges, we propose improving DAAs by optimizing only on effective tokens for preference learning. Specifically, we aim to identify and leverage a subset of tokens that are critical to aligning with human preferences, without incurring additional computational costs compared to standard algorithms like SimPO (","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":").","element":"span"}],[{"text":"We extend the implicit reward formulation of SimPO (Eq. ","element":"span"},{"href":"#id-45","text":"5","element":"a"},{"text":") by incorporating a selective token scoring function. The reward for a response ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"is redefined as the joint log probability of critical (i.e., selective) tokens, expressed as:","element":"span"}],[{"id":"id-56","style":{"width":"97%"},"width":914,"height":108,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-7.png","element":"img","alt":" s(yi)","inline":true,"padRight":true},{"text":"is a selection function that determines whether to include the token ","element":"span"},{"style":{"height":10.59},"width":28.52,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-8.png","element":"img","alt":" yi","inline":true,"padRight":true},{"text":"for reward calculation, and ","element":"span"},{"style":{"height":16},"width":50,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-9.png","element":"img","alt":" |ys|","inline":true,"padRight":true},{"text":"denotes the total number of selected tokens. Incorporating this selective token-based reward formulation, we refine the preference optimization objective of SimPO (Eq. ","element":"span"},{"href":"#id-41","text":"6","element":"a"},{"text":") as follows:","element":"span"}],[{"style":{"width":"95%"},"width":896,"height":310,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-10.png","element":"img"}],[{"text":"While previous works have proposed token selection strategies, these approaches often involve significant computational overhead. Examples include performing multiple forward passes with augmented prompts (","element":"span"},{"href":"#id-27","referenceIndex":48,"text":"Zhou et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":48,"text":"2024","element":"a"},{"text":"), leveraging annotations from powerful LLMs such as GPT-4 (","element":"span"},{"href":"#id-44","referenceIndex":18,"text":"Lai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":18,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"Yoon et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":39,"text":"2024a","element":"a"},{"text":"), or training auxiliary models to guide token selection (","element":"span"},{"href":"#id-26","referenceIndex":37,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":37,"text":"2024","element":"a"},{"text":"). These methods, while effective, are complex and expensive, limiting their practicality.","element":"span"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Unlike prior approaches, we aim to derive ","element":"span"},{"style":{"height":16},"width":77.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-11.png","element":"img","alt":" s(yi)","inline":true,"padRight":true},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"directly from the internal signals of the model during training, ensuring that the computational cost remains identical to that of the original DAAs.","element":"span"}]]},{"heading":"4. Revisiting the Gradients of Direct Alignment Algorithms (DAAs)","paragraphs":[[{"text":"In this section, we introduce a series of observations revealing how to use the internal signals of the training policy model to select important tokens for preference learning. We focus on the SimPO (","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":") due to its state-of-the-art performance, but we show that these observations also hold for DPO (","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":31,"text":"2024b","element":"a"},{"text":") in Section ","element":"span"},{"href":"#id-39","text":"7.2","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gradient of SimPO ","element":"span"},{"text":"As illustrated in ","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"Meng et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-9","referenceIndex":25,"text":"2024","element":"a"},{"text":"), the gradients of SimPO (Eq. ","element":"span"},{"href":"#id-41","text":"6","element":"a"},{"text":") can be written as follows:","element":"span"}],[{"style":{"width":"99%"},"width":932,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-12.png","element":"img"}],[{"style":{"width":"100%"},"width":942,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/3-13.png","element":"img"}],[{"style":{"width":"96%"},"width":909,"height":233,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/4-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Observation 1. ","element":"figcaption","subtype":"caption"},{"id":"id-47","text":"Token-level gradient norm values (i.e., ","element":"figcaption","subtype":"caption"},{"style":{"height":14.21},"width":334.52,"height":35.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/44867/images/4-1.png","element":"img","alt":"∥∇θ log πθ(yi|x, y