36:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"45181","publisher":"icml","paperJSON":{"title":"TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization","paperID":"45181","avgLineHeight":11.96,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3c","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"Reinforcement Learning from Human Feedback (RLHF) has become a crucial technique for aligning Large Language models (LLMs) with human preferences and intentions (","element":"span"},{"href":"#id-0","referenceIndex":17,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":17,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":37,"text":"Ziegler et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":37,"text":"2020","element":"a"},{"text":"). This approach has demonstrated significant success in recent LLMs advancements (","element":"span"},{"href":"#id-2","referenceIndex":16,"text":"OpenAI et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":16,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-3","referenceIndex":25,"text":"Team et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":25,"text":"2024a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"Grattafiori et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":6,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":26,"text":"Team et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":26,"text":"2024b","element":"a"},{"text":"). In typical RLHF workflows, a reward model is first trained using human feedback, and then the Proximal Policy Optimization (PPO) algorithm (","element":"span"},{"href":"#id-6","referenceIndex":22,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":22,"text":"2017","element":"a"},{"text":") is employed to fine-tune the policy model. Typically, in these methods, a sequence-level reward is assigned to the last token of a sequence. However, this approach faces challenges, such as the sparse reward problem (i.e., delayed feedback), which leads to instability and sample inefficiency in PPO training (","element":"span"},{"href":"#id-7","referenceIndex":3,"text":"Choshen ","element":"a"},{"href":"#id-7","referenceIndex":3,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":3,"text":"2020","element":"a"},{"text":"). This issue is particularly pronounced in LLM training, where responses are often lengthy and generated at the token level (","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"2023","element":"a"},{"text":"). Recent research has suggested that leveraging dense token-level reward models (","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"Yin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"2025","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"Zhong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"2024","element":"a"},{"text":") can help alleviate these issues, improving PPO’s performance in aligning LLMs with human preferences.","element":"span"}],[{"text":"Recent developments in RLHF have centered around creating simpler and more efficient algorithms that eliminate the need for a separate reward model. A notable approach in this direction is Direct Preference Optimization (DPO) (","element":"span"},{"href":"#id-11","referenceIndex":20,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":20,"text":"2023","element":"a"},{"text":"). DPO reparameterizes the reward function in RLHF by directly using preference data to optimize the policy model, bypassing the traditionally required step of training a separate reward model. This reparameterization streamlines the alignment process, making DPO a popular algorithm for LLM alignment. While dense token-level reward guidance has been proved beneficial for PPO (","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"Yin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"2025","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"Zhong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"2024","element":"a"},{"text":"), its extension to DPO is nontrivial, as DPO is formulated as a sequence-level bandit problem. In this context, the reward is expressed through the policy being optimized, and integrating token-level reward guidance into this framework presents a significant challenge, especially in eliminating the partition function from the loss function.","element":"span"}],[{"text":"To fill this gap, we decompose the sequence-level proximal ","element":"span"},{"text":"policy optimization into a sequence of token-level proximal policy optimization problems and modify them to incorporate token-level reward guidance. We derive a closed-form optimal token-level policy and the corresponding token-level reward for the modified problem. Based on the obtained reward and Bradley-Terry model, especially a new theoretical result for eliminating partition function, we propose a preference optimization algorithm framework with token-level reward guidance for DPO, which we refer to as TGDPO. Additionally, we introduce a practical token-level reward guidance based on the induced DPO reward.","element":"span"}],[{"text":"Extensive experiments are conducted on three instruction following benchmarks: AlpacaEval 2 (","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"2023","element":"a"},{"text":"), MTBench (","element":"span"},{"href":"#id-13","referenceIndex":35,"text":"Zheng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":35,"text":"2023","element":"a"},{"text":"), and Arena-Hard (","element":"span"},{"href":"#id-14","referenceIndex":11,"text":"Li et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":11,"text":"2024","element":"a"},{"text":"). TGDPO consistently outperforms existing preference optimization algorithms, achieving improvements of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard compared to the best baseline method. We further demonstrate and analyze the unique advantages of TGDPO. We empirically show that TGDPO achieves satisfactory policies upon loss convergence, which is not commonly observed in conventional preference optimization methods. TGDPO also enables control over convergence speed and is robust to variations in token-level rewards. These properties significantly enhance the efficiency and practicality of the algorithm. Our key contributions are outlined below:","element":"span"}],[{"text":"• We decompose the sequence-level PPO into a sequence of token-level proximal policy optimization problems via the upper-bounding approach and derive a closed-form optimal token-level policy for the modified problem, with which the corresponding reward can be represented along with the token-level reward guidance.","element":"span"}],[{"text":"• With the obtained reward, the Bradley-Terry model, and a new result for eliminating the partition function, we propose TGDPO, a preference optimization algorithm framework with token-level reward guidance for DPO. We further introduce a practical token-level reward guidance based on the induced DPO reward.","element":"span"}],[{"text":"• Extensive experiments demonstrate that our TGDPO improves win rates by up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on ArenaHard compared to the best baseline.","element":"span"}]]},{"heading":"2. Related Work","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Reinforcement Learning from Human Feedback. ","element":"span"},{"text":"Reinforcement learning from human feedback (RLHF) has been extensively applied for aligning LLMs with human preferences and values (","element":"span"},{"href":"#id-0","referenceIndex":17,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":17,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":37,"text":"Ziegler et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":37,"text":"2020","element":"a"},{"text":"). The standard RLHF pipeline typically consists of two stages: reward modeling and policy optimization through reinforce- ","element":"span"},{"text":"ment learning. Proximal Policy Optimization (PPO) with on-policy sampling (","element":"span"},{"href":"#id-6","referenceIndex":22,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":22,"text":"2017","element":"a"},{"text":") is commonly used for this purpose. However, challenges in effective reward modeling and tuning the PPO algorithm to achieve optimal performance have motivated alternative approaches that bypass the reward modeling step and focus on directly optimizing the policy. The direct preference optimization (DPO) algorithm (","element":"span"},{"href":"#id-11","referenceIndex":20,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":20,"text":"2023","element":"a"},{"text":") is a representative one. DPO explicitly represents the reward function with the optimal policy of the proximal policy optimization problem, thereby avoiding the need for a separate reward model and fine-tuning LLMs directly with human preference. DPO has proven to be both lightweight and stable, showing strong performance in a range of applications (","element":"span"},{"href":"#id-15","referenceIndex":9,"text":"Ivison et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":9,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-16","referenceIndex":27,"text":"Tian et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":27,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-17","referenceIndex":15,"text":"Miao et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":15,"text":"2024","element":"a"},{"text":"). Several variants of DPO have since been proposed, improving its performance. For instance, R-DPO (","element":"span"},{"href":"#id-18","referenceIndex":19,"text":"Park et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":19,"text":"2024","element":"a"},{"text":") addresses DPO’s tendency to exploit token length, while SimPO (","element":"span"},{"href":"#id-19","referenceIndex":14,"text":"Meng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":14,"text":"2024","element":"a"},{"text":") aims to better align the objective with the decoding formula and eliminate the need for a reference model. KTO (","element":"span"},{"href":"#id-20","referenceIndex":5,"text":"Ethayarajh et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":5,"text":"2024","element":"a"},{"text":") focuses on optimizing preferences using non-pairwise data. These preference optimization techniques, however, operate at the sequence level and do not shape the reward function of DPO from the token level. In contrast, our approach aims to leverage token-level rewards to guide preference optimization and better align LLMs. A recent work TDPO (","element":"span"},{"href":"#id-21","referenceIndex":34,"text":"Zeng et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":34,"text":"2024","element":"a"},{"text":") tries to provide a token-level understanding of DPO. It explains DPO using token-level Markov decision process and proposes to incorporate forward KL divergence to the DPO objective. However, like DPO, TDPO still does not consider token-level reward guidance. Our TGDPO, on the other hand, explicitly incorporates token-level reward signals into the preference optimization framework.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"RLHF with Dense Token-Level Reward. ","element":"span"},{"text":"Text generation of LLMs can be modeled as a Markov decision process. Sequence-level PPO treats the entire sequence as an action and assigns a reward at the sequence’s end (","element":"span"},{"href":"#id-6","referenceIndex":22,"text":"Schulman ","element":"a"},{"href":"#id-6","referenceIndex":22,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":22,"text":"2017","element":"a"},{"text":"), which results in sparse feedback at the token level. This sparsity hinders the model’s ability to differentiate between preferred and dispreferred tokens within a sequence, leading to training instability (","element":"span"},{"href":"#id-22","referenceIndex":24,"text":"Snell et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-22","referenceIndex":24,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-23","referenceIndex":30,"text":"Xia et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":30,"text":"2024","element":"a"},{"text":"). To mitigate this issue, several techniques have been developed to generate dense token-level rewards, including learning from fine-grained human feedback (","element":"span"},{"href":"#id-24","referenceIndex":29,"text":"Wu ","element":"a"},{"href":"#id-24","referenceIndex":29,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":29,"text":"2023","element":"a"},{"text":"), fine-grained AI feedback (","element":"span"},{"href":"#id-25","referenceIndex":18,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-25","referenceIndex":18,"text":"2024","element":"a"},{"text":"), and grounding preferences at the token or segment level (","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"Yin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"2025","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"Zhong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"2024","element":"a"},{"text":"). PPO leveraging such fine-grained reward models has shown significant performance improvements. However, extending token-level guidance to DPO is a challenge, as DPO’s reward function is explicitly expressed through the policy being optimized. Incorporating token-level reward guidance into the DPO framework requires overcoming substantial difficulties, especially in eliminating the partition function from the loss function, which remains an open problem. More discussions on closely related work are presented in Appendix ","element":"span"},{"text":"C","element":"span"},{"text":".","element":"span"}]]},{"heading":"3. Preliminary","paragraphs":[[{"text":"Given a human preference dataset ","element":"span"},{"style":{"height":16},"width":403.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-0.png","element":"img","alt":" D = {(x, yw, yl)}, where","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is a prompt, ","element":"span"},{"style":{"height":10.59},"width":40.48,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-1.png","element":"img","alt":" yw","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.59},"width":26.48,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-2.png","element":"img","alt":" yl","inline":true,"padRight":true},{"text":"are preferred and dispreferred responses respectively, in RLHF a sequence-level reward model ","element":"span"},{"style":{"height":16.8},"width":127,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-3.png","element":"img","alt":" rϕ(x, y)","inline":true,"padRight":true},{"text":"is first trained with the preference dataset for assigning higher reward to preferred response and lower reward to dispreferred one. With the trained reward model, sequence-level Proximal Policy Optimization (PPO) solves the following problem to fine-tune LLMs:","element":"span"}],[{"id":"id-26","style":{"width":"99%"},"width":932,"height":224,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":96.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-5.png","element":"img","alt":" DKL[·]","inline":true,"padRight":true},{"text":"is the KL-divergence of two probability distributions, ","element":"span"},{"style":{"height":9.6},"width":36.52,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-6.png","element":"img","alt":" πθ","inline":true,"padRight":true},{"text":"is the language model policy, ","element":"span"},{"style":{"height":9.6},"width":54.48,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-7.png","element":"img","alt":" πref","inline":true,"padRight":true},{"text":"is the reference policy, and the positive parameter ","element":"span"},{"style":{"height":14.61},"width":22.48,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-8.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"controls the deviation of ","element":"span"},{"style":{"height":13.6},"width":189.52,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-9.png","element":"img","alt":" πθ from πref","inline":true},{"text":". Equation (","element":"span"},{"href":"#id-26","text":"1","element":"a"},{"text":") can be considered as assigning the reward to a sequence and is referred to as the sequence-level PPO problem in this work. It has the issue of sparse reward (delayed feedback) that challenges traditional deep reinforcement learning (","element":"span"},{"href":"#id-27","referenceIndex":1,"text":"Andrychowicz et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":1,"text":"2017","element":"a"},{"text":"). To alleviate the issue, sequence-level PPO with token-level reward guidance is developed to fine-tune LLMs in a fine-grained fashion with dense token-wise rewards (","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"Yang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":31,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"Yin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":33,"text":"2025","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"Zhong et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":36,"text":"2024","element":"a"},{"text":").","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Sequence-Level PPO with Token-Level Reward Guidance. ","element":"span"},{"text":"Text generation of an LLM can be modeled as a Markov Decision Process (MDP). Let ","element":"span"},{"style":{"height":9.81},"width":28,"height":24.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-10.png","element":"img","alt":" st","inline":true,"padRight":true},{"text":"be the context for generating the token at time step ","element":"span"},{"style":{"height":13.01},"width":85.52,"height":32.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-11.png","element":"img","alt":" t ≥ 0","inline":true},{"text":", the generated token is denoted as ","element":"span"},{"style":{"height":16},"width":217,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-12.png","element":"img","alt":" at ∼ πθ(·|st)","inline":true},{"text":". For a prompt ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"of the LLM, ","element":"span"},{"style":{"height":9.81},"width":110.48,"height":24.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-13.png","element":"img","alt":"s0 = x","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.8},"width":204.48,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/2-14.png","element":"img","alt":" st = [x, a 0,","inline":true,"padRight":true},{"text":"then it is identified as a preferred token, implying that the state-action should be reinforced, and then it is assigned a larger weight ","element":"span"},{"style":{"height":17.01},"width":323.52,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/6-2.png","element":"img","alt":" 1 + αˆr([x, y 1","inline":true},{"text":", optimizing the ","element":"span"},{"text":"loss function ","element":"span"},{"style":{"height":16},"width":190,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/6-15.png","element":"img","alt":" LTGDPO(πθ)","inline":true,"padRight":true},{"text":"would assign an even lower probability to this action.","element":"span"}],[{"text":"• The token ","element":"span"},{"style":{"height":17.6},"width":30.52,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/6-16.png","element":"img","alt":" ytl","inline":true},{"text":"satisfying ","element":"span"},{"style":{"height":18.19},"width":301.48,"height":45.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/6-17.png","element":"img","alt":" ˆr([x, y 0","inline":true,"padRight":true},{"text":"is consid- ","element":"span"},{"text":"ered as a preferred token, although it is in the dispreferred response ","element":"span"},{"style":{"height":10.59},"width":27,"height":26.48,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/6-18.png","element":"img","alt":" yl","inline":true},{"text":". In this case ","element":"span"},{"style":{"height":18.4},"width":364.52,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/icml/45181/images/6-19.png","element":"img","alt":" 1−αˆr([x, y