37:[["$","audio",null,{"id":"tts"}],["$","$L3c",null,{"paperID":"2405.17618","publisher":"arxiv","paperJSON":{"title":"Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales","paperID":"2405.17618","avgLineHeight":11.93,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3d","element":"span"}]]},{"heading":"1. Introduction","paragraphs":[[{"text":"Recent advancements in Large Language Models (LLMs) have shown impressive performance across various natural language processing tasks (","element":"span"},{"href":"#id-0","referenceIndex":10,"text":"Chung et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-0","referenceIndex":10,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-1","referenceIndex":55,"text":"Wei et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":55,"text":"2023","element":"a"},{"text":"), robot control (","element":"span"},{"href":"#id-2","referenceIndex":18,"text":"Huang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":18,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-3","referenceIndex":11,"text":"Driess et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":11,"text":"2023","element":"a"},{"text":"), and healthcare (","element":"span"},{"href":"#id-4","referenceIndex":21,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":21,"text":"2023c","element":"a"},{"text":"; ","element":"span"},{"href":"#id-5","referenceIndex":17,"text":"Huang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":17,"text":"2020","element":"a"},{"text":"). However, as these LLMs are typically trained to predict the next word in a provided dataset, they require post-training processing to make them useful for particular tasks. Reinforcement Learning from Human Feedback (RLHF) trains LLMs to generate responses aligned with user preferences through human feedback. Additionally, Reinforcement Learning from AI Feedback (RLAIF), which leverages feedback from well-trained AI models, has also been employed (","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"2023a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2022","element":"a"},{"text":"). Thus, adapting fundamental Reinforcement Learning (RL) algorithms such as REINFORCE (","element":"span"},{"href":"#id-8","referenceIndex":56,"text":"Williams","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":56,"text":"1992","element":"a"},{"text":"), A2C (","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"2016","element":"a"},{"text":"), and PPO (","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"Schul- ","element":"a"},{"href":"#id-10","referenceIndex":40,"text":"man et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"2017","element":"a"},{"text":") to suit the fine-tuning of LLMs for LLM tasks is an area of active interest (","element":"span"},{"href":"#id-11","referenceIndex":1,"text":"Ahmadian et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":1,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-13","referenceIndex":33,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":33,"text":"2023","element":"a"},{"text":").","element":"span"}],[{"text":"RL methods (","element":"span"},{"href":"#id-14","referenceIndex":45,"text":"Sutton et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":45,"text":"2000","element":"a"},{"text":"; ","element":"span"},{"href":"#id-15","referenceIndex":43,"text":"Sutton & Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":43,"text":"2018a","element":"a"},{"text":") have lead to substantial breakthroughs in tasks such as robot control and game playing. Still, they entail learning instability compared to supervised learning due to factors such as moving targets, high-gradient variance, and training value functions. The RL literature has proposed various methods to make the RL process more robust, such as preventing overestimation with Double DQN (","element":"span"},{"href":"#id-16","referenceIndex":48,"text":"van Hasselt et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":48,"text":"2015","element":"a"},{"text":"), reducing variance with Generalized Advantage Estimation (GAE) (","element":"span"},{"href":"#id-17","referenceIndex":41,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":41,"text":"2018","element":"a"},{"text":"), updates within the trust region (","element":"span"},{"href":"#id-18","referenceIndex":39,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":39,"text":"2015","element":"a"},{"text":"; ","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"2017","element":"a"},{"text":"), and encouraging diverse behavior with Soft Actor-Critic (SAC) (","element":"span"},{"href":"#id-19","referenceIndex":14,"text":"Haarnoja et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":14,"text":"2018","element":"a"},{"text":"). In addition to the methods devised specifically for RL problems, RL literature has also adopted supervised learning techniques to make the learning process more robust. For example, ensembles have been used for more accurate value function prediction, while Layer Normalization and Batch Normalization have been employed to constrain predictions for out-of-distribution samples, thereby mitigating the overestimation and extrapolation.","element":"span"}],[{"text":"RLHF (","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"2023b","element":"a"},{"text":") and RLAIF (","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"2023a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-21","referenceIndex":4,"text":"Byun et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-21","referenceIndex":4,"text":"2024","element":"a"},{"text":") po-","element":"span"}],[{"style":{"width":"78%"},"width":1518,"height":469,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/1-0.png","element":"img"}],[{"id":"id-25","style":{"fontStyle":"italic"},"text":"Figure 1. ","element":"figcaption","subtype":"caption"},{"text":"Example of reward prediction errors in a trained reward model for TL;DR summarization. The generated summary samples (left and middle) are both ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"empty","element":"figcaption","subtype":"caption"},{"text":", yet they receive significantly different rewards. The middle sample is higher than some summarization (right) and even scores higher (6.66) than the average reward score of SPPO (6.13). The full text for these samples can be found in Appendix ","element":"figcaption","subtype":"caption"},{"href":"#id-22","text":"15","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"text":"tentially introduce additional training challenges. For example, these algorithms often receive feedback from multiple sources (humans or AI models) to align LLMs, and each feedback provider may have different preferences, meaning a sample considered preferable by one provider could be deemed undesirable by another (","element":"span"},{"href":"#id-23","referenceIndex":12,"text":"Ethayarajh et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-23","referenceIndex":12,"text":"2024","element":"a"},{"text":"; ","element":"span"},{"href":"#id-24","referenceIndex":7,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":7,"text":"2024","element":"a"},{"text":"). In addition, RLHF and RLAIF often leverage a trained reward model to provide feedback on samples generated by the LLM. This indirection raises the question: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"does the learned reward model provide the correct reward? ","element":"span"},{"text":"The reward model has prediction errors itself (See Figure ","element":"span"},{"href":"#id-25","text":"1","element":"a"},{"text":"), but as the LLM is trained with RL, its outputs deviate from the reward model’s training dataset, introducing more errors (noise) in the reward model’s predictions for out-of-distribution samples.","element":"span"}],[{"text":"The challenges associated with RL, RLHF, and RLAIF, as mentioned above, can introduce confusion when calculating advantage values in RL algorithms like A2C and PPO. Specifically, an action that should have a positive advantage value may have a negative sign in the next update, depending on which samples (states, actions) are generated and how the batch is composed during advantage normalization. The sign of the advantage determines whether the probability of a corresponding action for a given state increases or decreases in policy gradient algorithms. If the advantages are predicted incorrectly, this can lead to learning in the opposite direction. We hypothesize that these difficulties are similar to noisy classification tasks in supervised learning, where some labels are incorrect.","element":"span"}],[{"text":"In this paper, we leverage a technique developed for classification tasks with noisy labels, employing a robust loss function to enhance the learning procedures of A2C and PPO. We define a symmetric RL loss, whose fundamental mechanism aligns with the robust loss function used in supervised learning (","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"2019","element":"a"},{"text":"), to improve the robustness of RL procedure for A2C and PPO (See Section ","element":"span"},{"href":"#id-27","text":"4.3","element":"a"},{"text":"). We apply this symmetric RL loss to A2C and ","element":"span"},{"text":"PPO, naming them Symmetric A2C (SA2C) and Symmetric PPO (SPPO), and evaluate their performance across various tasks and model scales.","element":"span"}],[{"text":"First, we assess the performance gains of SA2C and SPPO on Atari games (","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"2016","element":"a"},{"text":"), which have discrete action spaces, as well as on the MuJoCo benchmark (","element":"span"},{"href":"#id-28","referenceIndex":47,"text":"Todorov ","element":"a"},{"href":"#id-28","referenceIndex":47,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":47,"text":"2012","element":"a"},{"text":") and Box2D (","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"Catto","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"2011","element":"a"},{"text":") environments, which have continuous action spaces. For these control tasks, we introduce a noisy reward variant, hypothesizing that it will increase confusion in advantage prediction to better evaluate our method. Additionally, we test our method on RLHF tasks using LLMs, such as IMDB positive sentiment analysis (","element":"span"},{"href":"#id-30","referenceIndex":24,"text":"Maas et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":24,"text":"2011","element":"a"},{"text":") and TL;DR summarization (","element":"span"},{"text":"V¨","element":"span"},{"href":"#id-31","referenceIndex":49,"text":"olske ","element":"a"},{"href":"#id-31","referenceIndex":49,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":49,"text":"2017","element":"a"},{"text":"). The IMDB task involves generating positive sentiment for a given context and TL;DR is a summarization task where an LLM is required to summarize content.","element":"span"}],[{"text":"SA2C and SPPO demonstrate better performance improvements across diverse control tasks compared to A2C and PPO. Notably, both SA2C and SPPO perform well in settings with added noise to the reward. Additionally, SPPO shows consistent performance improvements across various hyperparameters (Table ","element":"span"},{"href":"#id-32","text":"11","element":"a"},{"text":"). We analyze why SPPO exhibits more robust improvements than SA2C in Section ","element":"span"},{"href":"#id-33","text":"5.4","element":"a"},{"text":". Furthermore, SPPO shows superior performance to PPO in RLHF tasks, such as IMDB positive sentiment and TL;DR summarization. We demonstrate that SPPO outperforms PPO on reward in both tasks, and SPPO’s summarization is significantly better, as measured by the win rate judged by GPT-4 Turbo (","element":"span"},{"text":"gpt-4-turbo-2024-04-09","element":"span"},{"text":").","element":"span"}],[{"text":"In summary, our key contributions are:","element":"span"}],[{"text":"• We propose the symmetric RL loss for A2C and PPO, along with the gradient analysis that aligns with the gradient behavior of robust loss functions used in noisy classification tasks in Section ","element":"span"},{"href":"#id-27","text":"4.3","element":"a"},{"text":".","element":"span"}],[{"text":"• We conduct experiments across various environments","element":"span"}],[{"text":"and model scales, demonstrating performance improvements to validate the symmetric RL loss for general control tasks and RLHF tasks in Section ","element":"span"},{"text":"5","element":"span"},{"text":".","element":"span"}],[{"text":"• We analyze how PPO can introduce additional confusion in advantage estimates, which justifies using symmetric RL loss (See Section ","element":"span"},{"href":"#id-33","text":"5.4","element":"a"},{"text":"). This shows that SPPO demonstrates consistent improvement across a range of hyperparameters.","element":"span"}]]},{"heading":"2. Related Work","paragraphs":[[{"text":"We introduce robust loss functions studied in the context of noise in supervised learning classification tasks. ","element":"span"},{"href":"#id-34","referenceIndex":13,"text":"Ghosh ","element":"a"},{"href":"#id-34","referenceIndex":13,"text":"et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-34","referenceIndex":13,"text":"2017","element":"a"},{"text":") prove that, in the presence of a noisy dataset, the mean absolute error (MAE) has a slower learning speed compared to cross-entropy loss (CE), but the model learns more robustly. ","element":"span"},{"href":"#id-35","referenceIndex":57,"text":"Zhang & Sabuncu ","element":"a"},{"text":"(","element":"span"},{"href":"#id-35","referenceIndex":57,"text":"2018","element":"a"},{"text":") propose a generalized cross entropy loss ","element":"span"},{"style":{"height":15.59},"width":42.12,"height":38.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-0.png","element":"img","alt":" Lq","inline":true},{"text":", which becomes CE when ","element":"span"},{"style":{"height":14},"width":111.14,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-1.png","element":"img","alt":" q → 0,","inline":true,"padRight":true},{"text":"and becomes MAE when ","element":"span"},{"style":{"height":14},"width":101.21,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-2.png","element":"img","alt":" q → 1","inline":true},{"text":". By adjusting this parameter ","element":"span"},{"style":{"height":14},"width":165.41,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-3.png","element":"img","alt":" 0 ≤ q ≤ 1","inline":true},{"text":", robust learning is achieved in noisy datasets. The symmetric cross entropy (SCE) (","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"2019","element":"a"},{"text":") that we mainly refer to suggests a symmetric cross-entropy loss. This loss not only considers the flow of information from the true distribution to the model’s predictions but also incorporates information flowing in the reverse direction. SCE works better than GCE in general, especially for data with high noise rates. ","element":"span"},{"href":"#id-36","referenceIndex":23,"text":"Ma et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-36","referenceIndex":23,"text":"2020","element":"a"},{"text":") introduce various loss functions and classify them into types: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Active Loss ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Passive Loss ","element":"span"},{"text":"functions. They demonstrate that normalizing the loss can help improve robustness. They use a combination of one active loss and one passive loss like SCE. We define a loss function that considers reverse information to match the RL version.","element":"span"}],[{"text":"In the RL literature, ","element":"span"},{"href":"#id-37","referenceIndex":52,"text":"Wang et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-37","referenceIndex":52,"text":"2018","element":"a"},{"text":") propose using a confusion matrix to handle perturbed rewards, predicting surrogate rewards for robust policy updates. While this method appears effective for Atari games, later research (","element":"span"},{"href":"#id-38","referenceIndex":9,"text":"Chen et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":9,"text":"2024","element":"a"},{"text":") shows that it does not outperform corresponding baselines in continuous tasks. Additionally, introducing noise in RL has demonstrated performance benefits. For instance, ","element":"span"},{"href":"#id-39","referenceIndex":28,"text":"Obando-Ceron et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-39","referenceIndex":28,"text":"2023","element":"a"},{"text":") show that smaller batch sizes improve performance, and ","element":"span"},{"href":"#id-40","referenceIndex":38,"text":"Schaul et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-40","referenceIndex":38,"text":"2022","element":"a"},{"text":") present that policy churn aids exploration. These studies primarily conduct experiments on Atari games, which require navigating many novel states. However, whether noise is beneficial or not in continuous action spaces remains debatable (","element":"span"},{"href":"#id-41","referenceIndex":25,"text":"Mai ","element":"a"},{"href":"#id-41","referenceIndex":25,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-41","referenceIndex":25,"text":"2022","element":"a"},{"text":"; ","element":"span"},{"href":"#id-42","referenceIndex":3,"text":"Byun & Perrault","element":"a"},{"text":", ","element":"span"},{"href":"#id-42","referenceIndex":3,"text":"2024","element":"a"},{"text":"). Our work proposes a robust loss function designed to handle noise (confusion in advantage prediction) without judging whether the noise is beneficial or not.","element":"span"}],[{"text":"Reinforcement Learning from Human Feedback (RLHF) ","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"Ouyang et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"2022","element":"a"},{"text":"); ","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"Lee et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"2023b","element":"a"},{"text":") and Reinforcement ","element":"span"},{"text":"Learning from AI Feedback (RLAIF) (","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"2023a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2022","element":"a"},{"text":") have contributed to the success of large language models (LLMs) by aligning them with user preferences. However, these methods require training a reward model and a value function. Each of these components has prediction errors, and finding appropriate hyperparameters for training requires significant effort. Direct Preference Optimization (DPO) (","element":"span"},{"href":"#id-13","referenceIndex":33,"text":"Rafailov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":33,"text":"2023","element":"a"},{"text":") eliminates the cost associated with the reward model by rearranging PPO loss for ranking-based feedback (e.g., sample A is preferred over sample B). ","element":"span"},{"href":"#id-23","referenceIndex":12,"text":"Ethayarajh et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-23","referenceIndex":12,"text":"2024","element":"a"},{"text":") remove the requirement ranking-based feedback by modifying DPO loss further, allowing a model to be trained with bad or good labels. Additionally, ","element":"span"},{"href":"#id-24","referenceIndex":7,"text":"Chakraborty et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-24","referenceIndex":7,"text":"2024","element":"a"},{"text":") demonstrate that feedback from diverse people, each with different preferences, makes a single reward model difficult to reflect preferences correctly. Recent studies focus on sentence-level feedback (","element":"span"},{"href":"#id-43","referenceIndex":22,"text":"Lightman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-43","referenceIndex":22,"text":"2023","element":"a"},{"text":"; ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-44","referenceIndex":53,"text":"2024","element":"a"},{"text":"), but DPO and KTO cannot utilize sentence-level feedback. Therefore, we propose the reverse RL loss term, which can make PPO in existing RLHF methods more robust.","element":"span"}]]},{"heading":"3. Preliminaries","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"3.1. Reinforcement Learning","element":"span"}],[{"text":"Reinforcement Learning (RL) formulates a Markov decision process (MDP) (","element":"span"},{"href":"#id-45","referenceIndex":31,"text":"Puterman","element":"a"},{"text":", ","element":"span"},{"href":"#id-45","referenceIndex":31,"text":"2014","element":"a"},{"text":"; ","element":"span"},{"href":"#id-46","referenceIndex":44,"text":"Sutton & Barto","element":"a"},{"text":", ","element":"span"},{"href":"#id-46","referenceIndex":44,"text":"2018b","element":"a"},{"text":") defined by the tuple ","element":"span"},{"style":{"height":16},"width":410.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-4.png","element":"img","alt":" M = (S, A, P, R, γ, µ)","inline":true},{"text":". At each timestep ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", an action ","element":"span"},{"style":{"height":13.99},"width":125.96,"height":34.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-5.png","element":"img","alt":" at ∈ A","inline":true,"padRight":true},{"text":"is sampled from an agent’s policy ","element":"span"},{"style":{"height":16},"width":160.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-6.png","element":"img","alt":" πθ(· | st)","inline":true,"padRight":true},{"text":"for a given state ","element":"span"},{"style":{"height":13.19},"width":118.22,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-7.png","element":"img","alt":" st ∈ S","inline":true},{"text":". For the taken action ","element":"span"},{"style":{"height":9.19},"width":33.06,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-8.png","element":"img","alt":" at","inline":true},{"text":", the reward function returns a reward ","element":"span"},{"style":{"height":16},"width":150.79,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-9.png","element":"img","alt":" R(st, at)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"height":12.4},"width":312.02,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-10.png","element":"img","alt":" R : S × A → R","inline":true},{"text":", and the transition probability ","element":"span"},{"style":{"height":16},"width":192.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-11.png","element":"img","alt":"P(· | st, at)","inline":true,"padRight":true},{"text":"determines the next state ","element":"span"},{"style":{"height":10.79},"width":116.5,"height":26.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-12.png","element":"img","alt":" st+1. γ","inline":true,"padRight":true},{"text":"is the discount factor, and ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-13.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"represents the initial state distribution for ","element":"span"},{"style":{"height":9.19},"width":34.68,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-14.png","element":"img","alt":" s0","inline":true},{"text":". The RL objective is to find the optimal ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-15.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"that maximizes the expected discounted sum of rewards:","element":"span"}],[{"id":"id-47","style":{"width":"90%"},"width":847,"height":159,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-16.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"3.2. A2C and PPO Algorithms","element":"span"}],[{"text":"The Advantage Actor-Critic (A2C) algorithm (","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"2016","element":"a"},{"text":") is an actor-critic method that combines value-based and policy-based approaches. A2C uses the advantage function ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"to reduce the variance in policy updates. The policy ","element":"span"},{"style":{"height":9.19},"width":37.71,"height":22.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-17.png","element":"img","alt":"πθ","inline":true,"padRight":true},{"text":"is updated by following the gradient of the objective function to maximize the sum of rewards (Equation ","element":"span"},{"href":"#id-47","text":"1","element":"a"},{"text":"):","element":"span"}],[{"style":{"width":"87%"},"width":819,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/2-18.png","element":"img"}],[{"text":"Proximal Policy Optimization (PPO) (","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"2017","element":"a"},{"text":") aims to update the policy within a trust region. This is","element":"span"}],[{"style":{"width":"92%"},"width":1791,"height":343,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-0.png","element":"img"}],[{"id":"id-80","style":{"fontStyle":"italic"},"text":"Figure 2. ","element":"figcaption","subtype":"caption"},{"text":"Change of advantage rate (%): The graphs show how often the advantage signs flip in various environments as training progresses. In Atari games, often over 5% of samples change signs, while in MuJoCo tasks, usually over 10% of samples change signs after the advantage normalization. We use 5 different random seeds for CrazyClimber and WizardOfWor, and 30 different random seeds for Ant-v4 and Walker2d-v4. The line is the mean of the change ratio across the seeds, and the shaded area represents standard errors.","element":"figcaption","subtype":"caption"}],[{"text":"achieved through a clipped loss function to ensure that the new policy does not deviate too much from the old policy. The PPO loss function can be written as:","element":"span"}],[{"style":{"width":"99%"},"width":932,"height":85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":26.11},"width":320.39,"height":65.27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-2.png","element":"img","alt":" rt(θ) = πθ(at|st)πθold(at|st)","inline":true,"padRight":true},{"text":"is the probability ratio, and ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-3.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"is a small hyperparameter that controls the range of the clipping. The advantage function estimates how much better an action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"is compared to the other actions for at a given state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Both algorithms increase the probability of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"if the corresponding advantage ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"0 ","element":"span"},{"text":"and decrease it if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< ","element":"span"},{"text":"0","element":"span"},{"text":". In the next section, we introduce the connection between A2C and PPO with the cross-entropy loss for classification and define the symmetric RL loss.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3. Symmetric Cross Entropy","element":"span"}],[{"text":"Symmetric Cross Entropy (SCE) (","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"2019","element":"a"},{"text":") is designed for noisy classification datasets. Cross Entropy (CE) loss (Equation ","element":"span"},{"href":"#id-48","text":"4","element":"a"},{"text":") performs effectively when the data is clean; however, it encounters challenges in the presence of noise. Given a true distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"and a predicted distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"is learned based on the information derived from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"according to information theory. However, when ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"is noisy, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"can only approximate the true distribution to a limited extent. To address this issue, SCE incorporates information in the opposite direction through Reverse Cross Entropy (RCE) (Equation ","element":"span"},{"href":"#id-48","text":"5","element":"a"},{"text":").","element":"span"}],[{"id":"id-48","style":{"width":"76%"},"width":721,"height":259,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-4.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":264.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-5.png","element":"img","alt":" k ∈ {1, . . . , K}","inline":true,"padRight":true},{"text":"is a class and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"is an input. RCE loss has been proven to be robust to a certain amount of noise, but the learning speed is too slow. Therefore, SCE","element":"span"}],[{"text":"combines CE and RCE losses (Equation ","element":"span"},{"href":"#id-49","text":"6","element":"a"},{"text":"),","element":"span"}],[{"id":"id-49","style":{"width":"68%"},"width":643,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-7.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/3-8.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"are constants determining the contribution of each part. SCE demonstrates performance improvement across various noisy ratios and types. As mentioned in the introduction section, the RL training process can lead to noisy advantage predictions, so we propose a symmetric RL loss in the next approach section.","element":"span"}]]},{"heading":"4. Approach","paragraphs":[[{"text":"This section introduces the reverse RL loss and proposes the symmetric RL loss for A2C (","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"2016","element":"a"},{"text":") and PPO (","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"Schulman et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"2017","element":"a"},{"text":"), an RL version of Symmetric Cross Entropy (SCE) (","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"2019","element":"a"},{"text":"). A2C and PPO training procedures basically increase or decrease the probability of an action depending on the advantage sign, but the advantage prediction involves noise due to several factors. A highly engineered reward function is required to eliminate errors, and the trained reward model has prediction errors in RLHF (","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"2022","element":"a"},{"text":") and RLAIF (","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"Lee et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":19,"text":"2023a","element":"a"},{"text":"; ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"Bai et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-7","referenceIndex":2,"text":"2022","element":"a"},{"text":"). Receiving feedback from multiple sources further complicates the training of the reward model (","element":"span"},{"href":"#id-24","referenceIndex":7,"text":"Chakraborty et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-24","referenceIndex":7,"text":"2024","element":"a"},{"text":"). Additionally, the value function also has estimation errors, and the sign of the advantage in advantage normalization depends on how the batch is composed. PPO increases sample efficiency compared to A2C, but the off-policy part can introduce confusion in advantage predictions (See Section ","element":"span"},{"href":"#id-33","text":"5.4","element":"a"},{"text":"). Similar to SCE, which is robust to noisy data, the symmetric RL loss enhances robustness in an RL environment that can introduce noise.","element":"span"}],[{"id":"id-54","style":{"fontWeight":"bold"},"text":"4.1. Reverse Reinforcement Learning Loss","element":"span"}],[{"text":"Given a true (target) distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"and a predicted distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"is noisy, training ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"can be challenging and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"cannot accurately reflect the true distribution. Reverse Cross Entropy (RCE) considers the reverse information from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":". We propose that the reverse RL losses for A2C and PPO also incorporate reverse information to address noisy factors in the RL training procedure. The RCE loss (Equation ","element":"span"},{"href":"#id-48","text":"5","element":"a"},{"text":") defines ","element":"span"},{"text":"log 0 = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z < ","element":"span"},{"text":"0 ","element":"span"},{"text":"is some constant for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":") = 0 ","element":"span"},{"text":"(","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"2019","element":"a"},{"text":"). We also use this definition for the negative advantage and this is also useful to prove the robustness of the reverse RL losses. For all tasks conducted in this paper, we use ","element":"span"},{"style":{"height":10.8},"width":147.1,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-0.png","element":"img","alt":" Z = −1","inline":true},{"text":". Note that the constant terms ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-1.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"in Equation ","element":"span"},{"href":"#id-50","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-51","text":"9 ","element":"a"},{"text":"are multiplied together, so we control the impact of the reverse RL loss solely by adjusting ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-2.png","element":"img","alt":" β","inline":true},{"text":". For example, ","element":"span"},{"style":{"height":16},"width":348.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-3.png","element":"img","alt":" (β = 1.0, Z = −1.0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":362.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-4.png","element":"img","alt":" (β = 10.0, Z = −0.1)","inline":true,"padRight":true},{"text":"yield the exact same results.","element":"span"}],[{"text":"Suppose there exist ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"actions and ","element":"span"},{"style":{"height":14.18},"width":56.8,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-5.png","element":"img","alt":" a(i)","inline":true,"padRight":true},{"text":"indicates ","element":"span"},{"style":{"height":13.38},"width":44.17,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-6.png","element":"img","alt":" ith","inline":true,"padRight":true},{"text":"action. ","element":"span"},{"style":{"height":21.49},"width":286.5,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-7.png","element":"img","alt":"π(i)θ = πθ(a(i)|s)","inline":true,"padRight":true},{"text":"for a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". Let’s denote the possible action probabilities set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"as ","element":"span"},{"style":{"height":21.49},"width":476.79,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-8.png","element":"img","alt":" πθ(s) = {π(1)θ , π(2)θ , ..., π(k)θ }","inline":true},{"text":". ","element":"span"},{"text":"Note that we discretize the continuous action space for continuous action tasks (","element":"span"},{"href":"#id-52","referenceIndex":46,"text":"Tang & Agrawal","element":"a"},{"text":", ","element":"span"},{"href":"#id-52","referenceIndex":46,"text":"2020","element":"a"},{"text":"). One thing we need to note is that when updating a policy, we use advantages instead of label sets in RL. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advantages can have negative values ","element":"span"},{"text":"(negative labels) unlike ordinary labels. We only consider the sign of the advantage","element":"span"},{"text":"1 ","element":"span"},{"text":"because this advantage is the role of the label in supervised learning. For a sampled action probability ","element":"span"},{"style":{"height":21.49},"width":59.87,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-9.png","element":"img","alt":" π(i)θ","inline":true,"padRight":true},{"text":"and the corresponding advantage ","element":"span"},{"style":{"height":18.18},"width":275.26,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-10.png","element":"img","alt":" A(s, a(i)) = A(i)","inline":true},{"text":", the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"sample-wise reverse A2C (RA2C) loss ","element":"span"},{"text":"is:","element":"span"}],[{"id":"id-50","style":{"width":"95%"},"width":897,"height":247,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-11.png","element":"img"}],[{"text":"For a positive advantage ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", the difference between A2C’s loss ","element":"span"},{"style":{"height":14.8},"width":117.64,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-12.png","element":"img","alt":" A log π","inline":true,"padRight":true},{"text":"and CE loss ","element":"span"},{"text":"1 log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"is that A2C can be considered as CE multiplied by the advantage. In terms of gradients, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is a constant, so A2C reflects the information ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"times more strongly than the CE loss. Thus, we also reflect the reverse direction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"times more strongly. Similarly, since PPO has ","element":"span"},{"style":{"height":21.49},"width":65.63,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-13.png","element":"img","alt":" π(i)old ","inline":true,"padRight":true},{"text":"term in the loss, the sample-wise reverse ","element":"span"},{"text":"PPO (RPPO) loss just introduces the additional constant ","element":"span"},{"style":{"height":21.49},"width":65.63,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-14.png","element":"img","alt":"π(i)old","inline":true,"padRight":true},{"text":"for a sampled action probability ","element":"span"},{"style":{"height":21.49},"width":59.88,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-15.png","element":"img","alt":" π(i)θ","inline":true,"padRight":true},{"text":"to consider the same amount of reverse information:","element":"span"}],[{"style":{"width":"97%"},"width":914,"height":338,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-16.png","element":"img"}],[{"text":"We define the symmetric RL loss, which consists of the original RL loss (A2C or PPO) and the corresponding reverse ","element":"span"},{"text":"RL loss, in Section ","element":"span"},{"href":"#id-53","text":"4.2","element":"a"},{"text":". We then analyze how these reverse RL losses contribute to RL robustness in Section ","element":"span"},{"href":"#id-27","text":"4.3","element":"a"},{"text":".","element":"span"}],[{"id":"id-53","style":{"fontWeight":"bold"},"text":"4.2. Symmetric Reinforcement Learning Loss","element":"span"}],[{"text":"The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Symmetric Reinforcement Learning (SRL) loss ","element":"span"},{"style":{"height":13.19},"width":141.8,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-17.png","element":"img","alt":" Lsrl con-","inline":true,"padRight":true},{"text":"sists of two parts like SCE (Equation ","element":"span"},{"href":"#id-49","text":"6","element":"a"},{"text":"): the original actor loss ","element":"span"},{"style":{"height":13.19},"width":48.56,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-18.png","element":"img","alt":" Lrl","inline":true,"padRight":true},{"text":"(A2C or PPO) and the corresponding reverse RL loss ","element":"span"},{"style":{"height":13.19},"width":70.74,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-19.png","element":"img","alt":" Lrev","inline":true,"padRight":true},{"text":"(RA2C or RPPO). ","element":"span"},{"style":{"height":13.19},"width":61.2,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-20.png","element":"img","alt":" Lsrl","inline":true,"padRight":true},{"text":"flexibly adjusts the symmetric learning framework with two additional hyperparameters (","element":"span"},{"style":{"height":14.4},"width":273.95,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-21.png","element":"img","alt":"α > 0 and β > 0","inline":true},{"text":") as follows:","element":"span"}],[{"id":"id-51","style":{"width":"67%"},"width":637,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-22.png","element":"img"}],[{"text":"We name A2C and PPO using the symmetric RL loss as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Symmetric A2C (SA2C) ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Symmetric PPO (SPPO)","element":"span"},{"text":", respectively. The meanings of ","element":"span"},{"style":{"height":14.4},"width":124.12,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-23.png","element":"img","alt":" α and β","inline":true,"padRight":true},{"text":"align with SCE, where ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-24.png","element":"img","alt":"α","inline":true,"padRight":true},{"text":"represents the degree of actively training a policy, and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-25.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"serves as auxiliary support to stabilize the entire learning process. In the following section, we analyze the gradient of the two types of losses.","element":"span"}],[{"id":"id-27","style":{"fontWeight":"bold"},"text":"4.3. Gradient Analysis","element":"span"}],[{"text":"For an input ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"and the corresponding correct label ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", the cross entropy (CE) loss gradient is ","element":"span"},{"style":{"height":22.17},"width":322.27,"height":55.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-26.png","element":"img","alt":" − 1pθ(k|x)∇θpθ(k|x)","inline":true},{"text":". ","element":"span"},{"text":"Smaller ","element":"span"},{"style":{"height":10},"width":35.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-27.png","element":"img","alt":" pθ","inline":true,"padRight":true},{"text":"values aggressively increase the magnitude of the gradient. CE loss rapidly increases uncertain predictions. If there is no noise, this method is correct, but it may lead to incorrect predictions on noisy datasets and excessive overfitting (","element":"span"},{"href":"#id-35","referenceIndex":57,"text":"Zhang & Sabuncu","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":57,"text":"2018","element":"a"},{"text":"). A2C and PPO losses also have the same issue. For A2C, the gradient is simply multiplied by an advantage ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", i.e., ","element":"span"},{"style":{"height":24.43},"width":316.36,"height":61.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-28.png","element":"img","alt":" − A(s,a)πθ(a|s)∇θπθ(a|s)","inline":true},{"text":". In the ","element":"span"},{"text":"case of PPO, the magnitude of the gradient tends to increase as the probability of an action decreases. Consider a sample that passes the clipping function: the difference between ","element":"span"},{"style":{"height":9.19},"width":65.62,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-29.png","element":"img","alt":"πold","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-30.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"is within the ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-31.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"bound. As the denominator ","element":"span"},{"style":{"height":9.19},"width":65.63,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-32.png","element":"img","alt":" πold","inline":true,"padRight":true},{"text":"gets smaller, the magnitude of the gradient increases.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Detailed Analysis: ","element":"span"},{"text":"The symmetric RL loss gradient analysis aligns with the analysis of SCE. For simplicity, we set ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-33.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":95.18,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-34.png","element":"img","alt":" β to 1","inline":true,"padRight":true},{"text":"and examine the gradient direction for two types of A2C loss (RL and reverse RL) with respect to the action logits ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":". We use the notation defined in Section ","element":"span"},{"href":"#id-54","text":"4.1 ","element":"a"},{"text":"and introduce the case when ","element":"span"},{"style":{"height":14.98},"width":145.69,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-35.png","element":"img","alt":" A(i) > 0","inline":true},{"text":". For the full derivation including SPPO and ","element":"span"},{"style":{"height":14.98},"width":141.2,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-36.png","element":"img","alt":" A(i) < 0","inline":true},{"text":", please refer to Appendix ","element":"span"},{"text":"A","element":"span"},{"text":". The sample-wise SA2C loss is:","element":"span"}],[{"style":{"width":"68%"},"width":644,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-37.png","element":"img"}],[{"text":"The gradients for each part are:","element":"span"}],[{"style":{"width":"94%"},"width":886,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/4-38.png","element":"img"}],[{"id":"id-64","style":{"fontStyle":"italic"},"text":"Table 1. ","element":"figcaption","subtype":"caption"},{"text":"Mean final scores and standard errors (over the last 10 episodes) of PPO and SPPO on Atari games, without and with binary symmetric channel (BSC) noise with a crossover probability of ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"across 5 seeds. Full results can be found in Table ","element":"figcaption","subtype":"caption"},{"href":"#id-55","text":"10","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"96%"},"width":905,"height":697,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-0.png","element":"img"}],[{"text":"Thus, the SA2C loss gradient is:","element":"span"}],[{"style":{"width":"95%"},"width":893,"height":387,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-1.png","element":"img"}],[{"text":"For both cases, the gradient directions of the RL (A2C) loss and the reverse RL (RA2C) loss are aligned. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":14.99},"width":160.74,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-2.png","element":"img","alt":" A(i) > 0","inline":true},{"text":", the gradient of the RA2C loss is ","element":"span"},{"style":{"height":18.18},"width":367.21,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-3.png","element":"img","alt":"−A(i)Zπ(y)(π(y) − 1)","inline":true},{"text":", reaching its maximum magnitude at ","element":"span"},{"style":{"height":14.18},"width":174.04,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-4.png","element":"img","alt":" π(y) = 0.5","inline":true,"padRight":true},{"text":"as a parabolic function. This means that the accelerator helps the probability ","element":"span"},{"style":{"height":14.18},"width":59.88,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-5.png","element":"img","alt":" π(i) ","inline":true,"padRight":true},{"text":"increase most rapidly when the action to take is ambiguous. When ","element":"span"},{"style":{"height":15.2},"width":99.49,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-6.png","element":"img","alt":" i ̸= y","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.99},"width":152.52,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-7.png","element":"img","alt":"A(i) > 0","inline":true},{"text":", the probability of actions other than ","element":"span"},{"style":{"height":14.19},"width":56.79,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-8.png","element":"img","alt":" a(i)","inline":true,"padRight":true},{"text":"is reduced, and this reduction is influenced by the confidence of both ","element":"span"},{"style":{"height":14.18},"width":59.87,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-9.png","element":"img","alt":" π(i)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.76,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-10.png","element":"img","alt":" π(y)","inline":true},{"text":". Specifically, the gradient of the RA2C loss is ","element":"span"},{"style":{"height":14.18},"width":257.19,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-11.png","element":"img","alt":" −A(i)Zπ(y)π(i)","inline":true},{"text":". When both ","element":"span"},{"style":{"height":14.18},"width":59.87,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-12.png","element":"img","alt":" π(i)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.77,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-13.png","element":"img","alt":" π(y)","inline":true,"padRight":true},{"text":"are ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5","element":"span"},{"text":", representing the most ambiguous predictions, the accelerator aids the A2C loss in reducing ","element":"span"},{"style":{"height":14.19},"width":65.76,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-14.png","element":"img","alt":" π(y)","inline":true,"padRight":true},{"text":"most effectively. Thus, the RA2C loss helps deviate from ambiguous predictions as an accelerator. SPPO’s loss gradients are also aligned like SA2C and follow the same mechanism (See Appendix ","element":"span"},{"href":"#id-56","text":"B.2","element":"a"},{"text":").","element":"span"}]]},{"heading":"5. Experiments","paragraphs":[[{"text":"To validate the effectiveness of our algorithm, we conduct experiments on various tasks and models of different scales. First, we experiment on Atari games (","element":"span"},{"href":"#id-57","referenceIndex":26,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-57","referenceIndex":26,"text":"2013","element":"a"},{"text":") featuring discrete action spaces (Section ","element":"span"},{"href":"#id-58","text":"5.1","element":"a"},{"text":"), as well as MuJoCo benchmark tasks (","element":"span"},{"href":"#id-28","referenceIndex":47,"text":"Todorov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":47,"text":"2012","element":"a"},{"text":") and Box2D","element":"span"}],[{"style":{"width":"72%"},"width":676,"height":383,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-15.png","element":"img"}],[{"text":"tasks (","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"Catto","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"2011","element":"a"},{"text":") (Section ","element":"span"},{"href":"#id-59","text":"5.2","element":"a"},{"text":") with continuous action spaces using Stable-Baselines3 (","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"Raffin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"2021","element":"a"},{"text":"). In these control tasks, we also create a variant of each that introduces reward noise, hypothesizing that it will create more confusion in advantage prediction. SPPO performs better than SA2C for various reverse RL loss hyperparameters ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-16.png","element":"img","alt":" β","inline":true},{"text":". We also evaluate our method on IMDB and TL;DR datasets using TRIL (","element":"span"},{"href":"#id-61","referenceIndex":8,"text":"Chang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":8,"text":"2023","element":"a"},{"text":") to determine whether our approach is practical for LLM tasks. We primarily present the experimental results for PPO in the main paper. In the latter part of this section, we analyze why our method works better with PPO than A2C (Section ","element":"span"},{"href":"#id-33","text":"5.4","element":"a"},{"text":"), conduct hyperparameter sensitivity tests, and examine the training cost (Section ","element":"span"},{"href":"#id-62","text":"5.5","element":"a"},{"text":").","element":"span"}],[{"id":"id-58","style":{"fontWeight":"bold"},"text":"5.1. Discrete Action Space Tasks","element":"span"}],[{"text":"We first conduct experiments on Atari games (","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"Mnih et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":27,"text":"2016","element":"a"},{"text":") that the action spaces are discrete to evaluate SPPO and SA2C. We primarily select 22 games based on the reported score for A2C in ","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"Schulman et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-10","referenceIndex":40,"text":"2017","element":"a"},{"text":"), focusing on games where the A2C scores are not close to 0, as this allows us to demonstrate meaningful score changes.","element":"span"}],[{"text":"To introduce some reward noise, we simply flip the reward from 0 to 1 or from 1 to 0 with a probability of 10%. We denote this noise setting as a Binary Symmetric Channel (BSC). This setting is analogous to a potential problem in ranking-based feedback (","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"Ouyang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":30,"text":"2022","element":"a"},{"text":") from humans or AI, where evaluators may have different preferences, resulting in reversed scores. We observe that SA2C shows marginal improvements (Table ","element":"span"},{"href":"#id-63","text":"7","element":"a"},{"text":"), with a narrow range of effective hyperparameter ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/5-17.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"values. In contrast, SPPO performs well in both noise-free and noisy environments (See Section ","element":"span"},{"href":"#id-33","text":"5.4 ","element":"a"},{"text":"for discussion). Table ","element":"span"},{"href":"#id-64","text":"1 ","element":"a"},{"text":"presents partial results, while the complete results for SPPO, including training curves (Figure ","element":"span"},{"href":"#id-65","text":"4","element":"a"},{"text":"), can be found in Table ","element":"span"},{"href":"#id-55","text":"10","element":"a"},{"text":". SPPO achieves 16 out of 22 wins in noise-free settings and 19 out of 22 wins in noisy settings.","element":"span"}],[{"id":"id-68","style":{"fontStyle":"italic"},"text":"Table 2. ","element":"figcaption","subtype":"caption"},{"text":"Mean final scores and standard errors (over the last 10 episodes) of PPO and SPPO on Atari games, without and with binary symmetric channel (BSC) noise with a crossover probability of ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"across 5 seeds. To leverage the reverse RL loss, we discretize the continuous action space. DPPO is added as another baseline (","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":254.74,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-0.png","element":"img","alt":"α = 1.0, β = 0.0","inline":true},{"text":"), and DSPPO is our proposed method. Full results can be found in Table ","element":"figcaption","subtype":"caption"},{"href":"#id-66","text":"12 ","element":"a","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"href":"#id-67","text":"13","element":"a","subtype":"caption"},{"text":".","element":"figcaption","subtype":"caption"}],[{"style":{"width":"82%"},"width":1611,"height":482,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-1.png","element":"img"}],[{"id":"id-59","style":{"fontWeight":"bold"},"text":"5.2. Continuous Action Space Tasks","element":"span"}],[{"text":"Next, we perform experiments on MuJoCo benchmark (","element":"span"},{"href":"#id-28","referenceIndex":47,"text":"Todorov et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-28","referenceIndex":47,"text":"2012","element":"a"},{"text":") and Box2D (","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"Catto","element":"a"},{"text":", ","element":"span"},{"href":"#id-29","referenceIndex":5,"text":"2011","element":"a"},{"text":") continuous action space environments. To utilize the reverse RL loss, we need other action probabilities for a sampled action probability. However, conventional RL uses a multivariate Gaussian distribution as a policy, so it cannot provide the other action probabilities. Thus, we discretize the continuous action space (","element":"span"},{"href":"#id-52","referenceIndex":46,"text":"Tang & Agrawal","element":"a"},{"text":", ","element":"span"},{"href":"#id-52","referenceIndex":46,"text":"2020","element":"a"},{"text":"), naming these methods DA2C and DPPO, and add them as additional baseline comparisons.","element":"span"}],[{"text":"Note that discretizing the continuous action space generally works better than the original RL methods like A2C and PPO for these tasks if the continuous action space is discretized with a sufficient number of bins. This discretized distribution can represent more complex distributions than a diagonal Gaussian distribution (where the covariance is diagonal). We apply the reverse RL loss to both DA2C and DSPPO.","element":"span"}],[{"text":"Since the reward functions in these environments are highly engineered, we perturb the reward function with Gaussian noise with a mean of 0 and a standard deviation of 0.05. Table ","element":"span"},{"href":"#id-68","text":"2 ","element":"a"},{"text":"shows partial results for SPPO under noise settings. The full experiment results are in Table ","element":"span"},{"href":"#id-66","text":"12 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-67","text":"13","element":"a"},{"text":". Similar to the Atari game results, SA2C without noise shows tied performance in the noiseless setting, and improvements when the reward noise is introduced. SPPO consistently shows robust performance gains across a wide range of ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-2.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"values for both settings.","element":"span"}],[{"id":"id-84","style":{"fontWeight":"bold"},"text":"5.3. RLHF Tasks","element":"span"}],[{"text":"The final tasks are RLHF tasks to assess our method’s applicability to large language models. The first task, IMDB positive sentiment, aims to generate positive sentiment continuations for movie reviews (","element":"span"},{"href":"#id-30","referenceIndex":24,"text":"Maas et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-30","referenceIndex":24,"text":"2011","element":"a"},{"text":"). The senti- ","element":"span"},{"text":"ment classifier (","element":"span"},{"href":"#id-69","referenceIndex":37,"text":"Sanh et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-69","referenceIndex":37,"text":"2019","element":"a"},{"text":") is used as a reward model to evaluate how positive a provided text is. The base policy is GPT-2 (","element":"span"},{"href":"#id-70","referenceIndex":32,"text":"Radford et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-70","referenceIndex":32,"text":"2019","element":"a"},{"text":"), which we fine-tune using PPO or SPPO. We evaluate this model based on the reward score and perplexity. SPPO shows improvement in both reward score and perplexity compared to PPO.","element":"span"}],[{"text":"The second RLHF task is TL;DR summarization (","element":"span"},{"text":"V¨","element":"span"},{"href":"#id-31","referenceIndex":49,"text":"olske ","element":"a"},{"href":"#id-31","referenceIndex":49,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":49,"text":"2017","element":"a"},{"text":"). The objective is to summarize Reddit posts. The reward model is a fine-tuned GPT-J (","element":"span"},{"href":"#id-71","referenceIndex":51,"text":"Wang & Komat- ","element":"a"},{"href":"#id-71","referenceIndex":51,"text":"suzaki","element":"a"},{"text":", ","element":"span"},{"href":"#id-71","referenceIndex":51,"text":"2021","element":"a"},{"text":") with LoRA adapters (","element":"span"},{"href":"#id-72","referenceIndex":16,"text":"Hu et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-72","referenceIndex":16,"text":"2021","element":"a"},{"text":") by ","element":"span"},{"href":"#id-61","referenceIndex":8,"text":"Chang et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-61","referenceIndex":8,"text":"2023","element":"a"},{"text":"). The training dataset for this reward model is the filtered dataset with additional human preference data used in ","element":"span"},{"href":"#id-73","referenceIndex":42,"text":"Stiennon et al. ","element":"a"},{"text":"(","element":"span"},{"href":"#id-73","referenceIndex":42,"text":"2020","element":"a"},{"text":").","element":"span"}],[{"text":"The base policy model is an open-source GPT-J model ","element":"span"},{"text":"(CarperAI/openai_summarize_tldr_sft) ","element":"span"},{"text":"with added LoRA adapters. Note that the open-source GPT-J mode often outputs empty summarizations for most evaluation data. Therefore, we report results after 10 epochs of RL updates as an alternative to SFT, as it begins to consistently summarize posts. We evaluate SPPO based on reward score, perplexity, and win rate. This win rate is judged by GPT-4 Turbo (","element":"span"},{"href":"#id-74","referenceIndex":29,"text":"OpenAI","element":"a"},{"text":", ","element":"span"},{"href":"#id-74","referenceIndex":29,"text":"2024","element":"a"},{"text":") (","element":"span"},{"text":"gpt-4-turbo-2024-04-09","element":"span"},{"text":") by comparing the generated output and reference text. Even though the perplexity of SPPO is slightly higher than that of PPO, there is an improvement in the reward score and a significantly increased win rate.","element":"span"}],[{"text":"We also conduct experiments using Qwen2-0.5B as the policy model with a LoRA adapter, employing the same reward model and hyperparameters: ","element":"span"},{"style":{"height":11.2},"width":129.78,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-3.png","element":"img","alt":" α = 0.5","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":257.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-4.png","element":"img","alt":" β = {0.2, 20.0}","inline":true,"padRight":true},{"text":"for SPPO. Specifically, ","element":"span"},{"style":{"height":14.4},"width":145.21,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-5.png","element":"img","alt":" β = 0.2","inline":true,"padRight":true},{"text":"is used for GPT-J, and ","element":"span"},{"style":{"height":14.4},"width":148.73,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-6.png","element":"img","alt":"β = 20.0","inline":true,"padRight":true},{"text":"is used for SPPO in the continuous tasks. Overall, SPPO outperforms PPO (Figure ","element":"span"},{"href":"#id-75","text":"3","element":"a"},{"text":"). Notably, SPPO demonstrates a significant performance boost with ","element":"span"},{"style":{"height":14.4},"width":153.49,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-7.png","element":"img","alt":" β = 20.0","inline":true},{"text":", although its performance drops sharply after 300 epochs. In contrast, PPO and SPPO with ","element":"span"},{"style":{"height":14.4},"width":129.44,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/6-8.png","element":"img","alt":" β = 0.2","inline":true,"padRight":true},{"text":"show a similar performance drop around epoch 900.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Table 3. ","element":"figcaption","subtype":"caption"},{"text":"RM Score indicates the reward model score, Perplexity measures the uncertainty of the model, and Win Rate is judged by GPT-4 Turbo by comparing the generated output and reference text. We use 4 different random seeds for each task.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":928,"height":975,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-0.png","element":"img"}],[{"id":"id-75","style":{"fontStyle":"italic"},"text":"Figure 3. ","element":"figcaption","subtype":"caption"},{"text":"TL;DR summarization results with Qwen2-0.5B across 4 different random seeds. SPPO with ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":145.92,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-1.png","element":"img","alt":" β = 20.0","inline":true,"padRight":true},{"text":"shows a sharp performance surge, followed by a drop after 300 epochs. A similar drop occurs for SPPO with ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":118.46,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-2.png","element":"img","alt":" β = 0.2","inline":true,"padRight":true},{"text":"and PPO around epoch 900.","element":"figcaption","subtype":"caption"}],[{"text":"In the introduction section, we mention that RLHF or RLAIF have additional errors due to a trained reward model. We check whether the trained reward model in TL;DR has reward prediction errors. Figure ","element":"span"},{"href":"#id-25","text":"1 ","element":"a"},{"text":"shows a dramatic example: the generated summary sample (left) and the middle sample were both ","element":"span"},{"style":{"fontStyle":"italic"},"text":"empty","element":"span"},{"text":", but their rewards show a huge gap. The middle sample scores (6.66) better than those learned with an SPPO score (6.13). Wrong summaries, like ","element":"span"},{"style":{"fontStyle":"italic"},"text":"empty","element":"span"},{"text":", can score higher than a summarized text (right). These cases are observed very often. This makes the RL training procedure more noisy and means that the sign of advantage changes depending on how the batch is composed. The more detailed texts for these samples are available in Appendix ","element":"span"},{"text":"D","element":"span"},{"text":".","element":"span"}],[{"id":"id-33","style":{"fontWeight":"bold"},"text":"5.4. Why SPPO Works Better Than SA2C","element":"span"}],[{"text":"The motivation for using the reverse RL loss is to address the issue of ambiguity in advantage predictions (Section ","element":"span"},{"href":"#id-27","text":"4.3","element":"a"},{"text":"). We hypothesize that the PPO advantage prediction (sign) is less consistent than in A2C during policy updates, but this does not mean that PPO is worse than A2C. There are two main reasons why consistency is not maintained. First, PPO has improved sample efficiency compared to A2C, but after the first epoch, subsequent updates become","element":"span"}],[{"style":{"width":"78%"},"width":739,"height":264,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-3.png","element":"img"}],[{"text":"off-policy, affecting advantage estimates. Second, PPO often uses advantage normalization to restrict large advantage values from being involved with policy updates to stabilize the learning process. In addition, PPO often uses smaller mini-batch sizes (e.g., 64), whereas A2C uses the entire dataset for policy updates. Many popular RL code baselines, such as Stable Baselines3 (","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"Raffin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"2021","element":"a"},{"text":"), RL4LMs (","element":"span"},{"href":"#id-76","referenceIndex":36,"text":"Ramamurthy et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-76","referenceIndex":36,"text":"2023","element":"a"},{"text":"), TRL (","element":"span"},{"href":"#id-77","referenceIndex":50,"text":"von Werra et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-77","referenceIndex":50,"text":"2020","element":"a"},{"text":"), and TRLX (","element":"span"},{"href":"#id-78","referenceIndex":15,"text":"Havrilla et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-78","referenceIndex":15,"text":"2023","element":"a"},{"text":") use PPO advantage normalization by default, whereas A2C does not. Our experiments on the usefulness of advantage normalization also show that the performance increase in IMDB is greater than the performance decrease in TL;DR (Appendix ","element":"span"},{"href":"#id-79","text":"14","element":"a"},{"text":").","element":"span"}],[{"text":"We examine the ratio of advantage sign changes before and after normalization for PPO in Atari games and MuJoCo tasks (Figure ","element":"span"},{"href":"#id-80","text":"2","element":"a"},{"text":"). This ratio varies across different environments. The advantage sign changes usually exceed 5% for Atari games and 10% for MuJoCo and Box2D environments. These changes introduce the confusion, which makes the reverse RL loss more effective for PPO. This observation aligns with our motivation for using symmetric RL loss to handle noisy data, similar to how it is addressed in noisy classification tasks in supervised learning.","element":"span"}],[{"text":"Additionally, since A2C uses the entire dataset (rather than using advantage normalization with small batches) for the policy updates, it introduces less confusion in advantage prediction. As a result, SA2C demonstrates performance comparable to A2C in settings without reward noise (Table ","element":"span"},{"href":"#id-63","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-81","text":"8","element":"a"},{"text":"), and improvements in settings with reward noise (Table ","element":"span"},{"href":"#id-63","text":"7 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-82","text":"9","element":"a"},{"text":"), where advantage estimation is more likely to be confused.","element":"span"}],[{"id":"id-62","style":{"fontWeight":"bold"},"text":"5.5. Hyperparameters and Training Cost","element":"span"}],[{"text":"Although the symmetric RL loss introduces three additional hyperparameters (Equation ","element":"span"},{"href":"#id-53","text":"4.2","element":"a"},{"text":"): ","element":"span"},{"style":{"height":14.4},"width":71.47,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-4.png","element":"img","alt":" α, β","inline":true},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z","element":"span"},{"text":", we simply fix ","element":"span"},{"style":{"height":11.2},"width":137.1,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-5.png","element":"img","alt":" α = 0.5","inline":true,"padRight":true},{"text":"in all experiments to reduce the overall magnitude of the symmetric loss. Additionally, since ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-6.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"are constants that are multiplied together, we can fix one and adjust the other. For example, ","element":"span"},{"style":{"height":16},"width":359.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-7.png","element":"img","alt":" (β = 1.0, Z = −1.0)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":374.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-8.png","element":"img","alt":" (β = 10.0, Z = −0.1)","inline":true,"padRight":true},{"text":"yield the same results. In our experiments, we fix ","element":"span"},{"style":{"height":10.8},"width":134.18,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-9.png","element":"img","alt":" Z = −1","inline":true,"padRight":true},{"text":"and adjust ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/7-10.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"to determine the influence of the reverse RL loss.","element":"span"}],[{"text":"We test the sensitivity of ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/8-0.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"for SPPO on Atari games with and without noise in the rewards. Table ","element":"span"},{"href":"#id-32","text":"11 ","element":"a"},{"text":"presents the percentage improvements compared to PPO. We exclude excessively large improvements (e.g., 2000%) to avoid skewing the average. These significant improvements typically result from PPO’s training failure, while SPPO remains stable (Gopher and WizardOfWor in Figure ","element":"span"},{"href":"#id-65","text":"4","element":"a"},{"text":"). Fixing ","element":"span"},{"style":{"height":11.2},"width":134.28,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/8-1.png","element":"img","alt":" α = 0.5","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":134.19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/8-2.png","element":"img","alt":"Z = −1","inline":true},{"text":", we vary ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/8-3.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"and observe consistent improvements, demonstrating SPPO’s robustness across hyperparameters. Also, we use the default values of Stable Baselines3 (","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"Raffin ","element":"a"},{"href":"#id-60","referenceIndex":35,"text":"et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"2021","element":"a"},{"text":") for the other RL hyperparameters; more details can be found in Appendix ","element":"span"},{"href":"#id-83","text":"C.1","element":"a"},{"text":".","element":"span"}],[{"text":"The symmetric RL loss introduces the reverse RL loss term, which is essentially another form of cross-entropy that does not significantly increase training time. In practice, there is no increase in training time for the continuous tasks discussed in Section ","element":"span"},{"href":"#id-59","text":"5.2 ","element":"a"},{"text":"and the LLM tasks in Section ","element":"span"},{"href":"#id-84","text":"5.3","element":"a"},{"text":", and a 10–20% increase for the Atari games in Section ","element":"span"},{"href":"#id-58","text":"5.1","element":"a"},{"text":".","element":"span"}]]},{"heading":"6. Conclusion","paragraphs":[[{"text":"We present Symmetric RL loss, inspired by Symmetric Cross Entropy (SCE) (","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"Wang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":54,"text":"2019","element":"a"},{"text":") from supervised learning, to enhance RL robustness. By incorporating reverse information through SCE, we develop SA2C and SPPO, extending standard A2C and PPO algorithms. We test SA2C and SPPO on various discrete and continuous action space tasks and further evaluate SPPO on RLHF tasks like IMDB positive sentiment and TL;DR summarization. Our results show that SPPO consistently outperforms PPO.We attribute this to PPO’s off-policy components and advantage normalization with small batch sizes, which cause advantage sign changes (confusion). SCE helps stabilize training, addressing these challenges.","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"The authors would like to thank the Ohio Supercomputer Center (","element":"span"},{"href":"#id-85","referenceIndex":6,"text":"Center","element":"a"},{"text":", ","element":"span"},{"href":"#id-85","referenceIndex":6,"text":"1987","element":"a"},{"text":") for providing the computational resources used in this research.","element":"span"}]]},{"heading":"Impact Statement","paragraphs":[[{"text":"This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-11","text":"Ahmadian, A., Cremer, C., Gall","element":"span"},{"text":"´e, M., Fadaee, M., Kreutzer, J., Pietquin, O., ","element":"span"},{"text":"¨","element":"span"},{"text":"Ust¨un, A., and Hooker, S. Back to basics: Revisiting reinforce style optimization for learning from","element":"span"}],[{"text":"human feedback in llms, 2024.","element":"span"}],[{"id":"id-7","text":"Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., ","element":"span"},{"text":"Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., Mercado, N., DasSarma, N., Lasenby, R., Larson, R., Ringer, S., Johnston, S., Kravec, S., Showk, S. E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S. R., HatfieldDodds, Z., Mann, B., Amodei, D., Joseph, N., McCandlish, S., Brown, T., and Kaplan, J. Constitutional ai: Harmlessness from ai feedback, 2022.","element":"span"}],[{"id":"id-42","text":"Byun, J.-S. and Perrault, A. Normality-guided distributional ","element":"span"},{"text":"reinforcement learning for continuous control, 2024. URL ","element":"span"},{"href":"https://arxiv.org/abs/2208.13125","text":"https://arxiv.org/abs/2208.13125","element":"a"},{"text":".","element":"span"}],[{"id":"id-21","text":"Byun, J.-S., Chun, J., Kil, J., and Perrault, A. Ares: Al- ","element":"span"},{"text":"ternating reinforcement learning and supervised fine-tuning for enhanced multi-modal chain-of-thought reasoning through diverse ai feedback, 2024. URL ","element":"span"},{"href":"https://arxiv.org/abs/2407.00087","text":"https: ","element":"a"},{"href":"https://arxiv.org/abs/2407.00087","text":"//arxiv.org/abs/2407.00087","element":"a"},{"text":".","element":"span"}],[{"id":"id-29","text":"Catto, E. Box2d, a 2d physics engine for games, 2011. URL ","element":"span"},{"href":"http://box2d.org","text":"http://box2d.org","element":"a"},{"text":".","element":"span"}],[{"id":"id-85","text":"Center, O. S. ","element":"span"},{"text":"Ohio supercomputer center, 1987. ","element":"span"},{"text":"URL ","element":"span"},{"href":"http://osc.edu/ark:/19495/f5s1ph73","text":"http://osc.edu/ark:/19495/f5s1ph73","element":"a"},{"text":".","element":"span"}],[{"id":"id-24","text":"Chakraborty, S., Qiu, J., Yuan, H., Koppel, A., Huang, F., ","element":"span"},{"text":"Manocha, D., Bedi, A. S., and Wang, M. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences, 2024.","element":"span"}],[{"id":"id-61","text":"Chang, J. D., Brantley, K., Ramamurthy, R., Misra, D., ","element":"span"},{"text":"and Sun, W. ","element":"span"},{"text":"Tril: Transformers reinforcement and imitation learning library. ","element":"span"},{"href":"https://github.com/Cornell-RL/tril","text":"https://github.com/ ","element":"a"},{"href":"https://github.com/Cornell-RL/tril","text":"Cornell-RL/tril","element":"a"},{"text":", 2023.","element":"span"}],[{"id":"id-38","text":"Chen, X., Zhu, Z., and Perrault, A. The distributional re- ","element":"span"},{"text":"ward critic architecture for perturbed-reward reinforcement learning, 2024. URL ","element":"span"},{"href":"https://arxiv.org/abs/2401.05710","text":"https://arxiv.org/ ","element":"a"},{"href":"https://arxiv.org/abs/2401.05710","text":"abs/2401.05710","element":"a"},{"text":".","element":"span"}],[{"id":"id-0","text":"Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., ","element":"span"},{"text":"Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and Wei, J. Scaling instruction-finetuned language models, 2022.","element":"span"}],[{"id":"id-3","text":"Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, ","element":"span"},{"text":"A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., and Florence, P. Palm-e: An embodied multimodal language model, 2023.","element":"span"}],[{"id":"id-23","text":"Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and ","element":"span"},{"text":"Kiela, D. Kto: Model alignment as prospect theoretic optimization, 2024.","element":"span"}],[{"id":"id-34","text":"Ghosh, A., Kumar, H., and Sastry, P. S. Robust loss func- ","element":"span"},{"text":"tions under label noise for deep neural networks, 2017.","element":"span"}],[{"id":"id-19","text":"Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- ","element":"span"},{"text":"critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.","element":"span"}],[{"id":"id-78","text":"Havrilla, A., Zhuravinskyi, M., Phung, D., Tiwari, A., ","element":"span"},{"text":"Tow, J., Biderman, S., Anthony, Q., and Castricato, L. ","element":"span"},{"text":"trlX: A framework for large scale reinforcement learning from human feedback. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","element":"span"},{"text":", pp. 8578–8595, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.530. URL ","element":"span"},{"href":"https://aclanthology.org/2023.emnlp-main.530","text":"https:// ","element":"a"},{"href":"https://aclanthology.org/2023.emnlp-main.530","text":"aclanthology.org/2023.emnlp-main.530","element":"a"},{"text":".","element":"span"}],[{"id":"id-72","text":"Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, ","element":"span"},{"text":"S., and Chen, W. Lora: Low-rank adaptation of large language models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2106.09685, 2021. URL ","element":"span"},{"href":"https://arxiv.org/abs/2106.09685","text":"https://arxiv.org/abs/2106.09685","element":"a"},{"text":".","element":"span"}],[{"id":"id-5","text":"Huang, K., Altosaar, J., and Ranganath, R. Clinicalbert: ","element":"span"},{"text":"Modeling clinical notes and predicting hospital readmission, 2020.","element":"span"}],[{"id":"id-2","text":"Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- ","element":"span"},{"text":"guage models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022.","element":"span"}],[{"id":"id-6","text":"Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., ","element":"span"},{"text":"Lu, K., Bishop, C., Hall, E., Carbune, V., Rastogi, A., and Prakash, S. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023a.","element":"span"}],[{"id":"id-20","text":"Lee, K., Liu, H., Ryu, M., Watkins, O., Du, Y., Boutilier, ","element":"span"},{"text":"C., Abbeel, P., Ghavamzadeh, M., and Gu, S. S. Aligning text-to-image models using human feedback, 2023b.","element":"span"}],[{"id":"id-4","text":"Lee, P., Bubeck, S., and Petro, J. Benefits, limits, and risks ","element":"span"},{"text":"of gpt-4 as an ai chatbot for medicine. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"New England Journal of Medicine","element":"span"},{"text":", 388(13):1233–1239, 2023c. doi: 10.1056/NEJMsr2214184. URL ","element":"span"},{"href":"https://www.nejm.org/doi/full/10.1056/NEJMsr2214184","text":"https://www.nejm. ","element":"a"},{"href":"https://www.nejm.org/doi/full/10.1056/NEJMsr2214184","text":"org/doi/full/10.1056/NEJMsr2214184","element":"a"},{"text":".","element":"span"}],[{"id":"id-43","text":"Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, ","element":"span"},{"text":"B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step, 2023.","element":"span"}],[{"id":"id-36","text":"Ma, X., Huang, H., Wang, Y., Romano, S., Erfani, S. M., ","element":"span"},{"text":"and Bailey, J. Normalized loss functions for deep learning with noisy labels. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2006.13554, 2020. URL ","element":"span"},{"href":"https://arxiv.org/abs/2006.13554","text":"https://arxiv.org/abs/2006.13554","element":"a"},{"text":".","element":"span"}],[{"id":"id-30","text":"Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., ","element":"span"},{"text":"and Potts, C. Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies","element":"span"},{"text":", pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL ","element":"span"},{"href":"https://aclanthology.org/P11-1015","text":"https://aclanthology.org/P11-1015","element":"a"},{"text":".","element":"span"}],[{"id":"id-41","text":"Mai, V., Mani, K., and Paull, L. Sample efficient deep ","element":"span"},{"text":"reinforcement learning via uncertainty estimation, 2022. URL ","element":"span"},{"href":"https://arxiv.org/abs/2201.01666","text":"https://arxiv.org/abs/2201.01666","element":"a"},{"text":".","element":"span"}],[{"id":"id-57","text":"Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., ","element":"span"},{"text":"Antonoglou, I., Wierstra, D., and Riedmiller, M. A. Playing atari with deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1312.5602, 2013. ","element":"span"},{"text":"URL ","element":"span"},{"href":"http://arxiv.org/abs/1312.5602","text":"http://arxiv.org/ ","element":"a"},{"href":"http://arxiv.org/abs/1312.5602","text":"abs/1312.5602","element":"a"},{"text":".","element":"span"}],[{"id":"id-9","text":"Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lilli- ","element":"span"},{"text":"crap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1602.01783, 2016. URL ","element":"span"},{"href":"http://arxiv.org/abs/1602.01783","text":"http://arxiv. ","element":"a"},{"href":"http://arxiv.org/abs/1602.01783","text":"org/abs/1602.01783","element":"a"},{"text":".","element":"span"}],[{"id":"id-39","text":"Obando-Ceron, J., Bellemare, M. G., and Castro, P. S. Small ","element":"span"},{"text":"batch deep reinforcement learning, 2023. URL ","element":"span"},{"href":"https://arxiv.org/abs/2310.03882","text":"https: ","element":"a"},{"href":"https://arxiv.org/abs/2310.03882","text":"//arxiv.org/abs/2310.03882","element":"a"},{"text":".","element":"span"}],[{"id":"id-74","text":"OpenAI. Gpt-4 technical report, 2024. URL ","element":"span"},{"href":"https://arxiv.org/abs/2303.08774","text":"https:// ","element":"a"},{"href":"https://arxiv.org/abs/2303.08774","text":"arxiv.org/abs/2303.08774","element":"a"},{"text":".","element":"span"}],[{"id":"id-12","text":"Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, ","element":"span"},{"text":"C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback, 2022.","element":"span"}],[{"id":"id-45","text":"Puterman, M. L. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Markov decision processes: discrete stochastic dynamic programming","element":"span"},{"text":". John Wiley & Sons, 2014.","element":"span"}],[{"id":"id-70","text":"Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., ","element":"span"},{"text":"Sutskever, I., et al. Language models are unsupervised multitask learners. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"OpenAI blog","element":"span"},{"text":", 1(8):9, 2019.","element":"span"}],[{"id":"id-13","text":"Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, ","element":"span"},{"text":"C. D., and Finn, C. Direct preference optimization: Your language model is secretly a reward model, 2023.","element":"span"}],[{"id":"id-89","text":"Raffin, A. Rl baselines3 zoo. ","element":"span"},{"href":"https://github.com/DLR-RM/rl-baselines3-zoo","text":"https://github.com/ ","element":"a"},{"href":"https://github.com/DLR-RM/rl-baselines3-zoo","text":"DLR-RM/rl-baselines3-zoo","element":"a"},{"text":", 2020.","element":"span"}],[{"id":"id-60","text":"Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, ","element":"span"},{"text":"M., and Dormann, N. Stable-baselines3: Reliable reinforcement learning implementations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Machine Learning Research","element":"span"},{"text":", 22(268):1–8, 2021. ","element":"span"},{"text":"URL ","element":"span"},{"href":"http://jmlr.org/papers/v22/20-1364.html","text":"http: ","element":"a"},{"href":"http://jmlr.org/papers/v22/20-1364.html","text":"//jmlr.org/papers/v22/20-1364.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-76","text":"Ramamurthy, R., Ammanabrolu, P., Brantley, K., Hessel, ","element":"span"},{"text":"J., Sifa, R., Bauckhage, C., Hajishirzi, H., and Choi, Y. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization, 2023.","element":"span"}],[{"id":"id-69","text":"Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, ","element":"span"},{"text":"a distilled version of BERT: smaller, faster, cheaper and lighter. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1910.01108, 2019. URL ","element":"span"},{"href":"http://arxiv.org/abs/1910.01108","text":"http:// ","element":"a"},{"href":"http://arxiv.org/abs/1910.01108","text":"arxiv.org/abs/1910.01108","element":"a"},{"text":".","element":"span"}],[{"id":"id-40","text":"Schaul, T., Barreto, A., Quan, J., and Ostrovski, G. The ","element":"span"},{"text":"phenomenon of policy churn, 2022. URL ","element":"span"},{"href":"https://arxiv.org/abs/2206.00730","text":"https:// ","element":"a"},{"href":"https://arxiv.org/abs/2206.00730","text":"arxiv.org/abs/2206.00730","element":"a"},{"text":".","element":"span"}],[{"id":"id-18","text":"Schulman, J., Levine, S., Abbeel, P., Jordan, M., and ","element":"span"},{"text":"Moritz, P. ","element":"span"},{"text":"Trust region policy optimization. ","element":"span"},{"text":"In Bach, F. and Blei, D. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 32nd International Conference on Machine Learning","element":"span"},{"text":", volume 37 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pp. 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL ","element":"span"},{"href":"https://proceedings.mlr.press/v37/schulman15.html","text":"https://proceedings.mlr.press/v37/ ","element":"a"},{"href":"https://proceedings.mlr.press/v37/schulman15.html","text":"schulman15.html","element":"a"},{"text":".","element":"span"}],[{"id":"id-10","text":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and ","element":"span"},{"text":"Klimov, O. Proximal policy optimization algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1707.06347, 2017. URL ","element":"span"},{"href":"http://arxiv.org/abs/1707.06347","text":"http://arxiv. ","element":"a"},{"href":"http://arxiv.org/abs/1707.06347","text":"org/abs/1707.06347","element":"a"},{"text":".","element":"span"}],[{"id":"id-17","text":"Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, ","element":"span"},{"text":"P. High-dimensional continuous control using generalized advantage estimation, 2018.","element":"span"}],[{"id":"id-73","text":"Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., ","element":"span"},{"text":"Voss, C., Radford, A., Amodei, D., and Christiano, P. F. Learning to summarize from human feedback. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/2009.01325, 2020. URL ","element":"span"},{"href":"https://arxiv.org/abs/2009.01325","text":"https://arxiv.org/ ","element":"a"},{"href":"https://arxiv.org/abs/2009.01325","text":"abs/2009.01325","element":"a"},{"text":".","element":"span"}],[{"id":"id-15","text":"Sutton, R. S. and Barto, A. G. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement Learning: An Introduction","element":"span"},{"text":". MIT Press, Cambridge, MA, USA, 2018a. ISBN 978-0262039246.","element":"span"}],[{"id":"id-46","text":"Sutton, R. S. and Barto, A. G. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Reinforcement learning: An introduction","element":"span"},{"text":". MIT press, 2018b.","element":"span"}],[{"id":"id-14","text":"Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, ","element":"span"},{"text":"Y. Policy gradient methods for reinforcement learning with function approximation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", pp. 1057–1063, 2000.","element":"span"}],[{"id":"id-52","text":"Tang, Y. and Agrawal, S. Discretizing continuous action ","element":"span"},{"text":"space for on-policy optimization, 2020.","element":"span"}],[{"id":"id-28","text":"Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics ","element":"span"},{"text":"engine for model-based control. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2012 IEEE/RSJ International Conference on Intelligent Robots and Systems","element":"span"},{"text":", pp. 5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.","element":"span"}],[{"id":"id-16","text":"van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- ","element":"span"},{"text":"ment learning with double q-learning, 2015.","element":"span"}],[{"id":"id-31","text":"V","element":"span"},{"text":"¨olske, M., Potthast, M., Syed, S., and Stein, B. TL;DR: Mining Reddit to learn automatic summarization. In Wang, L., Cheung, J. C. K., Carenini, G., and Liu, F. (eds.), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Workshop on New Frontiers in Summarization","element":"span"},{"text":", pp. 59–63, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL ","element":"span"},{"href":"https://aclanthology.org/W17-4508","text":"https: ","element":"a"},{"href":"https://aclanthology.org/W17-4508","text":"//aclanthology.org/W17-4508","element":"a"},{"text":".","element":"span"}],[{"id":"id-77","text":"von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., ","element":"span"},{"text":"Thrush, T., Lambert, N., and Huang, S. ","element":"span"},{"text":"Trl: Transformer reinforcement learning. ","element":"span"},{"href":"https://github.com/huggingface/trl","text":"https://github. ","element":"a"},{"href":"https://github.com/huggingface/trl","text":"com/huggingface/trl","element":"a"},{"text":", 2020.","element":"span"}],[{"id":"id-71","text":"Wang, ","element":"span"},{"text":"B. ","element":"span"},{"text":"and ","element":"span"},{"text":"Komatsuzaki, ","element":"span"},{"text":"A. ","element":"span"},{"text":"GPT-J-6B: ","element":"span"},{"text":"A 6 ","element":"span"},{"text":"Billion ","element":"span"},{"text":"Parameter ","element":"span"},{"text":"Autoregressive ","element":"span"},{"text":"Language Model. ","element":"span"},{"href":"https://github.com/kingoflolz/mesh-transformer-jax","text":"https://github.com/kingoflolz/ ","element":"a"},{"href":"https://github.com/kingoflolz/mesh-transformer-jax","text":"mesh-transformer-jax","element":"a"},{"text":", May 2021.","element":"span"}],[{"id":"id-37","text":"Wang, J., Liu, Y., and Li, B. Reinforcement learning with ","element":"span"},{"text":"perturbed rewards. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1810.01032, 2018. URL ","element":"span"},{"href":"http://arxiv.org/abs/1810.01032","text":"http://arxiv.org/abs/1810.01032","element":"a"},{"text":".","element":"span"}],[{"id":"id-44","text":"Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, ","element":"span"},{"text":"D., Wu, Y., and Sui, Z. ","element":"span"},{"text":"Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024.","element":"span"}],[{"id":"id-26","text":"Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., and Bailey, J. ","element":"span"},{"text":"Symmetric cross entropy for robust learning with noisy labels. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","element":"span"},{"text":", pp. 322–330, 2019. doi: 10. 1109/ICCV.2019.00041.","element":"span"}],[{"id":"id-1","text":"Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., ","element":"span"},{"text":"Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models, 2023.","element":"span"}],[{"id":"id-8","text":"Williams, R. J. Simple statistical gradient-following algo- ","element":"span"},{"text":"rithms for connectionist reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine learning","element":"span"},{"text":", 8(3-4):229–256, 1992.","element":"span"}],[{"id":"id-35","text":"Zhang, Z. and Sabuncu, M. R. Generalized cross entropy ","element":"span"},{"text":"loss for training deep neural networks with noisy labels. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1805.07836, 2018. URL ","element":"span"},{"href":"http://arxiv.org/abs/1805.07836","text":"http://arxiv. ","element":"a"},{"href":"http://arxiv.org/abs/1805.07836","text":"org/abs/1805.07836","element":"a"},{"text":".","element":"span"}]]},{"heading":"A. Gradient of RL loss and reverse RL loss","paragraphs":[[{"text":"Suppose there exist ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"actions, and ","element":"span"},{"style":{"height":14.18},"width":56.79,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-0.png","element":"img","alt":" a(i)","inline":true,"padRight":true},{"text":"indicates the ","element":"span"},{"style":{"height":13.38},"width":44.17,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-1.png","element":"img","alt":" ith","inline":true,"padRight":true},{"text":"action. Let ","element":"span"},{"style":{"height":21.49},"width":277.83,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-2.png","element":"img","alt":" π(i)θ = πθ(a(i)|s)","inline":true,"padRight":true},{"text":"denote the policy for a state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":". The set ","element":"span"},{"style":{"height":21.49},"width":493.91,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-3.png","element":"img","alt":" πθ(s) = {π(1)θ , π(2)θ , . . . , π(k)θ }","inline":true,"padRight":true},{"text":"represents the possible action probabilities set for ","element":"span"},{"style":{"height":14.58},"width":108.14,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-4.png","element":"img","alt":" s. A(i)","inline":true,"padRight":true},{"text":"indicates the corresponding ","element":"span"},{"text":"advantage of the sampled action ","element":"span"},{"style":{"height":14.99},"width":268.04,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-5.png","element":"img","alt":" a(i) for s. Z < 0","inline":true,"padRight":true},{"text":"is a constant used in the reverse RL loss to handle the computational issue where ","element":"span"},{"style":{"height":14},"width":202.18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-6.png","element":"img","alt":" log 0 = −∞","inline":true},{"text":". For simplicity of notation, we drop ","element":"span"},{"style":{"height":13.2},"width":164.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-7.png","element":"img","alt":" θ, s, and a","inline":true,"padRight":true},{"text":"from the policy ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-8.png","element":"img","alt":" π","inline":true},{"text":". Note that ","element":"span"},{"style":{"height":14.19},"width":173.06,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-9.png","element":"img","alt":" A(i) and Z","inline":true,"padRight":true},{"text":"are not involved with the gradient as they are constants with respect to ","element":"span"},{"style":{"height":11.2},"width":29.82,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-10.png","element":"img","alt":" θ.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"A.1. A2C Loss","element":"span"}],[{"text":"The derivation of the A2C loss ","element":"span"},{"style":{"height":13.19},"width":72.92,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-11.png","element":"img","alt":" La2c","inline":true,"padRight":true},{"text":"with respect to logits ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"is presented as follows: For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"63%"},"width":1235,"height":303,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-12.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":15.2},"width":97.82,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-13.png","element":"img","alt":" i ̸= y,","inline":true}],[{"style":{"width":"100%"},"width":940,"height":429,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-14.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"24%"},"width":228,"height":375,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-15.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":15.2},"width":97.83,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-16.png","element":"img","alt":" i ̸= y,","inline":true}],[{"style":{"width":"21%"},"width":201,"height":375,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-17.png","element":"img"}],[{"text":"In summary, we have the following form for ","element":"span"},{"style":{"height":18.19},"width":265.14,"height":45.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/11-18.png","element":"img","alt":" La2c(π(i), A(i)):","inline":true}],[{"id":"id-88","style":{"fontWeight":"bold"},"text":"A.2. Reverse A2C Loss","element":"span"}],[{"text":"The derivation of the reverse A2C loss ","element":"span"},{"style":{"height":13.19},"width":85.36,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-0.png","element":"img","alt":" Lra2c","inline":true,"padRight":true},{"text":"with respect to logits ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"is presented as follows:","element":"span"}],[{"style":{"width":"74%"},"width":1451,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-1.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.38},"width":316.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-2.png","element":"img","alt":" i = y and A(i) > 0,","inline":true}],[{"id":"id-86","style":{"width":"36%"},"width":339,"height":496,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-3.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.78},"width":316.42,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-4.png","element":"img","alt":" i ̸= y and A(i) > 0,","inline":true}],[{"text":"For ","element":"span"},{"style":{"height":17.39},"width":306.5,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-5.png","element":"img","alt":" i = y and A(i) < 0","inline":true},{"text":", the only difference from Equation ","element":"span"},{"href":"#id-86","text":"21 ","element":"a"},{"text":"is the negative sign, thus:","element":"span"}],[{"id":"id-87","style":{"width":"66%"},"width":1292,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-6.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.79},"width":306.5,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-7.png","element":"img","alt":" i ̸= y and A(i) < 0","inline":true},{"text":", the only difference from Equation ","element":"span"},{"href":"#id-87","text":"22 ","element":"a"},{"text":"is the negative sign, thus:","element":"span"}],[{"style":{"width":"63%"},"width":1239,"height":96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-8.png","element":"img"}],[{"text":"In summary, we have the following form for ","element":"span"},{"style":{"height":18.18},"width":265.14,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/12-9.png","element":"img","alt":" La2c(π(i), A(i)):","inline":true}],[{"style":{"fontWeight":"bold"},"text":"A.3. PPO Loss","element":"span"}],[{"text":"The derivation of the PPO loss ","element":"span"},{"style":{"height":15.59},"width":79.19,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-0.png","element":"img","alt":" Lppo","inline":true,"padRight":true},{"text":"with respect to the logits ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"is presented as follows. The PPO loss includes a clipping function and a minimum operation. When these conditions are not satisfied, there is no gradient.","element":"span"}],[{"text":"The sample-wise PPO loss is:","element":"span"}],[{"style":{"width":"26%"},"width":252,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-1.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"28%"},"width":267,"height":307,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-2.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":15.2},"width":97.83,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-3.png","element":"img","alt":" i ̸= y,","inline":true}],[{"style":{"width":"23%"},"width":220,"height":307,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-4.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.4. Reverse PPO Loss","element":"span"}],[{"text":"The derivation of the reverse PPO loss ","element":"span"},{"style":{"height":15.59},"width":91.63,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-5.png","element":"img","alt":" Lrppo","inline":true,"padRight":true},{"text":"with respect to logits ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"is presented as follows. As with PPO, the reverse PPO loss only considers samples that pass the clipping function and the minimum operation.","element":"span"}],[{"text":"From Section ","element":"span"},{"href":"#id-88","text":"A.2","element":"a"},{"text":", we have the following form for ","element":"span"},{"style":{"height":21.49},"width":368.53,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-6.png","element":"img","alt":" Lrppo(π(i), A(i), π(i)old):","inline":true}],[{"style":{"width":"80%"},"width":1559,"height":305,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-7.png","element":"img"}]]},{"heading":"B. Gradient Analysis of RL Loss and Reverse RL Loss","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"B.1. Symmetric A2C Gradient Analysis","element":"span"}],[{"text":"The gradient analysis of the symmetric RL loss follows the SCE analysis. We adopt their analysis and extend it to cover the RL loss analysis. We set ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-8.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-9.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"to 1 for simplicity and evaluate the gradient direction of both RL and reverse RL losses with respect to the logits ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":". We show that the gradient directions for both types are the same and that the reverse RL loss helps deviate ambiguous predictions where the probability is around 0.5. We first show how the symmetric A2C (SA2C) loss behaves. Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z < ","element":"span"},{"text":"0 ","element":"span"},{"text":"is a constant used in the reverse RL loss to handle ","element":"span"},{"style":{"height":14},"width":212.02,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-10.png","element":"img","alt":" log 0 = −∞.","inline":true}],[{"style":{"width":"58%"},"width":1148,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/13-11.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.38},"width":316.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-0.png","element":"img","alt":" i = y and A(i) > 0,","inline":true}],[{"style":{"width":"52%"},"width":489,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-1.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.79},"width":316.42,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-2.png","element":"img","alt":" i ̸= y and A(i) > 0,","inline":true}],[{"style":{"width":"42%"},"width":402,"height":204,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-3.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.38},"width":316.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-4.png","element":"img","alt":" i = y and A(i) < 0,","inline":true}],[{"style":{"width":"52%"},"width":495,"height":214,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-5.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.78},"width":316.42,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-6.png","element":"img","alt":" i ̸= y and A(i) < 0,","inline":true}],[{"text":"For the above cases, the gradient directions of the RL (A2C) loss and the reverse RL (RA2C) loss are the same as SCE gradients. Essentially, the RA2C loss acts as an accelerator. In the case of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":14.98},"width":141.29,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-7.png","element":"img","alt":" A(i) > 0","inline":true},{"text":", the gradient of the RA2C loss is ","element":"span"},{"style":{"height":18.18},"width":364.45,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-8.png","element":"img","alt":" −A(i)Zπ(y)(π(y) − 1)","inline":true},{"text":", with the largest gradient magnitude at ","element":"span"},{"style":{"height":14.18},"width":172.34,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-9.png","element":"img","alt":" π(y) = 0.5","inline":true,"padRight":true},{"text":"as a parabolic function. In other words, the accelerator helps the probability ","element":"span"},{"style":{"height":14.19},"width":59.87,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-10.png","element":"img","alt":" π(i) ","inline":true,"padRight":true},{"text":"increase most quickly when it is ambiguous which action to take. In the case of ","element":"span"},{"style":{"height":15.2},"width":86.86,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-11.png","element":"img","alt":" i ̸= y","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.99},"width":146.14,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-12.png","element":"img","alt":" A(i) > 0","inline":true},{"text":", the probability of other actions except ","element":"span"},{"style":{"height":14.19},"width":56.79,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-13.png","element":"img","alt":" a(i)","inline":true,"padRight":true},{"text":"is reduced, and this reduction is influenced by the confidence of both ","element":"span"},{"style":{"height":14.18},"width":59.88,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-14.png","element":"img","alt":" π(i)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.76,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-15.png","element":"img","alt":" π(y)","inline":true},{"text":". Specifically, the gradient of the RA2C loss is ","element":"span"},{"style":{"height":14.18},"width":257.19,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-16.png","element":"img","alt":" −A(i)Zπ(y)π(i)","inline":true},{"text":". When both ","element":"span"},{"style":{"height":14.18},"width":59.87,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-17.png","element":"img","alt":" π(i)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":65.76,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-18.png","element":"img","alt":" π(y)","inline":true,"padRight":true},{"text":"are 0.5, indicating the most ambiguous predictions, the accelerator helps the A2C loss reduce ","element":"span"},{"style":{"height":14.18},"width":65.76,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-19.png","element":"img","alt":" π(y) ","inline":true,"padRight":true},{"text":"most aggressively.","element":"span"}],[{"text":"When ","element":"span"},{"style":{"height":14.98},"width":143.97,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-20.png","element":"img","alt":" A(i) < 0","inline":true},{"text":", the gradient direction is simply reversed. The behavior of the gradient itself remains the same as when ","element":"span"},{"style":{"height":14.99},"width":141.2,"height":37.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-21.png","element":"img","alt":"A(i) > 0","inline":true},{"text":". In the case of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":", RA2C decreases the probability ","element":"span"},{"style":{"height":14.59},"width":330.64,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-22.png","element":"img","alt":" π(y) more when π(y) ","inline":true,"padRight":true},{"text":"is around 0.5. For ","element":"span"},{"style":{"height":15.2},"width":86.86,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-23.png","element":"img","alt":" i ̸= y","inline":true},{"text":", RA2C helps increase ","element":"span"},{"style":{"height":14.18},"width":65.76,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-24.png","element":"img","alt":" π(y) ","inline":true,"padRight":true},{"text":"more when both ","element":"span"},{"style":{"height":14.58},"width":205.55,"height":36.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-25.png","element":"img","alt":" π(i) and π(y) ","inline":true,"padRight":true},{"text":"are ambiguous (both around ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"5","element":"span"},{"text":").","element":"span"}],[{"id":"id-56","style":{"fontWeight":"bold"},"text":"B.2. Symmetric PPO Gradient Analysis","element":"span"}],[{"style":{"width":"16%"},"width":151,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-26.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.38},"width":316.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/14-27.png","element":"img","alt":" i = y and A(i) > 0,","inline":true}],[{"text":"For ","element":"span"},{"style":{"height":17.78},"width":316.42,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/15-0.png","element":"img","alt":" i ̸= y and A(i) > 0,","inline":true}],[{"style":{"width":"46%"},"width":439,"height":282,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/15-1.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.38},"width":316.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/15-2.png","element":"img","alt":" i = y and A(i) < 0,","inline":true}],[{"style":{"width":"57%"},"width":542,"height":282,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/15-3.png","element":"img"}],[{"text":"For ","element":"span"},{"style":{"height":17.79},"width":316.42,"height":44.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/15-4.png","element":"img","alt":" i ̸= y and A(i) < 0,","inline":true}],[{"text":"Basically, the mechanism of RPPO is the same as RA2C, except for ","element":"span"},{"style":{"height":21.49},"width":65.63,"height":53.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/15-5.png","element":"img","alt":" π(i)old","inline":true},{"text":", which does not change the gradient sign. Therefore, ","element":"span"},{"text":"RPPO also helps PPO deviate from ambiguous predictions, acting as an accelerator.","element":"span"}]]},{"heading":"C. Experimental Setups and Results","paragraphs":[[{"id":"id-83","style":{"fontWeight":"bold"},"text":"C.1. Hyperparameters","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Atari Games: ","element":"span"},{"text":"We primarily follow the hyperparameter settings of RL Baselines3 Zoo (","element":"span"},{"href":"#id-89","referenceIndex":34,"text":"Raffin","element":"a"},{"text":", ","element":"span"},{"href":"#id-89","referenceIndex":34,"text":"2020","element":"a"},{"text":"). Most hyperparameter values remain unchanged across environments. Only ","element":"span"},{"style":{"height":14.4},"width":125.44,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-0.png","element":"img","alt":" α and β","inline":true,"padRight":true},{"text":"are adjusted for the reverse RL loss. For SA2C without noise, we use ","element":"span"},{"style":{"height":16},"width":307.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-1.png","element":"img","alt":" (α = 0.5, β = 5.0)","inline":true,"padRight":true},{"text":"for all environments. For SA2C with noise, we use ","element":"span"},{"style":{"height":16},"width":307.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-2.png","element":"img","alt":" (α = 0.5, β = 1.0)","inline":true,"padRight":true},{"text":"for (Alien, MsPacman, Qbert, TimePilot, VideoPinball, Assault, Gravitar, StarGunner, UpNDown), and ","element":"span"},{"style":{"height":16},"width":307.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-3.png","element":"img","alt":" (α = 0.5, β = 1.0)","inline":true,"padRight":true},{"text":"for others. For SPPO without noise, we use ","element":"span"},{"style":{"height":16},"width":307.63,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-4.png","element":"img","alt":" (α = 0.5, β = 1.0)","inline":true,"padRight":true},{"text":"for all environments. For SPPO with noise, we use ","element":"span"},{"style":{"height":16},"width":327.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-5.png","element":"img","alt":" (α = 0.5, β = 10.0)","inline":true,"padRight":true},{"text":"for all environments. We do not use any GPU for Atari games.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Table 4. ","element":"figcaption","subtype":"caption"},{"text":"Hyperparameters for Atari games","element":"figcaption","subtype":"caption"}],[{"style":{"width":"86%"},"width":1685,"height":147,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-6.png","element":"img"}],[{"text":"S","element":"span"},{"text":"TAR","element":"span"},{"text":"G","element":"span"},{"text":"UNNER","element":"span"},{"text":", T","element":"span"},{"text":"IME","element":"span"},{"text":"P","element":"span"},{"text":"ILOT","element":"span"},{"text":", U","element":"span"},{"text":"P","element":"span"},{"text":"ND","element":"span"},{"text":"OWN","element":"span"},{"text":", V","element":"span"},{"text":"IDEO","element":"span"},{"text":"P","element":"span"},{"text":"INBALL","element":"span"},{"text":") - (","element":"span"},{"style":{"height":13.2},"width":371.02,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-7.png","element":"img","alt":"α = 0.5, β = 5.0) A","inline":true},{"text":"LL ENVIRONMENTS ","element":"span"},{"text":"A","element":"span"},{"text":"LL OTHERS EXCEPT THOSE MENTIONED ABOVE","element":"span"}],[{"style":{"width":"86%"},"width":1685,"height":144,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-8.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"MuJoCo and Box2D: ","element":"span"},{"text":"We use n envs ","element":"span"},{"text":"= 4 ","element":"span"},{"text":"and n steps ","element":"span"},{"text":"= 8 ","element":"span"},{"text":"for A2C and SA2C. We follow Stable-Baselines3’s default hyperparameters (","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"Raffin et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":35,"text":"2021","element":"a"},{"text":") for other settings. Only ","element":"span"},{"style":{"height":14.4},"width":122.94,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-9.png","element":"img","alt":" α and β","inline":true,"padRight":true},{"text":"are adjusted for the reverse RL loss. For table visibility, let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"Ant ","element":"span"},{"text":"= 1","element":"span"},{"text":", BipedalWalker ","element":"span"},{"text":"= 2","element":"span"},{"text":", HalfCheetah ","element":"span"},{"text":"= 3","element":"span"},{"text":", Hopper ","element":"span"},{"text":"= 4","element":"span"},{"text":", HumanoidStandup ","element":"span"},{"text":"= 5","element":"span"},{"text":", InvertedDoublePendulum ","element":"span"},{"text":"= 6","element":"span"},{"text":", LunarLanderContinuous ","element":"span"},{"text":"= 7","element":"span"},{"text":", Swimmer ","element":"span"},{"text":"= 8","element":"span"},{"text":", Walker2d ","element":"span"},{"text":"= 9","element":"span"},{"style":{"fontStyle":"italic"},"text":"}","element":"span"},{"text":". We do not use any GPU for these tasks.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Table 5. ","element":"figcaption","subtype":"caption"},{"text":"Hyperparameters for MuJoCO and Box2D environments","element":"figcaption","subtype":"caption"}],[{"style":{"width":"61%"},"width":1202,"height":826,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/16-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"IMDB and TL;DR: ","element":"span"},{"text":"We basically use the provided implementation (","element":"span"},{"href":"#id-61","referenceIndex":8,"text":"Chang et al.","element":"a"},{"text":", ","element":"span"},{"href":"#id-61","referenceIndex":8,"text":"2023","element":"a"},{"text":") and follow their hyperparameters, with the addition of the advantage normalization step for PPO. The scripts used in our experiments are available in the code repository for further detail. We use a single Nvidia A100 (80GB) for our experiments.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Table 6. ","element":"figcaption","subtype":"caption"},{"text":"Hyperparameters for IMDB positive sentiment and TL;DR summarization","element":"figcaption","subtype":"caption"}],[{"style":{"width":"60%"},"width":1176,"height":544,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/17-0.png","element":"img"}],[{"id":"id-63","style":{"fontWeight":"bold"},"text":"C.2. Experimental Results: A2C and SA2C","element":"span"}],[{"id":"id-81","style":{"fontStyle":"italic"},"text":"Table 8. ","element":"figcaption","subtype":"caption"},{"text":"Mean final scores and standard errors (over the last 10 episodes) of A2C and SA2C on MuJoCo benchmark tasks and Box2D environments without Gaussian noise across 30 seeds.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"87%"},"width":1711,"height":712,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/18-0.png","element":"img"}],[{"id":"id-82","style":{"fontStyle":"italic"},"text":"Table 9. ","element":"figcaption","subtype":"caption"},{"text":"Mean final scores and standard errors (over the last 10 episodes) of A2C and SA2C on MuJoCo benchmark tasks and Box2D environments with Gaussian noise (mean ","element":"figcaption","subtype":"caption"},{"text":"0 ","element":"figcaption","subtype":"caption"},{"text":"and standard deviation ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"05","element":"figcaption","subtype":"caption"},{"text":") across 30 seeds.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"86%"},"width":1681,"height":716,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/18-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"C.3. Experimental Results: PPO and SPPO","element":"span"}],[{"id":"id-55","style":{"fontStyle":"italic"},"text":"Table 10. ","element":"figcaption","subtype":"caption"},{"text":"Mean final scores and standard errors (over the last 10 episodes) of PPO and SPPO on Atari games, without and with binary symmetric channel (BSC) noise with a crossover probability of ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"across 5 seeds.","element":"figcaption","subtype":"caption"}],[{"id":"id-32","style":{"width":"99%"},"width":1944,"height":1477,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/19-0.png","element":"img"}],[{"id":"id-65","style":{"width":"94%"},"width":1838,"height":1984,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/20-0.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Figure 4. ","element":"figcaption","subtype":"caption"},{"text":"Result of training plots for SPPO and PPO for Atari games. The blue line indicates the original PPO without any added noise, while the orange line represents SPPO without added noise. The green line indicates PPO with 10% noise, and the red line represents SPPO with 10% noise. We fix ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":119.48,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/20-1.png","element":"img","alt":" α = 0.5","inline":true,"padRight":true},{"text":"for all environments, with ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":118.46,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/20-2.png","element":"img","alt":" β = 1.0","inline":true,"padRight":true},{"text":"for the experiments without noise and ","element":"figcaption","subtype":"caption"},{"style":{"height":13.2},"width":136.89,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/20-3.png","element":"img","alt":" β = 10.0","inline":true,"padRight":true},{"text":"for the noise environments.","element":"figcaption","subtype":"caption"}],[{"id":"id-66","style":{"fontStyle":"italic"},"text":"Table 12. ","element":"figcaption","subtype":"caption"},{"text":"Mean final scores and standard errors (over the last 10 episodes) of PPO and SPPO on MuJoCo benchmark tasks and Box2D environments without Gaussian noise across 30 seeds.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"87%"},"width":1711,"height":712,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/21-0.png","element":"img"}],[{"id":"id-67","style":{"fontStyle":"italic"},"text":"Table 13. ","element":"figcaption","subtype":"caption"},{"text":"Mean final scores and standard errors (over the last 10 episodes) of PPO and SPPO on MuJoCo benchmark tasks and Box2D environments with Gaussian noise (mean ","element":"figcaption","subtype":"caption"},{"text":"0 ","element":"figcaption","subtype":"caption"},{"text":"and standard deviation ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":".","element":"figcaption","subtype":"caption"},{"text":"05","element":"figcaption","subtype":"caption"},{"text":") across 30 seeds.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"86%"},"width":1681,"height":716,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/21-1.png","element":"img"}],[{"id":"id-79","style":{"fontWeight":"bold"},"text":"C.4. On and Off Advantage Normalization","element":"span"}]]},{"heading":"D. Examples of Reward Model Errors","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Warning: This section contains harmful language.","element":"span"}],[{"id":"id-22","style":{"fontStyle":"italic"},"text":"Table 15. ","element":"figcaption","subtype":"caption"},{"text":"Example showing a trained reward model with errors that are not consistent for empty outputs, and the reward for an empty output is greater than that for a non-empty summarization. [...] indicates omitted content for brevity.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"81%"},"width":1582,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/22-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"P","element":"span"},{"style":{"fontWeight":"bold"},"text":"OST","element":"span"},{"style":{"fontWeight":"bold"},"text":": ","element":"span"},{"text":"S","element":"span"},{"text":"O OKAY","element":"span"},{"text":", I’","element":"span"},{"text":"M FROM ","element":"span"},{"text":"N","element":"span"},{"text":"EW ","element":"span"},{"text":"Y","element":"span"},{"text":"ORK BUT ","element":"span"},{"text":"I ","element":"span"},{"text":"STUDY IN ","element":"span"},{"text":"O","element":"span"},{"text":"REGON FOR MOST OF THE YEAR","element":"span"},{"text":". R","element":"span"},{"text":"ECENTLY A FRIEND OF MINE WHO ","element":"span"},{"text":"I ","element":"span"},{"text":"WAS NOT REALLY CLOSE STARTED FACEBOOK MESSAGING ME","element":"span"},{"text":", ","element":"span"},{"text":"THAT WAS ABOUT ","element":"span"},{"text":"3 ","element":"span"},{"text":"MONTHS AGO","element":"span"},{"text":", ","element":"span"},{"text":"SINCE THEN WE","element":"span"},{"text":"’","element":"span"},{"text":"VE TALKED ALMOST EVERYDAY","element":"span"},{"text":". [...] I ","element":"span"},{"text":"TRIED TO DO JUST THAT BUT SHE TOTALLY GAVE ME THE COLD SHOULDER","element":"span"},{"text":"; ","element":"span"},{"text":"NOT BEING REALLY RESPONSIVE TO HANGING OUT","element":"span"},{"text":", ","element":"span"},{"text":"LEAVING EARLY WHEN WE FINALLY DID ETC","element":"span"},{"text":"... A","element":"span"},{"text":"M ","element":"span"},{"text":"I ","element":"span"},{"text":"WRONG IN MY ORIGINAL ASSUMPTION THAT SHE WAS INTO ME JUST BECAUSE OUT OF THE BLUE SHE STARTED TALKING TO ME A LOT","element":"span"},{"text":"? I","element":"span"},{"text":"S SHE TRYING TO PLAY HARD TO GET","element":"span"},{"text":"? A","element":"span"},{"text":"M ","element":"span"},{"text":"I ","element":"span"},{"text":"LOOKING WAY TOO INTO THIS AND MAYBE SHE WAS JUST OCCUPIED THAT WEEKEND","element":"span"},{"text":"? I ","element":"span"},{"text":"REALLY HAVE NO IDEA HOW TO EVALUATE THIS","element":"span"},{"text":". D","element":"span"},{"text":"O ANY OF YOU GUYS HAVE ANY SUGGESTIONS","element":"span"},{"text":"/","element":"span"},{"text":"IDEAS","element":"span"},{"text":"?","element":"span"}],[{"style":{"width":"81%"},"width":1581,"height":161,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/22-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"P","element":"span"},{"style":{"fontWeight":"bold"},"text":"OST","element":"span"},{"style":{"fontWeight":"bold"},"text":": ","element":"span"},{"text":"W","element":"span"},{"text":"E WERE FRIENDS FOR ","element":"span"},{"text":"10 ","element":"span"},{"text":"YEARS","element":"span"},{"text":", ","element":"span"},{"text":"BEFORE WE GOT TOGETHER","element":"span"},{"text":". H","element":"span"},{"text":"E THAN TOLD ME ONCE ABOUT HIS TERRIBLE CHILDHOOD","element":"span"},{"text":". (H","element":"span"},{"text":"E TOLD ONLY ","element":"span"},{"text":"3 ","element":"span"},{"text":"OF HIS FRIENDS HIS STORY","element":"span"},{"text":") N","element":"span"},{"text":"OW WE","element":"span"},{"text":"’","element":"span"},{"text":"RE A COUPLE FOR QUITE A FEW MONTHS AND WELL","element":"span"},{"text":", ","element":"span"},{"text":"SOMETIMES THERE","element":"span"},{"text":"’","element":"span"},{"text":"S STUFF ","element":"span"},{"text":"I ","element":"span"},{"text":"KNOW THAT REMINDS HIM OF HIS CHILDHOOD","element":"span"},{"text":", ","element":"span"},{"text":"BUT IT","element":"span"},{"text":"’","element":"span"},{"text":"S LIKE HE","element":"span"},{"text":"’","element":"span"},{"text":"S FORGOTTEN THAT HE HAD TOLD ME","element":"span"},{"text":". [...] A","element":"span"},{"text":"ND STUFF LIKE WATCHING ","element":"span"},{"text":"TV","element":"span"},{"text":"SHOWS ABOUT RAISING CHILDREN","element":"span"},{"text":". W","element":"span"},{"text":"E TALK ABOUT HOW WE","element":"span"},{"text":"’","element":"span"},{"text":"RE GOING TO RAISE OURS IN THE FUTURE AND THAT WE WON","element":"span"},{"text":"’","element":"span"},{"text":"T WILL BE AS HORRIBLE AS THE PARENTS ON ","element":"span"},{"text":"TV. (B","element":"span"},{"text":"UT STRIKING","element":"span"},{"text":", ","element":"span"},{"text":"THE THINGS HE THINKS ARE IMPORTANT ARE ALWAYS THE THINGS HIS PARENTS SHOULD HAVE DONE","element":"span"},{"text":", ","element":"span"},{"text":"TO SAVE HIM FROM THE TRAUMATIZING STUFF","element":"span"},{"text":".)I ","element":"span"},{"text":"KNOW HE LIKES TO PUT HIS PROBLEMS FAR AWAY","element":"span"},{"text":". B","element":"span"},{"text":"UT ON THE OTHER HAND","element":"span"},{"text":", I’","element":"span"},{"text":"M HIS GIRLFRIEND NOW AND WE","element":"span"},{"text":"’","element":"span"},{"text":"RE PRETTY SERIOUS","element":"span"},{"text":", ","element":"span"},{"text":"ISN","element":"span"},{"text":"’","element":"span"},{"text":"T IT GOOD TO SPEAK ABOUT IT MAYBE JUST ONCE","element":"span"},{"text":", ","element":"span"},{"text":"SO HE KNOWS ","element":"span"},{"text":"I ","element":"span"},{"text":"KNOW HIS SECRET","element":"span"},{"text":"/","element":"span"},{"text":"WON","element":"span"},{"text":"’","element":"span"},{"text":"T TELL","element":"span"},{"text":", ","element":"span"},{"text":"AND MOST OF ALL","element":"span"},{"text":", I’","element":"span"},{"text":"M ALWAYS THERE FOR HIM","element":"span"},{"text":"? W","element":"span"},{"text":"HAT DO YOU THINK","element":"span"},{"text":"?","element":"span"}],[{"style":{"width":"81%"},"width":1581,"height":201,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/22-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"P","element":"span"},{"style":{"fontWeight":"bold"},"text":"OST","element":"span"},{"style":{"fontWeight":"bold"},"text":":","element":"span"},{"text":"M","element":"span"},{"text":"Y ","element":"span"},{"text":"G","element":"span"},{"text":"RANDMA AND MY AUNT ","element":"span"},{"text":"(","element":"span"},{"text":"HER DAUGHER","element":"span"},{"text":"-","element":"span"},{"text":"IN","element":"span"},{"text":"-","element":"span"},{"text":"AW","element":"span"},{"text":") ","element":"span"},{"text":"HAVEN","element":"span"},{"text":"’","element":"span"},{"text":"T SPOKEN TO EACH OTHER IN YEARS OVER A PHONE THAT DIDN","element":"span"},{"text":"’","element":"span"},{"text":"T GET HUNG UP","element":"span"},{"text":". M","element":"span"},{"text":"Y AUNT AND UNCLE SCREEN THEIR CALLS AND FREQUENTLY DO NOT RETURN THEM","element":"span"},{"text":"– ","element":"span"},{"text":"ONE TIME","element":"span"},{"text":", ","element":"span"},{"text":"MY GRANDMA CALLED AND LEFT A MESSAGE THEN THOUGHT SHE HUNG UP THE PHONE","element":"span"},{"text":". A ","element":"span"},{"text":"FEW MINUTES LATER","element":"span"},{"text":"– ","element":"span"},{"text":"MY ","element":"span"},{"text":"G","element":"span"},{"text":"RANDMA WAS TALKING WITH SOMEONE IN HER HOME AND USED THE WORD ","element":"span"},{"text":"¨","element":"span"},{"text":"BITCH","element":"span"},{"text":"¨-- ","element":"span"},{"text":"THIS WAS ALL RECORDED ON MY AUNT AND UNCLE","element":"span"},{"text":"’","element":"span"},{"text":"S ANSWERING MACHINE AND MY AUNT ASSUMED IT WAS ABOUT HER AND HASN","element":"span"},{"text":"’","element":"span"},{"text":"T SPOKEN TO NOR SEEN MY ","element":"span"},{"text":"G","element":"span"},{"text":"RANDMA IN UPWARDS OF ","element":"span"},{"text":"5 ","element":"span"},{"text":"YEARS","element":"span"},{"text":". [...] W","element":"span"},{"text":"HY WASTE TIME THE TIME YOU HAVE WITH SOMONE","element":"span"},{"text":"? W","element":"span"},{"text":"HY CONTINUE TO HOLD A SILLY GRUDGE","element":"span"},{"text":"? T","element":"span"},{"text":"O COMPLICATE MATTERS FURTHER","element":"span"},{"text":", ","element":"span"},{"text":"MY GRANDMA HAS A DAUGHTER WHO LIVES WITH HER AND LIKES TO BE IN OTHER PEOPLES BUSINESS","element":"span"},{"text":"– I ","element":"span"},{"text":"THINK SHE IS ALSO PART OF THE PROBLEM HERE AS SHE WON","element":"span"},{"text":"’","element":"span"},{"text":"T DROP IT EITHER","element":"span"},{"text":". G","element":"span"},{"text":"RANDMA IS INNOCENT BUT HAS A DAUGHTER AND DAUGHTER","element":"span"},{"text":"-","element":"span"},{"text":"IN","element":"span"},{"text":"-","element":"span"},{"text":"LAW WHO WON","element":"span"},{"text":"’","element":"span"},{"text":"T GROW UP AND DROP IT","element":"span"}],[{"style":{"width":"81%"},"width":1582,"height":170,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2405.17618/images/22-3.png","element":"img"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]