32:[["$","audio",null,{"id":"tts"}],["$","$L37",null,{"paperID":"94961","publisher":"neurips","paperJSON":{"title":"Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification","paperID":"94961","avgLineHeight":10.94,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$38","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Kullback-Leibler (KL) divergence constraints in reinforcement learning (RL) are employed to stay in regimes where the objective is accurate enough. Some on-policy ","element":"span"},{"href":"#id-0","referenceIndex":15,"text":"(Schulman et al., ","element":"a"},{"href":"#id-0","referenceIndex":15,"text":"2015, ","element":"a"},{"href":"#id-1","referenceIndex":16,"text":"2017) ","element":"a"},{"text":"and many off-policy ","element":"span"},{"href":"#id-2","referenceIndex":1,"text":"(Abdolmaleki et al., ","element":"a"},{"href":"#id-2","referenceIndex":1,"text":"2018; ","element":"a"},{"href":"#id-3","referenceIndex":8,"text":"Jaques et al., ","element":"a"},{"href":"#id-3","referenceIndex":8,"text":"2019) ","element":"a"},{"text":"policy gradient algorithms employ KL constraints or penalties during optimization to prevent the policy from deviating too much from the data collection distribution. This ensures that estimates of each action’s advantage are reliable enough to update the policy in a helpful way.","element":"span"}],[{"text":"Reinforcement learning from human feedback ","element":"span"},{"href":"#id-4","referenceIndex":2,"text":"(Christiano et al., ","element":"a"},{"href":"#id-4","referenceIndex":2,"text":"2017; ","element":"a"},{"href":"#id-5","referenceIndex":21,"text":"Ziegler et al., ","element":"a"},{"href":"#id-5","referenceIndex":21,"text":"2020, ","element":"a"},{"text":"RLHF) is a very popular method to induce desirable behavior in language models. RLHF starts with a base pre-trained model, then learns a reward function from human annotator data. Next, it trains an RL policy to maximize this reward, while penalizing high KL divergence from the policy to the base model. RLHF uses an on-policy algorithm and has accurate advantages, but the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"reward function ","element":"span"},{"text":"is always somewhat misspecified compared to desired behavior, due to insufficient data, human biases, and other factors.","element":"span"}],[{"text":"The main purpose of the KL penalty in RLHF is to limit the consequence of reward modeling errors by keeping the policy within a distribution similar to that on which it was trained. Ideally, in the low-KL regime the reward model’s errors are small enough that it provides correct updates to the base model. ","element":"span"},{"href":"#id-6","referenceIndex":5,"text":"Gao et al. ","element":"a"},{"href":"#id-6","referenceIndex":5,"text":"(2023) ","element":"a"},{"text":"empirically supports this view: if the KL divergence in RLHF is allowed to grow too much, with a misspecified reward, the model’s performance on the true utility starts to decrease.","element":"span"}],[{"text":"We ask: can we obtain good outcomes from misspecified reward in RLHF by controlling the KL divergence? That is, if there is some error between the true reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"and the proxy reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", can the KL help us to still optimize ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"? Using mathematical proof, we answer the question in the negative for heavy-tailed errors: there exist policies which have infinite proxy reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", but whose KL with the base model vanishes (these have undetermined ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"). We term this phenomenon “catastrophic Goodhart”, after Goodhart’s law.","element":"span"}],[{"text":"If the misspecification errors are independent and light-tailed, the KL divergence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"does ","element":"span"},{"text":"suffice to guarantee good outcomes. There may also be guarantees under weaker assumptions, but assumptions that intuitively seem sufficient are often not (see Section ","element":"span"},{"text":"5)","element":"span"},{"text":".","element":"span"}],[{"text":"Possibly, other regularization schemes would guarantee good outcomes for heavy-tailed errors, but this is not just a problem of KL. We show that optimizing by conditioning on large reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"has similar outcomes in light- and heavy-tailed regimes.","element":"span"}],[{"text":"Empirically, open-source language reward models seem to be light-tailed, which does not imply light-tailed errors but suggests it (Section ","element":"span"},{"href":"#id-7","text":"4.1)","element":"a"},{"text":". However, the errors are likely not independent and, given the prevalence of heavy-tailed distributions in the real world, error in future reward models may also be heavy-tailed. In any case, the present success of RLHF with misspecified rewards cannot be explained solely by the KL regularization in its objective.","element":"span"}]]},{"heading":"2 Background","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"KL divergence and KL regularization","element":"span"}],[{"text":"Recall that KL divergence between two distributions P and Q is defined as","element":"span"}],[{"style":{"width":"40%"},"width":636,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-0.png","element":"img"}],[{"text":"If we have two policies ","element":"span"},{"style":{"height":10},"width":78.5,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-1.png","element":"img","alt":" π, π0","inline":true},{"text":", we define ","element":"span"},{"style":{"height":15.8},"width":196.5,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-2.png","element":"img","alt":" DKL(π∥π0)","inline":true,"padRight":true},{"text":"as the KL divergence between the distributions of actions taken on the states in trajectories reached by ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-3.png","element":"img","alt":" π","inline":true},{"text":". That is, if ","element":"span"},{"style":{"height":16},"width":99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-4.png","element":"img","alt":" Tr(π)","inline":true,"padRight":true},{"text":"is the distribution of trajectories taken by ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-5.png","element":"img","alt":" π","inline":true},{"text":", we penalize ","element":"span"},{"style":{"height":19.2},"width":810.5,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-6.png","element":"img","alt":" DKL(π∥π0) ≜ Es∈T,T ∼T r(π)[DKL(π(s)∥π0(s))].","inline":true}],[{"text":"In RLHF, it is common to use the regularization term ","element":"span"},{"style":{"height":16},"width":228,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-7.png","element":"img","alt":" βDKL (π∥π0)","inline":true,"padRight":true},{"text":"to prevent the learned policy from deviating too much from the base policy, which can prevent unstable behavior or overfitting to the reward model. If our reward model gives reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", then the optimal policy for RLHF with a KL penalty is","element":"span"}],[{"style":{"width":"36%"},"width":574,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-8.png","element":"img"}],[{"text":"Often the regularization parameter ","element":"span"},{"style":{"height":14.6},"width":22.5,"height":36.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-9.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is dynamically adjusted to keep the ","element":"span"},{"style":{"height":13.4},"width":80.5,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-10.png","element":"img","alt":" DKL","inline":true,"padRight":true},{"text":"near some target value ","element":"span"},{"href":"#id-5","referenceIndex":21,"text":"(Ziegler et al., ","element":"a"},{"href":"#id-5","referenceIndex":21,"text":"2020)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Heavy-tailed distributions","element":"span"}],[{"text":"A distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"over ","element":"span"},{"text":"R ","element":"span"},{"text":"with cumulative distribution function (CDF) ","element":"span"},{"style":{"height":13.4},"width":48,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-11.png","element":"img","alt":" FP","inline":true,"padRight":true},{"text":"is heavy-tailed if its tail function ","element":"span"},{"style":{"height":17.8},"width":467.5,"height":44.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-12.png","element":"img","alt":"¯FP (x) ≜ 1 − FP (x) satisfies","inline":true}],[{"style":{"width":"37%"},"width":588,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/1-13.png","element":"img"}],[{"text":"Heavy-tailed distributions are well-known in statistics to have a higher probability of producing a single extreme value. For example, if the sum of two independent variables from heavy-tailed distributions is large, it is most likely due to one extreme sample rather than two equally large samples. ","element":"span"},{"href":"#id-8","referenceIndex":18,"text":"(Wierman, ","element":"a"},{"href":"#id-8","referenceIndex":18,"text":"2013)","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"2.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Reward misspecification and Goodhart’s Law","element":"span"}],[{"text":"Reward misspecification has caused low-utility outcomes in practice; for example, in ","element":"span"},{"href":"#id-9","referenceIndex":3,"text":"(Clark and ","element":"a"},{"href":"#id-9","referenceIndex":3,"text":"Amodei, ","element":"a"},{"href":"#id-9","referenceIndex":3,"text":"2016)","element":"a"},{"text":", an RL agent trained to play a racing videogame according to a misspecified reward function achieves a high score while failing to complete the course.","element":"span"}],[{"href":"#id-6","referenceIndex":5,"text":"Gao et al. ","element":"a"},{"href":"#id-6","referenceIndex":5,"text":"(2023) ","element":"a"},{"text":"introduce the concept of “overoptimization”: optimizing for a proxy objective decreases performance according to the true objective. This raises the question: in general, when RLHF reward is misspecified, when does the optimal policy produce high utility?","element":"span"}],[{"text":"By applying the proxy reward and true reward functions to a distribution over text (generated by an LLM), we get two scalar random variables, which we call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"for proxy reward and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"for true reward / utility. Then we can define the error in the proxy reward as ","element":"span"},{"style":{"height":16.2},"width":662,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-0.png","element":"img","alt":" X ≜ U −V , so that U = X +V . Framed","inline":true,"padRight":true},{"text":"this way, optimization for a proxy reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"is a mix of desirable optimization for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"and undesirable optimization for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":". The joint distribution of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"determines the limiting value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"as we apply more optimization. When we say that reward misspecification can have negative effects, we mean that too much variance in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"can \"redirect\" the optimization pressure from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", and prevent utility gain from optimization.","element":"span"}],[{"text":"Reward misspecification is also studied by ","element":"span"},{"href":"#id-10","referenceIndex":12,"text":"(Lambert and Calandra, ","element":"a"},{"href":"#id-10","referenceIndex":12,"text":"2024)","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":11,"text":"(Laidlaw et al., ","element":"a"},{"href":"#id-11","referenceIndex":11,"text":"2024)","element":"a"},{"text":", and others. Laidlaw et al show that a KL penalty between action distributions can be ineffective, and propose instead regularizing state occupancy measure. Our results show an inherent weakness of KL divergence, including when applied to state occupancy measure.","element":"span"}],[{"text":"We prove that in many cases, ","element":"span"},{"style":{"height":11.4},"width":115,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-1.png","element":"img","alt":" V → 0","inline":true,"padRight":true},{"text":"in the limit of optimization for some proxy ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":". We call this phenomenon “catastrophic Goodhart”, after Goodhart’s law: “when a measure becomes a target, it ceases to be a good measure” ","element":"span"},{"href":"#id-12","referenceIndex":17,"text":"(Strathern, ","element":"a"},{"href":"#id-12","referenceIndex":17,"text":"1997)","element":"a"},{"text":". In these cases, the end result of optimizing for a proxy of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"is no better than not optimizing at all. However, in other cases, ","element":"span"},{"style":{"height":11.2},"width":132,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-2.png","element":"img","alt":" V → ∞","inline":true,"padRight":true},{"text":"despite some reward misspecification; in these cases the reward misspecification is not severe enough to prevent good outcomes.","element":"span"}]]},{"heading":"3 Theoretical results","paragraphs":[[{"text":"When applying KL regularization, the trained model is regularized towards some base policy ","element":"span"},{"style":{"height":13.2},"width":126.5,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-3.png","element":"img","alt":" π0. One","inline":true,"padRight":true},{"text":"would hope that a KL penalty can produce good outcomes even in the case of reward misspecification; that is, if the reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"is the sum of true utility ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"and an error term ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":", we would hope that optimal policies under a KL penalty achieve high ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"even if the magnitude of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"is large. We show that this is not always the case: Corollary ","element":"span"},{"href":"#id-13","text":"1 ","element":"a"},{"text":"of Theorems ","element":"span"},{"href":"#id-14","text":"1, ","element":"a"},{"href":"#id-15","text":"3, ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-16","text":"2 ","element":"a"},{"text":"establishes that when ","element":"span"},{"style":{"height":16},"width":103.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-4.png","element":"img","alt":" X(π0)","inline":true,"padRight":true},{"text":"is heavy-tailed, there are arbitrarily well-performing policies ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-5.png","element":"img","alt":" π","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":16.2},"width":268,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-6.png","element":"img","alt":" Eπ[V ] ≈ Eπ0[V ]","inline":true},{"text":". However, Theorem ","element":"span"},{"href":"#id-17","text":"4 ","element":"a"},{"text":"shows that when error is light-tailed and independent of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":", the optimal policy under a KL penalty results in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V > ","element":"span"},{"text":"0","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"can be made arbitrarily large. Thus, the tails of the error distribution are crucial in determining how much utility will result from optimization towards an imperfect proxy.","element":"span"}],[{"text":"Theorems ","element":"span"},{"href":"#id-18","text":"5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-19","text":"6 ","element":"a"},{"text":"(Section ","element":"span"},{"href":"#id-20","text":"3.4) ","element":"a"},{"text":"show that the relationship of catastrophic Goodhart to heavy-tailed error is not just a quirk of KL divergence by using a different model of optimization based on conditioning on high reward values. Under this model (and given additional regularity conditions), it is also true that heavy-tailed error results in catastrophic Goodhart and light-tailed error plus independence results in arbitrarily large utility. All proofs are in the appendix.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"KL divergence on heavy- and light-tailed distributions","element":"span"}],[{"id":"id-14","style":{"fontWeight":"bold"},"text":"Theorem 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Given any heavy-tailed reference distribution ","element":"span"},{"style":{"height":16},"width":684,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-7.png","element":"img","alt":" Q over R with mean µQ, and any M, ϵ > 0,","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"there is a distribution ","element":"span"},{"style":{"height":16},"width":715,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-8.png","element":"img","alt":" P with mean µP > M and DKL(P∥Q) < ϵ.","inline":true}],[{"text":"Outline of proof (see appendix for full proof): WLOG take ","element":"span"},{"style":{"height":15.4},"width":125.5,"height":38.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-9.png","element":"img","alt":" µQ = 0","inline":true},{"text":". If we set ","element":"span"},{"style":{"height":13.6},"width":35,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-10.png","element":"img","alt":" Pt","inline":true,"padRight":true},{"text":"to upweight the probability mass of ","element":"span"},{"style":{"height":16},"width":529.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-11.png","element":"img","alt":" PrPt(X > t) to c/t for some c, t","inline":true},{"text":", then the mean of ","element":"span"},{"style":{"height":13.6},"width":35.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-12.png","element":"img","alt":" Pt","inline":true,"padRight":true},{"text":"will be approximately at least ","element":"span"},{"style":{"height":11},"width":206.5,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-13.png","element":"img","alt":" c. As t → ∞","inline":true},{"text":", the KL divergence ","element":"span"},{"style":{"height":16},"width":202.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-14.png","element":"img","alt":" DKL(Pt∥Q)","inline":true,"padRight":true},{"text":"will shrink to zero.","element":"span"}],[{"text":"Intuitively, in a heavy-tailed distribution, events with extremely high ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"are not very rare, so you don’t pay much of a KL penalty to upweight them so they happen about ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/x ","element":"span"},{"text":"of the time.","element":"span"}],[{"id":"id-16","style":{"fontWeight":"bold"},"text":"Theorem 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"However, if the distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is light-tailed and ","element":"span"},{"style":{"height":16},"width":268,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-15.png","element":"img","alt":" d = DKL(P∥Q)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded, then ","element":"span"},{"style":{"height":10.8},"width":46.5,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-16.png","element":"img","alt":" µP","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded, and ","element":"span"},{"style":{"height":15.8},"width":395,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/2-17.png","element":"img","alt":" µP − µQ → 0 as d → 0.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"RLHF with KL penalty under heavy-tailed return distribution","element":"span"}],[{"text":"We now adapt our result to the case where the policy is a language model and we are training it using RLHF. We are now applying KL divergence over the policies rather than the return distributions. ","element":"span"},{"text":"We first formally define the properties of RLHF on language models that cause the result to hold: namely, when when considered as a Markov decision process (MDP), environmental transitions are deterministic and return depends only on the final state reached.","element":"span"}],[{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"Definition: ","element":"span"},{"text":"A deterministic-transition MDP with Markovian returns (DMRMDP) is an MDP ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":", P, R","element":"span"},{"text":") ","element":"span"},{"text":"such that:","element":"span"}],[{"text":"• The transition function ","element":"span"},{"style":{"height":12.4},"width":254.5,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-0.png","element":"img","alt":" P : S × A → S","inline":true,"padRight":true},{"text":"is deterministic, i.e., for each state ","element":"span"},{"style":{"height":12.2},"width":91.5,"height":30.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-1.png","element":"img","alt":" s ∈ S","inline":true,"padRight":true},{"text":"and action ","element":"span"},{"style":{"height":12.4},"width":100,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-2.png","element":"img","alt":"a ∈ A","inline":true},{"text":", there exists a unique state ","element":"span"},{"style":{"height":16.2},"width":508,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-3.png","element":"img","alt":" s′ ∈ S such that P(s′|s, a) = 1.","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"In RLHF: ","element":"span"},{"text":"the transition is appending the generated token ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"to the context ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s","element":"span"},{"text":".","element":"span"}],[{"text":"• There is a set of sink states ","element":"span"},{"style":{"height":13.6},"width":109.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-4.png","element":"img","alt":" E ⊆ S","inline":true,"padRight":true},{"text":"that terminate every trajectory, which is disjoint from the set of start states. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"In RLHF: ","element":"span"},{"text":"The sink states are sequences ending in ","element":"span"},{"text":" ","element":"span"},{"text":"or above a certain length.","element":"span"}],[{"text":"• Returns are Markovian; that is, for any two trajectories ","element":"span"},{"style":{"height":16.2},"width":462,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-5.png","element":"img","alt":" τ = (s1, a1, . . . , sn), τ ′ =","inline":true},{"style":{"height":16.2},"width":258,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-6.png","element":"img","alt":"(s′1, a′1, . . . , s′n),","inline":true,"padRight":true},{"text":"if ","element":"span"},{"style":{"height":16.2},"width":144.5,"height":40.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-7.png","element":"img","alt":" sn = s′n","inline":true},{"text":", then ","element":"span"},{"style":{"height":7.2},"width":20,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-8.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":12.4},"width":29.5,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-9.png","element":"img","alt":" τ ′","inline":true,"padRight":true},{"text":"have identical return distributions. Equiva- ","element":"span"},{"text":"lently, for the trajectory random variable ","element":"span"},{"style":{"height":16},"width":293,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-10.png","element":"img","alt":" T = (S1, A1, . . . )","inline":true,"padRight":true},{"text":"distributed according to any policy, with return ","element":"span"},{"style":{"height":17.6},"width":595.5,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-11.png","element":"img","alt":" G, G⊥⊥(S 0","inline":true},{"style":{"fontStyle":"italic"},"text":", there is a policy ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-19.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with mean return ","element":"span"},{"style":{"height":16},"width":479,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-20.png","element":"img","alt":" E[U|U ∼ G ◦ Tr(π)] > M","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":17.6},"width":625,"height":44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-21.png","element":"img","alt":"Es∈T,T ∼T r(π)[DKL(π(s)∥π0(s))] < ϵ.","inline":true}],[{"id":"id-13","style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Theorems ","element":"span"},{"href":"#id-16","style":{"fontStyle":"italic"},"text":"2 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"href":"#id-15","style":{"fontStyle":"italic"},"text":"3 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"imply that when utility is light-tailed, reward modeling errors make the proxy reward heavy-tailed, and a policy ","element":"span"},{"style":{"height":7.2},"width":22,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-22.png","element":"img","alt":" π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is regularized severely enough to have KL divergence values approaching zero, the reward ","element":"span"},{"style":{"height":16},"width":131,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-23.png","element":"img","alt":" E[U(π)]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can go to infinity while utility ","element":"span"},{"style":{"height":16},"width":131.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-24.png","element":"img","alt":" E[V (π)]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"approaches a value no higher than the base policy.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Light-tailed + independence imply ","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"] ","element":"span"},{"style":{"fontStyle":"italic"},"text":"→ ∞","element":"span"}],[{"id":"id-17","style":{"fontWeight":"bold"},"text":"Theorem 4. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"text":"+","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"both light-tailed and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"unbounded, and the distribution of U is continuous, and ","element":"span"},{"style":{"height":17.2},"width":1284,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-25.png","element":"img","alt":" π∗(β) ≜ arg maxπ E[U(π)] − βDKL(π, π0), then limβ→0+ E[V (π∗(β))] = ∞.","inline":true}],[{"id":"id-20","style":{"fontWeight":"bold"},"text":"3.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Conditioning as alternate model of optimization","element":"span"}],[{"text":"Although we think a KL divergence penalty or cap is the most realistic setting for RLHF, it is not the only model of optimization where heavy-tailedness of the error determines whether catastrophic Goodhart occurs. Consider another model of optimization where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"as before, but we simply condition on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"being higher than some threshold ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"},{"text":"2 ","element":"span"},{"text":"Then we are interested in the quantity ","element":"span"},{"style":{"height":16},"width":403.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-26.png","element":"img","alt":" limt→∞ E[V |X +V ≥ t]","inline":true},{"text":". If we slightly strengthen the heavy-tailedness and light-tailedness assumptions, heavy-tailed error results in catastrophic Goodhart, while light-tailed error results in arbitrarily high expected utility.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Conditioning with heavy-tailed error produces catastrophic Goodhart","element":"span"}],[{"id":"id-18","style":{"fontWeight":"bold"},"text":"Theorem 5. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be two independent random variables with CDFs ","element":"span"},{"style":{"height":13.4},"width":51,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-27.png","element":"img","alt":" FX","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":13.6},"width":48.5,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-28.png","element":"img","alt":" FV","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and tail functions ","element":"span"},{"style":{"height":16.4},"width":635.5,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-29.png","element":"img","alt":"¯FV ≜ 1 − FV , ¯FX ≜ 1 − FX such that","inline":true}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"has a finite mean.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is subexponential; that is, ","element":"span"},{"style":{"height":24.4},"width":438,"height":61,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-30.png","element":"img","alt":" limx→∞Pr(X1+X2>x)Pr(X>x) = 2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"if ","element":"span"},{"style":{"height":14.2},"width":115.5,"height":35.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/3-31.png","element":"img","alt":" X1, X2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are two independent samples from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":". This is a slightly stronger property than being heavy-tailed.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The tail of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is sufficiently lighter than the tail of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"that ","element":"span"},{"style":{"height":26.4},"width":336.5,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/4-0.png","element":"img","alt":" limt→∞tp ¯FV (t)¯FX(t) = 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p > ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Then ","element":"span"},{"style":{"height":16},"width":583.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/4-1.png","element":"img","alt":" limt→∞ E[V |X + V ≥ t] = E[V ]","inline":true},{"style":{"fontStyle":"italic"},"text":"; that is, catastrophic Goodhart occurs in the limit of optimization for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The proof is included in the appendix. It requires expressing the conditional expectation in question as ","element":"span"},{"style":{"height":31.4},"width":323.5,"height":78.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/4-2.png","element":"img","alt":"� ∞−∞ vfV (v)Pr(X>t−v)","inline":true,"padRight":true},{"text":", then partitioning the interval ","element":"span"},{"style":{"height":15.8},"width":152,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/4-3.png","element":"img","alt":" (−∞, ∞)","inline":true,"padRight":true},{"text":"into four regions and bounding the","element":"span"}],[{"text":"integrand in the numerator above by a different quantity in each region.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"3.4.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Conditioning with light-tailed error produces arbitrarily high utility","element":"span"}],[{"id":"id-19","style":{"fontWeight":"bold"},"text":"Theorem 6. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be independent random variables such that ","element":"span"},{"style":{"height":26.4},"width":354.5,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/4-4.png","element":"img","alt":" limt→∞ ¯FX(t+1)¯FX(t) = 0","inline":true},{"style":{"fontStyle":"italic"},"text":". (This implies that X has tails that are dominated by ","element":"span"},{"style":{"height":11},"width":73,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/4-5.png","element":"img","alt":" e−cx","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for any c, though it’s a slightly stronger claim because it requires that X not have large jumps in the decay of its tails.) Then for any V with a finite mean which has no upper bound, ","element":"span"},{"style":{"height":16},"width":513.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/4-6.png","element":"img","alt":" limt→∞ E[V |X + V > t] = ∞.","inline":true}],[{"text":"Theorem 6 generalizes a consequence of the \"Regressional Goodhart Identity\" in ","element":"span"},{"href":"#id-6","referenceIndex":5,"text":"(Gao et al., ","element":"a"},{"href":"#id-6","referenceIndex":5,"text":"2023)","element":"a"},{"text":".","element":"span"}]]},{"heading":"4 Experiments","paragraphs":[[{"text":"Our theoretical results now raise the question of whether the error in reward models is heavy-tailed or light-tailed in practice. ","element":"span"},{"text":"3 ","element":"span"},{"text":"If we observe the reward distribution to be light-tailed, this is a strong indication that error is light-tailed. ","element":"span"},{"text":"4","element":"span"}],[{"text":"To empirically test whether the reward is heavy-tailed, we consider two lines of evidence: examining the distributions directly through random sampling and temperature-1 sampling, and finding adversarial token sequences that get high rewards. We examine one small and one medium reward model that performed reasonably well on RewardBench ","element":"span"},{"href":"#id-21","referenceIndex":13,"text":"(Lambert et al., ","element":"a"},{"href":"#id-21","referenceIndex":13,"text":"2023)","element":"a"},{"text":". The small model is an OpenAssistant model based on Pythia 1.4B, and the medium model is Starling 7B-alpha ","element":"span"},{"href":"#id-22","referenceIndex":19,"text":"(Zhu ","element":"a"},{"href":"#id-22","referenceIndex":19,"text":"et al., ","element":"a"},{"href":"#id-22","referenceIndex":19,"text":"2023)","element":"a"},{"text":"5","element":"span"},{"text":".","element":"span"}],[{"text":"For random sampling, we sample 30000 length-1024 sequences of uniformly random tokens and observe the distribution of rewards assigned by both Pythia 1.4B and Llama 7B-chat. We also use Llama 7B-chat to generate 16000 length-133 sequences at temperature 1 and observe the distribution of rewards assigned by Starling 7B-alpha.","element":"span"}],[{"text":"Because sampling is inefficient at probing the extreme tail, we also find token sequences that optimize Starling 7B-alpha for reward. We considered Greedy Coordinate Gradient (GCG) from ","element":"span"},{"href":"#id-23","referenceIndex":22,"text":"(Zou et al., ","element":"a"},{"href":"#id-23","referenceIndex":22,"text":"2023)","element":"a"},{"text":", a method used to find adversarial suffixes that circumvent jailbreaking, but decided on a faster version of GCG called Accelerated Coordinate Gradient (ACG) from ","element":"span"},{"href":"#id-24","referenceIndex":6,"text":"(Haize Labs, ","element":"a"},{"href":"#id-24","referenceIndex":6,"text":"2024)","element":"a"},{"text":". See Table ","element":"span"},{"text":"4 ","element":"span"},{"text":"for ACG hyperparameters.","element":"span"}],[{"text":"Generating plots took about 5 GPU-hours on 1x Nvidia H100, and running ACG took a further 8 hours.","element":"span"}],[{"id":"id-7","style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Results","element":"span"}],[{"text":"When sampling token sequences, both the Pythia model on random inputs (Figure ","element":"span"},{"href":"#id-25","text":"B.1) ","element":"a"},{"text":"and Starling 7B-alpha on Llama-generated inputs (Figure ","element":"span"},{"href":"#id-26","text":"2) ","element":"a"},{"text":"appear approximately normal and, therefore, light-tailed. Starling on random inputs (Figure ","element":"span"},{"href":"#id-25","text":"1 ","element":"a"},{"text":"is ambiguous, with the exponential Q-Q plot having an outlier that could indicate a heavy-tailed distribution, but the Hill estimator is consistent with a","element":"span"}],[{"style":{"width":"44%"},"width":698,"height":267,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-0.png","element":"img"}],[{"text":"Table 1: Hyperparameters for ACG","element":"figcaption","subtype":"caption"}],[{"style":{"width":"78%"},"width":1238,"height":1344,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-1.png","element":"img"}],[{"id":"id-25","text":"Figure 1: Plots of the distribution of reward from 30000 random length-1024 token sequences to ","element":"figcaption","subtype":"caption"},{"text":"Starling 7B-alpha. Clockwise from top left: The histogram shows a unimodal distribution with a slight right skew. The normal probability plot indicates the data are heavier-tailed than normal. The Hill estimator (error bars are standard error) appears to be 0.20 for higher values but fluctuates for lower values. The exponential probability plot of the right half of the distribution is consistent with either light or heavy tails (under heavy tails, the slope would go to infinity).","element":"figcaption","subtype":"caption"}],[{"text":"light-tailed distribution. Because Llama-7B-chat is a more reasonable base model than a completely random policy, we believe that Starling 7B-alpha is more likely to be light-tailed for the purposes of our theoretical results.","element":"span"}],[{"text":"The ACG results need some interpretation. The KL divergence between two distributions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"is the same as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"a fraction ","element":"span"},{"style":{"height":11},"width":89.5,"height":27.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-2.png","element":"img","alt":" 1 − α","inline":true,"padRight":true},{"text":"of the time, but is some value ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"a fraction ","element":"span"},{"style":{"height":7.4},"width":23,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-3.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"of the time is given by ","element":"span"},{"style":{"height":29},"width":1395.5,"height":72.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-4.png","element":"img","alt":" DKL(P∥Q) = [(1 − α)q(x) + α] log�(1−α)q(x)+αq(x) �+ (1 − α) log(1 − α)(1 − q(x)).","inline":true}],[{"text":"When ","element":"span"},{"style":{"height":7.4},"width":23,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-5.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is small but much larger than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":")","element":"span"},{"text":", we approximate this to first order as ","element":"span"},{"style":{"height":16},"width":241.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-6.png","element":"img","alt":" DKL(P∥Q) ≈","inline":true},{"style":{"height":29},"width":198,"height":72.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/5-7.png","element":"img","alt":"α log� αq(x)�","inline":true},{"text":". In Theorems 1 and 3, we prove that when the error is sufficiently heavy-tailed, a policy","element":"span"}],[{"style":{"width":"78%"},"width":1242,"height":1363,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-0.png","element":"img"}],[{"text":"Figure 2: Plots of the reward distribution from 16000 token sequences generated by Llama 7B-chat of length ","element":"figcaption","subtype":"caption"},{"id":"id-26","style":{"height":13},"width":97.5,"height":32.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-1.png","element":"img","alt":" ≤ 133","inline":true},{"text":", starting with five random tokens. Clockwise from top left: A histogram shows the reward distribution has a left skew. The normal probability plot suggests reward is approximately normal and thus light-tailed. The Hill estimator plot should stabilize if the distribution is heavy-tailed, but it does not; thus, there is no evidence the distribution is heavy-tailed. The exponential probability plot also indicates light tails, because the curve is bending downwards.","element":"figcaption","subtype":"caption"}],[{"text":"that gets extremely large reward a small fraction of the time will achieve high expected reward with low KL divergence. This is not the case here because the rewards achieved through ACG were small and the log-probabilities extremely negative. For example, a policy that matches Llama 2-chat’s base reward 99% of the time and uses the highest-reward input generated by ACG ","element":"span"},{"style":{"height":7.4},"width":66,"height":18.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-2.png","element":"img","alt":" α =","inline":true},{"text":"1% of the time will have KL divergence from Llama 2-chat of ","element":"span"},{"style":{"height":15.6},"width":485,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-3.png","element":"img","alt":" α(log(α) − 1339.70) = 13.35","inline":true,"padRight":true},{"text":"nats, but reward only about ","element":"span"},{"style":{"height":15.8},"width":544,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-4.png","element":"img","alt":" α ∗ (2.2377 − 0.3329) = 0.02571","inline":true,"padRight":true},{"text":"greater than the base model, far less than can be obtained with the same KL divergence by conditioning.","element":"span"}]]},{"heading":"5 Discussion and Limitations","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"How likely is catastrophic Goodhart?","element":"span"}],[{"text":"The low-KL policies that result in catastrophic Goodhart are not a unique optimal policy, just one family of high-performing policies. When optimizing ","element":"span"},{"style":{"height":16},"width":397,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-5.png","element":"img","alt":" E[U(π)]−βDKL (π, π0)","inline":true},{"text":", the outcome depends on RL training dynamics; it could be that ","element":"span"},{"style":{"height":13.4},"width":165,"height":33.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-6.png","element":"img","alt":" DKL → 0","inline":true,"padRight":true},{"text":"causing catastrophic Goodhart, but more likely both terms will go to infinity, potentially allowing ","element":"span"},{"style":{"height":11.4},"width":130.5,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/6-7.png","element":"img","alt":" V → ∞","inline":true},{"text":". Catastrophic Goodhart can be prevented by using a light-tailed or bounded reward function.","element":"span"}],[{"text":"Even so, catastrophic Goodhart is likely to occur in many scenarios where KL regularization is naively employed in an attempt to avoid Goodhart’s Law:","element":"span"}],[{"text":"• If we maximize ","element":"span"},{"style":{"height":16},"width":543,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-0.png","element":"img","alt":" σ(E[U]) + DKL(Tr(π)∥Tr(π0))","inline":true},{"text":", where ","element":"span"},{"style":{"height":7.2},"width":21.5,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-1.png","element":"img","alt":" σ","inline":true,"padRight":true},{"text":"is a bounded function (e.g. sigmoid), all near-optimal policies will have ","element":"span"},{"style":{"height":11.4},"width":108,"height":28.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-2.png","element":"img","alt":" V ≈ 0","inline":true},{"text":". Since we can only obtain so much reward from ","element":"span"},{"style":{"height":15.8},"width":130.5,"height":39.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-3.png","element":"img","alt":" σ(E[U])","inline":true},{"text":", it pays to make the KL (and thus V) go to zero.","element":"span"}],[{"text":"• If we cap KL to a finite value (or dynamically adjust the KL penalty to target a finite KL, as done in ","element":"span"},{"href":"#id-5","referenceIndex":21,"text":"Ziegler et al. ","element":"a"},{"href":"#id-5","referenceIndex":21,"text":"(2020)","element":"a"},{"text":", then ","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"] ","element":"span"},{"text":"is also upper bounded by a finite value (see Theorem ","element":"span"},{"href":"#id-16","text":"2)","element":"a"},{"text":", and we think it is likely that ","element":"span"},{"style":{"height":16},"width":158.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-4.png","element":"img","alt":" E[V ] ≈ 0","inline":true},{"text":". Consider a toy model where an AI can adjust three parameters: true quality ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"of responses, frequency of reward hacking (producing actions with extremely high X), and severity of hacking (value of X on those actions). Adjusting the policy to increase ","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"] ","element":"span"},{"text":"without increasing KL increase the severity of hacking while decreasing either frequency of hacking or quality of responses. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"] ","element":"span"},{"text":"is already large, decreasing quality has much better returns than decreasing frequency. This is similar to Theorems ","element":"span"},{"href":"#id-18","text":"5, ","element":"a"},{"href":"#id-19","text":"6 ","element":"a"},{"text":"about hard-threshold optimization.","element":"span"}],[{"text":"• Any way we maximize ","element":"span"},{"style":{"height":16},"width":411.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-5.png","element":"img","alt":" E[U(π)] − βDKL (π, π0)","inline":true,"padRight":true},{"text":"results in very large values of ","element":"span"},{"style":{"height":16},"width":131,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-6.png","element":"img","alt":" E[U(π)]","inline":true},{"text":", and there are a number of arguments that extreme optimization for an imperfect proxy can result in decreased utility due to tradeoffs between ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"; e.g., the constrained resource scenario in ","element":"span"},{"href":"#id-27","referenceIndex":20,"text":"(Zhuang and Hadfield-Menell, ","element":"a"},{"href":"#id-27","referenceIndex":20,"text":"2021)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Independence assumptions","element":"span"}],[{"text":"Theorems 1-3 do not require any independence assumption, but Theorems ","element":"span"},{"href":"#id-17","text":"4, ","element":"a"},{"href":"#id-18","text":"5, ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-19","text":"6 ","element":"a"},{"text":"require that error ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and utility ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"are independent, which seems to be violated in practice. Future work could weaken this assumption, although intuitively obvious ways to weaken it result in the statement being false. ","element":"span"},{"text":"6","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Stronger optimization methods","element":"span"}],[{"text":"We did not search the entire space of token sequences, so we cannot rule out that the reward is heavy-tailed enough to cause catastrophic Goodhart in some situations. While it is intractable to search the more than ","element":"span"},{"style":{"height":13.6},"width":99,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-7.png","element":"img","alt":" 102000 ","inline":true,"padRight":true},{"text":"possible token sequences, future work could get more evidence through more powerful optimization methods.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Reparameterizing reward","element":"span"}],[{"text":"In some cases, a heavy-tailed reward can be reparameterized to make it light-tailed and avoid catastrophic Goodhart; however, in settings where the true reward is heavy-tailed, making reward artificially light-tailed or bounded can result in unintended behavior.","element":"span"}],[{"text":"For example, a stock-trading agent should be rewarded by profit, but financial returns are known to be heavy-tailed. If we clip or otherwise transform rewards into a bounded interval, it will have no incentive to take into account huge gains or losses. Since RLHF rewards as implemented in Ziegler et al are unbounded, clipping or transforming rewards could itself cause reward misspecification.","element":"span"}],[{"text":"In some cases, e.g. when the reward is not the true intended one, it is possible to reparameterize the reward without adverse effects. In the RL literature for Atari games, rewards are changes in score clipped to ","element":"span"},{"style":{"height":16},"width":102,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/7-8.png","element":"img","alt":" [−1, 1]","inline":true,"padRight":true},{"href":"#id-28","referenceIndex":14,"text":"(Machado et al., ","element":"a"},{"href":"#id-28","referenceIndex":14,"text":"2018)","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Relation to previous overoptimization work","element":"span"}],[{"href":"#id-6","referenceIndex":5,"text":"Gao et al. ","element":"a"},{"href":"#id-6","referenceIndex":5,"text":"(2023) ","element":"a"},{"text":"found that optimizing the reward of small reward models causes overoptimization: a decrease in utility with increasing optimization. However, we observed that reward models are light-tailed, and (Theorem ","element":"span"},{"href":"#id-17","text":"4) ","element":"a"},{"text":"that independence combined with light-tailed error prevents overoptimization. We think this discrepancy is explained by dependence between error and utility. Policies optimized ","element":"span"},{"text":"for high error may activate features in the proxy reward models that are undesirable according to the true utility function.","element":"span"},{"href":"#id-29","referenceIndex":8,"text":"7 ","element":"a"},{"text":"More research is needed to understand why high-error completions have low utility and to design reward models that do not suffer from this problem; perhaps it is possible to construct reward models whose errors are in directions orthogonal to human preferences, so that the large-reward completions do not have lower utility.","element":"span"}]]},{"heading":"6 Conclusion","paragraphs":[[{"text":"We have argued that the purpose of the KL divergence regularization in RLHF is to mitigate reward misspecification. However, we have also proven that when errors in the reward function are heavy-tailed, it cannot serve this purpose: even with zero KL divergence, there are policies that achieve very high misspecified reward and no actual reward.","element":"span"}],[{"text":"When errors are light-tailed and independent, the KL divergence can mitigate misspecification, but when they are dependent, this may not be possible. Thus, we must look to places other than the KL objective to explain the current success of RLHF and ensure its continued success in the future.","element":"span"}]]},{"heading":"Acknowledgments and Disclosure of Funding","paragraphs":[[{"text":"This work was supported by the Long-Term Future Fund (LTFF). We also thank the anonymous reviewers for their valuable feedback and constructive suggestions.","element":"span"}]]},{"heading":"Impact Statement","paragraphs":[[{"text":"As this work aims to improve the safety of future ML systems by characterizing a possible failure mode of reward misspecification in RLHF, we hope the social impact is positive. We see no particular ethical issues to discuss.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-2","text":"Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018). Maximum a ","element":"span"},{"text":"posteriori policy optimisation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":".","element":"span"}],[{"id":"id-4","text":"Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning ","element":"span"},{"text":"from human preferences. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", volume 30. Curran Associates, Inc.","element":"span"}],[{"id":"id-9","text":"Clark, J. and Amodei, D. (2016). Faulty reward functions in the wild. Accessed: 2024-07-07.","element":"span"}],[{"id":"id-31","text":"Foss, S., Korshunov, D., and Zachary, S. (2013). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"An Introduction to Heavy-Tailed and Subexponential Distributions","element":"span"},{"text":". Springer, 2 edition.","element":"span"}],[{"id":"id-6","text":"Gao, L., Schulman, J., and Hilton, J. (2023). Scaling laws for reward model overoptimization. In Krause, ","element":"span"},{"text":"A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 40th International Conference on Machine Learning","element":"span"},{"text":", volume 202 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 10835–10866. PMLR.","element":"span"}],[{"id":"id-24","text":"Haize Labs (2024). Making a sota adversarial attack on llms 38x faster. ","element":"span"},{"href":"https://blog.haizelabs.com/posts/acg/","text":"https://blog.haizelabs.com/ ","element":"a"},{"href":"https://blog.haizelabs.com/posts/acg/","text":"posts/acg/","element":"a"},{"text":". Accessed: 2024-05-22.","element":"span"}],[{"id":"id-33","text":"Hu, J., Wu, X., Wang, W., Xianyu, Zhang, D., and Cao, Y. (2024). Openrlhf: An easy-to-use, scalable and ","element":"span"},{"text":"high-performance rlhf framework. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2405.11143","element":"span"},{"text":".","element":"span"}],[{"id":"id-3","text":"Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, A., Jones, N., Gu, S., and Picard, R. (2019). ","element":"span"},{"text":"Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1907.00456","element":"span"},{"text":".","element":"span"}],[{"text":"7","element":"span"},{"id":"id-29","text":"There are other explanations possible. Perhaps better optimization methods would find heavy-tailed reward in","element":"span"}],[{"text":"open reward models; or OpenAI’s reward models have heavy-tailed error (and their results are straightforwardly","element":"span"}],[{"text":"explained by our Theorem ","element":"span"},{"href":"#id-14","text":"1)","element":"a"},{"text":", while open reward models have light-tailed error.","element":"span"}],[{"style":{"width":"85%"},"width":1350,"height":324,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/9-0.png","element":"img"}],[{"text":"Figure A.1: As ","element":"figcaption","subtype":"caption"},{"style":{"height":10.2},"width":128.5,"height":25.5,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/9-1.png","element":"img","alt":" t → ∞","inline":true},{"text":", the mean of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"X ","element":"figcaption","subtype":"caption"},{"text":"(blue bar) grows without bound while KL divergence ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":216,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/9-2.png","element":"img","alt":"DKL(Pt ∥ Q)","inline":true,"padRight":true},{"text":"(orange bar) goes to 0. The base distribution Q is a Student t-distribution with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"df ","element":"figcaption","subtype":"caption"},{"text":"= 3","element":"figcaption","subtype":"caption"},{"text":". In this case, high values of X are upweighted to ","element":"figcaption","subtype":"caption"},{"style":{"height":17.2},"width":91,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/9-3.png","element":"img","alt":" 1/t0.8","inline":true},{"text":"; upweighting them to ","element":"figcaption","subtype":"caption"},{"text":"1","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"/t ","element":"figcaption","subtype":"caption"},{"text":"would cause ","element":"figcaption","subtype":"caption"},{"text":"E","element":"figcaption","subtype":"caption"},{"text":"[","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"X","element":"figcaption","subtype":"caption"},{"text":"] ","element":"figcaption","subtype":"caption"},{"text":"to converge to ","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"while KL divergence goes to zero faster.","element":"figcaption","subtype":"caption"}],[{"id":"id-11","text":"Laidlaw, C., Singhal, S., and Dragan, A. (2024). Preventing reward hacking with occupancy measure regulariza- ","element":"span"},{"text":"tion. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv","element":"span"},{"text":". Accessed: 2024-07-07.","element":"span"}],[{"id":"id-10","text":"Lambert, N. and Calandra, R. (2024). The alignment ceiling: Objective mismatch in reinforcement learning ","element":"span"},{"text":"from human feedback.","element":"span"}],[{"id":"id-21","text":"Lambert, N., Pyatkin, V., Morrison, J., Miranda, L., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, ","element":"span"},{"text":"Y., Smith, N. A., and Hajishirzi, H. (2023). Rewardbench: Evaluating reward models for language modeling. 40 pages, 19 figures, 12 tables.","element":"span"}],[{"id":"id-28","text":"Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. (2018). Revisiting ","element":"span"},{"text":"the arcade learning environment: Evaluation protocols and open problems for general agents. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Artificial Intelligence Research","element":"span"},{"text":", 61:523–562.","element":"span"}],[{"id":"id-0","text":"Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":".","element":"span"}],[{"id":"id-1","text":"Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization ","element":"span"},{"text":"algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":".","element":"span"}],[{"id":"id-12","text":"Strathern, M. (1997). ‘improving ratings’: audit in the british university system. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"European Review","element":"span"},{"text":", 5(3):305–321.","element":"span"}],[{"id":"id-8","text":"Wierman, ","element":"span"},{"text":"A. ","element":"span"},{"text":"(2013). ","element":"span"},{"text":"Catastrophes, ","element":"span"},{"text":"conspiracies, ","element":"span"},{"text":"and ","element":"span"},{"text":"subexponential ","element":"span"},{"text":"distributions ","element":"span"},{"text":"(part ","element":"span"},{"text":"ii). ","element":"span"},{"href":"https://rigorandrelevance.wordpress.com/2013/12/17/catastrophes-conspiracies-and-subexponential-distributions-part-ii/","text":"https://rigorandrelevance.wordpress.com/2013/12/17/ ","element":"a"},{"href":"https://rigorandrelevance.wordpress.com/2013/12/17/catastrophes-conspiracies-and-subexponential-distributions-part-ii/","text":"catastrophes-conspiracies-and-subexponential-distributions-part-ii/","element":"a"},{"text":". ","element":"span"},{"text":"Accessed: 2024-06-26.","element":"span"}],[{"id":"id-22","text":"Zhu, B., Frick, E., Wu, T., Zhu, H., and Jiao, J. (2023). Starling-7B: Improving llm helpfulness & harmlessness ","element":"span"},{"text":"with rlaif.","element":"span"}],[{"id":"id-27","text":"Zhuang, S. and Hadfield-Menell, D. (2021). Consequences of misaligned AI. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33:15762–15773. arXiv:2102.03896v1 [cs.AI].","element":"span"}],[{"id":"id-5","text":"Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2020). ","element":"span"},{"text":"Fine-tuning language models from human preferences. arXiv:1909.08593v2 [cs.CL].","element":"span"}],[{"id":"id-23","text":"Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. (2023). Universal and transferable adversarial attacks on ","element":"span"},{"text":"aligned language models.","element":"span"}]]},{"heading":"A Proofs","paragraphs":[[{"id":"id-30","style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theorem 1","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Restatement of Theorem ","element":"span"},{"href":"#id-14","style":{"fontWeight":"bold"},"text":"1. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Given any heavy-tailed reference distribution ","element":"span"},{"style":{"height":13.6},"width":485.52,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/9-4.png","element":"img","alt":" Q over R with mean µQ, and any","inline":true},{"style":{"height":12.8},"width":135.88,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/9-5.png","element":"img","alt":"M, ϵ > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", there is a distribution ","element":"span"},{"style":{"height":14.4},"width":658.88,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/9-6.png","element":"img","alt":" P with mean µP > M and DKL(P∥Q) < ϵ.","inline":true}],[{"text":"Intuitively, in a heavy-tailed distribution, events with extremely high ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"are not very rare, so you don’t pay much of a KL penalty to upweight them so they happen about ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/x ","element":"span"},{"text":"of the time. This is visually illustrated in Figure ","element":"span"},{"href":"#id-30","text":"A.1.","element":"a"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"WLOG let ","element":"span"},{"style":{"height":13.19},"width":115.34,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-0.png","element":"img","alt":" µQ = 0","inline":true},{"text":". We construct a sequence of distributions ","element":"span"},{"style":{"height":14.4},"width":582.7,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-1.png","element":"img","alt":" {Pt} such that limt→∞ EPt[X] ≥ c for","inline":true,"padRight":true},{"text":"any constant ","element":"span"},{"style":{"height":14.4},"width":481.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-2.png","element":"img","alt":" c, and limt→∞ DKL(Pt∥Q) = 0","inline":true},{"text":". We define ","element":"span"},{"style":{"height":13.2},"width":233.96,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-3.png","element":"img","alt":" Pt for any t > c","inline":true,"padRight":true},{"text":"thusly. Writing ","element":"span"},{"style":{"height":14.4},"width":106.63,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-4.png","element":"img","alt":" FPt(x)","inline":true,"padRight":true},{"text":"for the CDF ","element":"span"},{"style":{"height":8.8},"width":18,"height":22,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-5.png","element":"img","alt":"PrX∼Pt(X ≤ x) and ¯FPt(x) for 1 − FPt(x), we let","inline":true}],[{"style":{"width":"69%"},"width":1108,"height":197,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-6.png","element":"img"}],[{"text":"Intuitively, we rescale the part of the distribution to the right of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"evenly to have total probability ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c/t","element":"span"},{"text":", which is less than 1 because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > c","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"88%"},"width":1395,"height":395,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-7.png","element":"img"}],[{"text":"We know that ","element":"span"},{"style":{"height":14.4},"width":270.84,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-8.png","element":"img","alt":" EQ[X|X > t] > t","inline":true,"padRight":true},{"text":"because it is an integral of values strictly greater than t. Because ","element":"span"},{"style":{"height":14.4},"width":202.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-9.png","element":"img","alt":" EQ[X] = 0 is","inline":true,"padRight":true},{"text":"a weighted average of ","element":"span"},{"style":{"height":14.4},"width":1277.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-10.png","element":"img","alt":" EQ[X|X > t] and EQ[X|X ≤ t], and EQ[X|X > t] > 0, we know EQ[X|X ≤ t] < 0.","inline":true,"padRight":true},{"text":"So ","element":"span"},{"style":{"height":14.4},"width":557.88,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-11.png","element":"img","alt":" EQ[X|X > t] − EQ[X|X ≤ t] > t.","inline":true,"padRight":true},{"text":"We also know that for sufficiently large ","element":"span"},{"style":{"height":14.4},"width":380.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-12.png","element":"img","alt":" t, (FPt(t) − FQ(t)) > 0.","inline":true,"padRight":true},{"text":"Intuitively, starting from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":", which has mean 0, ","element":"span"},{"style":{"height":11.58},"width":34.56,"height":28.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-13.png","element":"img","alt":" Pt","inline":true,"padRight":true},{"text":"moves a probability mass approaching ","element":"span"},{"style":{"height":15.15},"width":13,"height":37.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-14.png","element":"img","alt":"ct ","inline":true,"padRight":true},{"text":"from mean <0 to ","element":"span"},{"text":"mean >t.","element":"span"}],[{"style":{"width":"97%"},"width":1542,"height":1250,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-15.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"is heavy-tailed, so by definition ","element":"span"},{"style":{"height":16.1},"width":579.64,"height":40.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-16.png","element":"img","alt":" limt→∞ eat ¯FQ(t) = ∞ for all a > 0","inline":true},{"text":". This implies that for every ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a > ","element":"span"},{"text":"0 ","element":"span"},{"text":"there is a sufficiently large ","element":"span"},{"style":{"height":10.8},"width":26.32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-17.png","element":"img","alt":" tc","inline":true,"padRight":true},{"text":"so that for all ","element":"span"},{"style":{"height":16.08},"width":323.68,"height":40.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-18.png","element":"img","alt":" t > tc, ¯FQ(x) > e−at","inline":true},{"text":", which means that ","element":"span"},{"style":{"height":16.02},"width":264.04,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-19.png","element":"img","alt":" log ¯FQ(t) > −at.","inline":true}],[{"text":"Therefore for every ","element":"span"},{"style":{"height":17.3},"width":1300.88,"height":43.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-20.png","element":"img","alt":" a > 0, limt→∞ DKL(Pt∥Q) ≤ limt→∞ −c/t log ¯FQ(t) < limt→∞ − −actt = ac, which","inline":true,"padRight":true},{"text":"since KL divergence is nonnegative means that","element":"span"},{"style":{"height":14.4},"width":388.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/10-21.png","element":"img","alt":"limt→∞ DKL(Pt∥Q) = 0","inline":true,"padRight":true},{"text":"as desired.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theorem 2","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Restatement of Theorem ","element":"span"},{"href":"#id-16","style":{"fontWeight":"bold"},"text":"2. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is light-tailed, ","element":"span"},{"style":{"height":14.4},"width":98.92,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-0.png","element":"img","alt":" EQ[V ]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is zero, and ","element":"span"},{"style":{"height":14.4},"width":251.56,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-1.png","element":"img","alt":" d = DKL(P∥Q)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded, then ","element":"span"},{"style":{"height":14.4},"width":98.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-2.png","element":"img","alt":" EP [V ]","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is bounded, and ","element":"span"},{"style":{"height":14.4},"width":328.04,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-3.png","element":"img","alt":" EP [V ] → 0 as d → 0.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Using Lagrange multipliers, we find that when KL divergence is minimized, we have ","element":"span"},{"style":{"height":21.34},"width":306.44,"height":53.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-4.png","element":"img","alt":" P(V )[λ1 log P (V )Q(V ) +","inline":true},{"style":{"height":14.4},"width":193.8,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-5.png","element":"img","alt":"λ2 − X] = 0","inline":true,"padRight":true},{"text":"for some constants ","element":"span"},{"style":{"height":12.8},"width":142.48,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-6.png","element":"img","alt":" λ1, λ2, so","inline":true}],[{"style":{"width":"69%"},"width":1094,"height":246,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-7.png","element":"img"}],[{"text":"That is, the new PDF is an ","element":"span"},{"href":"https://en.wikipedia.org/wiki/Exponential_tilting","text":"exponential tilting ","element":"a"},{"text":"of the old PDF. Now, what is ","element":"span"},{"style":{"height":14.4},"width":284.24,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-8.png","element":"img","alt":" EP [V ]? It’s just","inline":true},{"style":{"height":18.48},"width":369.72,"height":46.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-9.png","element":"img","alt":"� ∞−∞ CV eV/λ1Q(X) dV","inline":true,"padRight":true},{"text":". If the distribution of V is heavy-tailed distribution, this is ","element":"span"},{"style":{"height":6.8},"width":37,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-10.png","element":"img","alt":" ∞","inline":true},{"text":"; if it is light-tailed, this ","element":"span"},{"text":"is some finite value.","element":"span"}],[{"style":{"width":"97%"},"width":1540,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-11.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theorem 3","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Restatement of Theorem ","element":"span"},{"href":"#id-15","style":{"fontWeight":"bold"},"text":"3. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"= (","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"style":{"fontStyle":"italic"},"text":", P, R","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be a deterministic-transition MDP with Markovian returns. Given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W","element":"span"},{"style":{"fontStyle":"italic"},"text":", we define the function that takes policies to trajectories ","element":"span"},{"style":{"height":14.4},"width":609.44,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-12.png","element":"img","alt":" Tr : (S → ∆A) → ∆(S × A)∗, and the","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"average return function ","element":"span"},{"style":{"height":14.4},"width":265.8,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-13.png","element":"img","alt":" g : (S×A)∗ → R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"which induces a function ","element":"span"},{"style":{"height":14.4},"width":620.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-14.png","element":"img","alt":" G : ∆(S×A)∗ → ∆R. Let π0 : S → ∆A","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be some base policy. If ","element":"span"},{"style":{"height":14.4},"width":173.28,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-15.png","element":"img","alt":" G ◦ Tr(π0)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is heavy-tailed with finite mean ","element":"span"},{"style":{"height":10},"width":45.2,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-16.png","element":"img","alt":" µQ","inline":true},{"style":{"fontStyle":"italic"},"text":", then for any ","element":"span"},{"style":{"height":12.8},"width":135.92,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-17.png","element":"img","alt":" M, ϵ > 0","inline":true},{"style":{"fontStyle":"italic"},"text":", there is a policy ","element":"span"},{"style":{"height":6.4},"width":21,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-18.png","element":"img","alt":"π","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with mean return ","element":"span"},{"style":{"height":15.44},"width":1062.64,"height":38.6,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-19.png","element":"img","alt":" E[U|U ∼ G ◦ Tr(π)] > M and Es∈T,T ∼T r(π)[DKL(π(s)∥π0(s))] < ϵ.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof: ","element":"span"},{"text":"We will exhibit a distribution of trajectories ","element":"span"},{"style":{"height":14.4},"width":837.32,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-20.png","element":"img","alt":" ρ such that DKL(ρ∥Tr(π0)) < ϵ and E[G(ρ)] > M, and","inline":true,"padRight":true},{"text":"then construct a policy ","element":"span"},{"style":{"height":14.4},"width":264.08,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-21.png","element":"img","alt":" π with Tr(π) = ρ","inline":true},{"text":". Note that this proof applies for continuous action spaces if trajectories are replaced with measurable sets, but this would make it harder to read.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":14.4},"width":226.2,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-22.png","element":"img","alt":" ρπ0 = Tr(π0)","inline":true},{"text":". We have a heavy-tailed distribution of return ","element":"span"},{"style":{"height":14.4},"width":316.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-23.png","element":"img","alt":" Q ≜ G(ρπ0) over R","inline":true},{"text":", so we can apply Theorem ","element":"span"},{"href":"#id-14","text":"1. ","element":"a"},{"text":"But to define ","element":"span"},{"style":{"height":9.6},"width":19,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-24.png","element":"img","alt":" ρ","inline":true},{"text":", we can construct ","element":"span"},{"style":{"height":11.58},"width":34.52,"height":28.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-25.png","element":"img","alt":" Pt","inline":true,"padRight":true},{"text":"in the proof of Theorem ","element":"span"},{"href":"#id-14","text":"1 ","element":"a"},{"text":"in a particular way. For any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t > c","element":"span"},{"text":", we need a ","element":"span"},{"style":{"height":11.58},"width":34.52,"height":28.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-26.png","element":"img","alt":" Pt","inline":true,"padRight":true},{"text":"that uniformly upweights values of mean return such that ","element":"span"},{"style":{"height":16.03},"width":196.04,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-27.png","element":"img","alt":"¯FPt(t) = c/t","inline":true},{"text":". We can define ","element":"span"},{"style":{"height":13.2},"width":108.16,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-28.png","element":"img","alt":" ρt such","inline":true,"padRight":true},{"text":"that any trajectory ","element":"span"},{"style":{"height":6.4},"width":19,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-29.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"is upweighted by a factor depending only on its mean return:","element":"span"}],[{"style":{"width":"32%"},"width":515,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-30.png","element":"img"}],[{"text":"Then we can let ","element":"span"},{"style":{"height":13.2},"width":173.28,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-31.png","element":"img","alt":" Pt ≜ G◦ ρt","inline":true,"padRight":true},{"text":"and the rest of the proof of Theorem ","element":"span"},{"href":"#id-14","text":"1 ","element":"a"},{"text":"applies. Therefore, applying the theorem, we can let ","element":"span"},{"style":{"height":9.6},"width":98.12,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-32.png","element":"img","alt":" ρ = ρt","inline":true,"padRight":true},{"text":"for sufficiently large ","element":"span"},{"style":{"height":14.4},"width":775.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-33.png","element":"img","alt":" t, and then µG◦ρ > M and DKL(G ◦ ρ, G ◦ ρπ0) < ϵ","inline":true},{"text":". By the chain rule for KL divergence, ","element":"span"},{"style":{"height":14.4},"width":1368.56,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-34.png","element":"img","alt":" DKL(ρ, ρπ0) = DKL(G◦ρ, G◦ρπ0)+Eγ∼G◦ρ[DKL(ρ(T)|G(T) = γ ∥ ρπ0(T)|G(T) = γ)].","inline":true,"padRight":true},{"text":"Since we constructed ","element":"span"},{"style":{"height":9.6},"width":19,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-35.png","element":"img","alt":" ρ","inline":true,"padRight":true},{"text":"so that the probabilities of each ","element":"span"},{"style":{"height":6.4},"width":19,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-36.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"conditional on its return being ","element":"span"},{"style":{"height":9.6},"width":20,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-37.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"are equal, the second term is zero, and we also have ","element":"span"},{"style":{"height":14.4},"width":269.84,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-38.png","element":"img","alt":" DKL(ρ, ρπ0) < ϵ.","inline":true}],[{"text":"Finally, since the KL divergence between trajectory distributions is the sum of KL divergence between policies at each action in the trajectory, and each trajectory has at least one action, ","element":"span"},{"style":{"height":15.42},"width":547.72,"height":38.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-39.png","element":"img","alt":" Es∈T,T ∼T r(π)[DKL(π(s)∥π0(s))] ≤","inline":true},{"style":{"height":16.38},"width":863.04,"height":40.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-40.png","element":"img","alt":"ET ∼T r(π)�s∈T [DKL(π(s)∥π0(s))] = DKL(ρ∥ρπ0) < ϵ","inline":true,"padRight":true},{"text":"as desired.","element":"span"}],[{"text":"To define ","element":"span"},{"style":{"fontStyle":"italic"},"text":"π ","element":"span"},{"text":"such that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Tr","element":"span"},{"style":{"height":14.4},"width":907.32,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-41.png","element":"img","alt":"(π) = ρ, we let π(s, a) = Pr(ai = a|τ = (..., s, ai, ...) ∼ ρ).","inline":true}],[{"style":{"width":"68%"},"width":1088,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/11-42.png","element":"img"}],[{"style":{"width":"88%"},"width":1409,"height":396,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-0.png","element":"img"}],[{"text":"In (2), returns are Markovian, so all trajectory prefixes ending in state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"have the same distribution of returns under any policy. In the construction of ","element":"span"},{"style":{"height":9.6},"width":19,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-1.png","element":"img","alt":" ρ","inline":true},{"text":", all trajectories with the same mean return have equal measure. Therefore, conditioning on earlier states and actions of ","element":"span"},{"style":{"height":6.4},"width":19,"height":16,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-2.png","element":"img","alt":" τ","inline":true,"padRight":true},{"text":"does not change the measure, so we can write (3). So ","element":"span"},{"style":{"height":14.4},"width":163.48,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-3.png","element":"img","alt":"Tr(π) = ρ","inline":true,"padRight":true},{"text":"as desired. ","element":"span"},{"style":{"height":0},"width":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-4.png","element":"img","alt":" ■","inline":true}],[{"style":{"fontWeight":"bold"},"text":"A.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theorem 4","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Restatement of Theorem ","element":"span"},{"href":"#id-17","style":{"fontWeight":"bold"},"text":"4. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"both light-tailed and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"unbounded, and the distribution of U is continuous, and ","element":"span"},{"style":{"height":15.92},"width":1188.16,"height":39.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-5.png","element":"img","alt":" π∗(β) ≜ arg maxπ E[U(π)] − βDKL(π, π0), then limβ→0+ E[V (π∗(β))] = ∞.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Fix some ","element":"span"},{"style":{"height":13.2},"width":22,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-6.png","element":"img","alt":" β","inline":true},{"text":". Using Lagrange multipliers, we find that for any event ","element":"span"},{"style":{"height":16.9},"width":503.6,"height":42.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-7.png","element":"img","alt":" S, Prπ(S) = Prπ0(S)eλU(S). Let","inline":true},{"style":{"height":14.4},"width":66.92,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-8.png","element":"img","alt":"c(β)","inline":true,"padRight":true},{"text":"be the median value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"under the policy ","element":"span"},{"style":{"height":17.15},"width":904.12,"height":42.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-9.png","element":"img","alt":" π∗(β); that is, Pr(U > c(β)|U ∼ G ◦ Tr(π∗(β))) = 12. This","inline":true,"padRight":true},{"text":"exists because ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"has a continuous distribution. Then:","element":"span"}],[{"style":{"width":"77%"},"width":1223,"height":357,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theorem 5","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Restatement of theorem ","element":"span"},{"href":"#id-18","style":{"fontWeight":"bold"},"text":"5. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be two independent random variables with CDFs ","element":"span"},{"style":{"height":11.6},"width":237,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-11.png","element":"img","alt":" FX and FV and","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"tail functions ","element":"span"},{"style":{"height":14.43},"width":582.64,"height":36.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-12.png","element":"img","alt":"¯FV ≜ 1 − FV , ¯FX ≜ 1 − FX such that","inline":true}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"has a finite mean.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is subexponential; that is, ","element":"span"},{"style":{"height":21.36},"width":559.24,"height":53.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-13.png","element":"img","alt":" limx→∞Pr(X1+X2>x)Pr(X>x) = 2 if X1, X2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are two independent samples from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X","element":"span"},{"style":{"fontStyle":"italic"},"text":". This is a slightly stronger property than being heavy-tailed.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The tail of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is sufficiently lighter than the tail of ","element":"span"},{"style":{"height":22.58},"width":655.96,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-14.png","element":"img","alt":" X that limt→∞tp ¯FV (t)¯FX(t) = 0 for some p > 1.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Then ","element":"span"},{"style":{"height":14.4},"width":506.84,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-15.png","element":"img","alt":" limt→∞ E[V |X + V ≥ t] = E[V ]","inline":true},{"style":{"fontStyle":"italic"},"text":"; that is, catastrophic Goodhart occurs in the limit of optimization for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"width":"18%"},"width":294,"height":9,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-16.png","element":"img"}],[{"text":"The proof requires expressing the conditional expectation in question as ","element":"span"},{"style":{"height":7.17},"width":105.96,"height":17.92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-17.png","element":"img","alt":"� ∞−∞ vfV (v)Pr(X>t−v)","inline":true},{"style":{"height":11.42},"width":290.08,"height":28.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-18.png","element":"img","alt":"� ∞−∞ fV (v)Pr(X>t−v)","inline":true,"padRight":true},{"text":", then partitioning ","element":"span"},{"text":"the interval ","element":"span"},{"style":{"height":14.4},"width":147.1,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/12-19.png","element":"img","alt":" (−∞, ∞)","inline":true,"padRight":true},{"text":"into four regions and bounding the integrand in the numerator above by a different quantity in each region.","element":"span"}],[{"text":"In addition to the works cited in the main paper, we make reference to the textbook ","element":"span"},{"href":"#id-31","referenceIndex":4,"text":"(Foss et al., ","element":"a"},{"href":"#id-31","referenceIndex":4,"text":"2013) ","element":"a"},{"text":"throughout the proof. Many similar results about random variables are present in the textbook.","element":"span"}],[{"style":{"width":"100%"},"width":1775,"height":702,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-0.png","element":"img"}],[{"text":"Table A.1: A summary of the proof strategy for Theorem 5.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"42%"},"width":679,"height":580,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-1.png","element":"img"}],[{"text":"Figure A.2: A diagram showing the region boundaries at ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":189,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-2.png","element":"img","alt":" −h(t), h(t)","inline":true},{"text":", and ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":132.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-3.png","element":"img","alt":" t − h(t)","inline":true,"padRight":true},{"text":"in an example where ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"t ","element":"figcaption","subtype":"caption"},{"text":"= 25 ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"h","element":"figcaption","subtype":"caption"},{"text":"(","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"t","element":"figcaption","subtype":"caption"},{"text":") = 4","element":"figcaption","subtype":"caption"},{"text":", along with a negative log plot of the relevant distribution:","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"A.5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proof sketch and intuitions","element":"span"}],[{"style":{"width":"99%"},"width":1585,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-4.png","element":"img"}],[{"text":"the numerator into 4 regions, showing that each region’s effect on the conditiona","element":"span"},{"text":"l e","element":"span"},{"text":"xpectation of V is similar to","element":"span"}],[{"style":{"width":"64%"},"width":1019,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-5.png","element":"img"}],[{"text":"The regions are defined in terms of a slow-growing function ","element":"span"},{"style":{"height":14.58},"width":243.8,"height":36.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-6.png","element":"img","alt":" h(t) : R → R≥0","inline":true,"padRight":true},{"text":"such that the fiddly bounds on different pieces of the proof work out. Roughly, we want it to go to infinity so that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"| ","element":"span"},{"text":"is likely to be less than ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"in the limit, but grow slowly enough that the shape of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"’s distribution within the interval ","element":"span"},{"style":{"height":14.4},"width":191.72,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-7.png","element":"img","alt":" [−h(t), h(t)]","inline":true,"padRight":true},{"text":"doesn’t change much after conditioning.","element":"span"}],[{"style":{"width":"54%"},"width":871,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-8.png","element":"img"}],[{"text":"Note that up to a constant vertical shift of normalization, the green curve is the pointwise sum of the blue and orange curves.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Definitions","element":"span"}],[{"style":{"width":"75%"},"width":1200,"height":84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/13-9.png","element":"img"}],[{"text":"Let ","element":"span"},{"style":{"height":14.4},"width":91.04,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-0.png","element":"img","alt":" fV (v)","inline":true,"padRight":true},{"text":"be the PDF of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"at the value ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":". We assume for convenience that ","element":"span"},{"style":{"height":12.8},"width":39.88,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-1.png","element":"img","alt":" fV","inline":true,"padRight":true},{"text":"exists, is integrable, etc, though we suspect that this isn’t necessary, and that one could work through a similar proof just referring to the tails of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":". We won’t make this assumption for ","element":"span"},{"style":{"height":16.03},"width":809.24,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-2.png","element":"img","alt":" X. Let FX(x) = Pr(X ≤ x) and ¯FX(x) = Pr(X > x)","inline":true},{"text":", similarly for ","element":"span"},{"style":{"height":14.03},"width":164.44,"height":35.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-3.png","element":"img","alt":"FV and ¯FV","inline":true,"padRight":true},{"text":". Assume that","element":"span"}],[{"style":{"width":"55%"},"width":879,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-4.png","element":"img"}],[{"text":"Formally, this means that ","element":"span"},{"style":{"height":21.34},"width":408.48,"height":53.36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-5.png","element":"img","alt":" limx→∞Pr(X1+X2>x)Pr(X>x) = 2","inline":true},{"text":". This occurs roughly whenever ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"has tails that are heavier than ","element":"span"},{"style":{"height":13.7},"width":209.2,"height":34.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-6.png","element":"img","alt":" e−cx for any c","inline":true,"padRight":true},{"text":"and is reasonably well-behaved; counterexamples to the claim \"long-tailed implies subexponential\" exist, but they’re nontrivial to exhibit. Examples of subexponential distributions include lognormal distributions, anything that decays like a power law, the Pareto distribution,and distributions with tails asymptotic to ","element":"span"},{"style":{"height":16.91},"width":359.92,"height":42.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-7.png","element":"img","alt":" e−xa for any 0 < a < 1.","inline":true}],[{"text":"We require for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"that its tail function is substantially lighter than X’s, namely that ","element":"span"},{"style":{"height":22.58},"width":367.8,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-8.png","element":"img","alt":" limt→∞tp ¯FV (t)¯FX(t) = 0 for","inline":true,"padRight":true},{"text":"some ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p > ","element":"span"},{"text":"1","element":"span"},{"text":". (This implies that ","element":"span"},{"style":{"height":16.03},"width":344.04,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-9.png","element":"img","alt":"¯FV (t) = O( ¯FX(t)/t).)","inline":true}],[{"style":{"width":"100%"},"width":1586,"height":238,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-10.png","element":"img"}],[{"text":"We’d like to show that these two expectations are equal in the limit for large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". To do this, we’ll introduce ","element":"span"},{"style":{"height":22.59},"width":254.4,"height":56.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-11.png","element":"img","alt":"Q(v) = ¯FX(t−v)¯FX(t) ","inline":true,"padRight":true},{"text":". (More pedantically, this should really be ","element":"span"},{"style":{"height":14.4},"width":90,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-12.png","element":"img","alt":" Qt(v)","inline":true},{"text":", which we’ll occasionally use where it’s ","element":"span"},{"text":"helpful to remember that this is a function of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".)","element":"span"}],[{"text":"For a given value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"v","element":"span"},{"text":") ","element":"span"},{"text":"is just a scaled version of ","element":"span"},{"style":{"height":16.03},"width":157.21,"height":40.07,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-13.png","element":"img","alt":"¯FX(t − v)","inline":true},{"text":", so the conditional expectation of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"is given ","element":"span"},{"style":{"height":3.6},"width":27,"height":9,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-14.png","element":"img","alt":" ∞","inline":true,"padLeft":true}],[{"style":{"width":"99%"},"width":1585,"height":90,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-15.png","element":"img"}],[{"text":"We’ll aim to show that for all ","element":"span"},{"style":{"height":12.4},"width":92.52,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-16.png","element":"img","alt":" ϵ > 0,","inline":true,"padRight":true},{"text":"we have for sufficiently large ","element":"span"},{"style":{"height":26.82},"width":657.76,"height":67.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-17.png","element":"img","alt":" t that�� ∞−∞ vfV (v)Qt(v) −� ∞−∞ vfV (v)�� <","inline":true},{"style":{"height":17.2},"width":608.04,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-18.png","element":"img","alt":"ϵ and� ∞−∞ fV (v)Qt(v) ∈ [1 − ϵ, 1 + ϵ]","inline":true},{"text":", which implies (exercise) that the two expectations have limiting ","element":"span"},{"text":"difference zero. But first we need some lemmas.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.5.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Lemmas","element":"span"}],[{"style":{"width":"48%"},"width":773,"height":346,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Lemma 2.19 from ","element":"span"},{"href":"#id-31","referenceIndex":4,"text":"(Foss et al., ","element":"a"},{"href":"#id-31","referenceIndex":4,"text":"2013) ","element":"a"},{"text":"implies that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"is long-tailed (which it is, because subexponential implies long-tailed), then there is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"such that condition (a) holds and ","element":"span"},{"style":{"height":14.02},"width":113.28,"height":35.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-20.png","element":"img","alt":"¯FX is h","inline":true},{"text":"-insensitive; by Proposition 2.20 we can take ","element":"span"},{"style":{"height":14.4},"width":327.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-21.png","element":"img","alt":" h such that h(t) ≤ t/2","inline":true,"padRight":true},{"text":"for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", implying condition (b). Conditions (c) and (d) follow from being ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"-insensitive.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose that ","element":"span"},{"style":{"height":11.6},"width":47.56,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-22.png","element":"img","alt":" FX","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is whole-line subexponential and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is chosen as in Lemma 1. Also suppose that ","element":"span"},{"style":{"height":16.03},"width":1230.2,"height":40.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-23.png","element":"img","alt":"¯FV (t) = O( ¯FX(t)/t). Then Pr[X + V > t, V > h(t), X > h(t)] = o( ¯FX(t)/t).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"This is a slight variation on lemma 3.8 from ","element":"span"},{"href":"#id-31","referenceIndex":4,"text":"(Foss et al., ","element":"a"},{"href":"#id-31","referenceIndex":4,"text":"2013)","element":"a"},{"text":", and follows from the proof of Lemma 2.37. Lemma 2.37 states that","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 2.37. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"be any increasing function on ","element":"span"},{"style":{"height":16.08},"width":353.24,"height":40.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-24.png","element":"img","alt":" R+such that h(x) → ∞","inline":true},{"text":". Then, for any distributions ","element":"span"},{"style":{"height":12.8},"width":371.36,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-25.png","element":"img","alt":" F1, F2, G1, and G2 on R,","inline":true}],[{"style":{"width":"78%"},"width":1242,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/14-26.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.2},"width":252.88,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-0.png","element":"img","alt":" ξ1, ξ2, η1, and η2","inline":true,"padRight":true},{"text":"are independent random variables with respective distributions ","element":"span"},{"style":{"height":12.8},"width":282.68,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-1.png","element":"img","alt":"F1, F2, G1 and G2.","inline":true}],[{"style":{"width":"100%"},"width":1586,"height":935,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-2.png","element":"img"}],[{"text":"and because ","element":"span"},{"style":{"height":16.02},"width":812.36,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-3.png","element":"img","alt":" h(t) → ∞ as t → ∞ and ¯FV (t) = O( ¯FX(t)/t)","inline":true},{"text":", we can say that for some ","element":"span"},{"style":{"height":10.4},"width":137,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-4.png","element":"img","alt":" c < ∞,","inline":true},{"style":{"height":22.58},"width":462.12,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-5.png","element":"img","alt":"limt→∞ supz>h(t)t ¯FV (z)¯FX(z) < c","inline":true},{"text":". Therefore for sufficiently large t ","element":"span"},{"style":{"height":14.4},"width":600.8,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-6.png","element":"img","alt":" P {X + V > t, X > h(t), V > h(t)} ≤","inline":true}],[{"style":{"height":5.6},"width":22,"height":14,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-7.png","element":"img","alt":"t→∞z>h(t)¯FX(z) ","inline":true,"padRight":true},{"style":{"height":4.4},"width":13,"height":11,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-8.png","element":"img","alt":"c","inline":true},{"style":{"height":15.67},"width":573.74,"height":39.17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-9.png","element":"img","alt":"t P {X+X′ > t, X >h(t), X′ >h(t)}.","inline":true}],[{"style":{"width":"99%"},"width":1583,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.5.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bounds on the numerator","element":"span"}],[{"text":"We want to show, for arbitrary ","element":"span"},{"style":{"height":26.82},"width":775.96,"height":67.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-11.png","element":"img","alt":" ϵ > 0, that�� ∞−∞ vfV (v)Q(v) −� ∞−∞ vfV (v)�� < ϵ","inline":true,"padRight":true},{"text":"in the limit as ","element":"span"},{"style":{"height":9.6},"width":124.64,"height":24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-12.png","element":"img","alt":" t → ∞.","inline":true,"padRight":true},{"text":"Since","element":"span"},{"style":{"height":26.82},"width":1497.88,"height":67.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-13.png","element":"img","alt":"�� ∞−∞ vfV (v)Q(v) −� ∞−∞ vfV (v)�� ≤� ∞−∞ |vfV (v)(Q(v) − 1)| =� ∞−∞ |v| · fV (v) · |Q(v) − 1| it will","inline":true,"padRight":true},{"text":"suffice to show that the latter quantity is less than ","element":"span"},{"style":{"height":13.2},"width":177,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-14.png","element":"img","alt":" ϵ for large t.","inline":true}],[{"text":"We’re going to show that","element":"span"},{"style":{"height":17.2},"width":412.36,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-15.png","element":"img","alt":"� ∞−∞ |v| · fV (v) · |Q(v) − 1|","inline":true,"padRight":true},{"text":"is small by showing that the integral gets arbitrarily small ","element":"span"},{"text":"on each of four pieces: ","element":"span"},{"style":{"height":14.4},"width":947.28,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-16.png","element":"img","alt":" (−∞, −h(t)], (−h(t), h(t)), [h(t), t − h(t)], and (t − h(t), ∞).","inline":true}],[{"style":{"width":"100%"},"width":1590,"height":324,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-17.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Region 2: ","element":"span"},{"style":{"height":16},"width":216.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-18.png","element":"img","alt":" (−h(t), h(t))","inline":true,"padRight":true},{"text":"By lemma 1(d), ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h ","element":"span"},{"text":"is such that for sufficiently large ","element":"span"},{"style":{"height":19.54},"width":426.64,"height":48.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-19.png","element":"img","alt":" t, |Q(v) − 1| < ϵ� ∞−∞ |v|fV (v)","inline":true,"padRight":true},{"text":"on the interval ","element":"span"},{"style":{"height":14.4},"width":191.76,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-20.png","element":"img","alt":" [−h(t), h(t)]","inline":true},{"text":". (Note that the value of this upper bound depends only on ","element":"span"},{"style":{"height":12.8},"width":335.08,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-21.png","element":"img","alt":" V and ϵ, not on t or h.)","inline":true,"padRight":true},{"text":"So we have","element":"span"},{"style":{"height":23.55},"width":1395.8,"height":58.88,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-22.png","element":"img","alt":"� h(t)−h(t) |v|fV (v)|Q(v) − 1| < ϵ� ∞−∞ |v|fV (v)� h(t)−h(t) |v|fV (v) < ϵ� ∞−∞ |v|fV (v)� ∞−∞ |v|fV (v) = ϵ.","inline":true}],[{"style":{"width":"99%"},"width":1585,"height":245,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/15-23.png","element":"img"}],[{"text":"The LHS in this expression is the unconditional probability that ","element":"span"},{"style":{"height":14.4},"width":674.96,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-0.png","element":"img","alt":" X + V > t and h(t) < V < t − h(t), but this","inline":true,"padRight":true},{"text":"event implies ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V > t, V > h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X > h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":")","element":"span"},{"text":". So we can write","element":"span"}],[{"style":{"width":"99%"},"width":1584,"height":385,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-1.png","element":"img"}],[{"style":{"height":22.59},"width":359.32,"height":56.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-2.png","element":"img","alt":"limt→∞¯FX(t−h(t))¯FX(t) = 1","inline":true,"padRight":true},{"text":"by Lemma 1(c), this is equivalent to","element":"span"},{"style":{"height":20},"width":519.2,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-3.png","element":"img","alt":"� ∞t−h(t) vfV (v) = o( ¯FX(t − h(t)))","inline":true},{"text":", which (by ","element":"span"},{"text":"Lemma 1(b)) is equivalent to","element":"span"},{"style":{"height":17.62},"width":370.48,"height":44.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-4.png","element":"img","alt":"� ∞t vfV (v) = o( ¯FX(t)).","inline":true}],[{"text":"Note that","element":"span"},{"style":{"height":17.62},"width":1060.16,"height":44.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-5.png","element":"img","alt":"� ∞t vfV (v) = t� ∞t fV (v) +� ∞t (v − t)fV (v) = t ¯FV (t) +� ∞t ¯FV (v)","inline":true},{"text":", so it will suffice to show that both terms in this sum are ","element":"span"},{"style":{"height":16.02},"width":148.84,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-6.png","element":"img","alt":" o( ¯FX(t)).","inline":true}],[{"style":{"width":"99%"},"width":1584,"height":181,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-7.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"A.5.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bounds on the denominator","element":"span"}],[{"text":"For the denominator, we want to show that ","element":"span"},{"style":{"height":17.2},"width":682.52,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-8.png","element":"img","alt":" limt→∞� ∞−∞ fV (v)Qt(v) = 1 =� ∞−∞ fV (v)","inline":true},{"text":", so it’ll suffice to ","element":"span"},{"text":"show ","element":"span"},{"style":{"height":17.2},"width":669.28,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-9.png","element":"img","alt":" |� ∞−∞ fV (v)(Qt(v) − 1)| = o(1) as t → ∞","inline":true},{"text":". Again, we’ll break up this integral into pieces, though ","element":"span"},{"text":"they’ll be more straightforward than last time. We’ll look at ","element":"span"},{"style":{"height":14.4},"width":645.2,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-10.png","element":"img","alt":" (−∞, −h(t)), [−h(t), h(t)], and (h(t), ∞).","inline":true}],[{"style":{"width":"69%"},"width":1098,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-11.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"– ","element":"span"},{"text":"But since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"text":"goes to infinity, this left tail of the integral will contain less and less of ","element":"span"},{"style":{"height":10},"width":65.16,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-12.png","element":"img","alt":" V ’s","inline":true,"padRight":true},{"text":"probability mass as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"increases.","element":"span"}],[{"style":{"width":"68%"},"width":1084,"height":251,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-13.png","element":"img"}],[{"text":"But for sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"we have ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"1","element":"span"},{"text":", so we obtain","element":"span"},{"style":{"height":19.58},"width":633.16,"height":48.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-14.png","element":"img","alt":"� ∞h(t) fV (v)Q(v) < � ∞h(t) vfV (v)Q(v) <","inline":true},{"style":{"height":17.2},"width":379.08,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-15.png","element":"img","alt":"� ∞−∞ vfV (v)Q(v) = o(1)","inline":true,"padRight":true},{"text":"by the results of the previous section. This completes the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theorem 6","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Restatement of theorem ","element":"span"},{"href":"#id-19","style":{"fontWeight":"bold"},"text":"6. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X, V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be independent random variables such that ","element":"span"},{"style":{"height":22.58},"width":414.4,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-16.png","element":"img","alt":" limt→∞¯FX(t+1)¯FX(t) = 0. (This","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"implies that X has tails that are dominated by ","element":"span"},{"style":{"height":10.5},"width":70.56,"height":26.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-17.png","element":"img","alt":" e−cx ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for any c, though it’s a slightly stronger claim because it requires that X not have large jumps in the decay of its tails.) Then for any V with a finite mean which has no upper bound, ","element":"span"},{"style":{"height":14.4},"width":478.96,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-18.png","element":"img","alt":" limt→∞ E[V |X + V > t] = ∞.","inline":true}],[{"style":{"width":"90%"},"width":1428,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-19.png","element":"img"}],[{"text":"Let ","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"V < c","element":"span"},{"text":"] = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":". (If this is undefined because the conditional has probability ","element":"span"},{"text":"0","element":"span"},{"text":", we’ll have the desired result anyway since then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"would always be at least ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":".)","element":"span"}],[{"text":"Observe that for all ","element":"span"},{"style":{"height":14.4},"width":496.6,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-20.png","element":"img","alt":" t, E[V |V < c, X + V > t] ≥ q","inline":true,"padRight":true},{"text":"(assuming it is defined), because we’re conditioning ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"V < c","element":"span"},{"text":") ","element":"span"},{"text":"on an event which is more likely for larger ","element":"span"},{"style":{"fontStyle":"italic"},"text":"v ","element":"span"},{"text":"(since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"are independent).","element":"span"}],[{"style":{"width":"80%"},"width":1276,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/16-21.png","element":"img"}],[{"style":{"width":"79%"},"width":1259,"height":480,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-0.png","element":"img"}],[{"text":"Figure B.1: Histogram and normal probability plot of reward assigned by Pythia RM to random length-1024 token sequences. The Q-Q plot suggests the distribution is approximately normal, which is much lighter-tailed than exponential.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"48%"},"width":769,"height":520,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-1.png","element":"img"}],[{"text":"Figure B.2: Reward and log-probability for ACG-optimized inputs to Starling 7B-alpha.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"100%"},"width":1589,"height":235,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-2.png","element":"img"}],[{"text":"Now, consider ","element":"span"},{"style":{"height":14.4},"width":259.52,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-3.png","element":"img","alt":" E[V |X + V ≥ t]","inline":true},{"text":". We can break this up as the sum across outcomes ","element":"span"},{"style":{"height":14.4},"width":344.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-4.png","element":"img","alt":" Z of E[V |Z, X + V ≥","inline":true},{"style":{"height":14.4},"width":343.28,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-5.png","element":"img","alt":"t] · Pr(Z|X + V ≥ t)","inline":true,"padRight":true},{"text":"for the three disjoint outcomes ","element":"span"},{"style":{"height":12.4},"width":754.36,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-6.png","element":"img","alt":" V < c, c ≤ V ≤ c + 1, and V > c + 1. Note","inline":true,"padRight":true},{"text":"that we can lower bound these expectations by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q, c, c ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"respectively. But then once ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"is large enough that","element":"span"}],[{"style":{"width":"99%"},"width":1579,"height":49,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/17-7.png","element":"img"}]]},{"heading":"B Additional experiments and figures","paragraphs":[[{"text":"Figures ","element":"span"},{"href":"#id-25","text":"B.1, ","element":"a"},{"href":"#id-26","text":"B.2 ","element":"a"},{"text":"relate to experiments mentioned in the main paper. In response to reviewer feedback, we added two further experiments to demonstrate the catastrophic Goodhart phenemonon with artificially heavy-tailed reward, one using best-of-N on synthetic distributions and one with PPO on Pythia 1B.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Best-of-N experiment","element":"span"}],[{"text":"We created a synthetic experiment by letting reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"are independent and sampled from different probability distributions, consistent with our theoretical assumptions. We vary ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"from 1 to 65536, do 100 trials of taking the best-of-","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"sample with highest ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":", and note whether ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ","element":"span"},{"text":"goes towards 0 (overoptimization) or not.","element":"span"}],[{"text":"Possible distributions for V are normal and t-distribution with df=10. Possible distributions for X are normal, t with df=3, t with df=5, lognormal, and Levy. (All of these heavy-tailed except for the normal distribution.) V is scaled to a standard deviation of 2 and X has s.d. of 1 (except for the Levy distribution, which has infinite variance), representing that in ordinary regimes most of the variance comes from utility rather than error.","element":"span"}],[{"style":{"width":"77%"},"width":1907,"height":1176,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/18-0.png","element":"img"}],[{"text":"Figure B.3: When the error ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"X ","element":"figcaption","subtype":"caption"},{"text":"is normal and thus light-tailed, ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"V ","element":"figcaption","subtype":"caption"},{"text":"increases monotonically with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"N","element":"figcaption","subtype":"caption"},{"text":", consistent with our Theorem ","element":"figcaption","subtype":"caption"},{"href":"#id-19","text":"6. ","element":"a","subtype":"caption"},{"text":"However, when both ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"X ","element":"figcaption","subtype":"caption"},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"V ","element":"figcaption","subtype":"caption"},{"text":"are heavy-tailed, we see results consistent with theorem ","element":"figcaption","subtype":"caption"},{"href":"#id-18","text":"5. ","element":"a","subtype":"caption"},{"text":"In ","element":"figcaption","subtype":"caption"},{"text":"5 ","element":"figcaption","subtype":"caption"},{"text":"of ","element":"figcaption","subtype":"caption"},{"text":"6 ","element":"figcaption","subtype":"caption"},{"text":"cases when ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"X ","element":"figcaption","subtype":"caption"},{"text":"is lognormal or student-t, ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"V ","element":"figcaption","subtype":"caption"},{"text":"first increases then starts to decline around ","element":"figcaption","subtype":"caption"},{"style":{"height":13.38},"width":148.44,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/18-1.png","element":"img","alt":" N = 102","inline":true,"padRight":true},{"text":"or ","element":"figcaption","subtype":"caption"},{"style":{"height":13.38},"width":55.88,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/18-2.png","element":"img","alt":" 103","inline":true},{"text":". When ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"X ","element":"figcaption","subtype":"caption"},{"text":"is (t, df=5) and V is (t, df=10), ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"V ","element":"figcaption","subtype":"caption"},{"text":"instead peaks around ","element":"figcaption","subtype":"caption"},{"id":"id-32","style":{"height":13.78},"width":145.32,"height":34.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/18-3.png","element":"img","alt":" N = 105","inline":true,"padRight":true},{"text":"(but declines afterwards). Finally, when X is Levy-distributed, utility never goes significantly above zero (optimization completely fails) because the Levy distribution is too heavy-tailed.","element":"figcaption","subtype":"caption"}],[{"text":"The results are shown in Figure ","element":"span"},{"href":"#id-32","text":"B.3. ","element":"a"},{"text":"Briefly, the results are consistent with the asymptotic results in theorems ","element":"span"},{"href":"#id-18","text":"5 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-19","text":"6, ","element":"a"},{"text":"showing that overoptimization","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"B.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"PPO experiment","element":"span"}],[{"text":"In this experiment, we examined PPO with artificially heavy-tailed rewards to see if catastrophic Goodhart could be observed.","element":"span"}],[{"text":"OpenRLHF ","element":"span"},{"href":"#id-33","referenceIndex":7,"text":"(Hu et al., ","element":"a"},{"href":"#id-33","referenceIndex":7,"text":"2024) ","element":"a"},{"text":"was used to train Pythia 1B with a reward model derived also from Pythia 1B, on the default OpenRLHF prompt dataset. We used the reward model to represent true utility, and a heavy-tailed error term based on the number of \"the\" tokens was added to get the proxy reward. The ","element":"span"},{"text":"kl_target=0.5 ","element":"span"},{"text":"option was used to dynamically adjust KL penalty, as we mention is done in ","element":"span"},{"href":"#id-5","referenceIndex":21,"text":"Ziegler et al. ","element":"a"},{"href":"#id-5","referenceIndex":21,"text":"(2020)","element":"a"},{"text":". Rewards were not clipped. (Reward clipping can be useful to prevent overoptimization, but is not always used in PPO.) Response length was limited to 256.","element":"span"}],[{"text":"Two example completions are in figure ","element":"span"},{"href":"#id-34","text":"B.4. ","element":"a"},{"text":"Midway through training, the model generates reasonable text with a higher number of \"the\" tokens than usual, but as training progresses the model starts to generate completions with a very high number of \"the\" tokens while proxy reward reaches values ","element":"span"},{"style":{"height":13.3},"width":90.8,"height":33.24,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/18-4.png","element":"img","alt":" > 109","inline":true},{"text":", demonstrating that the basic pattern of catastrophic Goodhart can occur in RLHF under conditions of heavy-tailed error.","element":"span"}],[{"style":{"width":"99%"},"width":1584,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/19-0.png","element":"img"}],[{"text":"The preparation of polyarylethylene which is either employed for the purpose of applications in the petroleum refining or in the production of the well-known automobile oil lubes is carried out by the so-called preparation by the salt bath method described... in 1980 by D. W. Perkins in the United States. This method presupposes the usage of the salt bath as the medium for the preparation of the polyarylediene–the method of preparation of the crude oil base by the process which consists in letting it be saturated up with a particular the","element":"span"}],[{"style":{"width":"99%"},"width":1584,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/19-1.png","element":"img"}],[{"text":"A mirror, the essence of the understanding of the given the conclusion that the manifestation in the form of the gender and the identification the solutions related to different fields of the modern utilization of the the functionality of the the the the modification of the the the the the the state of the the the the the the the the the the the employment and the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the and the the the the the the the the the the the the the the the the the the the the by the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the and the the the the the the the the the the the the the the the the the","element":"span"}],[{"style":{"width":"99%"},"width":1584,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/94961/images/19-2.png","element":"img"}],[{"id":"id-34","text":"Figure B.4: PPO sample generations. Top: early in training, the model generates reasonable ","element":"figcaption","subtype":"caption"},{"text":"completions. Bottom: later in training, the artifically heavy-tailed reward dominates and the model generates completions with a very high number of \"the\" tokens.","element":"figcaption","subtype":"caption"}],[{"text":"The result depends on hyperparameters (e.g. reward clipping would prevent this), so our observation should not be taken as a claim that catastrophic Goodhart is inevitable in all RLHF settings with heavy-tailed error.","element":"span"}]]},{"heading":"C Assets","paragraphs":[[{"text":"We use three models for our experiments: Starling 7B-alpha, Llama 2 7B-chat, and Pythia-1.4B. Starling was developed by Berkeley, and Pythia by EleutherAI. Starling and Pythia models are licensed under Apache-2.0.","element":"span"},{"text":"10 ","element":"span"},{"text":"11 ","element":"span"},{"text":"Llama 2 models were developed by Meta and licensed under a license published by Meta.","element":"span"},{"text":"12","element":"span"}]]},{"heading":"NeurIPS Paper Checklist","paragraphs":[[{"text":"The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The papers not including the checklist will be desk rejected. ","element":"span"},{"text":"The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit.","element":"span"}],[{"text":"Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:","element":"span"}],[{"text":"• You should answer ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":", ","element":"span"},{"text":"[No] ","element":"span"},{"text":", or ","element":"span"},{"text":"[NA] ","element":"span"},{"text":".","element":"span"}],[{"text":"• ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"means either that the question is Not Applicable for that particular paper or the relevant information is Not Available.","element":"span"}],[{"text":"• Please provide a short (1–2 sentence) justification right after your answer (even for NA).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"The checklist answers are an integral part of your paper submission. ","element":"span"},{"text":"They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.","element":"span"}],[{"text":"The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While \"","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"\" is generally preferable to \"","element":"span"},{"text":"[No] ","element":"span"},{"text":"\", it is perfectly acceptable to answer \"","element":"span"},{"text":"[No] ","element":"span"},{"text":"\" provided a proper justification is given (e.g., \"error bars are not reported because it would be too computationally expensive\" or \"we were unable to find the license for the dataset we used\"). In general, answering \"","element":"span"},{"text":"[No] ","element":"span"},{"text":"\" or \"","element":"span"},{"text":"[NA] ","element":"span"},{"text":"\" is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"to a question, in the justification please point to the section(s) where related material for the question can be found.","element":"span"}],[{"text":"IMPORTANT, please: • ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Delete this instruction block, but keep the section heading “NeurIPS paper checklist\"","element":"span"},{"text":", • ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Keep the checklist subsection headings, questions/answers and guidelines below. ","element":"span"},{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Do not modify the questions and only use the provided macros for your answers","element":"span"},{"text":". 1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Claims","element":"span"}],[{"text":"Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The abstract lists the important claims: the relationship between Goodhart’s Law and whether the error in a misspecified reward is heavy-tailed. The main limitation of independence assumptions is clearly stated in the introduction.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the abstract and introduction do not include the claims made in the paper.","element":"span"}],[{"text":"• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.","element":"span"}],[{"text":"• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.","element":"span"}],[{"text":"• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Limitations","element":"span"}],[{"text":"Question: Does the paper discuss the limitations of the work performed by the authors? Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: Section 7 lists the limitations, which we have combined with the discussion section due to heavy overlap.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The authors are encouraged to create a separate \"Limitations\" section in their paper.","element":"span"}],[{"text":"• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.","element":"span"}],[{"text":"• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.","element":"span"}],[{"text":"• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.","element":"span"}],[{"text":"• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.","element":"span"}],[{"text":"• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.","element":"span"}],[{"text":"• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theory Assumptions and Proofs","element":"span"}],[{"text":"Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: All proofs are given inline or in the appendix, except for Theorem 5 which appears in the supplemental material.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include theoretical results.","element":"span"}],[{"text":"• All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.","element":"span"}],[{"text":"• All assumptions should be clearly stated or referenced in the statement of any theorems.","element":"span"}],[{"text":"• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.","element":"span"}],[{"text":"• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.","element":"span"}],[{"text":"• Theorems and Lemmas that the proof relies upon should be properly referenced.","element":"span"}],[{"text":"4. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Result Reproducibility","element":"span"}],[{"text":"Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: Sampling rewards requires no hyperparameters, and hyperparameters are provided for ACG.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.","element":"span"}],[{"text":"• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.","element":"span"}],[{"text":"(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.","element":"span"}],[{"text":"(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).","element":"span"}],[{"text":"(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.","element":"span"}],[{"text":"5. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Open access to data and code","element":"span"}],[{"text":"Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: Code will be provided in the supplemental material. Guidelines:","element":"span"}],[{"text":"• The answer NA means that paper does not include experiments requiring code. • Please see the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","text":"https://nips.cc/public/ ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","text":"guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).","element":"span"}],[{"text":"• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","text":"https://nips.cc/public/ ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","text":"guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.","element":"span"}],[{"text":"• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.","element":"span"}],[{"text":"• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).","element":"span"}],[{"text":"• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.","element":"span"}],[{"text":"6. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Setting/Details","element":"span"}],[{"text":"Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: Other than hyperparameters, there are no details required to understand the results. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.","element":"span"}],[{"text":"• The full details can be provided either with the code, in appendix, or as supplemental material.","element":"span"}],[{"text":"7. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Statistical Significance","element":"span"}],[{"text":"Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The only error bars are inter-run variability of ACG. The standard deviation was reported rather than error bars due to the small number of runs.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• The authors should answer \"Yes\" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.","element":"span"}],[{"text":"• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).","element":"span"}],[{"text":"• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)","element":"span"}],[{"text":"• The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean.","element":"span"}],[{"text":"• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.","element":"span"}],[{"text":"• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).","element":"span"}],[{"text":"• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.","element":"span"}],[{"text":"8. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiments Compute Resources","element":"span"}],[{"text":"Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The experiments took minimal compute resources except H100 hours for ACG, and we report the number of GPU-hours used.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.","element":"span"}],[{"text":"• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.","element":"span"}],[{"text":"• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).","element":"span"}],[{"text":"9. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Code Of Ethics","element":"span"}],[{"text":"Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics ","element":"span"},{"href":"https://neurips.cc/public/EthicsGuidelines","text":"https://neurips.cc/public/EthicsGuidelines","element":"a"},{"text":"?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The research conforms to all data-related concerns. No human subjects were involved, and we think the risk of harmful societal impact is minimal.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.","element":"span"}],[{"text":"• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).","element":"span"}],[{"text":"10. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Broader Impacts","element":"span"}],[{"text":"Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The immediate societal impacts are limited, but we discuss some potential applications to long-term safety.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that there is no societal impact of the work performed. • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.","element":"span"}],[{"text":"• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.","element":"span"}],[{"text":"• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.","element":"span"}],[{"text":"• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.","element":"span"}],[{"text":"• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).","element":"span"}],[{"text":"11. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Safeguards","element":"span"}],[{"text":"Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: We have created no such artifacts. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper poses no such risks.","element":"span"}],[{"text":"• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.","element":"span"}],[{"text":"• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.","element":"span"}],[{"text":"• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.","element":"span"}],[{"text":"12. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Licenses for existing assets","element":"span"}],[{"text":"Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: The assets we use are Starling 7B-alpha, Llama 2 7B-chat, and Pythia-1.4B. Starling and Pythia models are licensed under Apache-2.0.","element":"span"},{"text":"13 ","element":"span"},{"text":"14 ","element":"span"},{"text":"Llama 2 models are licensed under a license published by Meta.","element":"span"},{"text":"15 ","element":"span"},{"text":"We are in compliance with all licenses and terms of use.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not use existing assets. • The authors should cite the original paper that produced the code package or dataset. • The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset. • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.","element":"span"}],[{"text":"• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, ","element":"span"},{"text":"paperswithcode.com/datasets ","element":"span"},{"text":"has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.","element":"span"}],[{"text":"• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.","element":"span"}],[{"text":"• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.","element":"span"}],[{"text":"13. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"New Assets","element":"span"}],[{"text":"Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: We do not release new assets. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not release new assets.","element":"span"}],[{"text":"• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.","element":"span"}],[{"text":"• The paper should discuss whether and how consent was obtained from people whose asset is used.","element":"span"}],[{"text":"• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.","element":"span"}],[{"text":"14. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Crowdsourcing and Research with Human Subjects","element":"span"}],[{"text":"Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: No human subjects are involved. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.","element":"span"}],[{"text":"• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.","element":"span"}],[{"text":"15. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects","element":"span"}],[{"text":"Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: No human subjects are involved. Guidelines:","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$24:props:children:props:children:0:props:product"}]]