38:[["$","audio",null,{"id":"tts"}],["$","$L3d",null,{"paperID":"95452","publisher":"neurips","paperJSON":{"title":"Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model","paperID":"95452","avgLineHeight":10.92,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"$3e","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Reinforcement learning (RL) has seen remarkable success by using expressive deep neural networks to estimate the value function or policy function [","element":"span"},{"href":"#id-0","referenceIndex":1,"text":"1","element":"a"},{"text":"]. However, in deep RL optimization, updating the Q-value function or policy value function can be unstable and introduce significant bias [","element":"span"},{"href":"#id-1","referenceIndex":2,"text":"2","element":"a"},{"text":"]. Since the learning policy is influenced by the Q-value function, any bias in the Q-values affects the learning policy. In online RL, the agent’s interaction with the environment helps mitigate this bias through reward feedback for biased actions. However, in offline RL, the learning relies solely on data from a behavior policy, making information about rewards for states and actions outside the dataset’s distribution unavailable.","element":"span"}],[{"text":"It is commonly observed that during offline RL training, backups using OOD actions often lead to target Q-values being ","element":"span"},{"style":{"fontStyle":"italic"},"text":"overestimated ","element":"span"},{"text":"[","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":"] (see Figure ","element":"span"},{"href":"#id-3","text":"1(","element":"a"},{"text":"a)). As a result, the learning policy tends to prioritize these risky actions during policy improvement. This false prioritization accumulates with each training step, ultimately leading to failure in the offline training process [","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"4","element":"a"},{"text":", ","element":"span"},{"href":"#id-5","referenceIndex":5,"text":"5","element":"a"},{"text":", ","element":"span"},{"href":"#id-6","referenceIndex":6,"text":"6","element":"a"},{"text":"]. Therefore, addressing Q-value overestimation for OOD actions is crucial for the effective implementation of offline reinforcement learning.","element":"span"}],[{"style":{"width":"98%"},"width":1554,"height":526,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/1-0.png","element":"img"}],[{"id":"id-3","text":"Figure 1: (a) The maximum of the estimated Q-value often occurs in OOD actions due to the ","element":"figcaption","subtype":"caption"},{"text":"instability of the offline RL backup process and the “distribution shift” problem , so the Q-value of the learning policy (yellow line) will diverge from the behavior policy’s action space (blue line) during the training. (b) The red line represents the optimal Q-value within the action space of the dataset, while the blue line depicts the Q-value function of the behavior policy. The gold line corresponds to the Q-value derived from the in-sample Q training algorithm, showcasing a distribution constrained by the behavior policy. On the other hand, the green line illustrates the Q-value resulting from a more conservative Q training process. Although it adopts lower values in OOD actions, the Q-value within in-distribution areas proves excessively pessimistic, failing to approach the optimal Q-value.","element":"figcaption","subtype":"caption"}],[{"text":"Since any bias or error in the Q-value will propagate to the learning policy, it’s crucial to evaluate whether the Q-value is assigned to OOD actions and to apply a pessimistic adjustment to address overestimation. Ideally, this adjustment should only target OOD actions. One common way to identify whether the Q-value function is updated by OOD actions is by estimating the uncertainty of the Q-value [","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"4","element":"a"},{"text":"] in the action space. However, estimating uncertainty presents significant challenges, especially with high-capacity Q-value function approximators like neural networks [","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"4","element":"a"},{"text":"]. If Q-value uncertainty is not accurately estimated, a penalty may be uniformly applied across most actions [","element":"span"},{"href":"#id-7","referenceIndex":7,"text":"7","element":"a"},{"text":"], hindering the optimality of the Q-value function.","element":"span"}],[{"text":"While various methods [","element":"span"},{"href":"#id-7","referenceIndex":7,"text":"7","element":"a"},{"text":", ","element":"span"},{"href":"#id-8","referenceIndex":8,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":11,"text":"11","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"12","element":"a"},{"text":", ","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":14,"text":"14","element":"a"},{"text":"] attempt to make pessimistic estimates of the Q-value function, most have not effectively determined which Q-values need constraining or how to pessimistically estimate them with reliable and efficient uncertainty estimates. As a result, previous methods often end up being overly conservative in their Q-value estimations [","element":"span"},{"href":"#id-15","referenceIndex":15,"text":"15","element":"a"},{"text":"] or fail to achieve a tight lower confidence bound of the optimal Q-value function. Moreover, some in-sample training [","element":"span"},{"href":"#id-16","referenceIndex":16,"text":"16","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":18,"text":"18","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":19,"text":"19","element":"a"},{"text":"] of the Q-value function may lead it to closely mimic the Q-value of the behavior policy (see Figure ","element":"span"},{"href":"#id-3","text":"1(","element":"a"},{"text":"b)), rendering it unable to surpass the performance of the behavior policy, especially when the behavior policy is sub-optimal. Therefore, in balancing Q safety for learning and not hindering the recovery of the most optimal Q-value, current methods tend to prioritize safe optimization of the Q-value function.","element":"span"}],[{"text":"In this study, we introduce Q-Distribution Guided Q-Learning (QDQ) for offline RL ","element":"span"},{"text":"2","element":"span"},{"text":". The core concept focuses on estimating Q-value uncertainty by directly computing this uncertainty through bootstrap sampling from the behavior policy’s Q-value distribution. By approximating the behavior policy’s Q-values using the dataset, we train a high-fidelity and efficient distribution learnerconsistency model ","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"[20]","element":"a"},{"text":". This ensures the quality of the learned Q-value distribution.","element":"span"}],[{"text":"Since the behavior and learning policies share the same set of high-uncertainty actions [","element":"span"},{"href":"#id-5","referenceIndex":5,"text":"5","element":"a"},{"text":"], we can sample from the learned Q-value distribution to estimate uncertainty, identify risky actions, and make the Q target values for these actions more pessimistic. We then create an uncertainty-aware optimization objective to carefully penalize Q-values that may be OOD, ensuring that the constraints are appropriately pessimistic without hindering the Q-value function’s exploration in the in-distribution region. QDQ aims to find the optimal Q-value that exceeds the behavior policy’s optimal Q-value while remaining as pessimistic as possible in the OOD region. Moreover, our pessimistic approach is robust against errors in uncertainty estimation. Our main contributions are as follows:","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Utilization of trajectory-level data with a sliding window","element":"span"},{"text":": We use trajectory-level data with a sliding window approach to create the real truncated Q dataset. Our theoretical analysis (Theorem ","element":"span"},{"href":"#id-21","text":"4.1) ","element":"a"},{"text":"confirms that the generated data has a distribution similar to true Q-values. Additionally, distributions learned from this dataset tend to favor high-reward actions.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Introduction of a consistency model as a distribution learner","element":"span"},{"text":": QDQ introduces the consistency model [","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"20","element":"a"},{"text":"] as the distribution learner for the Q-value. Similar to the diffusion model, the consistency model demonstrates strong capabilities in distribution learning. Our theoretical analysis (Theorem ","element":"span"},{"href":"#id-22","text":"4.2) ","element":"a"},{"text":"highlights its consistency and one-step sampling properties, making it an ideal choice for uncertainty estimation.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Risk estimation of Q-values through uncertainty assessment","element":"span"},{"text":": QDQ estimates the risk set of Q-values by evaluating the uncertainty of actions. For Q-values likely to be overestimated and associated with high uncertainty, a pessimistic penalty is applied. For safer Q-values, a mild adjustment based on uncertainty error enhances their robustness.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Uncertainty-aware optimization objective to address conservatism","element":"span"},{"text":": To reduce the overly conservative nature of pessimistic Q-learning in offline RL, QDQ introduces an uncertainty-aware optimization objective. This involves simultaneous optimistic and pessimistic learning of the Q-value. Theoretical (Theorem ","element":"span"},{"href":"#id-23","text":"4.3 ","element":"a"},{"text":"and Theorem ","element":"span"},{"href":"#id-24","text":"4.4) ","element":"a"},{"text":"and experimental analyses show that this approach effectively mitigates conservatism issues.","element":"span"}]]},{"heading":"2 Background","paragraphs":[[{"text":"Our approach aims to temper the Q-values in OOD areas to mitigate the risk of unpredictable extrapolation errors, leveraging uncertainty estimation. We estimate the uncertainty of Q-values across actions visited by the learning policy using samples from a learned conditional Q-distribution via the consistency model. In this section, we provide a concise overview of the problem settings in offline RL and introduce the consistency model.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Fundamentals in offline RL","element":"span"}],[{"text":"The online RL process is shaped by an infinite-horizon Markov decision process (MDP): ","element":"span"},{"style":{"fontStyle":"italic"},"text":"M ","element":"span"},{"text":"= ","element":"span"},{"style":{"height":16.02},"width":288,"height":40.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-0.png","element":"img","alt":"{S, A, P, r, µ0, γ}","inline":true},{"text":". The state space is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"is the action space. The transition dynamic among the state is determined by ","element":"span"},{"style":{"height":16},"width":532,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-1.png","element":"img","alt":" P : S × A �→ ∆(S), where ∆(S)","inline":true,"padRight":true},{"text":"is the support of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S","element":"span"},{"text":". The reward determined on the whole state and action space is ","element":"span"},{"style":{"height":14.8},"width":388.48,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-2.png","element":"img","alt":" r : S × A �→ R, r < ∞","inline":true},{"text":", and can either be deterministic or random. ","element":"span"},{"style":{"height":16},"width":105,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-3.png","element":"img","alt":" µ0(s0)","inline":true,"padRight":true},{"text":"is the distribution of the initial states ","element":"span"},{"style":{"height":16},"width":211.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-4.png","element":"img","alt":" s0, γ ∈ (0, 1)","inline":true,"padRight":true},{"text":"is the discount factor. The goal of RL is to find the optimal policy ","element":"span"},{"style":{"height":16},"width":227.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-5.png","element":"img","alt":" π : S �→ ∆(a)","inline":true,"padRight":true},{"text":"that yields the highest long-term average return:","element":"span"}],[{"id":"id-43","style":{"width":"98%"},"width":1566,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"is the Q-value function under policy ","element":"span"},{"style":{"height":7.2},"width":22.52,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-7.png","element":"img","alt":" π","inline":true},{"text":". The process of obtaining the optimal policy is generally to recover the optimal Q-value function, which maximizes the Q-value function over the whole space ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A","element":"span"},{"text":", and then to obtain either an implicit policy (Q-Learning algorithm [","element":"span"},{"href":"#id-25","referenceIndex":21,"text":"21","element":"a"},{"text":", ","element":"span"},{"href":"#id-26","referenceIndex":22,"text":"22","element":"a"},{"text":", ","element":"span"},{"href":"#id-27","referenceIndex":23,"text":"23","element":"a"},{"text":"]), or a parameterized policy (Actor-Critic algorithm ","element":"span"},{"href":"#id-28","referenceIndex":24,"text":"[24, ","element":"a"},{"href":"#id-29","referenceIndex":25,"text":"25, ","element":"a"},{"href":"#id-30","referenceIndex":26,"text":"26, ","element":"a"},{"href":"#id-31","referenceIndex":27,"text":"27, ","element":"a"},{"href":"#id-32","referenceIndex":28,"text":"28]","element":"a"},{"text":").","element":"span"}],[{"text":"The optimal Q-value function ","element":"span"},{"style":{"height":16},"width":133.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-8.png","element":"img","alt":" Q∗(s, a)","inline":true,"padRight":true},{"text":"can be obtained by minimizing the Bellman residual:","element":"span"}],[{"id":"id-33","style":{"width":"66%"},"width":1062,"height":52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-9.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B ","element":"span"},{"text":"is the Bellman operator defined as","element":"span"}],[{"style":{"width":"47%"},"width":746,"height":66,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/2-10.png","element":"img"}],[{"text":"However, the whole paradigm needs to be adjusted in the offline RL setting, as MDP is only determined from a dataset ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", which is generated by behavior policy ","element":"span"},{"style":{"height":11.6},"width":40,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-0.png","element":"img","alt":" πβ","inline":true},{"text":". Hence, the state and action space is constraint by the distribution support of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":". We redefine the MDP in the offline RL setting as: ","element":"span"},{"style":{"height":16},"width":507,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-1.png","element":"img","alt":" MD = {SD, AD, PD, r, µ0, γ}","inline":true},{"text":", where ","element":"span"},{"style":{"height":18},"width":762.48,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-2.png","element":"img","alt":" SD = {s|s ∈ ∆(sD)}, AD = {a|a ∈ ∆(πβ)}","inline":true},{"text":". Then the transition dynamic is determined by ","element":"span"},{"style":{"height":16},"width":442.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-3.png","element":"img","alt":" PD : SD × AD �→ ∆(SD)","inline":true},{"text":". Therefore, the well-known “distribution shift” problem occurs when solving the Bellman equation Eq","element":"span"},{"href":"#id-33","text":".2. ","element":"a"},{"text":"The Bellman residual is taking expectation in ","element":"span"},{"style":{"height":14},"width":156.52,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-4.png","element":"img","alt":" SD × AD","inline":true},{"text":", while the target Q-value is calculated based on the actions from the learning policy.","element":"span"}],[{"id":"id-92","style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Consistency model","element":"span"}],[{"text":"The consistency model is an enhanced generative model compared to the diffusion model. The diffusion model gradually adds noise to transform the target distribution into a Gaussian distribution and by estimating the random noise to achieve the reverse process, i.e., sampling a priori sample from a Gaussian distribution and denoise to the target sample iteratively (forms the sample generation trajectory). The consistency model is proposed to ensures each step in a sample generation trajectory of the diffusion process aligns with the target sample (we call consistency). Specifically, the consistency model [","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"20","element":"a"},{"text":"] try to overcome the slow generation and inconsistency over sampling trajectory generated by the Probability Flow (PF) ODE during training process of the diffusion model [","element":"span"},{"href":"#id-34","referenceIndex":29,"text":"29","element":"a"},{"text":", ","element":"span"},{"href":"#id-35","referenceIndex":30,"text":"30","element":"a"},{"text":", ","element":"span"},{"href":"#id-36","referenceIndex":31,"text":"31","element":"a"},{"text":", ","element":"span"},{"href":"#id-37","referenceIndex":32,"text":"32","element":"a"},{"text":", ","element":"span"},{"href":"#id-38","referenceIndex":33,"text":"33","element":"a"},{"text":"].","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":140.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-5.png","element":"img","alt":" pdata(x)","inline":true,"padRight":true},{"text":"denote the data distribution, we start by diffuse the original data distribution by the PF ODE:","element":"span"}],[{"style":{"width":"73%"},"width":1168,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-6.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-7.png","element":"img","alt":" µ(·, ·)","inline":true,"padRight":true},{"text":"is the drift coefficient, ","element":"span"},{"style":{"height":16},"width":61.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-8.png","element":"img","alt":" σ(·)","inline":true,"padRight":true},{"text":"is the diffusion coefficient, ","element":"span"},{"style":{"height":16},"width":105,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-9.png","element":"img","alt":" pt(xt)","inline":true,"padRight":true},{"text":"is the distribution of ","element":"span"},{"style":{"height":9.79},"width":48.52,"height":24.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-10.png","element":"img","alt":" xt,","inline":true},{"style":{"height":16},"width":624.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-11.png","element":"img","alt":"p0(x) ≡ pdata(x), and {xt, t ∈ [ϵ, T]}","inline":true,"padRight":true},{"text":"is the solution trajectory of the above PF ODE.","element":"span"}],[{"text":"Consistency model aims to learn a consistency function ","element":"span"},{"style":{"height":16},"width":137,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-12.png","element":"img","alt":" fθ(xt, t)","inline":true,"padRight":true},{"text":"that maps each point in the same PF ODE trajectory to its start point, i.e., ","element":"span"},{"style":{"height":16},"width":438.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-13.png","element":"img","alt":" fθ(xt, t) = xϵ, ∀t ∈ [ϵ, T].","inline":true,"padRight":true},{"text":"Therefore, ","element":"span"},{"style":{"height":16},"width":212,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-14.png","element":"img","alt":" ∀t, t′ ∈ [ϵ, T]","inline":true},{"text":", we have ","element":"span"},{"style":{"height":16},"width":356,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-15.png","element":"img","alt":" fθ(xt, t) = fθ(xt′, t′)","inline":true},{"text":", which is the “self-consistency” property of consistency model.","element":"span"}],[{"text":"Here, ","element":"span"},{"style":{"height":16},"width":137.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-16.png","element":"img","alt":" fθ(xt, t)","inline":true,"padRight":true},{"text":"is defined as:","element":"span"}],[{"id":"id-39","style":{"width":"71%"},"width":1140,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-17.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.4},"width":309,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-18.png","element":"img","alt":" cskip(t) and cout(t)","inline":true,"padRight":true},{"text":"are differentiable functions, and ","element":"span"},{"style":{"height":16.4},"width":401,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-19.png","element":"img","alt":" cskip(ϵ) = 1, cout(ϵ) = 0","inline":true,"padRight":true},{"text":"such that they satisfy the boundary condition ","element":"span"},{"href":"#id-39","style":{"height":16},"width":508,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-20.png","element":"img","alt":" fθ(xϵ, ϵ) = xϵ. In (4), Fθ(xt, t)","inline":true,"padRight":true},{"text":"can be free-form deep neural network with output that has the same dimension as ","element":"span"},{"style":{"height":9.79},"width":48,"height":24.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-21.png","element":"img","alt":" xt.","inline":true}],[{"text":"Consistency function ","element":"span"},{"style":{"height":16},"width":137.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-22.png","element":"img","alt":" fθ(xt, t)","inline":true,"padRight":true},{"text":"can be optimized by minimizing the difference of points in the same PF ODE trajectory. If we use a pretrained diffusion model to generate such PF ODE trajectory, then utilize it to train a consistency model, this process is called consistency distillation. We use the consistency distillation method to learn a consistency model in this work and optimize the consistency distillation loss as the Definition 1 in ","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"[20]","element":"a"},{"text":".","element":"span"}],[{"text":"With a well-trained consistency model ","element":"span"},{"style":{"height":16},"width":137.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-23.png","element":"img","alt":" fθ(xt, t)","inline":true},{"text":", we can generate samples by sampling from the initial Gaussian distribution ","element":"span"},{"style":{"height":17.2},"width":282,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-24.png","element":"img","alt":" ˆxT ∼ N(0, T 2I)","inline":true,"padRight":true},{"text":"and then evaluating the consistency model for ","element":"span"},{"style":{"height":13.6},"width":82.48,"height":34,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-25.png","element":"img","alt":" ˆxϵ =","inline":true},{"style":{"height":16},"width":163,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/3-26.png","element":"img","alt":"fθ(ˆxT , T)","inline":true},{"text":". This involves only one forward pass through the consistency model and therefore generates samples in a single step. This is the one-step sampling process of the consistency model.","element":"span"}]]},{"heading":"3 Q-Distribution guided Q-learning via Consistency Model","paragraphs":[[{"text":"In this work, we present a novel method for offline RL called Q-Distribution Guided Q-Learning (QDQ). First, we quantify the uncertainty of Q-values by learning the Q-distribution using the consistency model. Next, we propose a strategy to identify risky actions and penalize their Q-values based on uncertainty estimation, helping to mitigate the associated risks. To tackle the excessive conservatism seen in previous approaches, we introduce uncertainty-aware Q optimization within the Actor-Critic learning framework. This mechanism allows the Q-value function to perform both optimistic and pessimistic optimization, fostering a balanced approach to learning.","element":"span"}],[{"id":"id-86","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Learn Uncertainty of Q-value by Q-distribution","element":"span"}],[{"text":"Estimating the uncertainty of the Q function is a significant challenge, especially with deep neural network Q estimators. A practical indicator of uncertainty is the presence of large variances in the estimates. Techniques such as bootstrapping multiple Q-values and estimating variance [","element":"span"},{"href":"#id-7","referenceIndex":7,"text":"7","element":"a"},{"text":"] have been used to address this issue. However, these ensemble methods often lack diversity in Q-values [","element":"span"},{"href":"#id-9","referenceIndex":9,"text":"9","element":"a"},{"text":"] and fail to accurately represent the true Q-value distribution. They may require tens or hundreds of Q-values to improve accuracy, which is computationally inefficient ","element":"span"},{"href":"#id-9","referenceIndex":9,"text":"[9, ","element":"a"},{"href":"#id-5","referenceIndex":5,"text":"5]","element":"a"},{"text":".","element":"span"}],[{"text":"Other approaches involve estimating the Q-value distribution and determining the lower confidence bound [","element":"span"},{"href":"#id-10","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"12","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":16,"text":"16","element":"a"},{"text":"], or engaging in in-distribution learning of the Q-value function [","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-16","referenceIndex":16,"text":"16","element":"a"},{"text":", ","element":"span"},{"href":"#id-10","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":34,"text":"34","element":"a"},{"text":"]. However, these methods often struggle to provide precise uncertainty estimations for the Q-value [","element":"span"},{"href":"#id-15","referenceIndex":15,"text":"15","element":"a"},{"text":"]. Stabilization methods can still lead to Q-value overestimation [","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":"], while inaccurate variance estimation can worsen this problem. Furthermore, even if the Q-value is not overestimated, there is still a risk of it being overly pessimistic or constrained by the performance of the behavior policy when using in-distribution-only training.","element":"span"}],[{"text":"In this subsection, we elucidate the process of learning the distribution of Q-values based on the consistency model, and outline the technique for estimating the uncertainty of actions and identifying risky actions. We have give a further demonstration on the performance of the consistency model and efficiency of the uncertainty estimation in Appendix ","element":"span"},{"href":"#id-41","text":"G.2 ","element":"a"},{"text":"and Appendix ","element":"span"},{"href":"#id-42","text":"G.3.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Trajectory-level truncated Q-value. ","element":"span"},{"text":"We chose to estimate the Q-value distribution of the behavior policy instead of the learning policy because they share a similar set of high-uncertainty actions [","element":"span"},{"href":"#id-5","referenceIndex":5,"text":"5","element":"a"},{"text":"]. Using the behavior policy’s Q-value distribution has several advantages. First, the behavior policy’s Q-value dataset comes from the true dataset, ensuring high-quality distribution learning. In contrast, the learning policy’s Q-value is unknown, counterfactually learned, and often noisy and biased, leading to poor data quality and biased distribution learning. Second, using the behavior policy’s Q-value distribution to identify high-uncertainty actions does not force the learning policy’s target Q-value to align with that of the behavior policy.","element":"span"}],[{"text":"To gain insights into the Q-value distribution of the behavior policy, we first need the raw Q-value data. The calculation of the Q-value operates at the trajectory level, represented as ","element":"span"},{"style":{"height":16},"width":368.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/4-0.png","element":"img","alt":" τ = (s0, a0, s1, a1, ...),","inline":true,"padRight":true},{"text":"with an infinite horizon (see Eq","element":"span"},{"href":"#id-43","text":".1)","element":"a"},{"text":". In the context of offline RL, our training relies on the dataset ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"produced by the behavior policy. This dataset consists of trajectories generated by the behavior policy, which is the only available trajectory-level data. However, the trajectory-level data from the behavior policy often faces a significant challenge: sparsity. This issue becomes even more pronounced when dealing with low-quality behavior policies, as the generated trajectories tend to be sporadic and do not adequately cover the entire state-action space ","element":"span"},{"style":{"height":12.4},"width":108,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/4-1.png","element":"img","alt":" S × A","inline":true},{"text":", especially the high reward region.","element":"span"}],[{"text":"To address this pervasive issue of sparsity, as well as the infinite summation in Eq","element":"span"},{"href":"#id-43","text":".1, ","element":"a"},{"text":"we present a novel approach aimed at enhancing sample efficiency. Our proposed solution involves the utilization of truncated trajectories to ameliorate the sparsity conundrum and avoid infinite summation. By employing a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"- step sliding window of width ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":", we systematically traverse the original trajectories, isolating segments within the window to compute the truncated Q-value (as depicted in Figure ","element":"span"},{"href":"#id-44","text":"A.1)","element":"a"},{"text":". For instance, considering the initiation point of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th step sliding window as ","element":"span"},{"style":{"height":16},"width":107.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/4-2.png","element":"img","alt":" (si, ai)","inline":true},{"text":", by setting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"+ ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", we derive the truncated Q-value of this starting point as follows:","element":"span"}],[{"id":"id-45","style":{"width":"93%"},"width":1478,"height":124,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/4-3.png","element":"img"}],[{"text":"The truncation of Q-values can occur either through sliding window mechanisms or task terminations. When truncation happens due to termination, the Q-value from Eq","element":"span"},{"href":"#id-45","text":".5 ","element":"a"},{"text":"is equivalent to the true Q-value, ","element":"span"},{"style":{"height":16},"width":374.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/4-4.png","element":"img","alt":"QπβT (·, ·) ≡ Qπβ(·, ·)","inline":true},{"text":". In contrast, if truncation results from window blocking, our theoretical analysis in Theorem ","element":"span"},{"href":"#id-21","text":"4.1 ","element":"a"},{"text":"confirms that the distribution of truncated Q-values has properties similar to those of the true Q-value distribution.","element":"span"}],[{"text":"Using a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-step sliding window does not compromise the consistency of the trajectory, owing to the inherent memory-less Markov property in RL. This strategic truncation allows for the extraction of truncated Q-values, which can improve sample efficiency, especially for long trajectories. Moreover, this approach highlights actions with potential high Q-values, as actions from lengthy trajectories—those with many successful interactions—are encountered more often during Q-distribution ","element":"span"},{"text":"training. Consequently, the uncertainty of these actions is lower, reducing the likelihood of them being overly pessimistic.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Learn the distribution of Q-value. ","element":"span"},{"text":"In distributional RL, the learning of Q-value distributions is typically achieved through Gaussian neural networks [","element":"span"},{"href":"#id-46","referenceIndex":35,"text":"35","element":"a"},{"text":", ","element":"span"},{"href":"#id-47","referenceIndex":36,"text":"36","element":"a"},{"text":"], Gaussian processes [","element":"span"},{"href":"#id-48","referenceIndex":37,"text":"37","element":"a"},{"text":", ","element":"span"},{"href":"#id-49","referenceIndex":38,"text":"38","element":"a"},{"text":"], or categorical parameterization [","element":"span"},{"href":"#id-50","referenceIndex":39,"text":"39","element":"a"},{"text":"]. However, these methods often suffer from low precision representation of Q-value distributions, particularly in high-dimensional spaces. Moreover, straightforward replacement of true Q-value distributions with ensembles or bootstraps can lead to reduced accuracy in uncertainty estimation(a critical aspect in offline reinforcement learning [","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"4","element":"a"},{"text":"]), or impose significant computational burdens ","element":"span"},{"href":"#id-8","referenceIndex":8,"text":"[8, ","element":"a"},{"href":"#id-7","referenceIndex":7,"text":"7]","element":"a"},{"text":".","element":"span"}],[{"text":"The idea of diffusing the original distribution using random noise has rendered the diffusion model a potent and high-fidelity distribution learner. However, it has limitations when estimating uncertainty. Sampling with a diffusion model requires a multi-step forward diffusion process to ensure sample quality. Unfortunately, this iterative process can compromise the accuracy of uncertainty estimates by introducing significant fluctuations and noise into the Q-value uncertainty. For a detailed discussion, see Appendix ","element":"span"},{"href":"#id-51","text":"A.2.","element":"a"}],[{"text":"To address this issue, we suggest using the consistency model [","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"20","element":"a"},{"text":"] to learn the Q-value distribution. The consistency model allows for one-step sampling, like other generative models, which reduces the randomness found in the multi-step sampling of diffusion models. This results in a more robust uncertainty estimation. Furthermore, the consistency feature, as explained in Theorem ","element":"span"},{"href":"#id-22","text":"4.2, ","element":"a"},{"text":"accurately captures how changes in actions affect the variance of the final bootstrap samples, making Q-value uncertainty more sensitive to out-of-distribution (OOD) actions compared to the diffusion model. Additionally, the fast-sampling process of the consistency model improves QDQ’s efficiency. While there may be some quality loss in restoring real samples, this is negligible for QDQ since it only calculates uncertainty based on the variance of the bootstrap samples, not the absolute Q-value of the sampled samples. Overall, the consistency model is an ideal distribution learner for uncertainty estimation due to its reliability, high-fidelity, ease of training, and faster sampling.","element":"span"}],[{"text":"Once we derive the truncated Q dataset ","element":"span"},{"style":{"height":15.6},"width":53.52,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-0.png","element":"img","alt":" DQ","inline":true},{"text":", we train a conditional consistency model, denoted by ","element":"span"},{"style":{"height":16},"width":257,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-1.png","element":"img","alt":"fθ(xT , T|(s, a))","inline":true},{"text":", which approximates the distribution of Q-values. Since the consistency model aligns with one-step sampling, we can easily sample multiple Q-values for each action using the consistency model. Suppose we draw ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"prior noise ","element":"span"},{"style":{"height":16.19},"width":329,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-2.png","element":"img","alt":" {ˆxT1, ˆxT2, · · · , ˆxTn}","inline":true,"padRight":true},{"text":"from the initial noise distribution ","element":"span"},{"style":{"height":17.2},"width":151.48,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-3.png","element":"img","alt":" N(0, T 2)","inline":true},{"text":", and denoise the prior samples by the consistency one-step forward process: ","element":"span"},{"style":{"height":16.21},"width":642,"height":40.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-4.png","element":"img","alt":"ˆxϵi = fθ(ˆxTi, Ti|(s, a)), i = 1, 2, · · · , n","inline":true},{"text":". Then the variance of these Q-values, derived by","element":"span"}],[{"style":{"width":"88%"},"width":1406,"height":134,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-5.png","element":"img"}],[{"text":"can be used to gauge the uncertainty of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":")","element":"span"},{"text":".","element":"span"}],[{"id":"id-119","style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Q-distribution guided optimization in offline RL","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Recover Q-value function. ","element":"span"},{"text":"We propose an uncertainty-aware optimization objective ","element":"span"},{"style":{"height":16},"width":130,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-6.png","element":"img","alt":" Luw(Q)","inline":true,"padRight":true},{"text":"to penalize Q-value for OOD actions as well as to avoid too conservative Q-value learning for in-distribution areas. The uncertainty-aware learning objective for Q-value function ","element":"span"},{"style":{"height":16},"width":190.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-7.png","element":"img","alt":" Qθ(s, a) is :","inline":true}],[{"id":"id-52","style":{"width":"75%"},"width":1196,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-8.png","element":"img"}],[{"text":"In Eq","element":"span"},{"href":"#id-52","text":".7, ","element":"a"},{"style":{"height":16},"width":142.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-9.png","element":"img","alt":" L(Qθ)H","inline":true,"padRight":true},{"text":"represents the classic Bellman residual defined in Eq","element":"span"},{"href":"#id-33","text":".2. ","element":"a"},{"text":"This residual is used in online RL and encourages optimistic optimization of the Q-value. In contrast, ","element":"span"},{"style":{"height":16},"width":129.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-10.png","element":"img","alt":" L(Qθ)L","inline":true,"padRight":true},{"text":"is a pessimistic Bellman residual based on the uncertainty-penalized Q target ","element":"span"},{"style":{"height":16},"width":161,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-11.png","element":"img","alt":" QL(s′, a′)","inline":true},{"text":", defined as","element":"span"}],[{"id":"id-53","style":{"width":"86%"},"width":1368,"height":100,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-12.png","element":"img"}],[{"text":"In Eq","element":"span"},{"href":"#id-53","text":".8, ","element":"a"},{"style":{"height":19.2},"width":490.96,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-13.png","element":"img","alt":" HQ(a′|s′) =�V (Xϵ|(s′, a′))","inline":true,"padRight":true},{"text":"represents the uncertainty estimate of the Q-value for action ","element":"span"},{"style":{"height":16},"width":261.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-14.png","element":"img","alt":"a′. The set U(Q)","inline":true,"padRight":true},{"text":"includes actions that may be out-of-distribution (OOD). We use the upper ","element":"span"},{"style":{"height":14.8},"width":162,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-15.png","element":"img","alt":" β-quantile","inline":true},{"style":{"height":22.4},"width":161,"height":56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-16.png","element":"img","alt":"HβQ(a′|s′)","inline":true,"padRight":true},{"text":"of the uncertainty estimate on actions taken by the learning policy as the threshold for ","element":"span"},{"text":"forming ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":")","element":"span"},{"text":". Additionally, we incorporate the quantile parameter ","element":"span"},{"style":{"height":14.61},"width":22.52,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/5-17.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"as a robust weighting factor for ","element":"span"},{"text":"the unpenalized Q-target value. This helps control the estimation error of uncertainty and enhances the robustness of the learning objective. We can also set a free weighting factor, but we use ","element":"span"},{"style":{"height":14.59},"width":22.52,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-0.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"to reduce the number of hyperparameters.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Improve the learning policy. ","element":"span"},{"text":"The optimization of learning policy follows the classic online RL paradigm:","element":"span"}],[{"id":"id-54","style":{"width":"85%"},"width":1352,"height":72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-1.png","element":"img"}],[{"text":"In Eq","element":"span"},{"href":"#id-54","text":".9, ","element":"a"},{"text":"an entropy term is introduced to further stabilize the volatile learning process of Q-value function. For datasets with a wide distribution, we can simply set the penalization factor ","element":"span"},{"style":{"height":10.8},"width":21.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-2.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"to zero, which can further enhance performance. Furthermore, other policy learning objectives, such as the AWR policy objective ","element":"span"},{"href":"#id-55","referenceIndex":40,"text":"[40]","element":"a"},{"text":", can also be flexibly used within the QDQ framework, especially for the goal conditioned task like Antmaze.","element":"span"}],[{"text":"We outline the entire learning process of QDQ in Algorithm ","element":"span"},{"href":"#id-56","text":"1. ","element":"a"},{"text":"In Section ","element":"span"},{"text":"4, ","element":"span"},{"text":"Theorems ","element":"span"},{"href":"#id-23","text":"4.3 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-24","text":"4.4 ","element":"a"},{"text":"show that QDQ penalizes the OOD region based on uncertainty while ensuring that the Q-value function in the in-distribution region is close to the optimal Q-value. This alignment is the main goal of offline RL.","element":"span"}],[{"id":"id-56","style":{"width":"100%"},"width":1594,"height":860,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-3.png","element":"img"}]]},{"heading":"4 Theoretical Analysis","paragraphs":[[{"text":"In this section, we provide a theoretical analysis of QDQ. The first theorem states that if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is sufficiently large, the distribution of ","element":"span"},{"style":{"height":14.8},"width":103,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-4.png","element":"img","alt":" QπβT","inline":true,"padRight":true},{"text":"does not significantly differ from the true distribution of ","element":"span"},{"style":{"height":16.99},"width":67.52,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-5.png","element":"img","alt":" Qπβ","inline":true},{"text":". This shows that our sliding window-based truncated Q-value distribution converges to the true Q-value distribution, ensuring accurate uncertainty estimation. A detailed proof can be found in Appendix ","element":"span"},{"text":"B.","element":"span"}],[{"id":"id-21","style":{"fontWeight":"bold"},"text":"Theorem 4.1 ","element":"span"},{"text":"(Informal)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under some mildly condition, the truncated Q-value ","element":"span"},{"style":{"height":18.99},"width":63,"height":47.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-6.png","element":"img","alt":" QπβT","inline":true},{"style":{"fontStyle":"italic"},"text":"converge in-distribution to the true true Q-value ","element":"span"},{"style":{"height":14.59},"width":75,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-7.png","element":"img","alt":" Qπβ.","inline":true}],[{"style":{"width":"67%"},"width":1070,"height":62,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-8.png","element":"img"}],[{"text":"In Theorem ","element":"span"},{"href":"#id-22","text":"4.2, ","element":"a"},{"text":"we analyze why the consistency model is suitable for estimating uncertainty. Our analysis shows that Q-value uncertainty is more sensitive to actions. This sensitivity helps in detecting out-of-distribution (OOD) actions. A detailed statement of the theorem and its proof can be found in Appendix ","element":"span"},{"text":"C.","element":"span"}],[{"id":"id-22","style":{"fontWeight":"bold"},"text":"Theorem 4.2 ","element":"span"},{"text":"(Informal)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Following the assumptions as in [","element":"span"},{"href":"#id-20","referenceIndex":20,"style":{"fontStyle":"italic"},"text":"20","element":"a"},{"style":{"fontStyle":"italic"},"text":"], ","element":"span"},{"style":{"height":16},"width":232,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-9.png","element":"img","alt":" fθ(x, T|(s, a))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-Lipschitz. We also assume the truncated Q-value is bounded by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"H","element":"span"},{"style":{"fontStyle":"italic"},"text":". The action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"broadly influences ","element":"span"},{"style":{"height":16},"width":205.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-10.png","element":"img","alt":" V (Xϵ|(s, a))","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"by: ","element":"span"},{"style":{"height":21.79},"width":494.48,"height":54.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/6-11.png","element":"img","alt":" | ∂var(Xϵ)∂a | = O(L2T√log n)1.","inline":true}],[{"text":"In Theorem ","element":"span"},{"href":"#id-23","text":"4.3, ","element":"a"},{"text":"we give theoretical analysis that the uncertainty-aware learning objective in Eq","element":"span"},{"href":"#id-52","text":".7 ","element":"a"},{"text":"can converge and the details can be found in Appendix ","element":"span"},{"text":"D.","element":"span"}],[{"id":"id-23","style":{"fontWeight":"bold"},"text":"Theorem 4.3 ","element":"span"},{"text":"(Informal)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Q-value function of QDQ can converge to a fixed point of the Bellman equation: ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":")","element":"span"},{"style":{"fontStyle":"italic"},"text":", where the Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is defined as:","element":"span"}],[{"style":{"width":"89%"},"width":1416,"height":64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-0.png","element":"img"}],[{"text":"Theorem ","element":"span"},{"href":"#id-24","text":"4.4 ","element":"a"},{"text":"shows that QDQ penalizes the OOD region by uncertainty while ensuring that the Q-value function in the in-distribution region is close to the optimal Q-value, which is the goal of","element":"span"}],[{"text":"offline RL. ","element":"span"},{"id":"id-24","style":{"fontWeight":"bold"},"text":"Theorem 4.4 ","element":"span"},{"text":"(Informal)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Under mild conditions, with probability ","element":"span"},{"style":{"height":14.61},"width":225.52,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-1.png","element":"img","alt":" 1 − η we have","inline":true}],[{"style":{"width":"60%"},"width":954,"height":58,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":16.99},"width":57.48,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-3.png","element":"img","alt":" Q∆","inline":true},{"style":{"fontStyle":"italic"},"text":"is learned by the uncertainty-aware loss in Eq","element":"span"},{"href":"#id-52","style":{"fontStyle":"italic"},"text":".7, ","element":"a"},{"style":{"height":7.2},"width":13.52,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-4.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is error rate related to the difference between the classical Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and the QDQ bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"text":"The optimal Q-value, ","element":"span"},{"style":{"height":17.01},"width":54.48,"height":42.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-5.png","element":"img","alt":" Q∆","inline":true},{"text":", derived by the QDQ algorithm can closely approximate the optimal Q-value function, ","element":"span"},{"style":{"height":14.61},"width":44,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-6.png","element":"img","alt":" Q∗","inline":true},{"text":", benefiting from the balanced approach of the QDQ algorithm that avoids excessive pessimism for in-distribution areas. Both the value ","element":"span"},{"style":{"height":7.2},"width":13.48,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-7.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":10.8},"width":19,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-8.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"are small and more details in Appendix ","element":"span"},{"text":"E.","element":"span"}]]},{"heading":"5 Experiments","paragraphs":[[{"text":"In this section, we first delve into the experimental performance of QDQ using the D4RL benchmarks [","element":"span"},{"href":"#id-57","referenceIndex":41,"text":"41","element":"a"},{"text":"]. Subsequently, we conduct a concise analysis of parameter settings, focusing on hyperparameter tuning across various tasks. For detailed implementation, we refer to Appendix ","element":"span"},{"text":"G.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"5.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Performance on D4RL benchmarks for Offline RL","element":"span"}],[{"text":"We evaluate the proposed QDQ algorithm on the D4RL Gym-MuJoCo and AntMaze tasks. We compare it with several strong state-of-the-art (SOTA) model-free methods: behavioral cloning (BC), BCQ [","element":"span"},{"href":"#id-58","referenceIndex":42,"text":"42","element":"a"},{"text":"], DT [","element":"span"},{"href":"#id-59","referenceIndex":43,"text":"43","element":"a"},{"text":"], AWAC [","element":"span"},{"href":"#id-60","referenceIndex":44,"text":"44","element":"a"},{"text":"], Onestep RL [","element":"span"},{"href":"#id-61","referenceIndex":45,"text":"45","element":"a"},{"text":"], TD3+BC [","element":"span"},{"href":"#id-62","referenceIndex":46,"text":"46","element":"a"},{"text":"], CQL [","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":"], and IQL [","element":"span"},{"href":"#id-10","referenceIndex":10,"text":"10","element":"a"},{"text":"]. We also include UWAC [","element":"span"},{"href":"#id-7","referenceIndex":7,"text":"7","element":"a"},{"text":"], EDAC [","element":"span"},{"href":"#id-9","referenceIndex":9,"text":"9","element":"a"},{"text":"], and PBRL [","element":"span"},{"href":"#id-14","referenceIndex":14,"text":"14","element":"a"},{"text":"], which use uncertainty to pessimistically adjust the Q-value function, as well as MCQ [","element":"span"},{"href":"#id-13","referenceIndex":13,"text":"13","element":"a"},{"text":"], which introduces mild constraints to the Q-value function. The experimental results for the baselines reported in this paper are derived from the original experiments conducted by the authors or from replication of their official code. The reported values are normalized scores defined in D4RL ","element":"span"},{"href":"#id-57","referenceIndex":41,"text":"[41]","element":"a"},{"text":".","element":"span"}],[{"id":"id-63","text":"Table 1: Comparison of QDQ and the other baselines on the three Gym-MuJoCo tasks. All the ","element":"figcaption","subtype":"caption"},{"text":"experiment are performed on the MuJoCo \"-v2\" dataset. The results are calculated over 5 random seeds.med = medium, r = replay, e = expert, ha = halfcheetah, wa = walker2d, ho=hopper","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1584,"height":444,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/7-9.png","element":"img"}],[{"text":"Table ","element":"span"},{"href":"#id-63","text":"1 ","element":"a"},{"text":"shows the performance comparison between QDQ and the baselines across Gym-MuJoCo tasks, highlighting QDQ’s competitive edge in almost all tasks. Notably, QDQ excels on datasets with wide distributions, such as medium and medium-replay datasets. In these cases, QDQ effectively ","element":"span"},{"text":"avoids the problem of over-penalizing Q-values. By balancing between being too conservative and actively exploring to find the optimal Q-value function through dynamic programming, QDQ gradually converges toward the optimal Q-value, as supported by Theorem ","element":"span"},{"href":"#id-24","text":"4.4.","element":"a"}],[{"id":"id-64","text":"Table 2: Comparison of QDQ and the other baselines on the Antmaze tasks. All the experiment are ","element":"figcaption","subtype":"caption"},{"text":"performed on the Antmaze \"-v0\" dataset for the comparison comfortable with previous baseline. The results are calculated over 5 random seeds.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"98%"},"width":1554,"height":392,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-0.png","element":"img"}],[{"text":"Table ","element":"span"},{"href":"#id-64","text":"2 ","element":"a"},{"text":"presents the performance comparison between QDQ and selected baselines","element":"span"},{"text":"3 ","element":"span"},{"text":"across AntMaze tasks, highlighting QDQ’s commendable performance. While QDQ focuses on reducing overly pessimistic estimations, it does not compromise its performance on narrow datasets. This is evident in its competitive results on the medium-expert dataset in Table ","element":"span"},{"href":"#id-63","text":"1, ","element":"a"},{"text":"as well as its performance on AntMaze tasks. Notably, QDQ outperforms SOTA methods on several datasets. This success is due to the inherent flexibility of the QDQ algorithm. By allowing for flexible hyperparameter control and seamless integration with various policy optimization methods, QDQ achieves a synergistic performance enhancement.","element":"span"}],[{"id":"id-109","style":{"fontWeight":"bold"},"text":"5.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Parameter analysis","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"The uncertainty-aware loss parameter ","element":"span"},{"style":{"height":7.6},"width":33,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-1.png","element":"img","alt":" α.","inline":true,"padRight":true},{"text":"The parameter ","element":"span"},{"style":{"height":7.41},"width":23,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-2.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is crucial for balancing the dominance between optimistic and pessimistic updates of the Q-value (Eq","element":"span"},{"href":"#id-52","text":".7)","element":"a"},{"text":". A higher ","element":"span"},{"style":{"height":7.39},"width":23,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-3.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"value skews updates toward the optimistic side, and we choose a higher ","element":"span"},{"style":{"height":7.39},"width":23.04,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-4.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"when the dataset or task is expected to be highly robust. However, the setting of ","element":"span"},{"style":{"height":7.39},"width":23,"height":18.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-5.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is also influenced by the pessimism of the Q target defined in Eq","element":"span"},{"href":"#id-53","text":".8. ","element":"a"},{"text":"For a more pessimistic Q target value, we can choose a larger ","element":"span"},{"style":{"height":7.41},"width":23,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-6.png","element":"img","alt":" α","inline":true},{"text":". Interestingly, both the theoretical analyses in Theorem ","element":"span"},{"href":"#id-23","text":"4.3 ","element":"a"},{"text":"and Theorem ","element":"span"},{"href":"#id-24","text":"4.4 ","element":"a"},{"text":"and empirical parameter tuning suggest that variability in ","element":"span"},{"style":{"height":7.41},"width":22.96,"height":18.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-7.png","element":"img","alt":"α","inline":true,"padRight":true},{"text":"across tasks is minimal, with a typical value around 0.95.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"The uncertainty related parameter ","element":"span"},{"style":{"height":14.61},"width":22.52,"height":36.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-8.png","element":"img","alt":" β","inline":true},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"text":"The parameter ","element":"span"},{"style":{"height":14.59},"width":22,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-9.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"influences both the partitioning of high uncertainty sets and acts as a relaxation variable to control uncertainty estimation errors. When dealing with a narrow action space or a sensitive task (such as the hopper task), the value of ","element":"span"},{"style":{"height":14.59},"width":138,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-10.png","element":"img","alt":" β should","inline":true,"padRight":true},{"text":"be smaller. In these cases, the Q-value is more likely to select OOD actions, increasing the risk of overestimation. This means we face greater uncertainty (Eq","element":"span"},{"href":"#id-53","text":".8) ","element":"a"},{"text":"and need to minimize overestimation errors. Therefore, we require stricter criteria to ensure actions are in-distribution and penalize the Q-values of OOD points more heavily. A detailed analysis of how to determine the value of ","element":"span"},{"style":{"height":14.59},"width":134.52,"height":36.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-11.png","element":"img","alt":" β can be","inline":true,"padRight":true},{"text":"found in Appendix ","element":"span"},{"href":"#id-42","text":"G.3.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"The entropy parameter ","element":"span"},{"style":{"height":14.4},"width":136.48,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-12.png","element":"img","alt":" γ. The γ","inline":true,"padRight":true},{"text":"term in Eq","element":"span"},{"href":"#id-54","text":".9 ","element":"a"},{"text":"stabilizes the learning of a simple Gaussian policy, especially for action-sensitive and narrower distribution tasks. When the dataset has a wide distribution or the task shows high robustness to actions (such as in the half-cheetah task), the Q-value function generalizes better across the action space. In these cases, we can set a more lenient requirement for actions, keeping the value of ","element":"span"},{"style":{"height":10.8},"width":21.52,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-13.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"as small as possible or even at 0. However, when the dataset is narrow (e.g., in the AntMaze task) or when the task is sensitive to changes in actions (like in the hopper or maze tasks, where small deviations can lead to failure), a larger value of ","element":"span"},{"style":{"height":10.8},"width":21.52,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-14.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"is necessary. For these tasks, a simple Gaussian policy can easily sample risky actions, as it fits a single-mode policy. Nonetheless, experimental results indicate that the sensitivity of the ","element":"span"},{"style":{"height":10.8},"width":21.48,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-15.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"parameter is not very high. In fact, ","element":"span"},{"style":{"height":10.8},"width":21.52,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/8-16.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"in Eq","element":"span"},{"href":"#id-54","text":".9 ","element":"a"},{"text":"is relatively small compared to the Q-value, primarily to stabilize training and prevent instability in Gaussian policy action sampling. See Appendix ","element":"span"},{"href":"#id-65","text":"G.7 ","element":"a"},{"text":"for more details.","element":"span"}]]},{"heading":"6 Related Works","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Restrict policy deviate from OOD areas","element":"span"},{"text":". The distribution mismatch between the behavior policy and the learning policy can be overcome if the learning policy share the same support with the behavior policy. One approach involves explicit distribution matching constraints, where the learning policy is encouraged to align with the behavior policy by minimizing the distance between their distributions. This includes techniques based on KL-divergence [","element":"span"},{"href":"#id-66","referenceIndex":47,"text":"47","element":"a"},{"text":", ","element":"span"},{"href":"#id-67","referenceIndex":48,"text":"48","element":"a"},{"text":", ","element":"span"},{"href":"#id-55","referenceIndex":40,"text":"40","element":"a"},{"text":", ","element":"span"},{"href":"#id-60","referenceIndex":44,"text":"44","element":"a"},{"text":", ","element":"span"},{"href":"#id-62","referenceIndex":46,"text":"46","element":"a"},{"text":"], Jensen–Shannon divergence [","element":"span"},{"href":"#id-68","referenceIndex":49,"text":"49","element":"a"},{"text":"], and Wasserstein distance [","element":"span"},{"href":"#id-66","referenceIndex":47,"text":"47","element":"a"},{"text":", ","element":"span"},{"href":"#id-68","referenceIndex":49,"text":"49","element":"a"},{"text":"]. Another line of research aims to alleviate the overly conservative nature of distribution matching constraints by incorporating distribution support constraints. These methods employ techniques such as Maximum Mean Discrepancy (MMD) distance [","element":"span"},{"href":"#id-5","referenceIndex":5,"text":"5","element":"a"},{"text":"], learning behavior density functions using implicit [","element":"span"},{"href":"#id-69","referenceIndex":50,"text":"50","element":"a"},{"text":"] or explicit [","element":"span"},{"href":"#id-70","referenceIndex":51,"text":"51","element":"a"},{"text":"] methods, or measuring the geometric distance between actions generated by the learning and behavior policies [","element":"span"},{"href":"#id-71","referenceIndex":52,"text":"52","element":"a"},{"text":"].In addition to explicit constraint methods, implicit constraints can also be implemented by learning a behavior policy sampler using techniques like Conditional Variational Autoencoders (CVAE) [","element":"span"},{"href":"#id-58","referenceIndex":42,"text":"42","element":"a"},{"text":", ","element":"span"},{"href":"#id-72","referenceIndex":53,"text":"53","element":"a"},{"text":", ","element":"span"},{"href":"#id-69","referenceIndex":50,"text":"50","element":"a"},{"text":", ","element":"span"},{"href":"#id-73","referenceIndex":54,"text":"54","element":"a"},{"text":"], Autoregressive Generative Model [","element":"span"},{"href":"#id-74","referenceIndex":55,"text":"55","element":"a"},{"text":"], Generative Adversarial Networks (GAN) [","element":"span"},{"href":"#id-68","referenceIndex":49,"text":"49","element":"a"},{"text":"], normalized flow models ","element":"span"},{"href":"#id-75","referenceIndex":56,"text":"[56]","element":"a"},{"text":", or diffusion models ","element":"span"},{"href":"#id-76","referenceIndex":57,"text":"[57, ","element":"a"},{"href":"#id-77","referenceIndex":58,"text":"58, ","element":"a"},{"href":"#id-11","referenceIndex":11,"text":"11, ","element":"a"},{"href":"#id-78","referenceIndex":59,"text":"59, ","element":"a"},{"href":"#id-11","referenceIndex":11,"text":"11, ","element":"a"},{"href":"#id-79","referenceIndex":60,"text":"60]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Pessimistic Q-value optimization","element":"span"},{"text":". Pessimistic Q-value methods offer a direct approach to address the issue of Q-value function overestimation, particularly when policy control fails despite the learning policy closely matching the behavior policy [","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":"]. A promising approach to pessimistic Q-value estimation involves estimating uncertainty over the action space, as OOD actions typically exhibit high uncertainty. However, accurately quantifying uncertainty poses a challenge, especially with high-capacity function approximators like neural networks [","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"4","element":"a"},{"text":"]. Techniques such as ensemble or bootstrap methods have been employed to estimate multiple Q-values, providing a proxy for uncertainty through Q-value variance [","element":"span"},{"href":"#id-7","referenceIndex":7,"text":"7","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-14","referenceIndex":14,"text":"14","element":"a"},{"text":"], importance ratio [","element":"span"},{"href":"#id-80","referenceIndex":61,"text":"61","element":"a"},{"text":", ","element":"span"},{"href":"#id-81","referenceIndex":62,"text":"62","element":"a"},{"text":"] or approximate Lower Confidence Bounds (LCB) for OOD regions [","element":"span"},{"href":"#id-8","referenceIndex":8,"text":"8","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":9,"text":"9","element":"a"},{"text":"]. Other methods focus on estimating the LCB of Q-values through quantile regression [","element":"span"},{"href":"#id-82","referenceIndex":63,"text":"63","element":"a"},{"text":", ","element":"span"},{"href":"#id-40","referenceIndex":34,"text":"34","element":"a"},{"text":"], expectile regression [","element":"span"},{"href":"#id-10","referenceIndex":10,"text":"10","element":"a"},{"text":", ","element":"span"},{"href":"#id-11","referenceIndex":11,"text":"11","element":"a"},{"text":"], or tail risk measurement such as Conditional Value at Risk (cVAR) [","element":"span"},{"href":"#id-12","referenceIndex":12,"text":"12","element":"a"},{"text":"]. Alternatively, some approaches seek to pessimistically estimate Q-values based on the behavior policy, aiming to underestimate Q-values under the learning policy distribution while maximizing Q-values under the behavior policy distribution [","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-13","referenceIndex":13,"text":"13","element":"a"},{"text":", ","element":"span"},{"href":"#id-83","referenceIndex":64,"text":"64","element":"a"},{"text":", ","element":"span"},{"href":"#id-84","referenceIndex":65,"text":"65","element":"a"},{"text":"]. Another category of Q-value constraint methods involves learning Q-values only within the in-sample [","element":"span"},{"href":"#id-16","referenceIndex":16,"text":"16","element":"a"},{"text":", ","element":"span"},{"href":"#id-17","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":18,"text":"18","element":"a"},{"text":", ","element":"span"},{"href":"#id-19","referenceIndex":19,"text":"19","element":"a"},{"text":"], capturing only in-sample patterns and avoid OOD risk. Furthermore, Q-value functions can be replaced by safe planning methods used in model-based RL, such as planning with diffusion models [","element":"span"},{"href":"#id-85","referenceIndex":66,"text":"66","element":"a"},{"text":"] or trajectory-level prediction using Transformers [","element":"span"},{"href":"#id-59","referenceIndex":43,"text":"43","element":"a"},{"text":"]. However, ensemble estimation of uncertainty may tend to underestimate true uncertainty, while quantile estimation methods are sensitive to Q-distribution recovery. In-sample methods may also be limited by the performance of the behavior policy.","element":"span"}]]},{"heading":"7 Conclusion","paragraphs":[[{"text":"We introduce QDQ, a novel framework rendering pessimistic Q-value in OOD areas by uncertainty estimation. Our approach leverages the consistency model to robustly estimate the uncertainty of Q-values. By employing this uncertainty information, QDQ can apply a judicious penalty to Q-values, mitigating the overly conservative nature encountered in previous pessimistic Q-value methods. Additionally, to enhance optimistic Q-learning within in-distribution areas, we introduce an uncertainty-aware learning objective for Q optimization. Both theoretical analyses and experimental evaluations demonstrate the effectiveness of QDQ. Several avenues for future research exist, including embedding QDQ into goal-conditioned tasks, enhancing exploration in online RL by efficient uncertainty estimation. We hope our work will inspire further advancements in offline reinforcement learning.","element":"span"}]]},{"heading":"Acknowledgements","paragraphs":[[{"text":"We would like to thank AC and reviewers for their valuable comments on the manuscript. Bingyi Jing’s research is partly supported by NSFC (No. 12371290).","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-0","text":"[1] ","element":"span"},{"text":"Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1708.05866","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-1","text":"[2] ","element":"span"},{"text":"John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 9, 1996.","element":"span"}],[{"id":"id-2","text":"[3] ","element":"span"},{"text":"Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. volume 2020-December, 2020.","element":"span"}],[{"id":"id-4","text":"[4] ","element":"span"},{"text":"Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2005.01643","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-5","text":"[5] ","element":"span"},{"text":"Aviral Kumar, Justin Fu, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. volume 32, 2019.","element":"span"}],[{"id":"id-6","text":"[6] ","element":"span"},{"text":"Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. volume 4, 2018.","element":"span"}],[{"id":"id-7","text":"[7] ","element":"span"},{"text":"Yue Wu, Shuangfei Zhai, Nitish Srivastava, Joshua Susskind, Jian Zhang, Ruslan Salakhutdinov, and Hanlin Goh. Uncertainty weighted actor-critic for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2105.08140","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-8","text":"[8] ","element":"span"},{"text":"Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 104–114. PMLR, 2020.","element":"span"}],[{"id":"id-9","text":"[9] ","element":"span"},{"text":"Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:7436–7447, 2021.","element":"span"}],[{"id":"id-10","text":"[10] ","element":"span"},{"text":"Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2110.06169","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-11","text":"[11] ","element":"span"},{"text":"Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2304.10573","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-12","text":"[12] ","element":"span"},{"text":"Núria Armengol Urpí, Sebastian Curi, and Andreas Krause. Risk-averse offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2102.05371","element":"span"},{"text":", 2021.","element":"span"}],[{"id":"id-13","text":"[13] ","element":"span"},{"text":"Jiafei Lyu, Xiaoteng Ma, Xiu Li, and Zongqing Lu. Mildly conservative q-learning for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2206.04745","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-14","text":"[14] ","element":"span"},{"text":"Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2202.11566","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-15","text":"[15] ","element":"span"},{"text":"HJ Terry Suh, Glen Chou, Hongkai Dai, Lujie Yang, Abhishek Gupta, and Russ Tedrake. Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Robot Learning","element":"span"},{"text":", pages 2878–2904. PMLR, 2023.","element":"span"}],[{"id":"id-16","text":"[16] ","element":"span"},{"text":"Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2301.02328","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-17","text":"[17] ","element":"span"},{"text":"Chenjun Xiao, Han Wang, Yangchen Pan, Adam White, and Martha White. The in-sample softmax for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2302.14372","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-18","text":"[18] ","element":"span"},{"text":"Hongchang Zhang, Yixiu Mao, Boyuan Wang, Shuncheng He, Yi Xu, and Xiangyang Ji. Insample actor critic for offline reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Eleventh International Conference on Learning Representations","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-19","text":"[19] ","element":"span"},{"text":"Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2303.15810","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-20","text":"[20] ","element":"span"},{"text":"Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2303.01469","element":"span"},{"text":", 2023.","element":"span"}],[{"id":"id-25","text":"[21] ","element":"span"},{"text":"Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1312.5602","element":"span"},{"text":", 2013.","element":"span"}],[{"id":"id-26","text":"[22] ","element":"span"},{"text":"Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the AAAI conference on artificial intelligence","element":"span"},{"text":", volume 30, 2016.","element":"span"}],[{"id":"id-27","text":"[23] ","element":"span"},{"text":"Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 1995–2003. PMLR, 2016.","element":"span"}],[{"id":"id-28","text":"[24] ","element":"span"},{"text":"R.S. Sutton and A.G. Barto. Reinforcement learning: An introduction. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Transactions on Neural Networks","element":"span"},{"text":", 9, 1998. ISSN 1045-9227. doi: 10.1109/tnn.1998.712192.","element":"span"}],[{"id":"id-29","text":"[25] ","element":"span"},{"text":"David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 387–395. PMLR, 2014.","element":"span"}],[{"id":"id-30","text":"[26] ","element":"span"},{"text":"Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1509.02971","element":"span"},{"text":", 2015.","element":"span"}],[{"id":"id-31","text":"[27] ","element":"span"},{"text":"John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1707.06347","element":"span"},{"text":", 2017.","element":"span"}],[{"id":"id-32","text":"[28] ","element":"span"},{"text":"Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. volume 5, 2018.","element":"span"}],[{"id":"id-34","text":"[29] ","element":"span"},{"text":"Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 2256–2265. PMLR, 2015.","element":"span"}],[{"id":"id-35","text":"[30] ","element":"span"},{"text":"Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 32, 2019.","element":"span"}],[{"id":"id-36","text":"[31] ","element":"span"},{"text":"Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 33:12438–12448, 2020.","element":"span"}],[{"id":"id-37","text":"[32] ","element":"span"},{"text":"Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 33:6840–6851, 2020.","element":"span"}],[{"id":"id-38","text":"[33] ","element":"span"},{"text":"Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2011.13456","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-40","text":"[34] ","element":"span"},{"text":"Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, and Dmitry Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 5556–5566. PMLR, 2020.","element":"span"}],[{"id":"id-46","text":"[35] ","element":"span"},{"text":"Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1804.08617","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-47","text":"[36] ","element":"span"},{"text":"Brendan O’Donoghue, Ian Osband, Remi Munos, and Volodymyr Mnih. The uncertainty bellman equation and exploration. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 3836–3845, 2018.","element":"span"}],[{"id":"id-48","text":"[37] ","element":"span"},{"text":"Jus Kocijan, Roderick Murray-Smith, Carl E Rasmussen, and Agathe Girard. Gaussian process model based predictive control. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 2004 American control conference","element":"span"},{"text":", volume 3, pages 2214–2219. IEEE, 2004.","element":"span"}],[{"id":"id-49","text":"[38] ","element":"span"},{"text":"Craig Knuth, Glen Chou, Necmiye Ozay, and Dmitry Berenson. Planning with learned dynamics: Probabilistic guarantees on safety and reachability via lipschitz constants. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"IEEE Robotics and Automation Letters","element":"span"},{"text":", 6(3):5129–5136, 2021.","element":"span"}],[{"id":"id-50","text":"[39] ","element":"span"},{"text":"Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International conference on machine learning","element":"span"},{"text":", pages 449–458. PMLR, 2017.","element":"span"}],[{"id":"id-55","text":"[40] ","element":"span"},{"text":"Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1910.00177","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-57","text":"[41] ","element":"span"},{"text":"Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2004.07219","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-58","text":"[42] ","element":"span"},{"text":"Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. volume 2019-June, 2019.","element":"span"}],[{"id":"id-59","text":"[43] ","element":"span"},{"text":"Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:15084–15097, 2021.","element":"span"}],[{"id":"id-60","text":"[44] ","element":"span"},{"text":"Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2006.09359","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-61","text":"[45] ","element":"span"},{"text":"David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:4933–4946, 2021.","element":"span"}],[{"id":"id-62","text":"[46] ","element":"span"},{"text":"Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in neural information processing systems","element":"span"},{"text":", 34:20132–20145, 2021.","element":"span"}],[{"id":"id-66","text":"[47] ","element":"span"},{"text":"Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1911.11361","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-67","text":"[48] ","element":"span"},{"text":"Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1907.00456","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-68","text":"[49] ","element":"span"},{"text":"Shentao Yang, Zhendong Wang, Huangjie Zheng, Yihao Feng, and Mingyuan Zhou. A regularized implicit policy for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2202.09673","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-69","text":"[50] ","element":"span"},{"text":"Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, and Mingsheng Long. Supported policy optimization for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2202.06239","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-70","text":"[51] ","element":"span"},{"text":"Jing Zhang, Chi Zhang, Wenjia Wang, and Bingyi Jing. Constrained policy optimization with explicit behavior density for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-71","text":"[52] ","element":"span"},{"text":"Jianxiong Li, Xianyuan Zhan, Haoran Xu, Xiangyu Zhu, Jingjing Liu, and Ya-Qin Zhang. Distance-sensitive offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2205.11027","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-72","text":"[53] ","element":"span"},{"text":"Wenxuan Zhou, Sujay Bajracharya, and David Held. Plas: Latent action space for offline reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Conference on Robot Learning","element":"span"},{"text":", pages 1719–1735. PMLR, 2021.","element":"span"}],[{"id":"id-73","text":"[54] ","element":"span"},{"text":"Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Yuan Gao, Jianhao Wang, Wenzhe Li, Bin Liang, Chelsea Finn, and Chongjie Zhang. Latent-variable advantage-weighted policy optimization for offline rl. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2203.08949","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-74","text":"[55] ","element":"span"},{"text":"Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, and Shixiang Shane Gu. ","element":"span"},{"text":"Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 3682–3691. PMLR, 2021.","element":"span"}],[{"id":"id-75","text":"[56] ","element":"span"},{"text":"Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, and Sergey Levine. Parrot: Data-driven behavioral priors for reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2011.10024","element":"span"},{"text":", 2020.","element":"span"}],[{"id":"id-76","text":"[57] ","element":"span"},{"text":"Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2208.06193","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-77","text":"[58] ","element":"span"},{"text":"Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2209.14548","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-78","text":"[59] ","element":"span"},{"text":"Wonjoon Goo and Scott Niekum. Know your boundaries: The necessity of explicit behavioral cloning in offline rl. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2206.00695","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-79","text":"[60] ","element":"span"},{"text":"Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-80","text":"[61] ","element":"span"},{"text":"Xiaoying Zhang, Junpu Chen, Hongning Wang, Hong Xie, Yang Liu, John Lui, and Hang Li. Uncertainty-aware instance reweighting for off-policy learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36:73691–73718, 2023.","element":"span"}],[{"id":"id-81","text":"[62] ","element":"span"},{"text":"Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart Russell, and Jiantao Jiao. ","element":"span"},{"text":"Optimal conservative offline rl with general function approximation via augmented lagrangian. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2211.00716","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-82","text":"[63] ","element":"span"},{"text":"Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan. Quantile qt-opt for risk-aware vision-based robotic grasping. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1910.02787","element":"span"},{"text":", 2019.","element":"span"}],[{"id":"id-83","text":"[64] ","element":"span"},{"text":"Liting Chen, Jie Yan, Zhengdao Shao, Lu Wang, Qingwei Lin, Saravanakumar Rajmohan, Thomas Moscibroda, and Dongmei Zhang. Conservative state value estimation for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-84","text":"[65] ","element":"span"},{"text":"Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, and Xiangyang Ji. ","element":"span"},{"text":"Supported value regularization for offline reinforcement learning. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 36, 2024.","element":"span"}],[{"id":"id-85","text":"[66] ","element":"span"},{"text":"Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2205.09991","element":"span"},{"text":", 2022.","element":"span"}],[{"id":"id-110","text":"[67] ","element":"span"},{"text":"James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL ","element":"span"},{"href":"http://github.com/google/jax","style":{"fontFamily":"monospace"},"text":"http://github.com/google/jax","element":"a"},{"text":".","element":"span"}],[{"text":"[68] ","element":"span"},{"text":"Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", 35: 26565–26577, 2022.","element":"span"}]]},{"heading":"Appendix A Further discussion about uncertainty estimation of Q-value by Q-distribution.","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The Q-value dataset enhancement with sidling window.","element":"span"}],[{"text":"In Section ","element":"span"},{"href":"#id-86","text":"3.1, ","element":"a"},{"text":"we analyze the challenges associated with deriving the Q-value dataset. We propose the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-step sliding window method to improve sample efficiency at the trajectory level. The implementation details of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-step sliding window within an entire trajectory are illustrated in Figure ","element":"span"},{"href":"#id-44","text":"A.1.","element":"a"}],[{"text":"This sliding window framework not only facilitates the expansion of Q-value data but also prevents the state-action pairs from becoming overly dense, thereby mitigating the risk of Q-value homogenization. In continuous state-action spaces, Q-values tend to be homogeneity when state and action are close in a trajectory, which may hinder subsequent learning of the Q-value distribution.","element":"span"}],[{"style":{"width":"95%"},"width":1515,"height":956,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-0.png","element":"img"}],[{"id":"id-44","text":"Figure A.1: This exemplifies how the sliding window mechanism operates to augment Q data. Let’s ","element":"figcaption","subtype":"caption"},{"text":"consider a sliding window with a width of 50 and a step size of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"= 10","element":"figcaption","subtype":"caption"},{"text":". For a specific trajectory, at step 1, we commence with ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":124.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-1.png","element":"img","alt":" (s1, a1)","inline":true,"padRight":true},{"text":"and compute the truncated Q-value utilizing trajectories within the window. At step 2, the sliding window progresses ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"k ","element":"figcaption","subtype":"caption"},{"text":"steps forward, allowing us to compute the truncated Q-value for ","element":"figcaption","subtype":"caption"},{"style":{"height":16},"width":218.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-2.png","element":"img","alt":" (s1+k, a1+k).","inline":true}],[{"id":"id-51","style":{"fontWeight":"bold"},"text":"A.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Drawbacks of the diffusion model for estimating the uncertainty","element":"span"}],[{"text":"Suppose we use the score matching method proposed in [","element":"span"},{"href":"#id-35","referenceIndex":30,"text":"30","element":"a"},{"text":"] to learn a conditional score network ","element":"span"},{"style":{"height":16},"width":201.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-3.png","element":"img","alt":"sθ(x, σ|s, a)","inline":true,"padRight":true},{"text":"to approximate the score function of the Q-value distribution ","element":"span"},{"style":{"height":16.78},"width":101.4,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-4.png","element":"img","alt":" pQ(x)","inline":true},{"text":". Then we use the annealed Langevin dynamics as in [","element":"span"},{"href":"#id-35","referenceIndex":30,"text":"30","element":"a"},{"text":"] to sample from the learned Q-value distribution. For ","element":"span"},{"style":{"height":16},"width":172.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-5.png","element":"img","alt":"x0 ∼ π(x)","inline":true,"padRight":true},{"text":"from some arbitrary prior distribution ","element":"span"},{"style":{"height":16},"width":78.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-6.png","element":"img","alt":" π(x)","inline":true},{"text":", the denoised sample is:","element":"span"}],[{"id":"id-87","style":{"width":"87%"},"width":1392,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-7.png","element":"img"}],[{"text":"The distribution of ","element":"span"},{"style":{"height":16.78},"width":721.52,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/14-8.png","element":"img","alt":" xi+1 equals pQ(x) when ϵ → 0 and T → ∞.","inline":true}],[{"text":"Our primary approach to quantify the uncertainty of the action with respect to the Q-value is to evaluate the spread of the sampled Q-values for each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":")","element":"span"},{"text":". We then use the gradient of the Q-value samples with respect to the action to assess their sensitivity to changes in action. By iteratively deriving the gradient over the sampling chain of length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", as shown in Eq","element":"span"},{"href":"#id-87","text":".A.1, ","element":"a"},{"text":"we approximate the following outcome:","element":"span"}],[{"id":"id-88","style":{"width":"64%"},"width":1020,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-0.png","element":"img"}],[{"text":"From Eq","element":"span"},{"href":"#id-88","text":".A.2, ","element":"a"},{"text":"the impact of actions on Q-value samples learned from the diffusion model shows considerable instability, especially during the iterative gradient-solving process for the score network ","element":"span"},{"style":{"height":16},"width":201.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-1.png","element":"img","alt":"sθ(x, σ|s, a)","inline":true},{"text":". This instability often leads to gradient vanishing or exploding. While the diffusion model effectively recovers the Q-value distribution with high precision, the multi-step sampling process introduces significant fluctuations and noise, making it difficult to accurately assess the uncertainty of Q-values for different actions.","element":"span"}],[{"text":"Additionally, as different prior information may yield the same target sample, and this stochastic correlation also introduces an uncontrollable impact on uncertainty of the Q-value. However, the effect of prior sample variance on the uncertainty of the sampled Q-value can not be quantified with the diffusion model. So we can not guarantee if the absolute influence on the Q-value uncertainty is from the action, then the performance of the uncertainty can not be guaranteed.","element":"span"}]]},{"heading":"B Convergence of the truncated Q-value distribution.","paragraphs":[[{"text":"We first give a formal introduction of Theorem ","element":"span"},{"href":"#id-21","text":"4.1 ","element":"a"},{"text":"as Theorem ","element":"span"},{"href":"#id-89","text":"B.1.","element":"a"}],[{"id":"id-89","style":{"fontWeight":"bold"},"text":"Theorem B.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose the true distribution of Q-value w.r.t the behavior policy ","element":"span"},{"style":{"height":11.58},"width":40.72,"height":28.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-2.png","element":"img","alt":" πβ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is defined as ","element":"span"},{"style":{"height":17.87},"width":144.32,"height":44.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-3.png","element":"img","alt":"FQπβ (x)","inline":true},{"style":{"fontStyle":"italic"},"text":". By Eq","element":"span"},{"href":"#id-45","style":{"fontStyle":"italic"},"text":".5, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"we derive the truncated Q-value ","element":"span"},{"style":{"height":19.01},"width":66.16,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-4.png","element":"img","alt":" QπβT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", denote the distribution of the truncated ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q-value is ","element":"span"},{"style":{"height":22.7},"width":144.32,"height":56.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-5.png","element":"img","alt":" FQπβT (x)","inline":true},{"style":{"fontStyle":"italic"},"text":". Assume the true Q-value is finite over the state and action space,then ","element":"span"},{"style":{"height":19.01},"width":66.16,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-6.png","element":"img","alt":" QπβT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"converge in-distribution to the true true Q-value ","element":"span"},{"style":{"height":14},"width":80.88,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-7.png","element":"img","alt":" Qπβ.","inline":true}],[{"style":{"width":"67%"},"width":1063,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-8.png","element":"img"}],[{"text":"We will show the distribution of the truncated Q-value ","element":"span"},{"style":{"height":19.01},"width":66.12,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-9.png","element":"img","alt":" QπβT","inline":true,"padRight":true},{"text":"has the same property with the true distribution of the Q-value ","element":"span"},{"style":{"height":19.02},"width":66.16,"height":47.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-10.png","element":"img","alt":" QπβT ","inline":true,"padRight":true},{"text":"and give a brief proof of Theorem ","element":"span"},{"href":"#id-21","text":"4.1.","element":"a"}],[{"text":"Suppose the state-action space determined by the offline RL dataset ","element":"span"},{"style":{"height":13.98},"width":344.92,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-11.png","element":"img","alt":" D is ΩD = SD × AD","inline":true},{"text":". Note that ","element":"span"},{"style":{"height":14},"width":66.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-12.png","element":"img","alt":"Qπβ ","inline":true,"padRight":true},{"text":"can be seen as an r.v. defined as: ","element":"span"},{"style":{"height":23.73},"width":980.56,"height":59.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-13.png","element":"img","alt":" (ΩD, F, P · πβ) Qπβ−−−→ (R, B(R), (P · πβ) ◦ (Qπβ)−1), where","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"F ","element":"span"},{"text":"is the ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-14.png","element":"img","alt":" σ","inline":true},{"text":"-field on ","element":"span"},{"style":{"height":15.58},"width":170.2,"height":38.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-15.png","element":"img","alt":" ΩD, P · πβ","inline":true,"padRight":true},{"text":"is the probability measure on ","element":"span"},{"style":{"height":13.18},"width":53.76,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-16.png","element":"img","alt":" ΩD","inline":true,"padRight":true},{"text":"and ","element":"span"},{"text":"P ","element":"span"},{"text":"is the transition probability measure over state, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"text":"(","element":"span"},{"text":"R","element":"span"},{"text":") ","element":"span"},{"text":"is the Borel ","element":"span"},{"style":{"height":18.18},"width":979.56,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-17.png","element":"img","alt":" σ-field on R, (P · πβ) ◦ (Qπβ)−1 = (P · πβ)((Qπβ)−1) is the","inline":true,"padRight":true},{"text":"push forward probability measure on ","element":"span"},{"style":{"height":19.01},"width":349.72,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-18.png","element":"img","alt":" R. Same as Qπβ, QπβT ","inline":true,"padRight":true},{"text":"can also be seen as a r.v..","element":"span"}],[{"text":"Then we show that ","element":"span"},{"style":{"height":19.02},"width":66.16,"height":47.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-19.png","element":"img","alt":" QπβT ","inline":true,"padRight":true},{"text":"converge to ","element":"span"},{"style":{"height":14},"width":66.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-20.png","element":"img","alt":" Qπβ","inline":true,"padRight":true},{"text":"almost surely: ","element":"span"},{"style":{"height":19.31},"width":235.84,"height":48.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-21.png","element":"img","alt":" QπβT a.s.−−→ Qπβ","inline":true},{"text":", when sending ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"to infinity.","element":"span"}],[{"text":"Define the trajectory level dataset ","element":"span"},{"style":{"height":16},"width":797.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-22.png","element":"img","alt":" Dτ = {τk|τk = (sk0, ak0, rk0, sk1, ak1, rk1, · · · )}","inline":true},{"text":". Then for any trajectory ","element":"span"},{"style":{"height":13.18},"width":145.6,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-23.png","element":"img","alt":" τk ∈ Dτ","inline":true},{"text":", the true Q-value w.r.t this trajectory can be rewrite without the expectation ","element":"span"},{"style":{"height":5.2},"width":32,"height":13,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-24.png","element":"img","alt":"∞","inline":true}],[{"text":"as:","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":16},"width":221.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-25.png","element":"img","alt":"π(sk0, ak0) =","inline":true}],[{"style":{"width":"3%"},"width":53,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-26.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Truncating Q-value by termination situation. ","element":"span"},{"text":"If the terminal occurs at step ","element":"span"},{"style":{"height":13.18},"width":32.72,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-27.png","element":"img","alt":" kt","inline":true,"padRight":true},{"text":"of the trajectory ","element":"span"},{"style":{"height":9.2},"width":47.04,"height":23,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-28.png","element":"img","alt":" τk,","inline":true,"padRight":true},{"text":"then we have ","element":"span"},{"style":{"height":17.58},"width":248.36,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-29.png","element":"img","alt":" r(skj, akj) = 0","inline":true},{"text":", for ","element":"span"},{"style":{"height":14},"width":104.6,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-30.png","element":"img","alt":" j > kt","inline":true},{"text":". Then the truncated Q-value is identical the true Q-value: ","element":"span"},{"style":{"height":19.94},"width":335.2,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-31.png","element":"img","alt":"Qπβkt ≡ Qπ(sk0, ak0)","inline":true},{"text":". So the distribution of these two distribution is same for the situation when the ","element":"span"},{"text":"truncation is happened due to terminal of the task.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Truncating Q-value by sliding window situation. ","element":"span"},{"text":"If the Q-value is truncated by a sliding window as shown in Figure ","element":"span"},{"href":"#id-44","text":"A.1, ","element":"a"},{"text":"then for a specific ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"- step sliding widow of width ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"with starting point ","element":"span"},{"style":{"height":16},"width":115.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-32.png","element":"img","alt":" (si, ai)","inline":true}],[{"text":"over a trajectory ","element":"span"},{"style":{"height":19.02},"width":410.04,"height":47.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-33.png","element":"img","alt":" τ, we have QπβT (si, ai) =","inline":true}],[{"style":{"width":"3%"},"width":62,"height":14,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/15-34.png","element":"img"}],[{"text":"Define the state action set ","element":"span"},{"style":{"height":28.32},"width":1180.68,"height":70.8,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-0.png","element":"img","alt":" Bn(ξ) := ∞∪T =nAn(ξ), where AT (ξ) = {(s, a) : |QπβT (s, a)−Qπβ(s, a)| >","inline":true},{"style":{"height":16},"width":39.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-1.png","element":"img","alt":"ξ}","inline":true},{"text":". By the definition of Q-value, we have ","element":"span"},{"style":{"height":16},"width":276.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-2.png","element":"img","alt":" Bm(ξ) = Am(ξ)","inline":true},{"text":", as ","element":"span"},{"style":{"height":16},"width":556.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-3.png","element":"img","alt":" AT (ξ) ⊃ AT +1(ξ) ⊃ AT +2(ξ) ⊃","inline":true},{"style":{"height":4.8},"width":46.44,"height":12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-4.png","element":"img","alt":"· · ·","inline":true,"padRight":true},{"text":"is decreasing. So ","element":"span"},{"style":{"height":16},"width":102.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-5.png","element":"img","alt":" Bn(ξ)","inline":true,"padRight":true},{"text":"is also decreasing. Then,","element":"span"}],[{"id":"id-90","style":{"width":"77%"},"width":1221,"height":55,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-6.png","element":"img"}],[{"text":"By definition, ","element":"span"},{"style":{"height":17.54},"width":801.88,"height":43.84,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-7.png","element":"img","alt":" An(ξ) = {(s, a) : |Qπβn (s, a) − Qπβ(s, a)| > ξ}","inline":true},{"text":", and assume the reward function ","element":"span"},{"style":{"height":16},"width":90.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-8.png","element":"img","alt":"r(·, ·)","inline":true,"padRight":true},{"text":"is bounded as ","element":"span"},{"style":{"height":16},"width":161.08,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-9.png","element":"img","alt":" r(·, ·) < c","inline":true,"padRight":true},{"text":"for some constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"86%"},"width":1375,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-10.png","element":"img"}],[{"text":"So ","element":"span"},{"style":{"height":16.8},"width":183.84,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-11.png","element":"img","alt":" An(ξ) → ∅","inline":true,"padRight":true},{"text":"when sending ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"to infinity, as the discount factor ","element":"span"},{"style":{"height":14.4},"width":96,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-12.png","element":"img","alt":" γ < 1","inline":true,"padRight":true},{"text":"by definition.","element":"span"}],[{"text":"Then by Eq","element":"span"},{"href":"#id-90","text":".B.2, ","element":"a"},{"text":"we have ","element":"span"},{"style":{"height":16},"width":518.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-13.png","element":"img","alt":" limm→∞ P(Bn(ξ)) = 0, ∀ξ > 0","inline":true},{"text":". In probability theory, this is equivalent to ","element":"span"},{"style":{"height":19.33},"width":250.6,"height":48.32,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-14.png","element":"img","alt":" QπβT a.s.−−→ Qπβ.","inline":true}],[{"text":"Furthermore, if ","element":"span"},{"style":{"height":19.02},"width":66.12,"height":47.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-15.png","element":"img","alt":" QπβT","inline":true,"padRight":true},{"text":"converge to ","element":"span"},{"style":{"height":14},"width":66.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-16.png","element":"img","alt":" Qπβ","inline":true,"padRight":true},{"text":"almost surely, then ","element":"span"},{"style":{"height":19.02},"width":66.12,"height":47.56,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-17.png","element":"img","alt":" QπβT","inline":true,"padRight":true},{"text":"also converge to ","element":"span"},{"style":{"height":14},"width":66.12,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-18.png","element":"img","alt":" Qπβ","inline":true,"padRight":true},{"text":"in probability, ","element":"span"},{"text":"and in-distribution. Hence, we finish the proof.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"B.1","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Theorem ","element":"span"},{"href":"#id-21","text":"4.1 ","element":"a"},{"text":"suggests that for arbitrary small ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-19.png","element":"img","alt":" ϵ","inline":true},{"text":", there exists a sufficiently large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":", such that ","element":"span"},{"style":{"height":22.69},"width":441.36,"height":56.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-20.png","element":"img","alt":" |FQπβT (x) − FQπβ (x)| < ϵ","inline":true},{"text":". Given that the impact of rewards diminishes exponentially after ","element":"span"},{"text":"as the increasing of trajectory length, it is unnecessary to set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"to an excessively large value. It’s important to remember that the goal of the Q dataset is to learn the Q-distribution and assess the uncertainty of different actions. Therefore, the absolute magnitude of Q is not crucial. Additionally, using too many future steps may introduce significant uncertainty into the Q-value, as predictions for the distant future can be inaccurate.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"B.2","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"During the proof, the specific starting point of the sliding window holds no significance; rather, our focus lies solely on the length of the window. This is primarily due to the Markovian nature of trajectories in RL, where the current state and action are unaffected by previous ones and adhere to a memoryless property. Consequently, the starting point of the sliding window exerts minimal influence on the computation of Q.","element":"span"}]]},{"heading":"C Robustness of consistency model for uncertainty measure.","paragraphs":[[{"text":"The formal introduction of Theorem ","element":"span"},{"href":"#id-22","text":"4.2 ","element":"a"},{"text":"is shown in Theorem ","element":"span"},{"href":"#id-91","text":"C.1.","element":"a"}],[{"id":"id-91","style":{"fontWeight":"bold"},"text":"Theorem C.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Follow the assumptions in ","element":"span"},{"href":"#id-20","referenceIndex":20,"style":{"fontStyle":"italic"},"text":"[20]","element":"a"},{"style":{"fontStyle":"italic"},"text":", we assume ","element":"span"},{"style":{"height":16},"width":310.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-21.png","element":"img","alt":" fθ(x, T|(s, a)) is L","inline":true},{"style":{"fontStyle":"italic"},"text":"-Lipschitz.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"By using the partial gradient to analyze the influence of prior samples ","element":"span"},{"style":{"height":13.2},"width":45.76,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-22.png","element":"img","alt":" ˆxT","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":", time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"to the variance of the denoised sample ","element":"span"},{"style":{"height":16},"width":140.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-23.png","element":"img","alt":" var(Xϵ)","inline":true},{"style":{"fontStyle":"italic"},"text":", with high probability, we have:","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"(1) Prior ","element":"span"},{"style":{"fontStyle":"italic"},"text":"noise ","element":"span"},{"style":{"fontStyle":"italic"},"text":"influence ","element":"span"},{"style":{"fontStyle":"italic"},"text":"the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"variance ","element":"span"},{"style":{"height":16},"width":111.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-24.png","element":"img","alt":"V (Xϵ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"bounded ","element":"span"},{"style":{"fontStyle":"italic"},"text":"by: ","element":"span"},{"style":{"height":23.22},"width":233.96,"height":58.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-25.png","element":"img","alt":"| ∂V (Xϵ)∂ˆxT | =","inline":true},{"style":{"height":19.2},"width":417.64,"height":48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-26.png","element":"img","alt":"O(L2Tn−1�T log(n))1.","inline":true}],[{"style":{"width":"92%"},"width":1470,"height":145,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/16-27.png","element":"img"}],[{"text":"As discussed in Section ","element":"span"},{"href":"#id-86","text":"3.1 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-51","text":"A.2, ","element":"a"},{"text":"while the diffusion model has shown great success in learning distributions and generating samples, it is less suitable for scenarios where the influence of certain parameters on sample uncertainty must be guaranteed. When estimating uncertainty, we sample multiple Q-values for each ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"pair, and the standard deviation of these sampled Q-values measures the uncertainty. Therefore, it is crucial that the sample spread is sensitive to changes in action to accurately judge OOD actions.","element":"span"}],[{"text":"However, the multi-step forward denoising process of the diffusion model undermines the influence of actions on the sampled Q-values, compromising the robustness of uncertainty estimation. Additionally, the lack of a one-to-one correspondence between prior information and target samples prevents the cancellation of prior effects on the Q-value distribution through repeated sampling.","element":"span"}],[{"text":"In contrast, the consistency model addresses these challenges. It not only overcomes the aforementioned issues, but its one-step sampling significantly enhances efficiency. In the following theoretical analysis, we will demonstrate the robustness of the consistency model in estimating uncertainty.","element":"span"}],[{"text":"As described in [","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"20","element":"a"},{"text":"] and Section ","element":"span"},{"href":"#id-92","text":"2.2, ","element":"a"},{"text":"a consistency model ","element":"span"},{"style":{"height":16},"width":123.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-0.png","element":"img","alt":" fθ(x, t)","inline":true,"padRight":true},{"text":"is trained to mapping the prior noise on any trajectory of PF ODE to the trajectory’s origin ","element":"span"},{"style":{"height":9.18},"width":36.8,"height":22.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-1.png","element":"img","alt":" xϵ","inline":true,"padRight":true},{"text":"by: ","element":"span"},{"style":{"height":16},"width":217.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-2.png","element":"img","alt":" fθ(x, t) = xϵ","inline":true},{"text":", given ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":9.18},"width":36.76,"height":22.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-3.png","element":"img","alt":" xϵ","inline":true,"padRight":true},{"text":"belong to the same PF ODE trajectory. ","element":"span"},{"style":{"height":16},"width":123.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-4.png","element":"img","alt":" fθ(x, t)","inline":true,"padRight":true},{"text":"is defined as in Eq","element":"span"},{"href":"#id-39","text":".4.","element":"a"}],[{"text":"Suppose we trained a conditional consistency model ","element":"span"},{"style":{"height":16},"width":192.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-5.png","element":"img","alt":" fθ(x, t|s, a)","inline":true,"padRight":true},{"text":"with the truncated Q-value dataset ","element":"span"},{"style":{"height":19.01},"width":312.48,"height":47.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-6.png","element":"img","alt":"DQ = {QπβT (s, a)}","inline":true},{"text":", following the one-step sampling of consistency model, we first initial ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"noise ","element":"span"},{"style":{"height":17.39},"width":521.76,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-7.png","element":"img","alt":"ˆxTi ∼ N(0, T 2), i = 1, 2, ..., n","inline":true},{"text":", then do one-step forward denosing and derive ","element":"span"},{"style":{"fontStyle":"italic"},"text":"n ","element":"span"},{"text":"sample ","element":"span"},{"style":{"height":14.78},"width":98.68,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-8.png","element":"img","alt":" ˆxϵi =","inline":true},{"style":{"height":16},"width":396.04,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-9.png","element":"img","alt":"fθ(ˆxTi, T|s, a), where T","inline":true,"padRight":true},{"text":"is a fixed time step. The variance based on the Q sample is:","element":"span"}],[{"style":{"width":"85%"},"width":1348,"height":129,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-10.png","element":"img"}],[{"text":"Next, we derive the gradient of ","element":"span"},{"style":{"height":16},"width":111.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-11.png","element":"img","alt":" V (Xϵ)","inline":true,"padRight":true},{"text":"w.r.t ","element":"span"},{"style":{"height":14.78},"width":154.84,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-12.png","element":"img","alt":" ˆxTi, T, a","inline":true},{"text":", and check how change in these variable influence the variance. As state ","element":"span"},{"style":{"fontStyle":"italic"},"text":"s ","element":"span"},{"text":"is always in-distribution during offline RL training process and has little influence on the uncertainty of the sampled Q-value, we skip the analysis.","element":"span"}],[{"text":"Following ","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"[20]","element":"a"},{"text":", we assume that ","element":"span"},{"style":{"height":16},"width":265.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-13.png","element":"img","alt":" fθ(x, t|s, a) is L","inline":true},{"text":"-Lipschitz bounded, i.e., for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"y","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"50%"},"width":798,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-14.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Gradient of the prior ","element":"span"},{"style":{"height":14.78},"width":66.24,"height":36.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-15.png","element":"img","alt":" ˆxTi.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":9.18},"width":29.56,"height":22.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-16.png","element":"img","alt":" ei","inline":true,"padRight":true},{"text":"be the square difference of the ","element":"span"},{"style":{"height":15.18},"width":222.6,"height":37.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-17.png","element":"img","alt":" i-th prior ˆxTi:","inline":true}],[{"style":{"width":"80%"},"width":1272,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-18.png","element":"img"}],[{"text":"Then we have:","element":"span"}],[{"style":{"width":"66%"},"width":1053,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-19.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":",","element":"span"}],[{"id":"id-93","style":{"width":"100%"},"width":1715,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-20.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"style":{"height":15.2},"width":65.76,"height":38,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-21.png","element":"img","alt":" ̸= i,","inline":true}],[{"id":"id-94","style":{"width":"95%"},"width":1517,"height":257,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-22.png","element":"img"}],[{"text":"Thus, plugging Eq","element":"span"},{"href":"#id-93","text":".C.4 ","element":"a"},{"text":"and Eq","element":"span"},{"href":"#id-94","text":".C.5 ","element":"a"},{"text":"into ","element":"span"},{"style":{"height":25.81},"width":220.12,"height":64.52,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-23.png","element":"img","alt":"∂V (Xϵ)∂ˆxTi yields","inline":true}],[{"style":{"width":"97%"},"width":1541,"height":320,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/17-24.png","element":"img"}],[{"text":"As ","element":"span"},{"style":{"height":16},"width":265.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-0.png","element":"img","alt":" fθ(x, t|s, a) is L","inline":true},{"text":"-Lipschitz bounded, we have","element":"span"}],[{"style":{"width":"93%"},"width":1477,"height":237,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-1.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":17.58},"width":311.16,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-2.png","element":"img","alt":" |ˆxTi − ˆxTj|, ∀j ̸= i","inline":true,"padRight":true},{"text":"can be bounded by ","element":"span"},{"style":{"height":17.39},"width":543.48,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-3.png","element":"img","alt":" cT√log n due to ˆxTi ∼ N(0, T 2)","inline":true,"padRight":true},{"text":"with probability at least ","element":"span"},{"style":{"height":13.38},"width":133.48,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-4.png","element":"img","alt":" 1 − n−1","inline":true},{"text":". Denote the constant with ","element":"span"},{"style":{"height":11.6},"width":34.24,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-5.png","element":"img","alt":" cp","inline":true,"padRight":true},{"text":"and apply the previous process to all the prior samples completes the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gradient of the time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Proof.","element":"span"}],[{"text":"Note that","element":"span"}],[{"id":"id-96","style":{"width":"100%"},"width":1661,"height":437,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-6.png","element":"img"}],[{"id":"id-95","style":{"height":35.1},"width":98.84,"height":87.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-7.png","element":"img","alt":"≤4L2n","inline":true}],[{"text":"with probability at least ","element":"span"},{"style":{"height":13.39},"width":136,"height":33.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-8.png","element":"img","alt":" 1 − n−1","inline":true},{"text":", since ","element":"span"},{"style":{"height":17.58},"width":320.16,"height":43.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-9.png","element":"img","alt":" |ˆxTi − ˆxTj|, ∀j ̸= i","inline":true,"padRight":true},{"text":"can be bounded by ","element":"span"},{"style":{"height":16.05},"width":161.4,"height":40.12,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-10.png","element":"img","alt":" cT√log n","inline":true,"padRight":true},{"text":"due to ","element":"span"},{"style":{"height":17.38},"width":263.76,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-11.png","element":"img","alt":"ˆxTi ∼ N(0, T 2)","inline":true,"padRight":true},{"text":"with probability at least ","element":"span"},{"style":{"height":13.78},"width":145.32,"height":34.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-12.png","element":"img","alt":" 1 − n−1.","inline":true}],[{"text":"Plugging Eq","element":"span"},{"href":"#id-95","text":".C.9 ","element":"a"},{"text":"into Eq","element":"span"},{"href":"#id-96","text":".C.8 ","element":"a"},{"text":"yields","element":"span"},{"id":"id-97","style":{"height":39.1},"width":153.32,"height":97.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-13.png","element":"img","alt":"��∂V (Xϵ)∂T","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Gradient of the action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"text":"For the gradient of ","element":"span"},{"style":{"height":21.62},"width":109.2,"height":54.04,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-14.png","element":"img","alt":"∂V (Xϵ)∂a","inline":true,"padRight":true},{"text":", we just need to take partial gradient of ","element":"span"},{"style":{"height":16},"width":111.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-15.png","element":"img","alt":" V (Xϵ)","inline":true,"padRight":true},{"text":"for each dimmension of the action ","element":"span"},{"style":{"height":23.23},"width":529.96,"height":58.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-16.png","element":"img","alt":" a = {a1, a2, · · · , am} by ∂V (Xϵ)∂ai","inline":true,"padRight":true},{"text":". The result is same as those we got in Eq","element":"span"},{"href":"#id-97","text":".C.10.","element":"a"}],[{"text":"Then take the vector form, we have ","element":"span"},{"style":{"height":21.63},"width":497.88,"height":54.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-17.png","element":"img","alt":" | ∂V (Xϵ)∂a | = O(L2T√log n) · 1","inline":true},{"text":", which finish the proof.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"C.1","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Theorem ","element":"span"},{"href":"#id-22","text":"4.2 ","element":"a"},{"text":"elucidates the diminishing impact of random prior on the variance of the denoised Q-value as the sample size increases. Leveraging the consistency of sampling, we mitigate concerns regarding the influence of a priori samples on the uncertainty of final target samples. Given the fixed sampling step size ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":", we also address concerns about its effect on the uncertainty of Q samples. However, for a thorough analysis, we still include the gradient analysis of ","element":"span"},{"style":{"height":16},"width":280.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/18-18.png","element":"img","alt":" V (Xϵ) against T.","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"C.2","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"The influence of actions on sample variance depends on factors like the Lipschitz factor and sample size. As the sample size increases, the impact of actions on Q sample variance does not diminish; instead, it becomes more sensitive. Larger Q sample variance is more likely to occur when OOD actions are present. Consequently, the consistency model proves to be a reliable approach for Q-value sampling and uncertainty estimation.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"C.3","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Although experiments in [","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"20","element":"a"},{"text":"] show that the performance of the consistency model is less competitive compared to the diffusion model or adversarial generators like GANs, these findings have minimal relevance to our method. Our primary focus is not on the absolute accuracy of the sampled Q-values but on the sensitivity of Q sample dispersion to OOD actions and its ability to effectively capture uncertainty in such cases. Additionally, the high sampling efficiency achieved through one-step sampling compensates for the minor performance discrepancies of the consistency model.","element":"span"}]]},{"heading":"D Convergence of the uncertainty-aware learning objective for recovering the Q-value function.","paragraphs":[[{"text":"We first give a formal introduction of Theorem ","element":"span"},{"href":"#id-23","text":"4.3 ","element":"a"},{"text":"below.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem D.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Updating Q-value ","element":"span"},{"style":{"height":16},"width":138.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-0.png","element":"img","alt":" Qθ(s, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"via the uncertainty-aware objective in Eq","element":"span"},{"href":"#id-52","style":{"fontStyle":"italic"},"text":".7 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"is equivalent to minimizing the ","element":"span"},{"style":{"height":13.2},"width":43.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-1.png","element":"img","alt":" L2","inline":true},{"style":{"fontStyle":"italic"},"text":"-norm of Bellman residuals: ","element":"span"},{"style":{"height":19.07},"width":658.92,"height":47.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-2.png","element":"img","alt":" Es∼PD(s),a∼π(a|s)[Q(s, a) − FQ(s, a)]2","inline":true},{"style":{"fontStyle":"italic"},"text":", where the Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is defined as:","element":"span"}],[{"style":{"width":"88%"},"width":1409,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-3.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"In addition, assume ","element":"span"},{"style":{"height":22.58},"width":396.04,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-4.png","element":"img","alt":"1HQ(a′|s′)1(a′∈U(Q)) < β","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":10.4},"width":39.24,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-5.png","element":"img","alt":" cγ","inline":true},{"style":{"fontStyle":"italic"},"text":"-contraction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"operator in the ","element":"span"},{"style":{"height":13.18},"width":59.12,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-6.png","element":"img","alt":" L∞","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"norm, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c < ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":". The Q-value function ","element":"span"},{"style":{"height":16},"width":138.52,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-7.png","element":"img","alt":" Qθ(s, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can converge to a fixed point by value iteration method.","element":"span"}],[{"text":"We will show the the uncertainty-aware learning objective is equivalent to minimized the Bellman equation defined by a specific Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"firstly.","element":"span"}],[{"text":"Then the we proof the Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"is ","element":"span"},{"style":{"height":10.4},"width":39.24,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-8.png","element":"img","alt":" cγ","inline":true},{"text":"-contraction operator in the ","element":"span"},{"style":{"height":13.18},"width":59.12,"height":32.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-9.png","element":"img","alt":" L∞","inline":true,"padRight":true},{"text":"norm, where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"c < ","element":"span"},{"text":"1","element":"span"},{"text":". The Q-value function ","element":"span"},{"style":{"height":16},"width":138.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-10.png","element":"img","alt":" Qθ(s, a)","inline":true,"padRight":true},{"text":"can converge to a fixed point by the value iteration method.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Derivation of the Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontWeight":"bold"},"text":".","element":"span"}],[{"text":"Recall the Q-value is optimized by the uncertainty-aware learning objective by Eq","element":"span"},{"href":"#id-52","text":".7:","element":"a"}],[{"id":"id-99","style":{"width":"75%"},"width":1189,"height":57,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-11.png","element":"img"}],[{"text":"where","element":"span"}],[{"id":"id-98","style":{"width":"95%"},"width":1516,"height":270,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-12.png","element":"img"}],[{"text":"The uncertainty penalized Q target ","element":"span"},{"style":{"height":16},"width":166.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-13.png","element":"img","alt":" QL(s′, a′)","inline":true,"padRight":true},{"text":"is defined as:","element":"span"}],[{"style":{"width":"85%"},"width":1360,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-14.png","element":"img"}],[{"text":"For simplicity, we ignore the estimation parameter of ","element":"span"},{"style":{"height":16},"width":138.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-15.png","element":"img","alt":" Qθ(s, a)","inline":true,"padRight":true},{"text":"and just use ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"in the following proof.","element":"span"}],[{"text":"We can just take the uncertainty-aware loss in Eq","element":"span"},{"href":"#id-52","text":".7 ","element":"a"},{"text":"as a plain regression like loss and we have:","element":"span"}],[{"style":{"width":"100%"},"width":1676,"height":490,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-16.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"C ","element":"span"},{"text":"is a factor that not related to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":")","element":"span"},{"text":", since the value of ","element":"span"},{"style":{"height":16},"width":202.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-17.png","element":"img","alt":" (BQ)L(s, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":")(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"are fixed as we update ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":". By the definition of ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":")(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":16},"width":202.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-18.png","element":"img","alt":" (BQ)L(s, a)","inline":true,"padRight":true},{"text":"in Eq","element":"span"},{"href":"#id-98","text":".D.3 ","element":"a"},{"text":"and Eq","element":"span"},{"href":"#id-98","text":".D.4, ","element":"a"},{"text":"the uncertainty-aware learning is equivalent to minimized the following Bellman equation:","element":"span"}],[{"id":"id-100","style":{"width":"72%"},"width":1144,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/19-19.png","element":"img"}],[{"text":"where the specific Bellman operator ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":")(","element":"span"},{"style":{"fontStyle":"italic"},"text":"s, a","element":"span"},{"text":") ","element":"span"},{"text":"is defined in Eq","element":"span"},{"href":"#id-99","text":".D.1. ","element":"a"},{"text":"Then we finish the proof.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"D.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Bellman operator ","element":"span"},{"style":{"height":15.6},"width":158.36,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-0.png","element":"img","alt":" FQ is cγ","inline":true},{"style":{"fontWeight":"bold"},"text":"-contraction operator in the ","element":"span"},{"style":{"height":13.2},"width":173.8,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-1.png","element":"img","alt":" L∞ norm.","inline":true}],[{"text":"Then we give a brief proof of the convergence of the Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"by value interative optimization.","element":"span"}],[{"text":"For any Q-value function ","element":"span"},{"style":{"height":16},"width":1026.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-2.png","element":"img","alt":" Q(s, a), Q′(s, a), define I = |FQ(s, a) − FQ′(s, a)|, we have","inline":true}],[{"style":{"width":"100%"},"width":1939,"height":1426,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-3.png","element":"img"}],[{"text":"Since ","element":"span"},{"style":{"height":22.58},"width":395.08,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-4.png","element":"img","alt":"1HQ(a′|s′)1(a′∈U(Q)) < β","inline":true},{"text":", we have ","element":"span"},{"style":{"height":24.83},"width":872.36,"height":62.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-5.png","element":"img","alt":" max{α + (1 − α)β, α + (1−α)HQ(a′|s′)} = α + (1 − α)β","inline":true,"padRight":true},{"text":"is ","element":"span"},{"text":"always true. As ","element":"span"},{"style":{"height":16},"width":756.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-6.png","element":"img","alt":" β < 1, then α + (1 − α)β < α + (1 − α) < 1.","inline":true}],[{"text":"Set ","element":"span"},{"style":{"height":16},"width":293,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-7.png","element":"img","alt":" c = α + (1 − α)β","inline":true},{"text":", then we have","element":"span"}],[{"style":{"width":"79%"},"width":1254,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-8.png","element":"img"}],[{"text":"which implies ","element":"span"},{"style":{"height":15.6},"width":158.4,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-9.png","element":"img","alt":" FQ is cγ","inline":true},{"text":"-contraction operator with ","element":"span"},{"style":{"height":14.4},"width":123.16,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-10.png","element":"img","alt":" cγ < 1.","inline":true}],[{"text":"Suppose ","element":"span"},{"style":{"height":16.99},"width":57.52,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-11.png","element":"img","alt":" Q∆","inline":true,"padRight":true},{"text":"is the stationary point of the Bellman equation in Eq","element":"span"},{"href":"#id-100","text":".D.7, ","element":"a"},{"text":"then it can be shown that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"iteratively updated by Eq","element":"span"},{"href":"#id-100","text":".D.7 ","element":"a"},{"text":"can converge to ","element":"span"},{"style":{"height":16.99},"width":70.64,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-12.png","element":"img","alt":" Q∆:","inline":true}],[{"style":{"width":"79%"},"width":1266,"height":177,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-13.png","element":"img"}],[{"text":"Sending ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"to infinity, we can derive ","element":"span"},{"style":{"height":16.19},"width":43.48,"height":40.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-14.png","element":"img","alt":" Qt ","inline":true,"padRight":true},{"text":"converge to ","element":"span"},{"style":{"height":16.99},"width":57.52,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-15.png","element":"img","alt":" Q∆","inline":true},{"text":", then we finish the proof. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"D.1","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"The assumption ","element":"span"},{"style":{"height":22.56},"width":389.24,"height":56.4,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-16.png","element":"img","alt":"1HQ(a′|s′)1(a′∈U(Q)) < β","inline":true,"padRight":true},{"text":"can be always satisfied. Roughly speaking, ","element":"span"},{"text":"since ","element":"span"},{"style":{"height":16},"width":176.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-17.png","element":"img","alt":" a′ ∈ U(Q)","inline":true},{"text":", the uncertainty of Q-value ","element":"span"},{"style":{"height":16.8},"width":165.4,"height":42,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-18.png","element":"img","alt":" HQ(a′|s′)","inline":true,"padRight":true},{"text":"on this ","element":"span"},{"style":{"height":6.8},"width":35.08,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-19.png","element":"img","alt":" a′","inline":true,"padRight":true},{"text":"has large uncertainty due to the OOD property. Furthermore, we can scale the absolute value ","element":"span"},{"style":{"height":16.78},"width":165.4,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/20-20.png","element":"img","alt":" HQ(a′|s′)","inline":true,"padRight":true},{"text":"for all the action with","element":"span"}],[{"text":"same factor to guarantee ","element":"span"},{"style":{"height":22.58},"width":392.44,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-0.png","element":"img","alt":"1HQ(a′|s′)1(a′∈U(Q)) < β","inline":true,"padRight":true},{"text":"without hurting the relative comparison for the ","element":"span"},{"text":"uncertainty. Furthermore, experiment results have shown that ","element":"span"},{"style":{"height":22.58},"width":389.24,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-1.png","element":"img","alt":"1HQ(a′|s′)1(a′∈U(Q)) < β","inline":true,"padRight":true},{"text":"is consistently ","element":"span"},{"text":"satisfied without additional processing.","element":"span"}]]},{"heading":"E Performance of the Q-value function Qk(s, a) derived by QDQ.","paragraphs":[[{"text":"In this section, we delve into an analysis of the performance of the Q-value function derived from the QDQ algorithm. Given that the primary aim of QDQ is to mitigate the issue of excessively conservative in most pessimistic Q-value methods, our focus is directed towards scrutinizing the disparity between the optimal Q-value within the offline RL framework and the optimal Q-value function yielded by QDQ.","element":"span"}],[{"text":"The following is a formal version of Theorem ","element":"span"},{"href":"#id-24","text":"4.4.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Theorem E.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Suppose the optimal Q-value function over state-action space ","element":"span"},{"style":{"height":13.98},"width":161.4,"height":34.96,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-2.png","element":"img","alt":" SD × AD","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"defined by the dataset ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":14.19},"width":47.52,"height":35.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-3.png","element":"img","alt":" Q∗","inline":true},{"style":{"fontStyle":"italic"},"text":". Then with probability at least ","element":"span"},{"style":{"height":14.4},"width":89.56,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-4.png","element":"img","alt":" 1 − η","inline":true},{"style":{"fontStyle":"italic"},"text":", the Q-value function ","element":"span"},{"style":{"height":16.99},"width":57.52,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-5.png","element":"img","alt":" Q∆","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"learned by minimizing uncertainty-aware loss (Eq","element":"span"},{"href":"#id-52","style":{"fontStyle":"italic"},"text":".7) ","element":"a"},{"style":{"fontStyle":"italic"},"text":"can approach the optimal ","element":"span"},{"style":{"height":14.18},"width":47.52,"height":35.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-6.png","element":"img","alt":" Q∗ ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"with a small constant:","element":"span"}],[{"style":{"width":"59%"},"width":946,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-7.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-8.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is error rate related to the difference between the classical Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and the QDQ Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"fontStyle":"italic"},"text":", and ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-9.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is determined by the probability that ","element":"span"},{"style":{"height":22.59},"width":1584,"height":56.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-10.png","element":"img","alt":"maxa′{(1 − β))|Q(s′, a′)|1(a′ /∈U(Q)) + (1 − 1HQ(a′|s′))|Q(s′, a′)|1(a′∈U(Q))} = maxa′{(1 −","inline":true}],[{"style":{"width":"32%"},"width":520,"height":45,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-11.png","element":"img"}],[{"text":"Before the proof, we redefine some notation to make the subsequent exposition clearer.","element":"span"}],[{"text":"Denote state space ","element":"span"},{"style":{"height":13.2},"width":49.12,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-12.png","element":"img","alt":" SD","inline":true,"padRight":true},{"text":"be the state space defined by the distribution of dataset ","element":"span"},{"style":{"height":14},"width":108.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-13.png","element":"img","alt":" D, AD","inline":true,"padRight":true},{"text":"is the action space defined by the dataset ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":", and actions not belong to this space is the OOD actions. The optimal Q-value ","element":"span"},{"style":{"height":14.8},"width":266.92,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-14.png","element":"img","alt":" Q∗ on SD × AD","inline":true,"padRight":true},{"text":"can be derived by optimize the following Bellman equation:","element":"span"}],[{"style":{"width":"93%"},"width":1478,"height":126,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-15.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":507.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-16.png","element":"img","alt":" (s, a) ∈ SD × AD, Q∗ = BQ∗.","inline":true}],[{"text":"We first introduced Lemma ","element":"span"},{"href":"#id-101","text":"E.1 ","element":"a"},{"text":"to facilitate the proof of Theorem ","element":"span"},{"href":"#id-24","text":"4.4.","element":"a"}],[{"id":"id-101","style":{"fontWeight":"bold"},"text":"Lemma E.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"For any ","element":"span"},{"style":{"height":14.8},"width":263.48,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-17.png","element":"img","alt":" s ∈ SD, a ∈ AD","inline":true},{"style":{"fontStyle":"italic"},"text":", with probability ","element":"span"},{"style":{"height":14.4},"width":100.84,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-18.png","element":"img","alt":" 1 − η,","inline":true}],[{"style":{"width":"79%"},"width":1264,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-19.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"and with probability ","element":"span"},{"style":{"height":10.4},"width":32.2,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-20.png","element":"img","alt":" η,","inline":true}],[{"style":{"width":"85%"},"width":1348,"height":93,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/21-21.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"Proof of Lemma ","element":"span"},{"href":"#id-101","style":{"fontStyle":"italic"},"text":"E.1.","element":"a"}],[{"text":"Direct computation shows that","element":"span"}],[{"style":{"width":"23%"},"width":371,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-0.png","element":"img"}],[{"style":{"height":22.26},"width":1716.36,"height":55.64,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-1.png","element":"img","alt":"=|r(s, a) + γEs′∼PD(s′){maxa′ Q(s′, a′)} − r(s, a) − γEs′∼PD(s′){maxa′ [αQ(s′, a′) + (1 − α)QL(s′, a′)]}|","inline":true}],[{"style":{"height":22.29},"width":1092.4,"height":55.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-2.png","element":"img","alt":"≤γEs′∼PD(s′)| maxa′ {Q(s′, a′) − [αQ(s′, a′) + (1 − α)QL(s′, a′)]}|","inline":true}],[{"style":{"height":22.29},"width":1092.4,"height":55.72,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-3.png","element":"img","alt":"≤γEs′∼PD(s′)| maxa′ {Q(s′, a′) − [αQ(s′, a′) + (1 − α)QL(s′, a′)]}|","inline":true}],[{"style":{"height":22.27},"width":1059.2,"height":55.68,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-4.png","element":"img","alt":"≤γEs′∼PD(s′) maxa′ |Q(s′, a′) − [αQ(s′, a′) + (1 − α)QL(s′, a′)]|","inline":true}],[{"style":{"width":"100%"},"width":1752,"height":354,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-5.png","element":"img"}],[{"text":"Then with probability ","element":"span"},{"style":{"height":14.4},"width":99.88,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-6.png","element":"img","alt":" 1 − η,","inline":true}],[{"style":{"width":"88%"},"width":1401,"height":119,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-7.png","element":"img"}],[{"text":"and with probability ","element":"span"},{"style":{"height":10.4},"width":31.2,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-8.png","element":"img","alt":" η,","inline":true}],[{"style":{"width":"89%"},"width":1420,"height":201,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-9.png","element":"img"}],[{"text":"Then we finish the proof.","element":"span"}],[{"text":"Next, we will give a brief proof of Theorem ","element":"span"},{"href":"#id-24","text":"4.4 ","element":"a"},{"text":"with the results of Theorem ","element":"span"},{"href":"#id-23","text":"4.3 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-101","text":"E.1. ","element":"a"},{"text":"Suppose the stationary point or the optimal Q-value derived based on the QDQ Bellman operator is ","element":"span"},{"style":{"height":16.99},"width":57.52,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-10.png","element":"img","alt":"Q∆","inline":true},{"text":", which satisfying: ","element":"span"},{"style":{"height":16.99},"width":223.6,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-11.png","element":"img","alt":" Q∆ = FQ∆.","inline":true}],[{"style":{"width":"84%"},"width":1333,"height":390,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-12.png","element":"img"}],[{"text":"Set ","element":"span"},{"style":{"height":17.38},"width":721.16,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-13.png","element":"img","alt":" ϵ = γ(1 − α)(1 − β)(1 − cγ)−1||Q∗(·, ·)||∞","inline":true,"padRight":true},{"text":"finish the proof.","element":"span"}],[{"id":"id-103","style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"E.1","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"The error ","element":"span"},{"style":{"height":17.39},"width":721.04,"height":43.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-14.png","element":"img","alt":" ϵ = γ(1 − α)(1 − β)(1 − cγ)−1||Q∗(·, ·)||∞","inline":true,"padRight":true},{"text":"can be 0 by setting ","element":"span"},{"style":{"height":14.4},"width":163.36,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-15.png","element":"img","alt":" β = 1 . In","inline":true,"padRight":true},{"text":"practice, we can ensure that ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-16.png","element":"img","alt":" ϵ","inline":true,"padRight":true},{"text":"converges to a small value by appropriately adjusting the parameters ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-17.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":34.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-18.png","element":"img","alt":" β.","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"E.2","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"Our primary focus lies on the scenario where ","element":"span"},{"style":{"height":17.68},"width":637.48,"height":44.2,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-19.png","element":"img","alt":" maxa′{(1 − β))|Q(s′, a′)|1(a′ /∈U(Q)) +","inline":true},{"style":{"height":22.58},"width":1297.56,"height":56.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-20.png","element":"img","alt":"(1 − 1HQ(a′|s′))|Q(s′, a′)|1(a′∈U(Q))} = maxa′{(1 − β))|Q(s′, a′)|1(a′ /∈U(Q))}","inline":true},{"text":". This preference ","element":"span"},{"text":"stems from the potential minuteness of ","element":"span"},{"style":{"height":10.4},"width":20,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-21.png","element":"img","alt":" η","inline":true},{"text":". Firstly, we can rely on Theorem 1 in [","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"20","element":"a"},{"text":"] to ensure that the consistency model can converge to ground truth, thus guaranteeing the fidelity of the learned Q-distribution. Consequently, the accuracy of our uncertainty estimation is upheld, enabling us to effectively assess OOD points and pessimistically adjust the Q-value for such occurrences [","element":"span"},{"href":"#id-4","referenceIndex":4,"text":"4","element":"a"},{"text":"]. Secondly, in order to mitigate segmentation errors of the uncertainty set, we introduce the penalty factor ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-22.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"in Eq","element":"span"},{"href":"#id-53","text":".8. ","element":"a"},{"text":"Finally, since actions in the set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":") ","element":"span"},{"text":"are always penalized, the Q-value always takes a lower value when ","element":"span"},{"style":{"height":16},"width":172.84,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-23.png","element":"img","alt":" a′ ∈ U(Q)","inline":true},{"text":". Collectively, these measures reinforce our aim to maintain a small ","element":"span"},{"style":{"height":10.4},"width":31.2,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/22-24.png","element":"img","alt":" η.","inline":true,"padRight":true},{"text":"Theorem ","element":"span"},{"href":"#id-24","text":"4.4 ","element":"a"},{"text":"shows the optimal Q-value by QDQ can closely approximate the true optimal Q-value ","element":"span"},{"style":{"height":14.8},"width":295.88,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-0.png","element":"img","alt":"Q∗ over SD × AD","inline":true},{"text":". Then we also provide the following corollary to show that substitute the optimal Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"will introduce controllable error at each step, and give an step-wise analysis of the convergence of QDQ operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":".","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":16},"width":550.24,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-1.png","element":"img","alt":" ζk(s, a) = |Qk(s, a) − Q∗(s, a)|","inline":true,"padRight":true},{"text":"be the total estimation error of Q-value learned by QDQ algorithm and the optimal in-distribution Q-value at step k of the value iteration. Let ","element":"span"},{"style":{"height":16},"width":171.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-2.png","element":"img","alt":" δk(s, a) =","inline":true},{"style":{"height":17.38},"width":391.04,"height":43.44,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-3.png","element":"img","alt":"|Qk(s, a) − FQk(s, a)|","inline":true,"padRight":true},{"text":"be the Bellman residual induced by QDQ Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"at step k. Assume ","element":"span"},{"style":{"height":17.9},"width":556.68,"height":44.76,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-4.png","element":"img","alt":" δ∗k(s, a) = |Qk(s, a) − BQk(s, a)|","inline":true,"padRight":true},{"text":"be the Bellman residual for the optimal in-distribution ","element":"span"},{"text":"Q-value optimization.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"At step k of value iteration, substitute the optimal Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"B","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"introduce an arbitrary small error as ","element":"span"},{"style":{"height":14},"width":32.24,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-5.png","element":"img","alt":" ξ:","inline":true}],[{"id":"id-102","style":{"width":"100%"},"width":1586,"height":240,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-6.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"With probability ","element":"span"},{"style":{"fontStyle":"italic"},"text":"η","element":"span"},{"style":{"fontStyle":"italic"},"text":",","element":"span"}],[{"style":{"width":"100%"},"width":1586,"height":351,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-7.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"href":"#id-101","text":"E.1, ","element":"a"},{"text":"with probability ","element":"span"},{"style":{"height":14.4},"width":99.88,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-8.png","element":"img","alt":" 1 − η,","inline":true}],[{"style":{"width":"81%"},"width":1287,"height":221,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-9.png","element":"img"}],[{"text":"With probability ","element":"span"},{"style":{"fontStyle":"italic"},"text":"η","element":"span"},{"text":",","element":"span"}],[{"style":{"width":"81%"},"width":1290,"height":277,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-10.png","element":"img"}],[{"text":"Then we finish the proof.","element":"span"}],[{"style":{"width":"100%"},"width":1891,"height":694,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/23-11.png","element":"img"}],[{"text":"By Lemma ","element":"span"},{"href":"#id-101","text":"E.1 ","element":"a"},{"text":"and Lemma ","element":"span"},{"href":"#id-102","text":"E.2, ","element":"a"},{"text":"we can easily derive, with probability ","element":"span"},{"style":{"height":14.4},"width":99.84,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-0.png","element":"img","alt":" 1 − η,","inline":true}],[{"style":{"width":"93%"},"width":1484,"height":59,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-1.png","element":"img"}],[{"text":"Set ","element":"span"},{"style":{"height":16},"width":563.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-2.png","element":"img","alt":" ξ = γ(1 − α)(1 − β)||Q(s′, a′)||∞","inline":true},{"text":", then we finish the proof.","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Remark ","element":"span"},{"text":"E.3","element":"span"},{"style":{"fontStyle":"italic"},"text":". ","element":"span"},{"text":"During the value iteration, there are two error accumulate: ","element":"span"},{"style":{"height":16.51},"width":126.48,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-3.png","element":"img","alt":" δ∗k(s, a)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":18,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-4.png","element":"img","alt":" ξ","inline":true},{"text":". Given the ","element":"span"},{"text":"convergence of optimal Q-value iteration, ","element":"span"},{"style":{"height":16.51},"width":126.48,"height":41.28,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-5.png","element":"img","alt":" δ∗k(s, a)","inline":true,"padRight":true},{"text":"tends to approach an arbitrarily small value and may ","element":"span"},{"text":"even converge to 0 under optimal circumstances. The error ","element":"span"},{"style":{"height":16},"width":666.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-6.png","element":"img","alt":" ξ = γ(1−α)(1−β)||Q(s′, a′)||∞induced","inline":true,"padRight":true},{"text":"by using the Bellman operator ","element":"span"},{"style":{"fontStyle":"italic"},"text":"F","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"instead of the optimal Bellman operator can also approach 0 as discussed in Remark ","element":"span"},{"href":"#id-103","text":"E.1.","element":"a"}]]},{"heading":"F Gap expanding of the QDQ algorithm.","paragraphs":[[{"text":"In this section we will show that QDQ also has a Gap expanding property as discussed in CQL ","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"[3]","element":"a"},{"text":". ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Proposition F.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The QDQ algorithm is Gap expanding for Q-values within-distribution and Q-values out of distribution.","element":"span"}],[{"text":"(1) If ","element":"span"},{"style":{"height":16},"width":161.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-7.png","element":"img","alt":" a /∈ U(Q)","inline":true,"padRight":true},{"text":"and is indeed in-distribution action, the target Q-value for the in-distribution point is ","element":"span"},{"style":{"height":16},"width":287.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-8.png","element":"img","alt":" (α + (1 − α)β)Q","inline":true},{"text":". The target Q-value for the OOD point is ","element":"span"},{"style":{"height":24.83},"width":354.88,"height":62.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-9.png","element":"img","alt":" (α + (1−α)HQ(a|s))Q. And","inline":true}],[{"style":{"height":24.83},"width":672.52,"height":62.08,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-10.png","element":"img","alt":"(α + (1 − α)β)Q − (α + (1−α)HQ(a|s))Q > 0","inline":true,"padRight":true},{"text":"definitely. And this suggest Q-value will prefer ","element":"span"},{"text":"in-distribution actions.","element":"span"}],[{"text":"(2) If ","element":"span"},{"style":{"height":16},"width":161.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-11.png","element":"img","alt":" a /∈ U(Q)","inline":true,"padRight":true},{"text":"and is indeed OOD action. In such cases, the penalty parameter ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-12.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is introduced to penalize the Q-value, resulting in ","element":"span"},{"style":{"height":16},"width":351.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-13.png","element":"img","alt":" (α+(1−α)β)Q < Q","inline":true},{"text":". Consequently, misclassifications of OOD actions can be handled in a pessimistic manner.","element":"span"}],[{"text":"While introducing the penalty parameter ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-14.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"may slightly slow down optimization updates for true in-distribution actions, it is crucial to prioritize control over out-of-distribution (OOD) actions due to the potentially severe consequences of exploration errors. In balancing the optimization of Q-learning with a pessimistic approach to OOD actions, the focus should be on reducing the impact of OOD actions. In fact, compared to previous pessimistic Q-value methods, QDQ offers greater flexibility in managing these values. This includes incorporating an uncertainty-aware learning objective and adjusting Q-values based on their uncertainty.","element":"span"}]]},{"heading":"G Experiment Details and More Results","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"G.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Real Q dataset generation.","element":"span"}],[{"text":"In Section ","element":"span"},{"href":"#id-86","text":"3.1, ","element":"a"},{"text":"we use a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":"-step sliding window approach with a length of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"to traverse each trajectory in the dataset ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"and generate the Q-value dataset ","element":"span"},{"style":{"height":15.6},"width":55.72,"height":39,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-15.png","element":"img","alt":" DQ","inline":true,"padRight":true},{"text":"based on Equation ","element":"span"},{"href":"#id-45","text":"5. ","element":"a"},{"text":"A good Q-value dataset for uncertainty estimation should cover a broad state and action space to accurately characterize the Q distribution and detect high uncertainty in the OOD action region. This coverage helps identify actions with significant Q-value uncertainty in OOD areas.","element":"span"}],[{"text":"Choosing the right value for the sliding step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"and the window length ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is crucial. A large value for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"may result in a smaller Q dataset, while a small value may lead to excessive homogenization of the Q dataset. Both situations can negatively impact the learning of the Q-value distribution. If the Q-value dataset is too small, the learned Q distribution may not generalize well across the state-action space ","element":"span"},{"style":{"height":12.4},"width":107.72,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/24-16.png","element":"img","alt":" S × A","inline":true},{"text":". Conversely, if the Q-values are too homogeneous, it can hinder feature learning by the distribution learner.","element":"span"}],[{"text":"To illustrate this, Figure ","element":"span"},{"href":"#id-104","text":"G.1 ","element":"a"},{"text":"shows the Q dataset distributions obtained for different values of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"using the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"halfcheetah-medium ","element":"span"},{"text":"dataset. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"text":", the Q distribution is more concentrated, resulting in more homogeneous Q-values. In contrast, with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 50","element":"span"},{"text":", the Q distribution becomes sparser, showing a greater inclination towards individual features.","element":"span"}],[{"text":"In the experiments, we consider various factors when setting the value of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", including the trajectory length, the width of the sliding window, and the derived Q-value dataset size. Throughout all experiments, we set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"to 10. Interestingly, we observed that the distribution of Q-values remains ","element":"span"},{"text":"robust to minor adjustments in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k","element":"span"},{"text":", indicating that the choice of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"does not necessitate overly stringent tuning.","element":"span"}],[{"style":{"width":"98%"},"width":1569,"height":406,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/25-0.png","element":"img"}],[{"text":"Figure G.1: The derived Q-value distribution when using difference sliding step and same window width to scan over the trajectory’s on ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"halfcheetah-medium ","element":"figcaption","subtype":"caption"},{"id":"id-104","text":"dataset.The width of the sliding window is ","element":"figcaption","subtype":"caption"},{"text":"set to 200. The Q-value is scaled to facilitate comparison.","element":"figcaption","subtype":"caption"}],[{"text":"When choosing the width of the sliding window, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":", we must consider factors such as trajectory length and the resulting size of the Q-value dataset. When ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"increases, the size of the Q-value dataset decreases. However, if ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"is too small, it might truncate essential information from the true Q-value. We give the experimental analysis of Q-value distribution for different ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"on the sparse reward task Antmaze-medium-play in Figure ","element":"span"},{"href":"#id-105","text":"G.2.","element":"a"}],[{"id":"id-105","style":{"width":"80%"},"width":1283,"height":349,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/25-1.png","element":"img"}],[{"text":"Figure G.2: The Q distribution of the Antmaze-medium-play dataset with varying sliding window widths (100 to 300 steps) is shown in the figure. Widening the sliding window does not change the shape of the Q distribution, even though a larger window covers more information for this sparse reward task with many short trajectories. Instead, enlarging the sliding window decreases the Q value and compresses the size of the derived Q data.","element":"figcaption","subtype":"caption"}],[{"text":"In our experiments, we opted for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= 200 ","element":"span"},{"text":"across all tasks. This choice considers factors such as the maximum trajectory length (1000), the decay rate of ","element":"span"},{"style":{"height":16.99},"width":92.04,"height":42.48,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/25-2.png","element":"img","alt":" γm−1","inline":true,"padRight":true},{"text":"in Eq","element":"span"},{"href":"#id-45","text":".5. ","element":"a"},{"text":"Based on the analysis provided in Section ","element":"span"},{"text":"B, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"does not need to be very large. We also found that minor adjustments to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"do not significantly affect the Q-value distribution, indicating that strict tuning of this parameter is unnecessary.","element":"span"}],[{"id":"id-41","style":{"fontWeight":"bold"},"text":"G.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"The distribution of Q-value function.","element":"span"}],[{"text":"The Q-distributions based on the truncated Q-value and the learned distribution from the consistency model are shown in Figure ","element":"span"},{"href":"#id-106","text":"G.3 ","element":"a"},{"text":"for Gym-MuJoCo tasks and in Figure ","element":"span"},{"href":"#id-107","text":"G.5 ","element":"a"},{"text":"for Antmaze tasks. In Figure ","element":"span"},{"href":"#id-106","text":"G.3, ","element":"a"},{"text":"the consistency model roughly captures the main characteristics of the true Q-value distribution. However, for the Antmaze task (Figure ","element":"span"},{"href":"#id-107","text":"G.5)","element":"a"},{"text":", the learned distribution shows some fluctuations and a slightly wider support compared to the true Q-value distribution, particularly in the dynamic goal task (\"-diverse\" task).","element":"span"}],[{"text":"From the distribution of Antmaze we observe that trajectories for these tasks mainly consist of suboptimal or failed experiences, this underscores the challenging nature of the Antmaze task. ","element":"span"},{"text":"Further the narrower data distribution make it easier to take OOD actions during offline training and fewer positive experiences limits the optimisation of Q, these all potentially leading to failure of these kinds of tasks.","element":"span"}],[{"style":{"width":"91%"},"width":1458,"height":1107,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/26-0.png","element":"img"}],[{"text":"Figure G.3: ","element":"figcaption","subtype":"caption"},{"id":"id-106","text":"The Q-value distribution based on the truncated Q-value v.s. the sample Q-value ","element":"figcaption","subtype":"caption"},{"text":"distribution via the learned consistency model for Gym-MuJoCo tasks.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"92%"},"width":1469,"height":738,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/26-1.png","element":"img"}],[{"text":"Figure G.4: ","element":"figcaption","subtype":"caption"},{"text":"The Q-value distribution based on the truncated Q-value v.s. the sample Q-value distribution via the learned consistency model for Antmaze tasks.","element":"figcaption","subtype":"caption"}],[{"id":"id-42","style":{"fontWeight":"bold"},"text":"G.3 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Efficiency of the uncertainty measure.","element":"span"}],[{"text":"The uncertainty measure is crucial for guiding the Q-value towards a pessimistic regime within the QDQ algorithm. In this section, we will first verify how uncertainty can assess the overestimation of Q-values in OOD regions. Then, we will discuss how the uncertainty set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":") ","element":"span"},{"text":"can be shaped using the hyperparameter ","element":"span"},{"style":{"height":14.4},"width":34.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/27-0.png","element":"img","alt":" β.","inline":true}],[{"text":"To understand the differences in uncertainty between in-distribution actions and OOD actions from a random policy, we can compare their distributions. In the left graph of Figure ","element":"span"},{"href":"#id-107","text":"G.5, ","element":"a"},{"text":"the red line represents the 95% quantile of the standard deviation of sampled Q-values from the learned Q-distribution for in-distribution actions. In the right graph, this 95% quantile corresponds to approximately the 75% quantile of the standard deviation for OOD actions. This indicates that OOD actions contribute to a heavy-tailed distribution of Q-value uncertainty, resulting in larger values compared to in-distribution actions.","element":"span"}],[{"text":"Figure ","element":"span"},{"href":"#id-108","text":"G.6 ","element":"a"},{"text":"further illustrates this, showing that the standard deviation of sampled Q-values from the learned policy sharply increases when Q-values are overestimated and become uncontrollable. These observations suggest that the uncertainty measure in the QDQ algorithm effectively captures the overestimation phenomenon in OOD actions.","element":"span"}],[{"style":{"width":"93%"},"width":1486,"height":531,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/27-1.png","element":"img"}],[{"id":"id-107","text":"Figure G.5: The uncertainty distribution of Q-value based on in-distribution action v.s. the uncertainty ","element":"figcaption","subtype":"caption"},{"text":"distribution of Q-value based on OOD actions(by a ramodm policy).","element":"figcaption","subtype":"caption"}],[{"style":{"width":"97%"},"width":1545,"height":554,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/27-2.png","element":"img"}],[{"id":"id-108","text":"Figure G.6: The Q value learned by QDQ(left), and the standard deviation of the sample Q-value ","element":"figcaption","subtype":"caption"},{"text":"from the consistency model for same state and action pair.","element":"figcaption","subtype":"caption"}],[{"text":"Table G.1: The hyperparameter used in Flow-GAN training.","element":"figcaption","subtype":"caption"}],[{"id":"id-111","style":{"width":"92%"},"width":1460,"height":1085,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-0.png","element":"img"}],[{"text":"The uncertainty set ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":") ","element":"span"},{"text":"is derived using the upper ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-1.png","element":"img","alt":" β","inline":true},{"text":"-quantile of the entire uncertainty value of Q over the action taken by the learning policy. In Section ","element":"span"},{"href":"#id-109","text":"5.2, ","element":"a"},{"text":"we provide a brief discussion on determining the appropriate ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-2.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"during experiments. A higher ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-3.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"may allow for a more generous attitude towards OOD actions, which is suitable for tasks with a wide data distribution or that are robust for action change. Conversely, a lower ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-4.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"suggests more rigid control over OOD actions, appropriate for tasks with a narrow data distribution or that are sensitive. During experiments, a rough starting point for setting ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-5.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"can be obtained by comparing the quantiles of the uncertainty distribution based on in-distribution actions and OOD actions. For instance, as shown in Figure ","element":"span"},{"href":"#id-107","text":"G.5, ","element":"a"},{"text":"parameter tuning might begin with ","element":"span"},{"style":{"height":14.8},"width":360.36,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-6.png","element":"img","alt":" β = 0.75 or β = 0.80.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"G.4 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Implementation details for QDQ algorithm.","element":"span"}],[{"text":"Implementation of QDQ algorithm contains consistency distillation for the consistency model and offline RL training. The whole algorithm is implemented with jax [","element":"span"},{"href":"#id-110","referenceIndex":67,"text":"67","element":"a"},{"text":"]. The training process of consistency distillation and offline RL is independent, we first train the consistency model by consistency distillation and save the converged model. Then we use the pretrained consistency model as the Q-value distribution sampler and go through the offline RL training, see Algorithm ","element":"span"},{"href":"#id-56","text":"1 ","element":"a"},{"text":"for the whole training process.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Consistency Distillation. ","element":"span"},{"text":"The consistency model is trained using a pretrained diffusion model [","element":"span"},{"href":"#id-38","referenceIndex":33,"text":"33","element":"a"},{"text":"]. The training process follows the official implementation ","element":"span"},{"text":"4 ","element":"span"},{"text":"of the consistency model [","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"20","element":"a"},{"text":"]. Since the consistency model is designed for image data, we modified the initial architecture, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"NCSN++ model ","element":"span"},{"text":"[","element":"span"},{"href":"#id-38","referenceIndex":33,"text":"33","element":"a"},{"text":"], to better fit offline RL data. For instance, we replaced the U-Net architecture with a multilayer perceptron (MLP) that has three hidden layers, each with 256 units. This simplified architecture is used to learn the consistency function ","element":"span"},{"style":{"height":16},"width":143,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/28-7.png","element":"img","alt":" fθ(xt, t)","inline":true},{"text":". The main hyperparameters for the consistency distillation are shown in Table ","element":"span"},{"href":"#id-111","text":"G.1, ","element":"a"},{"text":"covering both diffusion model training and consistency model training.","element":"span"}],[{"text":"Table G.2: The hyperparameters used in Actor-Critic training.","element":"figcaption","subtype":"caption"}],[{"id":"id-112","style":{"width":"100%"},"width":1822,"height":1165,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/29-0.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Offline RL training. ","element":"span"},{"text":"The offline RL training follows TD3 [","element":"span"},{"href":"#id-62","referenceIndex":46,"text":"46","element":"a"},{"text":"], which has a delayed update schedule for both the target Q network and the target policy network. For the Gym-MuJoCo tasks, we use the raw dataset without any preprocessing, such as state normalization or reward adjustment. For the AntMaze tasks, we apply the same reward tuning as IQL [","element":"span"},{"href":"#id-10","referenceIndex":10,"text":"10","element":"a"},{"text":"], with no additional preprocessing for the state. The hyperparameters used in offline RL training are shown in Table ","element":"span"},{"href":"#id-112","text":"G.2.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"G.5 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Learning Curve","element":"span"}],[{"text":"The learning curve of Gym-Mujoco tasks is shown in Figure ","element":"span"},{"href":"#id-113","text":"G.7. ","element":"a"},{"text":"The learning curve of AntMaze tasks is shown in Figure ","element":"span"},{"href":"#id-114","text":"G.8.","element":"a"}],[{"id":"id-115","style":{"fontWeight":"bold"},"text":"G.6 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Computation efficienty of QDQ","element":"span"}],[{"text":"We provide a detailed discussion of the computational efficiency of the QDQ algorithm we proposed, focusing on the training cost of the consistency model and computation coefficiency of QDQ (the distribution-based bootstrap method) compared with SOTA uncertainty estimation methods based on Q-value ensembles.","element":"span"}],[{"text":"Regarding the training cost of the consistency model, we believe it is nearly negligible. Training a diffusion model on a 4090 GPU takes about 5.2 minutes, while training a consistency model using this pretrained diffusion model takes around 16 minutes. Additionally, the consistency model can be stored and reused for subsequent RL experiments, eliminating the need for retraining.","element":"span"}],[{"text":"For a comparison of the computational costs of QDQ with SOTA uncertainty estimation methods in offline RL, see Table ","element":"span"},{"href":"#id-115","text":"G.6. ","element":"a"},{"text":"As mentioned earlier, QDQ achieves a significantly faster training speed compared to the ensemble-based uncertainty estimation method EDAC and other SOTA methods. The results for other methods are taken from Table 3 of EDAC ","element":"span"},{"href":"#id-9","referenceIndex":9,"text":"[9]","element":"a"},{"text":".","element":"span"}],[{"style":{"width":"99%"},"width":1584,"height":1312,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-0.png","element":"img"}],[{"text":"Figure G.7: Training curve of different Mujoco Tasks. All results are averaged across 5 random seeds. The evaluation interval is 5000 with evaluation episode length 10.","element":"figcaption","subtype":"caption"}],[{"text":"Table G.3: Computational performance of QDQ and other SOTA methods.","element":"figcaption","subtype":"caption"}],[{"id":"id-113","style":{"width":"50%"},"width":800,"height":265,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-1.png","element":"img"}],[{"id":"id-65","style":{"fontWeight":"bold"},"text":"G.7 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Ablations","element":"span"}],[{"text":"Although QDQ has three hyperparameters (","element":"span"},{"style":{"height":14.8},"width":179.2,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-2.png","element":"img","alt":"α, β, and γ","inline":true},{"text":") for flexibility, the tuning process is straightforward. For example, as discussed in Theorem 4.4 (Appendix ","element":"span"},{"text":"E)","element":"span"},{"text":", theoretically, ","element":"span"},{"style":{"height":16},"width":340.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-3.png","element":"img","alt":" (1−α)(1−β) should","inline":true,"padRight":true},{"text":"be small. Since ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-4.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"controls the size of the uncertainty set and requires flexibility across different tasks, we typically set ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-5.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"close to 1, tuning it between 0.9 and 0.995. This only requires a few experiments to find the optimal value. Our tuning process involves sequentially fixing parameters while selecting the best ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-6.png","element":"img","alt":" α","inline":true},{"text":", then ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-7.png","element":"img","alt":" β","inline":true},{"text":", and finally ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/30-8.png","element":"img","alt":" γ","inline":true},{"text":". Based on the characteristics of different datasets, we can set each parameter to an initial value close to its optimal value. QDQ offers evidence-based guidelines for","element":"span"}],[{"id":"id-114","style":{"width":"99%"},"width":1584,"height":1137,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-0.png","element":"img"}],[{"text":"Figure G.8: Training curve of different Antmaze Tasks. All results are averaged across 5 random seeds. The evaluation interval is 100000 with evaluation episode length 100.","element":"figcaption","subtype":"caption"}],[{"text":"hyperparameter ranges, making tuning manageable. Additionally, QDQ’s efficiency (see Table ","element":"span"},{"href":"#id-115","text":"G.6) ","element":"a"},{"text":"reduces the tuning burden.","element":"span"}],[{"text":"In the ablation study, we choose four task which represent different kinds of dataset we analyzed during parameter study:","element":"span"}],[{"style":{"width":"49%"},"width":784,"height":212,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-1.png","element":"img"}],[{"text":"We perform ablation study for parameters we discussed in Section ","element":"span"},{"href":"#id-109","text":"5.2. ","element":"a"},{"text":"We compare the performances of three different settings for the three parameters respectively.","element":"span"}],[{"text":"The learning curves for four types of tasks with different uncertainty-aware weights ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-2.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"(Eq. ","element":"span"},{"href":"#id-52","text":"7) ","element":"a"},{"text":"are shown in Figure ","element":"span"},{"href":"#id-116","text":"G.9. ","element":"a"},{"text":"For the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"halfcheetah-medium-v2 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hopper-medium-v2 ","element":"span"},{"text":"datasets, decreasing ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-3.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"harms performance. For the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"walker2d-medium-expert-v2 ","element":"span"},{"text":"dataset, a slightly lower ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-4.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"increases training volatility. In contrast, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"umaze-diverse-v0 ","element":"span"},{"text":"dataset shows high sensitivity to ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-5.png","element":"img","alt":" α","inline":true},{"text":". This sensitivity occurs because the Antmaze task requires combining different suboptimal trajectories, which challenges algorithms like QDQ that are not fully in-sample but in-support. For instance, an ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-6.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"value that is too high or too low can lead to overly optimistic or pessimistic Q-values, causing the algorithm to favor actions that do not align with the exact suboptimal trajectories.","element":"span"}],[{"text":"The ablation study of the uncertainty-related parameter ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-7.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is presented in Figure ","element":"span"},{"href":"#id-117","text":"G.11. ","element":"a"},{"text":"For the wide dataset ","element":"span"},{"style":{"fontStyle":"italic"},"text":"halfcheetah-medium-v2","element":"span"},{"text":", a higher ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/31-8.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"is preferred, indicating less control over the uncertainty","element":"span"}],[{"style":{"width":"99%"},"width":1584,"height":1312,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-0.png","element":"img"}],[{"text":"Figure G.9: Training curve of four different Tasks when using different ","element":"figcaption","subtype":"caption"},{"id":"id-116","style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-1.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"in Eq. ","element":"figcaption","subtype":"caption"},{"href":"#id-52","text":"7. ","element":"a","subtype":"caption"},{"text":"All results are averaged across 5 random seeds. The evaluation interval is 5000 with evaluation episode length 10 for Mujoco tasks. The evaluation interval is 100000 with evaluation episode length 100 for Antmaze task.","element":"figcaption","subtype":"caption"}],[{"text":"penalty and a smaller uncertainty set. In contrast, the performance for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hopper-medium-v2 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"umaze-diverse-v0 ","element":"span"},{"text":"datasets is sensitive to small changes in ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-2.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"due to their task sensitivity and narrow distribution. The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"walker2d-medium-expert-v2 ","element":"span"},{"text":"dataset, however, is robust to small changes in ","element":"span"},{"style":{"height":14.4},"width":34.64,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-3.png","element":"img","alt":" β.","inline":true}],[{"text":"We present the performance of different tasks for the entropy parameter ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-4.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"in Figure ","element":"span"},{"href":"#id-118","text":"G.12. ","element":"a"},{"text":"For the wide distribution in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"halfcheetah-medium-v2","element":"span"},{"text":", a larger ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-5.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"negatively impacts performance. The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"hopper-medium-v2 ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"umaze-diverse-v0 ","element":"span"},{"text":"datasets also show sensitivity to the parameter ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-6.png","element":"img","alt":" γ","inline":true},{"text":". In the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"walker2d-medium-expert-v2 ","element":"span"},{"text":"dataset, a very small ","element":"span"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/32-7.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"may introduce volatility into the training process. However, the final result’s convergence is not significantly affected.","element":"span"}],[{"text":"Furthermore, as discussed in Section ","element":"span"},{"href":"#id-109","text":"5.2, ","element":"a"},{"text":"the gamma term primarily stabilizes the learning of a simple Gaussian policy, especially for action-sensitive and dataset-narrow tasks. We note that Gaussian policies are prone to sampling risky actions because they can only fit a single-mode policy. To verify the impact of uncertainty-aware Q-value optimization in QDQ, we compared the performance of Q-values without uncertainty control (using Bellman optimization like in online RL settings) to the QDQ algorithm on the action-sensitive hopper-medium dataset, using identical gamma settings. Figure ","element":"span"},{"style":{"fontWeight":"bold"},"text":"?? ","element":"span"},{"text":"shows that introducing an uncertainty-based constraint for the Q-value function in QDQ significantly improves training stability, convergence speed, and overall performance. This supports ","element":"span"},{"text":"the effectiveness of QDQ’s uncertainty-aware Q-value optimization. We believe that a stronger learning policy will further reduce the need for this stabilizing term.","element":"span"}],[{"style":{"width":"86%"},"width":1376,"height":964,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/33-0.png","element":"img"}],[{"text":"Figure G.10: The training curve of the hopper-medium dataset with QDQ’s uncertainty pessimistic Q learning and without Q-value adjustments is shown. The green curve indicates that the ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/33-1.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"term in Eq. 9 has a limited impact on performance. Comparing the learning curves, QDQ’s uncertainty pessimistic Q learning boosts performance, leading to faster convergence and greater stability.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1584,"height":1339,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/34-0.png","element":"img"}],[{"text":"Figure G.11: ","element":"figcaption","subtype":"caption"},{"text":"Training curve of four different Tasks when using different ","element":"figcaption","subtype":"caption"},{"id":"id-117","style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/34-1.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"in Section ","element":"figcaption","subtype":"caption"},{"href":"#id-119","text":"3.2. ","element":"a","subtype":"caption"},{"text":"All results are averaged across 5 random seeds. The evaluation interval is 5000 with evaluation episode length 10 for Mujoco tasks. The evaluation interval is 100000 with evaluation episode length 100 for Antmaze task.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"99%"},"width":1584,"height":1339,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/35-0.png","element":"img"}],[{"text":"Figure G.12: Training curve of four different Tasks when using different ","element":"figcaption","subtype":"caption"},{"id":"id-118","style":{"height":10.4},"width":22,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/35-1.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"in Eq. ","element":"figcaption","subtype":"caption"},{"href":"#id-54","text":"9. ","element":"a","subtype":"caption"},{"text":"All results are averaged across 5 random seeds. The evaluation interval is 5000 with evaluation episode length 10 for Mujoco tasks. The evaluation interval is 100000 with evaluation episode length 100 for Antmaze task.","element":"figcaption","subtype":"caption"}]]},{"heading":"NeurIPS Paper Checklist","paragraphs":[[{"text":"1. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Claims","element":"span"}],[{"text":"Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We explicitly discuss the contribution and research focus in the abstract and introduction section. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the abstract and introduction do not include the claims made in the paper.","element":"span"}],[{"text":"• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.","element":"span"}],[{"text":"• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.","element":"span"}],[{"text":"• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.","element":"span"}],[{"text":"2. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Limitations","element":"span"}],[{"text":"Question: Does the paper discuss the limitations of the work performed by the authors? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We give limitation of this work in the experiment section. Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.","element":"span"}],[{"text":"• The authors are encouraged to create a separate \"Limitations\" section in their paper.","element":"span"}],[{"text":"• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.","element":"span"}],[{"text":"• The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.","element":"span"}],[{"text":"• The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.","element":"span"}],[{"text":"• The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.","element":"span"}],[{"text":"• If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.","element":"span"}],[{"text":"• While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.","element":"span"}],[{"text":"3. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Theory Assumptions and Proofs","element":"span"}],[{"text":"Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We provide the detailed assumption and proof for each theoretical result in the Appendix ","element":"span"},{"text":"B-","element":"span"},{"text":"F.","element":"span"}],[{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include theoretical results. • All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.","element":"span"}],[{"text":"• All assumptions should be clearly stated or referenced in the statement of any theorems.","element":"span"}],[{"text":"• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.","element":"span"}],[{"text":"• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.","element":"span"}],[{"text":"• Theorems and Lemmas that the proof relies upon should be properly referenced.","element":"span"}],[{"text":"4. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Result Reproducibility","element":"span"}],[{"text":"Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes]","element":"span"}],[{"text":"Justification: We give the detailed implementation of the experiment in experiment section and Appendix ","element":"span"},{"text":"G.","element":"span"}],[{"text":"Guidelines: • The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.","element":"span"}],[{"text":"• If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.","element":"span"}],[{"text":"• Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.","element":"span"}],[{"text":"• While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.","element":"span"}],[{"text":"(b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.","element":"span"}],[{"text":"(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).","element":"span"}],[{"text":"(d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.","element":"span"}],[{"style":{"width":"34%"},"width":546,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/neurips/95452/images/37-0.png","element":"img"}],[{"text":"Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that paper does not include experiments requiring code. • Please see the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https://nips.cc/ ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).","element":"span"}],[{"text":"• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (","element":"span"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"https: ","element":"a"},{"href":"https://nips.cc/public/guides/CodeSubmissionPolicy","style":{"fontFamily":"monospace"},"text":"//nips.cc/public/guides/CodeSubmissionPolicy","element":"a"},{"text":") for more details.","element":"span"}],[{"text":"• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.","element":"span"}],[{"text":"• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.","element":"span"}],[{"text":"• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).","element":"span"}],[{"text":"• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.","element":"span"}],[{"text":"6. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experimental Setting/Details","element":"span"}],[{"text":"Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: We give the detailed implementation of the experiment in Appendix ","element":"span"},{"text":"G. ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.","element":"span"}],[{"text":"• The full details can be provided either with the code, in appendix, or as supplemental material.","element":"span"}],[{"text":"7. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiment Statistical Significance","element":"span"}],[{"text":"Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments.","element":"span"}],[{"text":"• The authors should answer \"Yes\" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.","element":"span"}],[{"text":"• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).","element":"span"}],[{"text":"• The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)","element":"span"}],[{"text":"• The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean.","element":"span"}],[{"text":"• It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.","element":"span"}],[{"text":"• For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).","element":"span"}],[{"text":"• If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.","element":"span"}],[{"text":"8. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Experiments Compute Resources","element":"span"}],[{"text":"Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.","element":"span"}],[{"text":"• The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.","element":"span"}],[{"text":"• The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).","element":"span"}],[{"text":"9. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Code Of Ethics","element":"span"}],[{"text":"Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics ","element":"span"},{"href":"https://neurips.cc/public/EthicsGuidelines","style":{"fontFamily":"monospace"},"text":"https://neurips.cc/public/EthicsGuidelines","element":"a"},{"text":"?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.","element":"span"}],[{"text":"• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).","element":"span"}],[{"text":"10. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Broader Impacts","element":"span"}],[{"text":"Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that there is no societal impact of the work performed. • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.","element":"span"}],[{"text":"• Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.","element":"span"}],[{"text":"• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.","element":"span"}],[{"text":"• The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.","element":"span"}],[{"text":"• If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).","element":"span"}],[{"text":"11. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Safeguards","element":"span"}],[{"text":"Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper poses no such risks.","element":"span"}],[{"text":"• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.","element":"span"}],[{"text":"• Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.","element":"span"}],[{"text":"• We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.","element":"span"}],[{"text":"12. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Licenses for existing assets","element":"span"}],[{"text":"Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?","element":"span"}],[{"text":"Answer: ","element":"span"},{"text":"[Yes] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not use existing assets. • The authors should cite the original paper that produced the code package or dataset. • The authors should state which version of the asset is used and, if possible, include a URL.","element":"span"}],[{"text":"• The name of the license (e.g., CC-BY 4.0) should be included for each asset. • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.","element":"span"}],[{"text":"• If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"paperswithcode.com/datasets ","element":"span"},{"text":"has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.","element":"span"}],[{"text":"• For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.","element":"span"}],[{"text":"• If this information is not available online, the authors are encouraged to reach out to the asset’s creators.","element":"span"}],[{"text":"13. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"New Assets","element":"span"}],[{"text":"Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not release new assets.","element":"span"}],[{"text":"• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.","element":"span"}],[{"text":"• The paper should discuss whether and how consent was obtained from people whose asset is used.","element":"span"}],[{"text":"• At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.","element":"span"}],[{"text":"14. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Crowdsourcing and Research with Human Subjects","element":"span"}],[{"text":"Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.","element":"span"}],[{"text":"• According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.","element":"span"}],[{"text":"15. ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects ","element":"span"},{"text":"Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: ","element":"span"},{"text":"[NA] ","element":"span"},{"text":"Justification: ","element":"span"},{"style":{"fontWeight":"bold"},"text":"[TODO] ","element":"span"},{"text":"Guidelines:","element":"span"}],[{"text":"• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.","element":"span"}],[{"text":"• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.","element":"span"}],[{"text":"• We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.","element":"span"}],[{"text":"• For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.","element":"span"}]]}],"_version":"3.3.4"},"paperNode":"$28:props:children:props:children:0:props:product"}]]