37:[["$","audio",null,{"id":"tts"}],["$","$L3b",null,{"paperID":"2002.11611","publisher":"arxiv","paperJSON":{"title":"Online Learning in Contextual Bandits using Gated Linear Networks","paperID":"2002.11611","avgLineHeight":10.92,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"We introduce a new and completely online contextual bandit algorithm called Gated Linear Contextual Bandits (GLCB). This algorithm is based on Gated Linear Networks (GLNs), a recently introduced deep learning architecture with properties well-suited to the online setting. Leveraging data-dependent gating properties of the GLN we are able to estimate prediction uncertainty with effectively zero algorithmic overhead. We empirically evaluate GLCB compared to 9 state-of-the-art algorithms that leverage deep neural networks, on a standard benchmark suite of discrete and continuous contextual bandit problems. GLCB obtains mean first-place despite being the only online method, and we further support these results with a theoretical study of its convergence properties.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"The contextual bandit setting has been a focus of much recent attention, benefiting from both being sufficiently constrained as to be theoretically tractable, yet broad enough to capture many different types of real world applications such as recommendation systems. The linear contextual bandit problem in particular has been subject to intense theoretical investigation; the recent book by [","element":"span"},{"href":"#id-0","referenceIndex":1,"text":"1","element":"a"},{"text":"] gives a comprehensive overview. This line of investigation has yielded principled online algorithms such as ","element":"span"},{"text":"LINUCB ","element":"span"},{"text":"[","element":"span"},{"href":"#id-1","referenceIndex":2,"text":"2","element":"a"},{"text":"], that work well given informative features. To work around the limitations of linear representations in more difficult problems, these algorithms are often used in combination with an offline nonlinear feature extraction technique such as deep learning. A limitation with such approaches is that the feature extraction component is treated as a black box, which runs the risk of ignoring the uncertainty introduced by the offline feature extraction component.","element":"span"}],[{"text":"Recent advances in posterior approximation for deep networks has led to the introduction of a variety of approximate Thompson Sampling based contextual bandits algorithms that perform well in practice. A reoccurring theme across these works is to leverage some kind of efficiently approximated surrogate notion of the estimation accuracy to drive exploration. Noteworthy examples include the use of random value functions [","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":4,"text":"4","element":"a"},{"text":"], Bayes by Backprop [","element":"span"},{"href":"#id-4","referenceIndex":5,"text":"5","element":"a"},{"text":"], and noise injection [","element":"span"},{"href":"#id-5","referenceIndex":6,"text":"6","element":"a"},{"text":"]. An empirical comparison of neural network based Bayesian methods can be found in [","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"7","element":"a"},{"text":"]. A major drawback of these methods is that they are not online, and often require expensive retraining at regular intervals.","element":"span"}],[{"text":"Another line of investigation has focused on using count-based approaches to drive exploration via the optimism in the face of uncertainty principle. Here various types of confidence bounds on action value estimates are obtained directly from the state/context-action visitation counts, with algorithms typically choosing an action greedily with respect to the upper confidence bound. Count-based methods have seen noteworthy success in finite armed bandit problems [","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"8","element":"a"},{"text":"], tabular reinforcement learning [","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"9","element":"a"},{"text":", ","element":"span"},{"href":"#id-9","referenceIndex":10,"text":"10","element":"a"},{"text":"]), planning in MDPs [","element":"span"},{"href":"#id-10","referenceIndex":11,"text":"11","element":"a"},{"text":"], amongst others. For the most part however, the usage of count based approaches has been limited to low dimensional settings, as counts get exponentially","element":"span"}],[{"id":"id-16","style":{"width":"81%"},"width":1295,"height":700,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/1-0.png","element":"img"}],[{"text":"Figure 1: (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"A","element":"figcaption","subtype":"caption"},{"text":") Illustration of half-space gating for a 2D context. Color represents how many halfspaces intersect with the data point ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"x","element":"figcaption","subtype":"caption"},{"text":". Within each region of constant color (each polytope), the gated weights for a G-GLN network are constant.. (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"B","element":"figcaption","subtype":"caption"},{"text":") A graphical depiction of a Gated Linear Network. Each neuron receives inputs from the previous layer as well as the broadcasted side information ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"z","element":"figcaption","subtype":"caption"},{"text":". The side information is passed through all the gating functions, whose outputs ","element":"figcaption","subtype":"caption"},{"style":{"height":16.79},"width":361.43,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/1-1.png","element":"img","alt":" sij = cij(z) determine","inline":true,"padRight":true},{"text":"the active weight vectors (shown in blue). The dot-product of these vectors with input ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"x ","element":"figcaption","subtype":"caption"},{"text":"forms the output after being passed through a sigmoid function.","element":"figcaption","subtype":"caption"}],[{"text":"sparser as the state/context-action dimension increases. As a remedy to this problem, [","element":"span"},{"href":"#id-11","referenceIndex":12,"text":"12","element":"a"},{"text":"] proposed a notion of “pseudocounts”, which utilize density-like approximations to generalize counts across highdimensional states/contexts. Impressive performance was obtained in popular reinforcement learning settings such as Atari game playing when using this technique to drive exploration. Another approach which pursued the idea of generalizing counts to higher dimensional state spaces was the work of [","element":"span"},{"href":"#id-12","referenceIndex":13,"text":"13","element":"a"},{"text":"], who proposed an elegant approach that used the SimHash [","element":"span"},{"href":"#id-13","referenceIndex":14,"text":"14","element":"a"},{"text":"] variant of locality-sensitivity hashing to map the original state space to a smaller space for which counting state-visitation is tractable. This approach led to strong results in both Atari and continuous control reinforcement learning tasks.","element":"span"}],[{"text":"In this work, we introduce a new online contextual bandit algorithm that combines the benefits of scalable non-linear action-value estimation with a notion of hash based pseudocounts. For action-value estimation we use a Gated Linear Network (GLN) that employs half-space gating, which has recently been shown to give rise to universal function approximation capabilities [","element":"span"},{"href":"#id-14","referenceIndex":15,"text":"15","element":"a"},{"text":", ","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"16","element":"a"},{"text":"]. To drive exploration, our key insight is to exploit the close connection between half-space gating and the SimHash variant of locality-sensitivity hashing; by associating a counter to each neurons gated weight vector, we can define a pseudo-count based exploration mechanism that can generalise in a way similar to the work of [","element":"span"},{"href":"#id-12","referenceIndex":13,"text":"13","element":"a"},{"text":"], with essentially no additional computational overhead beyond obtaining a GLN based action-value estimate. Furthermore, since the gating in a GLN is directly responsible for determining its inductive bias, our notion of pseudocount is tightly coupled to the networks parameter uncertainty, which allows us to naturally define a UCB-like [","element":"span"},{"href":"#id-7","referenceIndex":8,"text":"8","element":"a"},{"text":"] policy as a function the pseudocounts. We demonstrate the empirical efficacy of our method across a set of real-world and synthetic datasets, where we show that our policy outperforms all of the state-of-the-art neural Bayesian methods considered in the recent survey of ","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"[7] ","element":"a"},{"text":"in terms of mean rank.","element":"span"}]]},{"heading":"2 Background","paragraphs":[[{"text":"In this section we give a short overview of Gated Linear Networks sufficient for understanding the contents of this paper. We refer the reader to ","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"[16, ","element":"a"},{"href":"#id-14","referenceIndex":15,"text":"15] ","element":"a"},{"text":"for additional background.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gated Linear Networks. ","element":"span"},{"text":"(GLNs) [","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"16","element":"a"},{"text":"] are feed-forward networks composed of many layers of gated geometric mixing neurons; see Figure ","element":"span"},{"href":"#id-16","text":"1 ","element":"a"},{"text":"(Right) for a graphical depiction. Each neuron in a given layer outputs a gated geometric mixture of the predictions from the previous layer, with the final layer consisting of just a single neuron that determines the output of the entire network. In ","element":"span"},{"text":"contrast to an MLP, the side information (or input features) are broadcast to every single neuron, as this is what each gating function will operate on. The distinguishing properties of this architecture are that the gating functions are fixed in advance, each neuron attempts to predict the same target with an associated per-neuron loss, and that all learning takes place locally within each neuron.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gated Geometric Mixing. ","element":"span"},{"text":"We now give a brief overview of gated geometric mixing neurons, and describe how they learn; a comprehensive description can be found in Section 2 in the work of ","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"[16]","element":"a"},{"text":".","element":"span"}],[{"text":"Geometric Mixing is a simple and well studied ensemble technique for combining probabilistic forecasts. It has seen extensive application in statistical data compression [","element":"span"},{"href":"#id-17","referenceIndex":17,"text":"17","element":"a"},{"text":", ","element":"span"},{"href":"#id-18","referenceIndex":18,"text":"18","element":"a"},{"text":"]. One can think of it as a parametrised form of geometric averaging, or as a product of experts [","element":"span"},{"href":"#id-19","referenceIndex":19,"text":"19","element":"a"},{"text":"]. Given ","element":"span"},{"style":{"height":10},"width":219.17,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-0.png","element":"img","alt":" p1, p2, . . . , pd","inline":true,"padRight":true},{"text":"input probabilities predicting the occurrence of a single binary event, geometric mixing computes ","element":"span"},{"style":{"height":17.38},"width":689.96,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-1.png","element":"img","alt":"σ(w⊤σ−1(p)), where σ(x) := 1/(1 + e−x)","inline":true,"padRight":true},{"text":"denotes the sigmoid function, ","element":"span"},{"style":{"height":13.38},"width":65.11,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-2.png","element":"img","alt":" σ−1 ","inline":true,"padRight":true},{"text":"its inverse – the logit function, ","element":"span"},{"style":{"height":16},"width":280.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-3.png","element":"img","alt":" p := (p1, . . . , pd)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.18},"width":124.18,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-4.png","element":"img","alt":" w ∈ Rd","inline":true,"padRight":true},{"text":"is the weight vector which controls the relative importance of the input forecasts.","element":"span"}],[{"text":"A gated geometric mixing neuron is the combination of a gating procedure and geometric mixing. In this work, gating has the intuitive meaning of mapping particular input examples to a particular choice of weight vector for use with geometric mixing. We can represent each neuron’s gated weights by a matrix, with each row corresponding to the weight vector selected by the gating procedure. More formally, associated to every gated geometric mixing neuron will be a gating function ","element":"span"},{"style":{"height":14},"width":175.3,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-5.png","element":"img","alt":" g : Z → S","inline":true},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , S","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"for some integer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S > ","element":"span"},{"text":"1","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Z ","element":"span"},{"text":"denotes the space of possible side information and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"denotes the signature for each weight vector. The weight matrix can now be defined as ","element":"span"},{"style":{"height":16},"width":394.64,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-6.png","element":"img","alt":"W = (w1, ..., ws)⊤ ∈ W","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"is assumed to be a convex set ","element":"span"},{"style":{"height":14.18},"width":181.5,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-7.png","element":"img","alt":" W ⊂ Rs×d","inline":true},{"text":". The key idea is that such a neuron can specialize its weighting of the input predictions based on some neuron-specific property of the side information ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":".","element":"span"}],[{"text":"Online learning under the logarithmic loss can be realized in a principled and efficient fashion using Online Gradient Descent ","element":"span"},{"href":"#id-20","referenceIndex":20,"text":"[20]","element":"a"},{"text":", as the loss function","element":"span"}],[{"id":"id-21","style":{"width":"70%"},"width":1124,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-8.png","element":"img"}],[{"text":"is a convex function of the active weights ","element":"span"},{"style":{"height":18.73},"width":252.87,"height":46.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-9.png","element":"img","alt":" Wg(z)∗ ≡ w⊤g(z)","inline":true},{"text":". By forcing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"W ","element":"span"},{"text":"to be a (scaled) hypercube, ","element":"span"},{"text":"the projection step can be implemented efficiently using clipping.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Networks of Gated Geometric Mixers. ","element":"span"},{"text":"We now return to more concretely describing the network architecture depicted in Figure ","element":"span"},{"href":"#id-16","text":"1. ","element":"a"},{"text":"Upon receiving an input, all the gates in the network fire, which corresponds to selecting a single weight vector local to each neuron from the provided side information for subsequent use with geometric mixing. It is important to note that such networks are ","element":"span"},{"style":{"fontStyle":"italic"},"text":"data-dependent ","element":"span"},{"text":"piecewise linear networks, as each neuron’s input non-linearity (the logit function) is inverse to the output non-linearity (the sigmoid function).","element":"span"}],[{"text":"Returning to Figure ","element":"span"},{"href":"#id-16","text":"1, ","element":"a"},{"text":"each rounded rectangle depicts a Gated Geometric Mixing neuron; the bias is a scalar value between 0 and 1. There are two types of input to each neuron: the first is the side information ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":", which can be thought of as the input features in a standard supervised learning setup; the second is the input to the neuron, which will be the predictions output by the previous layer, or in the case of layer 0, some function of the side information. The side information is fed into every neuron via the context function ","element":"span"},{"style":{"height":15.59},"width":233.15,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-10.png","element":"img","alt":" gij : Z → Sij","inline":true,"padRight":true},{"text":"for neuron ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"to determine which weight vector ","element":"span"},{"style":{"height":12.87},"width":136.49,"height":32.18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-11.png","element":"img","alt":" wijgij(z)","inline":true,"padRight":true},{"text":"is active in matrix ","element":"span"},{"style":{"height":11.59},"width":52.8,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-12.png","element":"img","alt":" wij","inline":true,"padRight":true},{"text":"for a given input. Each neuron attempts to directly predict the target, and these predictions are fed into higher layers. The loss function associated with each neuron is given by Eq.","element":"span"},{"href":"#id-21","text":"(1) ","element":"a"},{"text":"applied to ","element":"span"},{"style":{"height":11.59},"width":52.8,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-13.png","element":"img","alt":" wij","inline":true,"padRight":true},{"text":"using its respective gating function ","element":"span"},{"style":{"height":11.59},"width":43.28,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-14.png","element":"img","alt":" gij","inline":true},{"text":". It is important to note that both prediction and weight update require just a single forward computational pass of the network, as one can see from inspecting Algorithm ","element":"span"},{"href":"#id-22","text":"1.","element":"a"}],[{"style":{"fontWeight":"bold"},"text":"Random Halfspace Gating ","element":"span"},{"text":"The choice of GLN gating function (i.e., ","element":"span"},{"style":{"height":11.59},"width":43.28,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/2-15.png","element":"img","alt":" gij","inline":true},{"text":") is paramount, as it determines the inductive bias and capacity of the network. Here we restrict our attention to halfspace gating, which was shown in [","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"16","element":"a"},{"text":"] to be universal in the sense that sufficiently large halfspace gated GLNs can model any bounded, continuous and compactly supported density function by only ","element":"span"},{"style":{"fontStyle":"italic"},"text":"locally optimizing ","element":"span"},{"text":"the loss at each neuron.","element":"span"}],[{"text":"Given a finite sized halfspace GLN, we need a mechanism to select the fixed gates for each neuron. Promising initial results were shown in [","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"16","element":"a"},{"text":"] for simple classification problems when the normal ","element":"span"},{"text":"vector of each halfspace was sampled i.i.d. from Gaussian distribution. Here we add some intuition about the learning dynamics which will motivate our subsequent exploration heuristic.","element":"span"}],[{"text":"In [","element":"span"},{"href":"#id-15","referenceIndex":16,"text":"16","element":"a"},{"text":"] it was shown that one can rewrite the output of an ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-layer GLN, with ","element":"span"},{"style":{"height":13.19},"width":44.85,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-0.png","element":"img","alt":" Ki","inline":true,"padRight":true},{"text":"neurons in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", and with input ","element":"span"},{"style":{"height":10},"width":36.05,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-1.png","element":"img","alt":" p0","inline":true,"padRight":true},{"text":"and side information ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":", as","element":"span"}],[{"style":{"width":"57%"},"width":911,"height":115,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-2.png","element":"img"}],[{"text":"where each matrix ","element":"span"},{"style":{"height":16},"width":102.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-3.png","element":"img","alt":" Wi(z)","inline":true,"padRight":true},{"text":"is of dimension ","element":"span"},{"style":{"height":13.19},"width":181.82,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-4.png","element":"img","alt":" Ki × Ki−1","inline":true},{"text":", with each row constituting the active weights (as determined by the gating) for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"th neuron in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":". Here one can see that the product of matrices collapses to a multilinear polynomial in the learnt weights. Note that the resulting multilinear polynomial may be different for different ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":", resulting in a much richer class of models. Thus the depth and shape of the network influences how the GLN will generalize. Figure ","element":"span"},{"href":"#id-16","text":"1 ","element":"a"},{"text":"(Left) shows the effects on the change in decision boundary of training on a single data point marked as ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":". The magnitude of the change is largest within the convex polytope containing the training point, and decays with respect to the remaining convex polytopes according to how many halfspaces they share with the containing convex polytope. This makes intuitive sense, as since the weight update is local, each row of ","element":"span"},{"style":{"height":16},"width":102.69,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-5.png","element":"img","alt":" Wi(z)","inline":true,"padRight":true},{"text":"is pushed in the direction to better explain the data independently of each other. One can think of a GLN as a kind of smoothing technique – input points which cause similar gating activation patterns must have similar outputs.","element":"span"}],[{"text":"This observation motivated the following heuristic idea for exploration: if we associated a counter with every halfspace, which was incremented whenever we updated the weights there whenever we see a new data point, and simply summed the counts of all its active halfspaces, we would get a good sense as to how well we would expect the GLN to generalize within this region. This intuition is the basis for the algorithms we explore in Section 3.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Prediction and Weight Update. ","element":"span"},{"text":"Both prediction and online learning using Online Gradient Descent can be implemented in a single forward pass of the network. We will define this forward pass as helper routine in Algorithm ","element":"span"},{"href":"#id-22","text":"1, ","element":"a"},{"text":"and in subsequent sections instantiate it to compute various quantities of interest for our contextual bandit application.","element":"span"}],[{"text":"We will use notation consistent with Figure ","element":"span"},{"href":"#id-16","text":"1. ","element":"a"},{"text":"Layer 0 will correspond to the input features. Here ","element":"span"},{"style":{"height":13.19},"width":124.82,"height":32.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-6.png","element":"img","alt":"Ki ∈ N","inline":true,"padRight":true},{"text":"denotes the number of neurons in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":", with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"denoting the number of layers excluding the base layer (Layer 0). The prediction made by the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"th neuron in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is denoted by ","element":"span"},{"style":{"height":16.79},"width":252.34,"height":41.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-7.png","element":"img","alt":" pij ∈ [ε, 1 − ε],","inline":true,"padRight":true},{"text":"for all ","element":"span"},{"style":{"height":14},"width":189.72,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-8.png","element":"img","alt":" 0 ≤ j < Ki","inline":true},{"text":", for all layers ","element":"span"},{"style":{"height":13.2},"width":166.92,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-9.png","element":"img","alt":" 0 ≤ i ≤ L","inline":true},{"text":". The vector of predictions from all neurons within layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is denoted by ","element":"span"},{"style":{"height":16},"width":369.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-10.png","element":"img","alt":" pi = (pi0, . . . , piKi−1)","inline":true},{"text":". The base predictions used for the first layer need to lie within ","element":"span"},{"style":{"height":16},"width":140.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-11.png","element":"img","alt":"[ϵ, 1 − ϵ]","inline":true,"padRight":true},{"text":"to satisfy the constraints imposed by geometric mixing; if the contextual side information ","element":"span"},{"style":{"fontStyle":"italic"},"text":"z ","element":"span"},{"text":"lies outside this range, one would typically define the base prediction ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":":= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"z","element":"span"},{"text":")","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"f ","element":"span"},{"text":"is some squashing function. Here we adopt the convention that ","element":"span"},{"style":{"height":10},"width":47.32,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-12.png","element":"img","alt":" pi0","inline":true,"padRight":true},{"text":"is a constant bias ","element":"span"},{"style":{"height":16},"width":347.22,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-13.png","element":"img","alt":" β ∈ [ε, 1 − ε] \\ {0.5}","inline":true},{"text":". Associated with each neuron is a gating function ","element":"span"},{"style":{"height":11.59},"width":43.28,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-14.png","element":"img","alt":" gij","inline":true,"padRight":true},{"text":"that determines which vector of weights to use for any given side information. Note that all neuron predictions are clipped to lie within ","element":"span"},{"style":{"height":16},"width":145.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-15.png","element":"img","alt":" [ε, 1 − ε]","inline":true},{"text":"; this ensures that the ","element":"span"},{"style":{"height":7.6},"width":32.6,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-16.png","element":"img","alt":" ℓ2","inline":true,"padRight":true},{"text":"norm of any gradient is finite. We define the prediction clipping function as ","element":"span"},{"text":"CLIP","element":"span"},{"style":{"height":17.39},"width":561.98,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-17.png","element":"img","alt":"1−ϵε [x] := min {max(x, ε), 1 − ε}","inline":true},{"text":". The weight space for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"th neuron in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i > ","element":"span"},{"text":"0 ","element":"span"},{"text":"is a convex set ","element":"span"},{"style":{"height":18.18},"width":236.27,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-18.png","element":"img","alt":" Wij ⊂ RKi−1","inline":true},{"text":"; typically one would use the same convex set across all neurons within a single layer, however this is not required. For each neuron ","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"i, j","element":"span"},{"text":")","element":"span"},{"text":", we project its weights after a gradient step onto the convex set ","element":"span"},{"style":{"height":15.59},"width":63.63,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-19.png","element":"img","alt":" Wij","inline":true},{"text":". In practical implementations one typically would set ","element":"span"},{"style":{"height":18.17},"width":298.61,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-20.png","element":"img","alt":"Wij = [−b, b]Ki−1","inline":true},{"text":", for some constant ","element":"span"},{"text":"10 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"< b < ","element":"span"},{"text":"100","element":"span"},{"text":", for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":". This projection can be implemented efficiently by clipping every component of ","element":"span"},{"style":{"height":23.52},"width":69.61,"height":58.81,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-21.png","element":"img","alt":" w(t)ijs ","inline":true,"padRight":true},{"text":"to lie within ","element":"span"},{"style":{"height":16},"width":104.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-22.png","element":"img","alt":" [−b, b]","inline":true},{"text":". The matrix of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"gated ","element":"span"},{"text":"weights for ","element":"span"},{"text":"the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"th neuron in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"is denoted by ","element":"span"},{"style":{"height":18.18},"width":277.09,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-23.png","element":"img","alt":" wij ∈ RSij×Ki−1","inline":true},{"text":". We denote by ","element":"span"},{"style":{"height":16.79},"width":236.7,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-24.png","element":"img","alt":" Θ = {wijs}ijs","inline":true,"padRight":true},{"text":"the set of all gated weight vectors for the network.","element":"span"}]]},{"heading":"3 Gated Linear Contextual Bandits","paragraphs":[[{"text":"We now introduce our Gated Linear Contextual Bandits (GLCB) algorithm, a contextual bandit technique that utilizes GLNs for estimating expected rewards of arms and using its associated gating functions to derive exploration bonuses.","element":"span"}],[{"text":"Let ","element":"span"},{"style":{"height":17.38},"width":250.34,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-25.png","element":"img","alt":" X ⊆ [0; 1]K0−1","inline":true,"padRight":true},{"text":"be a set of contexts and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"A ","element":"span"},{"text":"be a finite set of actions. At each discrete timestep ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", the agent observes a context ","element":"span"},{"style":{"height":13.19},"width":118.54,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-26.png","element":"img","alt":" xt ∈ X","inline":true,"padRight":true},{"text":"and takes an action ","element":"span"},{"style":{"height":13.99},"width":115.82,"height":34.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/3-27.png","element":"img","alt":" at ∈ A","inline":true},{"text":", receiving a context-action dependent","element":"span"}],[{"id":"id-22","text":"Table 1: (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Algorithm 1","element":"figcaption","subtype":"caption"},{"text":") Perform a forward pass and optionally update weights. Each layer performs clipped geometric mixing over the outputs of the previous layer, where the mixing weights are side-info-dependent via the gating function (Line 12). (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Algorithm 2","element":"figcaption","subtype":"caption"},{"text":") ","element":"figcaption","subtype":"caption"},{"text":"GLCB","element":"figcaption","subtype":"caption"},{"text":"-policy applied for ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"T ","element":"figcaption","subtype":"caption"},{"text":"timesteps. Signature counts are initialized to zero in Line 7. The exploration bonus is computed in Line 12, where the denominator of the square-root is the pseudocount term. The actions are chosen by greedily maximizing the sum of the expected reward and the exploration bonus in Line 13. GLN parameters and counts are updated in Lines 15-18.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"95%"},"width":1517,"height":983,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-0.png","element":"img"}],[{"text":"reward ","element":"span"},{"style":{"height":9.19},"width":29.98,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-1.png","element":"img","alt":" rt","inline":true},{"text":". The goal is to maximize the cumulative rewards ","element":"span"},{"style":{"height":20.4},"width":133.07,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-2.png","element":"img","alt":"�Tt=1 rt","inline":true,"padRight":true},{"text":"over an unknown horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":". We ","element":"span"},{"text":"first consider the case of Bernoulli bandits, then generalize the setup to bounded continuous rewards.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Bernoulli distributed rewards. ","element":"span"},{"text":"Assume that the rewards ","element":"span"},{"style":{"height":16},"width":358.62,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-3.png","element":"img","alt":" rxat ∼ Bernoulli(θxa)","inline":true,"padRight":true},{"text":"are conditional i.i.d. , where ","element":"span"},{"style":{"height":13.19},"width":53.79,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-4.png","element":"img","alt":" θxa","inline":true,"padRight":true},{"text":"is a context-action dependent reward probability that is unknown to the agent. We will use a separate GLN to estimate the context dependent reward probability ","element":"span"},{"text":"Pr[","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":"] = ","element":"span"},{"style":{"height":16},"width":247.73,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-5.png","element":"img","alt":"E[r|x, a] = θxa","inline":true,"padRight":true},{"text":"for each arm. Across arms, each GLN will share the same set of hyperparameters. This includes the shape of the network, the choice of randomly sampled halfspace gating functions for the contexts, the choice of clipping threshold, and weight space. The weight parameters for each neuron on layer ","element":"span"},{"style":{"height":13.2},"width":86.91,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-6.png","element":"img","alt":" i ≥ 1","inline":true,"padRight":true},{"text":"are initialized to ","element":"span"},{"style":{"height":16},"width":125.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-7.png","element":"img","alt":" 1/Ki−1","inline":true},{"text":". In our application, there is no need to make a distinction between the input to the network and the side information, so from here onward we drop this dependence by defining","element":"span"}],[{"style":{"width":"37%"},"width":589,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-8.png","element":"img"}],[{"text":"We use ","element":"span"},{"style":{"height":16.92},"width":47.99,"height":42.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-9.png","element":"img","alt":" Θta","inline":true,"padRight":true},{"text":"to denote the current set of GLN parameters at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"for action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":", which is learnt ","element":"span"},{"text":"from ","element":"span"},{"style":{"height":16},"width":457.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-10.png","element":"img","alt":" {(xτ, rτ) : aτ = a, τ < t}","inline":true,"padRight":true},{"text":"using Algorithm ","element":"span"},{"href":"#id-22","text":"1 ","element":"a"},{"text":"with ","element":"span"},{"style":{"height":14},"width":201.64,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-11.png","element":"img","alt":" update = ⊤","inline":true},{"text":". Therefore ","element":"span"},{"text":"GLN","element":"span"},{"style":{"height":16.99},"width":131.63,"height":42.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-12.png","element":"img","alt":"ta(x) :=","inline":true,"padRight":true},{"text":"GLN","element":"span"},{"style":{"height":16.99},"width":128.9,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-13.png","element":"img","alt":"(x | Θta)","inline":true,"padRight":true},{"text":"is the estimate of the expected reward for an arm ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"given context ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":".","element":"span"}],[{"text":"From now on we assume each GLN is composed of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"neurons, which we also call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"units","element":"span"},{"text":", where we denote the index set of the units as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , U","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"which is bijected to our previous (layer,unit) index set ","element":"span"},{"style":{"height":16},"width":531.01,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-14.png","element":"img","alt":" {(i, j) : 1 ≤ i ≤ L, 0 ≤ j < Ki}","inline":true},{"text":". Each unit is associated with a gating function ","element":"span"},{"style":{"height":14.4},"width":147.37,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-15.png","element":"img","alt":" gu where","inline":true},{"style":{"height":11.6},"width":110.42,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-16.png","element":"img","alt":"u ∈ U.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"GLCB Policy. ","element":"span"},{"text":"The GLCB policy/action is defined as","element":"span"}],[{"style":{"width":"41%"},"width":665,"height":68,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/4-17.png","element":"img"}],[{"style":{"width":"49%"},"width":789,"height":120,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/5-0.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.53},"width":312.62,"height":41.33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/5-1.png","element":"img","alt":"¯t := t − 1, C ∈ R+","inline":true,"padRight":true},{"text":"is a constant that scales the exploration bonus, ","element":"span"},{"style":{"height":16},"width":420.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/5-2.png","element":"img","alt":" g(x) = (g1(x), ..., gU(x))","inline":true,"padRight":true},{"text":"is the total signature, and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"}],[{"text":"generalizing the exact count ","element":"span"},{"style":{"height":16},"width":139.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/5-3.png","element":"img","alt":" N¯t(x, a)","inline":true,"padRight":true},{"text":"found in UCB.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Pseudocounts for GLNs. ","element":"span"},{"text":"Let ","element":"span"},{"style":{"height":16},"width":292.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/5-4.png","element":"img","alt":" x 0, where s := g(x).","inline":true}],[{"text":"In the realizable case in which ","element":"span"},{"text":"GLN","element":"span"},{"style":{"height":16},"width":88.03,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-15.png","element":"img","alt":"∞a (x)","inline":true,"padRight":true},{"text":"can represent the expected reward ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"exactly, the ","element":"span"},{"text":"asymptotic GLCB policy ","element":"span"},{"style":{"height":18.83},"width":596.67,"height":47.08,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-16.png","element":"img","alt":" ˜π(x) ∈ ˜Π(x) := arg maxa GLN∞a (x)","inline":true,"padRight":true},{"text":"is (Bayes) optimal. In the unrealiz- ","element":"span"},{"text":"able case, which we consider here, ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-17.png","element":"img","alt":" ˜π","inline":true,"padRight":true},{"text":"is only the “optimal” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"realizable ","element":"span"},{"text":"policy. The resulting estimation error will be defined and taken into account later. The next lemma shows that sub-“optimal” (w.r.t. ","element":"span"},{"style":{"height":14},"width":37.14,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-18.png","element":"img","alt":" ˜π)","inline":true,"padRight":true},{"text":"actions are taken sublineraly often.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Lemma 4 (sub-optimal action lemma) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Sub-“optimal” actions are taken with vanishing frequency. Formally, ","element":"span"},{"style":{"height":18.83},"width":820.8,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-19.png","element":"img","alt":" Nt(s, a) = o(t) w.p.1 ∀a ̸∈ ˜Π(x), where s = g(x).","inline":true}],[{"text":"Let us now turn to the regret, that is the error measured in terms of lost reward suffered by the online learning ","element":"span"},{"text":"GLN ","element":"span"},{"text":"policy ","element":"span"},{"style":{"height":9.19},"width":34.72,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-20.png","element":"img","alt":" πt","inline":true,"padRight":true},{"text":"compared to the “optimal” realizable policy ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-21.png","element":"img","alt":" ˜π","inline":true,"padRight":true},{"text":"in hindsight:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 5 (pseudo-regret / policy error) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let PolErr","element":"span"},{"style":{"height":19.94},"width":583.94,"height":49.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-22.png","element":"img","alt":"(x) := GLN∞˜π(x)(x) − GLN∞πt(x)(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"simple regret incurred by the GLCB (learning) policy ","element":"span"},{"style":{"height":16},"width":91.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-23.png","element":"img","alt":" πt(x)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the total pseudo-regret","element":"span"}],[{"id":"id-27","style":{"width":"51%"},"width":810,"height":117,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-24.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"which implies PolErr","element":"span"},{"style":{"height":16},"width":135.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-25.png","element":"img","alt":"(x) → 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Cesaro average.","element":"span"}],[{"text":"Typically the GLN cannot represent the true expected reward exactly, which will introduce a (small) representation error (also known as approximation error):","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 6 (representation error) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") := ","element":"span"},{"text":"E","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"r","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":"] = ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"r ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":"] ","element":"span"},{"style":{"fontStyle":"italic"},"text":"be the true expected reward of action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"in context ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":". Let ","element":"span"},{"style":{"height":16},"width":461.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-26.png","element":"img","alt":" π∗(x) := arg maxa Q(x, a)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the (Bayes) optimal policy (in hindsight). Then, for Lipschitz ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and sufficiently large GLN, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"style":{"fontStyle":"italic"},"text":"can be represented arbitrarily well, i.e. the (asymptotic) representation error (also known as approximation error) RepErr","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") := ","element":"span"},{"style":{"height":16},"width":453.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-27.png","element":"img","alt":"maxa |Q(x, a) − GLN∞a (x)|","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"can be made arbitrarily small.","element":"span"}],[{"text":"The Theorem is stated for Bernoulli rewards, but also holds for bounded continuous rewards if ","element":"span"},{"text":"GLN ","element":"span"},{"text":"is replaced by ","element":"span"},{"text":"CTREE","element":"span"},{"text":". Finally we can connect the dots and bound the true regret in terms of policy and representation error:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Corollary 7 (Simple Q-regret) ","element":"span"},{"id":"id-25","style":{"height":16},"width":905.57,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/6-28.png","element":"img","alt":" Q(x, π∗(x)) − Q(x, πt(x)) ≤ PolErr(x) + 2RepErr(x).","inline":true}],[{"text":"Corollary ","element":"span"},{"href":"#id-25","text":"7 ","element":"a"},{"text":"shows that the simple regret of GLCB is bounded by twice the representation error (which by Thm. ","element":"span"},{"href":"#id-26","text":"3 ","element":"a"},{"text":"can be made small by a large GLN) and the policy error (which by Thm. ","element":"span"},{"href":"#id-27","text":"6 ","element":"a"},{"text":"tends to zero in Cesaro average).","element":"span"}],[{"id":"id-28","style":{"width":"95%"},"width":1518,"height":652,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/7-0.png","element":"img"}],[{"text":"Table 2: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"(Left) ","element":"figcaption","subtype":"caption"},{"text":"Ranks of bandit algorithms based on average cumulative rewards obtained per dataset, sorted by mean. Raw scores used for generating this table is provided in the Appendix. (","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"Right","element":"figcaption","subtype":"caption"},{"text":") Summary of all considered bandit tasks. Note that the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"wheel ","element":"figcaption","subtype":"caption"},{"text":"environment is synthetically generated, therefore the size of the context set is not given.","element":"figcaption","subtype":"caption"}]]},{"heading":"5 Experiments","paragraphs":[[{"text":"We evaluate ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"against 9 state-of-the-art bandit algorithms, as implemented in the “Deep Bayesian Bandits” library [","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"7","element":"a"},{"text":"], which we describe further in the Appendix. Each uses a neural network to estimate action values from a context, and selects actions greedily or via Thompson sampling. The neural networks themselves are trained using batch SGD with respect to the set of previously observed contexts. Importantly and in contrast, ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"is online and does not require looping over or storing previous data. We use the implementation and hyperparameters provided by [","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"7","element":"a"},{"text":"], and found that further parameter tuning yielded negligible improvement. We tune two sets of parameters for ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"using grid search, one for the set of Bernoulli bandit tasks and another for the set of continuous bandit tasks, which we report in the appendix.","element":"span"}],[{"text":"Each algorithm is evaluated using seven of the ten contextual bandit problems described in [","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"7","element":"a"},{"text":"] – four discrete tasks (","element":"span"},{"style":{"fontStyle":"italic"},"text":"adult","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"census","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"covertype ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"statlog","element":"span"},{"text":") adapted from classification problems, and three continuous tasks adapted from regression problems (","element":"span"},{"style":{"fontStyle":"italic"},"text":"financial","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"jester ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"wheel","element":"span"},{"text":"). The three dropped tasks were either trivial (","element":"span"},{"style":{"fontStyle":"italic"},"text":"synthetic linear bandits","element":"span"},{"text":"), did not fit the 0/1 Bernoulli or continuous bandit formulation (","element":"span"},{"style":{"fontStyle":"italic"},"text":"mushroom","element":"span"},{"text":"), or was not implemented in the library provided by [","element":"span"},{"href":"#id-6","referenceIndex":7,"text":"7","element":"a"},{"text":"] (","element":"span"},{"style":{"fontStyle":"italic"},"text":"song","element":"span"},{"text":"). A summary of each task is provided in Table ","element":"span"},{"href":"#id-28","text":"2 ","element":"a"},{"text":"(Right). For each time step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", a context ","element":"span"},{"style":{"height":11.6},"width":102.51,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/7-1.png","element":"img","alt":" x ∈ D","inline":true,"padRight":true},{"text":"is sampled without replacement until ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"= min","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"5000","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|D|} ","element":"span"},{"text":"(e.g. the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"financial ","element":"span"},{"text":"task run for only ","element":"span"},{"style":{"fontStyle":"italic"},"text":"|D| ","element":"span"},{"text":"= 3713 ","element":"span"},{"text":"steps). Some baselines (eg, L","element":"span"},{"text":"IN","element":"span"},{"text":"F","element":"span"},{"text":"ULL","element":"span"},{"text":"P","element":"span"},{"text":"OST","element":"span"},{"text":") have quadratic or even cubic time complexity and are therefore prohibitively expensive to run repeatedly using hundreds of random seeds. Therefore, we used a time horizon ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"text":"of ","element":"span"},{"text":"5000","element":"span"},{"text":".","element":"span"}],[{"text":"Table ","element":"span"},{"href":"#id-28","text":"2 ","element":"a"},{"text":"(Left) presents the performance of ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"and the baselines. Note that ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"is the only algorithm that is online, as discussed earlier. It is evident that ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"performs well overall, obtaining the best average rank across the seven tasks considered. ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"ranks comparatively higher in discrete tasks (i.e., with binary rewards) than in continuous tasks. In fact, we have seen that binarizing","element":"span"},{"text":"2 ","element":"span"},{"text":"the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"financial ","element":"span"},{"text":"regression task improves the relative performance of ","element":"span"},{"text":"GLCB ","element":"span"},{"text":". We suspect that regression tasks are harder to learn online as learning fine-grained differences in action values is likely to benefit from multiple passes.","element":"span"}],[{"text":"In our implementation, the wall-clock time to select an action for ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"is between 5 to 8 ms across all datasets, and does not change as more examples are seen. This is a favourable property for practical applications, especially compared to other methods such as L","element":"span"},{"text":"IN","element":"span"},{"text":"F","element":"span"},{"text":"ULL","element":"span"},{"text":"P","element":"span"},{"text":"OST","element":"span"},{"text":", whose action selection gets slower as more data is observed.","element":"span"}]]},{"heading":"6 Discussion","paragraphs":[[{"text":"We have introduced a new algorithm for both the discrete and continuous contextual bandits setting. Leveraging architectural properties of the recently-proposed Gated Linear Networks, we were able to efficiently estimate the uncertainty of our predictions with minimal computational overhead. Our ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"algorithm outperforms all nine considered state-of-the-art contextual bandit algorithms across a standard benchmark of bandit problems, despite being the only considered algorithm that is online.","element":"span"}]]},{"heading":"Broader Impact","paragraphs":[[{"text":"Contextual bandit algorithms can be utilized to deliver personalized content such as news or advertising. Privacy and algorithmic bias should therefore be considered during the implementation process. GLNs are more easily interpretable than conventional neural networks [","element":"span"},{"href":"#id-14","referenceIndex":15,"text":"15","element":"a"},{"text":"], which might be helpful for understanding and addressing any potential bias.","element":"span"}],[{"text":"Our proposed algorithm is online and therefore does not require storing data. This is potentially beneficial in terms of privacy. Using smaller context dimensions might help further by avoiding contexts with small number of data points as much as possible.","element":"span"}]]},{"heading":"Software","paragraphs":[[{"text":"All models implemented using JAX [","element":"span"},{"href":"#id-29","referenceIndex":23,"text":"23","element":"a"},{"text":"] and the DeepMind JAX Ecosystem [","element":"span"},{"href":"#id-30","referenceIndex":24,"text":"24","element":"a"},{"text":", ","element":"span"},{"href":"#id-31","referenceIndex":25,"text":"25","element":"a"},{"text":", ","element":"span"},{"href":"#id-32","referenceIndex":26,"text":"26","element":"a"},{"text":", ","element":"span"},{"href":"#id-33","referenceIndex":27,"text":"27","element":"a"},{"text":"]. Open source GLN implementations are available at: ","element":"span"},{"text":"www.github.com/deepmind/deepmind-research/ ","element":"span"},{"text":".","element":"span"}]]},{"heading":"Acknowledgments","paragraphs":[[{"text":"We thank to Tor Lattimore for helpful discussions.","element":"span"}]]},{"heading":"Funding Disclosure","paragraphs":[[{"text":"All authors are employees of DeepMind.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-0","text":"[1] Tor Lattimore and Csaba Szepesvári. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bandit Algorithms","element":"span"},{"text":". Cambridge University Press, 2020.","element":"span"}],[{"id":"id-1","text":"[2] ","element":"span"},{"text":"Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 19th International Conference on World Wide Web","element":"span"},{"text":", WWW ’10, pages 661–670, New York, NY, USA, 2010. ACM.","element":"span"}],[{"id":"id-2","text":"[3] ","element":"span"},{"text":"Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 29","element":"span"},{"text":", pages 4026–4034. Curran Associates, Inc., 2016.","element":"span"}],[{"id":"id-3","text":"[4] ","element":"span"},{"text":"Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 32Nd International Conference on Neural Information Processing Systems","element":"span"},{"text":", NIPS’18, pages 8626–8638, USA, 2018. Curran Associates Inc.","element":"span"}],[{"id":"id-4","text":"[5] ","element":"span"},{"text":"Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Francis Bach and David Blei, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 32nd International Conference on Machine Learning","element":"span"},{"text":", volume 37 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 1613–1622, Lille, France, 07–09 Jul 2015. PMLR.","element":"span"}],[{"id":"id-5","text":"[6] ","element":"span"},{"text":"Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-6","text":"[7] ","element":"span"},{"text":"Carlos Riquelme, George Tucker, and Jasper Snoek. Deep Bayesian bandits showdown: An empirical comparison of Bayesian deep networks for Thompson sampling. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Learning Representations","element":"span"},{"text":", 2018.","element":"span"}],[{"id":"id-7","text":"[8] ","element":"span"},{"text":"Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mach. Learn.","element":"span"},{"text":", 47(2-3):235–256, May 2002.","element":"span"}],[{"id":"id-8","text":"[9] ","element":"span"},{"text":"Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for Markov decision processes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Computer and System Sciences","element":"span"},{"text":", 74(8):1309 – 1331, 2008. Learning Theory 2005.","element":"span"}],[{"id":"id-9","text":"[10] ","element":"span"},{"text":"Tor Lattimore and Marcus Hutter. Near-optimal PAC bounds for discounted MDPs. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Theoretical Computer Science","element":"span"},{"text":", 558:125–143, 2014.","element":"span"}],[{"id":"id-10","text":"[11] ","element":"span"},{"text":"Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 17th European Conference on Machine Learning","element":"span"},{"text":", ECML’06, pages 282–293, Berlin, Heidelberg, 2006. Springer-Verlag.","element":"span"}],[{"id":"id-11","text":"[12] ","element":"span"},{"text":"Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems 29","element":"span"},{"text":", pages 1471–1479. Curran Associates, Inc., 2016.","element":"span"}],[{"id":"id-12","text":"[13] ","element":"span"},{"text":"Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. #exploration: A study of count-based exploration for deep reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 31st International Conference on Neural Information Processing Systems","element":"span"},{"text":", NIPS’17, pages 2750–2759, USA, 2017. Curran Associates Inc.","element":"span"}],[{"id":"id-13","text":"[14] ","element":"span"},{"text":"Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing","element":"span"},{"text":", STOC ’02, pages 380– 388, New York, NY, USA, 2002. ACM.","element":"span"}],[{"id":"id-14","text":"[15] ","element":"span"},{"text":"Joel Veness, Tor Lattimore, Avishkar Bhoopchand, David Budden, Christopher Mattern, Agnieszka Grabska-Barwinska, Peter Toth, Simon Schmitt, and Marcus Hutter. Gated linear networks, 2019.","element":"span"}],[{"id":"id-15","text":"[16] ","element":"span"},{"text":"Joel Veness, Tor Lattimore, Avishkar Bhoopchand, Agnieszka Grabska-Barwinska, Christopher Mattern, and Peter Toth. Online learning with gated linear networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"CoRR","element":"span"},{"text":", abs/1712.01897, 2017.","element":"span"}],[{"id":"id-17","text":"[17] ","element":"span"},{"text":"Christopher Mattern. Mixing strategies in data compression. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2012 Data Compression Conference, Snowbird, UT, USA, April 10-12","element":"span"},{"text":", pages 337–346, 2012.","element":"span"}],[{"id":"id-18","text":"[18] ","element":"span"},{"text":"Christopher Mattern. Linear and geometric mixtures - analysis. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"2013 Data Compression Conference, DCC 2013, Snowbird, UT, USA, March 20-22, 2013","element":"span"},{"text":", pages 301–310, 2013.","element":"span"}],[{"id":"id-19","text":"[19] ","element":"span"},{"text":"Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Computation","element":"span"},{"text":", 14(8):1771–1800, August 2002.","element":"span"}],[{"id":"id-20","text":"[20] ","element":"span"},{"text":"Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA","element":"span"},{"text":", pages 928–936, 2003.","element":"span"}],[{"id":"id-23","text":"[21] ","element":"span"},{"text":"Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins. Compress and control. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA","element":"span"},{"text":", pages 3016–3023, 2015.","element":"span"}],[{"id":"id-24","text":"[22] ","element":"span"},{"text":"Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 34th International Conference on Machine Learning -Volume 70","element":"span"},{"text":", ICML’17, pages 449–458. JMLR.org, 2017.","element":"span"}],[{"id":"id-29","text":"[23] ","element":"span"},{"text":"James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs, 2018.","element":"span"}],[{"id":"id-30","text":"[24] ","element":"span"},{"text":"David Budden, Matteo Hessel, Iurii Kemaev, Stephen Spencer, and Fabio Viola. Chex: Testing made fun, in JAX!, 2020.","element":"span"}],[{"id":"id-31","text":"[25] ","element":"span"},{"text":"Tom Hennigan, Trevor Cai, Tamara Norman, and Igor Babuschkin. Haiku: Sonnet for JAX, 2020.","element":"span"}],[{"id":"id-32","text":"[26] ","element":"span"},{"text":"Matteo Hessel, David Budden, Fabio Viola, Mihaela Rosca, Eren Sezener, and Tom Hennigan. Optax: Composable gradient transformation and optimisation, in JAX!, 2020.","element":"span"}],[{"id":"id-33","text":"[27] ","element":"span"},{"text":"David Budden, Matteo Hessel, John Quan, Steven Kapturowski, Kate Baumli, Surya Bhupatiraju, Aurelia Guy, and Michael King. RLax: Reinforcement Learning in JAX, 2020.","element":"span"}],[{"id":"id-38","text":"[28] ","element":"span"},{"text":"Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat Prabhat, and Ryan P. Adams. Scalable Bayesian optimization using deep neural networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37","element":"span"},{"text":", ICML’15, pages 2171–2180. JMLR.org, 2015.","element":"span"}],[{"id":"id-39","text":"[29] ","element":"span"},{"text":"Jose Hernandez-Lobato, Yingzhen Li, Mark Rowland, Thang Bui, Daniel Hernandez-Lobato, and Richard Turner. Black-box alpha divergence minimization. In Maria Florina Balcan and Kilian Q. Weinberger, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 33rd International Conference on Machine Learning","element":"span"},{"text":", volume 48 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 1511–1520, New York, New York, USA, 20–22 Jun 2016. PMLR.","element":"span"}],[{"id":"id-40","text":"[30] ","element":"span"},{"text":"Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"J. Mach. Learn. Res.","element":"span"},{"text":", 15(1):1929–1958, January 2014.","element":"span"}],[{"id":"id-41","text":"[31] ","element":"span"},{"text":"Stephan Mandt, Matthew Hoffman, and David Blei. A variational analysis of stochastic gradient algorithms. In Maria Florina Balcan and Kilian Q. Weinberger, editors, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of The 33rd International Conference on Machine Learning","element":"span"},{"text":", volume 48 of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of Machine Learning Research","element":"span"},{"text":", pages 354–363, New York, New York, USA, 20–22 Jun 2016. PMLR.","element":"span"}]]},{"heading":"A Tree-based discretization for regression problems","paragraphs":[[{"text":"Algorithm ","element":"span"},{"href":"#id-34","text":"3 ","element":"a"},{"text":"describes the ","element":"span"},{"text":"CTREE ","element":"span"},{"text":"algorithm, which we use for estimating expected (contextdependent) reward wherever the rewards are continuous. The algorithm operates on a complete binary tree of depth ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"that maintains a GLN at each non-leaf node. We assume that our tree divides the bounded reward range ","element":"span"},{"style":{"height":16},"width":191.33,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-0.png","element":"img","alt":" [rmin, rmax]","inline":true,"padRight":true},{"text":"uniformly into ","element":"span"},{"style":{"height":13.39},"width":36.92,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-1.png","element":"img","alt":" 2d","inline":true,"padRight":true},{"text":"bins at each level ","element":"span"},{"style":{"height":13.2},"width":113.2,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-2.png","element":"img","alt":" d ≤ D","inline":true},{"text":". By labelling left branches of a node by 0, and right branches with a 1, we can associate a unique binary string ","element":"span"},{"style":{"height":13.19},"width":59.01,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-3.png","element":"img","alt":" b1:d","inline":true,"padRight":true},{"text":"to any single internal (","element":"span"},{"style":{"fontStyle":"italic"},"text":"d < D","element":"span"},{"text":") ","element":"span"},{"text":"or leaf (","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D","element":"span"},{"text":") node in the tree. The ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"th element, when it exists, is denoted as ","element":"span"},{"style":{"height":13.19},"width":34.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-4.png","element":"img","alt":" bd","inline":true},{"text":". The root node is denoted by empty string ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-5.png","element":"img","alt":" ϵ","inline":true},{"text":". All nodes of the tree can thus be represented as ","element":"span"},{"style":{"height":20.4},"width":377.81,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-6.png","element":"img","alt":" B≤D = {ϵ} ∪ �Dd=1 Bd","inline":true,"padRight":true},{"text":"and all non-leaf nodes with ","element":"span"},{"style":{"height":13.78},"width":266.32,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/11-7.png","element":"img","alt":" B 0, where s := g(x).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"By assumption, ","element":"span"},{"style":{"height":10},"width":144.06,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-22.png","element":"img","alt":" x1, ..., xt","inline":true,"padRight":true},{"text":"are sampled i.i.d. from probability measure ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P ","element":"span"},{"text":"with ","element":"span"},{"style":{"height":11.6},"width":104.5,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-23.png","element":"img","alt":" x ∈ X","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"may be discrete or continuous (","element":"span"},{"style":{"height":17.38},"width":192.22,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-24.png","element":"img","alt":"X ⊆ [0; 1]d","inline":true,"padRight":true},{"text":"in the experiments). Then ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"s","element":"span"},{"text":") := ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"g","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") = ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"s","element":"span"},{"text":"] ","element":"span"},{"text":"is a discrete probability (mass function) over finite space ","element":"span"},{"style":{"height":14.18},"width":134.02,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-25.png","element":"img","alt":" SU ∋ s","inline":true},{"text":". Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"s","element":"span"},{"text":") = 0 ","element":"span"},{"text":"implies ","element":"span"},{"style":{"height":16},"width":171.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-26.png","element":"img","alt":"Nt(s) = 0","inline":true},{"text":", hence such ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"s ","element":"span"},{"text":"can safely been ignored. Consider ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"0","element":"span"},{"text":", which implies ","element":"span"},{"style":{"height":16},"width":200.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-27.png","element":"img","alt":" Nt(s) → ∞","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":10.4},"width":117.27,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-28.png","element":"img","alt":" t → ∞","inline":true,"padRight":true},{"text":"w.p.1, ,indeed, ","element":"span"},{"style":{"height":16},"width":98.72,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-29.png","element":"img","alt":" Nt(s)","inline":true,"padRight":true},{"text":"grows linearly w.p.1. By Lemma ","element":"span"},{"href":"#id-35","text":"1, ","element":"a"},{"text":"this implies ","element":"span"},{"style":{"height":16},"width":258.21,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-30.png","element":"img","alt":" Nt(su, a) → ∞","inline":true},{"style":{"height":12.4},"width":266.41,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-31.png","element":"img","alt":"∀u ∈ U ∀a ∈ A","inline":true,"padRight":true},{"text":"w.p.1. By Proposition ","element":"span"},{"href":"#id-36","text":"2, ","element":"a"},{"text":"this implies ","element":"span"},{"text":"GLN","element":"span"},{"style":{"height":18.64},"width":453.07,"height":46.61,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-32.png","element":"img","alt":"¯ta(x) → GLN∞a (x) w.p.1 ∀a.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Lemma 4 (sub-optimal action lemma) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Sub-“optimal” actions are taken with vanishing frequency. Formally, ","element":"span"},{"style":{"height":18.83},"width":820.8,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-33.png","element":"img","alt":" Nt(s, a) = o(t) w.p.1 ∀a ̸∈ ˜Π(x), where s = g(x).","inline":true}],[{"style":{"fontStyle":"italic"},"text":"Proof. ","element":"span"},{"text":"Since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"s","element":"span"},{"text":") = 0 ","element":"span"},{"text":"trivially implies ","element":"span"},{"style":{"height":16},"width":210.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-34.png","element":"img","alt":" Nt(s, a) = 0","inline":true},{"text":", we can assume ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"s","element":"span"},{"text":") ","element":"span"},{"style":{"fontStyle":"italic"},"text":"> ","element":"span"},{"text":"0","element":"span"},{"text":". Assume ","element":"span"},{"style":{"height":16},"width":137.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-35.png","element":"img","alt":" Nt(s, a)","inline":true,"padRight":true},{"text":"grows faster than ","element":"span"},{"text":"log ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". Then","element":"span"}],[{"id":"id-37","style":{"width":"76%"},"width":1213,"height":125,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-36.png","element":"img"}],[{"text":"This step uses ","element":"span"},{"style":{"height":17.23},"width":156.78,"height":43.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-37.png","element":"img","alt":"ˆNt ≥ Nt","inline":true},{"text":", which implies ","element":"span"},{"text":"GLNUB","element":"span"},{"style":{"height":18.58},"width":787.17,"height":46.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-38.png","element":"img","alt":"¯ta → GLN∞a < maxa GLN∞a ← maxa GLN¯ta ≤","inline":true},{"style":{"height":18.58},"width":236.72,"height":46.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-39.png","element":"img","alt":"maxa GLNUB¯ta","inline":true},{"text":". The convergence for ","element":"span"},{"style":{"height":10.4},"width":116.4,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-40.png","element":"img","alt":" t → ∞","inline":true,"padRight":true},{"text":"w.p.1 follows from ","element":"span"},{"href":"#id-37","text":"(4) ","element":"a"},{"text":"and Theorem ","element":"span"},{"href":"#id-26","text":"3. ","element":"a"},{"text":"The inequality ","element":"span"},{"text":"is strict for sub-“optimal” ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a","element":"span"},{"text":". Hence GLCB does not take action ","element":"span"},{"style":{"height":18.83},"width":154,"height":47.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-41.png","element":"img","alt":" a ̸∈ ˜Π(x)","inline":true,"padRight":true},{"text":"anymore for large ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":", which contradicts ","element":"span"},{"style":{"height":16},"width":267.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/12-42.png","element":"img","alt":" Nt(su, a) → ∞.","inline":true}],[{"style":{"fontWeight":"bold"},"text":"Theorem 5 (pseudo-regret / policy error) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let PolErr","element":"span"},{"style":{"height":19.93},"width":583.94,"height":49.83,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/13-0.png","element":"img","alt":"(x) := GLN∞˜π(x)(x) − GLN∞πt(x)(x)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"simple regret incurred by the GLCB (learning) policy ","element":"span"},{"style":{"height":16},"width":91.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/13-1.png","element":"img","alt":" πt(x)","inline":true},{"style":{"fontStyle":"italic"},"text":". Then the total pseudo-regret","element":"span"}],[{"style":{"width":"51%"},"width":810,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/13-2.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"which implies PolErr","element":"span"},{"style":{"height":16},"width":135.76,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/13-3.png","element":"img","alt":"(x) → 0","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"in Cesaro average.","element":"span"}],[{"style":{"width":"100%"},"width":1587,"height":1782,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/13-4.png","element":"img"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Greedy ","element":"span"},{"text":"estimates action-values with a neural network and follows ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/13-5.png","element":"img","alt":" ϵ","inline":true},{"text":"-greedy policy.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Neural Linear ","element":"span"},{"text":"utilizes a neural network to extract latent features, from which action values are estimated using Bayesian linear regression. Actions are selected by sampling weights from the posterior distribution, and maximizing action values greedily based on the sampled weights, similar to ","element":"span"},{"href":"#id-38","referenceIndex":28,"text":"[28]","element":"a"},{"text":".","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Linear Full Posterior ","element":"span"},{"text":"(L","element":"span"},{"text":"IN","element":"span"},{"text":"F","element":"span"},{"text":"ULL","element":"span"},{"text":"P","element":"span"},{"text":"OST","element":"span"},{"text":") performs a Bayesian linear regression on the contexts directly, without extracting features.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bootstrapped Network ","element":"span"},{"text":"(B","element":"span"},{"text":"OOT","element":"span"},{"text":"RMS) trains a set of neural networks on different subsets of the dataset, similarly to [","element":"span"},{"href":"#id-2","referenceIndex":3,"text":"3","element":"a"},{"text":"]. Values predicted by the neural networks form the posterior distribution.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bayes By Backprop ","element":"span"},{"text":"(BBB) [","element":"span"},{"href":"#id-4","referenceIndex":5,"text":"5","element":"a"},{"text":"] utilizes variational inference to estimate posterior neural network weights. BBBA","element":"span"},{"text":"LPHA","element":"span"},{"text":"D","element":"span"},{"text":"IV ","element":"span"},{"text":"utilizes ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bayes By Backprop","element":"span"},{"text":", where the inference is achieved via expectation propagation ","element":"span"},{"href":"#id-39","referenceIndex":29,"text":"[29]","element":"a"},{"text":".","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Dropout ","element":"span"},{"text":"policy treats the output of the neural network with dropout [","element":"span"},{"href":"#id-40","referenceIndex":30,"text":"30","element":"a"},{"text":"] – where each units output is zeroed with a certain probability – as a sample from the posterior distribution.","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Parameter-Noise ","element":"span"},{"text":"(P","element":"span"},{"text":"ARAM","element":"span"},{"text":"N","element":"span"},{"text":"OISE","element":"span"},{"text":") [","element":"span"},{"href":"#id-5","referenceIndex":6,"text":"6","element":"a"},{"text":"] obtains the posterior samples by injecting random noise into the neural network weights","element":"span"}],[{"text":"• ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Constant-SGD ","element":"span"},{"text":"(","element":"span"},{"text":"CONST","element":"span"},{"text":"SGD) policy exploits the fact that stochastic gradient descent (SGD) with a constant learning rate is a stationary process after an initial “burn-in” period. The analysis in [","element":"span"},{"href":"#id-41","referenceIndex":31,"text":"31","element":"a"},{"text":"] shows that, under some assumptions, weights at each gradient step can be interpreted as samples from a posterior distribution.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Processing of datasets. ","element":"span"},{"text":"For ","element":"span"},{"text":"GLCB ","element":"span"},{"text":"we require contexts to be in in ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1] ","element":"span"},{"text":"and rewards to be in ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"a, b","element":"span"},{"text":"] ","element":"span"},{"text":"for a known ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"b","element":"span"},{"text":". To achieve this for Bernoulli bandit tasks (","element":"span"},{"style":{"fontStyle":"italic"},"text":"adult","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"census","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"covertype","element":"span"},{"text":", and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"statlog","element":"span"},{"text":"), let ","element":"span"},{"style":{"height":10.8},"width":199.96,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-0.png","element":"img","alt":" X be a T ×d","inline":true,"padRight":true},{"text":"matrix with each row corresponding to a dataset entry and each column corresponding to a feature. We linearly transform each column to the ","element":"span"},{"text":"[0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1] ","element":"span"},{"text":"range, such that ","element":"span"},{"style":{"height":16.79},"width":237.3,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-1.png","element":"img","alt":" min(X.j) = 0","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16.79},"width":237.56,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-2.png","element":"img","alt":"max(X.j) = 1","inline":true,"padRight":true},{"text":"for each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":". Rescaling for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"jester","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"wheel ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"financial ","element":"span"},{"text":"tasks are trivial. We use the default parameters of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"wheel ","element":"span"},{"text":"environment, meaning ","element":"span"},{"style":{"height":11.6},"width":143.28,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-3.png","element":"img","alt":" δ = 0.95","inline":true,"padRight":true},{"text":"as of February 2020.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Further Experimental Results. ","element":"span"},{"text":"We present the cumulative rewards used for obtaining the rankings (Table 2 of main text) in Table ","element":"span"},{"href":"#id-42","text":"3.","element":"a"}],[{"text":"adult ","element":"span"},{"text":"census ","element":"span"},{"text":"covertype ","element":"span"},{"text":"statlog ","element":"span"},{"text":"financial ","element":"span"},{"text":"jester ","element":"span"},{"text":"wheel ","element":"span"},{"id":"id-42","text":"algorithm","element":"span"}],[{"style":{"width":"100%"},"width":1595,"height":462,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-4.png","element":"img"}],[{"text":"Table 3: Cumulative rewards averaged over ","element":"figcaption","subtype":"caption"},{"text":"500 ","element":"figcaption","subtype":"caption"},{"text":"random environment seeds. Best performing policies per task are shown in bold. ","element":"figcaption","subtype":"caption"},{"style":{"height":10.8},"width":31,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-5.png","element":"img","alt":" ±","inline":true,"padRight":true},{"text":"term is the standard error of the mean.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Computing Infrastructure. ","element":"span"},{"text":"All computations are run on single-GPU machines.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"GLCB hyperparameters. ","element":"span"},{"text":"We sample the hyperplanes weights used in gating functions uniformly from a unit hypersphere, and biases from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"d/","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"bias scale) ","element":"span"},{"text":"i.i.d. where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"is the context dimension. This term is needed to effectively transform context ranges from ","element":"span"},{"style":{"height":17.38},"width":96.7,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-6.png","element":"img","alt":" [0, 1]d","inline":true,"padRight":true},{"text":"to ","element":"span"},{"style":{"height":17.38},"width":207.4,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-7.png","element":"img","alt":" [−1/2, 1/2]d","inline":true},{"text":". We set the GLN weights such that for each unit the weights sum up to 1 and are equal. We decay the learning rate and the switching alpha of GLN via ","element":"span"},{"style":{"height":16},"width":939.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/14-8.png","element":"img","alt":" initial value/(1 + decay rate × Nt−1(a)) where Nt−1(a)","inline":true,"padRight":true},{"text":"is the number of times the given action is taken up until time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t","element":"span"},{"text":". We display the hyperparameters we use in the experiments in Table ","element":"span"},{"href":"#id-43","text":"4, ","element":"a"},{"text":"most of which are chosen via grid search.","element":"span"}]]},{"heading":"D List of Notation.","paragraphs":[[{"text":"We provide a partial list of notation in Table ","element":"span"},{"href":"#id-44","text":"5, ","element":"a"},{"text":"covering many of the variables introduced in Section 3 (Gated Linear Contextual Bandits) of the main text.","element":"span"}],[{"id":"id-43","style":{"width":"91%"},"width":1450,"height":438,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-0.png","element":"img"}],[{"text":"Table 4: ","element":"figcaption","subtype":"caption"},{"text":"GLCB ","element":"figcaption","subtype":"caption"},{"text":"hyperparameters used for the experiments.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"32%"},"width":509,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-1.png","element":"img"}],[{"style":{"height":13.19},"width":120.43,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-2.png","element":"img","alt":"K0 − 1","inline":true,"padRight":true},{"text":"Dimension of a context ","element":"span"},{"id":"id-44","style":{"height":17.38},"width":250.34,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-3.png","element":"img","alt":"X ⊆ [0; 1]K0−1","inline":true,"padRight":true},{"text":"A context set ","element":"span"},{"style":{"height":11.6},"width":447.51,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-4.png","element":"img","alt":"x ∈ X A context","inline":true},{"style":{"height":12.4},"width":101.79,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-5.png","element":"img","alt":"a ∈ A","inline":true,"padRight":true},{"text":"Action from finite set of actions ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x, a","element":"span"},{"text":") ","element":"span"},{"text":"True action value = expected reward of action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"in context ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"style":{"height":16},"width":196.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-6.png","element":"img","alt":"ε ∈ (0, 1/2)","inline":true,"padRight":true},{"text":"GLN output clipping parameter ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-7.png","element":"img","alt":"ϵ","inline":true,"padRight":true},{"text":"Empty string ","element":"span"},{"style":{"height":16.93},"width":48,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-8.png","element":"img","alt":"Θta","inline":true,"padRight":true},{"text":"Parameters of ","element":"span"},{"text":"GLN","element":"span"},{"style":{"height":16.93},"width":17,"height":42.32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-9.png","element":"img","alt":"ta","inline":true,"padRight":true},{"text":"GLN","element":"span"},{"style":{"height":16.98},"width":309.4,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-10.png","element":"img","alt":"(x|Θta) ∈ [ε, 1 − ε]","inline":true,"padRight":true},{"text":"GLN used for estimating the reward probability of action ","element":"span"},{"style":{"fontStyle":"italic"},"text":"a ","element":"span"},{"text":"at time ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"GLN","element":"span"},{"style":{"height":16.98},"width":294.32,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-11.png","element":"img","alt":"ta : X → [ε, 1 − ε]","inline":true,"padRight":true},{"text":"Equivalent to ","element":"span"},{"text":"GLN","element":"span"},{"style":{"height":16.98},"width":125.12,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-12.png","element":"img","alt":"(x|Θta).","inline":true},{"style":{"height":11.6},"width":109.26,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-13.png","element":"img","alt":"U ∈ N","inline":true,"padRight":true},{"text":"Number of GLN units ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , U","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"Index set for GLN units or gating functions ","element":"span"},{"style":{"height":16},"width":234.8,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-14.png","element":"img","alt":"u = (i, j) ∈ U","inline":true,"padRight":true},{"text":"Index of gating function or GLN unit/neuron ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"in layer ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i S ","element":"span"},{"text":"Number of signatures ","element":"span"},{"style":{"fontStyle":"italic"},"text":"S ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"{","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3 ","element":"span"},{"style":{"fontStyle":"italic"},"text":". . . , S","element":"span"},{"style":{"fontStyle":"italic"},"text":"} ","element":"span"},{"text":"Signature space of a gating function ","element":"span"},{"style":{"height":13.19},"width":114.24,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-15.png","element":"img","alt":"su ∈ S","inline":true,"padRight":true},{"text":"Signature of unit ","element":"span"},{"style":{"fontStyle":"italic"},"text":"u ","element":"span"},{"style":{"height":14.18},"width":121,"height":35.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-16.png","element":"img","alt":"s ∈ SU","inline":true,"padRight":true},{"text":"Total signature of all units ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"style":{"height":14},"width":195.29,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-17.png","element":"img","alt":"gu : X → S","inline":true,"padRight":true},{"text":"Gating function for unit ","element":"span"},{"style":{"height":13.59},"width":303.82,"height":33.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-18.png","element":"img","alt":" u of GLNa for all a","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"g ","element":"span"},{"style":{"height":13.79},"width":169.51,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-19.png","element":"img","alt":" : X → SU","inline":true,"padRight":true},{"text":"Gating function applied element-wise to all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"U ","element":"span"},{"style":{"height":16},"width":182.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-20.png","element":"img","alt":"τ/t/T ∈ N","inline":true,"padRight":true},{"text":"Some/current/maximum time step/index ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-21.png","element":"img","alt":"⊤","inline":true,"padRight":true},{"text":"Boolean value for True ","element":"span"},{"style":{"height":13.34},"width":434.82,"height":33.36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-22.png","element":"img","alt":"¯t ∈ N ¯t ≡ t − 1","inline":true},{"style":{"height":16.98},"width":197.62,"height":42.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11611/images/15-23.png","element":"img","alt":"x