Regret bounds for Narendra-Shapiro bandit algorithms

2015·Arxiv

Abstract

Abstract

Narendra-Shapiro (NS) algorithms are bandit-type algorithms developed in the 1960s which have been deeply studied in infinite horizon but for which scarce non-asymptotic results exist. In this paper, we focus on a non-asymptotic study of the regret and address the following question: are Narendra-Shapiro bandit algorithms competitive from this point of view? In our main result, we obtain some uniform explicit bounds for the regret of (over)-penalized-NS algorithms.

We also extend to the multi-armed case some convergence properties of penalized-NS algorithms towards a stationary Piecewise Deterministic Markov Process (PDMP). Finally, we establish some new sharp mixing bounds for these processes.

Keywords: Regret, Stochastic Bandit Algorithms, Piecewise Deterministic Markov Processes

1 Introduction

The so-called Narendra-Shapiro bandit algorithm (referred to as NSa) was introduced in [19] and developed in [18] as a linear learning automata. This algorithm has been primarily considered by the probabilistic community as an interesting benchmark of stochastic algorithm. More precisely, NSa is an example of recursive (non-homogeneous) Markovian algorithm, topic whose almost complete historical overview may be found in the seminal contributions of [11] and [13].

NSa belongs to the large class of bandit-type policies whose principle may be sketched as follows: a d-armed bandit algorithm is a procedure designed to determine which one, among d sources, is the most profitable without spending too much time on the wrong ones. In the simplest case, the sources (or arms) randomly provide some rewards whose values belong to {0; 1} with Bernoulli laws. The associated probabilities of success () are unknown to the player and his goal is to determine the most efficient source, i.e. the highest probability of success.

Let us now remind a rigorous definition of admissible sequential policies. We consider d independent sequences (of i.i.d. Bernoulli random variables ). Each represents the reward associated with the arm i at time n. We then consider some sequential predictions where at each stage n a forecaster chooses an arm , receives a reward and then uses this information to choose the next arm at step n + 1. As introduced in the pioneering work [20], the rewards are sampled independently of a fixed product distribution at each step n. The innovations here at time n are provided by () and we are naturally led to introduce the filtration ())). In the following, the sequential admissible policies will be a ((inhomogeneous) Markov chain. We also define another filtration by adding all the events before step n and observe that ( ¯)))To sum-up, ¯contains all the results of each arm between time 1 and n although only provides partial information about the tested arms.

In this paper, we focus on the stochastic NSa whose principle is very simple: it consists in sampling one arm according to a probability distribution on {1, . . . , d}, and in modifying this probability distribution in terms of the reward obtained with the chosen arm. From this point of view, this algorithm bears similarities with the EXP3 algorithm (and many of its variants) introduced in [5]. Among other close bandit algorithms, one can also cite the Thompson Sampling strategy where the random selection of the arm is based on a Bayesian posterior which is updated after each result. We refer to [1] for a recent theoretical contribution on this algorithm.

Instead of sampling one arm sequentially according to a randomized decision, other algorithms define their policy through a deterministic maximization procedure at each iteration. Among them, we can mention the UCB algorithm [4] and its derivatives (including MOSS [2] and KLUCB [9]), whose dynamics are dictated by an appropriate empirical upper confidence bound of the estimated best performance.

Let us now present the NSa algorithm. In fact, we will distinguish two types of NSa: crude-NSa and penalized-NSa. Before going further, let us recall their mechanism in the case of d = 2 (the general case will be introduced in Section 2). Designating as the probability of drawing arm 1 at step n and (as a decreasing sequence of positive numbers that tends to 0 when n goes to infinity, crude-NS is recursively defined by:

Note that the construction is certainly symmetric, i.e., 1(which corresponds to the probability of drawing arm 2) has a symmetric dynamics. The long-time behavior of some NSa was extensively investigated in the last decade. To name a few, in [17] and [15], some convergence and rate of convergence results are proved. However, these results strongly depend on both () and the probabilities of success of the arms. In order to get rid of these constraints, the authors then introduced in [16] a penalized NSa and proved that this method is an efficient distribution-free procedure, meaning that it unconditionally converges to the best arm on the unknown probabilities and . The idea of the penalized-NS algorithm is to also take the failures of the player into account and to reduce the probability of drawing the tested arm when it loses. Designating (as a second positive sequence, the dynamics of the penalized NSa is given by :

Performances of bandit algorithms. In view of potential applications, it is certainly important to have some informations about the performances of the used policies. To this end, one first needs to define what is a “good” sequencial algorithm. The primary efficiency requirement is the ability of the algorithm to asymptotically recover the best arm. In [16], this property is referred to as the infallibility of the algorithm. If without loss of generality, the first arm is assumed to be the best, (i.e. that max) and if denotes the probability of drawing arm 1, the algorithm is said to be infallible if

An alternative way for describing the efficiency of a method is to consider the behaviour of the cumulative reward obtained between time 1 and n:

In particular, in the old paper [20], Robbins is looking for algorithms such that

This last property is weaker than the infallibility of an algorithm since the Lebesgue theorem associated to (3) implies the convergence above.

A much stronger requirement involves the regret of the algorithm. The regret measures the gap between the cumulative reward of the best player and the one induced by the policy. The regret is the ¯-measurable random variable defined as:

A good strategy corresponds to a selection procedure that minimizes the expected regret , optimal ones being referred to as minimax strategies.

The former expected regret cannot be easily handled and is generally replaced in statistical analysis by the pseudo-regret defined as

Since can also be written as

A low pseudo-regret property then means that the quantity

has to be small, in particular sub-linear with n. The quantities and ¯are closely related and it is reasonable to study the pseudo-regret instead of the true regret, owing to the next proposition:

Proposition 1.1. (i) For any (-measurable strategy, we obtain after n plays:

(ii) Furthermore, for every integer n and d and for any (admissible) strategy,

We refer to Proposition 34 of [3] for a detailed proof of (i) and to Theorem 5.1 of [5] for (ii). As mentioned in (ii), the bounds are distribution-free (uniform in Since the MOSS method of [2] satisfies ¯25, (i) and (ii) show that a non-asymptotic distribution-free minimax rate is on the order of .

In particular, a fallible algorithm (meaning that 1) < 1) necessarily generates a linear regret and is not optimal. For example, in the case in terms of () is as follows:

Objectives. In this paper, we therefore propose to focus on the regret and to answer to the question “Are NSa competitive from a regret viewpoint? In the case of positive answer, what are the associated upper-bounds ?”

Due to some too restrictive conditions of infallibility, it will be seen that the crude-NSa cannot be competitive from a regret point of view. As mentioned before, the penalized NSa is more robust and is a priori more appropriate for this problem. More precisely, the penalty induces more balance between exploration and exploitation, i.e. between playing the best arm (the one in terms of the past actions) and exploring new options (playing the suboptimal arms). In this paper, we are going to prove that, up to a slight reinforcement, it is possible to obtain some competitive bounds for the regret of this procedure. The slightly modified penalized algorithm will be referred to as the over-penalized-algorithm below.

Outline. The paper is organized as follows : Section 2.1 provides some basic information about the crude NSa. Then, in Section 2.2, after some background on the penalized Nsa, we introduce a new algorithm called over-penalized NSa.

Section 3 is devoted to the main results: in Theorem 3.2, we establish an upper-bound of the pseudo-regret ¯for the over-penalized algorithm in the two-armed case and also show a weaker result for the penalized NSa.

In this section, we also extend to the multi-armed case some existing convergence and rate of convergence results of the two-armed algorithm. In the “critical” case (see below for details), the normalized algorithm converges in distribution toward a PDMP (Piecewise Deterministic Markov Process). We develop a careful study of its ergodicity and bounds on the rate of convergence to equilibrium are established. It uses a non-trivial coupling strategy to derive explicit rates of convergence in Wasserstein and total variation distance. The dependence of these rates are made explicit with the several parameters of the initial Bandit problem.

The rest of the paper is devoted to the proofs of the main results: Section 4 is dedicated to the regret analysis, and Section 5 establishes the weak limit of the rescaled multi-armed bandit algorithm. Finally, Section 6 includes all the proofs of the ergodic rates.

2 Deﬁnitions of the NS algorithms

2.1 Crude NSa and regret

The crude NSa (1) is rather simple: it defines a (Markov chain (and is a random variable satisfying:

The arm is selected at step n + 1 with the current distribution () and is evaluated. In the event of success, the weight of the arm is increased and the weight of the other arm is decreased by the same quantity. The algorithm can be rewritten in a more concise form as:

The arm i at step n succeeds with the probability that so that the arm 1 is the optimal one. As pointed in (6), we obtain that

This formula is important regarding the fallibility of an algorithm. In particular, it is shown in [15] that for any choice + 1)with (0, 1) and C > 0 or + 1) with C > 1, the NSa (7) may be fallible: some parameters () exist such that (a.s. converges to a binary random variable with 0. In this situation, for large enough n, we have:

It can easily be concluded that this method cannot induce a competitive policy since some “bad” values of the probabilities () generate a linear regret.

2.2 Penalized and over-penalized two-armed NSa

Penalized NSa. A major difference between the crude NSa and its penalized counterpart introduced in [16] relies on the exploitation of the failure of the selected arms. The crude NSa (1) only uses the sequence of successes to update the probability distribution () since the value of is modified generated by a potential failure of the arm . More precisely, in the event of success of the selected arm , this penalized NSa mimics the crude NSa, whereas in the case of failure, the weight of the selected arm is now multiplied (and thus decreased) by a factor (1 ) (whereas the probability of drawing the other arm is increased by the corresponding quantity). For the penalized NSa, the update formula of (can be written in the following way:

Over-penalized NSa. In view of the minimization of the regret, we will show that it may be useful to reinforce the penalization. For this purpose, we introduce a slightly “over-penalized” NSa where a player is also (slightly) penalized if it wins:

• If player 1 wins, then with probability 1 it is penalized by a factor .

• If player 2 wins, then with probability 1arm 1 is increased by a factor of ).

The over-penalized-NSa can be written as follows

where (is a sequence of i.i.d. r.v. with a Bernoulli distribution ), meaning that . Moreover, these r.v. are independent of (and in such a way that for all , and are also independent. It should be noted that

In fact, this slight over-penalization of the successful arm (with probability ) can be viewed as an additional statistical excitation which helps the stochastic algorithm to escape from local traps. The case penalized when it plays. In particular, this modification implies that the increment of is slightly weaker than in the previous case when the selected arm wins.

Asymptotic convergence of the penalized NSa. Before stating the main results, we need to understand which regret ¯could be reached by penalized and over-penalized NSa. We recall (in a slightly less general form) the convergence results of Proposition 3, Theorems 3 and 4 of [16].

Theorem 2.1 (Lamberton & Pages, [16])(0(0, 1)(8).

distribution and is the stationary distribution of the PDMP whose generator L acts on ) as

In view of Theorem 2.1, we can use formula (6) to obtain

We then obtain the key observation

where C is a constant that may depend on and . According to Theorem 2.1, it seems that the potential optimal choice corresponds to the one of (iii). Indeed, the infallibility occurs only when and 1 and Equation (10) suggests that should be chosen as large as possible to minimize the r.h.s. of (11), leading to 2. This is why in the following, we will focus on the case:

2.3 Over-penalized multi-armed NSa

We generalize the definition of the penalized and over-penalized NSa to the d-armed case, with 2. Let (0, 1)and assume that ) (the probability of success of arm i). The over-penalized NSa recursively defines a sequence of probability measures on {1, . . . , d} denoted by (Πwhere Π). At step n, the arm is sampled according to the discrete distribution and tcrthen tested through the computation of . Setting , the multi-armed NSa is defined by:

In contrast with the two-armed case, we have to choose how to distribute the penalty to the other arms when d > 2. The (natural) choice in (13) is to divide it fairly, i.e., to spread it uniformly over the other arms. Note that alternative algorithms (not studied here) could be considered.

3 Main Results

3.1 Regret of the over-penalized two-armed bandit

First, we provide some uniform upper-bounds for the two-armed -over-penalized NSa . Our main result is Theorem 3.2. Before stating it, we choose to state a new result when for the “original” penalized NSa introduced in [16].

Theorem 3.1. Let (be the two-armed penalized NSa defined by (8) with (defined by (12) with ((0, 1). Then, for every (0, 1), a positive exists such that:

Remark 3.1. The upper bound of the original penalized-NS algorithm is not completely uniform. From a theoretical point of view, there is not enough penalty when is too large, which in turn generates a deficiency of the mean-reverting effect for the sequence ((1 when is close to 0. In other words, the trap of the stochastic algorithm near 0 is not enough repulsive and Figure 1 below shows that this problem also appears numerically and suggests a logarithmic explosion of sup

This explains the interest of the over-penalization, illustrated by the next result, which is the main theorem of the paper.

Theorem 3.2. Let (be the two-armed -over-penalized NSa defined by (9) with [0, 1) and (defined by (12) with ((0, 1). Then,

(a) A ) exists such that:

(b) Furthermore, the choice yields

Remark 3.2. At the price of technicalities, could be made explicit in terms of and for every 0. The second bound is obtained by an optimization of ) (see (38) and below).

Figure 1: Evolution of supfor the over-penalized algorithm (with and comparison with EXP3 and KL-UCB.

Figure 1 presents on the left side a numerical approximation of supfor the penalized and over-penalized algorithms. The continuous curves indicate that the upper bound 312 in Theorem 3.2 is not sharp since the over-penalized NSa satisfies a uniform upper-bound on the order of 0. This bound is obtained with a small (as pointed in Theorem 3.2), and (red line in Figure 1 (left)), suggesting that the rewards should always be over-penalized with .

The right-hand side of Figure 1 focuses on the behavior of the regret with . The map (sup

performances for the cumulative regret, we should use a low value of between 0 and 3/5. The importance of this choice of seems relative since the behaviour of the over-penalized bandit is

Figure 2: Evolution of supfor the over-penalized algorithm (with