Online Learning for Active Cache Synchronization

Existing multi-armed bandit (MAB) models make two implicit assumptions: an arm generates a payoff only when it is played, and the agent observes every payoff that is generated. This paper introduces SYNCHRONIZATION BANDITS, a MAB variant where all arms generate costs at all times, but the agent observes an arm’s instantaneous cost only when the arm is played. SYNCHRONIZATION MABs are inspired by online caching scenarios such as Web crawling, where an arm corresponds to a cached item and playing the arm means downloading its fresh copy from a server. We present MIRRORSYNC, an online learning algorithm for

SYNCHRONIZATION BANDITS, establish an adversarial regret of  O(T 2/3)for it, and show how to make it practical.

Multi-armed bandits (MAB) (Robbins, 1952) have been widely applied in settings where an agent repeatedly faces K choices (arms), each associated with its own payoff distribution unknown to the agent at the start, and needs to eventually identify the arm with the highest mean payoff by pulling a subset of arms at a time and observing a payoff sampled from their distributions. MABs’ defining property is that the agent observes an arm’s instantaneous payoff when and only when the agent plays it. A key hidden assumption that goes hand-in-hand with it in the existing bandit models is that each arm generates reward when and only when it is played, which, combined with the bandit feedback property, also implies that the agent observes all generated payoffs.

In this paper, we go beyond these seemingly fundamental assumptions by identifying a class of practical settings that violate them and analyzing it using online learning theory. Specifically, this paper formalizes scenarios that we call

1Microsoft Research, Redmond 2Google Research, Berlin. Work on this paper partially done during a visit to Microsoft Research, Redmond. Correspondence to: Andrey Kolobov <akolobov@microsoft.com>.

Proceedings of the  37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

SYNCHRONIZATION MABs. In these settings, the agent can be thought of as holding copies of K files whose originals come from different remote sources. As time goes by, the files change at the sources, and their copies increasingly differ from the originals, becoming stale. The agent’s task is to refresh these files by occasionally downloading their new copies from remote sources, under a constraint B on the average number of downloads per time unit.

For each file, the agent is continually penalized for its staleness. The expected penalty at each time step due to this file is a non-decreasing function of the time since the file’s last refresh. Playing arm k here corresponds to refreshing file k: doing so temporarily reduces its staleness and thereby diminishes the cost incurred due to it per time unit. The goal is to find a synchronization policy that minimizes regret in terms of the average staleness penalties by refreshing files according to a well-chosen schedule.

Crucially, at any moment the agent doesn’t know how outdated its copy of a given file is, except at the time when it downloads its fresh copy, and therefore most of the time doesn’t know the penalties it is incurring. It observes the penalty only when it plays an arm, i.e., refreshes a file, and has a chance to see how different the cached copy was right before the refresh. Even this action reveals only the instantaneous penalty due to this file, not the cumulative penalty the file has brought on since its last refresh.

SYNCHRONIZATION MABs are inspired by problems such as web crawl scheduling (Wolf et al., 2002; Cho & Garcia- Molina, 2003a; Azar et al., 2018; Kolobov et al., 2019a; Upadhyay et al., 2020) and database update management (Gal & Eckstein, 2001; Bright et al., 2006). All these settings involve a cache that must proactively initiate downloads to refresh its content. This is in contrast to, e.g., Web browser caches that passively monitor a stream of download requests initiated by another program. The few existing works on policy learning for active caching (Kolobov et al., 2019a; Upadhyay et al., 2020) apply only to specific penalty functions. In contrast, the theoretical results in this paper are independent of the penalties’ functional form, and come with a practical online learning strategy for this model.

High-level analysis idea and paper outline. Online learning theory is a powerful tool for analyzing decision-making models where an agent operates in discrete-time instantaneous rounds by playing a candidate solution (arm), imme- diately getting a feedback on it (a sample from the arm’s payoff distribution), and using it to choose a candidate solution for the next round. Unfortunately, online learning’s traditional assumptions clash with the properties of our setting. As Section 2 describes, SYNCHRONIZATION MAB is a continuous-time model with non-stationary sparsely observable costs. Its candidate solutions are multi-arm policies (Section 3). Getting useful feedback on a policy, such as an estimate of its cost function or gradient, isn’t instantaneous; it requires playing the policy for a non-trivial stretch of time.

In Section 4, we present the MIRRORSYNC algorithm, which continuously plays a candidate policy along with “exploratory” arm pulls, periodically updating it with online mirror descent (Nemirovsky & Yudin, 1983; Bubeck, 2016). It uses a novel unbiased policy gradient estimator that operates in the face of sparse policy cost observations. Our regret analysis of MIRRORSYNC in Section 5 critically relies on the convexity of policy cost functions – a property we derive in Section 3 from minimal assumptions on SYNCHRO-

NIZATION MABs’ payoffs. The regret analysis treats time intervals between MIRRORSYNC’s policy updates as learning “rounds” and thereby brings online learning theory to bear on SYNCHRONIZATION BANDITS. Section 6 introduces ASYNCMIRRORSYNC, a practical MIRRORSYNC variant that lifts MIRRORSYNC’s idealizing assumptions. In Section 8, we compare the two algorithm empirically.

The contributions of this paper are thus as follows:

(1) We cast active caching as an online learning problem with sparse feedback, enabling principled theoretical analysis of this setting under a variety of payoff distributions.

(2) Based on this formulation, we propose a theoretic strategy for active caching under unknown payoff distributions and derive an adversarial regret bound of  O(T 2/3)for it. In doing so, we overcome the challenges of sparse and temporal feedback inherent in this scenario that existing online learning theory does not address.

(3) We present a practical variant of the above strategy that lifts the latter’s assumptions and, as experiments demonstrate, has the same empirical convergence rate.

SYNCHRONIZATION BANDITS are a continuous-time MAB model with K arms. Other than operating in continuous-time it differs from existing MAB formalisms in the mechanism by which arms generate costs/rewards and the observability of the generated costs from the agent’s standpoint. In this section we detail both of these aspects, using the aforementioned cache update scenario as an illustrating example.

Cost-generating processes. In SYNCHRONIZATION MABs, every arm k incurs a stochastically generated cost  ˆck,tat every time instant t, whether the arm is played or not. How- ever, the distribution of arm k’s possible instantaneous costs at time t depends on how much time has passed since the last time arm k was played to refresh the corresponding cached item. We denote the length of this time interval as τk(t) ∈ [0, ∞). Thus, arm k’s cost generation process is described by a family of random variables  {ck(τk(t))|t ≥ 0}as stated in the following assumption:

Assumption 1. (Cost generation) At time t, each arm incurs a cost independently of being played by the agent. The instantaneous cost  ˆck,tdue to arm k at time t is sampled from a random variable  ck,t = ck(τk(t))s.t.: (1)  τk(0) ≜0; (2)  ck(0) = DiracDelta(0); (3) there exists a bound U < ∞s.t.  supp(ck(τ)) ⊆ [0, U]for every arm’s cost generation process  ckand any time interval length  τ ≥ 0.

By this assumption, for every arm and any amount of time τsince its latest play, its cost expectation is well-defined:


Agent’s knowledge and cost observability. While costs are generated by all arms continually, in our model the agent doesn’t observe most of them, with an important exception: Assumption 2. (Cost knowledge and observability) For each arm k, the agent observes a cost  ˆck,t ∼ ck,tat time t if and only if the agent plays arm k at that time. The agent doesn’t know the distributions of random variables  ck,t.

Assumption 2 is crucial in two ways. First, it means that our model provides only bandit feedback. Namely, the agent doesn’t see arms’ costs at all times, unlike in related models such as maintenance scheduling (Bar-Noy et al., 1998). Second, coupled with Assumption 1 it implies that there is no causal relationship between playing an arm and incurring a cost, which is an implicit assumption that standard bandit strategies rely on.

Arm play modes. At any time t, any of SYNCHRONIZATION MAB’s arms can be played in one of two modes:

Sync mode. Playing arm k in this mode at time t resets the arm’s state, i.e., sets  τk(t) ←� 0. In addition, per Assumption 2, the agent observes the arm’s instantaneous cost sample ˆck,timmediately before  τk(t)is reset to 0.

In the case of a cache, this means downloading a fresh copy of file k, estimating the difference between  k’s currentoriginal and the cached copy, and overwriting the cached copy with the new one.

Probe mode. By playing arm k allows in probe mode, the agent observes the arm’s instantaneous cost, but the arm’s state  τk(t)is not reset.

In caching settings, this corresponds to downloading a fresh copy of item k, but using it purely to estimate the difference between k’s current original and the cached copy, without overwriting the cached copy.

Since, by Assumption 1,  ck(0) = 0, playing an arm in sync mode gives the agent a way to temporarily reduce the expected rate at which the arm incurs costs. However, due to the following assumption, after a sync play the arm’s cost generation rate starts growing again:

Assumption 3. (Cost monotonicity) For every arm k, the means  ck(τk(t))of instantaneous cost random variables ck(τk(t))are non-decreasing in time since the latest sync-mode play  τk(t). If arm k was played in sync mode at time  t0, then any sequence of arm k’s cost observations ˆck,1, ˆck,2, . . .yielded by probe-mode plays after  t0 and untilthis arm’s next sync-mode play at time  t′0 is non-decreasing.

Arm state  τk(t)can be viewed as the amount of time that has passed since the arm’s last sync by time point t; the more time has passed, the more cost the arm is incurring per time unit. Playing an arm in sync mode simply resets this time counter. Thus, according to Assumption 3, not only does the total cost generated by arm k since its previous sync play grow as time goes by – which is to be expected – but so does the rate at which it happens.

Note that probe-mode arm plays don’t help the agent reduce running costs directly. Instead, as we show in Section 4, they help the agent learn a good arm-playing policy faster.

Example. All of the above assumptions are natural in realworld scenarios that inspired the SYNCHRONIZATION model. For instance, in Web crawling each online web page accumulates changes according to a temporal process  Dk(t), whichis widely assumed to be Poisson (i.e., memoryless) in the Web crawling literature (Wolf et al., 2002; Cho & Ntoulas, 2002; Cho & Garcia-Molina, 2003a;b; Azar et al., 2018; Kolobov et al., 2019a;b; Upadhyay et al., 2020). For each indexed page, the agent (the search engine) incurs a cost Ck(d)due to serving outdated search results, as a function of the total difference d between the indexed page copy and the online original. From this perspective,  ck(τ) ≜ Ck(Dk(τ)),but at least one other approach models  ck(τ)directly as a function of a web page copy’s age (Cho & Garcia-Molina, 2000). In either case, Assumption 3 holds: the more time passes since the page’s last crawl, the higher the expected instantaneous penalty. Moreover, penalties don’t decrease between two successive crawls of a page: e.g., in case changes are generated by a Poisson process, their number can only grow with time since last refresh, and so can the penalty.

In order to derive a learning strategy for SYNCHRONIZATION BANDITS (Section 4) and its regret analysis (Section 5), we first derive the necessary building blocks: the cost of an arbitrary policy for this model, the class of policies that will serve as our algorithm’s hypothesis space, and parameterized cost functions for policies of this class.

Policy costs. Our high-level aim is finding a SYNCHRONIZA- TION policy  πthat has a low expected average cost over an infinite time horizon. Whether a policy  πis historydependent, stochastic, or neither, executing it produces a schedule  σk = ((t1, l1), (t2, l2), . . .)for each arm k, a possibly infinite sequence of time points  tnkwhen the arm is to be played and corresponding labels  lnkspecifying whether the arm should be played in probe or sync mode at that time. For convenience, WLOG assume that  t0always refers to t = 0, let  τnk ≜ tnk − tnk−1, and for any finite horizon H let  Nk(H)be the index of schedule  σk’s largest time point not exceeding H:


Given this definition, let  tNk(H)+1 ≜ H.

Recalling that each arm has a specific time-dependent cost distribution  ck(τk(t))with mean  ck(τk(t)), we define the average infinite-horizon cost  Jσkk of arm k’s schedule  σk as




we can rewrite  Jσkk ’s definition as


Here,  Ck(τnk)is the total cost that arm k is expected to incur between  (nk − 1)-th and  nk-th plays according to schedule  σk. Thus,  Jσkkis just the average of these costs over the entire schedule. If the schedule stops playing arm k forever after some time  t, Jσkkmay be infinite.

Running a policy  πamounts to sampling a joint schedule σ = {σk}Kk=1. Therefore, we define policy cost  Jπ as


Target policy class. Instead of considering all possible SYN- CHRONIZATION policies as potential solutions, in this paper we focus on those whose sync-mode plays are periodic, with equal gaps between every two consecutive such plays of a given arm. For arm k, we denote the length of these gaps as 1/rk > 0length,  rkbeing a policy parameter for this arm and  r ≜ (rk)Kk=1being the joint parameter vector for all arms. Importantly, our policies do allow probe-mode arm plays but don’t restrict how the time points for these plays are chosen. In particular, they may be chosen stochastically, as long the timings of sync-mode plays are deterministically periodic.

Formally, for a scheduled arm pull time point  tnk in sched-ule  σk, let  NextSyncσk(tnk)be the next sync-mode play of arm k in  σk, i.e.,  tn′k = NextSyncσk(tnk)if  tnk =min{tn′′k | n′′k > nk, (tn′′k , ln′′k ) ∈ σk, ln′′k = sync}. Then our target policy class is


Parameters r can be interpreted as rates at which arms are played in sync mode. For  π ∈ Π, policy costs are uniquely determined by sync-mode play rates r: although these parameters ignore the timing of probe plays, probe plays don’t affect cost generation and therefore policy cost.

We let J(r) denote  π(r) ∈ Π’s policy cost. Equation 5 implies that its cost functions J(r) have a special form critical for our regret analysis in Section 5 – they are convex:

Lemma 1. For any policy  π(r) ∈ Π,


and monotonically decreasing for r > 0.

Proof. See the Appendix. ■

The convexity of the cost functions plays a crucial role in obtaining the regret bounds (Section 5) for the policy learning algorithm presented in Section 4.

Policy constraints. Naturally, we would like to find a π(r) ∈ Πthat minimizes J(r) (Equation 7). As described, however,  Πhas no such policy: note that  limr→∞ J(r) =0, but no finite r attains this limit. However, in practical applications the rates  rkcannot be arbitrarily high individually or in aggregate, and are subject to several constraints. Therefore, in this paper we regard feasible  rkas bounded from above for all k by a universal bound  rmax. Moreover, we assume that the sum of all arms’ play rates may not exceed some value B > 0. E.g., in Web crawling B is commonly interpreted as a limit on crawl rate imposed by physical network infrastructure (Azar et al., 2018; Kolobov et al., 2019a; Upadhyay et al., 2020). Last but not least, valid  rkvalues may not be 0, since this implies never playing this arm after a certain time point. In applications such as Web crawling, this means abandoning a cached item (e.g., an indexed webpage) to grow arbitrarily stale, which is unacceptable, so we impose a minimum sync-mode play rate  rminon every arm. Note that since, by Assumption 1, every  ck(τ)is bounded for  τ ≥ 0, every Jk(rk) (Lemma 1) is bounded as well.

Policy optimization. Thus, if we knew cost processes  ck(.),we could use them to compute  Ck(.)for all arms and would face the following optimization problem:

Problem 1 (SYNCHRONIZATION BANDIT instance).


subject to:  r ∈ K ≜�r′ ∈ [rmin, rmax]K | ||r′||1 = B�

Notice that this formulation implicitly assumes that the entire bandwidth B will be used for sync-mode arm plays – indeed, if the model is known, there is no need for probes.

As a side note, we remark that the class of periodic policies Πdoesn’t restrict Problem 1’s solution quality compared to broader the class of deterministic open-loop policies. We state it here informally as a proposition, which we reformulate more precisely and prove in the Appendix A:

Proposition 1. For a given K-armed SYNCHRONIZATION BANDIT instance (Problem 1), the optimal periodic policy π∗ ∈ Πhas the same cost  J∗as the optimal general deterministic open-loop policy under the same constraints.

In reality we don’t know the cost generation processes and can’t solve the above optimization problem directly. Instead, we adopt an online perspective on SYNCHRONIZATION bandits. A key contribution of this paper that we present in this section is MIRRORSYNC (Algorithm 1), an algorithm that treats a SYNCHRONIZATION MAB as an online learning problem. A MIRRORSYNC agent can be viewed operates in rounds  T = 1, 2, . . . Tmax, in each round “playing” a candidate policy parameter vector  rT, suffering a “loss” ˆJT ∼ JT (rT ), and updating  rTto a new vector  rT +1as a result. As we show in Section 5, MIRRORSYNC has an adversarial regret of  O(T 2/3max).

MIRRORSYNC’s novelty is due to the fact that, despite super-ficial similarities to standard online learning, our setting is different from it in crucial ways, and MIRRORSYNC circumvents these differences:

(1) Although the agent can be viewed as suffering loss ˆJT , itdoesn’t observe this loss. By SYNCHRONIZATION MAB’s assumptions, it observes only samples of instantaneous costs ck(.), and only when it plays arms. Existing techniques don’t offer a way to estimate the gradient  ∇Jin this case. Moreover, even these impoverished observations take real- world time to collect. (2) In online learning, regret analysis normally assumes  ∇Jto be bounded. This isn’t quite the case in our model. While we could assume a bound on ∇Jlinear in  1/rmin, it would be detrimental to the regret bound when  1/rminis large. We show how MIRRORSYNC addresses challenge (1) in this section, and circumvent challenge (2) in Section 5.

A note on infinite vs. finite-horizon policies. The policy cost functions we derived in Section 3 describe the steady- state performance of a policy over an infinite time horizon. However, MIRRORSYNC’s rounds are finite. Thus, the cost function J (Equation 7) that MIRRORSYNC uses as a basis for policy improvement is a proxy measure for the average costs that running MIRRORSYNC incurs in each round.

MIRRORSYNC operation. At a high level, MIRRORSYNC’s main insight is allocating a small fraction of available band- width B, determined by input parameter  ε(Algorithm 1), to probe-mode arm plays, while using the bulk� 11+ε�B ofthe bandwidth (line 2) to play in sync mode according to the current rates r. MIRRORSYNC uses instantaneous cost samples obtained from both to estimate the gradient  ∇J(lines 11 - 19) by individually estimating its partial derivatives (lines 23-25), which we denote as


for short. At the end of each epoch, it does online mirror descent on these estimates to get a new sync-mode policy r (lines 20, 42-43).

In more detail, in the spirit of online learning, MIRRORSYNC assumes that at the start of each round all arms’ cost gen- eration processes have just been reset to  ck(0), and restarts the time counter at t = 0 for every arm (line 9). (In prac- tice, this assumption is unrealistic, and we lift it in another variant of MIRRORSYNC in Section 6.) Then, for every arm k, it schedules sync-mode plays at intervals  1/rk (lines 29,37-38) until the end of the current round, and attempts to insert one probe-mode play into each such interval (lines 33-36) independently with probability  ε (line 32), at a point chosen uniformly at random over the interval’s length (line 34). Executing the constructed schedule (line 13) yields cost samples that are used in the aforementioned gradient estimation, which, crucially, is unbiased:

Lemma 2. For a rate vector  r = (rk)Kk=1 and a probability ε, suppose the agent plays each arm in sync mode  1/rktime after that arm’s previous sync-mode play, observing a sample of instantaneous cost  ˆck ∼ ck(1/rk). Suppose also that in addition, with probability  εindependently for each arm k, the agent plays arm k in probe mode at time t ∼ Uniform[0, 1/rk]after that arm’s previous sync-mode play, observing a sample of instantaneous cost  ˆc(ε)k ∼ ck(t).

Then for each k,


is an unbiased estimator of  ∂kJ(rk).

Proof. See the Appendix. ■

In one round, MIRRORSYNC may get several gradient estimates for a given arm, in which case it takes the first one (line 18). To get a regret bound, however, it is crucial to ensure that for each arm MIRRORSYNC receives at least one such estimate per round, even if the estimate is 0 (line 18). This is why we set the length of each round to be  1/rmin(line 1) — the largest value  1/rkand hence the longest time that MIRRORSYNC may have to wait in order to get a cost sample at  1/rk.

We generalize our stochastic setting to an adversarial problem and prove an adversarial regret bound of order  O�T23�.This means that the cost distributions  ckand all derived quantities (¯ck, ¯Ck, Jk) need not be non-stationary from one round to the next. The cost distributions and derived functions at round T are denoted by  cTk,  ¯cTk,  ¯CTk,  JTkand can be chosen by an oblivious adversary ahead of time.

Regret. We define the regret with respect to a fixed schedule


where the expectation is over the randomness of observed costs  ˆckand  rTis the choice of the algorithm in round T . Our goal is to compete with the best possible schedule  r∗using the full available budget:


where  K0 is Kε (line 2of Algorithm 1) with  ε = 0.

Since we are not be able to obtain any information on the function value or gradient of  JT without an allocated exploration, we also define the best possible schedule  r∗ε given anexploration constrained by  ε (lines 32 - 34of Algorithm 1):


The regret can be decomposed into


which we bound separately.

In-policy regret. The problem is an instance of online learning, but online learning literature typically assumes that the gradients of the objective functions  ∇JT are uniformly bounded w.r.t. some norm. Our setting differs in a key aspect: the gradients  ∂kJ(r)scale proportionally to  r−1k .

A naive solution would be to bound  ∂kJ(r) ≲ r−1minand use gradient descent. However, the regret would scale with r−1min, which might be prohibitively large.

We show that mirror descent with a carefully chosen potential, namely the log barrier  F(r) = �Kk=1 log(rk), adapts to the geometry of the gradients and replaces the polynomial dependency on  r−1min by a logdependency.

Theorem 5.1. For any sequence of convex functions (JT )TmaxT =1and learning rate  0 < η < Kε2U, the in-policy regret of MIRRORSYNC is bounded by


Proof. See the Appendix. ■

Exploration Gap. We show that the exploration gap scales proportionally to  εand is independent of  r−1min. On first sight, this bound is surprising because the exploration gap should be approximately  ⟨�TmaxT =1 ∇JT (r∗), r∗ − r∗ε⟩ andthe gradients  ∇JTk (r∗)could be unbounded (i.e. of order r−1min). The high-level idea behind the following lemma is the observation that at the optimal point  r∗, the gradients in all coordinates must coincide and hence the gradient cannot be of order  r−1min even if r∗k = rmin.

Lemma 3. The exploration gap is bounded by


Proof. See the Appendix. ■

Finally we are ready to present the main result.

Corollary 1. The regret of MIRRORSYNC with  η =


bounded for any  Tmax > 8 log� BrminK�by


Proof. The choice of  η, εand the bound on  Tmaxensure that we can apply Theorem 5.1 to bound the in-policy regret. The in-policy regret simplifies to


Adding the exploration gap from Lemma 3 and substituting the value for  εcompletes the proof. ■

Although MIRRORSYNC lends itself to theoretical analysis, several design choices make its vanilla version impractical. (1) MIRRORSYNC assumes that all arms are synchronized “for free” at the start of each round so that each round starts in the same “state”, which is unrealistic. (2) MIRRORSYNC waits until the end of each  1/rmin-long round to update all arms’ play rates simultaneously, which could be months in applications like Web crawling. (3) MIRRORSYNC’s further source of inefficiency is that even if arm k has produced several  ∂kJ(rk)samples in a given round, MIRRORSYNC uses only one of them. ASYNCMIRRORSYNC (Algorithm 2), which can be viewed as a practical implementation of MIRRORSYNC, mitigates these weaknesses of MIRRORSYNC.

In contrast to MIRRORSYNC, which performs updates in rigidly defined rounds, ASYNCMIRRORSYNC updates policy according to a user-specified schedule S (see Algorithm 2’s inputs). The length of inter-update periods is unimportant for ASYNCMIRRORSYNC, unlike for MIRRORSYNC (line 1, Alg. 1), due to a major difference between the two algorithms. MIRRORSYNC aims to update all arms’ parameters synchronously at the end of each round and makes the rounds very long to guarantee that each arm has generated at least one gradient estimate by the end of each round. In the meantime, ASYNCMIRRORSYNC does updates asynchronously, performing mirror descent at an update time t(upd)i ∈ Sonly on those arms that happen to have generated at least one new gradient sample since the previous update time  t(upd)i−1 (lines 26-33). ASYNCMIRRORSYNC doesthese local updates while respecting the global constraint B by using the sum of current play rates or arms that are about to be updated as a local constraint (line 32). Thus, ASYNCMIRRORSYNC doesn’t need to make inter-update intervals excessively long and doesn’t suffer from issue (1).

As a side note, the reason MIRRORSYNC’s regret bound in Corollary 1 has no linear dependence on  1/rminis because it characterizes regret in terms of the number of rounds, not wall-clock time. Nonetheless, this dependency matters empirically, and obtaining a wall-clock-time regret bound that is free from it is an interesting theoretical problem.

ASYNCMIRRORSYNC’s asynchronous update mechanism also removes the need for “free” simultaneous sync-mode play of all arms after each round (2). Recall that before each sync-mode play of arm k with probability  εwe can play arm k another time, and so far we have chosen to do so in probe mode. The ScheduleArmPlays routine (line 27, Alg. 1) that both MIRRORSYNC and ASYNCMIRRORSYNC rely on attempts this (lines 32-33, Alg. 1) for every sync-mode arm play except the first arm play of each round. ASYNCMIRRORSYNC takes advantage these unused chances


to schedule sync-mode plays, which reset cost generation for some fraction of arms (line 36, Alg. 2). For the remaining arms, it simply waits until their next sync-mode play (line 37, Alg. 2) to start estimating the new gradient.

Last but not least, ASYNCMIRRORSYNC improves the ef-ficiency of updates themselves. It employs all gradient samples we get for an arm between updates, and averages them to reduce estimation variance (lines 25, 28), thereby rectifying MIRRORSYNC’s weakness (3).

There are several existing models superficially related to but fundamentally different from SYNCHRONIZATION MABs.

In maintenance job scheduling (Bar-Noy et al., 1998), as in our setting, each machine (arm) has an associated operating cost per time unit that increases since the previous maintenance, and performing a maintenance reduces this cost temporarily. However, the agent knows all arms’ cost functions and always observes the machines’ running costs.

Upadhyay et al. (2018) describe a model for maximizing a long-term reward that is a function of two general marked temporal point processes. This model is more general than

SYNCHRONIZATION MAB in some ways (e.g., not assuming cost process monotonicity) but allows controlling the policy process’s rate only via a policy cost regularization term and provides no performance guarantees.

Recharging bandits (Immorlica & Kleinberg, 2018), like

SYNCHRONIZATION MABs, have arms with non-stationary payoffs: the expected arm reward is a convex increasing function of time since the arm’s last play. In spite of this similarity, recharging bandits and other MABs with time-dependent payoffs (Heidari et al., 2016; Levine & Crammer, 2017; Cella & Cesa-Bianchi, 2020) make the common assumptions that a reward is generated only when an arm is played and that the agent observes all generated rewards. As a result, their analysis is different from ours. In general, payoff non-stationarity has been widely studied in two broad bandit classes. Restless bandits (Whittle, 1988) allow an arm’s reward distributions to change, but only independently of when the arm is played. Rested bandits (Gittins, 1979) also allow an arms’ reward distribution changes, but only when the arm is played. SYNCHRONIZATION MABs belong to neither class, since their arms’ instantaneous costs change both independently and a result of arms being played.

The goal of our experiments was to evaluate (1) the relative performance of MIRRORSYNC and ASYNCMIRRORSYNC, given that MIRRORSYNC assumes “free” arm resets at the beginning of each round and ASYNCMIRRORSYNC doesn’t, and (2) the relative performance of ASYNCMIRRORSYNC and its version with projected stochastic gradient descent (PSGD) instead of mirror descent, which we denote as ASYNCPSGDSYNC. The choice of mirror descent instead of PSGD was motivated by the intuition that with mirror descent MIRRORSYNC would achieve lower regret than with PSGD (see Section 5). In the experiments, we verify this intuition empirically. Before analyzing the results, we describe the details of our experiment setup.

Problem instance generation. Our experiments in Figures 1 and 2 were performed on SYNCHRONIZATION MAB instances generated as follows. Recall from Sections 2 and 3 that a SYNCHRONIZATION MAB instance is defined by:

 rmin, the lowest allowed arm play rate

 rmax, the highest allowed arm play rate

B, the highest allowed total arm play rate

K, the number of arms

• {ck(τ)}Kk=1, a set of cost-generating processes — time- dependent distributions of instantaneous costs, one process for each arm k.

For all problem instances in the experiments,  rmin, rmax, B,and K were as in Table 1 in Appendix B. The set {ck(τ)}Kk=1of cost-generating processes was constructed randomly for each instance. In all problem instances, each arm had a distribution over time-dependent cost functions in the form of polynomials  ck(τ) = akτ pk, where pk ∈ (0, 1)was chosen at instance creation time and  aksampled at run time as described in Appendix B. Note that MIRRORSYNC’s regret (Equation 9) depends on the cost cap U. While our polynomial cost functions are unbounded in general, they are bounded within the  [rmin, rmax]constraint region we are using (Table 1 in Appendix B). Within this constraint region, these cost functions are equivalent to  min{akτ pk, U},where U = 40.

In Appendix C we also describe a different, Poisson process-based family of cost-generating processes, and present experimental results obtained on it in Figures 3 and 4 in that section. Despite that process family being very distinct from the polynomial one, the results are qualitatively similar to those in Figures 1 and 2.

Implementation details. We implemented MIR- RORSYNC, ASYNCMIRRORSYNC, and ASYNCPSGDSYNC, along with a problem instance generator that constructs SYNCHRONIZATION MAB instances as above, in Python. The implementation, available at


Scheduling, relies on scipy.optimize.minimize for convex constrained optimization in order to update the play rates r (lines 20, 42 of Alg. 1, line 33 of Alg. 2). Other convex optimizers are possible as well. The experiments were performed on a Windows 10 laptop with 32GB RAM with 8 Intel 2.3GHz i9-9980H CPU cores.

Hyperparameter tuning. Running the algorithms required


Figure 1. MIRRORSYNC’s and ASYNCMIRRORSYNC’s convergence. The two algorithms exhibit nearly identical convergence behavior using tuned hyperparameters. However, ASYNCMIRRORSYNC works without assuming the “free” arm resets after arms’ parameter updates that MIRRORSYNC relies on.


Figure 2. Asynchronous version of MIRRORSYNC with mirror descent vs. with projected SGD. The use of mirror descent with the log barrier function in MIRRORSYNC was key to constructing the regret bounds in Section 5. Empirically, although ASYNCMIRRORSYNC and ASYNCPSGDSYNC eventually converge to similar-quality policies, ASYNCMIRRORSYNC discovers good policies faster, as MIRRORSYNC’s theory predicts.

choosing values for the following parameters:

Learning rate  η.As in other learning algorithms, choosing a good value for  ηfor each of the three algorithms was critical for their convergence behavior.

Length lupd round of intervals betweenASYNCMIRRORSYNC’s and ASYNCPSGDSYNC’s play rate updates. Recall that unlike MIRRORSYNC, which updates all play rates simultaneously after every  1/rmintime units, ASYNCMIRRORSYNC and ASYNCPSGDSYNC update the model parameters according to a user-provided schedule. While the schedule doesn’t necessarily have to be periodic, in the experiments it was, with  lupd roundbeing the inter-update interval length. Intuitively, lupd roundinfluences the average number of arms updated during each update attempt and the variance of gradient estimates: the larger  lupd round, the more gradient samples ASYNCMIRRORSYNC and ASYNCPSGDSYNC can be expected to average for the upcoming update (line 28 of Alg. 2). In this respect,  lupd round’s role resembles that of minibatch size in minibatch SGD.

Exploration parameter  ε.Theory provides a horizondependent guidance for setting  εfor MIRRORSYNC (Corollary 1) but not for ASYNCMIRRORSYNC and ASYNCPSGDSYNC. For comparing relative convergence properties of MIRRORSYNC, ASYNCMIRRORSYNC, and ASYNCPSGDSYNC, we fixed  ε = 0.05for all of them.

ASYNCMIRRORSYNC’s and ASYNCPSGDSYNC’s performance is determined by a combination of  η and lupd round,so we optimized them together using grid search. Please see Appendix B for more details.

Experiment results. Figures 1 and 2 compare the performance of MIRRORSYNC vs. ASYNCMIRRORSYNC and ASYNCMIRRORSYNC vs. ASYNCPSGDSYNC, respectively. The figure captions highlight important patterns we observed. The plots were obtained by running the respective pairs of algorithms on 150 problem instances generated as above, measuring the policy cost J after every update, and

averaging the resulting curves across these 150 trials.

In each trial, all algorithms were run for the amount of simulated time equivalent to 240 MIRRORSYNC rounds (see Figure 1’s and 2’s x-axis). However, note that the number of updates performed by ASYNCMIRRORSYNC and ASYNCPSGDSYNC was larger than 240. Specifically, a MIR- RORSYNC’s update round is always of length  1/rmintime units, but for ASYNCMIRRORSYNC and ASYNCPSGDSYNC it is  lupd roundunits, so for every MIRRORSYNC update round, they perform  (1/rmin)/lupd roundupdates. Although more frequent model updates is itself a strength of the asynchronous algorithms, their main practical advantage is independence of MIRRORSYNC’s “free arm resets” assumption.

This paper presented SYNCHRONIZATION MABs, a bandit class where all arms generate costs continually, independently of being played, and the agent observes an arm’s stochastic instantaneous cost only when it plays the arm. We proposed an online learning approach for this setting, called MIRRORSYNC, whose novelty is in estimating the policy cost gradient without directly observing the policy cost function and without having a closed-form expression for it. Moreover, we derived an  O(T23 )adversarial regret bound for MIRRORSYNC without explicitly requiring the gradients to be bounded. We also presented ASYNCMIRRORSYNC, a practical version of MIRRORSYNC that lifts the latter’s idealizing assumptions. The key insight behind all these contributions is that the use of mirror descent for policy updates in SYNCHRONIZATION MABs enables much faster convergence than gradient descent would. Our experiments confirmed this insight empirically.

Acknowledgments. We would like to thank Nicole Immorlica (Microsoft Research) and the anonymous reviewers for their helpful comments and suggestions regarding this work.

Azar, Y., Horvitz, E., Lubetzky, E., Peres, Y., and Shahaf, D. Tractable near-optimal policies for crawling. Proceedings of the National Academy of Sciences (PNAS), 2018.

Bar-Noy, A., Bhatia, R., Naor, J., and Schieber, B. Minimiz- ing service and operation costs of periodic scheduling. In SODA, pp. 11–20, 1998.

Bright, L., Gal, A., and Raschid, L. Adaptive pull-based policies for wide area data delivery. ACM Transactions on Database Systems (TODS), 31(2):631–671, 2006.

Bubeck, S. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 2016.

Cella, L. and Cesa-Bianchi, N. Stochastic bandits with delay-dependent payoffs. In AISTATS, 2020.

Cho, J. and Garcia-Molina, H. Synchronizing a database to improve freshness. In ACM SIGMOD International Conference on Management of Data, 2000.

Cho, J. and Garcia-Molina, H. Effective page refresh poli- cies for web crawlers. ACM Transactions on Database Systems, 28(4):390–426, 2003a.

Cho, J. and Garcia-Molina, H. Estimating frequency of change. ACM Transactions on Internet Technology, 3(3): 256–290, 2003b.

Cho, J. and Ntoulas, A. Effective change detection using sampling. In VLDB, 2002.

Gal, A. and Eckstein, J. Managing periodically updated data in relational databases. Journal of ACM, 48:1141–1183, 2001.

Gittins, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41(2), 1979.

Heidari, H., Kearns, M., and Roth, A. Tight policy regret bounds for improving and decaying bandits. In AISTATS, 2016.

Immorlica, N. and Kleinberg, R. Recharging bandits. In FOCS, 2018.

Kolobov, A., Peres, Y., Lu, C., and Horvitz, E. Staying up to date with online content changes using reinforcement learning for scheduling. In NeurIPS, 2019a.

Kolobov, A., Peres, Y., Lubetzky, E., and Horvitz, E. Op- timal freshness crawl under politeness constraints. In SIGIR, 2019b.

Levine, N. and Crammer, K. Rotting bandits. In NIPS, 2017.

Nemirovsky, A. and Yudin, D. Problem complexity and method efficiency in optimization. Wiley, 1983.

Robbins, H. Some aspects of the sequential design of exper- iments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.

Upadhyay, U., De, A., and Gomez-Rodriguez, M. Deep re- inforcement learning of marked temporal point processes. In NeurIPS, 2018.

Upadhyay, U., Busa-Fekete, R., Kotlowski, W., Pal, D., and Szorenyi, B. Learning to crawl. In AAAI, 2020.

Whittle, P. Restless bandits: Activity allocation in a chang- ing world. Applied Probability, 25(A):287–298, 1988.

Wolf, J. L., Squillante, M. S., Yu, P. S., Sethuraman, J., and Ozsen, L. Optimal crawling strategies for web search engines. In WWW, 2002.

A. Proofs

Proposition 1 For a given K-armed SYNCHRONIZATION BANDIT instance (Problem 1), the optimal periodic policy  π∗ ∈ Πhas the same cost  J∗ as the optimal general deterministic open-loop policy under the same constraints.

Before proving this result, we formalize its statement. Consider the class  Πo of all deterministic open-loop policies:


where  lnk = sync, denoting synchronization play, and  tnkis a time when arm k is supposed to be played.

Recall that for a schedule  σkand a horizon  H, Nk(H)is the index of  σk’s largest time point within the horizon H. We define the cost  Jσk of an arm’s schedule  σas in Equation 4 and let the cost of a policy  π ∈ Πo be


For an arm  k and rk ∈ R+, consider

i.e., the limit of the minimum average cost for arm k achievable by pulling arm k at no higher than some average rate  rk. We formulate the analog of Problem 1 for policy class  Πo in the known-parameter setting as follows:

Problem 2.

where  Jkis as in Equation 12. We can now re-state Proposition 1 formally:

Lemma. For a given K-armed SYNCHRONIZATION BANDIT instance, the periodic policy  π∗ ∈ Πthat optimally solves Problem 1 also optimally solves Problem 2.

Proof. The proof proceeds as follows. First, we derive an expression for  Jk(rk; H), the average cost of the optimal finite-horizon schedule that pulls arm k at no higher than rate  rk. Then, noting that  Jk(rk) = limH→∞, we derive an expression for  Jk(rk) (Equation 12). Finally, using this expression, we show that the policy  π∗ ∈ Πo that optimally solves Problem 2 is periodic and hence is also is a solution to Problem 1. But every solution of Problem 1 is also a solution of Problem 2, so  π∗is optimal for Problem 1 as well. Since, by convexity, Problem 1 has a unique optimal solution, the will follow.

Formally, letting  |σk|Hbe the number of schedule  σk’s arm plays up to time H, we define



Here, as defined at the beginning of Section 3,  τnkis the time interval between schedule  σk’s (nk − 1)-th and  nk-th arm play (τnk ≜ tnk − tnk−1, tNk+1 = H, and Ck(τnk) ≜� τnk0 ck(τ)dτ (Equation 3).

Now, note that for a fixed  Nkand  H, 1H�Nk+1nk=1 Ck(τnk)is convex increasing in each  τnk. This follows because each

Ck(τnk)is convex increasing, which, in turn, follows from  Ck(τnk)’s definition because  ck(τ)is positive non-decreasing by Assumption 3. Therefore, applying the method of Lagrange multipliers shows that, for a fixed  Nk and H, the minimizer of 1H�Nk+1nk=1 Ck(τnk)is the schedule  σ∗k = {( nHNk+1, sync)}Nkn=1, and we can rewrite Equation 15 as


Thus, if the number of arm plays  Nkis given, the best finite-horizon schedule is periodic. Moreover, Equation 16 implies that  Jk(H, Nk)is monotonically decreasing in  Nkfor a fixed H, because for any  Nk ≥ 1, increasing the number of arms plays by 1 always helps reduce the best schedule cost:


Equality 17 was obtained via algebraic manipulations. Inequality 18 follows because the interval  [0, HNk+2]can be bro-


Nk+2 = HNk+2 −� HNk+2 − H(Nk+2)(Nk+1)�, we have �Ck� HNk+2�− Ck� HNk+2 − H(Nk+2)(Nk+1)�� ≤�Ck� HNk+1�− Ck� HNk+2��, establishing the final inequality 19.

Crucially,  Jk(H, Nk)being monotonically decreasing in  Nkmeans that the r.h.s. of Equation 14 is minimized when  Nk isas large as possible given the consraint  Nk ≤ rkH, Nk ∈ N, i.e., when  Nk = ⌊rkH⌋. Plugging Nk = ⌊rkH⌋into Equation 16 and substituting Equation 16 into Equation 14, we can rewrite Equation 14 as


Also, notice from Equations 12 and 14 that


This implies that

Substituting this expression into Equation 13 reveals that Problem 2 is identical to Problem 1, so the optimal deterministic periodic policy under Problem 1’s constraints is also optimal for Problem 2 over all deterministic open-loop policies. ■

Lemma 1 For any policy  π(r) ∈ Π,


J(r) and Jk(rk) ≜ rkCk�1rk

Proof. First we show that the form of J(r) is as stated in the theorem, and then we show its monotonic decrease and convexity.

Recall from Section 3 that for arm k’s schedule  σk = ((t1, l1), (t2, l2), . . .), {tnk}is a possibly infinite sequence of time points when the arm is to be played, and for a finite horizon  H, Nk(H)is the index of schedule  σk’s largest time point not exceeding H. For convenience we let  t0 ≜ 0, tNk(H)+1 ≜ H, and τnk ≜ tnk − tnk−1 for nk ≥ 1.

Consider a cost function for a periodic schedule for arm k, where arm k is played at a rat  rk. Equation 4implies that it has the form


and hence


proving the first part of the lemma.

To show J’s convexity, we compute its Hessian:


noting that ∂2J∂rjrk = 0 for all j ̸= k.

Crucially, by Assumption 3,  ckis non-decreasing, so  c′k ≥ 0and  ∂2J∂r2k ≥ 0. Thus, J’s Hessian is positive semidefinite, implying that J is convex.

To see that J is monotonically decreasing in each  rk, consider ∂J∂rk = Ck�1rk


Lemma 2. For a rate vector  r = (rk)Kk=1and a probability  ε, suppose the agent plays each arm in sync mode 1/rktime after that arm’s previous sync-mode play, observing a sample of instantaneous cost  ˆck ∼ ck(1/rk). Suppose also that in addition, with probability  εindependently for each arm k, the agent plays arm k in probe mode at time t ∼ Uniform[0, 1/rk]after that arm’s previous sync-mode play, observing a sample of instantaneous cost  ˆc(ε)k ∼ ck(t). Then for each k,


is an unbiased estimator of  ∂kJ(rk).

Proof. We need to ensure that  E[∂kJ(rk)] − E [gk] = 0. By definition of  Jk (Lemma 1),


Similarly, by definition of  gk,

The last line follows because  Ck(0) = 0, since, by Assumption 1,  ck(0) = DiracDelta(0). Thus,  E[∂kJ(rk)] =E [gk]. ■

Lemma 3. The exploration gap is bounded by



i.e. the set has the same  ℓ1constrain as  Kεbut the range of  K0. Recall and define


where the last line follows from the negativity of all gradients. We note that  ˜r∗εk > r∗εkimplies  r∗εk = rmax1+ε, because otherwise one of the extreme points violates the K.K.T. conditions listed below. The gradients are monotonically increasing, so we have


Now we bound the gap between  r∗and  ˜r∗ε. The functions  JTare convex and monotonically decreasing according to Lemma 1. By convexity and Cauchy-Schwarz, it holds that

�����Tmax�T =1∇JT (˜r∗ε)�����∞∥r∗ − ˜r∗ε∥1 .Bounding the gradient norm. We show that there exists  k∗ ∈ arg mink∈[K]�TmaxT =1 ∂kJT (˜r∗ε) such that


(i) follows directly from the fact that  JTk are monotonically decreasing functions, so all gradients are negative and the infinity norm is obtained by the smallest value.

(ii) follows from the K.K.T. conditions of the extreme point  ˜r∗ε, which read:  ∃c ∈ R such that ∀k ∈ [K] :


If the set  {k ∈ [K] | ˜r∗εk = rmax}is not empty, then  k∗must lie in that set. If  ∀k : ˜r∗εk = rmin, the statement is trivial. Otherwise if there is no coordinate of value  rmaxand at least one larger than  rmin, we can simply choose  k∗as arg maxk∈[K] ˜r∗εk since the gradient is equal to c.

Note that  |∂kJTk (r)|is monotonically decreasing in  rkdue to convexity and negativity of the gradients. Furthermore ˜r∗εk∗ ≥ B(1+ε)K by definition.


Bounding the  ℓ1-norm.From the K.K.T. conditions, we can directly infer that  r∗ ≥ ˜r∗ε, or otherwise the K.K.T. conditions for the extreme point  r∗ are violated. Therefore


Combining everything finishes the proof. ■

Preliminaries for the proof of Theorem 5.1. For the proof, we require the following theorem and lemma

Theorem A.1 (Bandit Algorithms. Theorem 28.4). Let  η > 0and F be Legendre with domain D and  Kε ⊂ Rdbe a nonempty convex set with  int(dom(F)) ∩ Kε ̸= ∅. Let  r1, . . . , rT +1be the actions chosen by mirror descent, which are

assumed to exist. Furthermore assume that for any  T ∈ [Tmax]: ∇F(rT ) − ηgT ∈ int(dom(F ∗)), then for


the regret of mirror descent is bounded for any  r ∈ Kε by


Lemma 4. For any  x ≥ − 12 : − log(1 + x) + x ≤ x2.

which concludes the proof. ■

Theorem 5.1. For any sequence of convex functions  (JT )TmaxT =1and learning rate  0 < η < Kε2U, the in-policy regret of MIRRORSYNC is bounded by


Proof. Since the functions  JT are convex, we have

Furthermore the loss estimators are conditionally independent and unbiased, so


We verify that we can apply Theorem A.1. Recall our potential is  F(r) = − �Kk=1 log(rk)with domain  D = (0, ∞)K.

The convex conjugate is  F ∗(y) = −K − �Kk=1 log(−yk)with interior domain  (−∞, 0)K. It holds∂kF(rT ) − ηgTk = − 1rTk − ηgTk ≤ − 1rTk + ηUrTk Kε < 0 ,

which completes the requirements. By Theorem A.1


Now we bound the Bregman divergence terms.

Denote  BTk ∼ Ber(ε)the indicator of having used the exploration for estimating the gradient  ∂kJTin round T . by the definition of the gradient estimator, it is bounded by  |gTk | ≤ BTk2UεK + (1 − BTk ) UK . Hence


Combining everything completes the proof


B. Details of Empirical Evaluation

Constructing polynomial cost-generating processes. The cost-generating processes  ck(τ) = akτ pkused in the experiments in Figures 1 and 2 were constructed and operated as follows:

During a sync-mode play of arm k, a function  ck(.)is randomly chosen until the next sync-mode play by sampling ak ∼ Uniform([ak − ak · noise, ak + ak · noise]), where noise ∈ [0, 1]is a parameter and  akis both problem-instance-and arm-specific. To construct a problem instance, our generator randomly chose  ak ∼ Uniform[0, 1]for each k at problem creation time. The noise parameter in our experiments was shared by all problem instances and all arms, its value as in Table 1.

 pkis a parameter chosen by our problem generator at the instance creation time from the sigmoid prior  σ(scale ·Uniform[0, 1]), where the scaling parameter in our experiments is shared by all problem instances and all arms, with its value as in Table 1. Thus,  pk ∈ (0, 1) but pk values > 0.5are more likely.

Hyperparameter tuning. We performed a grid search with a fixed random number generator seed, 0, on problems generated


Table 1. Problem generator parameters for results in Figures 1 and 2.

using parameters in Table 1. Namely, for each considered parameter combination, we ran each algorithm once for a fixed number of update rounds equal to  horizon · (1/rmin)/lupd round, where horizon = 240 on the problem setup described in Section 8, and generated curves similar to those in Figures 1 and 2. For MIRRORSYNC,  lupd roundwas always 1. For ASYNCMIRRORSYNC and ASYNCPSGDSYNC,  lupd round < 8consistently caused both to either be very unstable or diverge (independently of  η), while lupd round > 40would cause them to update arm play rates less frequently than MIRRORSYNC, so we didn’t consider  lupd round /∈ [8, 40]during grid search. We inspected the resulting curves and for each algorithm chose a parameter combination whose curve showed convergence to the lowest cost J at the end but remained stable (showed little jitter). Please refer to the implementation at https://github.com/microsoft/Optimal-Freshness-Crawl-Scheduling for specific parameter combinations over which we performed the grid search.

The chosen parameter combinations that we used to get the results in Figures 1 and 2 are:


C. Additional Results

Besides the polynomial cost-generating processes in Section 8, we experimented with Poisson process-based costs of the kind common in the Web crawling literature (Cho & Garcia-Molina, 2003a; Azar et al., 2018; Kolobov et al., 2019a;b). Here, each arm’s cost-generating process  ck(τ)is driven by an associated Poisson point process  pois(λk), whose rate  λk issampled from Uniform[0.005, 5.0] at problem generation time. For each arm, we let  ck(τ) ≜ 1 if pois(λk)has generated at least one event in time  τsince the latest play of arm k, and 0 otherwise. In Web crawling,  pois(λk)models the number of meaningful changes on a Web page since it was last crawled. For problem instances in this experiment, we used problem generation parameters in Table 2.


Table 2. Problem generator parameters for results in Figures 3 and 4.

The algorithms’ hyperparameter values were chosen using the same procedure as described in Appendix B, in combination with  ε = 0.05:



Figure 3. MIRRORSYNC’s and ASYNCMIRRORSYNC’s convergence on the problem with Poisson process-based cost-generating processes. Like in Figure 1, the two algorithms exhibit very similar convergence behavior, despite ASYNCMIRRORSYNC not relying on free arm resets as MIRRORSYNC does.


Figure 4. ASYNCMIRRORSYNC vs. ASYNCPSGDSYNC. As in the experiment with polynomial cost-generating processes, ASYNCMIRRORSYNC arrives at good policies faster than ASYNCPSGDSYNC, but the former’s advantage is less pronounced than in Figure 2.

The experiment’s results are shown in Figures 3 and 4. Qualitatively, they resemble those for polynomial cost-generating processes in Figures 1 and 2 in Section 8: ASYNCMIRRORSYNC’s empirical converge rate approximately matches that of MIRRORSYNC (which, however, makes an unrealistic assumption about free arm pulls at the start of each round) and exceeds that of ASYNCPSGDSYNC, although the advantage of ASYNCMIRRORSYNC over ASYNCPSGDSYNC is less pronounced than in the first experiment.

Designed for Accessibility and to further Open Science