Multi-armed bandits (MAB) (Robbins, 1952) have been widely applied in settings where an agent repeatedly faces K choices (arms), each associated with its own payoff distribution unknown to the agent at the start, and needs to eventually identify the arm with the highest mean payoff by pulling a subset of arms at a time and observing a payoff sampled from their distributions. MABs’ defining property is that the agent observes an arm’s instantaneous payoff when and only when the agent plays it. A key hidden assumption that goes hand-in-hand with it in the existing bandit models is that each arm generates reward when and only when it is played, which, combined with the bandit feedback property, also implies that the agent observes all generated payoffs.
In this paper, we go beyond these seemingly fundamental assumptions by identifying a class of practical settings that violate them and analyzing it using online learning theory. Specifically, this paper formalizes scenarios that we call
1Microsoft Research, Redmond 2Google Research, Berlin. Work on this paper partially done during a visit to Microsoft Research, Redmond. Correspondence to: Andrey Kolobov <akolobov@microsoft.com>.
Proceedings of the International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).
SYNCHRONIZATION MABs. In these settings, the agent can be thought of as holding copies of K files whose originals come from different remote sources. As time goes by, the files change at the sources, and their copies increasingly differ from the originals, becoming stale. The agent’s task is to refresh these files by occasionally downloading their new copies from remote sources, under a constraint B on the average number of downloads per time unit.
For each file, the agent is continually penalized for its staleness. The expected penalty at each time step due to this file is a non-decreasing function of the time since the file’s last refresh. Playing arm k here corresponds to refreshing file k: doing so temporarily reduces its staleness and thereby diminishes the cost incurred due to it per time unit. The goal is to find a synchronization policy that minimizes regret in terms of the average staleness penalties by refreshing files according to a well-chosen schedule.
Crucially, at any moment the agent doesn’t know how outdated its copy of a given file is, except at the time when it downloads its fresh copy, and therefore most of the time doesn’t know the penalties it is incurring. It observes the penalty only when it plays an arm, i.e., refreshes a file, and has a chance to see how different the cached copy was right before the refresh. Even this action reveals only the instantaneous penalty due to this file, not the cumulative penalty the file has brought on since its last refresh.
SYNCHRONIZATION MABs are inspired by problems such as web crawl scheduling (Wolf et al., 2002; Cho & Garcia- Molina, 2003a; Azar et al., 2018; Kolobov et al., 2019a; Upadhyay et al., 2020) and database update management (Gal & Eckstein, 2001; Bright et al., 2006). All these settings involve a cache that must proactively initiate downloads to refresh its content. This is in contrast to, e.g., Web browser caches that passively monitor a stream of download requests initiated by another program. The few existing works on policy learning for active caching (Kolobov et al., 2019a; Upadhyay et al., 2020) apply only to specific penalty functions. In contrast, the theoretical results in this paper are independent of the penalties’ functional form, and come with a practical online learning strategy for this model.
High-level analysis idea and paper outline. Online learning theory is a powerful tool for analyzing decision-making models where an agent operates in discrete-time instantaneous rounds by playing a candidate solution (arm), imme- diately getting a feedback on it (a sample from the arm’s payoff distribution), and using it to choose a candidate solution for the next round. Unfortunately, online learning’s traditional assumptions clash with the properties of our setting. As Section 2 describes, SYNCHRONIZATION MAB is a continuous-time model with non-stationary sparsely observable costs. Its candidate solutions are multi-arm policies (Section 3). Getting useful feedback on a policy, such as an estimate of its cost function or gradient, isn’t instantaneous; it requires playing the policy for a non-trivial stretch of time.
In Section 4, we present the MIRRORSYNC algorithm, which continuously plays a candidate policy along with “exploratory” arm pulls, periodically updating it with online mirror descent (Nemirovsky & Yudin, 1983; Bubeck, 2016). It uses a novel unbiased policy gradient estimator that operates in the face of sparse policy cost observations. Our regret analysis of MIRRORSYNC in Section 5 critically relies on the convexity of policy cost functions – a property we derive in Section 3 from minimal assumptions on SYNCHRO-
NIZATION MABs’ payoffs. The regret analysis treats time intervals between MIRRORSYNC’s policy updates as learning “rounds” and thereby brings online learning theory to bear on SYNCHRONIZATION BANDITS. Section 6 introduces ASYNCMIRRORSYNC, a practical MIRRORSYNC variant that lifts MIRRORSYNC’s idealizing assumptions. In Section 8, we compare the two algorithm empirically.
The contributions of this paper are thus as follows:
(1) We cast active caching as an online learning problem with sparse feedback, enabling principled theoretical analysis of this setting under a variety of payoff distributions.
(2) Based on this formulation, we propose a theoretic strategy for active caching under unknown payoff distributions and derive an adversarial regret bound of for it. In doing so, we overcome the challenges of sparse and temporal feedback inherent in this scenario that existing online learning theory does not address.
(3) We present a practical variant of the above strategy that lifts the latter’s assumptions and, as experiments demonstrate, has the same empirical convergence rate.
SYNCHRONIZATION BANDITS are a continuous-time MAB model with K arms. Other than operating in continuous-time it differs from existing MAB formalisms in the mechanism by which arms generate costs/rewards and the observability of the generated costs from the agent’s standpoint. In this section we detail both of these aspects, using the aforementioned cache update scenario as an illustrating example.
Cost-generating processes. In SYNCHRONIZATION MABs, every arm k incurs a stochastically generated cost at every time instant t, whether the arm is played or not. How- ever, the distribution of arm k’s possible instantaneous costs at time t depends on how much time has passed since the last time arm k was played to refresh the corresponding cached item. We denote the length of this time interval as
. Thus, arm k’s cost generation process is described by a family of random variables
as stated in the following assumption:
Assumption 1. (Cost generation) At time t, each arm incurs a cost independently of being played by the agent. The instantaneous cost due to arm k at time t is sampled from a random variable
s.t.: (1)
0; (2)
; (3) there exists a bound
s.t.
for every arm’s cost generation process
and any time interval length
By this assumption, for every arm and any amount of time since its latest play, its cost expectation is well-defined:
Agent’s knowledge and cost observability. While costs are generated by all arms continually, in our model the agent doesn’t observe most of them, with an important exception: Assumption 2. (Cost knowledge and observability) For each arm k, the agent observes a cost at time t if and only if the agent plays arm k at that time. The agent doesn’t know the distributions of random variables
Assumption 2 is crucial in two ways. First, it means that our model provides only bandit feedback. Namely, the agent doesn’t see arms’ costs at all times, unlike in related models such as maintenance scheduling (Bar-Noy et al., 1998). Second, coupled with Assumption 1 it implies that there is no causal relationship between playing an arm and incurring a cost, which is an implicit assumption that standard bandit strategies rely on.
Arm play modes. At any time t, any of SYNCHRONIZATION MAB’s arms can be played in one of two modes:
Sync mode. Playing arm k in this mode at time t resets the arm’s state, i.e., sets . In addition, per Assumption 2, the agent observes the arm’s instantaneous cost sample
immediately before
is reset to 0.
In the case of a cache, this means downloading a fresh copy of file k, estimating the difference between original and the cached copy, and overwriting the cached copy with the new one.
Probe mode. By playing arm k allows in probe mode, the agent observes the arm’s instantaneous cost, but the arm’s state is not reset.
In caching settings, this corresponds to downloading a fresh copy of item k, but using it purely to estimate the difference between k’s current original and the cached copy, without overwriting the cached copy.
Since, by Assumption 1, , playing an arm in sync mode gives the agent a way to temporarily reduce the expected rate at which the arm incurs costs. However, due to the following assumption, after a sync play the arm’s cost generation rate starts growing again:
Assumption 3. (Cost monotonicity) For every arm k, the means of instantaneous cost random variables
are non-decreasing in time since the latest sync-mode play
. If arm k was played in sync mode at time
, then any sequence of arm k’s cost observations
yielded by probe-mode plays after
this arm’s next sync-mode play at time
is non-decreasing.
Arm state can be viewed as the amount of time that has passed since the arm’s last sync by time point t; the more time has passed, the more cost the arm is incurring per time unit. Playing an arm in sync mode simply resets this time counter. Thus, according to Assumption 3, not only does the total cost generated by arm k since its previous sync play grow as time goes by – which is to be expected – but so does the rate at which it happens.
Note that probe-mode arm plays don’t help the agent reduce running costs directly. Instead, as we show in Section 4, they help the agent learn a good arm-playing policy faster.
Example. All of the above assumptions are natural in realworld scenarios that inspired the SYNCHRONIZATION model. For instance, in Web crawling each online web page accumulates changes according to a temporal process is widely assumed to be Poisson (i.e., memoryless) in the Web crawling literature (Wolf et al., 2002; Cho & Ntoulas, 2002; Cho & Garcia-Molina, 2003a;b; Azar et al., 2018; Kolobov et al., 2019a;b; Upadhyay et al., 2020). For each indexed page, the agent (the search engine) incurs a cost
due to serving outdated search results, as a function of the total difference d between the indexed page copy and the online original. From this perspective,
but at least one other approach models
directly as a function of a web page copy’s age (Cho & Garcia-Molina, 2000). In either case, Assumption 3 holds: the more time passes since the page’s last crawl, the higher the expected instantaneous penalty. Moreover, penalties don’t decrease between two successive crawls of a page: e.g., in case changes are generated by a Poisson process, their number can only grow with time since last refresh, and so can the penalty.
In order to derive a learning strategy for SYNCHRONIZATION BANDITS (Section 4) and its regret analysis (Section 5), we first derive the necessary building blocks: the cost of an arbitrary policy for this model, the class of policies that will serve as our algorithm’s hypothesis space, and parameterized cost functions for policies of this class.
Policy costs. Our high-level aim is finding a SYNCHRONIZA- TION policy that has a low expected average cost over an infinite time horizon. Whether a policy
is historydependent, stochastic, or neither, executing it produces a schedule
for each arm k, a possibly infinite sequence of time points
when the arm is to be played and corresponding labels
specifying whether the arm should be played in probe or sync mode at that time. For convenience, WLOG assume that
always refers to t = 0, let
, and for any finite horizon H let
be the index of schedule
’s largest time point not exceeding H:
Given this definition, let
Recalling that each arm has a specific time-dependent cost distribution with mean
, we define the average infinite-horizon cost
’s schedule
Letting
we can rewrite ’s definition as
Here, is the total cost that arm k is expected to incur between
-th and
-th plays according to schedule
. Thus,
is just the average of these costs over the entire schedule. If the schedule stops playing arm k forever after some time
may be infinite.
Running a policy amounts to sampling a joint schedule
. Therefore, we define policy cost
Target policy class. Instead of considering all possible SYN- CHRONIZATION policies as potential solutions, in this paper we focus on those whose sync-mode plays are periodic, with equal gaps between every two consecutive such plays of a given arm. For arm k, we denote the length of these gaps as length,
being a policy parameter for this arm and
being the joint parameter vector for all arms. Importantly, our policies do allow probe-mode arm plays but don’t restrict how the time points for these plays are chosen. In particular, they may be chosen stochastically, as long the timings of sync-mode plays are deterministically periodic.
Formally, for a scheduled arm pull time point ule
, let
be the next sync-mode play of arm k in
, i.e.,
if
. Then our target policy class is
Parameters r can be interpreted as rates at which arms are played in sync mode. For , policy costs are uniquely determined by sync-mode play rates r: although these parameters ignore the timing of probe plays, probe plays don’t affect cost generation and therefore policy cost.
We let J(r) denote ’s policy cost. Equation 5 implies that its cost functions J(r) have a special form critical for our regret analysis in Section 5 – they are convex:
Lemma 1. For any policy
and monotonically decreasing for r > 0.
Proof. See the Appendix.
The convexity of the cost functions plays a crucial role in obtaining the regret bounds (Section 5) for the policy learning algorithm presented in Section 4.
Policy constraints. Naturally, we would like to find a that minimizes J(r) (Equation 7). As described, however,
has no such policy: note that
0, but no finite r attains this limit. However, in practical applications the rates
cannot be arbitrarily high individually or in aggregate, and are subject to several constraints. Therefore, in this paper we regard feasible
as bounded from above for all k by a universal bound
. Moreover, we assume that the sum of all arms’ play rates may not exceed some value B > 0. E.g., in Web crawling B is commonly interpreted as a limit on crawl rate imposed by physical network infrastructure (Azar et al., 2018; Kolobov et al., 2019a; Upadhyay et al., 2020). Last but not least, valid
values may not be 0, since this implies never playing this arm after a certain time point. In applications such as Web crawling, this means abandoning a cached item (e.g., an indexed webpage) to grow arbitrarily stale, which is unacceptable, so we impose a minimum sync-mode play rate
on every arm. Note that since, by Assumption 1, every
is bounded for
) is bounded as well.
Policy optimization. Thus, if we knew cost processes we could use them to compute
for all arms and would face the following optimization problem:
Problem 1 (SYNCHRONIZATION BANDIT instance).
subject to:
Notice that this formulation implicitly assumes that the entire bandwidth B will be used for sync-mode arm plays – indeed, if the model is known, there is no need for probes.
As a side note, we remark that the class of periodic policies doesn’t restrict Problem 1’s solution quality compared to broader the class of deterministic open-loop policies. We state it here informally as a proposition, which we reformulate more precisely and prove in the Appendix A:
Proposition 1. For a given K-armed SYNCHRONIZATION BANDIT instance (Problem 1), the optimal periodic policy has the same cost
as the optimal general deterministic open-loop policy under the same constraints.
In reality we don’t know the cost generation processes and can’t solve the above optimization problem directly. Instead, we adopt an online perspective on SYNCHRONIZATION bandits. A key contribution of this paper that we present in this section is MIRRORSYNC (Algorithm 1), an algorithm that treats a SYNCHRONIZATION MAB as an online learning problem. A MIRRORSYNC agent can be viewed operates in rounds , in each round “playing” a candidate policy parameter vector
, suffering a “loss”
, and updating
to a new vector
as a result. As we show in Section 5, MIRRORSYNC has an adversarial regret of
MIRRORSYNC’s novelty is due to the fact that, despite super-ficial similarities to standard online learning, our setting is different from it in crucial ways, and MIRRORSYNC circumvents these differences:
(1) Although the agent can be viewed as suffering loss doesn’t observe this loss. By SYNCHRONIZATION MAB’s assumptions, it observes only samples of instantaneous costs
, and only when it plays arms. Existing techniques don’t offer a way to estimate the gradient
in this case. Moreover, even these impoverished observations take real- world time to collect. (2) In online learning, regret analysis normally assumes
to be bounded. This isn’t quite the case in our model. While we could assume a bound on
linear in
, it would be detrimental to the regret bound when
is large. We show how MIRRORSYNC addresses challenge (1) in this section, and circumvent challenge (2) in Section 5.
A note on infinite vs. finite-horizon policies. The policy cost functions we derived in Section 3 describe the steady- state performance of a policy over an infinite time horizon. However, MIRRORSYNC’s rounds are finite. Thus, the cost function J (Equation 7) that MIRRORSYNC uses as a basis for policy improvement is a proxy measure for the average costs that running MIRRORSYNC incurs in each round.
MIRRORSYNC operation. At a high level, MIRRORSYNC’s main insight is allocating a small fraction of available band- width B, determined by input parameter (Algorithm 1), to probe-mode arm plays, while using the bulk
the bandwidth (line 2) to play in sync mode according to the current rates r. MIRRORSYNC uses instantaneous cost samples obtained from both to estimate the gradient
(lines 11 - 19) by individually estimating its partial derivatives (lines 23-25), which we denote as
for short. At the end of each epoch, it does online mirror descent on these estimates to get a new sync-mode policy r (lines 20, 42-43).
In more detail, in the spirit of online learning, MIRRORSYNC assumes that at the start of each round all arms’ cost gen- eration processes have just been reset to , and restarts the time counter at t = 0 for every arm (line 9). (In prac- tice, this assumption is unrealistic, and we lift it in another variant of MIRRORSYNC in Section 6.) Then, for every arm k, it schedules sync-mode plays at intervals
37-38) until the end of the current round, and attempts to insert one probe-mode play into each such interval (lines 33-36) independently with probability
), at a point chosen uniformly at random over the interval’s length (line 34). Executing the constructed schedule (line 13) yields cost samples that are used in the aforementioned gradient estimation, which, crucially, is unbiased:
Lemma 2. For a rate vector and a probability
, suppose the agent plays each arm in sync mode
time after that arm’s previous sync-mode play, observing a sample of instantaneous cost
. Suppose also that in addition, with probability
independently for each arm k, the agent plays arm k in probe mode at time
after that arm’s previous sync-mode play, observing a sample of instantaneous cost
Then for each k,
is an unbiased estimator of
Proof. See the Appendix.
In one round, MIRRORSYNC may get several gradient estimates for a given arm, in which case it takes the first one (line 18). To get a regret bound, however, it is crucial to ensure that for each arm MIRRORSYNC receives at least one such estimate per round, even if the estimate is 0 (line 18). This is why we set the length of each round to be (line 1) — the largest value
and hence the longest time that MIRRORSYNC may have to wait in order to get a cost sample at
We generalize our stochastic setting to an adversarial problem and prove an adversarial regret bound of order This means that the cost distributions
and all derived quantities (
) need not be non-stationary from one round to the next. The cost distributions and derived functions at round T are denoted by
,
,
,
and can be chosen by an oblivious adversary ahead of time.
Regret. We define the regret with respect to a fixed schedule
where the expectation is over the randomness of observed costs and
is the choice of the algorithm in round T . Our goal is to compete with the best possible schedule
using the full available budget:
where of Algorithm 1) with
Since we are not be able to obtain any information on the function value or gradient of without an allocated exploration, we also define the best possible schedule
exploration constrained by
of Algorithm 1):
The regret can be decomposed into
which we bound separately.
In-policy regret. The problem is an instance of online learning, but online learning literature typically assumes that the gradients of the objective functions are uniformly bounded w.r.t. some norm. Our setting differs in a key aspect: the gradients
scale proportionally to
A naive solution would be to bound and use gradient descent. However, the regret would scale with
, which might be prohibitively large.
We show that mirror descent with a carefully chosen potential, namely the log barrier , adapts to the geometry of the gradients and replaces the polynomial dependency on
dependency.
Theorem 5.1. For any sequence of convex functions and learning rate
, the in-policy regret of MIRRORSYNC is bounded by
Proof. See the Appendix.
Exploration Gap. We show that the exploration gap scales proportionally to and is independent of
. On first sight, this bound is surprising because the exploration gap should be approximately
the gradients
could be unbounded (i.e. of order
). The high-level idea behind the following lemma is the observation that at the optimal point
, the gradients in all coordinates must coincide and hence the gradient cannot be of order
Lemma 3. The exploration gap is bounded by
Proof. See the Appendix.
Finally we are ready to present the main result.
Corollary 1. The regret of MIRRORSYNC with
bounded for any
Proof. The choice of and the bound on
ensure that we can apply Theorem 5.1 to bound the in-policy regret. The in-policy regret simplifies to
Adding the exploration gap from Lemma 3 and substituting the value for completes the proof.
Although MIRRORSYNC lends itself to theoretical analysis, several design choices make its vanilla version impractical. (1) MIRRORSYNC assumes that all arms are synchronized “for free” at the start of each round so that each round starts in the same “state”, which is unrealistic. (2) MIRRORSYNC waits until the end of each -long round to update all arms’ play rates simultaneously, which could be months in applications like Web crawling. (3) MIRRORSYNC’s further source of inefficiency is that even if arm k has produced several
samples in a given round, MIRRORSYNC uses only one of them. ASYNCMIRRORSYNC (Algorithm 2), which can be viewed as a practical implementation of MIRRORSYNC, mitigates these weaknesses of MIRRORSYNC.
In contrast to MIRRORSYNC, which performs updates in rigidly defined rounds, ASYNCMIRRORSYNC updates policy according to a user-specified schedule S (see Algorithm 2’s inputs). The length of inter-update periods is unimportant for ASYNCMIRRORSYNC, unlike for MIRRORSYNC (line 1, Alg. 1), due to a major difference between the two algorithms. MIRRORSYNC aims to update all arms’ parameters synchronously at the end of each round and makes the rounds very long to guarantee that each arm has generated at least one gradient estimate by the end of each round. In the meantime, ASYNCMIRRORSYNC does updates asynchronously, performing mirror descent at an update time only on those arms that happen to have generated at least one new gradient sample since the previous update time
these local updates while respecting the global constraint B by using the sum of current play rates or arms that are about to be updated as a local constraint (line 32). Thus, ASYNCMIRRORSYNC doesn’t need to make inter-update intervals excessively long and doesn’t suffer from issue (1).
As a side note, the reason MIRRORSYNC’s regret bound in Corollary 1 has no linear dependence on is because it characterizes regret in terms of the number of rounds, not wall-clock time. Nonetheless, this dependency matters empirically, and obtaining a wall-clock-time regret bound that is free from it is an interesting theoretical problem.
ASYNCMIRRORSYNC’s asynchronous update mechanism also removes the need for “free” simultaneous sync-mode play of all arms after each round (2). Recall that before each sync-mode play of arm k with probability we can play arm k another time, and so far we have chosen to do so in probe mode. The ScheduleArmPlays routine (line 27, Alg. 1) that both MIRRORSYNC and ASYNCMIRRORSYNC rely on attempts this (lines 32-33, Alg. 1) for every sync-mode arm play except the first arm play of each round. ASYNCMIRRORSYNC takes advantage these unused chances
to schedule sync-mode plays, which reset cost generation for some fraction of arms (line 36, Alg. 2). For the remaining arms, it simply waits until their next sync-mode play (line 37, Alg. 2) to start estimating the new gradient.
Last but not least, ASYNCMIRRORSYNC improves the ef-ficiency of updates themselves. It employs all gradient samples we get for an arm between updates, and averages them to reduce estimation variance (lines 25, 28), thereby rectifying MIRRORSYNC’s weakness (3).
There are several existing models superficially related to but fundamentally different from SYNCHRONIZATION MABs.
In maintenance job scheduling (Bar-Noy et al., 1998), as in our setting, each machine (arm) has an associated operating cost per time unit that increases since the previous maintenance, and performing a maintenance reduces this cost temporarily. However, the agent knows all arms’ cost functions and always observes the machines’ running costs.
Upadhyay et al. (2018) describe a model for maximizing a long-term reward that is a function of two general marked temporal point processes. This model is more general than
SYNCHRONIZATION MAB in some ways (e.g., not assuming cost process monotonicity) but allows controlling the policy process’s rate only via a policy cost regularization term and provides no performance guarantees.
Recharging bandits (Immorlica & Kleinberg, 2018), like
SYNCHRONIZATION MABs, have arms with non-stationary payoffs: the expected arm reward is a convex increasing function of time since the arm’s last play. In spite of this similarity, recharging bandits and other MABs with time-dependent payoffs (Heidari et al., 2016; Levine & Crammer, 2017; Cella & Cesa-Bianchi, 2020) make the common assumptions that a reward is generated only when an arm is played and that the agent observes all generated rewards. As a result, their analysis is different from ours. In general, payoff non-stationarity has been widely studied in two broad bandit classes. Restless bandits (Whittle, 1988) allow an arm’s reward distributions to change, but only independently of when the arm is played. Rested bandits (Gittins, 1979) also allow an arms’ reward distribution changes, but only when the arm is played. SYNCHRONIZATION MABs belong to neither class, since their arms’ instantaneous costs change both independently and a result of arms being played.
The goal of our experiments was to evaluate (1) the relative performance of MIRRORSYNC and ASYNCMIRRORSYNC, given that MIRRORSYNC assumes “free” arm resets at the beginning of each round and ASYNCMIRRORSYNC doesn’t, and (2) the relative performance of ASYNCMIRRORSYNC and its version with projected stochastic gradient descent (PSGD) instead of mirror descent, which we denote as ASYNCPSGDSYNC. The choice of mirror descent instead of PSGD was motivated by the intuition that with mirror descent MIRRORSYNC would achieve lower regret than with PSGD (see Section 5). In the experiments, we verify this intuition empirically. Before analyzing the results, we describe the details of our experiment setup.
Problem instance generation. Our experiments in Figures 1 and 2 were performed on SYNCHRONIZATION MAB instances generated as follows. Recall from Sections 2 and 3 that a SYNCHRONIZATION MAB instance is defined by:
• , the lowest allowed arm play rate
• , the highest allowed arm play rate
• B, the highest allowed total arm play rate
• K, the number of arms
, a set of cost-generating processes — time- dependent distributions of instantaneous costs, one process for each arm k.
For all problem instances in the experiments, and K were as in Table 1 in Appendix B. The set
of cost-generating processes was constructed randomly for each instance. In all problem instances, each arm had a distribution over time-dependent cost functions in the form of polynomials
was chosen at instance creation time and
sampled at run time as described in Appendix B. Note that MIRRORSYNC’s regret (Equation 9) depends on the cost cap U. While our polynomial cost functions are unbounded in general, they are bounded within the
constraint region we are using (Table 1 in Appendix B). Within this constraint region, these cost functions are equivalent to
where U = 40.
In Appendix C we also describe a different, Poisson process-based family of cost-generating processes, and present experimental results obtained on it in Figures 3 and 4 in that section. Despite that process family being very distinct from the polynomial one, the results are qualitatively similar to those in Figures 1 and 2.
Implementation details. We implemented MIR- RORSYNC, ASYNCMIRRORSYNC, and ASYNCPSGDSYNC, along with a problem instance generator that constructs SYNCHRONIZATION MAB instances as above, in Python. The implementation, available at
Scheduling, relies on scipy.optimize.minimize for convex constrained optimization in order to update the play rates r (lines 20, 42 of Alg. 1, line 33 of Alg. 2). Other convex optimizers are possible as well. The experiments were performed on a Windows 10 laptop with 32GB RAM with 8 Intel 2.3GHz i9-9980H CPU cores.
Hyperparameter tuning. Running the algorithms required
Figure 1. MIRRORSYNC’s and ASYNCMIRRORSYNC’s convergence. The two algorithms exhibit nearly identical convergence behavior using tuned hyperparameters. However, ASYNCMIRRORSYNC works without assuming the “free” arm resets after arms’ parameter updates that MIRRORSYNC relies on.
Figure 2. Asynchronous version of MIRRORSYNC with mirror descent vs. with projected SGD. The use of mirror descent with the log barrier function in MIRRORSYNC was key to constructing the regret bounds in Section 5. Empirically, although ASYNCMIRRORSYNC and ASYNCPSGDSYNC eventually converge to similar-quality policies, ASYNCMIRRORSYNC discovers good policies faster, as MIRRORSYNC’s theory predicts.
choosing values for the following parameters:
• Learning rate As in other learning algorithms, choosing a good value for
for each of the three algorithms was critical for their convergence behavior.
• Length ASYNCMIRRORSYNC’s and ASYNCPSGDSYNC’s play rate updates. Recall that unlike MIRRORSYNC, which updates all play rates simultaneously after every
time units, ASYNCMIRRORSYNC and ASYNCPSGDSYNC update the model parameters according to a user-provided schedule. While the schedule doesn’t necessarily have to be periodic, in the experiments it was, with
being the inter-update interval length. Intuitively,
influences the average number of arms updated during each update attempt and the variance of gradient estimates: the larger
, the more gradient samples ASYNCMIRRORSYNC and ASYNCPSGDSYNC can be expected to average for the upcoming update (line 28 of Alg. 2). In this respect,
’s role resembles that of minibatch size in minibatch SGD.
• Exploration parameter Theory provides a horizondependent guidance for setting
for MIRRORSYNC (Corollary 1) but not for ASYNCMIRRORSYNC and ASYNCPSGDSYNC. For comparing relative convergence properties of MIRRORSYNC, ASYNCMIRRORSYNC, and ASYNCPSGDSYNC, we fixed
for all of them.
ASYNCMIRRORSYNC’s and ASYNCPSGDSYNC’s performance is determined by a combination of so we optimized them together using grid search. Please see Appendix B for more details.
Experiment results. Figures 1 and 2 compare the performance of MIRRORSYNC vs. ASYNCMIRRORSYNC and ASYNCMIRRORSYNC vs. ASYNCPSGDSYNC, respectively. The figure captions highlight important patterns we observed. The plots were obtained by running the respective pairs of algorithms on 150 problem instances generated as above, measuring the policy cost J after every update, and
averaging the resulting curves across these 150 trials.
In each trial, all algorithms were run for the amount of simulated time equivalent to 240 MIRRORSYNC rounds (see Figure 1’s and 2’s x-axis). However, note that the number of updates performed by ASYNCMIRRORSYNC and ASYNCPSGDSYNC was larger than 240. Specifically, a MIR- RORSYNC’s update round is always of length time units, but for ASYNCMIRRORSYNC and ASYNCPSGDSYNC it is
units, so for every MIRRORSYNC update round, they perform
updates. Although more frequent model updates is itself a strength of the asynchronous algorithms, their main practical advantage is independence of MIRRORSYNC’s “free arm resets” assumption.
This paper presented SYNCHRONIZATION MABs, a bandit class where all arms generate costs continually, independently of being played, and the agent observes an arm’s stochastic instantaneous cost only when it plays the arm. We proposed an online learning approach for this setting, called MIRRORSYNC, whose novelty is in estimating the policy cost gradient without directly observing the policy cost function and without having a closed-form expression for it. Moreover, we derived an adversarial regret bound for MIRRORSYNC without explicitly requiring the gradients to be bounded. We also presented ASYNCMIRRORSYNC, a practical version of MIRRORSYNC that lifts the latter’s idealizing assumptions. The key insight behind all these contributions is that the use of mirror descent for policy updates in SYNCHRONIZATION MABs enables much faster convergence than gradient descent would. Our experiments confirmed this insight empirically.
Acknowledgments. We would like to thank Nicole Immorlica (Microsoft Research) and the anonymous reviewers for their helpful comments and suggestions regarding this work.
Azar, Y., Horvitz, E., Lubetzky, E., Peres, Y., and Shahaf, D. Tractable near-optimal policies for crawling. Proceedings of the National Academy of Sciences (PNAS), 2018.
Bar-Noy, A., Bhatia, R., Naor, J., and Schieber, B. Minimiz- ing service and operation costs of periodic scheduling. In SODA, pp. 11–20, 1998.
Bright, L., Gal, A., and Raschid, L. Adaptive pull-based policies for wide area data delivery. ACM Transactions on Database Systems (TODS), 31(2):631–671, 2006.
Bubeck, S. Convex Optimization: Algorithms and Complexity. Foundations and Trends in Machine Learning, 2016.
Cella, L. and Cesa-Bianchi, N. Stochastic bandits with delay-dependent payoffs. In AISTATS, 2020.
Cho, J. and Garcia-Molina, H. Synchronizing a database to improve freshness. In ACM SIGMOD International Conference on Management of Data, 2000.
Cho, J. and Garcia-Molina, H. Effective page refresh poli- cies for web crawlers. ACM Transactions on Database Systems, 28(4):390–426, 2003a.
Cho, J. and Garcia-Molina, H. Estimating frequency of change. ACM Transactions on Internet Technology, 3(3): 256–290, 2003b.
Cho, J. and Ntoulas, A. Effective change detection using sampling. In VLDB, 2002.
Gal, A. and Eckstein, J. Managing periodically updated data in relational databases. Journal of ACM, 48:1141–1183, 2001.
Gittins, J. C. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), 41(2), 1979.
Heidari, H., Kearns, M., and Roth, A. Tight policy regret bounds for improving and decaying bandits. In AISTATS, 2016.
Immorlica, N. and Kleinberg, R. Recharging bandits. In FOCS, 2018.
Kolobov, A., Peres, Y., Lu, C., and Horvitz, E. Staying up to date with online content changes using reinforcement learning for scheduling. In NeurIPS, 2019a.
Kolobov, A., Peres, Y., Lubetzky, E., and Horvitz, E. Op- timal freshness crawl under politeness constraints. In SIGIR, 2019b.
Levine, N. and Crammer, K. Rotting bandits. In NIPS, 2017.
Nemirovsky, A. and Yudin, D. Problem complexity and method efficiency in optimization. Wiley, 1983.
Robbins, H. Some aspects of the sequential design of exper- iments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
Upadhyay, U., De, A., and Gomez-Rodriguez, M. Deep re- inforcement learning of marked temporal point processes. In NeurIPS, 2018.
Upadhyay, U., Busa-Fekete, R., Kotlowski, W., Pal, D., and Szorenyi, B. Learning to crawl. In AAAI, 2020.
Whittle, P. Restless bandits: Activity allocation in a chang- ing world. Applied Probability, 25(A):287–298, 1988.
Wolf, J. L., Squillante, M. S., Yu, P. S., Sethuraman, J., and Ozsen, L. Optimal crawling strategies for web search engines. In WWW, 2002.
A. Proofs
Proposition 1 For a given K-armed SYNCHRONIZATION BANDIT instance (Problem 1), the optimal periodic policy has the same cost
as the optimal general deterministic open-loop policy under the same constraints.
Before proving this result, we formalize its statement. Consider the class of all deterministic open-loop policies:
where , denoting synchronization play, and
is a time when arm k is supposed to be played.
Recall that for a schedule and a horizon
is the index of
’s largest time point within the horizon H. We define the cost
of an arm’s schedule
as in Equation 4 and let the cost of a policy
For an arm
i.e., the limit of the minimum average cost for arm k achievable by pulling arm k at no higher than some average rate . We formulate the analog of Problem 1 for policy class
in the known-parameter setting as follows:
Problem 2.
where is as in Equation 12. We can now re-state Proposition 1 formally:
Lemma. For a given K-armed SYNCHRONIZATION BANDIT instance, the periodic policy that optimally solves Problem 1 also optimally solves Problem 2.
Proof. The proof proceeds as follows. First, we derive an expression for , the average cost of the optimal finite-horizon schedule that pulls arm k at no higher than rate
. Then, noting that
, we derive an expression for
). Finally, using this expression, we show that the policy
that optimally solves Problem 2 is periodic and hence is also is a solution to Problem 1. But every solution of Problem 1 is also a solution of Problem 2, so
is optimal for Problem 1 as well. Since, by convexity, Problem 1 has a unique optimal solution, the will follow.
Formally, letting be the number of schedule
’s arm plays up to time H, we define
where
Here, as defined at the beginning of Section 3, is the time interval between schedule
-th and
-th arm play (
Now, note that for a fixed and
is convex increasing in each
. This follows because each
is convex increasing, which, in turn, follows from
’s definition because
is positive non-decreasing by Assumption 3. Therefore, applying the method of Lagrange multipliers shows that, for a fixed
, the minimizer of
is the schedule
, and we can rewrite Equation 15 as
Thus, if the number of arm plays is given, the best finite-horizon schedule is periodic. Moreover, Equation 16 implies that
is monotonically decreasing in
for a fixed H, because for any
, increasing the number of arms plays by 1 always helps reduce the best schedule cost:
Equality 17 was obtained via algebraic manipulations. Inequality 18 follows because the interval can be bro-
, we have
, establishing the final inequality 19.
Crucially, being monotonically decreasing in
means that the r.h.s. of Equation 14 is minimized when
as large as possible given the consraint
, i.e., when
into Equation 16 and substituting Equation 16 into Equation 14, we can rewrite Equation 14 as
Also, notice from Equations 12 and 14 that
This implies that
Substituting this expression into Equation 13 reveals that Problem 2 is identical to Problem 1, so the optimal deterministic periodic policy under Problem 1’s constraints is also optimal for Problem 2 over all deterministic open-loop policies.
Lemma 1 For any policy
J
Proof. First we show that the form of J(r) is as stated in the theorem, and then we show its monotonic decrease and convexity.
Recall from Section 3 that for arm k’s schedule is a possibly infinite sequence of time points when the arm is to be played, and for a finite horizon
is the index of schedule
’s largest time point not exceeding H. For convenience we let
Consider a cost function for a periodic schedule for arm k, where arm k is played at a rat implies that it has the form
and hence
proving the first part of the lemma.
To show J’s convexity, we compute its Hessian:
noting that
Crucially, by Assumption 3, is non-decreasing, so
and
. Thus, J’s Hessian is positive semidefinite, implying that J is convex.
To see that J is monotonically decreasing in each
Lemma 2. For a rate vector and a probability
, suppose the agent plays each arm in sync mode
time after that arm’s previous sync-mode play, observing a sample of instantaneous cost
. Suppose also that in addition, with probability
independently for each arm k, the agent plays arm k in probe mode at time
after that arm’s previous sync-mode play, observing a sample of instantaneous cost
. Then for each k,
is an unbiased estimator of
Proof. We need to ensure that . By definition of
Similarly, by definition of
The last line follows because , since, by Assumption 1,
. Thus,
Lemma 3. The exploration gap is bounded by
i.e. the set has the same constrain as
but the range of
. Recall and define
where the last line follows from the negativity of all gradients. We note that implies
, because otherwise one of the extreme points violates the K.K.T. conditions listed below. The gradients are monotonically increasing, so we have
Now we bound the gap between and
. The functions
are convex and monotonically decreasing according to Lemma 1. By convexity and Cauchy-Schwarz, it holds that
Bounding the gradient norm. We show that there exists
(i) follows directly from the fact that are monotonically decreasing functions, so all gradients are negative and the infinity norm is obtained by the smallest value.
(ii) follows from the K.K.T. conditions of the extreme point , which read:
If the set is not empty, then
must lie in that set. If
, the statement is trivial. Otherwise if there is no coordinate of value
and at least one larger than
, we can simply choose
as
since the gradient is equal to c.
Note that is monotonically decreasing in
due to convexity and negativity of the gradients. Furthermore
by definition.
Bounding the From the K.K.T. conditions, we can directly infer that
, or otherwise the K.K.T. conditions for the extreme point
are violated. Therefore
Combining everything finishes the proof.
Preliminaries for the proof of Theorem 5.1. For the proof, we require the following theorem and lemma
Theorem A.1 (Bandit Algorithms. Theorem 28.4). Let and F be Legendre with domain D and
be a nonempty convex set with
. Let
be the actions chosen by mirror descent, which are
assumed to exist. Furthermore assume that for any
the regret of mirror descent is bounded for any
Lemma 4. For any
which concludes the proof.
Theorem 5.1. For any sequence of convex functions and learning rate
, the in-policy regret of MIRRORSYNC is bounded by
Proof. Since the functions are convex, we have
Furthermore the loss estimators are conditionally independent and unbiased, so
We verify that we can apply Theorem A.1. Recall our potential is with domain
.
The convex conjugate is with interior domain
which completes the requirements. By Theorem A.1
Now we bound the Bregman divergence terms.
Denote the indicator of having used the exploration for estimating the gradient
in round T . by the definition of the gradient estimator, it is bounded by
Combining everything completes the proof
B. Details of Empirical Evaluation
Constructing polynomial cost-generating processes. The cost-generating processes used in the experiments in Figures 1 and 2 were constructed and operated as follows:
• During a sync-mode play of arm k, a function is randomly chosen until the next sync-mode play by sampling
is a parameter and
is both problem-instance-and arm-specific. To construct a problem instance, our generator randomly chose
for each k at problem creation time. The noise parameter in our experiments was shared by all problem instances and all arms, its value as in Table 1.
• is a parameter chosen by our problem generator at the instance creation time from the sigmoid prior
Uniform[0, 1]), where the scaling parameter in our experiments is shared by all problem instances and all arms, with its value as in Table 1. Thus,
are more likely.
Hyperparameter tuning. We performed a grid search with a fixed random number generator seed, 0, on problems generated
Table 1. Problem generator parameters for results in Figures 1 and 2.
using parameters in Table 1. Namely, for each considered parameter combination, we ran each algorithm once for a fixed number of update rounds equal to , where horizon = 240 on the problem setup described in Section 8, and generated curves similar to those in Figures 1 and 2. For MIRRORSYNC,
was always 1. For ASYNCMIRRORSYNC and ASYNCPSGDSYNC,
consistently caused both to either be very unstable or diverge (independently of
would cause them to update arm play rates less frequently than MIRRORSYNC, so we didn’t consider
during grid search. We inspected the resulting curves and for each algorithm chose a parameter combination whose curve showed convergence to the lowest cost J at the end but remained stable (showed little jitter). Please refer to the implementation at https://github.com/microsoft/Optimal-Freshness-Crawl-Scheduling for specific parameter combinations over which we performed the grid search.
The chosen parameter combinations that we used to get the results in Figures 1 and 2 are:
C. Additional Results
Besides the polynomial cost-generating processes in Section 8, we experimented with Poisson process-based costs of the kind common in the Web crawling literature (Cho & Garcia-Molina, 2003a; Azar et al., 2018; Kolobov et al., 2019a;b). Here, each arm’s cost-generating process is driven by an associated Poisson point process
, whose rate
sampled from Uniform[0.005, 5.0] at problem generation time. For each arm, we let
has generated at least one event in time
since the latest play of arm k, and 0 otherwise. In Web crawling,
models the number of meaningful changes on a Web page since it was last crawled. For problem instances in this experiment, we used problem generation parameters in Table 2.
Table 2. Problem generator parameters for results in Figures 3 and 4.
The algorithms’ hyperparameter values were chosen using the same procedure as described in Appendix B, in combination with
Figure 3. MIRRORSYNC’s and ASYNCMIRRORSYNC’s convergence on the problem with Poisson process-based cost-generating processes. Like in Figure 1, the two algorithms exhibit very similar convergence behavior, despite ASYNCMIRRORSYNC not relying on free arm resets as MIRRORSYNC does.
Figure 4. ASYNCMIRRORSYNC vs. ASYNCPSGDSYNC. As in the experiment with polynomial cost-generating processes, ASYNCMIRRORSYNC arrives at good policies faster than ASYNCPSGDSYNC, but the former’s advantage is less pronounced than in Figure 2.
The experiment’s results are shown in Figures 3 and 4. Qualitatively, they resemble those for polynomial cost-generating processes in Figures 1 and 2 in Section 8: ASYNCMIRRORSYNC’s empirical converge rate approximately matches that of MIRRORSYNC (which, however, makes an unrealistic assumption about free arm pulls at the start of each round) and exceeds that of ASYNCPSGDSYNC, although the advantage of ASYNCMIRRORSYNC over ASYNCPSGDSYNC is less pronounced than in the first experiment.