Internet advertising is one of the booming and rapidly increasing industry with revenue volume in billions of dollars [13]. The majority of the revenue generated by search engines like Google, Yahoo, and Bing comes from advertisements displayed on their platform. Typically, for any search query by a user, a search engine/the advertising platform, henceforth a center, displays ads along with the relevant results via an auction mechanism known as sponsored search auction (SSA). The fundamental difference between traditional advertising and Internet advertising is the payment model. In the former, the advertisers (henceforth agents), pay based on the number of impressions whereas in latter, the agents pay only if their ad receives a click. Thus, the probability of an ad getting clicked, referred to as click-through rate (CTR), plays a crucial role in SSA. The CTR of an ad is unknown to the center, but it can learn CTRs by displaying the ad repeatedly over a period of time. Each agent also has a private valuation for its ad, which represents its willingness to pay for a click. This valuation needs to be elicited from the agents truthfully.
In the absence of contexts, when the agents report their true valuations, we can model the problem as a Multi-Armed Bandit (MAB) problem [16] with agents playing the role of the arms. As the agents (arms) are strategic, they may misreport their valuations to maximize their utility. To elicit truthful bids from the agents, researchers have used Mechanism Design [2, 20]. However, such mechanisms are oblivious to the learning requirements and fail to avoid manipulations by the agents when learning is involved. In such cases, the researchers have modeled this problem as a MAB mechanism [7, 11, 12, 14]. The authors designed ex-post truthful mechanisms wherein the agents are not able to manipulate even when the random clicks are known to them.
Typically an individual user tends to click some specific ads more often than the other ads, which depends upon the profile of an individual user of the platform. In this work, we leverage this fact and use the profile of the user as features (for context) to personalize the ads to increase the number of clicks and hence, the social welfare. When the CTRs of the ads depend on the specific context at a particular round, we can model the problem as a Contextual MAB (ConMAB) problem [4, 17, 18, 1]. However, a naive implementation of ConMAB is not adequate in the presence of strategic agents.
To the best of our knowledge, contextual information in SSA is considered only in [12]. The authors proposed a novel, theoretically sound, deterministic, exploration-separated mechanism that offers strong game-theoretic properties. However, it faces multiple practical challenges: (i) it incurs high cost of learning (regret), (ii) the center needs to know the number of rounds for which it needs to execute SSA, and (iii) the initial rounds being free, a malicious agent may drop off after free rounds; in some cases, all the rounds could be free.
Contributions
In the presence of strategic agents, random context-arrivals, and stochastic clicks, our goal is to design a non-exploration-separated, ex-post truthful mechanism that (i) learns CTRs efficiently (minimizes regret), (ii) may not need prior knowledge of T, and (iii) does not have free rounds. We leverage popular algorithms LinUCB [18] and SupLinUCB [10] that perform well in estimating CTRs in the contextual setting to build our randomized mechanisms to avoid manipulations by strategic agents. In particular, our contributions are:
• We adapt LinUCB to design an ex-post monotone allocation rule ELinUCB-S for a single-slot SSA (Theorem 2). We further optimize ELinUCB-S by introducing batch level update to propose ELinUCB-SB and using resampling procedure by [6], we develop an ex-post truthful mechanism M-ELinUCB-SB, which is also ex-post individually rational (Theorem 3). Unlike existing ConMAB mechanism, M-ELinUCB-SB does not need to know T.
• For stronger theoretical guarantees, we adapt SupLinUCB to design an ex-post monotone allocation rule SupLinUCB-S for a single-slot SSA (Theorem 11). We prove that SupLinUCBS has regret log T) (Theorem 9) as against
log T) for the non-strategic settings; we attribute O(n) as price of truthfulness. Using resampling procedure, we develop M-SupLinUCB-S which is ex-post truthful and ex-post individually rational. M-SupLinUCB-S, however, needs to know T upfront.
• We study M-ELinUCB-SB and M-SupLinUCB-S with the existing mechanism M-Reg, on simulated data and provide empirical analysis. Empirically, M-ELinUCB-SB performs superior to M-SupLinUCB-S by large factors for less than million rounds.
First, we define our model and notation.
2.1 Model and Notation
There is a fixed set of agents N = {1, 2, . . . , n}, where each agent has exactly one ad to display and the center has one slot available for allocation. A contextual armed Multi-Armed Bandit (MAB) mechanism M proceeds in discrete rounds t = 1, 2, . . . , T. At each round t:
1. M observes a context [0, 1]
which summarizes the profile of the user arriving at round t.
2. Based on the history, , of allocations, observed clicks, and the context
chooses an agent
to display it’s ad.
3. A click is observed which is 1 if it gets clicked and 0 otherwise.
4. Mechanism M decides the positive payment to be made by the agent
to the center. The payment by any other agent is 0.
5. Update .
6. The mechanism then improves its arm-selection strategy with new observation (). No feedback is received for the agents that are not selected.
Each agent i is thus characterized by two quantities: (i) private valuation [0, 1], which represents the willingness to pay for the click received and is constant throughout the rounds, and (ii) click through rate (CTR) of its ad
[0, 1] which is an unknown parameter and is dependent on the context
. Each agent i submits the valuation of getting a click on its ad as bid
. We assume that the bids are constant across the rounds (since
’s are constant). We assume that the CTR of an agent i is linear in d-dimensional context
with some unknown coefficient vector
[18]. Thus, the problem reduces to learning the
dimensional vector
for each agent i. The probability of getting a click on the ad of agent i at any given round t is given as:
Thus, the expected valuation of agent i is . Let
be the bid vector of all the agents other than i and b denote the bid vector of all the agents. The utility of an agent i in round t with history
is given as:
and the utility of center is given as:
In this work, our aim is to maximize social welfare similar to [7, 14]. The social welfare at round t is evaluated as sum of utilities of the agents and the center and is given as:
When the CTRs are not known, the efficiency of any mechanism is measured by its rate of learning or regret. Thus, our goal reduces to design a mechanism M that minimizes the social welfare regret which is given as:
Here, (
) denote the highest expected valuation (based on bids) i.e.,
(
) =
(
. In the following section, we define game theoretic properties relevant to this work.
2.2 Game Theoretic Properties
A mechanism M = (A, P) (where A is the allocation rule and P is the payment rule) is ex-post truthful (formally called EPIC) if and only if A is ex-post monotone [19, 3]. That is, for all instances of possible click realizations and context-arrivals, by increasing its bid to , agent i should obtain at least same number of clicks at bid
if not more. Formally,
Definition 1. Ex-post monotonicity: Let ) denote total number of clicks on the ad of agent i in first t rounds. Then, A is ex-post monotone if for every possible sequence of context-arrivals and click realizations, for each agent
and two possible bids of
we have
Definition 2. EPIC: A mechanism M = (A, P) is said to be ex-post incentive compatible (EPIC) if by misreporting the bid, no agent can gain its total utility more than that it would have obtained by bidding truthfully, i.e.,
EPIC implies even if an agent has observed all the contexts and all the click realizations, it is in the agent’s best interest to report true valuation. Note that, if a mechanism does any randomization, ) is replaced by
)], where the expectation is taken w.r.t. randomization in the mechanism. Such a mechanism is still truthful for every realization of external randomness such as click realizations, context-arrivals.
One alternative notion of IC which may seem suitable in our model is dynamic IC [8]. We would like the reader to note that in our model, the bids, as well as valuations, are constant throughout the rounds, and not dependent on any round t. Thus, private information and communication with the mechanism are not dynamic. Hence, a strong game-theoretic property of ex-post IC is more apt in our model.
Definition 3. EPIR: A mechanism M = (A, P) is said to be ex-post individually rational (EPIR) if every agent has a non-negative utility with truthful bidding irrespective of the bids of other agents i.e., ,
The authors in [6] have shown the power of randomization in designing truthful mechanisms by proposing a randomized context-free MAB mechanism that is ex-post truthful and has regret ). Thus, by introducing randomness in the mechanism, they showed that it is possible to bypass the impossibility result in [7], which states that any deterministic, truthful MAB mechanism has to be exploration-separated and hence must suffer a regret of Ω(
). The main result of [6] involves designing the black box mechanism using the self-resampling procedure in Algorithm 1, which provides ex-post truthful and IR mechanism if given an ex-post monotone allocation rule (Theorem 1).
Theorem 1. (Theorem 4.5, [6]) Let A be ex-post monotone allocation rule. Applying the transformation in Algorithm 1 to A with parameter , we obtain a mechanism M such that M is EPIC and EPIR.
There is no previous work done in designing the ex-post monotone allocation rule in the contextual setting. Hence, we address this problem and design two allocation rules, which are ex-post monotone, though it has different properties.
Concerning ConMAB mechanisms for SSA, two works are closely related to our work [12] and [18]. The former considers the strategic agents in ConMAB [12] by proposing a mechanism that we will call M-Reg. The mechanism is exploration-separated [11, 7] which is deterministic and induces EPIC property. The regret achieved by this mechanism is quite high ) as compared to
) regret in the traditional ConMAB problem. The latter introduces LinUCB algorithm, which is particularly of interest to us. Hence we describe it below.
2.3 LinUCB
LinUCB [18] is a generic ConMAB algorithm that efficiently learns the CTR of an agent where the CTR model is linear in terms of context and the unknown parameters. The authors experimentally showed the efficacy of the algorithm in approximating the CTRs of news articles in news recommendation. Hence, we choose to adapt it to our setting. LinUCB is motivated by UCB [5] where upper confidence bound (UCB) is maintained for each agent. To capture the contextual information in ConMAB setting, LinUCB uses and
for each agent i, where
summarizes the information about contexts and
corresponding clicks. It maintains upper confidence bound (UCB) for each agent i as
where ˆ
and
is learning param- eter. At round t, the algorithm selects the agent
with the highest UCB ˆ
. The statistics for the selected agent
is updated as
,
+
, where
is the indicator variable of receiving click.
LinUCB was originally designed to estimate CTRs of news articles and hence does not capture strategic manipulations. Motivated by LinUCB, we build randomized EPIC mechanisms for SSA by developing an ex-post monotone allocation rule, ELinUCB-SB, and using the resampling procedure
(Algorithm 1) to design a randomized EPIC and EPIR mechanism M-ElinUCB-SB [6]. ELinUCBSB has linear regret. We present it here as the key ideas to adapt LinUCB to design truthful mechanisms are useful and carry forward when we design a more complicated mechanism, M-SupLinCUB-S, based on SupLinUCB by [10].
We first propose a single-slot allocation rule ELinUCB-S based on LinUCB. We next provide a further optimized algorithm ELinUCB-SB that incorporates mini-batch learning, which makes the algorithm efficient both in terms of regret and computation.
3.1 ELinUCB-S: LinUCB-Based Single-Slot SSA
ELinUCB-S (Algorithm 3) for single-slot allocation maintains a set of active agents . At each round, algorithm evaluates whether an agent should be retained in
or not. Once an agent is evicted from
, it can not be added back. For better understanding about the working of the algorithm, we virtually divide the LinUCB-S into 4 subroutines: i) Initialization (lines[1-7]) ii) Exploration (lines[11-20]) iii) Exploitation (lines[22-23]) iv) Elimination (lines[24-26]).
For each agent i, the algorithm maintains lower confidence bound (LCB) and upper confidence bound (UCB) as
and
respectively.
At each round t, the algorithm observes context . It determines the index of agent
whose turn is to display the ad based on round robin order, as stated in line[9]. The algorithm then checks if
. If it evaluates to true the algorithm runs Exploration subroutine else Exploitation. In Exploration subroutine the algorithm allocates the slot to
, observes click
and updates its parameters. The confidence bounds are updated if and only if the size of confidence interval decreases (line[18]). In Exploitation subroutine, the agent with the maximum estimated expected valuation among the agents in
is allocated the slot and observes click
. It is important to note that no parameter is updated during Exploitation subroutine which is crucial for the ex-post monotonicity property. At the end of each round, Elimination subroutine is executed which removes the agents
from
if UCB of agent j is less than LCB of any other agent in
.
The intuition driving the algorithm is after sufficient exploration the confidence interval becomes sufficiently small, hence the agents which are close to optimal continues to remain in and sub-optimal agents are eliminated.
3.2 Regret Analysis of ELinUCB-SB
Although the algorithm ELinUCB-S seems promising, the dynamic and varying nature of contexts and its arrival order may lead to the elimination of an optimal agent. Hence, it may continue to allocate sub-optimal agents in subsequent rounds leading to high regret on specific instances, which is evident from our simulation of the algorithm (Fig.1c). The updates in ,
depend upon the context in such a way that
is non-increasing and
is non-decreasing, as stated and proved in Claim 2. These updates being irreversible needs to be carefully handled to optimize regret. To counter this problem, we design ELinUCB-SB (Algorithm 4) in which we have introduced a subtle, yet important use of mini-batch. The algorithm ELinUCB-SB allocates an agent for bs number of rounds instead of one round. It follows similar rules for allocating agents and maintaining the active set
. It updates the bounds of agents by taking the average over the contexts arrived in bs rounds. Updating the bounds over the average of context after the completion of batch allocation handles the variance in contexts and its arrivals, thus reducing the regret significantly.
It can be shown that eventually, ELinUCB-SB will eliminate all but one arm. The remaining arm will be the dominant arm in most of the contexts. However, we can construct examples where this arm is not the best for at least one context, which has non-zero probability and thus leading to O(T) regret. However, the round number at which it happens is generally very high, which we validate experimentally. Even though ELinUCB-SB incurs linear regret theoretically, it performs well in experiments and has interesting monotonicity properties; the proofs we leverage while designing ex-post truthful ConMAB mechanism with sub-linear regret in the next section.
3.3 Monotonicity of ELinUCB-S
We now prove ex-post monotonicity property for the proposed allocation rule.
For a fixed sequence of context-arrivals , and click realization
, let
) be the set of active agents in the beginning of round t when agents bid b = (
). For each agent i, let
(b, t) and
(b, t) be the values of
and
in the round t and similarly when agents bid
. We prove ex-post monotonicity with the following claims.
Claim 1. For fixed context-arrivals and click realization
, let two bid vectors be b and
, then for each round t, if
), then:
and
are updated only in Exploration subroutine which is based on round-robin order and hence does not depend on bid. Thus, the claim follows from the fact that contexts and click realizations are fixed.
Claim 2. For a fixed bid vector b, and each agent , then for all (
) consecutive rounds
is non-decreasing and
is non-increasing.
Proof. From lines[15-20] of the Algorithm 3, we have: (
(
(
(
1). Hence the claim holds.
Claim 3. For any two bid vectors and b, where
=
and
), then
) holds.
Proof. The condition ) implies that if
), then
), hence satisfying the claim for i. For
, we will use induction on t. The claim trivially holds for t = 1. Let,
be the last round such that
) =
) and
+ 1)
=
+ 1). In this case, we prove that:
+ 1) =
+ 1).
Since, (
+ 1) =
(b, t + 1),
(
+ 1) =
(b, t + 1)
, and
(
+ 1)
(b, t + 1) from Claim 1. Thus,
(Since ) =
)). Hence,
+ 1). From, induction hypothesis,
s.t.
). We will now prove that
+ 1)
+ 1).
Consider any ): we will prove that if
+ 1), then
+ 1).
Since (
+1) =
(
+1). Also,
) and
(
+ 1) =
(
+ 1). Further,
such that
) but
),
) such that
(
+ 1)
(
+ 1) =
(
+ 1)
(
+ 1). Thus, max
(
+ 1)
max
(
+ 1). Thus,
+ 1) implies
Claim 4. For fixed context-arrivals, fixed click realizations, and fixed bids of the agents except i, that is, for a fixed , if
), then
) where b = (
) and
= (
);
.
Proof. Let 1 be the last round for i s.t. it is in active set with both bids. From Claim 1,
(
) =
(
(
) as
). As the context-arrivals, click realizations and bids of the remaining agents are fixed, if agent i becomes inactive with bid
then
The last line follows from Claim 3. Thus, i is also inactive with bid
Theorem 2. The allocation rule induced by ELinUCB-S (Algorithm 3) is ex-post monotone.
Proof. For a fixed context-arrivals , click realization
, bids of agents except i, i.e.,
and two possible bids
, let
and
be the last round for i s.t. i is in active set with bids
and
respectively. From Claim 4,
. Thus, i will receive more number of rounds with bid
as compared with bid
Proposition 1. The allocation rule induced by ELinUCB-SB (Algorithm 4) is ex-post monotone.
Proof. The difference between Algorithm 3 and Algorithm 4 is the introduction of batch allocation. In Algorithm 4 the unit of one round is equivalent to bs rounds in Algorithm 3. Hence, by replacing variable t with where
and
will satisfy all the claims (Claim 1-4). Thus, ELinUCB-SB (Algorithm 4) is still ex-post monotone.
M-ELinUCB-SB
We now propose the following mechanism M-ELinUCB-SB for the single-slot SSA. A mechanism is defined as M = (A, P). The outline of both mechanisms is defined in Mechanism 5. For both the mechanisms, we apply the resampling procedure [6] on the bids and the allocation in both cases are based on the modified bids, where is resampling parameter. For M-ELinUCB-SB, A is given by ELinUCB-SB. The payment P at round t, corresponding context
is given by
) if
= 1 and
(1
) if
1.
3.4 M-ELinUCB-SB: Game Theoretic Analysis
Theorem 3. M-ELinUCB-SB is ex-post incentive compatible (EPIC) and ex-post individually rational (EPIR) mechanism.
Proof. The result follows from Theorem 1 and by ex-post monotonicity of A defined in Algorithm 4.
As the mechanism with allocation rule in Algorithm 4 can incur linear regret, in this section, we propose a new ConMAB mechanism for SSA that achieves sub-linear regret. First, we explain how we adapt SupLinUCB [10] for SSA to derive an ex-post monotone allocation algorithm SupLinUCBS in the next subsection. In Section 4.2, we prove the regret bound on SupLinUCB-S. Then we prove the monotonicity of it. Finally, we design a truthful mechanism M-SupLinUCB-S.
4.1 SupLinUCB-S
Chu et al. [10] proposed SupLinUCB for contextual MAB settings with linear payoffs. First let us emphasize the major differences between SupLinUCB ([10]) and SupLinUCB-S (proposed here, Algorithm 6). (i) Chu et al. have considered a common to be learned across all the agent and for each round t the contexts are different for each agent whereas in our setting we have independent
to be learned for each agent i whereas the context across each agent is same. (ii) We have adapted their algorithm for auction setting such that it satisfies ex-post monotonicity property, which is necessary to design ex-post truthful mechanisms. Our algorithm is presented in 6.
In the next section, we prove the regret bounds on SupLinUCB-S, highlighting the steps which differ from regret analysis of SupLinUCB.
4.2 Regret Analysis of SupLinUCB-S
For convenience, let = (ˆ
+
) and 0
= 1
. For all round t, stage s and given context
(
) =
]. The regret analysis is along the similar lines with [10] with changes deemed necessary to incorporate in our setting. In Lemmas 4, 5, and 6, we need to work for each agent as we have different
s for different agents.
Lemma 4. (Lemma 2, [10]) For each ] and
, suppose
=
. Then, eigenvalues of
can be arranged so that
, for all j and
Lemma 5. (Lemma 3, [10]) Using notation in BaseLinUCB-S and assuming 2, we have
Lemma 6. (Lemma 4, [10]). For each ], each
], and any fixed sequence of feature vectors
, with
, the corresponding rewards
are independent random variables such that
] =
.
In our settings, rewards of the arms also have bid component which plays an important role, thus we need the following lemma.
Lemma 7. With probability 1 , for any
] and any
], the following hold:
1. bucb
[r
ucb
for all i
2. (
3.
Proof. From Lemma 15 of [4], we have, for all i. As
0, after multiplying with
inequality still holds, i.e.,
. From our assumption
1, hence the first part holds.
The lemma trivially holds for s = 1. For s > 1, ˆˆ
and from the algorithm it is clear that
and
. From part 1 of the lemma and using the above fact, for any
ˆ
we have
] and
. From definition,
]. Using this and above inequalities, agent
(
) will belong to ˆ
(i.e., will never be eliminated for context
) due to the rule defined in Line[14], Algorithm 6. Hence part 2 of the lemma is proved.
From Line[14] Algorithm 6, . Using part 1 of the lemma and above inequality the proof of part 3 follows.
Lemma 8. (Lemma 6, [10]) For all ] and
,
All the above can be summarized as the following theorem.
Theorem 9. SupLinUCB-S has regret ln T) with probability at least 1
if it is run
with
Proof. The proof is similar to the proof of Theorem 6 of [4] but requires additional terms to consolidate the difference between the algorithms, problem setting, and regret definition. We have restricted the learning during round-robin ordering only Lines[7-9] due to which we have an additional decision rule for agent selection as in Line[17]. Note that this additional rule was not in [10]. Hence the main challenge is to bound the number of rounds agent selection, which is done using this decision rule. (We refer it in our analysis.)
Let be the set of rounds in which the agent was selected in Lines[10-12]. Let
be the set of rounds agent was selected in Lines[16-17] and
=
.
Claim 5. At each stage = (
1)
.
For any stage s, let us take case of n consecutive rounds. Let us assume for each of the n rounds, selection of agent is done in Lines[16-17]. But note that selection of agent in this decision block is done if and only if there exist an agent k such that (where j is designated agent for the round) and
. But after n consecutive rounds each agent has got its designated round once (Line[4]). Hence, if for some agent
, then this agent k should be selected on its designated round. Hence our assumption of selection of agent in Lines[16-17] for n consecutive round is wrong. From above reasoning, it is clear that at least for one round out of n rounds, one of the agent must be selected at its designated round. Hence, at most for
1 rounds agent is selected in Line[16-17] out of n rounds, until condition in Line[10] is achieved. Thus, we can say that at each stage
= (
1)
.
Theorem 10. The allocation rule induced by SupLinUCB-S (Algorithm 6) is ex-post monotone.
Proof. The allocation rules Algorithm 3 and Algorithm 6 are similar in the way it learns and eliminates agents. Both algorithms learn only when a designated agent is selected based on round-robin ordering, and the elimination is based on bids, UCB and LCB estimates. The difference between elimination rules is the need for the width of an agent to reach threshold 2at stage s. Due to the above similarities, the proof follows on the similar lines of the proof of Theorem 2, and hence we skip it for ease of exposition.
M-SupLinUCB-S
The mechanism M-SupLinUCB-S follows the same structure as that of mechanism M-ELinUCB-SB. The only change is that the allocation rule A is given by SupLinUCB-S (Algorithm 6).
4.3 M-SupLinUCB-S: Game-Theoretic Analysis
Theorem 11. M-SupLinUCB-S is ex-post incentive compatible (EPIC) and ex-post individually rational (EPIR) mechanism.
Proof. The result follows from Theorem 1 and by ex-post monotonicity of A defined in Algorithm 6.
5.1 Data Preparation
Our simulated data follow the structure and information availability found in a real-world system. Typically, a center has access to user features such as gender, age, geographic features, device model, and behavioral categories (which summarizes the user’s past preferences), which constitute the context. Note that each of the stated features can be discretized. Considering the above facts, we created the corpus of users : with d = 4, for each feature, we randomly select 4 possible different values from 0 to 100, and then by taking all possible combination of features, we generated random 256 (4
) users. We normalize each
such that
[0, 1]
, with
= 1 and store these normalized contexts as a database
.
We then select uniformly at random from the context database
for each round to generate a stochastic context. For each agent (advertiser), we generate
([0, 1]
) and then normalize it s.t.
= 1. To simulate the clicks, at round t with the sampled
, we generate a click
from Bernoulli distribution with parameter
. We conduct experiments for 40 iterations, and for each iteration, we randomly generate a sequence of contexts from
for T = 10
rounds. We generate valuation of agent i for a click to be
sampled from uniform distribution [0, 1] and assume the agents bid truthfully; due to truthfulness properties of our mechanisms.
5.2 Results and Comparison
From our experimentation, we found learning parameter = 1 and batch size bs = 100 to be suitable for M-ELinUCB-SB. The metric of comparison between the mechanisms is regret, which is averaged over 40 iterations. Fig.1a compares the regret of M-Reg, M-ELinUCB-SB, and M-SupLinUCB-SB for n = 7. When n > 7, M-Reg becomes infeasible as the number of exploration rounds
exceeds the total number of rounds T, for T = 10
. In terms of regret, it is evident that both the mechanisms M-ELinUCB-SB and M-SupLinUCB-SB outperform M-Reg by a very large margin. Fig.1b highlights difference in experimental regret M-ELinUCB-SB and M-SupLinUCB-SB (it is zoomed version from Fig. 1a). We can see that M-ELinUCB-SB experimentally performs approximately 5 times better than M-SupLinUCB-SB; albeit the results are validated only on the randomly generated 256 contexts (
).
Though in theory, M-ELinUCB-SB has the worst regret, from simulations, the slope being very small, for reasonable values of T it outperforms M-Reg. Our experiments show that M-ELinUCB-SB and M-SupLinUCB-SB, both have nearly negligible regret as compared to M-Reg. Fig.1c compares the regret incurred by M-ELinUCB-SB with varying batch size 10, 25, 50, 75, 100, 125, 150} for
100k, 1000k}. From the figure, it is easy to infer the significant improvement in regret when we move from batch size bs = 1 to greater batch sizes. One may need to tune bs based on different experimental setup.
We believe that ours is the first attempt to design a non-exploration-separated ConMAB mechanism. We focused on designing ConMAB mechanisms for sponsored search auction (SSA). For a single-slot, we first designed LinUCB-based ex-post monotone allocation rule ELinUCB-S. We show that the introduction of batch size in ELinUCB-S significantly improves the regret while satisfying the ex-post monotone property. With this observation, we present ELinUCB-SB. Through simulations, we see that in practice, it performs better for regret; however, theoretically, it may
Figure 1: Regret comparisons
incur linear regret on carefully chosen contexts. To achieve sub-linear regret, we proposed another ex-post monotone allocation rule, SupLinUCB-S. We further extended these allocation rules to mechanisms, M-ELinUCB-SB, and M-SupLinUCB-SB satisfying EPIC and EPIR properties. We showed our mechanism performs significantly better than the existing mechanism M-Reg [12]. In summary, M-SupLinUCB-S is novel, truthful ConMAB mechanism that outperforms M-Reg in every aspect.
Although our mechanisms are randomized, they are game theoretically sound and scalable as compared to M-Reg. Further, in terms of regret, M-ELinUCB-SB and M-SupLinUCB-S outperforms M-Reg in experiments and theoretically M-SupLinUCB-S matches the regret in non-strategic setting up to a factor of O(n) which is the price of truthfulness. Though we presented mechanisms for single slot SSA, they can be generalized to multi-slot SSA using similar techniques. Along with SSA, this work can form a baseline for other applications such as crowdsourcing [9], smart grids [15], where similar setting arises to learn the stochastic parameters in the presence of strategic agents.
[1] Yasin Abbasi-Yadkori, D´avid P´al, and Csaba Szepesv´ari. Improved algorithms for linear stochastic bandits. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 2312–2320, USA, 2011. Curran Associates Inc.
[2] Gagan Aggarwal, Ashish Goel, and Rajeev Motwani. Truthful auctions for pricing search keywords. In Proceedings of the 7th ACM Conference on Electronic Commerce, EC ’06, pages 1–7, New York, NY, USA, 2006. ACM.
[3] A. Archer and ´E. Tardos. Truthful mechanisms for one-parameter agents. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science, FOCS ’01, pages 482–, Washington, DC, USA, 2001. IEEE Computer Society.
[4] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res., 3:397–422, March 2003.
[5] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
[6] Moshe Babaioff, Robert D. Kleinberg, and Aleksandrs Slivkins. Truthful mechanisms with implicit payment computation. J. ACM, 62(2):10:1–10:37, May 2015.
[7] Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. Characterizing truthful multi- armed bandit mechanisms: Extended abstract. In Proceedings of the 10th ACM Conference on Electronic Commerce, EC ’09, pages 79–88, New York, NY, USA, 2009. ACM.
[8] Dirk Bergemann and Juuso V¨alim¨aki. The dynamic pivot mechanism. Econometrica, 78(2):771–789, 2010.
[9] Arpita Biswas, Shweta Jain, Debmalya Mandal, and Y. Narahari. A truthful budget feasible multi-armed bandit mechanism for crowdsourcing time critical tasks. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’15, pages 1101–1109, Richland, SC, 2015. International Foundation for Autonomous Agents and Multiagent Systems.
[10] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff functions. Journal of Machine Learning Research - Proceedings Track, 15:208–214, 01 2011.
[11] Nikhil R. Devanur and Sham M. Kakade. The price of truthfulness for pay-per-click auctions. In Proceedings of the 10th ACM Conference on Electronic Commerce, EC ’09, pages 99–106, New York, NY, USA, 2009. ACM.
[12] Nicola Gatti, Alessandro Lazaric, and Francesco Trov`o. A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC ’12, pages 605–622, New York, NY, USA, 2012. ACM.
[13] IAB. Iab internet advertising revenue report. 2018 first half-year results., 2018.
[14] Shweta Jain, Sujit Gujar, Satyanath Bhat, Onno Zoeter, and Y Narahari. A quality assuring, cost optimal multi-armed bandit mechanism for expertsourcing. Artificial Intelligence, 254:44– 63, 2018.
[15] Shweta Jain, Balakrishnan Narayanaswamy, and Y. Narahari. A multiarmed bandit incentive mechanism for crowdsourcing demand response in smart grids. In Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence, AAAI’14, pages 721–727. AAAI Press, 2014.
[16] T Lai. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
[17] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Proceedings of the 20th International Conference on Neural Information Processing Systems, NIPS’07, pages 817–824, USA, 2007. Curran Associates Inc.
[18] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 661–670, New York, NY, USA, 2010. ACM.
[19] Roger B Myerson. Optimal auction design. Mathematics of operations research, 6(1):58–73, 1981.
[20] Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V. Vazirani. Algorithmic Game Theory. Cambridge University Press, New York, NY, USA, 2007.