Graph Clustering Bandits for Recommendation

2016·Arxiv

Abstract

Abstract

We investigate an efficient context-dependent clustering technique for recommender systems based on exploration-exploitation strategies through multi-armed bandits over multiple users. Our algorithm dynamically groups users based on their observed behavioral similarity during a sequence of logged activities. In doing so, the algorithm reacts to the currently served user by shaping clusters around him/her but, at the same time, it explores the generation of clusters over users which are not currently engaged. We motivate the effectiveness of this clustering policy, and provide an extensive empirical analysis on real-world datasets, showing scalability and improved prediction performance over state-of-the-art methods for sequential clustering of users in multi-armed bandit scenarios.

1 Introduction

Exploration-exploitation techniques a.k.a. Bandits are becoming an essential tool in modern recommenders systems [9, 12]. Most recommendation setting involve an ever changing dynamic set of items, in many domains such as news and ads recommendation the item set is changing so rapidly that is impossible to use standard collaborative filtering techniques. In these settings bandit algorithms such as contextual bandits have been proven to work well [10] since they provide a principled way to gauge the appeal of the new items.

Yet, one drawback of contextual bandits is that they mainly work in a content-dependent regime, the user and item content features determine the preference scores so that any collaborative effects (joint user preferences over groups of items) that arise are being ignored. Incorporating collaborative effects into bandit algorithms can lead to a dramatic increase in the quality of recommendations. In bandit algorithms this has been mainly done by clustering the user. For instance, we may want to serve content to a group of users by taking advantage of an underlying network of preference relationships among them. These preference relationships can either be explicitly encoded in a graph, where adjacent nodes/users are deemed similar to one another, or implicitly contained in the data, and given as the outcome of an inference process that recognizes similarities across users based on their past behavior.

To deal with this issue a new type of bandit algorithms has been developed which work under the assumption that users can be grouped (or clustered) based on their selection of items e.g. [8, 11]. The main assumption is that users form a graph where the edges are constructed based on signals of similar preference (e.g. a number of similar item selections). By partitioning the graphs, one can find a coherent group of users with similar preference and online behavior. While these algorithms have been proven to perform significantly better then classical contextual bandits, they ”only” provide exploration on the set of items. For new users or users with sparse activity it is very difficult to find an accurate group assignment. This is a particularly important issue since most recommendation settings face unbalancedness of user activity levels, that is, relatively few users have very high activity while the vast majority of users have little to practically no activity at all (see Figure 2 in Section 4).

In this work, we introduce a new bandit algorithm that adds an extra exploration component over the group of users. In addition to the standard exploration-exploitation strategy over items, this algorithm explores different clustering assignments of new users and users with low activity. The experimental evaluation on four real datasets against baselines and state-of-the-art methods confirm that the additional dynamic paradigm tends to translate into solid performance benefits.

2 Learning Setting

As in previous work in this literature (e.g., [8, 4]), we assume that user behavior similarity is represented by an underlying (and unknown) clustering over the set of users. Specifically, if we let U = {1, . . . , n} be the set of n users, we assume U can be partitioned into a small number m of clusters where m is expected to be much smaller than n. The meaning of this clustering is that users belonging to the same cluster tend to have similar behavior, while users lying in different clusters have significantly diverging behavior. Both the partition and the common user behavior within each cluster are unknown to the learning system, and have to be inferred on the fly based on past user activity.

The above inference procedure has to be carried out within a sequential decision setting where the learning system (henceforth “the learne”) has to continuously adapt to the newly received information provided by the users. To this effect, the learning process is divided into a discrete sequence of rounds: in round t = 1, 2, . . . , the learner receives a user index to serve content to. Notice that the user to serve may change at every round, though the same user can recur many times. The sequence is exogenously determined by the way users interact with the system, and is not under our control. In practice, very high unbalancedness levels of user activity are often observed. Some users are very active, many others are (almost) idle or newly registered users, and their preferences are either extremely sporadic or even do not exist [9]. Along with , the system receives in round t a set of feature vectors representing the content which is currently available for recommendation to user . The learner picks some to recommend to , and then observes ’s feedback in the form of a numerical payoff . In this scenario, the learner’s goal is to maximize its total payoff over a given number T of rounds. When the user feedback the learner observes is only the click/no-click behavior, the payoff can be naturally interpreted as a binary feedback, so that the quantity

becomes a click-through rate (CTR), where if the recommended item was clicked by user , otherwise. In our experiments (Section 4), when the data at our disposal only provide the payoff associated with the item recommended by the logged policy, CTR is our measure of prediction accuracy. On the other hand, when the data come with payoffs for all possible items in , then our measure of choice will be the cumulative regret of the learner,1 defined as follows. Let be the payoff associated in the data at hand with item . Then the regret of the learner at time t is the extent to which the payoff of

the best choice in hindsight at user exceeds the payoff of the learner’s choice, i.e.,

Notice that the round-refers to the behavior of the learner when predicting preferences of user thus the cumulative regret takes into duly account the relative “importance” of users, as quantified by their activity level. The same claim also holds when measuring performance through CTR.

3 The Algorithm

Our algorithm, called Graph Cluster of Bandits (GCLUB, see Figure 1), is a variant of the Cluster of Bandits (CLUB) algorithm originally introduced in [8]. We first recall how CLUB works, point out its weakness, and then describe the changes that lead to the GCLUB algorithm.

The CLUB algorithm maintains over time a partition of the set of users U in the form of connected components of an undirected graph whose nodes are the users, and whose edges our current belief about user similarity. CLUB starts off from a randomly sparsified version of the complete graph (having O(n log n) edges instead of ), and then progressively deletes edges based on the feedback provided by the current user . Specifically, each node i of this graph hosts a linear function which is meant to estimate the payoff user i would provide had item x been recom- mended to him/her. The hope is that this estimates gets better and better over time. When the current user is , it is only vector that is updated based on ’s payoff signal, similar to a standard linear bandit algorithm (e.g., [2, 5, 1, 4]) operating on the context vectors contained in . Every user hosts such a linear bandit algorithm. The actual recommendation issued to is computed as follows. First, the connected component (or cluster) that belongs to is singled out (this is denoted by in Figure 1). Then, a suitable aggregate prediction vector (denoted by in Figure 1) is constructed which collects information from all uses in that connected component. The vector so computed is engaged in a standard upper confidence-based exploration-exploitation tradeff to select the item to recommend to user

Once is received, gets updated to ). In turn, this may change the current cluster structure, for if was formerly connected to, say, node and, as a consequence of the update is no longer close to , then this is taken as a good indication that cannot belong to the same cluster, so that edge gets deleted, and new clusters over users are possibly obtained.

The main weakness of CLUB in shaping clusters is that when responding to the current user feedback, the algorithm operates only locally (i.e., in the neigborhood of ). While advantageous from a computational standpoint, in the long run this has the severe drawback of overfocusing on the (typically few) very active users; the algorithm is not responsive enough to those users on which not enough information has been gathered, either because they are not so active (typically, the majority of them) or because they are simply newly registered users. In other words, in order to make better recommendations it’s worthy to discover and capture the “niches” of user preferences as well.

Since uneven activity levels among users is a standard pattern when users are many (and this is the case with our data, too — see Section 4), GCLUB complements CLUB with a kind of stochastic exploration at the level of cluster shaping. In every round t, GCLUB deletes edges in one of two ways: with independent high probability , GCLUB operates on component as in CLUB, while with low probability r the algorithm picks a connected component uniformly at random among the available ones at time t, and splits this component into two subcomponents by invoking a fast graph clustering algorithm (thereby generating one more cluster). This stochastic choice is only made during an initial stage of learning (when in GCLUB’s pseudocode), which we may think of as a cold start regime for most of the users. The graph clustering algorithm we used in our experiments was implemented through the Graclus software package from the authors of [7, 6]. Because we are running it over a single connected component of an initially sparsified graph, this tool turned out to be quite fast in our experimentation.

The rationale behind GCLUB is to add an extra layer of exploration-vs-exploitation tradeoff, which operates at the level of clusters. At this level, exploration corresponds to picking a cluster at random among the available ones, while exploitation corresponds to working on the cluster the current user belongs to. In the absence of enough information about the current user and his/her neighborhood, exploring other clusters is intuitively beneficial. We will see in the next section that this is indeed the case.

4 Experiments

In this section, we briefly describe the setting and the outcome of our experimental investigation.

4.1 Datasets

We tested our algorithm on four freely available real-world benchmark datasets against standard bandit baselines for multiple users.

LastFM, Delicious, and Yahoo datasets. For the sake of fair comparison, we carefully followed and implemented the experimental setup as described in [8] on these datasets, we refer the reader to that paper for details. The Yahoo dataset is the one called “ Yahoo 18k” therein.

MovieLens dataset. This is the freely available2 benchmark dataset MovieLens 100k. In this dataset, there are 100, 000 ratings from n = 943 users on 1682 movies, where each user has rated at least 20 movies. Each movie comes with a number of features, like id, movie title, release date, video release date, and genres. After some data cleaning, we extracted numerical features through a standard tf-idf procedure. We then applied PCA to the resulting feature vectors so as to retain at least 95% of the original variance, giving rise to item vectors of dimension d = 19. Finally, we normalized all features so as to have zero (empirical) mean and unit (empirical) variance. As for payoffs, we generated binary payoffs, by mapping any nonzero rating to payoff 1, and the zero rating to payoff 0. Moreover, for each timestamp in the dataset referring to user , we generated random item sets by putting into for which provided a payoff of 1, and then picking the remaining 24 vectors at random from the available movies up to that timestamp. Hence, each set is likely to contain only one (or very few) movie(s) with payoff 1, out of 25. The total number of rounds was T = 100, 000. We took the first 10K rounds for parameter tuning, and the rest for testing.

In Figure 2, we report the activity level of the users on the Yahoo and the MovieLens datasets. As evinced by these plots,3 such levels are quite diverse among users, and the emerging pattern is the same across these datasets: there are few very engaged users and a long tail of (almost) unengaged ones.

4.2 Algorithms

We compared GCLUB to three representative competitors: LinUCB-ONE, LinUCB-IND, and CLUB. LinUCB-ONE and LinUCB-IND are members of the LinUCB family of algorithms [2, 5, 1, 4] and are, in some sense, extreme solutions: LinUCB-ONE allocates a single instance of LinUCB across all users (thereby making the same prediction for all users – which would be effective in a few-hits scenario), while LinUCB-IND (“LinUCB INDependent”) allocates an independent instance of LinUCB to each user, so as to provide personalized recommendations (which is likely to be effective in the presence of many niches). CLUB is the online clustering technique from [8]. On the Yahoo dataset, we run the featureless version of the LinUCB-like algorithm in [4], i.e., a version of the UCB1 algorithm of [3]. The corresponding ONE and IND versions are denoted by UCB-ONE and UCB-IND, respectively. Finally, all algorithms have also been compared to the trivial baseline (denoted here as RAN) that selects the item within fully at RANdom.

We tuned the parameters of the algorithms in the training set with a standard grid search as in [8], and used the test set to evaluate predictive performance. The training set was about 10% of the test set for all datasets, but for Yahoo, where it turned out to be4 around 6.2%. All experimental results reported here have been averaged over 5 runs (but in fact variance across these runs was fairly small).

4.3 Results

Our results are summarized in Figure 3, where we report test set prediction performance. On LastFM, Delicious, and MovieLens, we measured the ratio of the cumulative regret of the algorithm to the cumulative regret of the random predictor RAN (so that the lower the better). On the Yahoo dataset, because the only available payoffs are those associated with the items recommended in the logs, we measured instead the ratio of Clickthrough Rate (CTR) of the algorithm to the CTR of RAN (so that the higher the better).

Whereas all four datasets are generated by real online web applications, it is worth remarking that these datasets are indeed quite different in the way customers consume the associated content. For instance, the Yahoo dataset is derived from the consumption of news that are often interesting for large portions of users, hence there is no strong polarization into subcommunities (a typical “few hits” scenario). It is thus unsurprising that on Yahoo (Lin)UCB-ONE is already doing quite well. This also explains why (Lin)UCB-IND is so poor (almost as poor as RAN). At the other extreme lies Delicious, derived from a social bookmarking web service, which is a many niches scenario. Here LinUCB-ONE is clearly underperforming. On all these datasets, CLUB performs reasonably well (this is consistent with the findings in [8]), but in some cases the improvement over the best performer between (Lin)UCB-ONE and (Lin)UCB-IND is incremental. On LastFM, CLUB is even outperformed by LinUCB-IND in the long run. Finally, GCLUB tends to outperform all its competitors (CLUB included) in all cases.

Though preliminary in nature, we believe these findings are suggestive of two phenomena: i. building clusters over users solely based on past user behavior can be beneficial; ii. in settings of highly diverse user engagement levels (Figure 2), combining sequential clustering with a stochastic exploration mechanism operating at the level of cluster formation may enhance prediction performance even further.

References

[1] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari. Improved algorithms for linear stochastic bandits. Proc. NIPS, 2011.

[2] P. Auer. Using confidence bounds for exploration-exploitation trade-offs. Journal of Machine Learning Research, 3:397–422, 2002.

[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 2001.

[4] N. Cesa-Bianchi, C. Gentile, and G. Zappella. A gang of bandits. In Proc. NIPS, 2013.

[5] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In Proc. AISTATS, 2011.

[6] I. Dhillon, Y. Guan, and B. Kulis. A fast kernel-based multilevel algorithm for graph clustering. In Proc. SIGKDD, 2005.

[7] I. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007.

[8] C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In Proc. ICML, 2014.

[9] A. Lacerda, A. Veloso, and N. Ziviani. Exploratory and interactive daily deals recommendation. In Proc. RecSys, 2013.

[10] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. WWW, pages 661–670, 2010.

[11] T. T. Nguyen and H. W. Lauw. Dynamic clustering of contextual multi-armed bandits. In Proc. CIKM, 2014.

[12] L. Tang, Y. Jiang, L. Li, and T. Li. Ensemble contextual bandits for personalized recommendation. In Proc. RecSys, 2014.

Observe payoff Let

– If then pick at random index , and split subclusters by means of a standard graph clustering procedure. Delete all edges between

• Let be the resulting set of edges, set , and compute associated clusters

Figure 1: Pseudocode of the GCLUB algorithm.

Figure 2: User activity levels on the Yahoo (left) and the MovieLens (right) datasets. Users are sorted in decreasing order according to the number of times they provide feedback to the learning system. For the sake of better visibility, on the Yahoo dataset we truncated to the 2K most active users (out of 18K).

Figure 3: Results on the LastFM (top left), the Delicious (top right), MovieLens (bottom left), and the Yahoo (bottom right) datasets. On the first three plots, we display the time evolution of the ratio of the cumulative regret of the algorithm (“Alg”) to the cumulative regret of RAN, where “Alg” is either “GCLUB” (green), CLUB (blue), “LinUCB-IND” (red), or “LinUCB-ONE” (black). On the Yahoo dataset, we instead plot the ratio of Clickthrough Rate (CTR) of “Alg” to the Clickthrough Rate of RAN. Colors are consistent throughout the four plots.

designed for accessibility and to further open science