b

DiscoverSearch
About
My stuff
Fast Adaptively Weighted Matrix Factorization for Recommendation with Implicit Feedback
2020·arXiv
Abstract
Abstract

Recommendation from implicit feedback is a highly challenging task due to the lack of the reliable observed negative data. A popular and effective approach for implicit recommendation is to treat unobserved data as negative but downweight their confidence. Naturally, how to assign confidence weights and how to handle the large number of the unobserved data are two key problems for implicit recommendation models. However, existing methods either pursuit fast learning by manually assigning simple confidence weights, which lacks flexibility and may create empirical bias in evaluating user’s preference; or adaptively infer personalized con-fidence weights but suffer from low efficiency.

To achieve both adaptive weights assignment and efficient model learning, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder. The personalized data confidence weights are adaptively assigned with a parameterized neural network (function) and the network can be inferred from the data. Further, to support fast and stable learning of FAWMF, a new specific batch-based learning algorithm fBGD has been developed, which trains on all feedback data but its complexity is linear to the number of observed data. Extensive experiments on real-world datasets demonstrate the superiority of the proposed FAWMF and its learning algorithm fBGD.

Recommender systems play an important role in many internet services. Since in practise most of recommender systems only have the implicit feedback (e.g. items consumed), recently research attention is increasingly shifted from explicit feedback (e.g. rating prediction) to implicit feedback. However, learning a recommender system from implicit feedback is more challenging. In implicit feedback scenarios, only positive feedback are observed, and the unobserved user-item feedback data (e.g. a user has not bought an item yet) are a mixture of real negative feedback (i.e. a user does not like it) and missing values (i.e. a user just does not know it). Existing methods address this problem by treating all the un-observed data as negative (dislike) but downweighting their confidence. Although it is reasonable, it poses two important questions: (1) How to assign confidence weights for each data? (2) How to handle the massive volume of the unobserved data efficiently?

While there is a rich literature on recommendation from implicit feedback, to our knowledge, all existing methods lack one or more desiderata. Most of these methods rely on the meticulous assignment of confidence weights to the data. For example, WMF (Hu, Koren, and Volinsky 2008) and eALS (He et al. 2016) give uniform or popularity-based confidence weights; BPR (Rendle et al. 2009), CDAE (Wu et al. 2016), or other sophisticated models downweight the contributions of the negative data in their heuristical (e.g. uniformly) negative sampling strategy. However, choosing these weights usually involves heuristic alterations to the data and needs expensive exhaustive grid search via crossvalidation. Furthermore, it is unrealistic for researchers to manually set flexible and diverse weights for millions of data. In practical scenarios, the data confidence weights may change for various user-item combinations. Some un-observed data can be attributed to user’s preference while others are the results of users’ limited scopes. To this end, exposure-based matrix factorization (EXMF) (Liang et al. 2016) has been proposed to downweight negative data automatically by predicting how a user knows a item. However, EXMF involves iterative inference of the exposure for each data, which is computationally expensive and also potentially suffers from overfitting. More recently, SamWalker (Chen et al. 2019b) adaptively assign confidence weights from users’ social contexts. However, the social network information is hard obtained in many recommender systems.

To address these problems, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder (Kingma and Welling 2013) for both adaptive weights assignment and efficient learning. We first analyze EXMF from variational perspective and find that the variational posterior of user’s exposure acts as data con-fidence weights in learning users’ preference with MF. It is consistent with our intuitions. Only if the user is exposed to the item, he can decide whether to consume the item based on his preference. The data with larger expo-

sure are more reliable in deriving user’s preference. Further, we propose to replace confidence weights (variational posterior of user’s exposure) with a parameterized inference neural network (function), which reduces the number of the inferred parameters and is capable of capturing latent correlations between users’ exposure. In fact, the independent assumption of the exposure is not practical in real world. Typically, recent literatures (Palla et al. 2005; Zhou 2011) in social science claim that each of us belongs to some information-sharing communities. Naturally, users are exposed to these communities and thus their exposure exhibit correlations when they belong to common communities. Motivated by this point, a specific community-based inference neural network has been designed, which explores latent communities among users and infer users’ exposure from the communities that they belong to. It will capture more precise exposure and can mitigate the over-fitting problem. By optimizing the variational lower bound, both the inference neural network and MF can be learned from the data. Efficiently learning a recommendation model from im-

plicit feedback is also challenging since it requires to ac-

count for the large-scale unobserved data. Although stochastic gradient descent optimizer (SGD) with sampling strategy can be employed to speed up the inference, it usually suffers from low convergency and high gradient instability. Especially in the implicit recommendation task, the optimizer usually samples the uninformative data, which have low confidence and make limited contributions on gradient update (Chen et al. 2019b). The final recommendation performance will suffer. Instead, we turn to employ batch gradient descent optimizer (BGD), which computes stable gradient from all feedback data and usually converges to a better optimum. Unfortunately, the naive implement of BGD will suffer from low efficiency caused by the expensive full-batch gradient computation over all data. To address this problem, we develop a fast BGD-based learning algorithm fBGD for our FAWMF. With rigorous mathematical reasoning, massive repeated computations of original BGD can be avoided and the learning process can be accelerated. Notably, despite fBGD computes gradient over all data, its actual complexity is linear with the number of the unobserved positive data. Due to the sparsity of the implicit feedback data, fBGD

achieves a significant acceleration. We summarize our key contributions as follows:

We propose a FAWMF model based on variational auto-encoder to achieve both adaptive weights assignment and efficient model learning.

A batch-based learning algorithm fBGD has been developed to learn FAWMF efficiently and stably.

Extensive experimental evaluations on three well-known benchmark datasets demonstrate the superiority of our FAWMF over the existing implicit MF methods.

Recommendation from implicit feedback data. Most of the existing methods manually assign coarse-grained confi-dence weights. For example, the classic weighted factorization matrix model (WMF) (Hu, Koren, and Volinsky 2008), eALS (He et al. 2016), ICD-MF (Bayer et al. 2017), used a simple heuristic where the confidence weights are assigned uniformly or based on item popularity; Logistical matrix factorization (Johnson 2014), BPR (Rendle et al. 2009), or neural-based recommendation models (e.g. CDAE (Wu et al. 2016), NCF(He et al. 2017)) downweight the contribution of negative data implicitly in their heuristic negative sampling strategy (e.g. uniformly). More recently, a new probabilistic model EXMF(Liang et al. 2016) was proposed to incorporate user’s exposure to items into the CF methods. When inferring user’s preference, user’s exposure can be translated as data confidence. However, this method suffers from the efficiency problem. Also, some sophisticated models have been proposed to learn confidence weights from users’ social contexts (Wang et al. 2018; Jiawei et al. 2018; Chen et al. 2019b). However, the social information is not available in many cases.

Efficient learning algorithm for implicit recommendation models. To handle the large-scale unobserved data, two types of strategies have been proposed for efficient learning: sample-based learning and whole-data based learning. The first type achieves fast learning with stochastic gradient descent (SGD) and negative sampling. The most popular sampling strategy is to draw un-observed feedback data uniformly, which is widely adopted in LMF(Johnson 2014), BPR(Rendle et al. 2009), CDAE (Wu et al. 2016), NCF (He et al. 2017), Mult-DAE (Liang et al. 2018) etc. Also, (Yu, Bilenko, and Lin 2017), (Chen et al. 2017) and (Hern´andez- Lobato, Houlsby, and Ghahramani 2014) further propose item popularity-based and item-user co-bias sampling strategy. However, in recommendation task SGD usually suffers from slow convergence and gradient instability when the number of items is large (Chen et al. 2019b).

Thus, some other dynamic sampling strategies are proposed to improve convergence and accuracy. Some literatures (Rendle and Freudenthaler 2014; Yu et al. 2018; Park and Chang 2019; Wang et al. 2017) propose to oversample the “difficult” negative instances which are hard to be discriminated by the models. However, they will expose efficiency issues in sampling. Also, the stochastic gradient estimator is biased and may amplify the natural noise in users feedback data (Li et al. 2018). More recently, SamWalker (Chen et al. 2019b) conduct the random walk along the social network to select informative instances. Although effectively, it has two weakness: (1) SamWalker requires additional social information but it is not available in many recommender systems. (2) SamWalker still employs uniformly sampling strategy to update its dynamic sampler, leading to insufficient training of it.

A more effective and stable way is updating the model from the whole-data, but it face efficiency challenge due to the large number of the negative data. Thus, memorization strategies (e.g. ALS, eALS) has been proposed (Hu, Koren, and Volinsky 2008; He et al. 2016; Bayer et al. 2017; Chen et al. 2019a) to speed up learning. However, these algorithms are just suitable for the MF with simple manual confidence weights, which lacks flexible and may create empirical bias. The recommendation performance will suffer.

Adaptive weights assignment and efficient learning are both important in recommendation from implicit feedback. Existing methods are not able to provide both of these, which motivates the approach described in this paper.

In this section, we first give the problem definition of implicit recommendation. Then, we introduce exposure-based matrix factorization (EXMF) (Liang et al. 2016) framework from variational perspective to provide usual insight about the relation between user’s exposure and data confidence.

Problem definition

Suppose we have a recommender system with user set U (including n users) and item set I (including m items). The implicit feedback data is represented as  n×mmatrix X with entries  xijdenoting whether or not the user i has consumed the item j. The task of a recommender system can be stated as follow: recommending items for each user that are most likely to be consumed by him.

Exposure-based matrix factorization (EXMF)

Note that unobserved feedback data  {xij ∈ X : xij = 0}contain the real negative data (dislike) and the missing values (unknown). EXMF introduces a Bernoulli variable  aijto model users’ exposure:  aij = 1denotes that the user i knows the item j and  aij = 0denotes not. Then, EXMF models user’s consumption  xijbased on  aijas follow:

image

where  δ0denotes  p(xij = 0|aij = 0) = 1; ηijis the prior probability of exposure. Here we relax function  δ0as N(ε, λx)to make model more robust, where  εis a small constant (e.g.  ε=1e-5). When  aij = 0, we have  xij ≈ 0, since the user does not know the item and he can not consume it. When  aij = 1, when the user has learned the item, he will decide whether or not to consume the item based on his preference. Thus,  xijis modeled as the classic matrix factorization model, where  uidenotes the K-dimensional latent factor (preference) of user i, and  vjdenotes the K-dimensional latent factor (attribute) of item j.

Analyses of EXMF from variational perspective The marginal likelihood of EXMF is composed of a sum over each datapoint  log p(X) = �

can be rewritten as:

image

where  q(aij|xij)is defined as an approximated variational posterior of  aij. Since the second term – KL-divergence – is non-negative, the first term  L(θ, q; xij)is the evidence lower bound on (ELBO) the margin likelihood. Classic variational methods (Hoffman et al. 2013) usually employ conjugate variational distribution and individual variational parameters1, i.e.  q(aij|xij) = Bernoulli(γij). Then, optimizing EXMF can be transferred to minimize the following objective function w.r.t  ui, vj, γij:

image

Exposure as data confidence. The objective function consists of three parts: (1) Weighted matrix factorization to learn user’s preference; (2) The loss when the data are predicted as unknown; (3) The regularization term – KL divergency between the prior and the variational posterior. A good property is observed that the parameters  γij, which characterize the probability that user i is exposed to item j, act as the confidence of the corresponding data in learning user’s preference. It is consistent with our intuitions. Only if the user is exposed to the item, he can decide whether to consume the item based on his preference. The data with larger exposure are more reliable in deriving user’s preference.

Inefficiency problem. EXMF suffers efficiency problems. The number of inferred variational parameters (con-fidence weights)  γijgrows quickly with the number of users and items (n×m), which easily scale to billion or even larger level. It will potentially suffer from overfitting and become the time and space bottleneck for practise recommender systems. Further, the inference of EXMF requires to sum over the terms for each data, which is time-consuming.

To address these problems, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder to achieve both adaptive weight assignment and efficient model learning. To reduce the number of the inferred parameters, FAWMF replaces individual parameters  γijwith a parameterized function  γij = gΦ(i, j, X)that maps users’ consumption on items into their exposure with parameters  Φ. It is more reasonable since there exists interactions between users. Independent assumption of exposure is not practical in real world. Typically, recent literatures (Palla et al. 2005) in social science claim that each of us belongs to some information-sharing communities. Users are exposed to these communities and thus their exposure exhibit commonality when they belong to similar communities. Motivated by this point, we summarize communities as collections of similar users and further infer users’ exposure

image

from the communities that they belong to. Concretely, the inference function can be modeled as follow:

image

where vector  θidenotes the membership that allocates each user i to a fixed D number of communities and meet θi ⪰ 0, |θi|1 = 1. We cumulate the consumptions of the users in the community to infer how items are exposed to the community, where  αkcaptures the heterogenous roles of users. Different users may have different influence strength on the communities (Shi et al. 2019a; Shi et al. 2019b). Then, a logistic linear function with parameters  wj, bjhas been employed to map cumulated consumptions into the exposure. Intuitively, the more influence users in the common community have consumed the item, the user are more likely to know it. Finally, user’s exposure can be depicted by how the user belongs to the communities and how items are exposed to these communities. Overall, the inference of the data confidence can be translated into learning the parameters of inference function  gΦ(i, j, X)(Φ ≡ {θ, α, w, b}), which captures latent correlations between  γijfor better estimation and reduces the number of inferred parameters from O(nm) to O(D(n + m)). Different from existing clustering-based methods (e.g. (Chen et al. 2015; Lee et al. 2016)) which cluster users into communities/submatrices based on users’ preference, our FAWMF deduces implicit communities based on users’ exposure.

Also, FAWMF can be well understood from neural network perspective. As shown in Figure 1, at first an outer product between each user’s consumption and his membership has been employed for interactions. The obtained tensor Z can be regarded as the influence of user’s consumptions along different dimensions (communities). Then, we use a specific striped CNN (Krizhevsky, Sutskever, and Hin- ton 2012) layer and a linear layer to encode the influences (consumptions) into the “kernel maps” of users’ exposure. Here we choose striped CNN since the adjacent elements of Z are just caused by the adjacent users(items) id and do not suggest they will have share more commonality. Further, the exposure for different user-item pairs can be depicted as the combination of the “kernel maps”. Finally, the exposure-based MF acts as a probabilistic decoder to predict users’ consumption. Overall, FAWMF forms an auto-encoder to learn both data confidence weights and latent factors of MF.

Discussion How does FAWNF mitigate overfitting? To better understand this effect, let us draw an analogy with the floating

image

The underlying correlation between instances (a) In EXMF (b) In FAWMF

Figure 2: Illustration of how FAWMF mitigates over-fitting

balls in the water, as illustrated in Figure 2. Learning EXMF model by optimizing equation (5) will give a force to push up these positive balls (instances) and push down these un-observed balls (instances). Thus, without strong priors, the confidence weights of the data easily achieve extremely values (γij ≈ 1for the positive instances and  γij ≈ 0for the unobserved instances), where unobserved data makes few contributions on training. In this situation, all the instances will be predicted as positive by MF and the model will suffer from over-fitting. In our FAWMF, this over-fitting effect will be mitigated by introducing inference network of the  γij. There exists correlations between users’ exposure, which can be analogies with elastic links lines between the balls. Typically, users tend to have similar exposure (γij), if they share more common communities. Naturally, with the training of the model, the unobserved data which have correlations with positive instances, will be pulled up due to the force from the lines. This way, when the model has well fit-ted the data, the positive and the unobserved instances will get stable at different depth in water. The unobserved instances which have stronger correlations with positive instances, usually reaches higher position than other unobserved instances.

Learning a recommendation model from implicit feedback is time-cost expensively due to the massive volume of the unobserved data. The popular solution addressing this problem is stochastic gradient descent (SGD) with negative sampling. However, SGD usually suffers from low convergency and gradient instability, which affects the performance of the model. Especially, in the recommendation task, the sampler usually select uninformative instances that have small confidence weights  γijand make limited contributions on update. Thus, we turn to use batch gradient descent optimizer (BGD), which computes stable gradient from all feedback data. Unfortunately, the orginal BGD will suffer from low efficiency due to the full-batch computation of the gradient over instances. To address this problem, we develop new specific learning algorithm fBGD for our FAWMF. We speed up learning of BGD by caching some specific intermediate variables to avoid the massive repeated computations. Here we detail the derivation of the gradient w.r.t  αk; where the counterpart of others (θi, wj, bj, ui, vj) is achieved likewise and presented in supplemental materials.

image

bj). Also, we drop out the regularization term in the objective function (Eq. (5)) since the regularization term usually makes no contribution on recommendation accuracy of FAWMF and hinders fast training. It also can be regarded that we set a large value of parameter  λx. Then, the gradient of objective function (Eq. (5)) w.r.t  αkfor each user k can be derived as follow:

image

Clearly, the computational cost lies in two parts: (1) the calculation of the gradients of J w.r.t  qjfor each item  j ∈ I, which requires the summation over the terms of each user i; (2) the summation of ∂J∂qj · ∂qj∂αkover each item  j ∈ Ito get final result. In fact, the first part produces the main cost since it requires a traversal of all users and repeat this operation over all items; while the second part can be accelerated by just iterating over these positive instances (xkj = 1). The overall computational complexity to update  {αk : k ∈ U}is O(nmKD), which is generally infeasible since nm can easily reach billion level or even higher in practise.

To speed up the learning, we rewrite the computational bottlenecks – Eq. (8) – by isolating item-independent terms:

image

By this reformulation, we can see that the major computation – the �i∈Uuikuilθiand �i∈Uθiover all users – is independent of item j. Thus, we could achieve a significant speed-up by caching these terms. That is, we cache M(q)kl∗ = �i∈Uuikuilθifor each  1 ≤ k ≤ K, 1 ≤ l ≤ Kand S(q) = �i∈Uθi, where  M(q)is a  K × K × D-dimensional tensor and  S(q)is a D-dimensional vector. Then, ∂J∂qjcan be calculated as follow:

image

The rearrangement of nested sums is the key transformation that allows the fast optimization. The time computation can be reduced from O(nmKD) to  O(K2D(n + m) +|X+|(K + D)), where  |X+|denotes the number of the observed positive data. Here we spend  O(K2Dn)time on calculating  M(q), S(q)and  O(K2Dm)time on calculating the first and second term in equation (11) for each item  j ∈ I.

Table 1: Statistics of three datasets.

image

Table 2: characteristics of the compared methods.

image

For the third term, we just sum over the terms with  xij = 1with complexity  O(|X+|(K + D)). Similar strategy can be used to accelerate the update of other parameters. With this algorithm, learning FAWMF model is efficient and stable, which trains on the all data but its complexity is linear to the number of the observed data.

In this section, we conduct experiments to evaluate the performance of FAWMF. Our experiments are intended to address the following major questions: (Q1) Does FAWMF outperform state-of-the-art implicit MF methods? (Q2) How does the proposed batch-based learning algorithm fBGD perform?

Experimental protocol

Datasets. Three benchmark datasets Moivelens 2, Amazon (food-reviews) 3, Douban 4 are used in our experiments. These datasets contain users’ feedback on items. The dataset statistics are presented in Table 1. Similar to (Chen et al. 2019b), we preprocess the datasets so that all items have at least three interactions and“binarize” user’s feedback into implicit feedback. That is, as long as there exists some user-item interactions (ratings or reviews), the corresponding implicit feedback is assigned a value of 1. Grid search and 5-fold cross validation are used to find the best parameters. In our FAWMF, we set  D=K=20, ε=1e-5 and learning rate=0.1 across all datasets. Compared methods. We compare FAWMF with follow-

ing baselines. Table 2 concludes their characteristics.

Item-pop: This is a simple baseline which recommends items based on global item popularity.

WMF(ALS) (Hu, Koren, and Volinsky 2008; Pan et al. 2008): The classic weighted matrix factorization model for implicit feedback data. The corresponding ALS-based

Table 3: The performance metrics of the compared methods. The boldface font denotes the winner in that column. The row ‘Impv’ indicates the relative performance gain of our FAWMF compared to the best results among baselines. ’†’ indicates that the improvement is significant with t-test at p < 0.05.

image

(Hu, Koren, and Volinsky 2008) algorithm can reduce inference complexity.

eALS (He et al. 2016): The improved weighted matrix factorization model, where the data confidence weights are assigned heuristically based on item-popularity.

BPR (Rendle et al. 2009): The classic pair-wise method for recommendation, coupled with matrix factorization. BPR implicitly downweights the unobserved data in their uniformly negative strategy.

CDAE (Wu et al. 2016): The advanced recommendation method based on Auto-Encoders, which is a generalization of WMF with more flexible components. CDAE also employs uniformly negative strategy to learn their model.

EXMF (Liang et al. 2016): A probabilistic model that directly incorporates user’s exposure to items into traditional matrix factorization.

Evaluation Metrics. We adopt four well-known metrics Precision@K (Pre@K), Recall@K (Rec@K), Normalized Discounted Cumulative Gain (NDCG@K) and Mean Reciprocal Rank (MRR) to evaluate recommendation performance: Recall@K (Rec@K) quantifies the fraction of consumed items that are in the top-K ranking list; Precision@K (Pre@K) measures the fraction of the top-K items that are indeed consumed by the user; NDCG@K and MRR evaluate ranking performance of the methods. Refer to our supplemental material for more details about these metrics.

Performance comparison (Q1) Table 3 presents the performance of the compared methods in terms of three evaluation metrics. The boldface font denotes the winner in that column. For the sake of clarity, the last row of Table 3 also show the relative improvements achieved by our FAWMF over the baselines. Generally speaking, with one exception, FAWMF outperforms all compared methods on all datasets for all metrics.

The improvement of FAWMF over these baselines can be attributed to three aspects: (1) In the real world, users have personalized communities and thus are exposed to diverse information. Correspondingly, the data confidence (exposure) will vary with different user-item combinations. That is, some unobserved feedback are more likely attributed to user’s preference while others are the results of users’ awareness. By adaptively learning fine-grained data confi-dence weights from the data, FAWMF achieves better performance than those baselines with manually coarse-grained confidence weights. (2) FAWMF models confidence weight with an community-based inference network, which is capable of capturing latent interactions between users and mitigating the over-fitting problems. It can bee seen from the better performance of FAWMF over EXMF. (3) Instead of employing sampling-based stochastic gradient descent optimizer (SGD), FAWMF employs a specific fast batch-based learning algorithm, which has better convergency and more stable results. We also conduct specific experiments in the next subsection to validate this point.

Running time comparisons. Figure 3 depicts running time of the six compared recommendation methods. As we can see, the speed up of our FAWMF over EXMF is significant. Especially in the largest dataset Douban, EXMF requires 56 hours for training, while FAWMF only takes 1.8 hours. The acceleration of FAWMF over EXMF can be attributed two aspects: (1) Confidence weights for each data have been modeled with a parameterized function, which reduces the number of learned parameters. (2) fBGD speed up gradients calculations. This way, FAMWF achieves similar analytical time complexity as other compared methods, which aim at fast learning but sacrifice the flexibility of the confidence weights. Their actual running time are also in the same magnitude. Also, we observe that these BGDbased methods (WMF, eALS, FAWMF) are relatively more efficient than SGD-based methods (BPR, CDAE), although they have similar analytical time complexity. It is caused by the poor convergency of SGD, which usually requires more iterations.

Effect of batch-based learning algorithm (Q2) In this subsection, we compare our fBGD with two popular SGD-based learning strategies: (1) Uniform-1X (Uniform-5X, Uniform-25X), which uniformly sample a part of unobserved instances to update the model. Note that the training time and prediction accuracy are largely determined by the size of negative samples. Here we test this learning strategy with three different sampling-size for comparisons, where Uniform-25X (Uniform-5X, Uniform-1X) denotes the number of the sampled instances is 25 (5, 1) times as large as the number of positive instances. (2) Itempop-1X (Itempop-5X,

image

Figure 4: Recommendation accuracy for different learning strategy versus the number of iterations and running time.

Itempop-25X) whose probability of sampling a un-observed data is proportional to the item popularity. We also present the performance of original BGD algorithm for running time comparisons. Here we do not compare with existing dynamic sampling strategies, since they either suffer from effi-ciency problems or require other side information.

Figure 4 presents pre@5 of FAWMF on dataset Movielens with different learning strategies versus the number of iteration and running time. There exists the efficiency and effectiveness trade-off of SGD-based learning algorithm with varying sampling-size. The smaller sampling-size will cause insufficient learning and gradient instability. It can be seen from the poor performance final and heavy fluctua-tion of Uniform-1X (Itempop-1X). On the contrary, when the sampling-size become larger, SGD performs better but spend much more time. However, even if the stochastic optimizer uses a large sampling-size, it can not achieve good performance as BGD-based learning algorithm. Also, we can observe the inefficiency of original BGD. Our proposed fBGD accelerates original BGD for 16 times and even performs more efficient than Itempop-1X. Overall, our proposed fBGD performs better than others in all convergence, speed and recommended performance.

Effect of the parameter K

Figure 5 shows the performance of FAWMF with varying latent dimension K in MF. We also present the results of two close MF methods for comparisons. First, with few exceptions, FAWMF consistently outperforms eALS and WMF across K, demonstrating the effectiveness of adaptively assigning personalized confidence weights. Second, all methods can be improved with a larger K. But a large K might have the risk of overfitting. It can be seen from the worse results when K = 60 on the dataset Movielens.

Case study We also conduct a case study to show the learned communities. We do statistical analyses on the communities, two

image

Figure 6: Case study of two communities: the top rows show top-5 items that have highest exposure to the users in the two communities; the bottom rows show the ratio of male/female and the average age of the users in each communities. Note that these side information is not used in model training.

of which are presented in Figure 6. We can find that the genres of the exposed items in the community 1 are mainly about “Horror” and “Thriller”, but the genres in the community 2 are mainly about “Comedy”. These results can be explained by the different kinds of users in the communities. Most of users in the community 1 are young men, who are more likely to enjoy these exciting movies and share these movies with each other. Meanwhile, comedy is a hot topic for the elderly people, who are the main constituents of the community 2. These results validate the effectiveness of our FAWMF. Although the genre information of users and items is not used in model training, the latent correlations between users/items can be captured. Also, we can find different users will belong to different communities and thus have different exposure. Personalized confidence weights are necessary for the implicit recommendation models.

In this paper, we present a novel recommendation method FAWMF based on variational auto-encoder to achieve both adaptive weights assignment and efficient model learning. On the one hand, FAWMF models data confidence weights with a community-based inference neural network, which reduces the number of inferred parameters and is capable of capturing latent interactions between users. On the other hand, a specific batch-based learning algorithm fBGD has been developed to learn FAWMF fast and stably. The experimental results on three real-world datasets demonstrate the superiority of FAWMF over existing implicit MF methods.

This work is supported by National Natural Science Foundation of China (Grant No: U1866602) and National Key Research and Development Project (Grant No: 2018AAA0

101503).

[Bayer et al. 2017] Bayer, I.; He, X.; Kanagal, B.; and Ren- dle, S. 2017. A generic coordinate descent framework for learning from implicit feedback. In International Conference on World Wide Web, 1341–1350. IW3C2.

[Chen et al. 2015] Chen, C.; Li, D.; Zhao, Y.; Lv, Q.; and Shang, L. 2015. Wemarec: Accurate and scalable recommendation through weighted and ensemble matrix approximation. In SIGIR, 303–312. ACM.

[Chen et al. 2017] Chen, T.; Sun, Y.; Shi, Y.; and Hong, L. 2017. On sampling strategies for neural network-based collaborative filtering. In International Conference on Knowledge Discovery and Data Mining, 767–776. ACM.

[Chen et al. 2019a] Chen, J.; Wang, C.; Shi, Q.; Feng, Y.; and Chen, C. 2019a. Social recommendation based on users attention and preference. Neurocomputing 341:1–9.

[Chen et al. 2019b] Chen, J.; Wang, C.; Zhou, S.; Shi, Q.; Feng, Y.; and Chen, C. 2019b. Samwalker: Social recommendation with informative sampling strategy. In The World Wide Web Conference, 228–239. ACM.

[He et al. 2016] He, X.; Zhang, H.; Kan, M.-Y.; and Chua, T.-S. 2016. Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, 549–558. ACM.

[He et al. 2017] He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.-S. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, 173–182. ACM.

[Hern´andez-Lobato, Houlsby, and Ghahramani 2014] Hern´andez-Lobato, J. M.; Houlsby, N.; and Ghahramani, Z. 2014. Stochastic inference for scalable probabilistic modeling of binary matrices. In International Conference on Machine Learning, 379–387.

[Hoffman et al. 2013] Hoffman, M. D.; Blei, D. M.; Wang, C.; and Paisley, J. 2013. Stochastic variational inference. The Journal of Machine Learning Research 14(1):1303– 1347.

[Hu, Koren, and Volinsky 2008] Hu, Y.; Koren, Y.; and Volinsky, C. 2008. Collaborative filtering for implicit feedback datasets. In International Conference on Data Mining, 263–272. Ieee.

[Jiawei et al. 2018] Jiawei, C.; Yan, F.; Martin, E.; Sheng, Z.; Chun, C.; and Wang, C. 2018. Modeling users’ exposure with social knowledge influence and consumption influence for recommendation. In International on Conference on Information and Knowledge Management, 953–962. ACM.

[Johnson 2014] Johnson, C. C. 2014. Logistic matrix fac- torization for implicit feedback data. Advances in Neural Information Processing Systems 27.

[Kingma and Welling 2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica-

tion with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105.

[Lee et al. 2016] Lee, J.; Kim, S.; Lebanon, G.; Singer, Y.; and Bengio, S. 2016. Llorma: Local low-rank matrix approximation. The Journal of Machine Learning Research 17(1):442–465.

[Li et al. 2018] Li, D.; Chen, C.; Lv, Q.; Gu, H.; Lu, T.; Shang, L.; Gu, N.; and Chu, S. M. 2018. Adaerror: An adaptive learning rate method for matrix approximation-based collaborative filtering. In The World Wide Web Conference, 741–751. IW3C2.

[Liang et al. 2016] Liang, D.; Charlin, L.; McInerney, J.; and Blei, D. M. 2016. Modeling user exposure in recommendation. In International Conference on World Wide Web, 951–961. ACM.

[Liang et al. 2018] Liang, D.; Krishnan, R. G.; Hoffman, M. D.; and Jebara, T. 2018. Variational autoencoders for collaborative filtering. In World Wide Web Conference, 689– 698. IW3C2.

[Palla et al. 2005] Palla, G.; Der´enyi, I.; Farkas, I.; and Vic- sek, T. 2005. Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818.

[Pan et al. 2008] Pan, R.; Zhou, Y.; Cao, B.; Liu, N. N.; Lukose, R.; Scholz, M.; and Yang, Q. 2008. One-class collaborative filtering. ICDM 502–511.

[Park and Chang 2019] Park, D. H., and Chang, Y. 2019. Adversarial sampling and training for semi-supervised information retrieval. In The World Wide Web Conference, 1443–1453. ACM.

[Rendle and Freudenthaler 2014] Rendle, S., and Freuden- thaler, C. 2014. Improving pairwise learning for item recommendation from implicit feedback. In International conference on Web search and data mining, 273–282. ACM.

[Rendle et al. 2009] Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt-Thieme, L. 2009. Bpr: Bayesian personalized ranking from implicit feedback. In conference on uncertainty in artificial intelligence, 452–461. AUAI Press.

[Shi et al. 2019a] Shi, Q.; Wang, C.; Chen, J.; Feng, Y.; and Chen, C. 2019a. Location driven influence maximization: Online spread via offline deployment. Knowledge-Based Systems 166:30–41.

[Shi et al. 2019b] Shi, Q.; Wang, C.; Chen, J.; Feng, Y.; and Chen, C. 2019b. Post and repost: A holistic view of budgeted influence maximization. Neurocomputing 338:92–100.

[Wang et al. 2017] Wang, J.; Yu, L.; Zhang, W.; Gong, Y.; Xu, Y.; Wang, B.; Zhang, P.; and Zhang, D. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In SIGIR, 515–524. ACM.

[Wang et al. 2018] Wang, M.; Zheng, X.; Yang, Y.; and Zhang, K. 2018. Collaborative filtering with social exposure: A modular approach to social recommendation. In AAAI, New Orleans, Louisiana, USA, February 2-7, 2018.

[Wu et al. 2016] Wu, Y.; DuBois, C.; Zheng, A. X.; and Es- ter, M. 2016. Collaborative denoising auto-encoders for

top-n recommender systems. In International Conference on Web Search and Data Mining, 153–162. ACM.

[Yu et al. 2018] Yu, L.; Zhang, C.; Pei, S.; Sun, G.; and Zhang, X. 2018. Walkranker: A unified pairwise ranking model with multiple relations for item recommendation. In Conference on Artificial Intelligence. AAAI.

[Yu, Bilenko, and Lin 2017] Yu, H.-F.; Bilenko, M.; and Lin, C.-J. 2017. Selection of negative samples for one-class matrix factorization. In Proceedings of the 2017 SIAM International Conference on Data Mining, 363–371. SIAM.

[Zhou 2011] Zhou, T. 2011. Understanding online commu- nity user participation: a social influence perspective. Internet research 21(1):67–81.

Details of fBGD

Here we present the detailed derivation of our fBGD. Note that the constraint of the  θ (θi ⪰ 0, |θi|1 = 1) causes the optimization problem. We re-parameterize  θiby a softmax function with new parameters  βi: θi = s(βi). The gradient of objective function w.r.t the parameters  βi, wj, bj, αk, ui, vjcan be derived as follows:

image

As we can see, the computational bottlenecks is on computing ∂J∂θi(Eq.(7)),  ∂J∂qj(Eq.(8)),  ∂J∂ui(Eq.(5)),  ∂J∂vj(Eq.(6)) since it requires a traversal of all users (items) and repeat this operation over all items (users); while others can be accelerated by just iterating over these positive instances (xij = 1). The overall computational complexity to update αis O(nmKD), which is generally infeasible since nm can easily reach billion level or even higher in practise.

To speed up the learning, we first rewrite the computational bottlenecks –Eq.(5)(6)(7)(8) by isolating item(user)-independent terms as follows:

image

By this reformulation, we can see that the major computation – the �i∈Uuikuilθi, �i∈Uθi, �i∈Uθikuiluiover all users – is independent of item j; and the�

image

could achieve a significant speed-up by caching these terms. That is, we cache:

image

for each  1 ≤ k ≤ D, 1 ≤ l ≤ Dand:

image

for each  1 ≤ k ≤ D, 1 ≤ l ≤ K. where M (θ), M (q)are  K × K × D-dimensional tensors and S(θ), S(q)are D-dimensional vectors;  M (u), M (v)are D × K × K-dimensional tensors and  S(θ), S(q)are K-dimensional vectors. This way, the computational bottlenecks Eq.(5)(6)(7)(8) can be efficiently calculated as follows:

image

Algorithm 1 Inference of FAWMF based on fBGD

image

The rearrangement of nested sums and the cache strategies can avoid the massive repeated computations. The time computation can be reduced from O(nmKD) to O((n + m)K2D +|X+|(K +D)). That is, although our fBGD uses all feedback data but its complexity is linearly to the number of observed data. due to the sparsity of the implicit feedback data, our FAWMF with fBGD learning algorithm is efficient. Our experimental results also validate this point. Overall, our fBGD algorithm is presented in algorithm 1.

Evaluation Metrics

We adopt the following metrics to evaluate recommendation performance:

Recall@K (Rec@K): This metric quantifies the fraction of consumed items that are in the top-K ranking list sorted by their estimated rankings. For each user u, we define Rec(u) as the set of recommended items in top-K and Con(u) as the set of consumed items in test data for user u. Then we have:

image

Precision@K (Pre@K): This measures the fraction of the top-K items that are indeed consumed by the user:

image

Normalized Discounted Cumulative Gain@K (NDCG@K): This is widely used in information retrieval and it measures the quality of ranking through discounted importance based on positions in the top-K recommendation lists. In recommendation, NDCG is computed as follow:

image

where  DCGu@Kis defined as follow and  IDCGu@Kis the ideal value of  DCGu@Kcoming from the best ranking.

image

where  rankuirepresents the rank of the item i in the recommended list of the user u.

Mean Reciprocal Rank (MRR): Given the ranking lists, MRR is defined as follow:

image

MRR can be interpreted as the ease of finding all consumed items, as higher numbers indicate the consumed items are higher in the list.


Designed for Accessibility and to further Open Science