In recent years, matrix completion has been a venue for constant research—both from theoretical as well as algorithmic front. The applications include collaborative filtering, multiclass learning, dimensionality reduction etc. In most general terms, the aim of these problems is to estimate the elements of a matrix from noisy observations collected on only a subset of its locations. Of course, such estimation problems are ill-posed without further assumptions, since the value of
at the unobserved locations can be arbitrary. A common approach is to augment the inference method with the assumption that the underlying matrix to be estimated exhibits some form of intrinsic low-dimensional structure, making, in some cases, the problem feasible. For example, the assumption of low-rank on the target matrix has been the main focus of research for these problems. Even though the rank constrained problem is non-convex and difficult to analyze, convex relaxations to these problems have been well studied in noiseless [2], additive uncertainty [3], and even in settings where observations are nonlinear (e.g., one-bit quantized [4]). However, due to the computational complexity reasons, the most popular algorithms to solve these problems are variants of alternating minimization [1]; the low-rank structure is imposed explicitly via bilinear models i.e.,
is assumed to be a product of two factor-matrices as
where modifications of this formulation also allows for explicit control and facilitates imposing additional constraints like positivity [5], sparsity [6], or even tree-sparsity [7] on the factors.
In terms of the well-known user-item recommendation problem [8]: pertains to the rating matrix,
matrices of user and item latent-factors, respectively. Apart from the rating information, generally, we also have other user and item side-features (like age, gender, etc) that can be used to further improve the quality of estimation. These features can be included via a Bayesian framework [9, 10] or by directly changing the model as in Inductive Matrix Completion (IMC) [11, 12] where we have the following model assumption on
where are known matrices, for example, user and item feature matrices,
is a mixing matrix whose low-rank structure is explicitly imposed via a bilinear model
The goal of this paper is to provide estimation error bounds for matrix completion under inductive setting, as in (2), by using the framework of regularized maximum likelihood estimation [1]. For generality, we derive these bounds under the assumption that both can be sparse, and noise can follow any generic distribution. This kind of sparse-structure arises in recommendation problems where the desire is to have sparse latent vectors due to requirement of low computational and storage complexity [13], and variant of dictionary learning problems where apart from the coefficient matrix being sparse, dictionary itself is modeled to be sparse [14].
1.1. Related Works
Our inference tasks here essentially entails learning two structured factors in a bilinear model. With a few notable exceptions (e.g., low-rank matrices and certain non-negative matrices [15, 16]), the non-convex bilinear models studied in this paper are difficult to analyze. Several recent efforts in the dictionary learning literature have established theoretical guarantees on identifiability, as well as algorithmic analyses of a number of optimization approaches [17–20]. The most closely related to our work are [1, 6, 17], where authors provide a framework for deriving estimation error bounds for the models of form (1) under various noise distributions and sparsity structure on by using a regularized maximum-likelihood approach based on the ideas from [21]. The Overall theme of the estimation bounds obtained using this framework is that the error of their estimate,
, decays as the ratio of number of degrees of freedom to estimate over the number of observations.
The idea of including side features in the factorization models to improve recommendations is considered using Bayesian formulations in [22] or via directly incorporating these features in a generative model in IMC [12], Factorization Machines [23]. The statistical performance of IMC has been theoretically analyzed via convex relaxations based on nuclear-norm in noiseless [12] and noisy settings [24]. From the algorithmic perspective, provable alternatingminimization based approach with bilinear models has been recently proposed and analyzed [11]. Our results in this paper provide additional insights into these problems under a general setting of noise and sparsity of the bilinear factors. It is to be noted here that the aim of all these related works is to accurately estimate the mixing matrix ; we provide error bounds for the estimation of
Many recent works have used IMC framework for large scale recommendations [25] or extended IMC to automatically construct better features [26].
1.2. Specific Contributions
In this paper we provide estimation error guarantees for matrix completion using regularized maximum likelihood estimator under the model described in (2) for the cases where the factor matrices are dense (either one or both), or when is itself sparse. In order to arrive at these bounds, we extend the fundamental theorem proposed in [1] to the inductive matrix completion setting studied here. This extension is detailed in Theorem 2.1. The main utility of this theorem is that it can provide estimation error guarantees for a variety of noise distributions; for example, Gaussian, Poisson, Laplace, and even under the extreme quantized setting of 1-bit observations. Here we instantiate it for the Gaussian noise and obtain corresponding error guarantees.
1.3. Preliminaries
To set the stage for the statement of our main result, we recall several information-theoretic preliminaries. When p(Y ) and q(Y ) denote the pdf (or pmf) of a real-valued random variable Y , the Kullback-Leibler divergence (or KL divergence) of q from p is denoted and given by
where the logarithm is taken to be the natural log. By definition, is finite only if the support of p is contained in the support of q. Further, the KL divergence satisfies
when p(Y ) = q(Y ). We also use the Hellinger affinity denoted by A(p, q) and given by
It is known that are parameterized by elements
of matrices
tively, so that
use the shorthand notation
. Finally, for a matrix M we denote by
its number of nonzero elements, and
magnitude of its largest element (in absolute value). Also,
Our focus here is to estimate a matrix that admits a model as in (2) from noisy observations
collected at a subset
of its locations. Sampling locations are randomly chosen in the sense that for an integer m satisfying
and
, each location is observed independently via a Bernoulli(
Going forward, we assume the following bounds on the participating matrices:
for some positive constants
Given , we write the joint pdf (or pmf) of the observations as
where denotes the corresponding scalar pdf (or pmf), and we use the shorthand
to denote the collection of elements of
. Then our task may be described succinctly as follows: given S and corresponding noisy observations
distributed according to (3), our goal is to estimate
under the assumption that it admits a model as in (2). Our approach will be to estimate via sparsity-penalized maximum likelihood; we consider estimates of the form
where are user-specified regularization parameter,
is shorthand for the collection
of entries of X indexed by S, and X is an appropriately constructed class of candidate estimates. To facilitate our analysis here, we take X to be a countable class of estimates constructed as follows: for a specified
we set
and construct P to be the set of all matrices
whose elements are either 0, or are discretized to one of
uniformly spaced levels in the range
be the set of all matrices
whose elements either take the value zero, or are discretized to one of
uniformly-spaced levels in the range
. Then, we let
Our main result establishes error bounds under general noise or corruption models. We state the result here as a theorem; its proof appears in Section 3.
Theorem 2.1. Let be as defined above and
is any constant satisfying
Then for any
the complexity penalized maximum likelihood estimator (4) satisfies the (normalized, per-element) error bound
Before proceeding to specific case of Gaussian noise, we note a few salient points about this result. First, as alluded above, our results is general and can be used to analyze the error performance under a variety of noise models. Specialization to a given noise model requires us to compute the upper bounds for KL divergences in terms of problem parameters, and the lower bound of negative log Hellinger affinities in terms of the required error function. Second, the above result is kind of an oracle bound, in that it is specified in terms of a minimum over . We can construct a valid upper bound by evaluating this for any
. We construct a specific
that is close to
, and evaluate the oracle bound for the constructed point to get a non-trivial upper bound.
Finally, we note that the optimization problem (4) is non-convex, as often is the case, not only because of penalty, or discretized space (X ) used for proofs in this paper, but due the fundamental challenge posed by bilinear models.
2.1. Implications for Gaussian Noise
Below we provide scaling behavior of error bound with problem parameters for Gaussian noise with known variance by instantiating Theorem 2.1. The joint likelihood of observations can be written as
where we have used the shorthand notation . To that end, we fix
for describing the number of discretization levels in the elements of each factor matrix and regularization parameters right hand side of (6). In this setting we have the following result; its proof appears in Section 3.2.
Corollary 2.1. Let be as defined above with
. The estimate
obtained via (4) satisfies
A few comments are in order regarding these error guarantees. We may interpret the quantity as the number of degrees of freedom to be estimated, and in this sense we see that the error rate of the penalized maximum likelihood estimator exhibits characteristics of the well-known parametric rate (modulo the logarithmic factor). If
are instead dense, then we get the familiar error bound that decays as
this with the case where matrix
is just low-rank (as in formal matrix completion tasks), the error rates are of the form
[1]. Clearly, the IMC model has a better error performance if
or even better if the factor matrices are sparse.
The presence of the factor is akin to the incoherence con- ditions prevalent in the matrix-completion literature. In essence, it pertains to the spikiness of the underlying matrix we are trying to estimate [27].
3.1. Proof of Theorem 2.1
We utilize the following Lemma from [1] and apply it to our problem.
Lemma 3.1. Let matrix whose elements we aim to estimate, and let X be a countable collection of candidate reconstructions
, each with corresponding penalty
so that the collection of penalties satisfies the summability condition
as explained earlier, we have that for any
, the complexity penalized maximum likelihood estimator
satisfies the (normalized, per-element) error bound
where, as denoted, the expectation is with respect to the joint distribution of
Now, consider any discretized matrix factors Q, as described in Section 2. For
, we encode each nonzero element of
bits to denote its loca- tion and
bits for its amplitude. We can then encode P with
nonzero entries by using
bits. Similarly, we can encode
where
Now, we let
be set of all such X = APQB, and let the binary code for each X be the concatenation of the binary code for P followed by the binary code for Q. It follows that we may assign penalties
length satisfy
given that A and B are known matrices.
It is easy to see that such codes are (by construction) uniquely decodable, so we have that by the well- known Kraft-McMillan Inequality [28]. Now let X be any subset of
. For randomly subsampled and noisy observations
estimates take the form
Further, using Lemma 3.1 we have
where . Finally, letting
and using the fact that
which follows by our selection of and the fact that
; it follows (after some straightforward simplifica-tion) that (
) follow (6), and the estimate (4) satisfies
as claimed.
3.2. Proof of Corollary 2.1
Proof. For as specified and any
, using the model (8) we have
for any fixed . It follows that
, and using the fact that the amplitudes of entries of
and all
are no larger than
, it is clear that we may choose
. Further, for any
and any fixed
it is easy to show that in this case
Incorporating this into Theorem 2.1, we obtain that for regularization parameters as in (6) with substituted, the sparsity penalized maximum-likelihood estimate satisfies the per-element mean-square error bound
We now establish the error bound for the case where are sparse with corresponding cardinality
Consider a candidate reconstruction of the form
where the elements of
are the closest discretized surrogates of the nonzero entries of
, and the entries of and
are the closest discretized surrogates of the nonzero entries of
. Then it is easy to see that
Given the range limits on allowable P and Q and that each range is quantized to levels, we have that
and
. Now, we can obtain a bound on the magnitudes of the elements of
that hold uniformly over all i, j, as follows
where the inequality follows from a application of triangle inequality followed by the bounds on entry-wise bounds on elements of allowable P and Q, and because
. Now, it is straight-forward to show that our choice of
(9) implies
so each entry of is bounded in magnitude by
follows that each element of the candidate
constructed above is bounded in magnitude by
is indeed a valid element of the set X .
Further, the approximation error analysis above also implies directly that
where the last line follows from the fact that our specific choice of in (9) also implies
Now, evaluating the oracle term at the candidate and using the fact that
have
[1] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Noisy matrix com- pletion under sparse factor models,” IEEE Transactions on Information Theory, vol. 62, no. 6, pp. 3636–3661, June 2016.
[2] E. J. Cand`es and B. Recht, “Exact matrix completion via con- vex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009.
[3] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from noisy entries,” Journal of Machine Learning Research, vol. 11, pp. 2057–2078, 2010.
[4] M. A. Davenport, Y. Plan, E. van den Berg, and M. Wootters, “1-bit matrix completion,” Information and Inference, vol. 3, no. 3, pp. 189–223, 2014.
[5] A. Soni and J. Haupt, “Estimation error guarantees for poisson denoising with sparse and structured dictionary models,” in 2014 IEEE International Symposium on Information Theory, June 2014, pp. 2002–2006.
[6] A. Soni, S. Jain, J. Haupt, and S. Gonella, “Error bounds for maximum likelihood matrix completion under sparse factor models,” in Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, Dec 2014, pp. 399–403.
[7] A. Soni and J. Haupt, “On the fundamental limits of recovering tree sparse vectors from noisy linear measurements,” IEEE Transactions on Information Theory, vol. 60, no. 1, pp. 133– 149, Jan 2014.
[8] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization tech- niques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
[9] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King, “Sorec: social recommendation using probabilistic matrix factorization,” in Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008, pp. 931– 940.
[10] Ian Porteous, Arthur U Asuncion, and Max Welling, “Bayesian matrix factorization with side information and dirichlet process mixtures.,” in AAAI, 2010.
[11] Prateek Jain and Inderjit S Dhillon, “Provable inductive matrix completion,” arXiv preprint arXiv:1306.0626, 2013.
[12] Miao Xu, Rong Jin, and Zhi-Hua Zhou, “Speedup matrix com- pletion with side information: Application to multi-label learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2301–2309.
[13] E. Acar, G. Grdeniz, M. A. Rasmussen, D. Rago, L. O. Drag- sted, and R. Bro, “Coupled matrix factorization with sparse factors to identify potential biomarkers in metabolomics,” in 2012 IEEE 12th International Conference on Data Mining Workshops, Dec 2012, pp. 1–8.
[14] R. Rubinstein, M. Zibulevsky, and M. Elad, “Double sparsity: Learning sparse dictionaries for sparse signal approximation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1553–1564, March 2010.
[15] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra, “Computing a nonnegative matrix factorization– provably,” Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pp. 145–162, 2012.
[16] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix factorization and dimensionality reduction on physical space,” IEEE Transactions on Image Processing, vol. 21, no. 7, pp. 3239–3252, 2012.
[17] J. D. Haupt, N. D. Sidiropoulos, and G. B. Giannakis, “Sparse dictionary learning from 1-bit data,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 7664–7668.
[18] R. Gribonval and K. Schnass, “Dictionary identification – Sparse matrix-factorization via minimization,” IEEE Transactions on Information Theory, vol. 56, no. 7, pp. 3523–3539, 2010.
[19] M. Aharon, M. Elad, and A. M. Bruckstein, “On the uniqueness of overcomplete dictionaries, and a practical way to retrieve them,” Linear algebra and its applications, vol. 416, no. 1, pp. 48–67, 2006.
[20] D. A. Spielman, H. Wang, and J. Wright, “Exact recovery of sparsely-used dictionaries,” in Proceedings of the TwentyThird international joint conference on Artificial Intelligence, 2013, pp. 3087–3090.
[21] A. R. Barron, C. Huang, J. Q. Li, and X. Luo, “The MDL prin- ciple, penalized likelihoods, and statistical risk,” Festschrift for Jorma Rissanen. Tampere University Press, Tampere, Finland, 2008.
[22] Deepak Agarwal and Bee-Chung Chen, “flda: matrix factor- ization through latent dirichlet allocation,” in Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010, pp. 91–100.
[23] Steffen Rendle, “Factorization machines,” in 2010 IEEE International Conference on Data Mining. IEEE, 2010, pp. 995– 1000.
[24] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon, “Ma- trix completion with noisy side information,” in Advances in Neural Information Processing Systems, 2015, pp. 3447–3455.
[25] Donghyuk Shin, Suleyman Cetintas, Kuang-Chih Lee, and In- derjit S Dhillon, “Tumblr blog recommendation with boosted inductive matrix completion,” in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 203–212.
[26] Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inder- jit S. Dhillon, “Goal-directed inductive matrix completion,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), aug 2016.
[27] S. Negahban and M. J. Wainwright, “Restricted strong con- vexity and weighted matrix completion: Optimal bounds with noise,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 1665–1697, 2012.
[28] T. M. Cover and J. A. Thomas, Elements of information theory, John Wiley & Sons, 2006.