Consider a non-negative data matrix consisting of m samples each with n non-negative features, arranged as its columns. In the non-negative matrix factorization (NMF) problem the goal is to identify two entrywise non-negative matrices
(for some a-priori fixed
) such that the matrix product WH approximates V , where the precise sense of approximation depends on the application of interest.
Factorizations based on non-negative matrices have found numerous applications throughout most fields of science and engineering, notable examples including music analysis [7], document clustering [36], speech-source separation [34] and cancer-class identification [9]. NMFs are useful as they lead to additive and sparse representations of the input data. Indeed, a factorization V = WH can be interpreted as V (:, j) = WH(:, j), for all j = 1, . . . , m, i.e., each sample (columns of V ) can be represented as non-negative combination of basis vectors (columns of W).
The NMF problem was introduced in the engineering community in the seminal paper [30] and was subsequently popularized further in [21]. As it turns out, non-negative factrorizations were also explored earlier in the combinatorial optimization community, as a tool to describe the efficacy of linear programming for hard combinatorial problems [37].
In terms of complexity, the problem of deciding the existence of an exact NMF is NP-hard even for r = rank(V ) [35]. [5] gave the first finite algorithm for calculating the non-negative rank over the real numbers based on quantifier elimination arguments. Extending this technique, [33] derived an (algorithm for exact NMF that was subsequently improved to (2
Additionally, [33] identified sufficient conditions on V under which the NMF problem can be solved in polynomial time.
There are several practically efficient algorithms for calculating (approximate) NMFs. The starting point for the majority of existing approaches is the non-convex program
for some user-specified parameter . Nevertheless, in many applications, it is useful to use information-theoretic divergences rather than the squared Euclidean loss function. In view of the non-convexity of (NMF), the theoretical analysis of iterative algorithms for solving (NMF) typically amounts to showing that an accumulation point of the sequence of iterates (
) generated by the algorithm satisfies the first-order optimality conditions given in (1).
Many algorithms for NMF can be interpreted in a unified manner within the framework of block coordinate descent (BCD), which is also known as the Gauss-Seidel method, e.g., see the survey [20] and references therein. In the BCD framework the goal is to minimize a smooth function over a domain that can be decomposed as the Cartesian product of closed convex sets. At each iteration of the BCD method, the function is minimized with respect to a single block of variables while the rest of the blocks are kept fixed. Convergence results in the BCD framework typically require attainment and unicity of the minimizer at each step [3], the unicity assumption being redundant in the 2-block case [14]. One example is the Alternating Non-negative Least Squares (ANLS) method [19], which is a 2-block BCD with updates given by the solutions of minand min
. A second example is the Hierarchical ALS (HALS) method [1], which has been rediscovered several times, e.g. as the rank-one residue iteration [15] and FastNMF in [24]. The HALS method corresponds to a 2r-block BCD, one for each column of the factors W and H. For additional details concerning algorithms for NMF and their convergence properties the reader is referred to the surveys [20, 12] and references therein.
Going beyond the BCD framework, the multiplicative update (MU) rule introduced in [22] is by far the most popular algorithm for NMF. In the MU framework, the updates are given by
where X/Y denotes the componentwise division of two matrices. [22] showed using a majorization-minimization approach, e.g., see [38], that under the MU rule, the objective function is non-increasing. However, it has long been observed that the MU rule may fail to converge to a first-order stationary point, e.g. see [13]. Indeed, only when the MU method converges to a fixed point (W, H) with strictly positive entries this is also a KKT point. An important limitation of the MU update rule is that it cannot be implemented concurrently (as
depends on
). [25] further elaborates on the difficulties in proving convergence of the MU algorithm, while establishing convergence of a close variant of Lee-Seung’s method.
Despite the wealth of existing algorithmic approaches for finding NMFs, a common limitation is that they may fail to avoid saddle points, e.g. see [8, 13, 17, 4, 2]. This drawback is also shared with many gradient-based approaches applied to important optimization problems (e.g. training neural networks, matrix completion, community detection), as such methods are only guaranteed to converge to first-order stationary points, among which there exists a proliferation of highly-suboptimal saddle points, e.g., see [6]. At the same time, algorithmic approaches that incorporate additional curvature information typically converge to second-order stationary points, i.e., points with vanishing gradient and positive semidefinite Hessian, which turn out to be are as good as local minima in many problems of practical interest [11]. These observations have led to a flurry of research activity on avoiding saddle points the last 5 years (see [10, 23, 18, 32, 31, 28] and references therein) for both unconstrained and constrained optimization. One other line of work that deals with avoiding saddle points in non-convex settings with linear constraints and can be applied to our paper (as long as one has shown our main Theorem 2) can be found in [29, 26]. Our main goal in this work, is to identify new methods that converge to second-order stationary points of (NMF).
1.1 Summary of results and significance
Before we formally state our results, we provide some standard definitions in constrained optimization applied to the NMF problem (NMF). Moreover we give the definition of our Multiplicative Weights Update (MWU) applied to NMF.
Stationary points of (NMF). A pair of matrices () is a first-order stationary point (FOSP) of problem (NMF) if for all
] it satisfies:
Additionally, () is a second-order stationary point (SOSP) of the problem (NMF) if it is a FOSP, and moreover,
for any pair of matrices (W, H) satisfying:
for all
We now present our multiplicative weight update method, which converges to a SOSP of (NMF) with probability 1.
Denote by S the set of fixed points of MWU dynamics/algorithm (1), namely the set of points (W, H) that are invariant under the update rule of MWU, multiplied by scalar C. Then, the set of stationary points of problem (NMF) with the additional constraint that a subset of S.
We now state the main result of our paper, which informally states that the MWU dynamics (1) provably avoids fixed points that are not second order stationary points.
Main Theorem. The MWU described in Algorithm 1 converges to the set of fixed points for any initialization (. Moreover, the set of initial conditions so that MWU converges to a point ( ˜
is not a second order stationary point for problem (NMF) is of measure zero (in ∆
An immediate corollary is the following:
Corollary 1. Assume that the iterate () converges to a limit, under random initialization (any probability distribution that is absolutely continuous with respect to Lebesgue measure on ∆
suffices for the initialization) the probability of MWU (1) to converge to a point ( ˜
) so that (
is a second order stationary point for problem (NMF) is one.
An important differentiation between the concurrent MWU rule and existing MWUs and gradient based approaches, is the concurrent way of updating the entries of W and H. Both iterates and
are updated using only the values of
, an extremely useful algorithmic feature since it permits implementations in parallel computing environments. To the best of our knowledge, Algorithm 1 is the first iterative method for NMF that converges while performing its updates in a concurrent way. In all previous gradient based approaches (both in multiplicative weight algorithms and in alternating least squares) the convergence properties heavily rely on the fact that the entries of W and H are updating in an alternating way, while their concurrent counterparts may fail to converge. Such an instance is presented in Example 1 for the Lee-Seung algorithm.
1.2 Proof techniques
The main challenge in the non-convex problem of NMF is that the landscape is not Lipschitz (in the positive orthant) and there are continuums of stationary points (if W, H is a stationary point, so it is with D a diagonal matrix with positive entries). The first challenge is essentially circumvented by adding an extra linear constraint that makes the feasibility region a compact set. In particular we define a modification of the NMF problem with the extra constraint that the sum of the entries of W and H is equal to a specific constant C sufficiently large (this constant is chosen to be
). By choosing this constant that large, we are able to prove that all the fixed points of MWU that are not second order stationary points of the NMF problem are repelling for MWU inside simplex. Moreover, we are able to show that for any stationary point (
) of NMF, there exists another stationary point ˜
so that
and the sum of entries of ˜W and ˜H is exactly C. As long as we have shown these claims, we use a result from [32] (Theorem 1) that states that given any twice differentiable polynomial function f(x) that we want to maximize with simplex constraints, MWU dynamics converges to second order stationary points almost surely. Last but not least, we can show that after adding the simplicial constraint (that is the sum of entries of W, H must be equal to C) for any stationary point (
) of the classic NMF problem there exists a stationary point ( ˜
) such that
have same values) and moreover ( ˜
) lies in the positive orthant and satisfies the aforementioned simplicial constraint (see Lemma 1).
to denote the i-th column of matrix
to denote the i-th row of matrix H. We also use subscripts or superscripts with letter t to denote the t-th iterate. We denote by vec(A) the standard vectorization of matrix
the simplex of size n, that is ∆
Before giving the details of our proof, we elaborate the strong relation between Algorithm 1 and the following optimization problem, which we call simplex-NMF:
where C is any constant
Problem (S-NMF) is similar to the original NMF problem, the only difference being the additional simplex constraint. On the negative side, the feasibility set of (S-NMF) is a strict subset of the feasibility set of the original NMF problem, meaning that it may include solutions with cost much greater than the optimal value of the original NMF problem. On the positive side, this problem turns out to be algorithmically easier to tackle. More precisely, due to the recent result [32], when the sequence of matrices generated by the MWU algorithm will converge almost surely to a SOSP of (S-NMF).
Stationary points of (S-NMF). A pair of matrices () is a first-order stationary point of the problem (S-NMF) if for all
for some constant c. Additionally, a pair of matrices () is a second-order stationary of the problem (S-NMF) if it is a FOSP, and moreover,
for any pair of matrices (W, H) satisfying:
Theorem 1 ([32]). Consider the problem maxis a polynomial function. Then, the MWU algorithm with update rule
has the property that ) unless
is a fixed point of the MWU dynamics (5). Moreover, the set of initial conditions so that the MWU dynamics (5) converge to a point that is not a second order stationary point for max{Q(x) :
is of measure zero. The statement holds when
is chosen to be of order Θ(
) where L is the Lipschitz constant inside ∆
, that is
Notice that the MWU described in Algorithm 1 is the same with the MWU of Theorem 1 applied for (make it a minimization problem). Also observe that, F can be described as multivariate polynomial and thus Theorem 1 applies if the parameter
is selected appropriately small. For our case,
should be Θ(1
) (see also Algorithm 1).
To this end, Theorem 1 ensures that the MWU algorithm (almost certainly) converges to a SOSP of (S-NMF). However there is no reason why one should be interested in finding such points, since these points are not necessarily SOSPs of (NMF). In fact, a SOSP of (S-NMF) can be an arbitrarily bad solution for the initial NMF problem (e.g. consider the case C = 0). One of our main technical contributions consists in showing that if the offset C exceeds a certain threshold (depending on ) then the set of SOSPs of (S-NMF) is a subset of the second-order stationary points of (NMF). Specifically, we show that:
we have that
• The value of (S-NMF) is equal to (NMF).
• Any second-order stationary point of (S-NMF) is also a second-order stationary point of (NMF).
The proof of the main theorem follows easily by combining Theorem 2 with Theorem 1. The first part of Theorem 2 is a consequence of the following result, proven in Section 3.
Lemma 1. For any first-order stationary point of (NMF) there exists a first-order stationary point of (S-NMF) with the same value.
The second part of Theorem 2 heavily relies on the following result, which is proven in Section 4.
Lemma 2. Any second-order stationary point of the problem (S-NMF) necessarily satisfies c = 0. In particular, any second-order stationary point of (S-NMF) is a first-order stationary point of (NMF).
Lastly, note that Lemma 2 combined with Theorem 1 imply the following claim: If we apply (NMF). Although the latter claim is not enough for our initial goal (finding pair of matrices that are second-order stationary points for NMF (2)), it is the basic step of the proof of Theorem 2, that is presented in Section 5.
In Section 6, we present the results of several experimental evaluations indicating that the MWU defined in Algorithm 1 converges to the optimal pair of matrices.
2.1 An illustrative example
Before proceeding we exhibit the above discussion in a very simple but illustrative example. Consider the following instance of the NMF problem:
For the above optimization problem the set of first-order stationary points (Equation (1)) is the union of the sets, . While the set of the second-order stationary points (Equation (2)) is just the set {(x, y) :
. This simple example indicates the interest in finding second-order stationary points (Equation (2)) since the set of first order stationary points (Equation (1)) contains very bad solutions.
Now consider the same problem with an additional simplicial constraint:
In this case the set of first-order stationary points (Equation (3)) is {(0, 1), (1, 0), (1/2, 1/2)}. While the set of second order stationary points (Equation (4)) is the set {(1/2, 1/2)}. The above means that if we run the MWU algorithm runs with parameter C = 1 then for almost all initializations, the produced sequence of solutions will converge to (1/2, 1/2), since this is the only second-order stationary point of the above minimization problem. Now notice that (1/2, 1/2) is not a good solution (for the initial optimization problem), the value of the global optimal (for the initial optimization problem) is 0. More importantly, the point (1/2, 1/2) does not satisfy Equation (2), which was our initial algorithmic goal. The reason for this is that the parameter C is not chosen large enough (notice that C must be selected as 2) which is greater than 1) and thus Theorem 2 does not apply.
Now consider the problem with the same additional simplicial constraint, but with C = 4.
In this case the set of first-order stationary points (Equation (3)) is {(0, 1), (1, 0), (2, 2), (2 2 +
(2 +
. While the set of second order stationary points (Equation (4)) is
. As a result, MWU algorithm with parameter C = 4 converge either to (2
2 +
) or to (2 +
) for almost all initializations. Notice that both (2
2 +
) and (2 +
) satisfy Equation (2) (in fact they are optimal solutions). This should not be a surprise since for C = 4, Theorem 2 applies and thus MWU converges to second-order stationary points of Equation (2) .
2.2 Calculating derivatives
The entries of the gradient of F are given by:
whereas the entries of its Hessian are given by:
Using (6) and (7) we arrive at the following useful result:
Lemma 3. For any pair of matrices (W, H) and index , let the pair of matrices ( ˜
where 0 denotes the zero column vector of appropriate size.
Proof. Let us start by proving that for all the entries of the vector corresponding respectively to
By direct calculation we get that,
where the last equality follows by (6). Respectively for . Up next we prove that
Respectively for
By Theorem 6, [16], any FOSP () of the problem (NMF) satisfies
. Consider the pair matrices ( ˆ
) defined by:
for each ]. Without loss of generality we assumed that both
are not equal to zero (if one of these terms is zero, we may assume and the other is also zero and the inequality below still holds). By definition of ( ˆ
) we have that
and thus ), i.e., these two pairs of matrices have the same value. Furthermore, note that
Lastly, consider the parametrized family of matrix pairs () and increase t until
This leads to a pair of matrices that are FOSP of (S-NMF).
Let () be a SOSP of the problem (S-NMF) with
showing the existence of a pair of matrices (
) satisfying:
where we set ). We start with a technical claim that will be used up next.
Lemma 4. Let () be a FOSP for (S-NMF), for some constant
. Then for all
Proof. Direct calculation reveals that:
We are now ready to describe the construction of the matrices () satisfying (8)-(11). Setting
it follows immediately by Lemma 4 that
The matrices are defined as follows:
, while
the k-th column of
coincides with the k-th column of
and all other entries are zero. It is immediate from the definition that Conditions (8)-(9) are satisfied, since
implies
) and thus
(and analogously for
). Moreover, Condition (10) is satisfied since
where the first inequality follows from the fact that all the entries are positive and the second from the Cauchy-Schwarz inequality. In order to satisfy Condition (11), it remains to show that
0. Indeed,
where the last inequality follows by Claim 4. Combining the above with (12) we arrive at,
where we used that
Let () be a second-order stationary point of the problem (S-NMF) which is not a second-order stationary point of the unconstrained problem (NMF). Since (
) is a SOSP of (S-NMF), Lemma 2 implies that c = 0.
As in the proof of Lemma 2, we will arrive at a contradiction by identifying a non-zero pair of matrices () that satisfy the following 4 conditions:
where we set
As () is not a SOSP of (NMF), there exist a pair of matrices ( ˆ
) satisfying conditions (14), (15) and (17). In the remainder of the proof, we use the pair ( ˆ
) to construct a new pair (
) that also satisfies (16). In the construction of the pair (
), we will also use the pair ( ˜
) of Lemma 3 for
+
. We remind that ˜
and ˜
denotes the zero column vector of appropriate size),
We are now ready to describe our construction. Consider the following family of matrix pairs:
which we show satisfies Conditions (14), (15) and (17). Indeed, assuming that 0, it follows by (14) that ˆ
) is a SOSP for (S-NMF), we also have that
) satisfies (14) for all t. Analogously, if
0 it follows that
Lastly, we show that for all . To simplify notation let,
• ), respectively ˆ
• ∇
Then, we have that
where for the second equality we use Lemma 3. The first term is zero since if ˜) is a SOSP (analogously for
). By Conditions (14)-(15), we know that if
that the second term is also zero. Finally, the third term is strictly negative as ( ˆ
We complete the proof of Theorem 2 by noting that for
the remaining Condition (16) is satisfied. An important detail is that contradicts with the assumption that (
) is SOSP (recall the proof of Lemma 2 in Section 4).
In this section we present experimental evaluations on the quality of the solutions produced by the Multiplicative Weight Update algorithm (Algorithm 1), which indicate convergence to the global minimizers. More precisely, for several values of the parameters n and r, we generated random matrices with entries in [0, 1] and rank r and we checked the quality of the solutions
produced by MWU. This was done so as to ensure that the global minimum of the respective NMF problem is 0, which served as a benchmark on the quality of the solutions produced by MWU. For all the conducted experiments MWU was able to find a solution with value arbitrarily close to 0 meaning that it always converged to the right factorization. Figure 1 illustrates the number of iterations needed MWU to converge to solutions with error smaller than 1% of the initial error, for various values of n, r.
An important differentiation of the Multiplicative Weight Update depicted in Algorithm 1 with the previous multiplicative weight update and gradient based approaches, is its concurrent way of updating the entries of W and H. Both the matrices and
are updated by using only the values of
. We remark that that concurrency in the updating step is very desirable, since it permits more efficient implementations in parallel computing environments. To the best of our knowledge, Algorithm 1 is the first iterative method for non-negative matrix factorization that converges while performing its updates in a concurrent way. In all the previous gradient based approaches (both in multiplicative weight algorithms and in alternating least squares) the convergence properties heavily rely on the fact that the entries of W and H are updating in an
Figure 1: The figure depicts the number of iterations MWU needs to produce a solution with error smaller than 1% of the initial error. V is a random matrix with rank r and entries in [0, 1].
alternating way, while their concurrent counterparts may fail to converge. In Example 1 we present such a case for the Lee-Seung algorithm in which the original version of the algorithm converges, while the concurrent version fails to converge.
Example 1. Consider the matrices V =
(oscillates.
Ioannis Panageas and Xiao Wang were supported by SRG ISTD 2018 136, NRF-NRFFAI1-2019-0003 and NRF2019NRF-ANR2019. Stratis Skoulakis was supported by NRF 2018 Fellowship NRF-NRFF2018-07. Antonios Varvitsiotis was supported by SRG ESD 2020 154.
[1] S.I. Amari A. Cichocki, R. Zdunek. Hierarchical als algorithms for nonnegative matrix and 3d tensor factorization. In Lecture Notes in Computer Science, pages 169–176, 2007.
[2] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. In Computational Statistics and Data Analysis, pages 155–173, 2006.
[3] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
Figure 2: The error of the concurrent Lee-Seung algorithm versus the error of the original algorithm in the NMF instance described in Example 1.
[4] M. Chu, Fasma Diele, R. Plemmons, and Stefania Ragni. Optimality, computation, and interpretation of nonnegative matrix factorizations. 01 2004.
[5] J.E. Cohen and U.G. Rothblum. Nonnegative ranks, decompositions and factorizations of nonnegative matrices. Linear Algebra and its Applications, 190:149–168, 1993.
[6] Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.
[7] C. F´evotte, N. Bertin, and J.L. Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural Computation, 21(3):793–830, 2009.
[8] Lorenzo Finesso and Peter Spreij. Approximate nonnegative matrix factorization via alternating minimization. In Proceedings of the 16 th MTNS Internat. Symposium, Leuven.
[9] Yuan Gao and George Church. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, 21(21):3970–3975, 2005.
[10] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 797–842, 2015.
[11] Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems 29, 2016.
[12] N. Gillis. The why and how of nonnegative matrix factorization”. In M. Signoretto J.A.K. Suykens and A. Argyriou, editors, Regularization, Optimization, Kernels, and Support Vector Machines. Chapman & Hall/CRC, 2014.
[13] Edward F. Gonzalez and Yin Zhang. Accelerating the lee-seung algorithm for nonnegative matrix factorization, 2005.
[14] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gauss- seidel method under convex constraints. Operations Research Letters, 26:127–136, 2000.
[15] N.D. Ho. Nonnegative matrix factorization algorithms and applications. PhD thesis, Univ. Catholique de Louvain, 2008.
[16] Ngoc-Diep Ho, Paul Van Dooren, and Vincent D. Blondel. Descent methods for nonnegative matrix factorization. CoRR, abs/0801.3199, 2008.
[17] Chih jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 2007.
[18] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 1724–1732, 2017.
[19] H. Kim and H. Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713–730, 2008.
[20] Jingu Kim, Yunlong He, and Haesun Park. Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Global. Optim., 58:285–319, 2014.
[21] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999.
[22] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In In NIPS, pages 556–562. MIT Press, 2001.
[23] Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddle points. Math. Program., 176(1-2):311–337, 2019.
[24] L. Li and Y.-J Zhang. Fastnmf: highly efficient monotonic fixed-point nonnegative matrix factorization algorithm with good applicability. J. Electron. Imaging, 18:033004, 2009.
[25] C. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007.
[26] Songtao Lu, Meisam Razaviyayn, Bo Yang, Kejun Huang, and Mingyi Hong. SNAP: finding approximate second-order stationary solutions efficiently for non-convex linearly constrained problems. CoRR, abs/1907.04450, 2019.
[27] A. Moitra. An almost optimal algorithm for computing nonnegative rank. In Proc. of the 24th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 1454–1464, 2013.
[28] Maher Nouiehed, Jason D. Lee, and Meisam Razaviyayn. Convergence to second-order stationarity for constrained non-convex optimization. CoRR, 2019.
[29] Maher Nouiehed and Meisam Razaviyayn. A trust region method for finding second-order stationarity in linearly constrained non-convex optimization. 2019.
[30] Pentti Paatero and Unto Tapper. Environmetrics, 5(2):111–126, 1994.
[31] Ioannis Panageas, Georgios Piliouras, and Xiao Wang. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 6471–6480, 2019.
[32] Ioannis Panageas, Georgios Piliouras, and Xiao Wang. Multiplicative weights updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 4961–4969, 2019.
[33] R. Kannan A. Moitra S. Arora, R. Ge. Computing a nonnegative matrix factorization — provably,. In Proceedings of the Annual Symposium on Theory of Computing, pages 145–162, 2012.
[34] Mikkel N. Schmidt and Rasmus Kongsgaard Olsson. Single-channel speech separation using sparse non-negative matrix factorization. In Ninth International Conference on Spoken Language Processing, 2006.
[35] S. Vavasis. On the complexity of nonnegative matrix factorization. SIAM J. Optimization, 20:1364–1377, 2009.
[36] Y. Gong W. Xu, X. Liu. Document clustering based on non-negative matrix factorization. In the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval,, pages 267–273, 2003.
[37] M. Yannakakis. Expressing combinatorial optimization problems by linear programs. J. Comput. Syst. Sci., 43(3):441–466, 1991.
[38] Prabhu Babu Ying Sun and Daniel P. Palomar. Majorization-minimization algorithms in signal processing, communications, and machine learning. IEEE TRANSACTIONS ON SIGNAL PROCESSING,, 65(3):794–816, 2017.