Compressed Dictionary Learning

2018·arXiv

Abstract

1 INTRODUCTION

Low complexity models of high-dimensional data lie at the core of many efficient solutions in modern signal processing. One such model is that of sparsity in a dictionary, where every signal in the data class at hand has a sparse expansion in a predefined basis or frame. In mathematical terms we say that there exists a set of K unit-norm vectors referred to as atoms, such that every signal can be approximately represented in the dictionary as where I is an index set and is a sparse coefficient vector with |I| = S and .

A fundamental question associated with the sparse model is how to find a suitable dictionary providing sparse representations. When taking a learning rather than a design approach this problem is known as dictionary learning or sparse component analysis. In its most general form, dictionary learning can be seen as a matrix factorization problem. Given a set of N signals represented by the data matrix , decompose it into a dictionary matrix and a coefficient matrix ; in other words, find where every coefficient vector is sparse. Since the seminal paper by Olshausen and Field, [11], a plethora of dictionary learning algorithms have emerged, see [3], [10], [21], [23], [25], [36], [24], [26], and also theory on the problem has become available, [15], [38], [7], [2], [32], [33], [14], [8], [6], [35], [39], [40]. For an introduction to the origins of dictionary learning including well-known algorithms see [31], while pointers to the main theoretical results can be found in [34].

Fast dictionary learning algorithms that are applicable to large-scale problems have been proposed in [27], [26], [6], [35], [25]. In large-scale problems we are typically required to learn dictionaries from a massive number of training signals, potentially in high dimensions. When N is large, fast learning algorithms avoid operating explicitly with the data matrix Y by following an online-learning model; in other words, they process the data set Y column-by-column in a serial fashion without having the entire matrix Y available from the start of the learning task. When d is large, the computational cost of processing a single high-dimensional signal can get prohibitively expensive, even in an online-learning model. Random subsampling has been proposed in [26], [27] to deal with the increase in computational cost of dictionary learning in high dimensions. In random-subsampling-based learning algorithms, the computational cost of processing high-dimensional data is reduced by applying a random mask to each column of the data matrix Y , such that a random subset of the entries in the column is observed instead of all the entries. Subsampling schemes require the target dictionary to be incoherent with the Dirac basis. Further, despite the emergence of practical learning algorithms that are applicable to large-scale problems, so far there exist no identification results for fast dictionary learning algorithms which scale well both in the large d and N parameter regime.

Here we take a step towards increasing the computational efficiency of dictionary learning for high-dimensional training data. We introduce an online learning algorithm that scales well with both d and N, and can identify any dictionary with high probability, not only those incoherent with the Dirac basis. On top of that, we provide a theoretical analysis of its local convergence behavior. We will use the ITKrM algorithm introduced in [35] as our starting point. For a target recovery error and noisy signals with sparsity levels and Signal-to-Noise Ratio (SNR) of order O(1), it has been shown that except with probability ITKrM will recover the generating dictionary up to from any input dictionary within radius in iterations, as long as in each iteration a new batch of training signals is used. In particular, for sparsity levels the computational cost of dictionary recovery with ITKrM scales as .

Main Contribution. In this work we show that the signal dimension bottleneck in the computational cost of ITKrM can be shattered by using randomized dimensionality reduction techniques. Our main technical tool is a result due to Johnson and Lindenstrauss in [18], which shows that it is possible to map a fixed set of data points in a high-dimensional space to a space with lower dimension with high probability, while preserving the pairwise distances between the points up to a prescribed distortion level. We introduce the IcTKM algorithm for fast dictionary learning, where we exploit recent constructions of the Johnson-Lindenstrauss embeddings using Fourier or circulant matrices, [5], [4], [20], [16], [19], to efficiently embed both the training signals and the current dictionary. The inner products between embedded signals and atoms are then used as approximations to the exact inner products in the thresholding step. We study the local convergence behavior of IcTKM and prove a similar result as for ITKrM. For a target recovery error and noisy signals with sparsity levels and SNR of order O(1), we will show that except with probability IcTKM will recover the generating dictionary up to from any input dictionary within radius in iterations, as long as in each iteration a new batch of training signals is used and a new embedding is drawn with the embedding dimension m satisfying . This means that the computational cost of dictionary recovery with IcTKM can be as small as . We further show with several numerical experiments on synthetic and audio data that IcTKM is a powerful, low-cost algorithm for learning dictionaries from high-dimensional data sets.

Outline. The paper is organized as follows. In Section 2 we introduce notation and define the sparse signal model used to derive our convergence results. The proposed algorithm is presented in Section 3, where its convergence properties and computational complexity are studied in detail. In Section 4 we present numerical simulations on synthetic and audio training data to illustrate the ability of our proposed algorithm for learning dictionaries with fairly low computational cost on realistic, high-dimensional data sets. Lastly, in Section 5 we present conclusions and discuss possible future research directions.

2 NOTATION AND SIGNAL MODEL

Before we hit the strings, we will fine tune the notation and introduce some definitions. Regular letters will denote numbers as in . For real numbers , we use the notation to convey that . Lower-case bold letters denote vectors while upper-case bold letters are reserved for matrices, e.g., vs. . For a vector u we use u(k) to denote its kth coordinate. The supremum norm of u is defined by . For a matrix U, we denote its conjugate transpose by and its Moore-Penrose pseudo-inverse by . The operator norm of U is defined by . We say that a matrix with m < d has the Restricted Isometry Property (RIP) of order k and level , or the shorthand -RIP, if for all k-sparse vectors we have , see [9] for details. Upper-case calligraphy letters will denote sets; specifically, we let I denote an index set and use the notation to convey the restriction of matrix U to the columns indexed by I, e.g., with and |I| = n for some integer n. Further, we denote by the orthogonal projection onto the span of the columns indexed by I, i.e., . For a set V, we refer to as its indicator function, such that is equal to one if and zero otherwise.

The matrix denotes the generating dictionary, which we define as a collection of K unit-norm vectors , also referred to as atoms, and d is the ambient dimension. The rank of is called the intrinsic dimension of the dictionary, and we have . Since we are dealing with high dimensional data, we will not only consider the usual case of overcomplete dictionaries with d < K but, especially in the numerical simulations, also the undercomplete dictionaries with K < d. The maximal inner-product (in magnitude) between two different atoms of the generating dictionary is called the coherence . The dictionary will be used to generate our training signals as follows

where is a sparse coefficient vector and represents noise. By collecting a set of N training signals generated as in (1), we form our training data set as . We refer to the number of training signals N required to recover a dictionary with high probability as the sample complexity.

We will model the sparse coefficient vector x as a random permutation of a randomly chosen sequence provided with random signs, where the S-largest entry of the randomly chosen sequence is greater than the (S + 1)-largest entry in magnitude. Formally, let C denote a subset of all positive non-increasing, unit-norm sequences, i.e., for a we have with , endowed with a probability measure . To model S-sparsity we assume that almost -surely we have

where is called the absolute gap and the relative gap of our coefficient sequence. To choose the coefficients c, we first draw them according to , then draw a permutation p as well as a sign sequence uniformly at random and set in (1), where . With this notation the signal model then takes the form

For most of our derivations it will suffice to think of the sparse coefficient vector x as having exactly S randomly distributed and signed, equally sized non-zero entries; in other words, C contains one sequence c with for and c(k) = 0 for k > S, so we have and . In particular, the main theorem of the paper will be a specialization to this particular coefficient distribution, while the more detailed result that addresses general sequence sets C is deferred to the appendix. For the numerical simulations we will again use only exactly S-sparse sequences with c(k) = 0 for k > S. However, for they will form a geometric sequence whose exact generation will be discussed in the relevant section.

The noise vector r is assumed to be a centered subgaussian vector with parameter independent of x, that is, E(r) = 0, and for all vectors the marginals are subgaussian with parameter such that we have for all t > 0. Since , with equality holding in the case of Gaussian noise, and , we have the following relation between the noise level and the Signal-to-Noise Ratio, . As before, for most of the paper it suffices to think of r as a Gaussian random vector with mean zero and variance .

The scaling factor in the signal model might seem strange at first glance, but can be thought of as a first moment approximation to the normalization factor , often used in practice. It allows us to handle unbounded noise distributions, while still being relatively simple. Note that the main argument used for the result presented here does not depend on the scaling factor. The corresponding lemma as well as the other lemmata from [35], which we require for the proof of our main result, can be straightforwardly modified to handle a signal model without the scaling factor, when additionally assuming boundedness of the noise. Further, with more technical effort, e.g., by using concentration of around, the Taylor expansion of the square root, and keeping track of the small distortion coming from higher order moments, it is possible to extend the necessary lemmata in [35] to the normalized signal model.

We will refer to any other dictionary as our perturbed dictionary, meaning that it can be decomposed into the generating dictionary and a perturbation dictionary . To define this decomposition, we first consider the (asymmetric) distance of to defined by

Although is not a metric, a locally equivalent (symmetric) version may be defined in terms of the maximal distance between two corresponding atoms, see [35] for details. Since the asymmetric distance is easier to calculate and our results are local, we will refer to distances between dictionaries in terms of (4) and assume that is already signed and rearranged in a way that . With this distance in hand, let and , where by definition. We can write our perturbed dictionary by finding unit vectors with such that we have the decomposition

Lastly, we use the Landau symbol O(f) to describe the growth of a function f. We have f(t) = O(g(t)) if , where C > 0 is a constant.

3 FAST DICTIONARY LEARNING VIA ICTKM

The ITKrM algorithm is an alternating-minimization algorithm for dictionary learning which can also be interpreted as a fixed-point iteration. ITKrM alternates between 2 steps: (1) updating the sparse coefficients based on the current version of the dictionary, and (2) updating the dictionary based on the current version of the coefficients. The sparse coefficients update is achieved via thresholding, which computes the sparse support of each point in the data set Y by finding the S-largest inner products (in magnitude) between the atoms of a perturbed dictionary and the data point as follows

The dictionary update, on the other hand, is achieved by computing K residual means given by

The two most computationally expensive operations in the ITKrM algorithm are the computation of the sparse support and the projection . If we consider one pass of the algorithm on the data set Y , then finding the sparse support of N signals via thresholding entails the calculation of the matrix-product of cost O(dKN). To compute the N projections, on the other hand, we can use the eigenvalue decomposition of with total cost . Stable dictionary recovery with ITKrM can be achieved for sparsity levels up to , see [35] for details, but in practice recovery is carried out with much lower sparsity levels where thresholding becomes the determining complexity factor. Conversely, the projections would dominate the computations only for impractical sparsity levels . We will concentrate our efforts on the common parameter regime where thresholding is the computational bottleneck for stable dictionary recovery.

Although the cost O(dKN) incurred by thresholding N signals is quite low compared to the computational cost incurred by other popular algorithms such as the K-SVD algorithm [3], learning dictionaries can still be prohibitively expensive from a computational point of view when the ambient dimension is large. Our goal here is to shatter the ambient dimension bottleneck in thresholding by focusing on dimensionality-reduction techniques, which will allow us to address real-world scenarios that require handling high-dimensional data in the learning process.

3.1 Speeding-up Dictionary Learning

Our main technical tool for speeding-up ITKrM is a dimensionality reduction result due to Johnson and Lindenstrauss [18]. This key result tells us that it is possible to embed a finite number of points from a high-dimensional space into a lower-dimensional one, while preserving the relative distances between any two of these points by a constant distortion factor. Say we want to embed a set of |X| = p points into m < d, where m is the embedding dimension. By Lemma 4 in [18], there exists a JohnsonLindenstrauss (JL) mapping with , where is the embedding distortion, such that

Further, we know from [5], [4], [20], [16], [19] that the JL mapping in (8) can be realized with probabilistic matrix constructions where the embedding dimension is on par with the bound up to logarithmic factors, and fast algorithms for matrix-vector multiplication can be used to reduce the computational cost to embed the points. A precise definition of random matrices with these nice dimensionality-reduction properties is now given.

Theorem 3.1 (Fast JL-Embeddings [20]). Let be of the form , where is a normalization factor, is a diagonal matrix with entries uniformly distributed on , and is obtained by drawing m rows uniformly at random from a orthogonal matrix, e.g., discrete Fourier/cosine unitary matrices, or a circulant matrix in which the first row is a Rademacher random vector multiplied by a normalization factor, and the subsequent rows are cyclic permutations of this vector. If for obtained from orthogonal or circulant matrices, respectively, then (8) holds with probability exceeding .

Proof: The proof is a direct consequence of Theorem 3.1 in [20]. Since has the -RIP with high probability when drawing m rows at random from a Fourier/cosine matrix for (see Theorem 4.5 in [16]), and a circulant matrix for (see Theorem 1.1 in [19]), it then follows from Theorem 3.1 in [20] that the JL property in (8) holds for with probability exceeding for in the orthogonal case, and in the circulant case.

Remark 3.1 (Computational Complexity). Note that all fast JL-embeddings defined above can be decomposed as with |I| = m, where is a projection onto the indices in I, and is either an orthogonal or a circulant matrix. From this decomposition, we can see that the embedding cost is dominated by the action of Q. If Q is a circulant, or a Fourier/cosine matrix, then Q admits fast matrix-vector multiplication via the FFT and the cost of embedding a point is of order O(d log d). Therefore, from now on we will always think of as being based on one of these three constructions.

Remark 3.2 (Operator Norm). The advantage of an embedding based on an orthogonal Q, meaning a Fourier or cosine matrix, is that the operator norm of is bounded by since the operator norms of all three factors, are bounded by one. In the case of a circulant Q, we have that its singular values correspond to the magnitudes of the DFT of its first row. The operator norm of Q is, therefore, bounded by the supremum norm of the DFT of a normalized Rademacher vector which concentrates around its expectation of order , and so with high probability the operator norm of will be of order . We will see that the operator norm of Q directly affects our admissible noise level . In particular, the circulant matrix construction reduces our admissible noise by a factor of at least O(log d) compared to the orthogonal construction. For simplicity, we will state and prove only the stronger theoretical results for JL-embeddings based on an orthogonal Q, but will point out which part of the proofs needs to be amended for a circulant Q.

3.2 The Proposed Algorithm

We can reduce the computational cost of ITKrM by using fast embedding constructions such as those in Theorem 3.1. Consider updating the sparse coefficients in an alternating minimization algorithm via compressed-thresholding, which now computes the sparse support of each point in the data set Y by finding the S-largest inner products (in magnitude) between the embedded atoms of a perturbed dictionary and the embedded data point as follows

By replacing the thresholding operation of the ITKrM algorithm in (6) with its compressed version in (9), we arrive at the IcTKM algorithm, see Algorithm 3.1.

IcTKM inherits all the nice properties of ITKrM as far as the implementation of the algorithm is concerned. It can be halted after a fixed number of iterations has been reached, and is suitable for online processing and parallelization. In particular, Algorithm 3.1 may be rearranged in a way that the two inner loops are merged into a single loop that goes through the data set. In this implementation, the sparse support is computed for the signal at hand and all the atoms for which are then updated as in . The algorithm proceeds to the next signal, and the dictionary is normalized once all the signals have been processed. Since each signal can be processed independently, the learning process may be carried out in N independent processing nodes and thus we benefit from massive parallelization. Further, we have fairly low storage complexity requirements in this online implementation. We only need to store O (d(K + m)) values which correspond to the input dictionary , the current version of the updated dictionary , the JL embedding , and the current signal . Note that it is not necessary to store the data set Y in the online implementation, as this would have incurred a large storage overhead of O(dN) values in memory.

3.3 Computational Complexity

Considering one pass of the IcTKM algorithm on the data set Y , it can be seen from (9) that to find the sparse support with compressed thresholding, we first need to compress the dictionary and the data set as in and , respectively, and then compute the matrix product to find the S-largest inner products in magnitude. We know from the decomposition in Remark 3.1 that the cost of computing and is dominated by the action of the orthogonal (or circulant) matrix Q. Since Q can be applied with the FFT, the cost of embedding the dictionary and the data set reduces to O(dK log d) and O(dN log d), respectively. Thus, the computational cost of compressed thresholding for one pass on the data set Y is of order O ((d log d + mK)N). Further, we will see in our convergence result that in order to achieve stable dictionary recovery m needs to be larger than log d, and thus the cost of compressed thresholding for overcomplete dictionaries simplifies to O (mKN).

Comparing the cost O(mKN) of compressed thresholding against the cost O(dKN) of regular thresholding, we can see that a significant speed-up in dictionary learning can be achieved with IcTKM if m can be made small, ideally as in . Next, we address the convergence properties of IcTKM and answer the question of how much the data can be compressed. In particular, we present data compression ratios (d/m) : 1 that can be reliably achieved to speed-up the learning process.

3.4 Convergence Analysis

We now take a look at the convergence properties of IcTKM for exactly S-sparse training signals with randomly distributed and signed, equally sized non-zero entries. A precise convergence results for the case of approximately S-sparse signals with more general coefficient distributions can be found in Theorem A.1.

Theorem 3.2. Assume that the generating dictionary has operator norm O(K/d) = O(1) and coherence , and that the training signals are generated following the signal model in (3) with for and c(k) = 0 for k > S, and with Gaussian noise of variance .

Fix a target error and a distortion level , such that

Choose a failure probability parameter and let be a JL embedding constructed according to Theorem 3.1 by drawing

rows uniformly at random from a Fourier or cosine unitary matrix. Then for any starting dictionary within distance to the generating dictionary , after iterations of IcTKM, each using a new batch of

training signals Y and a new JL embedding , the distance of the output dictionary to the generating dictionary will be smaller than the target error except with probability .

Our first observation is that for corresponding to m = d, IcTKM reduces to ITKrM, and Theorem 3.2 reduces to the statement for ITKrM, see Theorem 4.2 in [35]. We also see that the convergence radius and sample complexity are essentially not affected by the compressed thresholding operation. This is due to the fact that we use inner products between the embedded sparse signal and embedded atoms, and exploit their excess separation outside the sparse support to prove recovery, but use only the regular inner products in the update-formula. One side effect of compression is noise folding, a recurring issue in compressed sensing algorithms, see [1] for instance. The noise folding issue manifests itself in our admissible noise level , which is reduced by a factor of compared to ITKrM.

To get a better feeling for the best achievable compression, assume that and that the SNR is 1, corresponding to . If we are satisfied with a moderate target error and a final failure probability log K/K, meaning , which results in a reasonable batch size of signals per iteration, the condition on the distortion level essentially becomes and we get for the best possible compression ratio an embedding dimension as low as (omitting log log factors)

This means that up to logarithmic factors the embedding dimension necessary for local convergence of compressed dictionary learning scales as the sparsity level, and thus is comparable to the embedding dimension for compressed sensing of sparse signals. It also shows that the cost of dictionary learning per signal can be significantly reduced from O(dK) to . To not break the flow of the paper, we present the complete proof together with the exact statement in the appendix, and provide here only a sketch which summarizes the main ideas.

Proof Sketch: To prove this result we will use the JL property in (8) and Theorem 3.1. We first need to ensure that the relative distances between all pairs of embedded atoms of the generating and perturbation dictionaries are preserved up to a distortion with high probability, which is achieved by enforcing the embedding dimension bound in (11). The distance preservation property in conjunction with the assumption that the coefficients have a well balanced distribution in magnitude will ensure that compressed thresholding recovers the generating (oracle) signal support with high probability. With this result in hand, we then make use of the same techniques used in the proof of Theorem 4.2 in [35]. Assuming that compressed thresholding recovers the generating signal support, we will apply a triangle inequality argument to the update formula and show that the difference between the residual based on the oracle signs and supports using , and the residual using concentrates around its expectation, which is small. This concentration property also ensures that the sum of residuals using converges to a scaled version of . The convergence of the sum of residuals will then be used to show that one iteration of IcTKM decreases the error, e.g., for , with high probability. Finally, we iterate the error decreasing property and show that the target error is reached, , after L iterations.

4 NUMERICAL SIMULATIONS

We will now complement our theoretical results with numerical simulations to illustrate the relation between the compression ratio, admissible sparsity level/achievable error, and the computational cost of dictionary learning in a practical setting1. The simulation results will further demonstrate that IcTKM is a powerful, low-cost algorithm for learning dictionaries, especially when dealing with training signals in dimension , where it is possible to speed-up the learning process by up to an order of magnitude. We begin with simulations carried out on synthetic training data, and then follow up with simulations on audio training data obtained from the RWC music database [13].

4.1 Synthetic data

We have generated our synthetic training set with (3), using both overcomplete and undercomplete generating dictionaries and exactly S-sparse coefficients under Gaussian noise. Details for the signal generation are described next.

Generating dictionary. We will generate two types of dictionaries: with intrinsic dimension equal or lower than the ambient dimension d. In both cases the dictionary is constructed as a subset of the union of two bases in , the Dirac basis and the first-half elements of the discrete cosine transform basis, meaning the number of atoms in the generating dictionary amounts to . In case of we simply embed the -dimensional dictionary into by zero-padding the missing entries.

Sparse coefficients. As coefficient sequences c we use geometric sequences with decay factor uniformly distributed in for some 0 < b < 1; to be more specific, we set for and c(k) = 0 for k > S with the normalisation factor ensuring . The maximal dynamic range of our coefficients for this particular arrangement is , and for a given sparsity level S we choose b so that the maximal dynamic range is exactly 4.

Sparsity level. We have experimented with two parameter regimes for the sparsity level, S = O(1) and ; or more precisely, S = 4 and . We have chosen these two sparsity level regimes to experimentally validate our theoretical findings; that for the lower sparsity levels S = O(1) the highest compression ratios can be achieved but at the expense of an increased recovery error and, on the other hand, with the higher sparsity levels recovery precision is increased but only modest improvements in the computational cost are achievable.

Recovery criteria. Given an atom from the output dictionary , the criteria for declaring the generating atom as recovered is . To estimate the percentage of recovered atoms we run 100 iterations of IcTKM (or ITKrM where applicable) with N = 50K log K using 10 completely random dictionary initializations, meaning that the atoms for the initial dictionary are chosen uniformly at random from the unit sphere in .

Noise level. The noise r is chosen as a Gaussian random vector with mean zero and variance 1/(4d). Since for our coefficient sequence and for Gaussian noise, the SNR of our training signals is exactly 4.

Next we present recovery simulations carried out with our synthetic generated training data to evaluate the achievable compression ratio and recovery rates/time. We will also evaluate how IcTKM scales with increasing ambient dimensions.

4.1.1 Compression ratio

In Table 1 we evaluate the highest achievable compression ratio to recover the generating dictionary with the Discrete-Fourier Transform (DFT), Discrete-Cosine Transform (DCT), and Circulant-Rademacher Transform (CRT) as JL embedding. Here we have used the convention that has been recovered if 90% of its atoms are recovered. Synthetic signals with ambient dimension , intrinsic dimension and compression ratios in {1.5, 2, 2.5, 2.9, 3.33, 4, 5, 6.67, 10, 20, 33.33, 40} have been used in this experiment. As predicted by our theoretical results, we can attain much higher compression ratios when using reduced sparsity levels, compare the results with in Table 1(A) and S = O(1) in Table 1(B). Additionally, although low-dimensional training signals have been used in this experiment, a compression ratio of at least 2 : 1 can be attained for sparsity levels rising to 6.67 : 1 for sparsity levels S = O(1), and these good results indicate that a large constant factor might be present in our compression ratio estimate of in the theoretical results. Lastly, note that the DFT attains consistently higher compression ratios than the DCT and CRT as JL embeddings.

TABLE 1: Highest compression ratio achieved with IcTKM.

4.1.2 Recovery rates

In Figure 1 we evaluate the attained dictionary recovery rates for synthetic training signals of ambient dimension d = 1, 024 and intrinsic dimension with ITKrM, and IcTKM using the DFT, DCT, and CRT as JL embedding and increasing compression ratios. The solid yellow line marks the highest recovery rate achieved with ITKrM in the experiment, i.e., 99% recovered atoms for sparsity levels and 94.7% for sparsity levels S = O(1). We can see from the results that IcTKM usually requires far less iterations than ITKrM to reach the target recovery rate. In particular, for sparsity levels in Figure 1(a) the DFT with compression ratios of 2.5 : 1 and 3.3 : 1, or the DCT with a compression ratio of 2.5 : 1 can attain the 99% recovery rate with roughly 60 iterations, or a 40% reduction in the number of iterations compared to ITKrM. A similar trend can be seen in Figure 1(b) for the sparsity levels S = O(1), but here the increase in convergence speed is much more pronounced. In particular, the DFT with a compression ratio of 10 : 1, or the DCT and CRT with a compression ratio of 5 : 1 can attain the 94.7% recovery rate with no more than 40 iterations, or a 60% reduction in the number of iterations compared to ITKrM. Lastly, note that for the sparsity levels , IcTKM always managed to achieve a higher recovery rate than ITKrM. For example, the DFT with compression ratios of 2.5 : 1 and 5 : 1 attained a 99.8% recovery rate compared to the 99% rate achieved with ITKrM. For the sparsity levels S = O(1), the DCT and CRT with a compression ratio of 5 : 1, and the DFT with a compression ratio of 10 : 1 also managed to attain a higher recovery rate of at least 96.5% compared to the 94.7% recovery rate attained with ITKrM, but for some compression ratios the improved recovery rates could not be attained. For example, the DCT with a compression ratio of 2.5 : 1 attained a 93.18% recovery rate.

Fig. 1: Dictionary recovery rates with increasing compression ratios.

4.1.3 Recovery time

In Figure 2 we evaluate the dictionary recovery time attained for synthetic training signals of ambient dimension d = 1, 024 and intrinsic dimension with ITKrM, and IcTKM using the DFT, DCT, and CRT as JL embedding and increasing compression ratios. The solid yellow line again marks the highest recovery rate achieved with ITKrM in the experiment. As predicted by our theoretical results, we can attain much better improvement in the computational complexity of IcTKM when using reduced sparsity levels, compare the results with in Figure 2(a) and S = O(1) in Figure 2(b). In particular, for the higher sparsity levels, IcTKM with the DCT and a compression ratio of 2.5 : 1 requires 9.02 hours to attain the 99% recovery rate compared to 18.44 hours required by ITKrM, or a 2.04 speed-up in dictionary recovery time. For the lower sparsity levels, on the other hand, the DCT with a compression ratio of 5 : 1 required 2.98 hours to attain the 94.7% recovery rate compared to 16.3 hours required by ITKrM, or a 5.47 speed-up ratio. Lastly, note that although the DFT at a given compression ratio usually requires less iterations to recover the dictionary than the DCT and CRT at same compression ratio, compare the results in Figure 1, the recovery time results do not reflect this faster convergence rate. We can see from Figure 2 that the DFT is usually slower than the other transforms to recover the dictionary. The reason for the worse performance is that the matrix product in the DFT has to be computed with complex numbers, thus requiring twice the amount of arithmetic operations than the DCT and CRT.

Fig. 2: Dictionary recovery time with increasing compression ratios.

4.1.4 Scalability

In Figure 3 we evaluate the scalability of dictionary recovery with ambient dimension for ITKrM, and IcTKM using the DFT, DCT, and CRT as JL embedding. Synthetic signals with ambient dimension ranging from d = 2, 048 up to d = 131, 072 have been used in this experiment. To carry out the learning process with these high-dimensional signals, we have fixed with across all the ambient dimensions tested, so that we could avoid the large dictionary memory overhead, e.g., in the highest dimension setting K = (3/2)d would have required more than 200 gigabytes of volatile memory to manipulate the dictionary matrix in double-precision floating-point representation. Similar to before, we compare the time needed to recover 99% of all atoms for and 94.7% of all atoms for S = O(1). We can see from the results that IcTKM performs particularly well on high-dimensional signals. For the higher sparsity levels in Figure 3(a), IcTKM with the DCT and a compression ratio of 2.5 : 1 is faster than ITKrM to recover the dictionary at the highest dimension tested, while the DFT and CRT are roughly faster. For the lower sparsity levels S = O(1) in Figure 3(b), on the other hand, IcTKM performs significantly faster than ITKrM. In particular, IcTKM with the DCT and a compression ratio of 5 : 1 is almost faster than ITKrM to recover the dictionary for signals with d = 131, 072.

Fig. 3: Scalability of dictionary recovery time with ambient dimension and compression ratio.

4.2 Audio data

For the real data we have selected several recordings comprised of stereo audio signals sampled at 44.1 KHz. Details for the audio training data and the recovery simulations are described next.

Audio recordings. We have used three audio recordings of roughly 10 minutes each for carrying out the simulations. The recordings represent distinct musical genres: classical, folk Japanese and Flamenco musical pieces, which have been obtained from RWC’s classical and music genre databases. The classical piece is the first movement of a piano sonata by Mozart. The folk Japanese piece are min’y¯o traditional songs comprised of female vocal, shamisen (three-stringed instrument), shakuhachi (an end-blown long flute), shinobue (a high-pitched short flute), and traditional percussion instruments. The Flamenco piece is solely comprised of male vocal, and guitar which also acts as a percussive instrument.

Block size/overlap. We have first summed the audio signals to mono, and then partitioned the resulting signals into smaller blocks. Short duration blocks of 0.25 seconds and long ones of 1 second have been used. The blocks were allowed to overlap such that the maximally allowed amount of overlap of one block with a shifted version of itself varied from 95% for the short block, up to 98.75% for the long block.

Training signals. The dictionaries have been learned directly from the time-domain samples of our musical recordings, with each audio block assigned to one training signal. The short and long blocks amount to training signals with ambient dimension of d = 11, 025 and d = 44, 100, respectively. The number of training signals for the three audio recordings were approximately N = 48, 000 for the classical piece, N = 42, 000 for the folk Japanese, and N = 59, 000 for the Flamenco piece.

Learning parameters. We have carried out the learning simulations with two dictionary sizes of K = 64 and K = 256 atoms, and the sparsity level was fixed at S = 4. To learn dictionaries on the audio data, we ran 200 iterations of IcTKM with a DCT based JL embedding and a compression ratio of 5 : 1.

Next, we will explore the ability of IcTKM to learn audio dictionaries for extracting notes of the musical recordings. We will also take a look at how increased ambient dimensions can be used to improve the tone quality of the audio representations.

4.2.1 Extracting musical notes

In Figures 4 to 6 we evaluate the magnitude spectra of the recovered atoms for the classical piano, folk Japanese, and Flamenco recordings. The learning simulations have been carried out with short duration blocks of 0.25 seconds, and the atoms in the learned dictionaries have been sorted by their fundamental frequency. We can see from the results that the larger dictionary is able to capture more musical notes than the smaller one. In particular, for the smaller dictionary we have identified 26 unique fundamental frequencies in the range [108, 992] Hz for the classical piano recording, and 24 frequencies in [92, 440] Hz for the Flamenco recording. For the larger dictionary, on the other hand, we have 55 unique fundamental frequencies in the range [108, 1408] Hz for the classical piano, and 56 frequencies in [88, 524] Hz for the Flamenco recording. These unique fundamental frequencies correspond to notes of the Western equally tempered 12 tone scale found in the recordings. For the folk Japanese recording we have found more notes in a larger frequency range; 31 unique fundamental frequencies in the range [132, 1584] Hz for the smaller dictionary, and 98 frequencies in [128, 1784] Hz for the larger dictionary. We can further see from the results that the learned dictionaries sometimes have multiple atoms with same fundamental frequency, but these equally pitched atoms usually differ in their harmonic structure.

4.2.2 Tone quality

The learned dictionaries have been found to possess musical notes with distinct tone quality, and this is a direct consequence of the different musical genres and instruments in the audio training data2. In particular, the musical notes found in the dictionary of the classical piece have a distinct piano tone quality, while in the Flamenco piece the notes usually have a mixed tone quality reflecting pitched notes from the guitar/vocals and percussive, un-pitched sounds from tapping the guitar plate and plucking its strings. In the dictionary for the folk Japanese piece the lower-pitched atoms have a distinct drum

1.7

2.6

3.5

1.7

2.6

3.5

Fig. 4: Spectra of recovered dictionaries for the classical piano recording using increasing dictionary sizes.

1.7

2.6

3.5

1.7

2.6

3.5

Fig. 5: Spectra of recovered dictionaries for the folk Japanese recording using increasing dictionary sizes.

1.7

2.6

3.5

1.7

2.6

3.5

Fig. 6: Spectra of recovered dictionaries for the Flamenco recording using increasing dictionary sizes.

tone quality while the mid- and high-pitched ones resemble the tone quality of the traditional Japanese flutes. The harmonic content of the female vocal can be found in many atoms of the learned dictionary, which gives them a distinct chorus-like sound quality.

In Figure 7 we evaluate the spectrograms for atoms of the folk Japanese dictionary. Dictionaries have been learned with the short and long audio blocks, and their atoms have been similarly sorted by fundamental frequency. Figure 7(a) shows the spectrograms of the atoms number 15 of the dictionaries learned with short blocks (on the left) and long blocks (on the right). Similarly, Figure 7(b) shows the spectrograms of the atoms number 183. As can be seen from these figures, the learned dictionaries can extract similar musical notes, but the higher-dimensional training signals promote notes with a much richer harmonic structure. This intricate harmonic structure translates to dictionaries where the individual instruments and vocals in the musical piece can be more easily identified.

Fig. 7: Spectrograms of atoms recovered with increasing ambient dimensions.

5 CONCLUSIONS

We have shown in this work that IcTKM is a powerful, low-computational cost algorithm for learning dictionaries. Given a recovery error and failure probability , IcTKM will recover an incoherent generating dictionary from noisy signals with sparsity levels and SNR of order O(1) in iterations, as long as the initial dictionary is within a radius to the generating dictionary, and in each iteration a new batch of training signals as well as a new random embedding with embedding dimension m satisfying is used. This means that the computational cost of dictionary recovery with IcTKM can be reduced by a factor , and lets us conclude that IcTKM is an appealing algorithm for learning dictionaries from high-dimensional signals, particularly in learning tasks with heavily-sparse data under controlled noise levels.

We have further demonstrated with numerical experiments that IcTKM can stably recover dictionaries with low computational cost in a practical setting. For synthetic signals, we had successfully carried out the learning process with high compression ratios, even when low-dimensional data were used. We have also seen that IcTKM scales quite well with increasing ambient dimensions. For high-dimensional signals with roughly a tenth-of-a-million dimensions, we were able to speed-up the learning process by up to an order of magnitude. Further, IcTKM has been shown to be a powerful algorithm for learning dictionaries from high-dimensional audio data. The learned dictionaries worked particularly well for extracting notes from large musical pieces. We have further seen that the ability to learn dictionaries from high-dimensional audio signals allows us to more easily identify individual instruments from musical pieces. The learned dictionaries have been found to contain richer harmonic structure directly corresponding to the musical instruments, particularly for the longer-duration training data.

There are a few research directions we would like to pursue for future work. In Non-Negative Matrix Factorization (NMF), the principle of learning representations by the (additive) combination of multiple bases is achieved via the matrix factorization , where and X are only allowed to have non-negative valued entries, see [22]. Sparse NMF, where X is required to be sparse in addition to non-negative, has been shown to work quite well in audio processing tasks such as pitch detection, automatic music transcription, and source separation [29], [41], [30], [37]. In these applications the learning process is typically carried out in the frequency domain, and thus the data matrix Y is usually given by the power spectrum of the audio training data. Addressing sparse NMF problems with ITKM based algorithms is a line of inquiry we would like to pursue. Non-negative ITKM requires straightforward adjustments to the sparse coefficient update formula to ensure that the updated dictionary is non-negative.

Inspired by masked dictionary learning, see [28], another line of inquiry we want to pursue is blind compressed sensing. In the masked setting we are asked to recover the generating dictionary from training data which has been corrupted or lost via a binary erasure channel. Data corruption is typically modeled with the concept of a binary mask M, a diagonal matrix with 0 or 1 entries in the diagonal, where the corrupted data is thus given by MY , see [28] for details. A practical application of masked dictionary learning is image inpainting, which is based on the observation that if the training data is S-sparse in the generating dictionary, then the corrupted data must also be sparse in the corrupted dictionary. In other words, if then , where X is sparse. We can see that masked dictionary learning is closely related to compressed dictionary learning, in the sense that the mask M has a similar role to . By erasing the data with zeros in the diagonal of M we are effectively reducing the dimension of our learning problem. However, since the erasures occur always in different coordinates we are able to observe the signals on different coordinates for each mask and combining these observations allows us to recover the full signal. To employ these concepts for compressed dictionary learning, we simply need to choose the masks such that they behave as a low-distortion embedding similar to the JL lemma. Conversely, we can use our dimensionality reduction tools to study the theoretical properties of masked dictionary learning, and thus to prove local convergence of the ITKrMM algorithm presented in [28]. Finally, note that such a combination of compressed and masked dictionary learning supported with theoretical guarantees would be a big step towards blind compressed sensing [12].

Acknowledgment

This work was supported by the Austrian Science Fund (FWF) under Grant no. Y760. The computational results presented have been achieved (in part) using the HPC infrastructure LEO of the University of Innsbruck. Further, we gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU used for the synthetic simulations.

Part of the work on the audio simulations was done while F. Teixeira was affiliated to the University of Innsbruck and visited the Parsimony and New Algorithms for Audio and Signal Modeling (PANAMA) team at Inria Rennes, Bretagne Atlantique. He wishes to thank the PANAMA team for its hospitality.

Finally, we would like to thank the anonymous reviewer for references and suggestions to improve the theoretical results and their presentation as well as Marie Pali and Simon Ruetz for proofreading the manuscript at various stages.

APPENDIX A EXACT STATEMENT AND PROOF OF THEOREM 3.2

Before we can state and prove the exact version of Theorem 3.2 we have to introduce additional notation and a few statistics for our signal model in (3). First, we refer to the position of the largest S terms of x (in magnitude) as the oracle support . On top of the already defined absolute and relative gaps, we also set for a -random draw from our coefficient set C the following statistics

We can think of the constant as the expected energy of the S-sparse approximations of our signals, meaning . The constant is related to the expected size of a coefficient in the sparse approximation via . In particular, we have the following bounds and . For the simple distribution based on for and 0 otherwise, we have the equality since in this simplification. Finally, the constant can be lower bounded by , see [17], [33] and thus for large we have , and so we can think of as the signal-to-noise ratio.

We are now ready to state the general version of Theorem 3.2.

Theorem A.1. Let the generating dictionary have coherence and operator norm . Assume that the training signals follow the signal model in (3) with coefficients that have an absolute gap and a relative gap . Further assume that and . Take a JL embedding based on an orthogonal transform in Theorem 3.1, and choose an embedding distortion and a target error , where

If the initial dictionary satisfies

and the embedding dimension m is at least of order

then after iterations of IcTKM, each using a new JL embedding and a new training data set Y , the output dictionary satisfies except with probability

where is the compressed-thresholding residual based on defined by

and is the oracle residual based on (or )

Applying the triangle inequality to (19), we have, where

and with

Now from Lemma B.10 in [35], we know that the inequalityimplies that

As for 0 < x < 1 we can define and assuming that 0 < t < s get

This means that to ensure the error is not increased in one iteration, a tight control of t/s needs to be established with high probability.

We proceed by controlling t and s using concentration of measure results. Starting with the first term , from Lemma B.1 to be found in Appendix B we have for and the following estimate

as long as the embedding dimension satisfies (17) and

and for , the third term is estimated as

Finally, for we estimate the last term as follows

Collecting the concentration of measure results, we have

except with probability given by the sum of the right-hand sides of (22) and (24) to (26). We can see from (27) that to have the error decreased in one iteration with high probability, it suffices to choose the constants to , and . Assuming that the target error and setting , and we arrive at . Since we assume by (21) we have , except with probability

Further substituting the value for the constant in (23), we arrive at the estimate for the convergence radius in (16).

Lastly, we need to ensure that the target error is actually reached. Since for each iteration we obtain by redrawing and m rows from an orthogonal matrix, and Y by redrawing N training signals , it follows that after L iterations the error will satisfy except with probability at most . Thus, to reach the target error we set . Using the fact that the exponential terms in (28) are dominated by the second exponential term when replacing with , we can bound the failure probability by

which leads to the final estimate in (18).

With the general Theorem in hand it is straightforward to arrive at the featured Theorem 3.2. Proof of Theorem 3.2: The proof is a direct application of Theorem A.1 to dictionaries with B = O(K/d) = O(1) and exactly S-sparse signals with for and c(k) = 0 for k > S and Gaussian noise with variance . In this case and . Therefore we get

and to have a target error satisfying , we get the following bound on the admissible sparsity level and embedding distortion

which simplifies to (10). Further, the above bound and (10) ensure the condition on and imply that , which in turn means that the condition on the initial dictionary becomes

To reach the target error we need iterations. Choosing an overall failure probability of order , the bound in (18) leads to the requirements on the embedding dimension m and the sample size N. So we need an embedding dimension

which, using the bound on together with

simplifies to (11), and a sample size

which reduces to (12).

APPENDIX B TECHNICALITIES

Lemma B.1 (Compressed-Thresholding/Oracle-Residuals Expected Difference). Assume that follows the model in (3) with coefficients that are S-sparse, and that have an absolute gap and a relative gap . Further assume that , and that is a JL embedding based on an orthogonal transform as in Theorem 3.1. We have for

whenever

and

Proof: Our first observation is that when compressed thresholding succeeds, meaning it recovers the oracle support and oracle signs, then the two residuals will coincide for all k. So defining the event of thresholding failing as

and using the orthogonality of the projectionsand the bound , we have for all k

This means that we can estimate

and that it suffices to bound the number of signals for which thresholding fails with high probability. Clearly this number depends on the properties of the chosen embedding and how well it preserves the inner products between the atoms of the current dictionary and the atoms of the generating dictionary making up the signals. Recalling the decomposition of the perturbed atoms , we define for a given distortion

We then have

We first relate the probability of drawing an embedding with too high distortion to the embedding dimension m. We define the set of vectors , where

From Theorem 3.1 we know that for a DCT or Fourier based JL embedding with embedding dimension m satisfying

which is ensured by (32), we have with probability at least , that for all

This means that the squared norms of and are preserved up to , while setting and in the polarization identities further yield

Looking back at the definition of in (36), we then see that as long as the embedding dimension m satisfies (32).

Next we assume that we have drawn satisfying (32) and estimate the number of signals for which the residuals do not coincide for all k - now only depending on the draw of signals. First we rewrite

Since the signals are independent, so are the indicator functions and applying Bernstein’s inequality to their centered versions yields

meaning we still have to bound the probability . To ensure that compressed thresholding recovers the oracle support, i.e., we need to have

Expanding the inner products using the decomposition we get

from which we obtain the following lower (upper) bounds for ()

Substituting the above into (42), and using the norm-preserving property of , meaning for all k we arrive at the following sufficient condition for having ,

Note that the condition above already implies that the embedded inner products have the correct sign. However, we are using the unembedded inner products to determine the sign in the algorithm, so we still have to analyse the second event . From (43) with we can see that it suffices to ensure that for all .

Thus, the event of thresholding failing is contained in the events , where

and in analogy for

We first bound using a union bound over k, as well as Hoeffding’s inequality and the subgaussian property of the noise

Since we fixed we have . Further, was constructed based on an orthogonal transform, so we have . Using these estimates in the bound above, we get

and repeating the same steps for we arrive at

For the second event we also use a union bound over all k, as well as Hoeffding’s inequality and the subgaussian property of the noise to arrive at

The term is again bounded by , while for the sum in the denominator involving the embedded inner products we get,

where for the last step we have used the Cauchy-Schwarz inequality. Since and we get

As before we repeat the steps above for and, using the bound , arrive at

Combining (48)/(49) with (51)/(52) and choosing

we arrive at

where we have used that for . We first observe that each exponential with prefactor S is dominated by an exponential with prefactor K and that we assumed . Next note that we chose the embedding dimension according to (38), thus , meaning that , where the last bound has the advantage of remaining true for m = d, when . Further, for already the first term on the right hand side in the inequality above is larger than one, so the bound is trivially true without the third term. On the other hand for we can bound the denominator in the exponential of the third term by , leading

to

Using the definitions of and , the fact that and the assumption we get

From Lemma A.3 in [32] we know that for we have whenever . Thus (31) implies that

and looking at the definition of we get

Substituting this bound into (41) we get for a fixed

which combined with (37) and (35) yields the statement of the Lemma.

As already mentioned, the proof can be amended to get slightly weaker results also for fast JLembeddings based on the circulant matrices. We will sketch the necessary steps in the following remark.

Remark B.1 (Circulant JL-Constructions). Note that in the case of JL-embeddings based on the circulant matrices the operator norm bound , we have used to control the noise terms is no longer valid. However, we can show that with high probability the operator norm of a circulant matrix, built from a normalized Rademacher vector v, which is equivalent to the supremum norm of the DFT of v, is of order . This means for we have

and accordingly the bound in (32) is replaced by

REFERENCES

[1] E. A.-Castro and Y. C. Eldar. Noise folding in compressed sensing. CoRR, abs/1104.3833, 2011.

[2] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning sparsely used overcomplete dictionaries via alternating minimization. In COLT 2014 (arXiv:1310.7991), 2014.

[3] M. Aharon, M. Elad, and A.M. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing., 54(11):4311–4322, November 2006.

[4] N. Ailon and B. Chazelle. The fast Johnson-Lindenstrauss transform and approximate nearest neighbours. SIAM J. Computing, 39(1):302–322, 2009.

[5] N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH codes. Discrete & Computational Geometry, 42(4):615–630, 2008.

[6] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, efficient, and neural algorithms for sparse coding. In COLT 2015 (arXiv:1503.00778), 2015.

[7] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In COLT 2014 (arXiv:1308.6273), 2014.

[8] B. Barak, J.A. Kelner, and D. Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In STOC 2015 (arXiv:1407.1543), 2015.

[9] E. J. Cand`es and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.

[10] K. Engan, S.O. Aase, and J.H. Husoy. Method of optimal directions for frame design. In ICASSP99, volume 5, pages 2443 – 2446, 1999.

[11] D.J. Field and B.A. Olshausen. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607–609, 1996.

[12] S. Gleichman and Y. C. Eldar. Blind compressed sensing. CoRR, abs/1002.2586, 2010.

[13] M. Goto. Development of the RWC music database. In Proc. 18th International Congress on Acoustics (ICA 2004), pages 553–556, 2004.

[14] R. Gribonval, R. Jenatton, and F. Bach. Sparse and spurious: dictionary learning with noise and outliers. IEEE Transactions on Information Theory, 61(11):6298–6319, 2015.

[15] R. Gribonval and K. Schnass. Dictionary identifiability - sparse matrix-factorisation via -minimisation. IEEE Transactions on Information Theory, 56(7):3523–3539, July 2010.

[16] I. Haviv and O. Regev. The restricted isometry property of subsampled fourier matrices, volume 2169 of Lecture Notes in Mathematics, pages 163–179. Springer Verlag, Germany, 2017.

[17] D. Hsu, S.M. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability (arXiv:1110.2842), 17(14), 2012.

[18] W.B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26:189–206, 1984.

[19] F. Krahmer, S. Mendelson, and H. Rauhut. Suprema of chaos processes and the Restricted Isometry Property. Communications on Pure and Applied Mathematics, 67(11):1877–1904, 2014.

[20] F. Krahmer and R. Ward. New and improved Johnson-Lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269–1281, 2011.

[21] K. Kreutz-Delgado, J.F. Murray, B.D. Rao, K. Engan, T. Lee, and T.J. Sejnowski. Dictionary learning algorithms for sparse representation. Neural Computations, 15(2):349–396, 2003.

[22] D. D. Lee and H. S. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 1999.

[23] M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Computations, 12(2):337–365, 2000.

[24] J. Mairal, F. Bach, and J. Ponce. Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):791–804, 2012.

[25] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010.

[26] A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary learning for massive matrix factorization. In ICML (arXiv:1605.00937), volume 48, pages 1737–1746, 2016.

[27] A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. IEEE Transactions on Signal Processing, 66(1):113–128, 2018.

[28] V. Naumova and K. Schnass. Fast dictionary learning from incomplete data. EURASIP Journal on Advances in Signal Processing, 2018(12), 2018.

[29] M. D. Plumbley, S. A. Abdallah, T. Blumensath, and M. E. Davies. Sparse representations of polyphonic music. Signal Process., 86(3):417–431, March 2006.

[30] M. D. Plumbley, T. Blumensath, L. Daudet, R. Gribonval, and M. E. Davies. Sparse Representations in Audio and Music: from Coding to Source Separation. Proceedings of the IEEE., 98(6):995–1005, June 2010.

[31] R. Rubinstein, A. Bruckstein, and M. Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045–1057, 2010.

[32] K. Schnass. On the identifiability of overcomplete dictionaries via the minimisation principle underlying K-SVD. Applied Computational Harmonic Analysis, 37(3):464–491, 2014.

[33] K. Schnass. Local identification of overcomplete dictionaries. Journal of Machine Learning Research (arXiv:1401.6354), 16(Jun):1211–1242, 2015.

[34] K. Schnass. A personal introduction to theoretical dictionary learning. Internationale Mathematische Nachrichten, 228:5–15, 2015.

[35] K. Schnass. Convergence radius and sample complexity of ITKM algorithms for dictionary learning. accepted to Applied and Computational Harmonic Analysis, 2016.

[36] K. Skretting and K. Engan. Recursive least squares dictionary learning algorithm. IEEE Transactions on Signal Processing, 58(4):2121–2130, 2010.

[37] P. Smaragdis. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. In Independent Component Analysis and Blind Signal Separation, pages 494–499, 2004.

[38] D. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictionaries. In COLT 2012 (arXiv:1206.5882), 2012.

[39] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: Overview and geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, 2017.

[40] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere II: Recovery by riemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885–915, 2017.

[41] T. Virtanen. Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria. IEEE Transactions on Audio, Speech and Language Processing, 15(3):1066–1074, March 2007.

Designed for Accessibility and to further Open Science