Robust Sparse Coding via Self-Paced Learning

2017·Arxiv

Abstract

Abstract

Sparse coding (SC) is attracting more and more attention due to its comprehensive theoretical studies and its excellent performance in many signal processing applications. However, most existing sparse coding algorithms are nonconvex and are thus prone to becoming stuck into bad local minima, especially when there are outliers and noisy data. To enhance the learning robustness, in this paper, we propose a unified framework named Self-Paced Sparse Coding (SPSC), which gradually include matrix elements into SC learning from easy to complex. We also generalize the self-paced learning schema into different levels of dynamic selection on samples, features and elements respectively. Experimental results on real-world data demonstrate the efficacy of the proposed algorithms.

Introduction

Learning effective representations for high dimensional data plays an essential role in machine learning tasks for various types of data, such as text, image, speech and video data. Among these techniques, sparse coding is attracting more and more attention due to its comprehensive theoretical studies and its excellent performance in many signal processing applications, especially computer vision tasks, including supervised image classification (Wright et al. 2009), semi-supervised image classifica-tion (Lu et al. 2015), image clustering (Yang et al. 2014; Feng, Wu, and Zhou 2017), image restoration (Dong et al. 2015), etc. Usually, in these areas, the signals studied are image feature or instance, and sparse coding is a powerful tool for feature/image reconstruction(Yang et al. 2009; Feng, Wu, and Zhou 2017) and similarity measurement (Yan and Wang 2009; Yang et al. 2014). Besides, it has also been successfully applied in text mining (Li et al. 2015), speech signal processing (Zhen et al. 2016), recommendation (Feng et al. 2016), etc.

Basically, given a data matrix , consisting of n data points measured on m dimensions, is a vector represent- ing the i-th data point. Sparse Coding (SC) (Lee et al. 2006) aims at extracting a dictionary (also called a base) that consists of a set of atom items (called a basis), and converts data instances into sparse coordinates with respect to the dictionary. In other words, it decomposes the data matrix X into a dictionary matrix and a sparse code matrix , so that each data sample can be represented as a sparse linear combination of atom items (i.e., ) in the dictionary B with corresponding sparse coefficient vector . Formally, the sparse coding problem can be described as the following optimization problem:

s.t. where denotes the matrix Frobenius norm (i.e., denotes the norm of a given vector (i.e., ), and is the sparsity regularization parameter that represents the tradeoff between reconstruction error and sparsity. Naturally, the optimization of sparse coding in (1) is not jointly convex for both B and S, but it is convex in B (while holding S fixed) and convex in S (while holding B fixed) separately. Thus, it is usually solved iteratively by alternatingly minimizing over B and S, while holding the other fixed. In this way, when B is fixed, (1) becomes an -regularized non-smooth convex optimization problem and many efforts have been done for solving this type of problem (Yang et al. 2010). When S is fixed, it is a well-defined constrained least squares problem, which can be solved effi-ciently by existing optimization methods (Lee et al. 2006). Though we can optimize SC by an alternative method, there is a big limitation caused by the non-convexity of the objective in (1). This deficiency will drive the current SC optimization approach get stuck into bad local minima, especially when many noises and outliers exist in the datasets. To relieve the affect of non-convexity, a heuristic approach is to run the algorithm multiple times with different initializations or learning rates, and pick the best results based on cross-validation. However, this strategy is ad hoc and generally inconvenient to implement in unsupervised setting, since there is no straightforward criterion for choosing a proper solution. Thus, for robust SC optimization,

our goal is to alleviate the influence of the noisy and confusing data, as the confusing data usually correspond to the highly nonlinear local patterns which is hardly learnable for the model space, and the noisy ones are the outliers that should not be learned, such as noisy pixels in images, or noisy words in texts. Typically, this learning robustness will be achieved by a well sample selection to distinguish the reliable samples from the confusing ones. Fortunately, a recently studied learning regime, named Self- Paced Learning (SPL) (Kumar, Packer, and Koller 2010) is making effort for this issue. The core idea of SPL is to learn the model with easier samples first and then gradually process more complex ones, which is simulating the learning principle of humans/animals. This learning mechanism is empirically demonstrated to be ben-eficial for some learning tasks (Xu, Tao, and Xu 2015; Zhao et al. 2015; Li et al. 2017b), where it can in avoid bad local minima and achieve a better generalization result. Therefore, incorporating it into SC is expected to alleviate the local minimum issue of alternative learning.

To enhance the learning robustness, in this paper, we propose a unified framework named Self- Paced Sparse Coding (SPSC). With an adaptive pace from easy to hard in optimization, SPSC can dynamically introduce the data values into learning from easy ones to complex ones. Specifically, for different levels of dynamic selection, we extends SPSC into three variants, that is Sample-wise SPSC, Feature-wise SPSC and Element-wise SPSC. We also present a simple yet effective algorithm framework for solving the proposed SPSC problems. Experimental results on the real-world clean datasets and those with corruption demonstrate the effectiveness of the proposed approach, compared to the state-of-the-arts.

Related Work

In this section, we will review related works on sparse coding and self-paced learning, as our work Self- Paced Sparse Coding lies in cross road of these two fields.

Sparse Coding Sparse coding consists of two components, that is, the optimization of the corresponding sparse codes (as dictionary B is known) and the learning of optimal dictionary (as sparse code S is given). For sparse code optimization, when the sparsity of codes is measured as -norm, it can be solved by methods such as matching pursuit, orthogonal matching pursuit, basis pursuit, etc; and multiple approaches are presented for -norm ionization, such as coordinate descent (Wu and Lange 2008), interior-point method (Kim et al. 2007), active set method (Lee et al. 2007), and augmented lagrange multiplier method (Lin, Chen, and Ma 2010)

The dictionary is usually automatically learned from data rather than by a predetermined dictionary. Typically, sparse coding is suitable to reconstruct data, and consequently variants of traditional sparse coding are presented for different purposes: (i) Structure regularized sparse coding. Structured sparse coding methods exploit different structure priors among the data. (Sun et al. 2014) incorporate additional structured sparsity to the sparse coding objective functions, which leads to the promising performances on the applications. Graph/Hypergrpah regularized sparse coding (Zheng et al. 2011; Gao, Tsang, and Chia 2013; Feng, Wu, and Zhou 2017) are proposed to preserve the intrinsic manifold structure among high dimensional data objects, and impose Laplacian regularization as a manifold regularization on sparse coefficients. (ii) Supervised/Semisupervised sparse coding. In this group, label information of a training dataset is leveraged to learn a discriminative dictionary for better classification performances. Approaches in (M. Yang and Feng 2011) attempt to learn multiple dictionaries or category-specific dictionaries to promote discrimination between classes. Others in (Z. Jiang and Davis 2012) learn a dictionary shared by all classes by merging or selecting atom items from an initially large dictionary or incorporates discriminative terms into the objective function during training (Zhang and Li 2010; Z. Jiang and Davis 2013; Zheng and Tao 2015). (iii) Muli-modal sparse coding. In this group, traditional sparse coding approaches can be extended to benefit from the information encoded in multiple sources for cross-domain sparse representation (Jing et al. 2014; Li et al. 2017a).

Most of the current SC models are developed on nonconvex optimization problems, and thus always encounter the local minimum problem, especially when there are noisy and outlier data. We thus aim to alleviate this issue by advancing it into the SPL framework.

Self-Paced Learning

Inspired by the intrinsic cognitive principle of humans/animals, Bengio et al. (2009) initialized a new learning paradigm called curriculumlearning (CL), the core idea of which is to learn a model by gradually including samples into training from easy to complex for training non-convex objectives. This learning process is beneficial for alleviating local solation in non-convex optimization, as supported by empirical evaluation (Basu and Christensen 2013). However, it requires a priori identification to identify the easy and the hard samples in advance, which is difficult in real-world applications. To overcome the limitation due to the heuristic sample easiness ranking as in CL, Kumar et al. (2010) proposed a concise model, called Self-Paced Learning (SPL) to introduce a regularizor into the learning objective. SPL can automatically optimize an appropriate curriculum by the model itself with a weighted loss term on all samples and a general SPL regularizer imposed on sample weights. Jiang et al. (2015) presented a more comprehensive model learning underlying SPL/CL, as a general optimization problem as

s.t. (2) where a training set denotes the loss function which calculates the cost between the true label and estimated label under a model g with w as its model parameter to learn; denotes the weight variables reflecting the samples confidence; f corresponds to a self-paced regularizer function for the sample selection scheme; is a feasible region that encodes the information of a predetermined curriculum; is a parameter to control the learning pace, also referred as ”pace age”. For the self-paced regularizer function , Jiang et al. (2015) abstracted three necessary condition should be satisfied.

By sequentially optimizing the model with gradually increasing pace parameter on the SPL regularizer, more samples can be automatically discovered in a pure self-paced way. The number of samples selected at each iteration is determined by a weight that is gradually annealed such that later iterations introduce more samples. The algorithm converges when all samples have been considered and the objective function cannot be improved further. By virtue of its generality, SPL have been considered in a broad spectrum of latent variable models, such as matrix factorization (Zhao et al. 2015), clustering (Xu, Tao, and Xu 2015), multi-task learning (Li et al. 2017b), dictionary learning (Tang, Yang, and Gao 2012; Xu et al. 2016). Opposite to existed self-paced dictionary learning, in this paper we try to utilize the generalized SPL sheila and implement a soft sample selection accordingly rather than a heuristic hard strategy sample selection in (Tang, Yang, and Gao 2012). Also, in (Xu et al. 2016), it only address the issue of bad local minimum of non-convex sparse coding optimization by assigning weights to different samples of training data. However, due to the complexity and noisy nature of real data, the con-fidence levels of features of the dataset may also vary (eg., features extracted with different approaches for a set of images should have different confidence level), and the confi-dence levels of the specific values in the same sample also vary (eg., in image denoising task, noisy pixels should have lower confidence level comparing to the uncorrupted ones in the same image). Thus, considering a unified self-paced learning for sparse coding may be the further direction.

Self-Paced Sparse Coding

The Formulation of SPSC

Generally, the objective of SC is to learn a dictionary B and representation S from X such that it can well reconstruct the data and S can follow some prior knowledge as in (1). Inspired by the fact that humans often learn concepts from the easiest to the hardest as in SL and SPL, we incorporate the easy-to-hard strategy operated on the samples into the learning process of SC, as Self-Paced Sparse Coding (SPSC). Thus, following the framework of SPL in (2), the new learning objective function of SPSC can be formulated as:

where denotes the weight vector imposed on samples in X; let ’’ denote element-wise square root of a matrix; ‘’ denotes the diagonal

matrix with diagonal entries as those in the vector, that is,

It is noted that, different from the original SPL learning framework as in (2), as sparse coding is an originally unsupervised learning, the loss function corresponds to , which measures the reconstruction error be- tween and estimated after learning.

SPL Regularizer

For regularizer , the original SPL (Kumar, Packer, and Koller 2010) adopted negative -norm as

Under this regularizer, with fixed (B, S), the global optimal is calculated by the optimization as

Then, the solution can be written as a hard threshold schema:

Instead of hard weighting, in our experiments, we utilize the linear soft weighting regularizer (Jiang et al. 2015) due to its relatively easy implementation and well adaptability to complex scenarios. This regularizer penalizes the sample weights linearly in terms of the loss. Specifically, we have

(6) is convex with respect to , and thus the global minimum can be obtained with analytical solution for the linear soft weighting, as,

It is seen that if an sample can be well reconstructed (), it will be selected as an easy sample () and the weight is decreasing with respect to the reconstruction error, or otherwise unselected (). The parameter controls the pace at which the model learns new samples, and physically it can be interpreted as the ”age” of the model. When is smaller, more samples will be unselected as their corresponding losses are more likely to be greater than . As becomes larger, more samples with larger losses will be gradually considered.

Regularization on Sparse Codes Prior relational knowledge among the samples is very useful when learning new representations. To embed this knowledge into sparse coding, we will adopt the newly proposed

manifold regularizer, called hypergraph consistency regularization (HC) (Feng, Wu, and Zhou 2017), which is empirically shown to perform better in sparse coding. Specifically, it tries to reconstruct the hyperedge incidence matrix using sparse codes S in addition to the Laplcaican regulation on S as

(8) where, is the incidence matrix of hypergraph G(X, E) which consists of a set of vertices , and a set of hyperedges E = . The hypergraph can be constructed using the original data of X to record locally geometrical structure of the original data; , if , otherwise. More generally, I can also be defined in a probabilistic way (Huang et al. 2010), such that denotes the Laplacian matrix defined as is the identity matrix); is the weight matrix of a hyergraph to measure how close each pair of two vertices in the manifold space are, which can be defined as follows (with normalization): where and respectively denote the diagonal matrices of hyperedge degrees and vertex degrees, and denotes a diagonal matrix of the hyperedge weights. By minimizing R(S), the incidence matrix I can be well reconstructed and the consistency of the hypergraph structure on sparse code space S is guaranteed.

SPL Selection on Different Levels

It is seen that SPSC in (3) is implementing sample-level selection during the learning process, where the sample can be an image instance, a text, or a user in various tasks. Accordingly, we note the SPSC in (3) as Sample-wise SPSC (SPSC). Usually, a sample x is represented by a multidimensional vector with each element as a partial description of x. As in in (3), each element is treated the same, that is, equal confidence for model learning. However, due to the complexity and noisy nature of real data, the confidence levels of the specific element in the same sample also vary (eg., in image denoising task, noisy pixels should have lower confidence level comparing to the uncorrupted ones in the same image). Also the con-fidence levels of features could be also different (eg., features extracted with different approaches for a set of images should corresponds to different confidence level), We then show how to incorporate the easy-to-hard strategy operated on elements and features into the learning process of SC.

Element-wise SPSC When operating SPSC on elementlevel learning, the loss for each element and could be defined as . Then the learning objective of Element-wise SPSC (SPSC) becomes the following formulation:

Where V represents the matrix composing of which denotes the weights imposed on the observed elements of X;

‘’ denotes element-wise multiplication, that is,

Here denotes the j-th row vector of dictionary B. Thus, is the self-paced regularizer determining the elements to be selected in learning. To defines , the SPL regularizer in (6) can be easily generalized as

(10) Accordingly, the optimal solution of V can be obtained by

Feature-wise SPSC When operating SPSC on featurelevel learning, the learning objective of Feature-wise SPSC (SPSC) is as the following optimization:

s.t. (12) where denotes the weight vector imposed on different features in X; the loss of each feature corresponds to , which measures the reconstruction error of each feature and represents the j-th row vector in X, that is, a feature value across all samples. The SPL regularizer and optimal weight canaliculation in (6) and (7) can be easily adopted to SPSC, as the formulations in (12) and (3) are mathematically equivalent to each other.

Optimization of SPSC

Optimization Algorithm Comparing the proposed three formulations of SPSC in (3), (9) and (12), SPSCin (9) can be seen as a general form of these three. Thus, in the following we will show how to optimize (9). The optimization problem is not jointly convex, so we will use alternative search strategy (ASS) to solve it, suggested in (Kumar, Packer, and Koller 2010). Following ASS, we can alternatively optimizes V, B, S while keeping the other set of variables fixed. (i) Solve for V : With fixed B and S, optimization of selection weight matrix V can be solved under (11). (ii) Solve for B: With fixed codes S and weight matrix V , the optimization problem for dictionary B writes:

(13) This problem is a Quadratically Constrained Quadratic Program (QCQP) which can be solved using Lagrangian dual (Lee et al. 2006). (iii) Solve for S: With fixed dictionary B and weight matrix V , combing (9) and (8), the optimization problem for

codes S corresponds to the weighted SC:

(14) This problem is a -regularized convex optimization and off-the-shelf algorithms can be employed for solving it, such as feature-sign algorithm in (Gao, Tsang, and Chia 2013). The whole process is summarized in Algorithm 1. The al-

gorithm converges when all samples have been considered and the objective function cannot be improved further. It is seen from the algorithm that the number of samples selected at each iteration is determined by a weight that will be gradually annealed such that later iterations introduce more elements. For optimization of SPSCor SPSC, in Step 6, it can be easily solved by respectively calculating sample- wise loss or Feature-wise loss defined above. For weight update in in Step 3, all weights of elements are set equal to the weights of corresponding sample or feature, i.e. or . During each iteration, an convex minimization problem is solved for each V-step, B-step and Sstep, and the global optimum solution is obtained for each sub-step. Thus the overall objective is non-increasing with the iterations and Algorithm 1 is guaranteed to convergence.

Experiment Results

In this section, we extensively evaluate the proposed approach when applied to image clustering task on two real datasets. Experimental results demonstrate the correctness and effectiveness of the proposed model.

Experiment Setup

Datasets Two widely used real-world image dataset benchmarks are considered in the experiments1.

• COIL20. The COIL20 image library contains 1, 440 images of 20 objects, with 72 images per object. The size of each image is , with 256 grey levels per pixel.

Images of the objects were taken at angle intervals of . Each image is represented by a 1024-dimensional vector. • CMU-PIE. The CMU-PIE face database contains 41, 368 facial images of 68 subjects in various angles, facial expressions, and lighting conditions. The size of each image is , with 256 grey levels per pixel. Thus, each image is represented by a 1024-dimensional vector. In our experiments, we fix the angle and the facial expression to select 21 images under different lighting conditions for each subject, totaling 1, 428 images.

Baselines We compared the proposed approach with several state-of-the-art methods.

• Original and SC. The “original” method is to cluster image objects in the original data space (denoted as ’OS’). Sparse coding (Lee et al. 2006) is the basic sparse coding method without any regularization except sparsity constraints.

• LSC, LSCand CSC. (LSC) (Zheng et al. 2011) and (LSC) (Gao, Tsang, and Chia 2013) add a graph/hyperGraph Laplacian regularization term to the original sparse coding framework, while CSC (Feng, Wu, and Zhou 2017) (hypergraph consistent sparse coding) integrates a hypergraph incidence consistency regularization term.

• SPSC, SPSCand SPSC. These are proposed methods as in (3), (9) and (12) respectively.

Experimental Setting We use k-means clustering to group image datasets and compare the clustering results with two commonly used metrics, i.e., clustering accuracy (ACC) and normalized mutual information (NMI). Both ACC and NMI range from 0 to 1, while ACC reveals the clustering accuracy (also called purity) and NMI indicates whether the different clustering sets are identical (NMI = 1) or independent (NMI = 0). To ensure stability of results, the direct k-means clustering on the sparse code space is respectively repeated 30 times, each time with a random set of initial cluster centers. The average ACC and NMI for different methods over these 30 repetitions on each dataset will be reported. For each dataset, we normalize each image vector into a unit -norm as the input of comared sparse coding algorithms. The dictionary size r of all these models is set to be 128, since several recent works on sparse coding have advocated the use of overcomplete representations for images (Olshausen and others 1996; Zheng et al. 2011). Thus, after sparse coding, each image is represented as a 128-dimension vector, and will be used as the input of k-means clustering (k-means). The graph in LSCand hypergraph in CSC and SPSC are constructed in the original feature space after unitnorm normalization. Specifically, The k-nearest neighborbased graph (or hypergraph) (with k = 3 by Euclidean distance) is used to characterize the intrinsic manifold with a binary weighting scheme as suggested in (Zheng et al. 2011; Gao, Tsang, and Chia 2013). The parameters are tuned as follows. For SC, the sparsity weight is tuned in the range of [0.005, 0.01, 0.02, 0.04, 0.08] to get the best performance (i.e., the highest average of ACC in 30 runs), and the

Table 1: Clustering performance on COIL20 dataset using different methods

Table 2: Clustering performance on MNIST dataset using different methods

best value is used for all sparse coding framework-based methods. For LSC, the Laplacian regularization weight is tuned in the range of [0.5, 1, 2, 4, 8, 16, 32, 64]. For CSC, we respectively fix the best values of and , and tune . For SPSC, we use the best and of CSC; and we initialize to a value when 50% of the elements (samples, features) are selected in the first iteration and set stepsize .

To test the robustness of proposed SPSC to corruption using COIL20 dataset. For each image x with pixel value representation, we add white Gaussian noise according to , where n is the noise following a normal distribution with zero average and is set to be the standard deviation over entries in data matrix is a corruption ratio which increase from 0.2 to 1.0 in intervals of 0.2. Clearly, white Gaussian noise is additive. Figure. 2 shows some samples on COIL20 dataset.

Figure 1: COIL20 images with corruptions by Gaussian noise.

Results and Discussion Tables 1 and 2 report the performance of compared methods on clustering. They show the following points. 1) All three variants of proposed SPSC perform better than other baselines on both ACC and NIM, the power of self-paced

learning is demonstrated. 2) Among three variants of proposed SPSC, element-wise SPSC (SPSC) is more robust than sample-wise SPSC and sample-wise, especially when the noise level is larger. We think that this is because the element-wise SPSC is implementing on the dynamic selection on the smallest granularity. 3) When the noise is on a low level (), the clustering performance will be not severely affected. and however when the noise level is high (), the clustering performance will sharply decrease (we implement experiments for but don’t show it due to the lower accuracy).

Discussion on the Weight Learning In the experiment, we record the weight matrix V in each iteration in Algorithm 1. To show the process of dynamic selection of samples, we plot the correlation between noise level (the mean absolute value of noise over each sample) and the corresponding weight in each iteration, as in Figure. 2. We can see that, the values in the weight matrix are

Figure 2: The correlation between noise level and weight in each run.

becoming larger after each iteration and they keep a high value after about 15 iterations, following the mechanism of self-paced learning that more elements are selected after each iteration of the optimization process. Also, there is a negative correlation between noise level and the corresponding weight in the beginning of iterations, means the algorithm of SPSC is learning samples with lower noise corruption (easier samples) first.

Conclusion

Under the framework of alternative optimization of sparse coding, the non-convexity usually drive the optimization approach get stuck into bad local minima, especially when many noises and outliers exist in the datasets. To relieve the affect of non-convexity, we propose a unified framework named Self-Paced Sparse Coding (SPSC) by incorporating Self-paced learning methodology with Sparse coding as well as manifold regularization. For different levels of dynamic selection, we extends SPSC into three variants, that is Sample-wise SPSC, Feature-wise SPSC and Element-wise SPSC. The effectiveness of proposed SPSC was demonstrated by experiments on real image dataset and it with noise. The proposed method shows its advantage over current SC methods on more accurately approximating the ground truth clustering, and we will evaluate it on other tasks, such as classification and recommendation.

References

[Basu and Christensen 2013] Basu, S., and Christensen, J. 2013. Teaching classification boundaries to humans. In AAAI, 109–115.

[Bengio et al. 2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML, 41–48. ACM.

[Dong et al. 2015] Dong, W.; Li, X.; Ma, Y.; and Shi, G. 2015. Image restoration via bayesian structured sparse coding. In ICIP, 4018–4022.

[Feng et al. 2016] Feng, X.; Sharma, A.; Srivastava, J.; Wu, S.; and Tang, Z. 2016. Social network regularized sparse linear model for top-n recommendation. EAAI 51(C):5–15.

[Feng, Wu, and Zhou 2017] Feng, X.; Wu, S.; and Zhou, W. 2017. Multi-hypergraph consistent sparse coding. ACM TIST 8(6):75.

[Gao, Tsang, and Chia 2013] Gao, S.; Tsang, I.; and Chia, L. 2013. Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE TPAMI 35:92–104.

[Huang et al. 2010] Huang, Y.; Liu, Q.; Zhang, S.; and Metaxas, D. N. 2010. Image retrieval via probabilistic hypergraph ranking. In CVPR, 3376–3383.

[Jiang et al. 2015] Jiang, L.; Meng, D.; Zhao, Q.; Shan, S.; and Hauptmann, A. G. 2015. Self-paced curriculum learning. In AAI, 2694–2700.

[Jing et al. 2014] Jing, X. Y.; Hu, R. M.; Wu, F.; Chen, X. L.; Liu, Q.; and Yao, Y. F. 2014. Uncorrelated multi-view discrimination dictionary learning for recognition. In AAAI, 2787–2795.

[Kim et al. 2007] Kim, S. J.; Koh, K.; Lustig, M.; Boyd, S.; and Gorinevsky, D. 2007. An interior-point method for large-scale l 1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing 1(4):606–617.

[Kumar, Packer, and Koller 2010] Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-paced learning for latent variable models. In NPIS, 1189–1197.

[Lee et al. 2006] Lee, H.; Battle, A.; Raina, R.; and Ng, A. Y. 2006. Efficient sparse coding algorithms. NIPS 801–808.

[Lee et al. 2007] Lee, H.; Battle, A.; Raina, R.; and Ng, A. Y. 2007. Efficient sparse coding algorithms. In NPIS, 801–808.

[Li et al. 2015] Li, P.; Bing, L.; Lam, W.; Li, H.; and Liao, Y. 2015. Reader-aware multi-document summarization via sparse coding. In IJCAI, 1270–1276.

[Li et al. 2017a] Li, B.; Yuan, C.; Xiong, W.; Hu, W.; Peng, H.; Ding, X.; and Maybank, S. 2017a. Multi-view multiinstance learning based on joint sparse representation and multi-view dictionary learning. IEEE TPAMI PP(99):1–1.

[Li et al. 2017b] Li, C.; Yan, J.; Wei, F.; Dong, W.; Liu, Q.; and Zha, H. 2017b. Self-paced multi-task learning. In AAAI, 2175–2181.

[Lin, Chen, and Ma 2010] Lin, Z.; Chen, M.; and Ma, Y. 2010. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.

[Lu et al. 2015] Lu, Z.; Gao, X.; Wang, L.; Wen, J.; and Huang, S. 2015. Noise-robust semi-supervised learning by large-scale sparse coding. In AAAI, 2828–2834.

[M. Yang and Feng 2011] M. Yang, D. Z., and Feng, X. 2011. Fisher discrimination dictionary learning for sparse representation. In ICCV, 543–550.

[Olshausen and others 1996] Olshausen, B. A., et al. 1996. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607– 609.

[Sun et al. 2014] Sun, X.; Qu, Q.; Nasrabadi, N. M.; and Tran, T. D. 2014. Structured priors for sparse-representation-based hyperspectral image classification. IEEE Geoscience & Remote Sensing Letters 11(7):1235– 1239.

[Tang, Yang, and Gao 2012] Tang, Y.; Yang, Y.-B.; and Gao, Y. 2012. Self-paced dictionary learning for image classifi-cation. In ACM MM, 833–836. ACM.

[Wright et al. 2009] Wright, J.; Yang, A. Y.; Ganesh, A.; Sas- try, S. S.; and Ma, Y. 2009. Robust face recognition via sparse representation. IEEE TPAMI 31:210–227.

[Wu and Lange 2008] Wu, T. T., and Lange, K. 2008. Co- ordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics 224–244.

[Xu et al. 2016] Xu, D.; Song, J.; Alameda-Pineda, X.; Ricci, E.; and Sebe, N. 2016. Multi-paced dictionary learning for cross-domain retrieval and recognition. In ICPR, 3228–3233.

[Xu, Tao, and Xu 2015] Xu, C.; Tao, D.; and Xu, C. 2015. Multi-view self-paced learning for clustering. In IJCAI, 3974–3980.

[Yan and Wang 2009] Yan, S., and Wang, H. 2009. Semi- supervised learning by sparse representation. In SDM, 792– 801. SIAM.

[Yang et al. 2009] Yang, J.; Yu, K.; Gong, Y.; and Huang, T. 2009. Linear spatial pyramid matching using sparse coding for image classification. 1794–1801.

[Yang et al. 2010] Yang, A. Y.; Sastry, S. S.; Ganesh, A.; and Ma, Y. 2010. Fast -minimization algorithms and an application in robust face recognition: A review. In ICIP, 1849– 1852.

[Yang et al. 2014] Yang, Y.; Yang, J.; Yang, J.; Wang, J.; Chang, S.; and Huang, T. S. 2014. Data clustering by laplacian regularized -graph. In AAAI, 3148–3149.

[Z. Jiang and Davis 2012] Z. Jiang, G. Z., and Davis, L. S. 2012. Submodular dictionary learning for sparse coding. In CVPR, 3418–3425.

[Z. Jiang and Davis 2013] Z. Jiang, Z. L., and Davis, L. S. 2013. Label consistent k-svd: learning a discriminative dictionary for recognition. IEEE TPMAI 35:2651–2664.

[Zhang and Li 2010] Zhang, Q., and Li, B. 2010. Discrim- inative k-svd for dictionary learning in face recognition. In CVPR, 2691–2698.

[Zhao et al. 2015] Zhao, Q.; Meng, D.; Jiang, L.; Xie, Q.;

Xu, Z.; and Hauptmann, A. G. 2015. Self-paced learning for matrix factorization. In AAAI, 3196–3202.

[Zhen et al. 2016] Zhen, L.; Peng, D.; Yi, Z.; Xiang, Y.; and Chen, P. 2016. Underdetermined blind source separation using sparse coding. IEEE TNNLS PP(99):1–7.

[Zheng and Tao 2015] Zheng, H., and Tao, D. 2015. Dis- criminative dictionary learning via fisher discrimination KSVD algorithm. Neurocomputing 162:9–15.

[Zheng et al. 2011] Zheng, M.; Bu, J.; Chen, C.; Wang, C.; Zhang, L.; Qiu, G.; and Cai, D. 2011. Graph regularized sparse coding for image representation. IEEE TIP 20:1327– 1336.

designed for accessibility and to further open science