Classification is one of the fundamental machine learning tasks. In the era of big data, huge amounts of high-dimensional data become common in a wide variety of applications. As the dimensionality of the data for real-life classification tasks has increased dramatically in recent years, it is essential to develop accurate and efficient classi-fication algorithms for high-dimensional data.
One of the most popular methods to address the high-dimensionality challenge is dimensionality reduction. It allows the classification problems to be efficiently solved in a lower dimensional space. Popular dimensionality reduction methods includes Principal Component Analysis (PCA) (Jolliffe 2011), Linear Discriminant Analysis (LDA) (Mika et al. 1999) etc. However, these traditional dimensionality reduction methods are computationally expensive which makes them not suitable for big data. Recently, random projection (Bingham and Mannila 2001) that projects a high-dimensional feature vector into a low-dimensional feature
(
) using a random orthonormal matrix
. Namely,
. Since random projection methods are (1) computationally efficient while maintaining accurate results; (2) simple to implement in practice; and (3) easy to analyze (Mahoney and others 2011; Xu et al. 2017), they have attracted significant research attentions in recent years.
The theoretical foundation of random projection is the Johnson-Lindenstrauss Lemma (JL lemma) (Johnson, Lindenstrauss, and Schechtman 1986). JL lemma proves that in Euclidean space, high-dimensional data can be randomly projected into lower dimensional space while the pairwise distances between all data points are preserved with a high probability. Based on the JL lemma, several different ways are proposed for constructing the random matrix R: (1) Gaussian Matrix (Dasgupta and Gupta 1999): each entry in R is generated from a Gaussian distribution with mean equals to 0 and variance equals to ; (2) Achlioptas ma- trix (Achlioptas 2003): each entry in R is generated from
, 0, 1} from a discrete distribution. This method generates a sparse random matrix R. Therefore it requires less memory and computation cost; (3) sparse embedding matrix (or called count sketch matrix) (Clarkson and Woodruff 2017) and its similar variant feature hashing (Weinberger et al. 2009; Freksen, Kamma, and Larsen 2018): for each row R, only one column is randomly selected and assign either 1 or
with probability 0.5 to this entry; and (4) Subsampled Randomized Hadamard Transform (SRHT) (Tropp 2011): It uses a highly structured matrix R for random projection. The procedure of SRHT can be summarized as follows: It first rotates the input matrix
(n is the number of samples, d is the dimensionality of the data) by Randomized Walsh-Hadamard Transform and then uniformly sampling r columns from the rotated matrix. More details of SRHT will be discussed in next section.
Although random projection methods have attracted a great deal of research attention in recent years (Sarlos 2006; Choromanski, Rowland, and Weller 2017) and have been studied for regression (Lu et al. 2013), classification (Paul et al. 2013; Zhang et al. 2013; Paul, Magdon-Ismail, and Drineas 2015) and clustering (Boutsidis, Zouzias, and Drineas 2010; Liu, Shen, and Tsang 2017), most of the existing studies focused on data-independent projection. In other words, the random matrix R was constructed without considering any specific properties of a given dataset.
In this paper, we focus on SRHT for random projection and study how it affects the classification performance of linear Support Vector Machines (SVM). SRHT achieve dimensionality reduction by uniformly sampling r columns from a rotated version of the input data matrix X. Even though Randomized Walsh-Hadamard transform in SRHT tends to equalize column norms (Boutsidis and Gittens 2013), we argue that the importance of different columns in the rotated data matrix are not equal. Therefore, the uniformly random sampling procedure in SRHT would result in low accuracy when used as a dimensionality reduction method in high-dimensional data classification.
To overcome the limitation of SRHT, we propose to produce effective low-dimensional embedding by using non-uniform sampling instead of uniform sampling. To achieve this goal, we first analyze the effect of using SRHT for random projection in linear SVM classification. Based on our analysis, we have proposed importance sampling and deterministic top-r sampling methods to improve SRHT. Furthermore, we also propose a new sampling method by incorporating label information. It samples the columns which can achieve optimal inter-class separability along with the intra-class compactness of the data samples.
Finally, we performed experiments to evaluate our proposed methods on six real-life datasets. Our experimental results clearly demonstrate that our proposed method can obtain higher classification accuracies than SRHT and other three popular random projection methods while only slightly increasing the running time.
Random Projection Given a data matrix , random projection methods reduce the dimensionality of X by multiplying it by a random orthonormal matrix
with parameter
. The projected data in the low-dimensional space is
Note that the matrix R is randomly generated and is independent of the input data X. The theoretical foundation of random projection is the Johnson-Lindenstrauss Lemma (JL lemma) as shown in following,
Lemma 1 (Johnson-Lindenstrauss Lemma (JL lemma) (Johnson, Lindenstrauss, and Schechtman 1986)). For any and any integer n, let
and
be a random orthonormal matrix. Then for any set X of n points in
, the following inequality about pair-wise distance between any two data points
and
in X holds true with high probability:
≤ ∥
JL lemma proves that in Euclidean space, high-dimensional data can be randomly projected into lower dimensional space while the pairwise distances between all data points are well preserved with high probability. There are several different ways to construct the random projection matrix R. In this paper, we focus on SRHT which uses a structured orthonormal matrix for random projection. The details about SRHT are as follows.
Subsampled Randomized Hadamard Transform (SRHT) For where q is any positive integer1 , SRHT defines a
matrix as:
• is a subset of randomly sampling r columns from the
identity matrix. The purpose of multiplying S is to uniformly sample r columns from the rotated data matrix
. There are several advantages of using SRHT for random
projection. Firstly, we do not need to explicitly represent H
in memory and only need to store a diagonal random sign matrix D and a sampling matrix S with r non-zero values. Therefore, the memory cost of storing R is only O(d + r). Secondly, due to the recursive structure of Hadamard matrix H, matrix multiplication XR only takes O(nd log(d)) time by using Fast Fourier Transform (FFT) for matrix multiplication (Tropp 2011). Ailon and Liberty (Ailon and Liberty 2009) further improve the time complexity of SRHT to O(nd log(r)) if only r columns in are needed.
Limitation of Uniform Sampling in SRHT
Despite the advantages of SRHT as mentioned in the previous section, one limitation of SRHT is that the random matrix R is constructed without considering any specific properties of input data X. Based on the definition in (2), SRHT works by first rotating the input data matrix X by multiplying it with DH and then uniformly random sampling r columns from the rotated matrix . Since this uniform column sampling method does not consider the underlying data distribution of
, it would result in low and unstable accuracy when used as a dimensionality reduction method in a specific classification problem. To illustrate the limitation of uniformly random sampling step in SRHT, we generate a simple two-dimensional synthetic data
Figure 1: Limitation of uniform sampling
as shown in Figure 1(a). The first class (i.e., class label ‘+’) was generated from
The second class (i.e., class label ‘’) was generated from
ing with matrix DH, the original data was rotated into new feature space as shown in Figure 1(b). In this simple illustration example, SRHT might choose the feature that can not distinguish these two classes (i.e, the vertical feature in Figure 1(b)) and then leads to low classification accuracy. In Figure 1(c), we shown the classification accuracy of using SRHT on a real-life dataset for 15 different runs. It clearly shows that this data-independent random projection method could result in low and unstable performance on this synthetic dataset. In the following sections, we will describe our proposed methods to construct the data-dependent sampling matrix S in (2) by exploiting the underlying data properties from both unsupervised and supervised perspectives.
Improved SRHT by unsupervised non-uniform sampling
First, we present our proposed non-uniform sampling methods from unsupervised perspective. In this paper, we study SRHT in the context of using linear SVM for classification. Assume we are given a training data set , where
is a d-dimensional input feature vector,
is its corresponding class label. The dual problem of SVM can be written in matrix format as shown in following,
where is the input feature matrix, and Y is a diagonal matrix where the elements in the diagonal are class labels,
is the regularization parameter.
By applying SRHT to project original X from d-dimensional space into r-dimensional space, i.e., , the SVM optimization problem on projected
Comparing the dual problem (3) in the original space and the dual problem (4) in the projected low-dimensional space, the only difference is that in (3) is replaced by
in (4). The constraints on optimal solutions (i.e.,
of (3) and
of (4) ) do not depend on input data matrix. Therefore,
is expected to be close to
when
is close to
.
Let us rewrite where
, we
can view the effect of is to uniformly sample r columns from
and then re-scale each selected column by a scaling factor
. It corresponds to use uniform column sampling for approximating matrix multiplication (Mahoney and others 2011). It has been proved in (Drineas, Kannan, and Mahoney 2006) that (i, j)-th element in
equals to the (i, j)-th element of exact product
in expectation regardless of the sampling probabilities. However, the variance of the (i, j)-th element in
depends on the choice of sampling probabilities. Therefore, simply using uniform sampling would result in large variance of each element in
and then cause a large approximation error between
and
.Motivated by this observation, we first propose to improve SRHT by employing importance sampling for randomized matrix multiplication. Specifically, we would like to design a method to construct
based on specific data properties of
instead of using random sampling matrix S in original SRHT. The idea is based on importance sampling for approximating matrix multiplication: (1) we assign each column i in
a probability
such that
and
; (2) we then select r columns from
based on the probability distribution
to form
; (3) a col- umn in
is formed as
if this column is chosen from the i-th column in
where
factor. In other word, the matrix is constructed as: for every column j in
only one entry i (i is a random number from {1, . . . , d}) is selected by following the probability distribution
and a non-zero value is assigned to this entry. The non-zero value is set as
In the uniform sampling setting, , therefore
which is consistent with the setting in original SRHT as shown in (2).
Now, the question is what choice of probability can optimally reduce the expected approximation error between
and
. We use Frobenius norm to measure the approximation error between
and
.Let us use
to denote the j-th column of
, as shown in the Proposition 1, the optimal sampling probability to minimize
is
Proposition 1. Suppose and
where
is defined as in (5). Then, the optimal sampling probability to minimize the expectation of the Frobenius norm of the approximation error between
and
, i.e.,
, is
.
To prove Proposition 1, we first use the fact since
. Then, we apply the Lemma 4 from (Drineas, Kannan, and Mahoney 2006) to obtain the optimal sampling probability.
In addition, we also propose a deterministic top-r sampling, which select the r columns with largest Euclidean norms from . As shown in the Proposition 2, this deterministic top-r sampling method directly optimizes an upper bound of
without using re-scaling factor. Proposition 2. Suppose
and
where
is a diagonal matrix whose elements are 1 or 0.
denotes the i-th column in
is selected. Then,
is an upper bound of the approximation error
.
Proof. Since and
,
. Let us rewrite the matrix product
as the sum of d rank one matri- ces
. Similarly,
can be rewritten as
. Therefore, the approximation error
can be upper bounded as follows,
In our experimental section, we have shown that this deterministic top-r sampling obtains better classification accuracy than importance sampling method. Adelman and Silberstein(Adelman and Silberstein 2018) also observed that deterministic top-r sampling gets better results when they use randomized matrix multiplication approximation to speedup the training of deep neural networks.
Improved SRHT by Supervised Non-uniform Sampling
In this section, we describe our proposed non-uniform sampling method from supervised perspective. We propose to construct the sampling matrix by incorporating label information. Our proposed method is based on the idea of metric learning (Xing et al. 2003). The idea is to select r columns from
for improving the inter-class separability along with the intra-class compactness as used in (Lan et al. 2019).
Let be new feature representation of the i-th data sample in
. Then, we measure the interclass separability by the sum of squared pair-wise distances between
’s from different classes as
. Similarly, we mea- sure the intra-class tightness by the sum of squared pair-wise distances between
’s from the same class as
. We formulate our objective as a combination of these two terms,
where if
, and
if
, parameter a is used to balance the tradeoff between inter-class separability and intra-class tightness, L is the Laplacian matrix of A, defined as
, and D is a diagonal degree matrix of A such that
is the sam- pling matrix whose elements are 0 or 1. We want to learn
by minimizing (8).
Without losing any information, let us rewrite our variable as a
diagonal matrix P where the diagonal element
means the i-th column in
is selected and
means the i-th column is not selected. Then
can be viewed as a submatrix of P that selects the columns with
from P. It is straight-forward to verify that
in (8) equals toP. Therefore, our optimization is formulated as
Let vector b denote the diagonal of the matrix , i.e
, vector p denote the diagonal of P.
Then, the objective (9) is simplified to
The optimal solution of (10) is just to select the largest r elements in and set the corresponding entries in
to 1. Then
can easily obtained based on non-zero values in p.
Algorithm Implementation and Analysis
We summarize our proposed algorithm in algorithm 1 and name it as Improved Subsampled Randomized Hadamard Transform (ISRHT). With respect to training time complexity, step 2 needs O(ndlog(d)) time by using FFT. In step 3, for unsupervised column selection, we need O(nd) time to compute the Euclidean norm for each column. For supervised column selection, in (10) are computed as
and the Laplacian matrix L is equal to
, therefore, the required time is still O(nd). Step 4 needs O(r) time and step 5 needs O(tnr) where t is the number of iterations needed for training linear SVM. Therefore, the overall time complexity of our proposed algorithm is O(ndlog(d) + nd + tnr). With respect to prediction time complexity, our proposed method needs O(d log(d) + r) time to make a prediction for a new test sample. With respect to space complexity, D is a diagonal matrix, matrix
is a sampling matrix with r non-zero entries. Therefore, the required space for storing the projection matrix R of our proposed methods is O(d + r). Our proposed methods achieve very efficient time and space complexities for model deployment. We compare the time and space complexity of different random projection methods in Table 1.
Table 1: Comparison of different random projection methods
Extension to High-dimensional Sparse Data
In this section, we describe an extension of our algorithm on high-dimensional sparse data. One disadvantage of applying Hadamard transform on sparse data is that it produces a dense matrix. As shown in the step 2 in Algorithm 1, the memory cost is increased from O(nnz(X)) to O(nd). To tackle this challenge, we use a combination of sparse embedding and SRHT as proposed in (Chen et al. 2015) for memory-efficient projection. We improve their work by replacing the original SRHT by our proposed ISRHT methods. We expect that our proposed data-dependent projection methods can obtain higher accuracies than data-independent projection methods on high-dimensional sparse data.
Specifically, we first use sparse embedding matrix to project X into
-dimensional space and then use our improved SRHT methods to project into r dimensional space. The sparse embedding matrix
is generated as follows, for each row in random matrix
, randomly selected one column and assign either 1 or
with probability 0.5 to this entry. All other entries are 0s. Due to the sparse structure of
, matrix multiplication
takes O(nnz(X)) time. By doing this, the time complexity of the proposed method on high-dimensional sparse data was reduced to O(nnz(X) +
and the memory cost was reduced to
. In our experimental setting, we set
2r as suggested in (Chen et al. 2015).
In this section, we compare our proposed methods with other four popular random projection methods on six real-world benchmark datasets. These benchmark datasets are downloaded from LIBSVM website 2. For datasets (mushrooms,real-sim, rcv1-binary, news20-binary), we refer to the setting in (Pourkamali-Anaraki, Becker, and Wakin 2018) and randomly select 70% of the original training data as training data and the rest 30% as test data. The detailed information about these six datasets is summarized in in Table 3.
We evaluate the performance of the following seven random projection algorithms in our experiments:
• Gaussian: random projection by Gaussian Matrix (Dasgupta and Gupta 1999);
Table 2: The classification accuracy and running time (in seconds) (The best results for each dataset are in bold and italic and the best results among the seven random projection methods are in bold).
Table 3: Experiment dataset
• Achlioptas: random projection by Achlioptas matrix (Achlioptas 2003);
• Sparse Embedding: random projection by count sketch matrix (Clarkson and Woodruff 2017);
• SRHT: original subsampled randomized Hadamard transform with uniform sampling (Tropp 2011);
• ISRHT-nps: our proposed method for ISRHT with normproportional sampling as defined in (6);
• ISRHT-top-r: our proposed deterministic sampling method for ISRHT which select r columns with largest Euclidean norms;
• ISRHT-supervised: our proposed method by incorporating label information as defined in (10);
We also include the results of using all features and two other popular data-dependent dimensionality reduction methods: PCA and Deterministic Leverage Score Sampling (DLLS) (Papailiopoulos, Kyrillidis, and Boutsidis 2014) in
Table 2. Due to high computational complexity of Singular Value Decomposition (SVD) in PCA and leverage score sampling, we employed randomized SVD (Halko, Martinsson, and Tropp 2011) in our experiments.
Experimental setup. The feature values for all data sets are linearly scaled to [, 1]. For dense datasets (mushrooms, usps-b and gisette), we directly apply our proposed ISRHT methods on them. For the highdimefnsional sparse datasets (real-sim, rcv1-binary and news20-binary), we use an extension of our proposed methods that combines of sparse embedding and ISRHT for memory efficient projection as discussed in previous section. The regularization parameter C in linear SVM is chosen from
by 5-fold cross validation. The tradeoff parameter a for ISRHT-supervised is fixed to 1.0. Our experiments are performed on a server with Dual 6-core Intel Xeon 2.4GHz CPU and 128 GB RAM.
Experimental Results. The accuracy and training time of all algorithms are reported in Table 2. ’-’ is used to denote that PCA cannot be completed because of space complexity. The number of reduced dimension r is shown in the first column of the table. We also explore the impact of parameter r in Figure 2 and will be discussed later. The reported accuracies are the averaged accuracies on test data based on 15 repetitions. As shown in Table 2, our proposed ISRHT-(top-k) method achieves significant higher accuracies than other four data-independent random projection methods (i.e., Gaussian, Achlioptas, Sparse Embedding and SRHT) on all six datasets. The ISRHT-nps gets slightly better results than SRHT. Furthermore, by incorporating the
Figure 2: classification accuracies of different algorithms with different r
label information, our proposed ISRHT-supervised method gets the best accuracy on all six datasets among the seven random projection methods. Using all features gets the best accuracy on all six datasets. PCA gets better accuracy than DLSS and other random projection methods while needs longer running time. We also observe that our proposed methods produce dense feature representation after random projection and therefore it needs longer running time than directly applying liblinear on high-dimensional sparse data.
With respect to the running time, all the random projection methods are more efficient than data-dependent dimension reduction method such as PCA and DLSS especially on large datasets. Among these random projection methods Gaussian and Achiloptas are much slower than Sparse Embedding and SRHT since they need O(ndr) time for projection. The running times of our proposed methods ISRHTnps and ISRHT-top-r are very close to SRHT. The results demonstrate that our proposed methods only slightly increase the running time since the computing norm can be done very efficiently.
Impact of parameter r. We evaluate the impact of parameter r in our proposed algorithms. We show the accuracies of all the algorithms with respect to different r in Figure 2. We can observe that our proposed methods gets higher accuracy than other four methods. The accuracy improvement is large when the parameter r is set to relative small number. As expected, the difference will become smaller as the parameter r increases.
Stability Comparison Between ISRHT and SRHT. In this section, we would like to further investigate the stability of our proposed methods and SRHT with respect to differ-
Figure 3: Classification Accuracies of SRHT (x-axis) and ISRHT (y-axis) for different parameters.
ent choices of the regularization parameter C in SVM and reduced dimension r. We evaluate all methods for different parameter combinations of C and r. Regularization parameter C is chosen from and reduced dimension r is chosen from
.
Due to the space limitation, we only show the results of comparing the prediction accuracies of ISRHT-nps and ISRHT-top-r with SRHT on mushrooms dataset in Figure 3. The accuracies plotted in Figure 3 are based on the average of 15 repetitions. For each parameter combination, we plot its corresponding accuracy of SRHT on the x-axis and the accuracy of our proposed methods on y-axis. So the points above y = x line indicate an accuracy improvement of our proposed methods (y-axis) (ISRHT-nps or ISRHTtop-r) over SRHT(x-axis). Overall speaking, our proposed methods is more stable with different choices of parameter combinations.
In this paper, we propose to produce more effective low-dimensional embedding than original SRHT by using non-uniform sampling instead of uniform sampling in SRHT. To achieve this goal, we first analyze the effect of using SRHT for random projection in the context of linear SVM classification. Based on our analysis, we have proposed importance sampling and deterministic top-r sampling to improve the embedding. Secondly, we also propose a new sampling method to incorporate label information based on the idea of metric learning. We performed extensive experiments to evaluate our proposed non-uniform samplings methods. Our experimental results demonstrate that our proposed new methods can achieve better accuracy than original SRHT and other three popular random projection methods. Our results also demonstrate that our proposed method only slightly increase the running time but results in more effective embedding. In the future, we would like to extend our proposed ISRHT methods to nonlinear classifi-cation problems. Another interesting direction is to design data-dependent sparse embedding methods.
We would like to thank the anonymous reviewers for their insightful comments and valuable suggestions on our paper. This work was supported by NSFC 61906161.
Achlioptas, D. 2003. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences 66(4):671–687.
Adelman, M., and Silberstein, M. 2018. Faster neural network training with approximate tensor operations. arXiv preprint arXiv:1805.08079.
Ailon, N., and Liberty, E. 2009. Fast dimension reduction using rademacher series on dual bch codes. Discrete & Computational Geometry 42(4):615.
Bingham, E., and Mannila, H. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD, 245–250. ACM.
Boutsidis, C., and Gittens, A. 2013. Improved matrix algorithms via the subsampled randomized hadamard transform. SIAM Journal on Matrix Analysis and Applications 34(3):1301–1340.
Boutsidis, C.; Zouzias, A.; and Drineas, P. 2010. Random projections for k-means clustering. In NIPS, 298–306.
Chen, S.; Liu, Y.; Lyu, M. R.; King, I.; and Zhang, S. 2015. Fast relative-error approximation algorithm for ridge regression. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 201–210. AUAI Press.
Choromanski, K. M.; Rowland, M.; and Weller, A. 2017. The unreasonable effectiveness of structured random orthogonal embeddings. In NeurIPS, 219–228.
Clarkson, K. L., and Woodruff, D. P. 2017. Low-rank approximation and regression in input sparsity time. Journal of the ACM (JACM) 63(6):54.
Dasgupta, S., and Gupta, A. 1999. An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Technical Report 22(1):1–5.
Drineas, P.; Kannan, R.; and Mahoney, M. W. 2006. Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing 36(1):132–157.
Freksen, C. B.; Kamma, L.; and Larsen, K. G. 2018. Fully understanding the hashing trick. In Advances in Neural Information Processing Systems, 5389–5399.
Halko, N.; Martinsson, P.-G.; and Tropp, J. A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53(2):217–288.
Johnson, W. B.; Lindenstrauss, J.; and Schechtman, G. 1986. Extensions of lipschitz maps into banach spaces. Israel Journal of Mathematics 54(2):129–138.
Jolliffe, I. 2011. Principal component analysis. Springer.
Lan, L.; Wang, Z.; Zhe, S.; Cheng, W.; Wang, J.; and Zhang, K. 2019. Scaling up kernel SVM on limited resources: A low-rank linearization approach. IEEE Trans. Neural Netw. Learning Syst. 30(2):369–378.
Liu, W.; Shen, X.; and Tsang, I. 2017. Sparse embedded k-means clustering. In NIPS, 3319–3327.
Lu, Y.; Dhillon, P.; Foster, D. P.; and Ungar, L. 2013. Faster ridge regression via the subsampled randomized hadamard transform. In NIPS, 369–377.
Mahoney, M. W., et al. 2011. Randomized algorithms for matrices and data. Foundations and Trends Rin Machine Learning 3(2):123–224.
Mika, S.; Ratsch, G.; Weston, J.; Scholkopf, B.; and Mullers, K.-R. 1999. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), 41–48. Ieee. Papailiopoulos, D.; Kyrillidis, A.; and Boutsidis, C. 2014. Provable deterministic leverage score sampling. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 997–1006. ACM.
Paul, S.; Boutsidis, C.; Magdon-Ismail, M.; and Drineas, P. 2013. Random projections for support vector machines. In Artificial intelligence and statistics, 498–506.
Paul, S.; Magdon-Ismail, M.; and Drineas, P. 2015. Feature selection for linear svm with provable guarantees. In Artificial Intelligence and Statistics, 735–743.
Pourkamali-Anaraki, F.; Becker, S.; and Wakin, M. B. 2018. Randomized clustered nystrom for large-scale kernel machines. In AAAI’18.
Sarlos, T. 2006. Improved approximation algorithms for large matrices via random projections. In FOCS’06, 143–152. IEEE.
Tropp, J. A. 2011. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis 3(01n02):115–126.
Weinberger, K.; Dasgupta, A.; Langford, J.; Smola, A.; and Attenberg, J. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–1120. ACM.
Xing, E. P.; Jordan, M. I.; Russell, S. J.; and Ng, A. Y. 2003. Distance metric learning with application to clustering with sideinformation. In NIPS, 521–528.
Xu, Y.; Yang, H.; Zhang, L.; and Yang, T. 2017. Efficient nonoblivious randomized reduction for risk minimization with improved excess risk guarantee. In Thirty-First AAAI Conference on Artificial Intelligence.
Zhang, L.; Mahdavi, M.; Jin, R.; Yang, T.; and Zhu, S. 2013. Recovering the optimal solution by dual random projection. In Conference on Learning Theory, 135–157.