The kernel Principal Component Analysis (kPCA) is a very effective and popular technique to perform nonlinear dimensionality reduction Sch¨olkopf et al. (1998b). It is applied to a large variety of fields: image processing and signal denoising Rathi et al. (2006); Twining and Taylor (2001); Mika et al. (1999), face recognition Wang (2012), credible real-time simulations in engineering applications Gonz´alez et al. (2018); Lopez et al. (2018); D´ıez et al. (2018), health and living matter sciences Shiokawa et al. (2018); Widjaja et al. (2012); Grassi et al. (2014)... kPCA is often presented giving the general ideas and avoiding the fundamental mathematical details. These details are soundly described only in few selected papers. In particular, when it comes to the forward mapping (or dimensionality reduction, see Izenman (2008)) and to the backward mapping or pre-image reconstruction, see Snyder et al. (2013).
This paper aims at providing a kPCA digest. That is, a condensed description with all the details necessary to understand and implement the method, alternative and complementary to the existing literature. Besides, motivated by different alternatives to achieve the backward mapping (or pre-image reconstruction) available in the literature Kwok and Tsang (2004); Mika et al. (1999); Wang (2012); Sch¨olkopf et al. (1998a); Rathi et al. (2006); Zheng et al. (2010), a new approach is proposed based on the minimization of a discrepancy functional (residual). This novel idea allows dealing with the very frequently used case of Gaussian kernel in a straightforward and efficient manner.
2.1 Diagonalizing the covariance matrix
Vectors in IR
are seen as
samples of some (possibly random) variable
IR
. It is assumed that
such that the sample is representative of the variability of x. The samples are collected in a
matrix X = [
].
PCA aims at discovering a linear model of dimension k, being , properly representing the variability of x. Thus, eliminating the intrinsic redundancy in the d dimensions of vector x.
The covariance matrix is defined as
In fact, the symmetric and positive definite matrix C stands for the covariance matrix of x if the mean value of x is zero. This is not a a loss of generality because it simply requires subtracting the mean value of x to all the columns in X.
Diagonalizing C results in finding a diagonal matrix of eigenvalues
0 and a unit matrix U such that
Matrix U describes an orthonormal basis in IR, the basis formed by its columns U = [
].
Vector is the expression of x in this new basis (note that x = Uz). Thus, matrix
is the corresponding matrix of samples of z. The covariance of z is
=
. Being the covariance of z diagonal, the d components of z are uncorrelated (linearly independent) random variables.
2.2 Reducing the dimension
The trace of C, which is equal to the trace of and therefore the sum of
, for i = 1, . . . , d, is the total variance of the vector sample. If eigenvalues decrease quickly, a reduced number of dimensions k is collecting a significant amount of
the variance. This is the case if, for some small tolerance is such that
Thus, eigenvalues from k + 1 to d are neglected, and consequently the last columns of matrix U (or the last rows of matrix
) are not expected to contribute to describe the variability of x. Accordingly, the last
components of vector z are suppressed without significant loss of information.
Thus, matrix
= [
], induces a new variable
in the reduced-dimension space. Thus, the samples in X map into =
of reduced dimension
.The backward mapping (from reduced-dimension space IR
to full dimension IR
) can be seen as a truncation of relation x = Uz, that is rewritten as
The reduced-order models is interpreted as taking x in the k-dimensional linear manifold (or subspace) generated by the basis . The error introduced in the reduction of dimensionality (from d to k) is associated with the discrepancy
and is decreasing as the tolerance
decreases (and k increases).
2.3 Singular Value Decomposition (SVD): an alternative to diagonalization.
The SVD provides a factorization of matrix X of the form
being U and V unit matrices of sizes and
, respectively, and
a
matrix. The singular values of
0 are the diagonal entries of
(recall
).
Note that the diagonalization of C is a direct consequence of the SVD
being . Thus, the eigenvalues of C are precisely the squared singular values of X, that is
, for i = 1, . . . , d.
2.4 Permuting d and ns.
The matrix equivalent to C taking
as input matrix (organizing data by rows instead of columns), is
It is denoted as Gramm matrix because it accounts for all the scalar products of the samples. This perspective of the problem aims at a reduction of the number of samples , while keeping the representativity of the family. This methodology is often named as Proper Orthogonal Decomposition (POD): it coincides with PCA, but aiming at reducing the number of samples
instead of the dimension d. However, the fact of changing X by
is not relevant when using SVD. Namely, the SVD (7) reads
=
And directly provides a diagonalization of the ) Gramm matrix
where the diagonal matrix
has the same nonzero entries
, for i = 1, . . . , d, as
.
The dimensionality reduction is performed exactly in the same way as described in section 2.2, allowing to reduce the dimension of the space of the samples from to k. That is, to describe with enough accuracy any of the
samples (or any other vector pertaining to the same family) as a linear combination of k vectors which are precisely the first k columns of matrix V.
2.5 The equivalence of diagonalizing C and G
When diagonalizing the covariance matrix C as indicated in (2), the columns of the transformation matrix U are precisely the eigenvectors of C, that is
The same happens with the Gramm matrix G and matrix V in (9), namely
being the i-th column of V. It is worth noting that for the for
, the columns
of matrix V correspond to the zero eigenvalue (they describe the kernel space of the matrix).
The fact that the SVD described in section 2.3 provides in one shot both U and V suggests that the computational effort provided in diagonalizing C (and therefore obtaining U, see (2)) is equivalent to the cost of diagonalizing G (and obtaining V). This is not obvious at the first sight, because the size of C is and the size of G is
, hence much larger.
This equivalence actually holds. Computing vectors , for i = 1, . . . , d, is equivalent to compute vectors
. This idea is one of the cornerstones of the kernel Principal Component Analysis (kPCA) method.
Recalling (1), isolating from the right-hand-side of (10), and using Appendix A, yields
where, assuming that U is available, the matrix B is introduced such that the entry with indices
, for
= 1
and j = 1, . . . , d, reads
Matrix B is computable and coincides precisely with the first d columns of matrix V, see Appendix B, that is
The reduced variable of dimension k is introduced in (4) . For the samples in the training set, that is for
= 1
, the i-th component of the reduced-dimensional samples read
for i = 1. That is, the compact expression for the construction of the
matrix
of samples in the reduced space reads
The backward mapping described in (6) is also written in terms of , namely
Note that, when compared with equation (6), in (17) the unknown X appears also in the right-hand-side when it has to be expressed in terms of (when
is not available, as it is the case in section 4). This is one of the difficulties to overcome in order to devise a backward mapping strategy for kPCA.
As shown above, PCA is a dimensionality reduction technique that, for some d-dimensional variable x, discovers linear manifolds of dimension where x lies (within some tolerance
). In many applications, reality is not as simple and the low-dimensional manifold characterizing the variable is nonlinear. In these cases PCA fails in identifying a possible dimensionality reduction. Several techniques allow achieving nonlinear dimensionality reduction, see Bishop (2006) and references therein. Among them, kernel Principal Component Analysis (kPCA) Sch¨olkopf et al. (1998b); Rathi et al. (2006) is appreciated for its simplicity and easy implementation.
3.1 Transformation to higher-dimension D
The kPCA conceptual idea is conceived by introducing an arbitrary transformation Φ from IRto IR
for some
. This transformation into a large dimensional space is expected to untangle (or to flatten) the nonlinear manifold where x belongs. That is, Φ and D are expected to be such that the variable x
manifold that is readily identified by the PCA. In other words, PCA is to be applied to the matrix containing the transformed samples
is applied; it diagonalizes the covariance matrix C
in section 2. This introduces two obvious difficulties: first, Φ is unknown and, second, the dimension D required to properly untangle the manifold is typically very large (in particular much larger than ) and consequently the computational effort to diagonalize C
(same as G). Moreover, as shown in section 2.4, diagonalizing G
dimension in the same way as diagonalizing C
3.2 The kernel trick
As noted above, the a priori unknown Φ is expected to map the nonlinear manifold into a linear one in a higher-dimensional space. In principle, it is not obvious to determine a proper Φ that aligns the samples into a linear subspace of IR.
of matrix G
for i, j = 1. The kernel idea is to introduce, instead of function Φ(
), some bivariate symmetric form
) such that
Comparing (20) and (21) reveals that selecting some ) is, for all practical purposes, equivalent to selecting some Φ(
). A typical choice for the kernel is the Gaussian that reads
Any other choice is valid provided that the matrix G
is symmetric and positive definite (recall that it must be a Gramm matrix). Many other alternative kernels are proposed in the literature Sch¨olkopf et al. (1998b); Rathi et al. (2006); Twining and Taylor (2001); Mika et al. (1999). The idea is to select the kernel providing the best dimensionality reduction (the lower value of k).
3.3 Centering the kernel
As mentioned above, the samples have to be centered to properly use PCA. Note however that selecting some arbitrary Φ() does not guarantee that the transformed sample remains centered. However, if Φ(
) is available, the operation to center X
a matrix G
In the following, the centered magnitudes are denoted with an over bar, like x
in (23). Thus, the generic term of the centered Gramm matrix, analogously to (20), reads [G
or
Recalling (21), this matrix results as
being IR
the
matrix having all its entries equal to one. Note that expression (24) provides the centered Gramm matrix G
algebraic manipulation of matrix G
being g
useful in the following, as a vector transformation.
3.4 Diagonalization and dimensionality reduction
Recall that PCA is to be applied to the transformed sample X
is not available: only G
idea is to use POD (diagonalize G
idea of section 2.5. The diagonalization of G
The eigenvalues of
, 0, are expected to decreasesuch that the first k (with
min(
)) collect a significant amount of the full variance, with a criterion associated with some small tolerance
, see (3). Accordingly, the reduced version of V
columns of V
shown is section 2.5. That is, the matrix of samples in the reduced space
is computed as
This is equivalent to compute each vector (the j-th sample
mapped into the reduced space) as
which is precisely the forward mapping into the reduced space IRof the sample
IR
.
Figure 1: Illustration the dimensionality reduction using kPCA.
and (15) for the linear case (PCA). In the nonlinear case the forward mapping follows the idea of (28) with a vector g
the elements of the sample (the training set). Vector g
component reads
Vector g
The mapping of into the reduced space is finally obtained as
Figure 1 shows how samples in the input space, IR
are mapped forward into
IR
, in the reduced space. This is ideally performed passing through the feature space, using PCA to reduce the dimensionality of the samples Φ(
IR
, and then projecting them into the reduced space IR
(grey dotted arrows in Figure 1). In practice, the feature space is never used. With kPCA, the kernel trick introduces an alternative strategy (with no explicit definition of Φ) that directly defines a mapping forward from
IR
to
IR
(represented by the blue dashed arrow in Figure 1).
As noted in section 2.5, having only does not allow properly mapping backwards an element
IR
into its pre-image
IR
, (red dashed arrow in
Figure 1).
The typical strategy to perform the backward mapping consists in seeking the element as a linear combination (or weighted average) of the elements of the training set, that is
for some weights = 1
. Note that the weights have to be computed as a function of the available data: that is,
and the samples in the training set. Often, the weights are explicitly computed in terms of the distances from
to all the samples mapped forward into the reduced space. That is, the distances
are computed in the reduced space and the weights
are explicitly given as an inverse function of
(the larger is the distance, the lower is the weight). Of course, this introduces some arbitrariness in the definition of the weights. There are many functions that decrease with the distance d: w = 1/d; w = 1
= exp(
)... they all provide sensible results but with non-negligible discrepancies. Here, we propose to select weights with an objective criterion, independent of any arbitrary choice. The idea is to adopt the form of (31) and define the vector of unknown weights w = [
. Then, vector g
becomes a function of the unknown weights as
Vector g
of , the following discrepancy functional is introduced:
Finally, the weights w are selected such that they minimize the discrepancy
The minimization is performed with any standard optimization algorithm D´ıez (2003); Ciarlet (1989). In the example shown in the next section, the minimization method used is fmincon, interior-point algorithm implemented in Matlab, simply imposing the weights to be nonnegative, that is 0.
In the case of the Gaussian kernel given in (22), recalling that V is a unit matrix (premultiplying by V does not alter the norm) and that the logarithm is monotonic, the previous discrepancy functional is replaced by an alternative form that measures discrepancy of the values of the logarithms
Note that functionals J in (33) and (35) are different but they do lead to solutions minimizing the discrepancy of model and data.
Using the expression of the Gaussian kernel, (32) becomes
and therefore the discrepancy functional J (w) adopts an explicit form in terms of the unknown weights w, namely
In order to ease the computational effort of the minimization algorithm it is convenient to reduce the number of unknown parameters by using some criterion based on the distance . For instance, take as actual unknowns only the weights that correspond to samples with a distance in the reduced space lower than some threshold. This may drastically reduce the dimension of the minization problem from
to the number of samples to be considered close enough to
(and hence to
).
The backward mapping proposed above (pre-image reconstruction) is tested in an example illustrated in Figure 2. The sample (training set) contains = 69 frames of a video consisting in an open hand that closes and opens once (nine of them shown in figure 2).The number of grey-scale pixel values in each frame is d = 1080
1920 = 2073600. An extra sample
is kept apart to check the backward and forward mapping. Picture
is a frame of the video outside the training set of
samples and located in the center of the running time. Intuitively, this set of pictures can be described by a single parameter (the opening of the hand) and therefore the nonlinear manifold where the sample belongs is expected to be of dimension one.
The kPCA methodology described in section 3 associated with an exponential kernel, see (22), provides a reduced model of three variables k = 3 that collects approximately 50% of the variance (corresponds to = 0.5). This low level of accuracy is preferred in the example (
= 0.5 is too large) because k = 3 is allowing a graphical illustration. Actually, figure 3 shows (in blue) the representation of the reduced samples
= 1, . . . , 69. Note that the configuration of the samples in the 3D space suggest that the actual intrinsic dimension of the manifold is one. The image
is mapped to the 3D space using the strategy devised in Section 3.4 (see red dot in Figure 3).
The new sample in the reduced space is mapped backward to the input space IRfollowing the ideas in Section 4. This produces the pre-image
which is
Figure 2: Training set: 9 pictures are displayed (out of 69) of a hand in a sequence of closing and opening.
Figure 3: Set of points representing the samples = 1
for k = 3. The dashed line unites consecutive frames. The red dot is the forward mapping of the extra image
.
Figure 4: Original picture out of training set (left) and its pre-image approximation
(right, after consecutive forward and backward mapping).
an approximation to , see Figure 4. Note that the dimensionality reduction (from 2073600 degrees of freedom to 3) is preserving most of the features of the image.
A Three vectors product rule
All along the paper, the following product rule is extensively used. For three arbitrary vectors a and b and c in IR, it holds that
or, using the tensorial and scalar product notation
This is easily shown considering the index notation
where the Einstein summation convention is used.
B The first d columns of V are equal to B
Proposition 1. The matrix B introduced in (13) coincides with the first d columns of matrix V.
Proof. The matrix B is
Thus, recalling that =
the eigenvalue problem (10) results in
The left-hand-side of (40) becomes
And left multiplying (40) by yields
The previous expression is written in matrix form as
This proves that the d columns of B are indeed eigenvectors of G associated with eigenvalues , for i = 1, . . . , d and therefore they do coincide with the first d columns of V (possibly up to a normalization constant).
This work is partially funded by Generalitat de Catalunya (grant number 1278 SGR 2017-2019) and Ministerio de Econom´ıa y Empresa and Ministerio de Ciencia, Innovaci´on y Universidades (grant number DPI2017-85139-C2-2-R.
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer.
Ciarlet, P.G., 1989. Introduction to Numerical Linear Algebra and Optimisa- tion. Cambridge University Press.
D´ıez, P., 2003. A note on the convergence of the secant method for simple and multiple roots. Applied mathematics letters 16, 1211–1215.
D´ıez, P., Zlotnik, S., Garc´ıa-Gonz´alez, A., Huerta, A., 2018. Algebraic PGD for tensor separation and compression: an algorithmic approach. Comptes Rendus M´ecanique 346, 501–5014. doi:10.1016/j.crme.2018.04.011.
Gonz´alez, D., Aguado, J., Cueto, E., Abisset-Chavanne, E., Chinesta, F., 2018. kPCA-based parametric solutions within the PGD framework. Archives of Computational Methods in Engineering 25, 69–86.
Grassi, L., Schileo, E., Boichon, C., Viceconti, M., Taddei, F., 2014. Comprehensive evaluation of PCA-based finite element modelling of the human femur. Medical engineering & physics 36, 1246–1252.
Izenman, A.J., 2008. Modern multivariate statistical techniques. Regression, classification and manifold learning. Springer Texts in Statistics, Springer. doi:10.1007/978-0-387-78189-1.
Kwok, J.Y., Tsang, I.H., 2004. The pre-image problem in kernel methods. IEEE transactions on neural networks 15, 1517–1525.
Lopez, E., Gonzalez, D., Aguado, J., Abisset-Chavanne, E., Cueto, E., Bi- netruy, C., Chinesta, F., 2018. A manifold learning approach for integrated computational materials engineering. Archives of Computational Methods in Engineering 25, 59–68.
Mika, S., Sch¨olkopf, B., Smola, A.J., M¨uller, K.R., Scholz, M., R¨atsch, G., 1999. Kernel pca and de-noising in feature spaces, in: Advances in neural information processing systems, pp. 536–542.
Rathi, Y., Dambreville, S., Tannenbaum, A., 2006. Statistical shape analysis using kernel PCA, in: Image processing: algorithms and systems, neural networks, and machine learning, International Society for Optics and Photonics. p. 60641B.
Sch¨olkopf, B., Mika, S., Smola, A., R¨atsch, G., M¨uller, K.R., 1998a. Kernel PCA pattern reconstruction via approximate pre-images, in: International Conference on Artificial Neural Networks, Springer. pp. 147–152.
Sch¨olkopf, B., Smola, A., M¨uller, K.R., 1998b. Nonlinear component analysis as kernel eigenvalue problem. Neural Computation 10, 1299–1319.
Shiokawa, Y., Date, Y., Kikuchi, J., 2018. Application of kernel principal compo- nent analysis and computational machine learning to exploration of metabolites strongly associated with diet. Scientific reports 8, 3426.
Snyder, J.C., Mika, S., Burke, K., M¨uller, K.R., 2013. Kernels, pre-images and optimization, in: Empirical Inference. Springer, pp. 245–259.
Twining, C.J., Taylor, C.J., 2001. Kernel principal component analysis and the construction of non-linear active shape models., in: BMVC, pp. 23–32.
Wang, Q., 2012. Kernel principal component analysis and its applications in face recognition and active shape models. arXiv preprint arXiv:1207.3538 .
Widjaja, D., Varon, C., Dorado, A., Suykens, J.A., Van Huffel, S., 2012. Ap- plication of kernel principal component analysis for single lead ecg-derived respiration. IEEE Transactions on Biomedical Engineering 59, 1169–1176.
Zheng, W.S., Lai, J., Yuen, P.C., 2010. Penalized preimage learning in kernel principal component analysis. IEEE Transactions on Neural Networks 21, 551–570.