A kernel Principal Component Analysis (kPCA) digest with a new backward mapping (pre-image reconstruction) strategy

2020·Arxiv

Abstract

Abstract

Methodologies for multidimensionality reduction aim at discovering low-dimensional manifolds where data ranges. Principal Component Analysis (PCA) is very effective if data have linear structure. But fails in identifying a possible dimensionality reduction if data belong to a nonlinear low-dimensional manifold. For nonlinear dimensionality reduction, kernel Principal Component Analysis (kPCA) is appreciated because of its simplicity and ease implementation. The paper provides a concise review of PCA and kPCA main ideas, trying to collect in a single document aspects that are often dispersed. Moreover, a strategy to map back the reduced dimension into the original high dimensional space is also devised, based on the minimization of a discrepancy functional.

1 Introduction

The kernel Principal Component Analysis (kPCA) is a very effective and popular technique to perform nonlinear dimensionality reduction Sch¨olkopf et al. (1998b). It is applied to a large variety of fields: image processing and signal denoising Rathi et al. (2006); Twining and Taylor (2001); Mika et al. (1999), face recognition Wang (2012), credible real-time simulations in engineering applications Gonz´alez et al. (2018); Lopez et al. (2018); D´ıez et al. (2018), health and living matter sciences Shiokawa et al. (2018); Widjaja et al. (2012); Grassi et al. (2014)... kPCA is often presented giving the general ideas and avoiding the fundamental mathematical details. These details are soundly described only in few selected papers. In particular, when it comes to the forward mapping (or dimensionality reduction, see Izenman (2008)) and to the backward mapping or pre-image reconstruction, see Snyder et al. (2013).

This paper aims at providing a kPCA digest. That is, a condensed description with all the details necessary to understand and implement the method, alternative and complementary to the existing literature. Besides, motivated by different alternatives to achieve the backward mapping (or pre-image reconstruction) available in the literature Kwok and Tsang (2004); Mika et al. (1999); Wang (2012); Sch¨olkopf et al. (1998a); Rathi et al. (2006); Zheng et al. (2010), a new approach is proposed based on the minimization of a discrepancy functional (residual). This novel idea allows dealing with the very frequently used case of Gaussian kernel in a straightforward and efficient manner.

2 Principal Component Analysis (PCA)

2.1 Diagonalizing the covariance matrix

Vectors in IRare seen as samples of some (possibly random) variable IR. It is assumed that such that the sample is representative of the variability of x. The samples are collected in a matrix X = [].

PCA aims at discovering a linear model of dimension k, being , properly representing the variability of x. Thus, eliminating the intrinsic redundancy in the d dimensions of vector x.

The covariance matrix is defined as

In fact, the symmetric and positive definite matrix C stands for the covariance matrix of x if the mean value of x is zero. This is not a a loss of generality because it simply requires subtracting the mean value of x to all the columns in X.

Diagonalizing C results in finding a diagonal matrix of eigenvalues 0 and a unit matrix U such that

Matrix U describes an orthonormal basis in IR, the basis formed by its columns U = [].

Vector is the expression of x in this new basis (note that x = Uz). Thus, matrix is the corresponding matrix of samples of z. The covariance of z is = . Being the covariance of z diagonal, the d components of z are uncorrelated (linearly independent) random variables.

2.2 Reducing the dimension

The trace of C, which is equal to the trace of and therefore the sum of , for i = 1, . . . , d, is the total variance of the vector sample. If eigenvalues decrease quickly, a reduced number of dimensions k is collecting a significant amount of

the variance. This is the case if, for some small tolerance is such that

Thus, eigenvalues from k + 1 to d are neglected, and consequently the last columns of matrix U (or the last rows of matrix ) are not expected to contribute to describe the variability of x. Accordingly, the last components of vector z are suppressed without significant loss of information.

Thus, matrix = [], induces a new variable

in the reduced-dimension space. Thus, the samples in X map into = of reduced dimension .The backward mapping (from reduced-dimension space IRto full dimension IR) can be seen as a truncation of relation x = Uz, that is rewritten as

The reduced-order models is interpreted as taking x in the k-dimensional linear manifold (or subspace) generated by the basis . The error introduced in the reduction of dimensionality (from d to k) is associated with the discrepancy and is decreasing as the tolerance decreases (and k increases).

2.3 Singular Value Decomposition (SVD): an alternative to diagonalization.

The SVD provides a factorization of matrix X of the form

being U and V unit matrices of sizes and , respectively, and a matrix. The singular values of 0 are the diagonal entries of (recall ).

Note that the diagonalization of C is a direct consequence of the SVD

being . Thus, the eigenvalues of C are precisely the squared singular values of X, that is , for i = 1, . . . , d.

2.4 Permuting d and ns.

The matrix equivalent to C taking as input matrix (organizing data by rows instead of columns), is

It is denoted as Gramm matrix because it accounts for all the scalar products of the samples. This perspective of the problem aims at a reduction of the number of samples , while keeping the representativity of the family. This methodology is often named as Proper Orthogonal Decomposition (POD): it coincides with PCA, but aiming at reducing the number of samples instead of the dimension d. However, the fact of changing X by is not relevant when using SVD. Namely, the SVD (7) reads =

And directly provides a diagonalization of the ) Gramm matrix

where the diagonal matrix has the same nonzero entries , for i = 1, . . . , d, as .

The dimensionality reduction is performed exactly in the same way as described in section 2.2, allowing to reduce the dimension of the space of the samples from to k. That is, to describe with enough accuracy any of the samples (or any other vector pertaining to the same family) as a linear combination of k vectors which are precisely the first k columns of matrix V.

2.5 The equivalence of diagonalizing C and G

When diagonalizing the covariance matrix C as indicated in (2), the columns of the transformation matrix U are precisely the eigenvectors of C, that is

The same happens with the Gramm matrix G and matrix V in (9), namely

being the i-th column of V. It is worth noting that for the for , the columns of matrix V correspond to the zero eigenvalue (they describe the kernel space of the matrix).

The fact that the SVD described in section 2.3 provides in one shot both U and V suggests that the computational effort provided in diagonalizing C (and therefore obtaining U, see (2)) is equivalent to the cost of diagonalizing G (and obtaining V). This is not obvious at the first sight, because the size of C is and the size of G is , hence much larger.

This equivalence actually holds. Computing vectors , for i = 1, . . . , d, is equivalent to compute vectors . This idea is one of the cornerstones of the kernel Principal Component Analysis (kPCA) method.

Recalling (1), isolating from the right-hand-side of (10), and using Appendix A, yields

where, assuming that U is available, the matrix B is introduced such that the entry with indices , for = 1and j = 1, . . . , d, reads

Matrix B is computable and coincides precisely with the first d columns of matrix V, see Appendix B, that is

The reduced variable of dimension k is introduced in (4) . For the samples in the training set, that is for = 1, the i-th component of the reduced-dimensional samples read

for i = 1. That is, the compact expression for the construction of the matrix of samples in the reduced space reads

The backward mapping described in (6) is also written in terms of , namely

Note that, when compared with equation (6), in (17) the unknown X appears also in the right-hand-side when it has to be expressed in terms of (when is not available, as it is the case in section 4). This is one of the difficulties to overcome in order to devise a backward mapping strategy for kPCA.

3 kernel Principal Component Analysis

As shown above, PCA is a dimensionality reduction technique that, for some d-dimensional variable x, discovers linear manifolds of dimension where x lies (within some tolerance ). In many applications, reality is not as simple and the low-dimensional manifold characterizing the variable is nonlinear. In these cases PCA fails in identifying a possible dimensionality reduction. Several techniques allow achieving nonlinear dimensionality reduction, see Bishop (2006) and references therein. Among them, kernel Principal Component Analysis (kPCA) Sch¨olkopf et al. (1998b); Rathi et al. (2006) is appreciated for its simplicity and easy implementation.

3.1 Transformation to higher-dimension D

The kPCA conceptual idea is conceived by introducing an arbitrary transformation Φ from IRto IRfor some . This transformation into a large dimensional space is expected to untangle (or to flatten) the nonlinear manifold where x belongs. That is, Φ and D are expected to be such that the variable x

manifold that is readily identified by the PCA. In other words, PCA is to be applied to the matrix containing the transformed samples

is applied; it diagonalizes the covariance matrix C

in section 2. This introduces two obvious difficulties: first, Φ is unknown and, second, the dimension D required to properly untangle the manifold is typically very large (in particular much larger than ) and consequently the computational effort to diagonalize C

(same as G). Moreover, as shown in section 2.4, diagonalizing G

dimension in the same way as diagonalizing C

3.2 The kernel trick

As noted above, the a priori unknown Φ is expected to map the nonlinear manifold into a linear one in a higher-dimensional space. In principle, it is not obvious to determine a proper Φ that aligns the samples into a linear subspace of IR.

of matrix G

for i, j = 1. The kernel idea is to introduce, instead of function Φ(), some bivariate symmetric form ) such that

Comparing (20) and (21) reveals that selecting some ) is, for all practical purposes, equivalent to selecting some Φ(). A typical choice for the kernel is the Gaussian that reads

Any other choice is valid provided that the matrix G

is symmetric and positive definite (recall that it must be a Gramm matrix). Many other alternative kernels are proposed in the literature Sch¨olkopf et al. (1998b); Rathi et al. (2006); Twining and Taylor (2001); Mika et al. (1999). The idea is to select the kernel providing the best dimensionality reduction (the lower value of k).

3.3 Centering the kernel

As mentioned above, the samples have to be centered to properly use PCA. Note however that selecting some arbitrary Φ() does not guarantee that the transformed sample remains centered. However, if Φ() is available, the operation to center X

a matrix G

In the following, the centered magnitudes are denoted with an over bar, like x

in (23). Thus, the generic term of the centered Gramm matrix, analogously to (20), reads [G

or

Recalling (21), this matrix results as

being IRthe matrix having all its entries equal to one. Note that expression (24) provides the centered Gramm matrix G

algebraic manipulation of matrix G

being g

useful in the following, as a vector transformation.

3.4 Diagonalization and dimensionality reduction

Recall that PCA is to be applied to the transformed sample X

is not available: only G

idea is to use POD (diagonalize G

idea of section 2.5. The diagonalization of G

The eigenvalues of

, 0, are expected to decreasesuch that the first k (with min()) collect a significant amount of the full variance, with a criterion associated with some small tolerance , see (3). Accordingly, the reduced version of V

columns of V

shown is section 2.5. That is, the matrix of samples in the reduced space is computed as

This is equivalent to compute each vector (the j-th sample mapped into the reduced space) as

which is precisely the forward mapping into the reduced space IRof the sample IR.

Figure 1: Illustration the dimensionality reduction using kPCA.

and (15) for the linear case (PCA). In the nonlinear case the forward mapping follows the idea of (28) with a vector g

the elements of the sample (the training set). Vector g

component reads

Vector g

The mapping of into the reduced space is finally obtained as

Figure 1 shows how samples in the input space, IRare mapped forward into IR, in the reduced space. This is ideally performed passing through the feature space, using PCA to reduce the dimensionality of the samples Φ(IR, and then projecting them into the reduced space IR(grey dotted arrows in Figure 1). In practice, the feature space is never used. With kPCA, the kernel trick introduces an alternative strategy (with no explicit definition of Φ) that directly defines a mapping forward from IRto IR(represented by the blue dashed arrow in Figure 1).

4 kPCA backward mapping: from z⋆ to x⋆

As noted in section 2.5, having only does not allow properly mapping backwards an element IRinto its pre-image IR, (red dashed arrow in

Figure 1).

The typical strategy to perform the backward mapping consists in seeking the element as a linear combination (or weighted average) of the elements of the training set, that is

for some weights = 1. Note that the weights have to be computed as a function of the available data: that is, and the samples in the training set. Often, the weights are explicitly computed in terms of the distances from to all the samples mapped forward into the reduced space. That is, the distances are computed in the reduced space and the weights are explicitly given as an inverse function of (the larger is the distance, the lower is the weight). Of course, this introduces some arbitrariness in the definition of the weights. There are many functions that decrease with the distance d: w = 1/d; w = 1= exp()... they all provide sensible results but with non-negligible discrepancies. Here, we propose to select weights with an objective criterion, independent of any arbitrary choice. The idea is to adopt the form of (31) and define the vector of unknown weights w = [. Then, vector g

becomes a function of the unknown weights as

Vector g

of , the following discrepancy functional is introduced:

Finally, the weights w are selected such that they minimize the discrepancy

The minimization is performed with any standard optimization algorithm D´ıez (2003); Ciarlet (1989). In the example shown in the next section, the minimization method used is fmincon, interior-point algorithm implemented in Matlab, simply imposing the weights to be nonnegative, that is 0.

In the case of the Gaussian kernel given in (22), recalling that V is a unit matrix (premultiplying by V does not alter the norm) and that the logarithm is monotonic, the previous discrepancy functional is replaced by an alternative form that measures discrepancy of the values of the logarithms

Note that functionals J in (33) and (35) are different but they do lead to solutions minimizing the discrepancy of model and data.

Using the expression of the Gaussian kernel, (32) becomes

and therefore the discrepancy functional J (w) adopts an explicit form in terms of the unknown weights w, namely

In order to ease the computational effort of the minimization algorithm it is convenient to reduce the number of unknown parameters by using some criterion based on the distance . For instance, take as actual unknowns only the weights that correspond to samples with a distance in the reduced space lower than some threshold. This may drastically reduce the dimension of the minization problem from to the number of samples to be considered close enough to (and hence to ).

5 Numerical test

The backward mapping proposed above (pre-image reconstruction) is tested in an example illustrated in Figure 2. The sample (training set) contains = 69 frames of a video consisting in an open hand that closes and opens once (nine of them shown in figure 2).The number of grey-scale pixel values in each frame is d = 1080 1920 = 2073600. An extra sample is kept apart to check the backward and forward mapping. Picture is a frame of the video outside the training set of samples and located in the center of the running time. Intuitively, this set of pictures can be described by a single parameter (the opening of the hand) and therefore the nonlinear manifold where the sample belongs is expected to be of dimension one.

The kPCA methodology described in section 3 associated with an exponential kernel, see (22), provides a reduced model of three variables k = 3 that collects approximately 50% of the variance (corresponds to = 0.5). This low level of accuracy is preferred in the example (= 0.5 is too large) because k = 3 is allowing a graphical illustration. Actually, figure 3 shows (in blue) the representation of the reduced samples = 1, . . . , 69. Note that the configuration of the samples in the 3D space suggest that the actual intrinsic dimension of the manifold is one. The image is mapped to the 3D space using the strategy devised in Section 3.4 (see red dot in Figure 3).

The new sample in the reduced space is mapped backward to the input space IRfollowing the ideas in Section 4. This produces the pre-image which is

Figure 2: Training set: 9 pictures are displayed (out of 69) of a hand in a sequence of closing and opening.

Figure 3: Set of points representing the samples = 1for k = 3. The dashed line unites consecutive frames. The red dot is the forward mapping of the extra image .

Figure 4: Original picture out of training set (left) and its pre-image approximation (right, after consecutive forward and backward mapping).

an approximation to , see Figure 4. Note that the dimensionality reduction (from 2073600 degrees of freedom to 3) is preserving most of the features of the image.

Appendices

A Three vectors product rule

All along the paper, the following product rule is extensively used. For three arbitrary vectors a and b and c in IR, it holds that

or, using the tensorial and scalar product notation

This is easily shown considering the index notation

where the Einstein summation convention is used.

B The first d columns of V are equal to B

Proposition 1. The matrix B introduced in (13) coincides with the first d columns of matrix V.

Proof. The matrix B is

Thus, recalling that = the eigenvalue problem (10) results in

The left-hand-side of (40) becomes

And left multiplying (40) by yields

The previous expression is written in matrix form as

This proves that the d columns of B are indeed eigenvectors of G associated with eigenvalues , for i = 1, . . . , d and therefore they do coincide with the first d columns of V (possibly up to a normalization constant).

Acknowledgments

This work is partially funded by Generalitat de Catalunya (grant number 1278 SGR 2017-2019) and Ministerio de Econom´ıa y Empresa and Ministerio de Ciencia, Innovaci´on y Universidades (grant number DPI2017-85139-C2-2-R.

REFERENCES

Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Springer.

Ciarlet, P.G., 1989. Introduction to Numerical Linear Algebra and Optimisa- tion. Cambridge University Press.

D´ıez, P., 2003. A note on the convergence of the secant method for simple and multiple roots. Applied mathematics letters 16, 1211–1215.

D´ıez, P., Zlotnik, S., Garc´ıa-Gonz´alez, A., Huerta, A., 2018. Algebraic PGD for tensor separation and compression: an algorithmic approach. Comptes Rendus M´ecanique 346, 501–5014. doi:10.1016/j.crme.2018.04.011.

Gonz´alez, D., Aguado, J., Cueto, E., Abisset-Chavanne, E., Chinesta, F., 2018. kPCA-based parametric solutions within the PGD framework. Archives of Computational Methods in Engineering 25, 69–86.

Grassi, L., Schileo, E., Boichon, C., Viceconti, M., Taddei, F., 2014. Comprehensive evaluation of PCA-based finite element modelling of the human femur. Medical engineering & physics 36, 1246–1252.

Izenman, A.J., 2008. Modern multivariate statistical techniques. Regression, classification and manifold learning. Springer Texts in Statistics, Springer. doi:10.1007/978-0-387-78189-1.

Kwok, J.Y., Tsang, I.H., 2004. The pre-image problem in kernel methods. IEEE transactions on neural networks 15, 1517–1525.

Lopez, E., Gonzalez, D., Aguado, J., Abisset-Chavanne, E., Cueto, E., Bi- netruy, C., Chinesta, F., 2018. A manifold learning approach for integrated computational materials engineering. Archives of Computational Methods in Engineering 25, 59–68.

Mika, S., Sch¨olkopf, B., Smola, A.J., M¨uller, K.R., Scholz, M., R¨atsch, G., 1999. Kernel pca and de-noising in feature spaces, in: Advances in neural information processing systems, pp. 536–542.

Rathi, Y., Dambreville, S., Tannenbaum, A., 2006. Statistical shape analysis using kernel PCA, in: Image processing: algorithms and systems, neural networks, and machine learning, International Society for Optics and Photonics. p. 60641B.

Sch¨olkopf, B., Mika, S., Smola, A., R¨atsch, G., M¨uller, K.R., 1998a. Kernel PCA pattern reconstruction via approximate pre-images, in: International Conference on Artificial Neural Networks, Springer. pp. 147–152.

Sch¨olkopf, B., Smola, A., M¨uller, K.R., 1998b. Nonlinear component analysis as kernel eigenvalue problem. Neural Computation 10, 1299–1319.

Shiokawa, Y., Date, Y., Kikuchi, J., 2018. Application of kernel principal compo- nent analysis and computational machine learning to exploration of metabolites strongly associated with diet. Scientific reports 8, 3426.

Snyder, J.C., Mika, S., Burke, K., M¨uller, K.R., 2013. Kernels, pre-images and optimization, in: Empirical Inference. Springer, pp. 245–259.

Twining, C.J., Taylor, C.J., 2001. Kernel principal component analysis and the construction of non-linear active shape models., in: BMVC, pp. 23–32.

Wang, Q., 2012. Kernel principal component analysis and its applications in face recognition and active shape models. arXiv preprint arXiv:1207.3538 .

Widjaja, D., Varon, C., Dorado, A., Suykens, J.A., Van Huffel, S., 2012. Ap- plication of kernel principal component analysis for single lead ecg-derived respiration. IEEE Transactions on Biomedical Engineering 59, 1169–1176.

Zheng, W.S., Lai, J., Yuen, P.C., 2010. Penalized preimage learning in kernel principal component analysis. IEEE Transactions on Neural Networks 21, 551–570.