Multi-Kernel Regression with Sparsity Constraint

2018·Arxiv

Abstract

Abstract

In this paper, we provide a Banach-space formulation of supervised learning with generalized total-variation (gTV) regularization. We identify the class of kernel functions that are admissible in this framework. Then, we propose a variation of supervised learning in a continuous-domain hybrid search space with gTV regularization. We show that the solution admits a multi-kernel expansion with adaptive positions. In this representation, the number of active kernels is upper-bounded by the number of data points while the gTV regularization imposes an penalty on the kernel coefficients. Finally, we illustrate numerically the outcome of our theory.

Key words: Representer theorem, regularization theory, multiple-kernel learning, generalized LASSO, generalized total variation.

1 Introduction

The determination of an unknown function from a series of samples is a classical problem in machine learning. It falls under the category of “supervised learning”, for which there exists a rich literature [7, 23, 43]. The goal of supervised learning is to recover a target function from its M noisy samples ) + = 1, 2, . . . , M, where is a disturbance term. A general way to formulate supervised learning is through the minimization problem

where the cost function is made of two terms. The first one (data fidelity) measures how well f fits the given training data set while the second one (regularization) imposes the prior knowledge about the function model. The parameter balances the terms.

1.1 RKHS in Machine Learning

The simplest form of (1) is the least-squares problem with Tikhonov regularization

where L is the regularization operator and ), known as the native space of L, is the space of functions such that L). It is a classical quadratic minimization problem that has a closed-form solution [39]. An important assumption in this formulation is the continuity of the sampling functionals ) for m = 1, 2, . . . , M. This is equivalent to ) being a reproducing-kernel Hilbert space (RKHS) [1, 11, 43].

The Hilbert space ) consisting of functions from to R is called an RKHS if there exists a bivariate positive-definite function k : such that, for all , k() and f(x) = [1]. The function k() is unique and is called the reproducing kernel of ).

The supervised learning over the RKHS ) can be formulated through the minimization

The kernel representer theorem states that the solution of (3) admits the form

for some appropriate weights , where m = 1, 2, . . . , M [24, 33]. The expansion (4) is the key element of kernel-based schemes in machine learning [34, 36, 42]. Computing the RKHS norm of a function f of the form (4) results in = , where is a positive-definite matrix with [= k(). It is called the Gram matrix of the kernel k(). The practical outcome of this observation is that the infinite-dimensional problem (3) over the space of functions ) becomes equivalent to the finite-dimensional problem

which is of size M and can be computed numerically.

1.2 Toward Sparse Kernel Expansions

In the solution form (4), the kernels are shifted to the location of the data samples. This is elegant but it can become cumbersome when the number of samples M grows large. Several schemes have been developed to reduce the number of active kernels. One proposed approach is to use a sparsity-enforcing loss such as the -insensitive norm of support-vector machine (SVM) regression [14, 42]. Another approach is to replace the quadratic regularization in the reduced finite-dimensional problem (5) by a sparsity-promoting penalty such as . This results in (5) becoming

which is called the generalized LASSO [29].

In this paper, we consider a Banach-space formulation of supervised learning. We choose the generalized total-variation (gTV) norm as the regularization term in order to promote sparsity in the continuous domain. The effect of gTV regularization has been extensively studied in the context of linear inverse problems [17, 28, 40]. For an invertible operator L, the gTV norm is defined as

where is the total-variation norm in the sense of measures [31]. One can formulate supervised learning with gTV regularization through the minimization

where ) is the native space that corresponds to the operator L and consists of functions with finite gTV norm. A consequence of the general representer theorem of [40] is that there is always a solution of (8) that admits a kernel expansion of the form

for some kernel weights and some adaptive kernel positions [22]. There, the function k() : is the shift-invariant kernel associated to the Green’s function of the operator L. In other words, we have that k(x, y) = ), where = L.

Let us also mention that a formulation with strong link to (8) has been presented in [3] for learning a function from a continuously indexed family of atoms , where V is a compact topological space. Putting it in a similar form as (8), the proposed formulation in [3] for supervised learning is equivalent to the minimization

where M(V) is the space of Radon measures over V. The relevant property there is that the minimization of (10) introduces an atomic measure = ). It hence suggests the parametric form (9) with k() = k) for the learned function.

The minimization problem (10) is a synthetis-based formulation for supervised learning where the basis functions are known a priori, in contrary to (8) which is an analysis-based formalism that relies on regularization theory in Banach spaces. Interestingly, the two formulations are equivalent when the family of atoms in (10) coincides with the class of shifted Green’s function of the regularization operator L; that is, k) = ).

To conclude this section, we discuss the connection between (8) and generalized LASSO. One readily verifies that the gTV norm enforces an penalty on the kernel coefficients . More precisely, the expansion (9) translates the original problem (8) into the discrete minimization

where Z =is the kernel-position matrix and is a matrix with [= k(). The reduced problem (11) can be seen as an extended version of the generalized LASSO in (6). The fundamental difference is that the minimization is through the positions as well.

1.3 Multi-Kernel Schemes

The solution forms (4) and (9) heavily depend on the kernel function k(). Hence, choosing the proper kernel is a challenging task that requires careful consideration. One can use a cross-validation scheme in order to compare the performance of several kernel estimators and select the best one for the desired application [21]. Another approach is to learn a new kernel function k= from a family of given kernels k[4, 27]. This transforms the original problem (5) into the joint optimization

where is the Gram matrix of the learned kernel kand R() regularizes the coefficient vector , for example like in R() = for 1 2 [5, 18, 25]. The learned function will then take the generic form

1.4 Our Contribution

In this paper, we provide a Banach-space framework for supervised learning with gTV regularization. We study the topological structures of the search space of this problem and we characterize the class of admissible regularization operators together with their associated kernel functions.

We also propose a multi-kernel extension of supervised learning with gTV regularization. To that end, we consider the minimization

In this formulation, the target function f is decomposed into N additive components, where the smoothness of each component has been expressed by its corresponding regularization operator. Our main result, which follows from Theorem 4, is the existence of a solution of (14) that yields a multi-kernel expansion of the target function and that takes the form

where kis the shift-invariant kernel associated to the operator L. Moreover, the total number of nonzero coefficients is upper-bounded by the number M of data points and, hence, is not growing with the number N of components. We also illustrate numerically the effect of using multiple kernels.

1.5 Roadmap

The paper is organized as follows: We present some mathematical preliminaries in Section 2. In Section 3, we study the Banach structure of the native spaces and we characterize the class of admissible kernels. We propose and prove our main result in Section 4. Finally, we illustrate our theory numerically in Section 5.

2 Preliminaries

In this section, we recall relevant mathematical concepts, such as the function spaces that we use throughout the paper along with properties of linear operators that are defined over those spaces.

2.1 Function Spaces

All the derivatives of a rapidly decaying function decay faster than the inverse of any polynomial at infinity. Then, a smooth and slowly growing function is an element of ) such that all of its derivatives have asymptotic growth controlled by a polynomial. Finally, a heavy-tailed function satisfies (1 + for some finite constants 0.

The Schwartz space of smooth and rapidly decaying functions is denoted by ). Its topological dual is ), the space of tempered distributions [19]. We remark that any smooth and slowly growing function specifies the continuous linear functional over ) and, hence, is an element of ).

The space of continuous functions over that vanish at infinity is ). It is a Banach space equipped with the supremum norm . The space of Schwartz functions ) is densely embedded in ). Hence, the topological dual of ) can be defined as

In fact, ) is the Banach space of Radon measures over equipped with the total-variation norm [31]. It includes the shifted Dirac impulses ), with = 1. Moreover, ) with the relation for all ). This allows one to interpret () as a generalization of ().

For a Banach space ), we consider two topologies for its continuous dual space ). The first one is the strong topology. It is induced from the dual norm in the sense that a sequence ) is said to converge in the strong topology to ) if lim= 0. The second one is the weak*-topology that comes from the predual space ) in the sense that a sequence is said to converge in the weak*-topology to if, for any test function converges to .

2.2 Linear Operators

The linear operator L : ) is called shift-invariant if, for any function ) and any shift value , we have that L= L). We recall a variant of the Schwartz celebrated kernel theorem for linear and shift-invariant (LSI) operators (see [37] for a “simple” proof of the general case).

Theorem 1 (Schwartz kernel theorem). For any LSI operator L : ), there exists a unique distribution ), known as the impulse response of L, such that

3 Banach-Space Kernels

In this section, we introduce our Banach-space framework of learning with gTV regularization. We start by defining the class of kernel-admissible operators.

Definition 1. The linear operator L : ) is called kernel-admissible (or simply admissible), if

(ii) it is an isomorphism over ), meaning that it is continuous and invertible, its inverse being the continuous operator L: );

(iii) the sampling functional ) is weak*-continuous in the topology of its native space.

The restriction to LSI operators is not crucial to our framework. However, it lends itself to the convenience of an analysis in the Fourier domain. It also allows us to provide necessary and sufficient conditions to characterize the class of admissible operators (see Theorem 3). The invertiblity assumption, on the other hand, is essential to have decaying kernels. In fact, it is known that the Green’s function of any LSI operator with a nontrivial null space necessarily has a singularity in the Fourier domain at the origin [41]. Finally, the assumption of the (weak*) continuity of the sampling functional is a natural choice in learning theory. The main motivation here is to guarantee the (weak*) lower semicontinuity of the global cost functional in (14). This can be used, together with the generalized Weierstrass theorem, to prove the existence of solutions (see Theorem 4). Let us note that the definition of weak*-continuity depends on the Banach structure of the native space. In the sequel, we first properly define native spaces and then specify their underlying Banach structures.

Definition 2. The native space of the LSI isomorphism L : ) is the pre image of L over the space of Radon measures; that is, the space ) = L.

Theorem 2 summarizes the important properties of the native spaces. Its proof is available in Appendix A.

Theorem 2. Let L : ) be an LSI isomorphism over ). Then, its native space is a topological vector space with the following properties:

(ii) The restriction of L to its native space results in the isomorphism L : ).

(iii) The adjoint operator Lis well-defined over ) and its image is the Banach space ) with the norm .

(iv) The space ) is the predual of ), meaning that (= ).

(v) The space of Schwartz functions is embedded in the native space. More- over, the native space itself is densely embedded in the space of tempered distributions. The embedding hierarchy is indicated as

Due to Theorem 2, the weak*-continuity of the sampling functional (Condition (iii) in Definition 1) is equivalent to the inclusion of the shifted Dirac impulses in the predual of the native space. In other words, for all , one should have that ).

We now define the shift-invariant kernel associated to an admissible operator.

Definition 3. The shift-invariant kernel associated to the admissible operator L : ) is the bivariate function k : with k(x, y) = ), where = Lis the Green’s function of L.

In Theorem 3, we provide the necessary and sufficient conditions that characterize the class of admissible LSI operators. The proof can be found in Appendix B.

Theorem 3. Let L be an admissible operator. Then, its associated Green’s function = Lsatisfies the following properties:

(ii) Its Fourier transform ) is a smooth, non-vanishing, slowly growing, and heavy-tailed function of .

Additionally, any function that satisfies these properties can be appointed to an admissible operator L : ) defined as

The practical implication of Theorem 3 is that it yields Fourier-domain criteria to determine the admissiblity of an operator L. As the last part of this section, we use this characterization to introduce some families of admissible kernels. Our first example is made of exponential-type kernels defined as

These functions are known to be positive-definite [41, Appendix B]. Their inverse Fourier transforms (the so-called -stable distributions) are heavy-tailed and infinitely smooth, with algebraically decaying derivatives of any order [32, Chapter 5]. Hence, they satisfy the conditions of Theorem 3. Note that the classical Gaussian kernels are excluded because their frequency responses are not heavy-tailed. However, one can get arbitrarily close by letting tend to its critical value 2. Similar to the arguments that can be made to support the use of Gaussian kernels in regularized RKHS (see for example [20, 38, 44]), we conjecture that the present Banach space formulation can be extended to cover Gaussian kernels as well. However, this requires one to consider a space larger than ). Our second example is made of Bessel potentials used in kernel estimation [2]. For a positive real number s > d, we consider the operator (I ∆): ), where ∆ is the Laplacian operator. The Bessel potentials are the Green’s function of such operators. They correspond to the shift-invariant kernels ) = 1 (1 +

Clearly, the function is in ) for s > d. By invoking the Riemann- Lebesgue lemma, we deduce that its inverse Fourier transform is a continuous function that vanishes at infinity. Hence, the kernel function ) satisfies Property (i) of Theorem 3. Moreover, from the Fourier-domain definition (24) of ), it can be seen that Property (ii) also holds. Together, we deduce the admissiblity of these kernels. We remark that the Bessel potential kernels are rotation-invariant as well. It is worth to mention that one can rotate and dilate any admissible kernel by considering an invertible mixture matrix A and by defining the transformed kernel as k(Ax, Ay). One readily verifies that the transformed kernel also satis-fies the conditions of Theorem 3 and hence, is also admissible. In Figures 1 and 2, we have plotted the introduced kernels in dimension d = 2 for different sets of parameters. It can be seen that the width and regularity of these kernels can be adjusted through their parameters. This can be exploited in our framework of learning with multiple kernels to benefit from this diversity. We shall illustrate this numerically in Section 5.

4 Multiple-Kernel Regression

In this section, we prove our main result: the representer theorem of multiple-kernel regression with gTV regularization. In effect, the gTV norm will force

Figure 1: Bessel-potential kernels ), where r = (). The parameters s > 2 and 0 adjust smoothness and width of the kernel, respectively.

the learned function to use the fewest active kernels.

Theorem 4 (Multiple-kernel regression with gTV). Given a training data set that consists of M distinct pairs () for m = 1, 2, . . . , M, we consider the minimization problem

where E() is a strictly convex functional and Lis a kernel-admissible operator in the sense of Definition 1 for n = 1, 2, . . . , N. Then, the solution set of this problem is nonempty, convex, and weak*-compact. For any of its extreme points (), we have the kernel expansions

for its components, where are kernel weights, are adaptive kernel positions, and kis the shift-invariant kernel associated to the regularization operator Lfor n = 1, 2, . . . , N. Moreover, the number of active kernels is upper-bounded by the number of data points, so that .

Proof. Our proof is divided in three parts. First, we show the existence of a solution. Then, we show that (25) is equivalent to a constrained interpolation problem with fixed function values and, from this equivalent form, we deduce

Figure 2: Exponential kernels k) = exp(), where r = (). The parameters (0, 2) and 0 adjust the smoothness and width of the kernel, respectively.

the topological properties of the solution set. Finally, we derive the form (26) for the extreme points of the solution set.

Part 1: Existence We apply a standard technique in convex analysis. We show that the cost functional is coercive and weakly lower-semicontinuous [26]. This also works when the latter property is replaced by weak* lowersemicontinuity (see Proposition 8 in [22]).

Let us denote the data-fidelity and regularization terms of the cost functional by ) and ), respectively, so that we have that

The cost functional is a weighted sum of the nonnegative data-fidelity term ) and the coercive regularization functional ). This ensures its overall coercivity.

The sampling operator is weak*-continuous by assumption. Its composition with a continuous functional E() (that follows from its strict-convexity) and summation over m yields a cost functional H(f) that is weak* lower-semicontinuous as well.

The M norm is weak*-continuous on ). Consequently, since Lis an isomorphism, its corresponding gTV norm is a weak*-continuous functional on ). Hence, we conclude the weak* lower semicontinuity of the regularization functional (in the product space). Therefore, the overall cost functional ) + ) is weak* lower-semicontinuous. Together with the coercivity of the cost functional, this proves the existence of a solution.

Part 2: Equivalence to the Constrained Problem Considering two solutions () and () of the problem, we denote their reconstructing functions by = for i = 1, 2. By contradiction assume that ) for some m. Since E() is a strictly convex function for any , we have that

Similarly, the convexity of ) implies the inequality

The inequalities (29) and (30) together imply that ( ) has a smaller cost than ()’s which contradicts their optimality. Hence, ) = ) = for m = 1, 2, . . . , M and one can rewrite the problem as

Part 3: Identifying the Solution Set Let us define = Lfor n = 1, . . . , N and ) = ) for m = 1, . . . , M. We then reformulate (31) as

Now, using the vector-valued Fisher-Jerome theorem (Appendix C), we deduce that the solution set of (32) is convex and weak*-compact with the extreme points of the form w = (), where takes the form

for some and . Moreover, the total number of Diracs in w is upper-bounded by M. This implies that the solution set of (31) (and, consequently, the one of (25)) is a convex and weak*-compact set due to the linearity and isomorphism of L. Correspondingly, the extreme points of the original problem (25) takes the form of (), where = Lhas a kernel expansion with kernels at adaptive positions subject to the constraint .

The practical outcome of Theorem 4 is that any extreme point of (25) maps into a solution of the form

for the learned function. The solution form (34) has the following important properties:

• The number of active kernels is upper-bounded by the number of samples M. This justifies the use of multiple kernels since the flexibility of the model will be increased while the problem remains well-posed.

• The gTV norm enforces an penalty on the kernel coefficients. Practically, this will result in an -minimization problem that is reminiscent of the generalized LASSO.

• The kernel positions are adaptive and will be chosen such that the solution becomes sparse.

5 Numerical Illustration

In this section, we illustrate the use of Theorem 4 with a numerical example. We consider M = 40 noisy samples of a piecewise-linear function in the interval [1] (See Figure 3). We fix the data-fidelity to be the quadratic term E(x, y) = (and we compare the performance of 5 kernel estimators to reconstruct the target function:

• RKHS regularization (5) with Gaussian kernel,

• Generalized LASSO (6) with Gaussian kernel,

• Multiple-kernel learning (12) with regularization on the vector and with a family of Gaussian kernels with different widths,

• Single-kernel gTV regularized learning (11) with exponential kernel (= 1.99), and

• Our proposed framework of learning with multiple kernels and gTV regularization with a family of exponential kernels (= 1.99) with different widths (25).

In order to avoid the difficulty of optimizing over the data centers, which would result in a nonconvex problem, we use a convex proxy where a redundant set of centers is placed on a grid and where the excess ones are suppressed with the help of -minimization. This scheme will obviously only work when the input dimension is very low, such as d = 1 for the present experiment. We use the multi-grid strategy of [12, 22] to control the accuracy. For the fixed-kernel

Figure 3: Performance of the kernel estimators. Solid line: target function. Dash-dotted line: estimated function.

Table 1: MSE and sparsity of the kernel estimators.

schemes, we adjust the width of the kernel function to yields the least error (cross-validation approach). We also optimize the parameter in each method separately. The final results are depicted in Figure 3. As it can be seen, our proposed method results in the function with the least fluctuation. This is due to the fact that we benefit from both wide and narrow kernels, a state of affairs that is unavailable with the single-kernel approaches.

We also plotted the M largest kernel coefficients for the output function of each method. As expected, our method results in the sparsest representation among these kernel-based estimators. We also report the MSE error and exact sparsity of each method in Table 1.

Finding the kernel positions in general can be very challenging. Once the

Figure 4: M largest coefficients of each expansion. The sparsest expansion is the one that uses multiple kernels.

positions are fixed, one can find the kernel weights efficiently using the classical -minimization techniques [6, 10, 16]. In low dimensions, one can use gridbased algorithms and reach a solution where the positions of the kernels are quantified [12, 22]. It is also possible to adapt the algorithms developed for finding Dirac locations in super resolution in order to find the kernel positions [8, 9, 13, 15]. However, for high-dimensional problems, this is an open challenge and requires further considerations.

6 Conclusion

In this paper, we have provided a theoretical foundation for multiple-kernel regression with gTV regularization. We have studied the Banach structure of our search space and identified the class of kernel functions that are admissible. Then, we have derived a representer theorem that shows that the learned function can be written as a linear combination of kernels with adaptive centers. Our representer theorem also provides an upperbound to the number of active elements, which allows us to use as many kernels as convenient. We illustrate numerically the effect of using multiple kernels with a sparsity constraint. Further research directions could be the development of efficient methods in high dimensions to approximate the kernel positions and an extension of the current theory to make Gaussian kernels admissible.

A Proof of Theorem 2

Proof. (i) The linearity and invertiblity of L implies that the native space together with the gTV norm is a bona fide Banach space.

(ii) The restriction of L over its native space is injective (inherited from L) and is continuous due to the definition of the gTV norm. For all ), the relation Limplies that it is surjective as well and that its inverse is the restriction of Lover ) which continuously maps ). This ensures that L : ) is an isomorphism.

(iii) The isomorphism of Part (ii) implies the existence of the adjoint operator over. By restricting the adjoint operator to ), we obtain the operator L: ), where the space ) is the image of Lover ). This space, equipped with the norm , is a Banach space due to the linearity and invertiblity of L.

(iv) Similarly to Part (ii), we readily verify that the adjoint operator L: ) is indeed an isomorphism. Therefore, the double-adjoint operator is the isomorphism L: (). Consequently, the domains of L and Lmust be equal, which implies that ) is the predual of the native space.

(v) First, we show that the operator L is closed over the space of Schwartz functions. It is known that the impulse response of L: is the flipped version of the one of L [41]. In other words, the application of Lon a Schwartz function can be expressed by

where ) is the impulse response of L, described in (17). By the change of variable y = (), one verifies that, for any ), we have that

where is the flipped version of ) with ) = ) for all . In effect, (36) shows that L) for any ).

Now, from the inclusions L) and ), we deduce that ). Moreover, ) by Definition 2. This verifies the inclusion ). To complete the proof, we need to show that the identity operators ) and id2 : ) are continuous.

For a converging sequence of Schwartz functions , the continuity of L implies that L. Since ) is continuously embedded in ), we have that Land, consequently, that . This proves that the embedding is continuous, which is denoted by ). Moreover, since the space ) is continuously embedded in ), the convergence Limplies that L. This proves that ). The latter continuous embedding is also dense due to the denseness of ) in ) and the inclusion ).

B Proof of Theorem 3

Proof. Assume that L is a kernel-admissible operator. The weak*-continuity of the sampling functional implies that the shifted Dirac impulses ) should be included in the predual space ). Therefore, Lshould be in ). Since, the Green’s functions of L and Lare flipped version of each other, we deduce that = L). For the second property, we recall that the continuity of L: ) implies the smoothness and slow growth of the Fourier transform of its frequency response. Hence, ) is smooth and slowly growing. Similarly, the continuity of L implies that

is a smooth and slowly growing function as well. Thus, ) is non-vanishing and heavy-tailed.

in Theorem 3. First, note that, if are smooth and slowly growing functions and, moreover, g is nonzero and heavy-tailed, then

is a quotient whose numerator is a smooth and slowly growing function and whose denominator is a nonzero, heavy-tailed, smooth, and slowly growing function. Hence, the quotient itself is a smooth function whose growth is bounded by a polynomial. Based on this observation, one can deduce from induction that all the arbitrary-order derivatives of can be expressed by a quotient with a slowly growing nominator and a heavy-tailed denominator. This shows that is a smooth and slowly growing function as well. These prop- erties ensure the existence of continuous LSI operators L, ˜L : ) with the frequency responses and ), respectively. The one-to-one cor- respondence between an operator and its frequency response then yields that ˜L = L, from which we conclude that L is an isomorphism over ). Moreover, due to the Property (i), we know that the Green’s function of L is in ). Hence, the Green’s function of Lis also in ) so that, for any , we have that

In other words, )) = ), which shows that the sampling functionals are weak*-continuous.

C Vector-Valued Fisher-Jerome Theorem

Here, we propose and prove a generalization of the Fisher-Jerome theorem [17] for a vector of Radon measures. The result is not deducible from the original theorem, but its proof is an adaptation of the scalar case (Theorem 7 in [40]). We denote the space of Radon vector measures () by ), where each component ) is a Radon measure. The total-variation norm of the vector w = () is defined by = .

Theorem 5 (Vector-valued Fisher-Jerome). Let ) , where N is an -dimensional normed space and assume that F : is a linear and weak*-continuous functional () such that

and that the minimization problem

is feasible for a convex and compact set . Then, V is a nonempty, convex, weak*-compact subset of B while the components of its extreme points () are all of the form

where and . Moreover, and the minimum M-norm obtained for the problem is equal to .

Proof. The proof is in two parts. First, we show that the solution set is nonempty, weak*-compact, and convex. Then, we explore the form of its extreme points to complete the theorem.

Structure of the Solution Set Consider a point in the feasible set and denote it by (). The, (40) is equivalent to the minimization

Since C is compact, A = maxwill be a finite constant. Due to the linearity of ) and due to the triangle inequality, for all (w, p) in the feasible set we have that

The conclusion is that the feasible set of (42) is bounded. It is also weak*-closed due to the weak*-continuity of ) and the closedness of C. Hence, it is weak*-compact due to the Banach-Alaoglu theorem [30, Theorem 3.15]. The conclusion is that (40) is equivalent to the minimization of a weak*-continuous functional over a weak*-compact domain. Moreover, due to the generalized Weierstrass theorem [26, Theorem 7.3.1], its solution set is nonempty. Denote the optimal cost of (40) by . Now, note that the feasible set of (40) is the preimage of the linear continuous functional F over the convex set C. So, one can rewrite the solution set V as

This implies that V is bounded (due to (43)), weak*-closed, and convex (the intersection of two weak*-closed and convex set). Hence, it is also weak*-compact. Using the Krein-Milman theorem ([30], Theorem 3.23), we deduce that V is the convex hull of its extreme points.

such as (w, p), where w = (). We show that it is not possible to have disjoint Borelian sets such that = 0, where n = 1, 2, . . . , N and l = 1with + 1. We prove the result by contradiction. Assume such disjoint sets exist. Define = = (, and let w = (). It can be seen that w = w + . Define = ). Since the are at least M + 1 vectors in , they are linearly dependent. Consequently, there exist constants , with at least one of them being nonzero, such that

For n = 1, 2, . . . , N, define = and = (). Also, denote 0. For any ), we have that 1 + 0 for all n = 1, 2, . . . , N and l = 1. We also see that

Now, for any ), we have that ) = and, therefore, (. Moreover,

Note that the nth element of has support . Moreover, the nth element of has support for = n and has empty support otherwise. Therefore,

the nth entries have disjoint supports, which allow us to write that

For sufficiently small values of , this gives either or . Therefore, = 0, which yields that + . This shows that (, which contradicts that (w, p) is an extreme point. Therefore w, is nonzero at most in M points, which yields the form of (41). Computing the norm of such an extreme point results in

which completes the proof.

References

[1] N. Aronszajn, Theory of reproducing kernels, Transactions of the American Mathematical Society, 68 (1950), pp. 337–404.

[2] N. Aronszajn and K. T. Smith, Theory of Bessel potentials I, Annales de l’Institut Fourier, 11 (1961), pp. 385–475.

[3] F. Bach, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research, 18 (2017), pp. 629–681.

[4] F. R. Bach, G. Lanckriet, and M. Jordan, Multiple kernel learning, conic duality, and the SMO algorithm, in Proceedings of the Twenty-First International Conference on Machine Learning, 2004, pp. 41–48.

[5] J. Bazerque and G. Giannakis, Nonparametric basis pursuit via sparse kernel-based learning, arXiv preprint arXiv:1302.5449, (2013).

[6] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.

[7] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

[8] K. Bredies and H. Pikkarainen, Inverse problems in spaces of measures, ESAIM: Control, Optimisation and Calculus of Variations, 19 (2013), pp. 190–218.

[9] E. Cand`es and C. Fernandez-Granda, Super-resolution from noisy data, Journal of Fourier Analysis and Applications, 19 (2013), pp. 1229– 1254.

[10] eratively reweighted least squares minimization for sparse recovery, Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 63 (2010), pp. 1–38.

[11] C. De Boor and R. E. Lynch, On splines and their minimum properties, Journal of Mathematics and Mechanics, 15 (1966), pp. 953–969.

[12] T. Debarre, J. Fageot, H. Gupta, and M. Unser, B-spline-based exact discretization of continuous-domain inverse problems with generalized TV regularization, IEEE Transactions on Information Theory, 65 (2019), pp. 4457 – 4470.

[13] , The sliding Frank-Wolfe algorithm and its application to super-resolution microscopy, Inverse Problems, (2019).

[14] T. Evgeniou, M. Pontil, and T. Poggio, Regularization networks and support vector machines, Advances in Computational Mathematics, 13 (2000), p. 1.

[15] C. Fernandez-Granda, Super-resolution of point sources via convex programming, Information and Inference: A Journal of the IMA, 5 (2016), pp. 251–303.

[16] M. A. Figueiredo, R. D. Nowak, and S. Wright, Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems, IEEE Journal of Selected Topics in Signal Processing, 1 (2007), pp. 586–597.

[17] S. D. Fisher and J. W. Jerome, Spline solutions to L1 extremal problems in one and several variables, Journal of Approximation Theory, 13 (1975), pp. 73–83.

[18] J. Gao, P. Kwan, and D. Shi, Sparse kernel learning with LASSO and Bayesian inference algorithm, Neural Networks, 23 (2010), pp. 257–264.

[19] I. M. Gelfand and G. E. Shilov, Generalized Functions. Vol. 1, Properties and Operations, Academic Press, 1969.

[20] F. Girosi, M. Jones, and T. Poggio, Priors, Stabilizers and Basis Functions: From Regularization to Radial, Tensor and Additive Splines, (1993).

[21] , Journal of Machine Learning Research, 12 (2011), pp. 2211–2268.

[22] H. Gupta, J. Fageot, and M. Unser, Continuous-domain solutions of linear inverse problems with Tikhonov versus generalized TV regularization, IEEE Transactions on Signal Processing, 66 (2018), pp. 4670–4684.

[23] T. Hastie, R. Tibshirani, and J. Friedman, Overview of supervised learning, in The Elements of Statistical Learning, Springer, 2009, pp. 9–41.

[24] G. Kimeldorf and G. Wahba, Some results on Tchebycheffian spline functions, Journal of Mathematical Analysis and Applications, 33 (1971), pp. 82–95.

[25] , in Advances in Neural Information Processing Systems, 2009, pp. 997–1005.

[26] A. Kurdila and M. Zabarankin, Convex Functional Analysis, Springer Science & Business Media, 2006.

[27] G. Lanckriet, N. Cristianini, P. Bartlett, L. Ghaoui, and M. Jordan, Learning the kernel matrix with semidefinite programming, Journal of Machine Learning Research, 5 (2004), pp. 27–72.

[28] E. Mammen and S. van de Geer, Locally adaptive regression splines, The Annals of Statistics, 25 (1997), pp. 387–413.

[29] V. Roth, The generalized LASSO, IEEE Transactions on Neural Networks, 15 (2004), pp. 16–28.

[30] W. Rudin, Functional Analysis. International Series in Pure and Applied Mathematics, McGraw-Hill, Inc., New York, 1991.

[31] W. Rudin, Real and Complex Analysis, Tata McGraw-Hill education, 2006.

[32] K. Sato, L´evy Processes and Infinitely Divisible Distributions, Cambridge University Press, 1999.

[33] theorem, in Computational Learning Theory, Springer, 2001, pp. 416–426.

[34] , Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2001.

[35] , vol. 2, Hermann Paris, 1957.

[36] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004.

[37] B. Simon, Distributions and their Hermite expansions, Journal of Mathematical Physics, 12 (1971), pp. 140–148.

[38] , The connection between regularization operators and support vector kernels, Neural Networks, 11 (1998), pp. 637–649.

[39] A. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet Mathematics Doklady, 4 (1963), pp. 1035–1038.

[40] M. Unser, J. Fageot, and J. Ward, Splines are universal solutions of linear inverse problems with generalized TV regularization, SIAM Review, 59 (2017), pp. 769–793.

[41] M. Unser and P. D. Tafti, An Introduction to Sparse Stochastic Processes, Cambridge University Press, 2014.

[42] V. Vapnik, Statistical Learning Theory. 1998, vol. 3, Wiley, New York, 1998.

[43] G. Wahba, Spline Models for Observational Data, vol. 59, Siam, 1990.

[44] A. Yuille, The motion coherence theory, in Proceedings of the International Conference on Computer Vision, IEEF. Computer Society Press, 1998, pp. 344–354.

designed for accessibility and to further open science