Patents and technical handbooks contain condensed knowledge. That is, they contain the distilled version of knowledge that bubbles up from a much larger body of supporting work often available as technical publications. This condensed knowledge is extremely relevant to a new experimental design process but the knowledge in most part cannot be directly applied because experimenters have to extract insights and meta-knowledge and then transfer it to the new experimental set up. For example, in metallurgy patents, we can extract the state of the art in terms of alloy compositions and heat treatments, and how it relates to properties such as yield strength. More generally in many situations we may have a good amount of data clustered around a set of “well-known” regions in experimental space and a dearth of data elsewhere.
In this paper we present a technique that makes use of any condensed knowledge we may have that is relevant to the problem to accelerate the experimental design process using a Bayesian optmisation framework. One difficulty with Bayesian optimisation using Gaussian process models in practice is that of covariance function selection. In principle one assumes that data is drawn from a distribution whose covariance function matches that of the covariance function used by the Gaussian process model. Typically this will take the form of a squared-exponential function, Mat´ern kernel or similar, where the relevant hyperparameter(s) are selected to maximise the log-likelihood (or similar) fitness criteria. However it is by no means guaranteed that this covariance function represents a good approximation of the true covariance function. In light of this, the question that naturally arises is: using condensed knowledge, how can we (automatically) find a covariance function that better fits the actual covariance? This more closely fitted covariance function allows us to build more accurate Gaussian process models, which we use to speed up the Bayesian optimisation process.
We approach this problem from the perspective of kernel methods (a kernel being a covariance function by another name). To do this we begin with the theory of m-kernels, where an m-kernel is a map that arises as a special case of reproducing kernel Banach space theory [2, 4, 16]. In particular, an m-kernel is an m-semi-inner-product [3] evaluated on the m arguments mapped into some feature space, which in the case m = 2 reduces to a standard (Mercer) kernel (covariance function). We show that, for a particular subset of m-kernels we call free m-kernels, the m-kernels have the useful property that the feature map they implicitly embody is independent of m. Using this observation we show how the feature map itself may be directly modified to give weight to relevant features and less to irrelevant ones; and moreover that this weighting may be learned directly from condensed datasets such as patents and handbooks.
We demonstrate our results using two real world settings: first we design a new hybrid Aluminium alloy based on 46 existing patents for aluminium 6000, 7000 and 2000 series. We learn the kernel from this condensed set of patents and use it to accelerate the design of the new aluminium alloy. Second we test our algorithm for a new short polymer fiber design using micro-fluid devices. Experimental data is available from Device A - a particular geometric configuartion using a gear pump. We learn the kernel using this data and then use it to design short polymer fiber on a Device B - a different configuration using a lobe pump. We demonstrate in both our experiments that our method outperforms standard Bayesian optimisation for several acquisition functions.
Our main contributions are:
• Extension of Mercer’s Theorem (Mercer’s condition) to 2q-kernels (theorem
• Definition of free kernels: families of m-kernels whose corresponding (implied) feature weights and maps are independent of m (definition 2, section 3.1).
• Definition of m-semi-inner-product kernels (m-kernel analogues of dot product kernels) and proof that these are free kernels (theorem 2, section 3.2).
• Definition of m-kernel analogues of linear, polynomial, exponential and squared-exponential kernels (table 1).
• Introduction of kernel re-weighting, a method of tuning free kernels to adjust implied feature weights (theorem 3, section 3.3).
• A modified Bayesian optimisation algorithm that leverages kernel re-weighting to pre-tune the covariance function using additional data to better match the problem at hand (algorithm 2, section 4).
• Demonstration of the utility for accelerated optimisation in two practical problems, namely short polymer fibre production and aluminium alloy design (section 5).
1.1 Notation
Sets are written A, B, . . .; and =
. If I =
then perm(I) is the set of all permutations (n-tuples) of these indices. Vectors are bold lower case
. Element i of vector a is
is the elementwise product,
the inner product,
the elementwise power, 1 a vector of 1s, and 0 a vector of 0s. The m-semi-inner-product
is the multi-linear map [3]:
We want to maximise a function that is expensive to evaluate to find
= argmax
). It is assumed that
) is a draw from a zero mean Gaussian process [12] with covariance function
. We note that in most realistic use-cases K is not given a-priori but rather obtained using a combination of experience, assumptions and heuristics such as max-log-likelihood to select a “good-enough” covariance function from a set of well-known covariance functions (squared exponential, Mat´ern, etc).
To accelerate the optimisation we are given an additional dataset =
) +
, where
is a noise term. This data may be derived from for example patents, handbooks, datasheets of similar sources of (relevant) domain knowledge. In general
and
, but we assume that
are similar in the following sense: if we were to use some machine learning algorithm to fit the models:
to and
, respectively, where
) +
represents data generated by f, then the relative magnitudes of the feature weight vectors
and
would be similar - that is (loosely speaking), if
is large (small) then
is also large (small). We will use
, and in particular the weights
(or rather their representation) obtained using a support vector machine, to directly construct a covariance function K fitted to the objective f.
2.1 Bayesian Optimisation
Bayesian optimisation [1] (BO) is a technique for maximising expensive (to evaluate) black-box functions f in the fewest iterations possible. Bayesian optimisation works by maintaining a GP model of f constructed from the set of observations of f up to (but not including) the current iteration t. Based on this model, the acquisition function
, which is cheap to evaluate and depends on our model of f, is maximised to find the next point
to be evaluated. This point is evaluated to obtain
, the model is updated, and the algorithm continues. A typical Bayesian optimisation algorithm is given in algorithm 1.
As ) is a draw from a zero mean Gaussian process [12], given observations
) +
, the posterior
)) has mean and variance:
where = [
) = [
and
= [
Using this model typical acquisition functions include probability
of improvement (PI) [7], expected improvement (EI) [6], GP upper confidence bound (GP-UCB) [14] and predictive entropy search (PES) [5]:
where is a sequence of constants [14].
2.2 m-Kernels and Support Vector Machines
Another name for a covariance function that is common in the machine learning literature is kernel function. A (Mercer) kernel function is any function K : for which there exists a corresponding (implicitly defined) feature map
Equivalently, if we define ) to be the angle between
) and
) in feature space:
Thus the kernel (covariance function) K provides a measure of the similarity of in terms of their alignment in feature space. For kernel such as the SE kernel for which K(x, x) = 1 this simplifies to
An extension that arises in the context of generalised norm SVMs [9, 10] and reproducing kernel Banach-space (RKBS) theory [2, 4, 16] is the Formally:
is a function
there exists a feature map
where -semi-inner-product [3] of the form:
and is the elementwise product. We say that feature map
satisfying (3) is implied by K.
Note that a 2-kernel is just a standard (Mercer) kernel. A canonical application of m-kernels is the p-norm support vector machine (SVM) (aka the max-margin moment classifier [2]). Let
be a regression training set.
We wish to (implicitly) find a sparse (in w) function:
to fit the data. The p-norm SVM is one approach to this problem. Letting 1 2 be dual to 2
-norm SVM primal is:
Using the m-kernel trick (see supplementary material) the dual form of the p-norm SVM may be derived:
where ) and K is a 2q-kernel (in practice we start with K and never need to know the feature map
implied by K). Moreover:
and by representor theory the (implicitly defined) weight vector w is:
Before proceeding to our main contribution we must first establish some preliminary results. We begin by presenting theoretical conditions that must be met by a function K : (to be a m-kernel under definition 1. As we are primarily interested in 2q-kernels we will largely focus on this case. The following provides an analogue of Mercer’s condition [11] for 2q-kernels:
, be a continuous function that is symmetric under the exchange of any two arguments and such that
defined by:
satisfies Mercer’s condition. Then there exists , such that:
the series being uniformly and absolutely convergent.
Proof: See supplementary material.
3.1 m-Kernels and Free Kernels
For the purposes of the present paper we define a special classes of m-kernels, namely the free kernels:
Definition 2 A free kernel is a family of functions : (
indexed by
for which there exists a feature map
and feature weights
both independent of
Note that for fixed m a free kernel defines (is) an m-kernel:
with implied feature map:
where is an the elementwise power and
) an elementwise product. Note that, while all free kernels define m-kernels, not all m-kernels derive from free kernels for given m.
We also define normalised free kernels:
Definition 3 Let : (
be a free kernel. Then for a given m the function ˆ
defined by:
is a free kernel, which we call the normalised free kernel.
The standard SE (2-)kernel is an example of a normalised 2-kernel as it may be constructed by normalising the exponential kernel is straightforward to show that if
and
are the feature weights and feature map implied by the free kernel
then the normalised free kernel ˆ
feature weights ˆ
and feature map ˆ
Table 1: “Standard” free and m kernels and their corresponding Mercer (2-)kernels. The linear, polynomial and exponential free kernels are m-semi-inner-product free kernels (see theorem 2). The SE m-kernel is the normalised form of the exponential free kernel.
3.2 Constructing Free Kernels
We now consider the question of constructing free kernels. In RKHS theory there exist a range of “standard” kernels (2-kernels) from which to choose. In [2] it is shown that any 2-kernel K may be used to construct a 2q-kernel ¯K using the symmetrised product:
This construction does not define a free kernel as the feature map implied by ¯K depends on q. To construct free kernels we use the following theorem (which may be viewed as an analogue of [13] for m-semi-inner-product-kernels):
Theorem 2 Let be a Taylor-expandable function
) =
with
. Then the function
defined by:
is a free kernel, where:
Proof: See supplementary material.
A sample of free kernels corresponding to 2-kernels are given in table 1. With the exception of the SE kernel all of these kernels are m-semi-inner-product kernels; the SE kernel is the normalised exponential kernel.
3.3 Kernel Re-Weighting
As noted, all free kernels in table 1 (except the SE free kernel) have implied feature weights defined by (11) and corresponding implied features (10). While
Figure 1: Kernel Pre-training Procedure. Working left to right in input space: (1) start with a kernel derived from a free kernel
. (2) Train SVM to obtain
. (3) Using (12) obtain re-weighted kernel
derived from free kernel
. The feature space operation implied by each step is shown below the line: (1) start with a feature map
) implied by
. (2) Train SVM to obtain weights w. (3) Using (12) obtain re-weighted feature map
) implied by
. Note that important features (as given by the SVM)
be assigned higher weights
than less important features.
this is not entirely arbitrary there is no guarantee that the weights will give (relatively) more weight to important features and (relatively) less to unimportant ones. We would like to be able to adjust these weights (tune the free kernel) to suit the problem at hand. We call this operation kernel re-weighting. It relies on the following key result:
Theorem 3 Let : (
be a free kernel with implied feature map
and feature weights
; and let
) :
. Then the function
defined by:
is a free kernel family with implied feature map =
and implied feature weights
Proof: Using the definitions:
and the result follows.
The utility of this result, as illustrated in figure 1, is that it allows us to take a kernel derived from free kernel
, train a (standard 2-norm) SVM (or similar) equipped with this kernel to obtain (implicitly defined) weights w =
) =
) (using (7) and (9)), and then use this result to generate a new kernel
derived from the free kernel
using (12)) that gives more weight to important features in feature space (as indicated by large
Figure 2: Geometry of kernel re-weighting (m = 2 case). The upper diagram shows the two vectors ) in feature space. The similarity
) as measured by K is proportional to cos
as per (2). The lower diagrams show two possible re-weightings. The left-hand case increases the relative weight of feature 2, resulting in decreased similarity, whilst the right-hand case increases the relative weight of feature 1, resulting in increased similarity.
) and less weight to less important features (as indicated by small
geometric significance of this operation is illustrated in figure 2.
Note that at no point in this process do we need to explicitly know the implied feature map or weights. Thus we can directly tune a kernel (covariance) function to fit a given set of problem. As we will show in our results, applying this procedure to Bayesian optimisation where the kernel is tuned to fit some corpus of condensed knowledge results in a significant speed-up due to better fitting of the covariance function.
Figure 3: Device geometry for short polymer fibre injection.
We now discuss our method for selecting the kernel function. As noted previously, we wish to accelerate Bayesian optimisation by leveraging the additional dataset ) +
, which may be derived for example from relevant patents, handbooks or datasheets. Our proposed algorithm is given in algorithm 2.
The key difference between our proposed algorithm and the standard Bayesian optimisation algorithm (algorithm 1) is that the covariance function K is first re-weighted based on the additional dataset to ensure that the implied feature weights match those of this dataset before proceeding with the optimisation. Thus we can expect that it will better match the problem at hand in terms of the relative weights it gives to components in feature space.
We consider three experiments here, two based on real-world problems and one simulation to highlight the properties of the algorithm.
5.1 Short Polymer Fibres
In this experiment we have tested our algorithm on the real-world application of optimizing short polymer fiber (SPF) to achieve a given (median) target length [8]. This process involves the injection of one polymer into another in a special device [15]. The process is controlled by 3 geometric parameters (channel width (mm), constriction angle (degree), device position (mm)) and 2 flow factors (butanol speed (ml/hr), polymer concentration (cm/s)) that parametrise the experiment - see figure 3. Two devices (A and B) were used. Device A is armed by a gear pump and allows for three butanol speeds (86.42, 67.90 and 43.21 cm/s). The newer device B has a lobe pump and allows butonal speed 98, 63
and 48 cm/s. Our goal is to design a new short polymer fiber for Device B that results in a (median) target fibre length of 500
We write the device parameters as and the result of experiments (median fibre length) on each device as
), respectively. Device A has been characterised to give a dataset
of 163 input/output pairs. We aim to minimise:
noting that (the objective f differs from the function generating
, although both relate to fibre length). Device B has been similarly characterised and this grid forms our search space for Bayesian optimisation.
For this experiment we have used a free SE kernel . An SVM was trained using
and an SE kernel (hyperparameters
were selected to minimise leave-one-out mean-squared-error (LOO-MSE)) to obtain
re-weighted SE kernel
obtained from this was normalised (to ensure good conditioning along the diagonal of
) and used in Bayesian optimisation as per algorithm 2. All data was normalised to [0, 1] and all experiments were averaged over 40 repetitions.
We have tested both the EI and GP-UCB acquisition functions. Figure 4 shows the convergence of our proposed algorithm. Also shown for comparison are standard Bayesian optimisation (using a standard SE kernel as our covariance function); and a variant of our algorithm where a kernel mixture model trained on is used as our covariance function - specifically:
where is an SE kernel,
a Mat´ern 1/2 kernel,
a Mat´ern 3/2 kernel; and
0 and all relevant (kernel) hyperparameters are selected to minimise LOO-MSE on
. Relevant hyperparameters in Bayesian optimisation (
method and kernel mixtures,
for standard Bayesian optimisation) were selected using max-log-likelihood at each iteration. As can be seen our proposed approach outperforms other methods with both GP-UCB and EI acquisition functions.
5.2 Aluminium Alloy Design using Thermo-Calc
This experiment considers optimising a new hybrid Aluminium alloy for target yield strength. Designing an alloy is an expensive process. Casting an alloy and then measuring its properties usually takes long time. An alloy has certain phase structures that determine its material properties. For example, phases such as C14LAVES and ALSC3 are known to increase yield strength whilst others such as AL3ZR D023 and ALLI B32 reduce the yield strength of the alloy. However a precise function relating the phases to yield strength does not exist. The simulation software Thermo-Calc takes a mixture of component elements as input and computes the phase composition of the resulting alloy.
Figure 4: Short Polymer Fibre design. Comparison of algorithms in terms of minimum squared distance from the set target versus iterations. GP-UCB and EI indicate acquisition function used. MK indicates mixture kernel used. rmK indicates our proposed method.
Figure 5: Regression coefficients for 25 phases as determined from patent data.
We consider 11 elements as potential constituents of the alloy and 24 phases. We use Thermo-Calc for this computation.
A dataset of 46 closely related alloys filed as patents was collected. This dataset consists information about the composition of the elements in the alloy and their yield strength. The phase compositions extracted from Thermo-Calc simulations for various alloy compositions were used to understand the positive or negative contribution of phases to the yield strength of the alloy using linear regression. The weights retrieved for these phases were then used formulate a utility function. Figure 5 shows the regression coefficients for the phases contributing to the yield strength.
As in the previous experiment for this experiment we have used a free SE kernel . An SVM was trained using
and an SE kernel (hyperparameters
were selected to minimise LOO-MSE) to obtain
Figure 6: Aluminium alloy design. Comparison of algorithms in terms of maximum utility score versus iterations. GP-UCB and EI indicate acquisition function used. rmK indicates our proposed method.
re-weighted SE kernel obtained from this was normalised (to ensure good conditioning along the diagonal of
) and used in Bayesian optimisation as per algorithm 2. All data was normalised to [0, 1].
We have tested both the EI and GP-UCB acquisition functions. Figure 6 shows the convergence of our proposed algorithm compared to standard Bayesian optimisation (using a standard SE kernel as our covariance function). Relevant hyperparameters in Bayesian optimisation (for our method and kernel mixtures,
for standard Bayesian optimisation) were selected using max-log-likelihood at each iteration. As can be seen our proposed approach outperforms standard Bayesian optimisation by a significant margin for both EI and GP-UCB.
5.3 Simulated Experiment
In this experiment we consider use of kernel re-weighting to incorporate domain knowledge into a kernel design. We aim to minimise the function:
as illustrated in figure 7, where 1]
. Noting that this function has rotational symmetry we select an additional dataset to exhibit this property, namely:
of 100 vectors, where is selected uniformly randomly. Thus
reflects the rotational symmetry of the target optimisation function f but not its form.
Figure 7: Target optimisation function in experiment 3.
Figure 8: Simulated experiment. Comparison of algorithms in terms of minimum f(x) versus iterations. GP-UCB and EI indicate acquisition function used. MK indicates mixture kernel used. rmK indicates our proposed method.
As for previous experiments we have started with a free SE kernel and re-weighted this by training and SVM on
(using an SE kernel with hyperparameters selected to minimise LOO-MSE) to obtain
, and constructed the re-weighted (but not normalised in this experiment) SE kernel
. However in this case in our Bayesian optimisation we have used the composite kernel:
which implies a 2-layer feature map, the first layer being the re-weighted feature map implied by and the second layer being the standard feature map implied by the SE kernel. All data was normalised to [0, 1] and all experiments were averaged over 10 repetitions.
We have tested both the EI and GP-UCB acquisition functions. Figure 8 shows the convergence of our proposed algorithm compared to standard Bayesian optimisation (using a standard SE kernel) and standard Bayesian optimisation with a kernel mixture model as per our short polymer fibre experiment (13) trained on used as the covariance function. Curiously in this case, while our method combined with GP-UCB outperforms the alternatives, the results are less clear for our method combined with EI. The precise reason for this will be investigated in future work.
In this paper we have presented a novel approach to direct covariance function learning for Bayesian optimisation, focussing on experimental design problems where an existing corpus of condensed knowledge is present. We have based our method on m-kernel techniques from reproducing kernel Banach space theory. We have extended Mercer’s theorem to 2q-kernels; defined free kernel families whose implied feature map is independent of m (and provided a means of constructing them, with examples); shown how the properties of these may be utilised to tune them by effectively modifying the weights in feature space; presented a modified Bayesian optimisation algorithm that makes use of these properties to pre-tune its covariance function on additional data provided; and demonstrated our techniques on two practical optimisation tasks and one simulated optimisation problem, demonstrating the efficacy of our proposed algorithm.
Let 0 2 be dual to 2
(i.e. 1/p + 1/2q = 1). As defined in the paper, let Θ =
be a regression training set and let the discriminant function take the standard form:
where is a feature map implied by the 2
p-norm SVM primal is:
where
Defining Lagrange multipliers ¯0,
0 for, respectively, the first, second and third constraint sets in the primal (14) (noting that one or both of ¯
must be zero for all i, as at most one the first two constraints can be met with equality for any i) we obtain the Lagrangian:
Using the fact that one or both of 0 and ¯
0 must be zero for all i we see that ¯
, and hence:
where |w| and are, respectively, the elementwise absolute value and elementwise power of w. It follows that, for optimality:
Defining = ¯
, substituting into the Lagrangian and using the definition of the 2q-kernel K (the 2q-kernel trick) we obtain the dual p-norm SVM training problem:
where ); and the discriminant function may be written:
In the case of binary classification Θ = +1}}. The discriminant function is:
where is a feature map implicit in the 2
p-norm SVM primal is:
Defining Lagrange multipliers ¯0 for, respectively, the first and second constraint sets in the primal (18) we obtain the Lagrangian:
where |w| and are, respectively, the elementwise absolute value and elementwise power of w. It follows that, for optimality:
Defining , substituting into the Lagrangian and using the definition of the 2q-kernel K (the 2q-kernel trick) we obtain the dual p-norm SVM training problem:
where ); and the discriminant function may be written:
Mercer’s theorem states that [11]:
be a continuous function for which
for all satisfying
. Then there exists
such that:
the series being uniformly and absolutely convergent.
We now prove theorem 3 from the paper:
Proof: It follows from theorem 4 that for any function K satisfying the conditions given in the theorem the function L defined by K may be written as:
the series being uniformly and absolutely convergent. Hence:
where ) =
]). If q = 1 this is sufficient to prove the theorem. For the more general case we first note that, using the symmetry of K:
for all (we denote
). This means that for any product of two such
functions we may switch arguments between the two
functions without changing the product. Changing the range of j in the above reasoning to
we see that for all
Next suppose that such that
) = 0
. Using (22) and (23) we see that
which implies that is trivially zero everywhere and may therefore be removed from the series. Hence we may assume that
there exists
such that
= 0. It further follows from (22), (23) that
and therefore:
Recursively applying (24) to ) it may be seen that:
from which we may deduce that there exists a uniformly and absolutely convergent series representation for K:
and , which completes the proof.
Proof: Using the multinomial theorem it may be seen that:
and hence:
and:
are independent of
[1] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with applications to active user modeling and heirarchical reinforcement learning. eprint arXiv:1012.2599, arXiv.org, December 2010.
[2] Ricky Der and Danial Lee. Large-margin classification in banach spaces. In Proceedings of the JMLR Workshop and Conference 2: AISTATS2007, pages 91–98, 2007.
[3] Sever S. Dragomir. Semi-inner products and Applications. Hauppauge, New York, 2004.
[4] Gregory E. Fasshauer, Fred J. Hickernell, and Qi Ye. Solving support vector machines in reproducing kernel banach spaces with positive definite functions. Applied and Computational Harmonic Analysis, 38(1):115–139, 2015.
[5] Jos´e!Miguel Hern´andez-Lobato, Matthew Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In NIPS, pages 918–926, 2014.
[6] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
[7] Harold J. Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
[8] Cheng Li, David Rub´ın de Celis Leal, Santu Rana, Sunil Gupta, Alessandra Sutti, Stewart Greenhill, Teo Slezak, Murray Height, and Svetha Venkatesh. Rapid bayesian optimisation for synthesis of short polymer fiber materials. Scientific reports, 7(1):5683, 2017.
[9] Olvi L. Mangasarian. Arbitrary-norm separating plane. Technical Report 97-07, Mathematical Programming, May 1997.
[10] Olvi L. Mangasarian. Arbitrary-norm separating plane. Operations Research Letters, 24:15–23, 1999.
[11] James Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Transactions of the Royal Society of London, 209(A), 1909.
[12] Carl Edward Rasmussen. Gaussian processes for machine learning. MIT Press, 2006.
[13] Alexander J. Smola, Zolt´an L. ´Ov´ari, and Robert C. Williamson. Regularization with dot-product kernels. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 308–314. MIT Press, 2001.
[14] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, May 2012.
[15] Alessandra Sutti, Mark Kirkland, Paul Collins, and Ross John George. An apparatus for producing nano-bodies: Patent WO 2014134668 A1.
[16] Haizhang Zhang, Yuesheng Xu, and Jun Zhang. Reproducing kernel banach spaces for machine learning. Journal of Machine Learning Research, 10:2741– 2775, 2009.