Enhanced Particle Swarm Optimization Algorithms for Multiple-Input Multiple-Output System Modelling using Convolved Gaussian Process Models

2017·Arxiv

Abstract

Abstract

: Convolved Gaussian Process (CGP) is able to capture the correlations not only between inputs and outputs but also among the outputs. This allows a superior performance of using CGP than standard Gaussian Process (GP) in the modelling of Multiple-Input Multiple-Output (MIMO) systems when observations are missing for some of outputs. Similar to standard GP, a key issue of CGP is the learning of hyperparameters from a set of input-output observations. It typically performed by maximizing the Log-Likelihood (LL) function which leads to an unconstrained nonlinear and non-convex optimization problem. Algorithms such as Conjugate Gradient (CG) or Broyden-Fletcher-Goldfarb-Shanno (BFGS) are commonly used but they often get stuck in local optima, especially for CGP where there are more hyperparameters. In addition, the LL value is not a reliable indicator for judging the quality intermediate models in the optimization process. In this paper, we propose to use enhanced Particle Swarm Optimization (PSO) algorithms to solve this problem by minimizing the model output error instead. This optimization criterion enables the quality of intermediate solutions to be directly observable during the optimization process. Two enhancements to the standard PSO algorithm which make use of gradient information and the multi-start technique are proposed. Simulation results on the modelling of both linear and nonlinear systems demonstrate the effectiveness of minimizing the model output error to learn hyperparameters and the performance of using enhanced algorithms.

1 Introduction

Gaussian Process (GP) modelling is a non-parametric data-driven technique based on Bayesian theory. A major advantage of GP models, compared with parametric data-driven models such as Artificial Neural Network (ANN) and Fuzzy Models (FMs), is that the accuracy of the predicted outputs can be naturally measured through the variances that are computed as part of the modelling process. Another advantage is that GP models generally require fewer parameters (Kocijan 2011). These parameters, also known as hyperparameters, are estimated through a learning process using the measured input-output data of the system. GP models have found many applications in science and engineering (Bailer-Jones, Bhadeshia, and Mackay 1999; Aˇzman and Kocijan 2007; Wang, Fleet, and Hertzmann 2008; Gregorˇciˇc and Lightbody 2009; Yu 2012).

A standard GP model can be applied to a Multiple-Input Single-Output (MISO) system. For systems with multiple outputs, one can use a separate GP model for each output. This approach is referred to as Independent Gaussian Process (IGP) modelling. Its disadvantage is that since the GP models are independent of each other, any correlations between outputs will not be modelled (Boyle and Frean 2005; Alvarez and Lawrence 2009; Cao, Lai, and Alam 2014). An alternative way is to use Convolved Gaussian Process (CGP) models (Alvarez and Lawrence 2009), which are able to model not only the relationships between inputs and outputs but also correlations among all outputs. The importance of modelling this correlation becomes apparent when there are missing output data (Cao, Lai, and Alam 2014).

The hyperparameters of the CGP model can be estimated by maximizing a Log-Likelihood (LL) function. This maximization is typically performed by using gradient based solutions, such as Conjugate Gradient (CG) and Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithms. The algorithms are usually required to restart many times with different initial values to overcome the issue of getting stuck in local optima caused by the sensitiveness to initial values. Evolutionary algorithms, such as standard Particle Swarm Optimization (PSO), have been used as an alternative approach to learn the hyperparameters of GP (Zhu, Xu, and Dui 2010; Petelin and Kocijan 2011) and CGP model (Cao, Lai, and Alam 2014) due to they typically perform better than gradient based methods (Noel 2012). However, the issues of poor global search ability caused by poor initialization and slow convergence due to poor local search ability remained in the existing works due to the use of standard PSO. In addition, a physically meaningful and reliable indicator of intermediate models’ quality is preferred than the use of LL values.

In view of these shortcomings, we propose three enhanced PSO algorithms to solve the optimization problem of minimizing Mean Squared Error (MSE) values of model outputs. The first one is called multi-start PSO where the standard PSO is restarted several times to diversify the particles. The second one is the gradient-based PSO which makes use of gradient information to achieve faster convergence. The last one is a hybrid of these two methods that provides good particle diversity and faster convergence. These three algorithms are studied through the modelling of Multiple-Input Multiple-Output (MIMO) Linear Time-Varying (LTV) and Nonlinear Time-Varying (NLTV) systems. Furthermore, the use of MSE as fitness function provides us a direct and reliable indication of current solutions during the optimization process.

The rest of this article is organized as follows. Section 2 provides a brief overview of the CGP modelling technique. In Section 3, we reviewed the maximizing LL function problem for learning CGP model’ hyperparameters, and defined the problem of minimizing MSE of model outputs. The standard PSO as well as three enhanced algorithms for the problems are introduced in Section 4. Simulation results comparing the proposed algorithms to standard PSO and CG approaches are presented and discussed in Section 5. Finally, Section 6 concludes the article.

2 Convolved Gaussian Process Models

Consider a system with n inputs and m outputs again. In the CGP, each output ) is modelled by,

where d = 1, 2, . . ., m and ) denotes an independent Gaussian white noise. The function ) typically is defined by a linear convolution of a smoothing kernel ) and a latent function u(x),

The correlation between outputs is derived from the latent function u(x) which has effects on all output functions. This latent function can be any appropriate random processes. If a Gaussian white noise is used, then resulting in a Dependent Gaussian Process (DGP) model. In the CGP, a wide range of latent functions are proposed to match the modelling requirements for different physical or dynamical systems (Alvarez 2011).

In addition, the CGP models allow using more than one type of latent function. Assuming Q groups of latent functions are considered, where for the group, it has smoothing kernels. Thus the output function can be rewritten by,

Then, the covariance between different outputs ) and ) can be obtained by,

where is a Kronecker delta function thus will lead to a diagonal matrix of noise variance if it is assumed that ), and the cross-covariance between ) and ) is given by,

=

Data-driven modelling using CGP basically involves obtaining the appropriate smoothing kernels and latent functions that reflect the covariance between outputs.

As given in (3), the output function is a linear combination of independent random functions. Thus, if these functions are Gaussian processes, then ) will also be a Gaussian process. In this case, the smoothing kernels can be expressed by,

where is a length-scale coefficient, is an precision matrix of the smoothing kernel. To simplify the model further, it is assumed that the covariances of latent functions ) in every group are all same Gaussian,

where is the length-scale coefficient and is another precision matrix.

To simplify the discussion again, it is assumed that = 1 for all Q groups of latent functions. In addition, the precision matrices of the smoothing kernels are assumed to be the same for each group of latent functions. As a result, given the smoothing kernel (6) and latent function covariance (7), the covariance can be obtained by,

where . Note that this multiple-output covariance function maintains a Gaussian form, i.e. ).

Then similar to standard GP models, given a set of observations, where

distribution can be defined on the output functions by,

where the output vector y(x) is given by,

with the entries,

Without loss of generality, zero means are used. In addition, the covariance matrix can be obtained by using (5) and (8). Usually, the computation of such a covariance matrix is computationally expensive. Thus, some sparse approximations have been proposed to reduce the complexities of CGP (Alvarez and Lawrence 2009). Then, the marginal likelihood can be defined by,

The joint distribution of observed y and the predicted outputs = at new input is thus still a Gaussian and is given by,

Finally, similar to standard GP models again, the predictive distribution is a Gaussian,

where the mean ) and variance ) functions are computed by,

3 Hyperparameter Learning of CGP Models

3.1 Maximizing the Log-Likelihood Function

When doing predictions using (15), the covariance matrix K is required to be specified by a set of appropriate hyperparameters . They are usually obtained by maximizing the log of marginal likelihood function.

In CGP models, the marginal likelihood is equal to the integral over a product of the likelihood function and CGP prior over the latent functions, both of which are Gaussian. Thus, the marginal likelihood is also Gaussian and defined by,

This marginal likelihood can be viewed as the likelihood of hyperparameters corrupted by noise so that we simply call likelihood function. A good point estimate ˆof hyperparameters can be subsequently obtained by maximizing this likelihood function. In practice, we usually estimate the hyperparameters by maximizing the log likelihood function due to its less computation complexities. The corresponding optimization problem can be subsequently defined as,

where,

The unconstrained optimization problem (17) is not easy to solve due to it is typically nonlinear and non-convex. However, in CGP models, the derivatives of log likelihood function with respective to (w.r.t.) the hyperparameters are mathematically analytical and can be obtained by,

where represents the entry of hyperparameters .

3.2 Minimizing the MSE Function

Equation (18) is the natural choice as the objective function for the hyperparameter learning problem. However, there are some issues involved which we shall illustrate with the modelling of a single output nonlinear dynamic system. The system is described by the following difference equation:

where u(k) is the input and y(k) is the output at time instant k. 1000 uniformly distributed input values are randomly generated within the range (4) and the corresponding outputs are computed. From these input-output data, 200 are randomly chosen for training the model. The hyperparameters of the CGP model are learned by minimizing the

Table 1 NLL and MSE values of two CGP models of system described by (20).

negative of the LL (NLL) function. The quality of the resulting CGP model is evaluated by computing the MSE of the outputs given by

using a different set of 50 values. Here, N is the number of test data, are the observed output values, and ˆis corresponding mean value of the predictive distribution obtained by (15) given the hyperparameters .

Table 1 shows two different CGP models that results from limiting the search range of the hyperparameters to [0, 100] for Model 1 and [0, 1] for Model 2. From the MSE values, it is clear that Model 2 is able to predict the outputs more accurately compared with Model 1. However, the NLL value of Model 1 is much smaller than Model 2. If the NLL function is the objective function for minimization, one may conclude that Model 1 is the better model. Thus one cannot use the NLL (and hence the LL) values to accurately gauge the quality of the intermediate models obtained during the optimization process.

We therefore propose to minimize the MSE function (21) to learn CGP’s hyperparameters by,

In addition, the following derivatives of MSE of outputs w.r.t. hyperparameters can be used to accelerate the optimization process,

with

where the computation of and can be found in (Rasmussen and Williams 2006; Alvarez and Lawrence 2011). This technique is in fact widely known as the least-square approach in the literature. In addition, from the viewpoint of non-Bayesian learning, minimizing the MSE is approximately equivalent to maximizing the LL. The proof of equivalence between these two learning strategies can be found in (Myung 2003).

4 Enhanced PSO Algorithms

In (Zhu, Xu, and Dui 2010; Petelin and Kocijan 2011; Cao, Lai, and Alam 2014), the standard PSO algorithm has been proven superior to gradient based CG and BFGS approaches in terms of accuracy and efficiency for the optimization problems (17) and (22). However, poor initializations can lead to poor global search ability, and they exhibit slow convergence due to poor local search ability. In this section, three enhancements are proposed to address these issues.

4.1 Standard PSO

We shall first outline the standard PSO algorithm for the hyperparameter learning of CGP models. Let there be a population of particles, each of which, denoted by = [], represents a potential solution to the problem (17) or (22). Each particle also records its best position as = []and its best fitness value ), where ) denotes the fitness function and could be (18) or (21). In addition, the best position of all particles is denoted by G = [and the corresponding best fitness value is denoted by = f(G). In the iteration t + 1, the velocity of particle, given by = [], along dimension is updated according to the following rule,

where and are two acceleration factors, and are two random values between [0, 1], ) represents an inertia factor.

In general, a PSO algorithm consists of two search phases, known as “exploration” and “exploitation” respectively. They are governed by the inertia factor ). The use of a larger value of ) allows the particle to explore larger areas of the search space during the exploration phase. Meanwhile, a smaller value of ) restricts the particle to a smaller region of the search space and allows the particle to converge to a local optimum in the exploitation phase. Thus, the inertia factor is usually reduced with time step. A commonly used ) is defined by,

where and are the pre-determined start and final values respectively, denotes the maximum number of iterations. and the rate of decrease is governed by the constant k. The new position of a particle can subsequently be obtained by,

For the minimization problem (22), the and + 1 iteration are updated according to the following rule,

In addition, the G and at t + 1 iteration are updated by,

We can also use the rules (28) and (29) when the maximization problem (17) becomes the minimizing negative of LL function (18). For our hyperparameter learning problem, each particle is defined by

where represents the hyperparameters of smoothing kernels (6), and = are the hyperparameters of latent functions (7). The algorithm of standard PSO based hyperparameter learning is presented in Algorithm 1.

4.2 Multi-Start PSO

In the “exploration” stage of optimization process, we want the particles to explore as much of the search space as possible. This can be achieved by setting the inertia factor ) to a suitably large value which in turn is determined by and in (26). However, suitable values for these two constants are quite specific to each problem. Another way to achieve this objective is to diversify the swarm by introducing new particles. In this paper, all particles will be reinitialized if the global best position G remains unchanged or slightly changed for a given number of iterations . This is referred as the multi-start PSO algorithm. One issue remained in the proposed algorithm is that the potentials of old particles may not be sufficiently exploited. This issue can be ignored due to we care the global search ability more than local one in the “exploration” stage. In addition, it has been proposed that only those particles that are trapped in a local optimum should be reinitialized (An et al. 2010). However, the rest of particles may still need to be reinitialized later. Besides, this approach requires checking the changes of multiple ). The proposed algorithm is therefore simpler due to only the change of f(G) is checked. Algorithm 2 describes the approach of learning CGP models’ hyperparameters through using the multi-start PSO.

4.3 Gradient-based PSO

Standard PSO also suffers from slow convergence during the “exploitation” phase. This issue can be solved through using the gradient/derivative information especially when approaching to the global or local optima. In this paper, a gradient-based PSO is proposed for the hyperparameters learning problem by combining the standard PSO and CG algorithm. In particular, the current global best position G will be exploited by solving the problem 17) or (22) by using the CG algorithm. The obtained solution is subsequently used to replace the current global position in the PSO algorithm if it produces a better fitness value. Compared with the existing work in (Noel 2012) where all particles are exploited by using a gradient-based method, the proposed algorithm only conducts gradient-based search on the current global best position if its fitness value remains unchanged or slightly changed for a specified number of iterations . The computational burden of using proposed algorithm is essentially reduced. The gradient based PSO for the hyperparameter learning of CGP models is given in Algorithm 3.

4.4 Hybrid PSO

The multi-start method in Section 4.2 and the gradient-based method in Section 4.3 can be combined in a single PSO algorithm so that both the “exploration” and the “exploitation” phases of the optimization process are enhanced. This leads to the hybrid PSO algorithm. In particular, the multi-start technique is first used such that the search space can be well covered. When the number of iterations reaches a given proportion of maximum iteration number,

the optimization process is considered to have approached near global or local optima. The algorithm subsequently switches to the use of gradient-based technique. This allows a faster convergence rate due to the nature of using gradient-based solution compared to the use of rules (25) and (27). The proposed hybrid PSO is conceptually simple and allows to adjust the proportion to suit the problem. The use of hybrid PSO in the problem of CGP models’ hyperparameter learning is given in Algorithm 4.

5 Performance Evaluation

The performance of the proposed PSO discussed in Section 4 for CGP hyperparameters learning is evaluated by computer simulation. We consider the modelling of non-trivial MISO and MIMO systems in these numerical experiments. The results are compared with those obtained using the standard CG and BFGS. In addition, results using both the Negative value of Log-Likelihood (NLL) and the MSE as the fitness function are compared.

All simulations are repeated 50 times on a computer with a 3.40GHz IntelCore2 Duo CPU with 16 GB RAM, using Matlabversion 8.1. The average results of these 50 simulation runs are shown here. Table 2 shows the key parameters of CGP and PSO used in the simulations.

5.1 Effects of Using MSE As Fitness Function

5.1.1 Single Output Modelling

The system described by (20) is used for modelling here. Although this dynamical system has only 1 input and 1 output, the CGP modelling inputs will be 1), 1) and 2), making it a 3-input and 1-output model. Only a single output is used here for modelling to simplify the comparison. In addition, we randomly chose 1000 inputs in 4) and apply them into the system. This allows us to collect 1000 observations including inputs, states and outputs.

1000 inputs for 4) are generated and applied to the system. This allows us collect 1000 observations which includes the inputs, the states and the output. From this set of observations, 200 training and 50 test data are randomly selected.

Table 2 Key parameters used in simulations

Table 3 The MSE values of predicted outputs of the CGP models learned by using standard PSO for system (20). PSO/1 uses MSE and PSO/2 usesNLL as fitness function.

CGP models are trained using the NLL and the MSE as fitness functions, denoted by PSO/1 and PSO/2 respectively, with the standard PSO algorithm. Table 3 shows the MSE values using 50 test samples on the resulting CGP models, for population sizes of 10, 20, 50 and 100. For all four population sizes, the MSE of the predicted

Table 4 MSE of the predicted outputs for CGP models learned by the proposed standard PSO with MSE fitness (PSO/2), CG and BFGS in the two-output modelling problem, where

Table 5 MSE of the predicted outputs for CGP models learned by the proposed standard PSO with MSE fitness (PSO/2), CG and BFGS in the two-output modelling problem, where

outputs for PSO/1 and PSO/2 are very close. This implies that using MSE produces models of similar quality as those obtained using NLL. Furthermore, PSO/1 and PSO/2 require similar amount of computation time.

In both cases, a larger population size produces better quality models but require a longer computation time. It seems that using a population size between 20 to 50 provides a good trade-off between model accuracy and computational efficiency. Hence a population size of 20 will be used for the rest of the simulations.

5.1.2 Two-output Modelling

Systems with multiple-outputs can be modelled in two different ways. One is to use multiple single-output models and the other is to provide a single model for all outputs at the same time. While the first approach is often simpler, the latter approach is able to capture correlation between outputs. For example, a robot arm system with multiple degrees of freedom has multiple outputs that are strongly correlated. Another example is the prediction of steel mechanical properties in (Gaffour, Mahfouf, and Yang 2010), where the yield and tensile strength are predicted from the chemical compositions and grain size. Note that these two outputs are highly correlated.

We shall continue to use the dynamical system in (20). Since it has only one output y (denoted here), a second output will be created as a function of . Two such functions are considered, one linear and the other nonlinear, given by and ) respectively. Two different sets of training data, each has 200 samples, are selected from the 1000 observations. The test data consists of 50 samples which are different from the training samples. The performance of PSO/2 is compared that obtained by CG and BFGS. Note that CG and BFGS should be restarted 20 500 times in order to provide a fair comparison to PSO/2. However this will result in much longer computation time than the PSO/2. In our simulations, CG and BFGS are restarted 2000 times so that the computation times of the three methods are comparable.

Table 4 and 5 show the predicted output MSE of the CGP models learned by the three different methods. These results show that PSO/2 outperforms the other two methods. This is confirmed by Figure 1 which shows that the predicted outputs for PSO/2 are closer to the real outputs than for CG and BFGS.

5.2 Effects of Search Space

Next, we aim to determine the influence of using different search spaces in the problem of hyperparameter learning. Two different cases are considered here. The same single-output system as in Section 5.1.1 is used here. In the first case (“case 1”), it is assumed that prior knowledge of value ranges for the parameters in (30) is available. More specifically,

where and are the elements of the diagonal precision matrices and respectively. In the second case (“case 2”), a range of [0, 100] that is much wider than (31) is used for these parameters to indicate that we do not have any prior knowledge.

Prediction accuracies of the three methods are shown in Table 6. They show that all three methods perform equally well with a well-defined search range. This is confirmed by Figure 2a for “case 1” where the predicted outputs of the three models are very close to desired one. But PSO/2 outperforms CG and BFGS when the search range is not well defined. Figure 2b shows that the models learnt by using CG and BFGS could not produce predicted outputs that follow the desired output as closely as the one learnt by PSO/2.

Figure 1 Desired outputs and predicted outputs of CGP models learned by PSO/2, CG and BFGS for the two-output modelling problem, where “Linear” denotes and “Nonlinear” represents

Figure 2 Fitting curves between desired outputs and predicted outputs of CGP models learned by the proposed standard PSO with MSE fitness (denoted by PSO/2), CG and BFGS approaches for the both two cases

Table 6 The MSE values of predicted outputs through using the CGP models learned by the proposed standard PSO with MSE fitness (denoted by PSO/2), CG and BFGS in the single-output modelling problem

Table 7 MSE of predicted outputs of CGP models learned by the enhanced and standard PSO algorithms with the NLL fitness, CG and BFGS for LTV system modelling.

Table 8 MSE of predicted outputs of CGP models learned by the enhanced and standard PSO algorithms with MSE fitness, CG and BFGS for LTV system modelling.

5.3 Enhanced PSO Algorithms

In this section, we evaluate the optimization performance of the three enhanced PSO algorithms presented in Section 4. The modelling of two non-trivial MIMO systems is considered. The results will be compared with those obtained by standard PSO, CG and BFGS algorithms.

5.3.1 LTV System Modelling

Consider a 2-input-2-output LTV system (Majji 2009) defined by,

where A, B, C and D are defined as:

Matrix A has time-varying parameters Γ= sin(10t) and Γ= cos(10t). The two control inputs are given by ) = 0.5 sin(12t) and ) = cos(7t). They have zero initial conditions.

Using a sampling interval of 0.05s, 200 data records which include the inputs, states and outputs are generated. 60 randomly selected samples are used for training, and all 200 samples are used for testing. The search range is [0, 100] and CG and BFGS algorithms are restarted 2000 times. In addition, both the NLL and MSE are used as the fitness function for the enhanced and standard PSO algorithms.

The results are shown in Tables 7 and 8. In all cases, the 3 enhanced PSO methods perform better than the standard PSO, CG and BFGS. In particular, the proposed hybrid PSO produced the lowest MSE. In addition, comparing the corresponding entries in Tables 7 and 8 suggests that using the output MSE as the fitness function for PSO algorithms seems to produce more accurate models.

Figures 3a and 3b depict the convergence behaviours of the PSO algorithms. They show that the hybrid and multi-start PSO algorithms perform a better search at the early stages (approximately before 150 iterations) than the standard and gradient-based PSOs. In addition, the hybrid and gradient-based PSO methods are able to reach more optimal solutions than the multi-start and standard alternatives. It can therefore be concluded that the hybrid and gradient-based methods have better local search abilities (approximately after 400 iterations) than the other two approaches. Among the methods considered, the proposed hybrid PSO method showed good local and global optimization performance.

Figure 3 Convergence behaviour of the proposed enhanced PSO algorithms (multi-start, gradient-based and hybrid) and standard PSO with the both NLL and MSE fitnesses in the modelling problem of the LTV system

Figure 4 Control inputs and outputs of using the Partial Form Dynamic Linearization (PFDL) appraoch for the two trajectories

5.3.2 NLTV System Modelling

The simulation in this section involves the CGP modelling of a NLTV system controlled by a PFDL based Model-Free Adaptive Control (MFAC) controller with the same parameters as in (Hou and Jin 2011). The 4-input and 2-output numerical system is described by,

where the time-varying parameters are given by,

This system is to track two trajectories. One involves a “Step” trajectory given by,

the other is “Curve” trajectory specified by,

The same initial values of the system as (Zhang, Ge, and Lee 2005) are used: (1) = (2) = (1) = (2) = 0.5, (1) = (2) = (1) = (2) = 0, and (1) = (2) = (1) = (2) = 0. 1500 and 200 records are collected for the “Curve” and “Step” trajectories, respectively. In these simulations, we use a search range of [0, 1] such that the optima or near-optima can be founded easier and faster than using [0, 100]. In addition, CG and BFGS are again restarted 2000 times.

First, 40 records are used for training the CGP models for both trajectories. The simulation results of using MSE and LL in the CGP learning problem are given in Tables 9 and 10. Similar to the results obtained in Section 5.3.1, the hybrid PSO produces the lowest MSE values. In terms of the convergence behaviour, as shown in Figure ??, the hybrid algorithm convergences as fast as the multi-start PSO at the early stage. At the same time, it is able to arrive at the most optimum values at the later stage.

The effect of the training data size on model accuracy for the hybrid PSO algorithm with MSE fitness function is now evaluated. Training data are chosen from the control intervals shown in Figures 4b and 4d. The results of using different training sizes are shown in Table 11. As expected, model accuracy improves as the training data size increases. However, the algorithm runtime increases exponentially with data size. Interestingly, for the “Step” trajectory where the outputs are piecewise constant, the system can be modelled with far fewer training data compared with the “Curve” trajectory with continuously smooth outputs.

Figure 5 Convergence behaviour of the proposed enhanced PSO algorithms (multi-start, gradient-based and hybrid) and standard PSO with the NLL fitness in the modelling problem of the NLTV system

Figure 6 Convergence behaviour of the proposed enhanced PSO algorithms (multi-start, gradient-based and hybrid) and standard PSO with the MSE fitness in the modelling problem of the NLTV system

Table 9 MSE of predicted outputs of the CGP models learned by the enhanced and standard PSO algorithms with MSE fitness, CG and BFGS for modellingthe NLTV system.

Table 10 MSE of predicted outputs of the CGP models learned by the enhanced and standard PSO algorithms with NLL fitness, CG and BFGS for modellingthe NLTV system.

Table 11 The comparison of learning the NLTV system through using the proposed MSE fitness hybrid PSO with different training data sizes in terms of the computational time and the MSE values of predicted outputs of obtained CGP models

6 Conclusion

The hyperparameters of the GP models are conventionally learnt by minimizing the NLL function. This typically leads to an unconstrained nonlinear non-convex optimization problem that is usually solved by using the CG algorithm. Three enhanced PSO algorithms have been proposed in this chapter to improve the hyperparameter learning for CGP models of MIMO systems. They make use of gradient-based technique and also combine it with the multi-start technique. Using numerical LTV and NLTV systems, we have shown that these algorithms are more effective in avoiding getting stuck in local optima. Hence they are able to produce more accurate models of the systems. Results showed that the hybrid PSO algorithm allows the faster convergence and produces the more accurate models. These algorithms also use the MSE of model outputs rather than the LL function as the fitness function of optimization problems. This enables us to assess the quality of intermediate solutions more directly.

References

Alvarez, Mauricio and Neil D Lawrence (2009). “Sparse Convolved Gaussian Processes for Multi-output Regression”. In: Advances in Neural Information Processing Systems, pp. 57–64.

Alvarez, Mauricio A (2011). “Convolved Gaussian process priors for multivariate regression with applications to dynamical systems”. PhD thesis. University of Manchester.

Alvarez, Mauricio A and Neil D Lawrence (2011). “Computationally efficient convolved multiple output Gaussian processes”. In: Journal of Machine Learning Research 12.May, pp. 1459–1500.

An, Ru et al. (2010). “A Modified PSO Algorithm for Remote Sensing Image Template Matching”. In: Photogrammetric engineering and remote sensing 76.4, pp. 379–389.

Aˇzman, Kristjan and Juˇs Kocijan (2007). “Application of Gaussian processes for black-box modelling of biosystems”. In: ISA Transactions 46.4, pp. 443–457.

Bailer-Jones, C.A.L., H.K.D.H. Bhadeshia, and D. J. C. Mackay (1999). “Gaussian process modelling of austenite formation in steel”. In: Materials Science and Technology 15.3, pp. 287–294.

Boyle, Phillip and Marcus Frean (2005). “Dependent Gaussian Processes”. In: Advances in Neural Information Processing Systems. MIT Press, pp. 217–224.

Cao, Gang, Edmund M-K Lai, and Fakhrul Alam (2014). “Particle Swarm Optimization for Convolved Gaussian Process Models”. In: International Joint Conference on Neural Networks (IJCNN). IEEE. Beijing, China, pp. 1573– 1578.

Gaffour, Sidahmed, Mahdi Mahfouf, and Yong Yao Yang (2010). “‘Symbiotic’ data-driven modelling for the accurate prediction of mechanical properties of alloy steels”. In: Proceedings of International Conference of Intelligent Systems. IEEE, pp. 31–36.

Gregorˇciˇc, Gregor and Gordon Lightbody (2009). “Gaussian process approach for modelling of nonlinear systems”. In: Engineering Applications of Artificial Intelligence 22.4, pp. 522–533.

Hou, Zhongsheng and Shangtai Jin (2011). “Data-driven model-free adaptive control for a class of MIMO nonlinear discrete-time systems”. In: IEEE Transactions on Neural Networks 22.12, pp. 2173–2188.

Kocijan, Juˇs (2011). “Control Algorithms Based on Gaussian Process Models: A State-of-the-Art Survey”. In: Proceedings of the Special International Conference on Complex Systems: Synergy of Control, Communications and Computing - COSY 2011. Ohrid, Republic of Macedonia, pp. 69–80.

Majji, Manoranjan (2009). “System identification: time varying and nonlinear methods”. PhD thesis. Texas A&M University.

Myung, In Jae (2003). “Tutorial on maximum likelihood estimation”. In: Journal of mathematical Psychology 47.1, pp. 90–100.

Noel, Mathew M (2012). “A new gradient based particle swarm optimization algorithm for accurate computation of global minimum”. In: Applied Soft Computing 12.1, pp. 353–359.

Petelin, Dejan and Jußs Kocijan (2011). “Control system with evolving Gaussian process models”. In: Workshop on Evolving and Adaptive Intelligent Systems. IEEE, pp. 178–184.

Rasmussen, CE and CKI Williams (2006). Gaussian Processes for Machine Learning. en. Cambridge, MA, USA: MIT Press, p. 248.

Wang, Jack M, David J Fleet, and Aaron Hertzmann (2008). “Gaussian process dynamical models for human motion”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 30.2, pp. 283–298.

Yu, Jie (2012). “A nonlinear kernel Gaussian mixture model based inferential monitoring approach for fault detection and diagnosis of chemical processes”. In: Chemical Engineering Science 68.1, pp. 506–519.

Zhang, Jin, Shuzhi Sam Ge, and Tong Heng Lee (2005). “Output feedback control of a class of discrete MIMO nonlinear systems with triangular form inputs”. In: IEEE Transactions on Neural Networks 16.6, pp. 1491–1503.

Zhu, Fuwei, Chong Xu, and Guansuo Dui (2010). “Particle swarm hybridize with Gaussian Process Regression for displacement prediction”. In: Proceedings of International Conference on Bio-Inspired Computing: Theories and Applications. IEEE, pp. 522–525.

designed for accessibility and to further open science