Regression analysis is a statistical analysis to find an unknown function relating predictor variables to a real response variable given some paired measurements of the predictor and response variables. Typically, the values of the predictor variables are given, and the conditional expectation or probability of the response variable for a given value of the predictor variables is estimated using the given measurements. In practice, the response variable measurements may contain outliers that do not follow the pattern of the other measurements. These outliers can produce a misleading outcome in regression analysis when conventional regression approaches are applied unless the outlier can be taken into account. Robust regression is an approach to overcome this limitation (Huber, 2011). This paper is concerned with a robust regression method.
This work is motivated by the needs of analyzing the time-dependent measurements of environmental factors such as temperature and relative humidity at a specific geographic site over a three month time period using Luna Innovations corrosion and coatings evaluation system (CorRES), a multichannel electrochemical monitoring device that continuously records environmental parameters including temperature and relative humidity. These environmental variables may be used to predict more complex parameters related to atmospheric chemistry that affect degradation of aerospace materials. A sophisticated atmospheric monitoring system that incorporates gas monitoring sensors manufactured by Airpointer
was used over the same three month time period to obtain response variable measurements for model development. Data sets from these gas monitoring systems may contain a small percentage of outlier data associated with system startup, user operation, periodic calibration, and other uncontrollable factors such as power fluctuation. Establishing an accurate regression model with less influences from the outliers is crucial for understanding the patterns of the environment factors relevant to degradation of aerospace materials. We are particularly interested in a robust Gaussian process regression with data collected by CorRES and Airpointer monitoring systems, mainly because of the generality and flexibility of a Gaussian process regression.
A Gaussian process (GP) regression is a Bayesian nonparametric approach for regression analysis (Rasmussen and Williams, 2006). In the approach, a GP prior is placed on an unknown regression function to express the prior belief on the function, and a Gaussian likelihood function is popularly used to model data as the noisy observations of the unknown function with Gaussian noises. It is well known that the Gaussian likelihood is very sensitive to outliers in data in that the mean and variance estimates are significantly affected by the outliers (Jyl¨anki et al., 2011). This is mainly because the Gaussian density is fast decaying in its tail, so data in the tail part has very low likelihood values. To address this issue, different likelihood models induced from probability distributions showing heavy-tail behaviors have been tried, including the Student-t likelihood (Jyl¨anki et al., 2011; Shah et al., 2014; Ranjan et al., 2016), Laplace likelihood (Kuss, 2006; Ranjan et al., 2016), Gaussian mixture likelihood (Naish-Guzman and Holden, 2008; Daemi et al., 2019), and data-dependent noise model (Goldberg et al., 1998). Replacing the Gaussian likelihood with a non-Gaussian likelihood makes the derivation of the posterior distribution difficult. Many numerical approximation schemes have been sought with cost of expensive computation steps, e.g., a Markov Chain Monte Carlo sampling and several variants of the Expectation Propagation (Minka, 2001).
In this paper, we propose a simpler treatment of outliers. We regard outliers as noisy and biased observations of an underlying regression function, where the ‘bias’ means the deviation of the conditional expectation of the response variable from the regression function. Accordingly, we introduce a bias term in the likelihood that explains the deviation of outliers from a regression trend for a robust GP regression. There is a clear difference from the existing approaches that try to explain outliers with different noise models. In our approach, the unknown bias term is estimated jointly with the hyperparameters of a covariance function of GP. When the bias is estimated, the posterior inference of the unknown regression function follows a standard GP regression solution procedure given in an analytically closed form. This simple idea provides the posterior mean and variance estimates of the regression function in an analytically closed form as the standard GP regression approach, so it is computationally efficient. We will study how the posterior mean and variance estimates behave with the comparisons to the more complex approximation models discussed in the literature.
The remainder of this paper is organized as follows. Section 2 describes two bias models to link biased observations to an underlying regression function and entails how the model parameters are estimated jointly with other hyperparameters of the GP regression. We also explain how the robust GP estimates can be achieved with the bias models in the same section. In Section 3, we investigate the numerical performance of the proposed method with comparisons to some existing robust GP approaches for various simulation scenarios of different outlier proportions, noise levels, and input dimensions of data. The numerical evaluation is extended with the use of real environmental sensor data containing different patterns of outliers in Section 4. We finally conclude this paper in Section 5.
Consider a robust regression problem of estimating an unknown regression function f that relates a d dimensional predictor to a real response y, using its observations, D =
. The observed response
is assumed a noisy and biased observation of f,
where is the bias of
deviating from
) is a random white noise, independent of
). We regard outliers as data with large bias
. Notice that we do not use bold font for the multivariate predictor
and reserve bold font for the collection of observed predictor locations,
. Some other notations for the future use are described as follows. Let
denote the corresponding response vector, and let
denote a vector of the unknown bias values.
with zero mean and covariance function ). The covariance function popularly used for
) is a parametric covariance function such as the Matrn covariance function and the exponential covariance function. We denote the parameters of the covariance by
the GP prior, the observations D is used to obtain the posterior predictive distribution of f at an unobserved location
, denoted by
). The joint distribution of (
follows a multivariate normal distribution,
where matrix with (
). The subscripts on
indicate the two sets of locations between which the covariance is computed, and we have abbreviated the subscript
If the parameters are given or their estimates ˆ
are given, we can apply the Gaussian conditioning formula to the joint distribution (1) to achieve the posterior predictive distribution of
The predictive mean ) is taken to be the point prediction of f(x) at location
, and its uncertainty is measured by the predictive variance
. In the subsections below, we will explain how to estimate the the parameters.
2.1 Constant Bias Model
We first regard the unknown bias as a constant vector to estimate, and the likelihood function becomes a Gaussian likelihood with data-dependent means,
which is different from the data-dependent noise variance model (Goldberg et al., 1998). The bias vector can be estimated jointly with the other parameters, the noise variance
and the covariance parameter . The negative log likelihood function of the parameter is
Estimating the parameters by minimizing the negative log likelihood is not a great idea. First of all, it is not well-defined, since the NL function has infinitely many minima. The estimates are easily overfit to y, and the f fit to the residual
gives a over-smoothed estimate of f as illustrated in Figure 1.
We place a certain regularization on for the well-posedness of the problem. We assume that the outliers occur occasionally, so the bias term
should be sparse with many zero elements but very few non-zero elements. We consider to add a
regularization on the bias term to the negative log likelihood,
Figure 1: Predicted Mean of f with the ML estimate of . Among 500 data points, two points with y > 2 are outliers. The exponential covariance function was used. The predicted mean is over-smooth mainly because the estimated bias is over-fit to data, absorbing the pattern of f as seen in (b).
Figure 2: Predicted Mean of f with the Regularized ML estimate of . Among 500 data points, two points with y > 2 are outliers. The exponential covariance function was used. The bias estimates are suppressed near zero except for the outliers.
where norm. The regularized negative log likelihood function is minimized for optimizing
jointly. This regularized ML estimates give a better posterior estimate of f as illustrated in Figure 2.
In fact, adding the regularization on
to the negative log likelihood function is equivalent to assuming a zero-mean Laplace prior on
and then taking a maximum a posteriori probability (MAP) estimate of . Please note that with the Laplace prior, the negative log joint density of
is proportional to
where is independent of
. Therefore, minimizing
a solution that corresponds to the MAP estimate of
2.2 Random Bias Model
In this section, we consider the unknown bias as a random quantity. We first consider a Gaussian prior distribution for the random quantity,
where is the mean of the bias, and
is the variance of the bias. In this case, the conditional distribution of
where ˜. This over-parameterized model is not a great choice, because the number of the parameters surpasses the number of data. This incurs a non-uniqueness issue in the parameter estimation.
We consider more restricted models described below. Depending on the restriction, this model would have different likelihoods. For example, when likelihood is reduced to a standard Gaussian likelihood, because
where ˜. Obviously, we will not consider this case. When
, the likelihood becomes similar to one for the constant bias model which we already discussed in the previous section, because
When , the likelihood is similar to a Gaussian likelihood with data-dependent noise variances, because
In this section, we choose this model as a random bias model and discuss the parameter estimation for the model. Please note that, by letting = 0, this data-dependent noise variance model is similar to one discussed in Goldberg et al. (1998), where ˜
was regressed with a log-Gaussian process prior. Here we treat ˜
as a parameter to estimate.
To discuss how we estimate the variance parameters, let the set of the parameters, the distribution of y is
where diagonal matrix with ˜
the diagonal element. The negative log likelihood function of the parameter set,
Minimizing the negative log likelihood directly for the parameter set can cause the overestimation of the bias parameters, , leading to the over-smoothed f estimate as we seen in the previous section. We regularize
norm and regularize each log ˜
with its linear and negative exponential terms,
The regularization term for ˜is actually derived from the negative log density of a inverse- gamma distribution. When ˜
), its density function is proportional to
In addition, the regularization term for is proportional to the negative log density of a normal distribution,
The negative log joint density of
where 1). Therefore, minimizing
with respect to the parameter set gives the MAP estimates of ˜
. The result of the parameter estimation with
is illustrated in Figure 3 with the same toy example which we used in the previous section. The estimates of ˜
’s are highly concentrated on one value except for two outliers, where the variance estimates are significantly higher to mitigate the effect of the outliers on f.
Figure 3: Predicted Mean of f with the Regularized ML estimate of 500 data points, two points with y > 2 are outliers. The exponential covariance function was used. The estimates of ˜
’s are highly concentrated around 0.0268, and they jump to 0.1299 and 0.7276 for the two outliers.
2.3 Tuning Parameter Selection
Both of the constant bias model and random bias model come with tuning parameters, and
. We can choose the tuning parameters iteratively with the regularized ML estimation discussed in the previous section. Let (
) denote the estimates achieved by minimizing
for an initial choice of
. We plug in the estimate into the negative log joint density in (3),
We minimize ˜) for updating
. This minimization and the minimization of
iteratively solved until the
value converges. This iteration is actually equivalent to the coordinate descent algorithm to minimize the log joint density,
We perform the selection of the tuning parameter similarly. Let (
denote the estimates achieved by minimizing
for an initial choice of
the estimates into the negative log joint density in (6), we have
We minimize ˜) for updating
. This minimization and the minimization of
are iteratively solved until the tuning parameter values converge.
In this section, we present the numerical outcomes of our two proposed bias models with a comprehensive set of simulation scenarios, comparing with the fractional EP approximation
with the Student-t likelihood (Jyl¨anki et al., 2011) and the MCMC approach with the Gaussian scale mixture likelihood (Gelman et al., 2013). We use two synthetic datasets, one with one input dimension and another with ten input dimensions. For each dataset, eight simulation scenarios are generated with different outlier patterns, and 15 replicated experiments were performed for each scenario. Below are the details of how we create the scenarios. The first dataset with one input dimension is generated from the underlying regression function,
The 300 training samples, denoted as , are randomly sampled from
where is a random variable following a mixture of two Gaussian distributions. We used
1) quantifies the fraction of outliers,
the mean bias of outliers, and
is the variance around the mean. We varied the fraction of outliers,
, and considered
. We also considered the values
be one sixth of
or one twelfth of
. If it is the one sixth, the samples from
slightly overlapped with those from
), so the outliers are less distinct from the normal data. If it is the one twelfth, they are less likely overlapping. Combining these possible values, we have eight different test scenarios in total. The 1000 test samples, denoted as
, are randomly sampled from
The second dataset has ten input dimensions, and the synthetic dataset was first introduced by Friedman (1991) and was referred in many papers for robust GP approaches (Kuss, 2006). In the dataset, the underlying regression function f(x) at a 10-dimensional input x depends on the first five input dimensions,
The 300 training samples and the 1000 test samples are generated following the same procedure used in the first dataset.
For each test scenario, we evaluated the four methods, our constant bias model (COB), our random bias model (RAB), the fractional EP approximation with the Student-t likelihood (EP) and the MCMC approach with the Gaussian scale mixture likelihood (MCMC). We collected the total computation time of the four methods to judge their computational efficiency. For comparison of prediction accuracy, we use two measures on the test data, denoted as . The first measure is the mean squared error (MSE)
which measures the accuracy of the mean prediction at location
. The second one is the negative log predictive density (NLPD)
which considers the accuracy of the predictive variance as well as the mean prediction
. These two criteria were used broadly in the GP regression literature. A smaller value of MSE or NLPD indicates a better performance.
For this numerical analysis, we used the GPStuff Version 4.7 MATLAB package (Van- hatalo et al., 2013) for MCMC and EP. We chose all hyperparameters to be optimized instead of fixing them to the same values. We implemented our two proposed approaches with MATLAB. For all of the four methods, we applied the same squared exponential covariance function. The results with the two datasets are presented in the two subsections below.
3.1 1D Synthetic Dataset
This 1D example is mainly used to illustrate the outcomes of the four compared methods and compare the predictive means and variances of the four compared methods in a low dimensional case. Figure 4 shows the outcomes of the four methods compared for q = 0.1, 12. All of the four methods work comparably while the EP approach underestimates the predictive variance for this example, giving narrower confidence intervals.
We did more quantitative analysis on the 15 replicated outcomes of MSE, NLPD and TIME for each of the eight simulation scenarios. The MSE performance is summarized using the bar charts in Figure 5. The average MSE values over 15 replicated outcomes are very close among the four methods. The variations of the MSE values are significantly different. The random bias model showed larger variations over the other methods, mainly due to the existence of a few bad performing cases. Both of the MCMC and EP approaches were quite stable. However, the MCMC approach has shown a couple of significantly bad performing cases as shown in Figure 5-(g), mainly due to MCMC sampling variations. The EP approach has shown three significantly bad performing cases as shown in Figure 5-(f), which occurred when the EP converged slowly relative to the other good performance cases. The outcomes from the constant bias model were less varied and more consistent than those from the other methods.
The NLPD performance was similar to the MSE performance except for that the average NLPD values are significantly higher than the corresponding values for the other compared methods. For example, see Figure 6-(b) and 6-(d). In the two scenarios, the MSE performance of the EP was similar to those of the other methods, so the differences in the NLPD are mainly due to the differences in the predictive variance estimates. As we illustrated in Figure 4, the EP underestimates the predictive variances for some scenarios. Besides, the variations in the NLPD value for the EP are larger for these two scenarios. Again, the random bias model has shown relatively larger variations in the NLPD values for a couple of the simulated scenarios.
The computation times of the four methods are quite different. The MCMC approach takes more computation times than the other methods, and the EP is the second slowest. Our proposed constant bias model is fastest among the four methods. The constant bias model provides a very efficient robust GP regression solution with great and less varying MSE and NLPD performances. This is the major benefit of using the proposed constant bias model.
Figure 4: Illustrative outcomes of the four methods compared for
3.2 10D Friedman Dataset
We use this example to see how the four compared methods perform for high dimensional inputs. The Friedman data has 10 input dimensions, five of which are not related to the response variable but adds complexity in data analysis. The main interest here is how the accuracy and computation time of the four compared methods are affected by the increased
Figure 5: MSE performance of the four methods compared for eight simulated scenarios with 1D synthetic dataset. q is the fraction of outliers in the training data, the mean bias of outliers, and
is the standard deviation of data for both normal data and outliers.
Figure 6: NLPD performance of the four methods compared for eight simulated scenarios with 1D synthetic dataset. q is the fraction of outliers in the training data, the mean bias of outliers, and
is the standard deviation of data for both normal data and outliers.
Figure 7: Time performance of the four methods compared for eight simulated scenarios with 1D synthetic dataset. q is the fraction of outliers in the training data, the mean bias of outliers, and
is the standard deviation of data for both normal data and outliers.
Table 1: Average MSE values for
input dimension. Like in the previous example, we collected the MSE, NLPD and TIME metrics for the four compared methods.
Figure 8 summarizes the MSE values of the four methods for eight test scenarios. Different from the 1D dataset, the averages and variations of the MSE values are significantly different among the four compared methods. Our proposed constant bias model (COB) was the best performer for all of the eight scenarios with significant gaps to the other methods, and its average MSE values are the lowest while the variations of the MSE values are relatively little. Our proposed random bias model (RAB) was the second-best consistently, and the MCMC approach was the third-best. The EP approach has shown a significant deterioration of the performance as the input dimension increases. A major benefit of using the COB is that the MSE performance changes much less in between the 1D dataset and this dataset. For example, its average MSE performance with the 1D data was 0.2649 while that with this 10D dataset was 0.4208 for 6, but the gaps are much more significant for the other three methods as shown in Table 1.
Figure 9 compares the NLPD performances of the four methods. The NLPD performance is quite comparable to the MSE performance. The proposed COB is the best performer with the lowest average negative log posterior density values and little variations. The NLPDs of the RAB and MCMC were comparable.
Figure 10 compares the total computation times. The COB is fastest. It is interesting to see that the EP slowed down for this high dimensional example significantly, being slower than the MCMC. Besides, the variation in the computation times was very significant for the EP, mainly because the convergence of the expectation propagation algorithm varied significantly over 15 random replicated experiments.
The main message from the two simulated studies is clear. The predictive power of the proposed COB method is very competitive and stable, which does not depend significantly on training samples and input dimensions of the dataset. Its computation time is faster than the existing approximation methods by more than one order of magnitude, and it is also less dependent on the input dimension.
In this section, we use environmental data collected by two monitoring systems deployed by the U.S. Air Force for development of atmospheric corrosion models of aluminum alloys (AA). We considered temperature and humidity from the CorRESsensing system, and nitrogen oxides (NO
) and ozone (O
) gas concentrations in parts per billion (ppb) from the Airpointer
system. Both systems were placed at an outdoor environmental test site operated by the U.S. Naval Research Laboratory in Key West FL. In this study, the variables of time, ambient temperature, in degree Celsius, and relative humidity are used as threedimensional predictors, and each of the gas concentrations are used as response variables.
Figure 8: MSE performance of the four methods compared for eight simulated scenarios with 10D Friedman dataset. q is the fraction of outliers in the training data, the mean bias of outliers, and
is the standard deviation of data for both normal data and outliers.
Figure 9: NLPD performance of the four methods compared for eight simulated scenarios with 10D Friedman dataset. q is the fraction of outliers in the training data, the mean bias of outliers, and
is the standard deviation of data for both normal data and outliers.
Figure 10: Time performance of the four methods compared for eight simulated scenarios with 10D Friedman dataset. q is the fraction of outliers in the training data, is the mean bias of outliers, and
is the standard deviation of data for both normal data and outliers.
Multiple causes for the gas concentration outlier data were identified, including loss of communications from sensors, and data produced during prescribed user calibrations. The NOdata have more intermittent outliers, while the O
data contain more clustered outliers 11. For the O
, the outliers deviate strongly from the normal data. Although outliers in the gas concentration data are only a small fraction of the total data, a computationally efficient process for accounting for these outliers increases the utility of the environmental monitoring systems and data sets. The capability of the robust GP regression to account for response variable outliers in the gas monitoring data sets was determined relative to three other existing approaches for these two unique NO
scenarios.
Figure 11: Environmental measurement data
Another major contrast from the simulated datasets is that this real dataset contains more measurements. In total, the dataset has 1,400 measurements for each of the gas concentrations. We manually identified outliers as indicated with red circles in Figure 11. There are five outliers for the NO gas and 45 outliers for the Ogas. For each of the gas types, we randomly selected 10% or 20% of the 1,400 records excluding the outliers, which serve as a test dataset. So, the test dataset only contains normal data. The training data contains the whole 1,400 records, substituting the records selected for the test dataset with outliers. The outliers are randomly sampled from those which we manually identified. Dropping our proposed random bias model (RAB) in this comparison due to its poor performance for this example, we evaluate the three other methods in terms of the three performance metrics,
Table 2: MSE, NLPD, TIME values for environmental measurement data
MSE, NLPD and TIME. In this study, we applied the exponential kernel for all of the four methods.
Table 2 summarizes the outcomes with the MSE, NLPD and TIME evaluations. The MCMC approach worked best for the NO data, and the proposed COB model worked closely to the MCMC approach. However, there is a meaningful gap in between them. By looking at their predictive means and variances as shown in Figure 12, we can see that both of the models work reasonably, but the COB model over-smooths the wiggly fluctuations of the NO measurements. The EP approach did not work very well, creating a flat predictive mean and too wide predictive variance around the mean. It is mainly due to the poor convergence of its hyperparameter optimization for this example.
For the Omeasurement, the proposed COB model worked better in MSE, but the MCMC approach worked better in NLPD. Figure 13 shows the predictive means and variances of the three compared methods. Our proposed model better catches the overall trend of normal data, and the MCMC approach is influenced by outliers for some test locations. The EP failed to find the mean line of the normal data, catching up with the outlier data below the normal mean line.
All in all, the proposed COB model works better than or close to the best performing method for all of the simulated and real data cases with much shorter computation time. The proposed model can be a very time economic option for the robust GP regression with great prediction accuracy and robustness.
We presented a new computationally efficient method to a robust GP regression. The approach regards data as noisy and biased observations of an underlying regression function. Accordingly, the likelihood linking data to the regression function includes bias terms. We modeled the bias terms in two different ways, biases as unknown fixed parameters and biases as unknown random quantities. We estimated the parameters of the resulting bias models based on the proposed regularized maximum likelihood estimation. We entailed how the
Figure 12: Predictive mean and variances of the three compared methods for environmental measurement data, NO measurements with 10% outliers.
regularized likelihood functions are formulated and how the tuning parameters are chosen. After the bias model estimation, the robust GP regression can be achieved following the standard GP procedure. This bias modeling approach is very simple and computationally efficient, while it provides excellent prediction accuracy and robustness to outliers. We examined its numerical performance based on a comprehensive simulation study. For most of the simulation cases, the proposed approach outperformed the existing approaches of the fractional EP approximation with the Student-t likelihood and the MCMC approach with the Gaussian scale mixture likelihood. The performance of the proposed approach was also less sensitive to the proportion of outliers and the noise level of data as well as the dimension of data. The computation time of the proposed approach was faster than the existing approaches by at least one order of magnitude. The new approach has also shown better trade-offs between the computational time and prediction accuracy than the existing
Figure 13: Predictive mean and variances of the three compared methods for environmental measurement data, Omeasurements with 20% outliers.
approaches in our numerical study with environmental sensor data. We believe that the proposed bias model is a very attractive solution to a robust GP regression.
We would like to acknowledge support for this work from Air Force Research Laboratory Materials and Manufacturing Directorate and members members of the AFRL Corrosion Integrated Product Team. The work was conducted under U.S. Federal Government Contract No FA8650-15-D-5405.
Atefeh Daemi, Yousef Alipouri, and Biao Huang. Identification of robust Gaussian process regression with noisy input using EM algorithm. Chemometrics and Intelligent Laboratory Systems, 191:1–11, 2019.
Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19 (1):1–67, 1991.
Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian Data Analysis. Chapman and Hall/CRC, 2013.
Paul W Goldberg, Christopher KI Williams, and Christopher M Bishop. Regression with input-dependent noise: A Gaussian process treatment. In Advances in Neural Information Processing Systems, pages 493–499, 1998.
Peter J Huber. Robust Statistics. Springer, 2011.
Pasi Jyl¨anki, Jarno Vanhatalo, and Aki Vehtari. Robust Gaussian process regression with a Student-t likelihood. Journal of Machine Learning Research, 12(Nov):3227–3257, 2011.
Malte Kuss. Gaussian process models for robust regression, classification, and reinforcement learning. PhD thesis, Technische Universit¨at, 2006.
Thomas P Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 362–369. Morgan Kaufmann Publishers Inc., 2001.
Andrew Naish-Guzman and Sean Holden. Robust regression with twinned Gaussian pro- cesses. In Advances in Neural Information Processing Systems, pages 1065–1072, 2008.
Rishik Ranjan, Biao Huang, and Alireza Fatehi. Robust Gaussian process modeling using em algorithm. Journal of Process Control, 42:125–136, 2016.
C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to Gaussian processes. In Artificial Intelligence and Statistics, pages 877–885, 2014.
Jarno Vanhatalo, Jaakko Riihim¨aki, Jouni Hartikainen, Pasi Jyl¨anki, Ville Tolvanen, and Aki Vehtari. Gpstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research, 14(Apr):1175–1179, 2013.