Per-sample Prediction Intervals for Extreme Learning Machines

2019·Arxiv

Abstract

Abstract

Prediction intervals in supervised Machine Learning bound the region where the true outputs of new samples may fall. They are necessary in the task of separating reliable predictions of a trained model from near random guesses, minimizing the rate of False Positives, and other problem-specific tasks in applied Machine Learning. Many real problems have heteroscedastic stochastic outputs, which explains the need of input-dependent prediction intervals.

This paper proposes to estimate the input-dependent prediction intervals by a separate Extreme Learning Machine model, using variance of its predictions as a correction term accounting for the model uncertainty. The variance is estimated from the model’s linear output layer with a weighted Jackknife method. The methodology is very fast, robust to heteroscedastic outputs, and handles both extremely large datasets and insufficient amount of training data.

1 Introduction

Practical applications of machine learning can be problematic in the sense that developers and practitioneers often do not fully trust in their own predictions. A fundamental reason for this mistrust can be found in the fact that Mean Squared Error (MSE) and other error measures averaged over a dataset are commonly used to evaluate performance of a method or compare different methods. Averaged error measures are unfit for business processes where each particular sample is important, as it represents a customer or other existing entity [2]. On the other hand, applied Machine Learning models might skip some data samples, because they are only a part of a bigger process structure, and uncertain data might be given to human experts to be handled [22].

The trust problem can be solved by computing a sample-specific confidence value [33]. Then predictions with high confidence (and enough trust in them) are used, while data samples with uncertain predictions are passed to the next analytical stage. The Machine Learning model works as a filter, solving easy cases automatically with confident predictions, and reducing the amount of data remaining to be analyzed [3].

Let [1, N] be a dataset where outputs y are independently drawn from a normal distribution conditioned on inputs x:

This dataset has heteroscedastic noise because the variance is not constant. A common homoscedasticity assumption simplifies formula (1) to y = f(x) + ) but removes the ability to separate confident predictions from uncertain ones.

The heteroscedastisity of outputs is a reasonable assumption because applied Machine Learning problems often have stochastic outputs. Such outputs do not have a single correct value for the given input. The variance of random noise in outputs may be assumed equal because the noise is independent of the inputs, but the same assumption cannot be made about the variance of the stochastic outputs because they certainly depend on the inputs.

This work focuses on prediction intervals specifically for Extreme Learning Machines (ELM) [21, 25]. ELM is a fast non-linear model with universal approximation ability [18]. It has a feed-forward neural network structure but with randomly fixed hidden layer weights, so only the linear output layer needs to be trained. With a large hidden layer and L2-regularization [42] ELM exhibit stable predictions [30], that are not affected by a particular initialization of the random hidden layer weights. It is an excelled Machine Learning tool to solve applied problems [4, 41] with simple formulation, little to no hyper-parameters, performance at the state-of-the-art level [17, 39, 47] and scalable to Big Data [1, 40].

The idea of the method is to use an ELM to predict an output f(x), and a second ELM to estimate its conditional variance ) = (. Furthermore, a variance analysis is done on the predictions of the second ELM. It provides upper and lower boundaries for the predicted variance. These boundaries describe the model uncertainty for samples with little similar training data available, and make the methodology uniformly applicable to different problems.

The rest of the paper is organized as following. The following section describes state-of-the-art in prediction intervals estimation, and how the proposed solution differs from the rest. Section 2 describes Extreme Learning Machines and the proposed methodology. Section 3 analyses the method performance on small artificial and real world datasets. Section 4 presents the results on huge real world dataset, and describes the runtime requirements compared to the original ELM. Section 5 summarizes the findings.

1.1 State-of-the-Art

Prediction with uncertainty in a well-known task. Probabilistic methods can obviously formulate a solution. Prediction intervals are available in Bayesian formulation of ELM [12, 8], including per-sample PI [37] though the applicability is limited due to the quadratic computational cost in the number of data samples.

Fuzzy nonlinear regression [15] approach exists for problems having fuzzy inputs or outputs. It applies random weights neural networks with non-iterative training similar to ELM, but formulates the solution in terms of fuzzy sets theory [5]. Such a native fuzzy approach allows for a detailed investigation of the effects of uncertainty on learning of a method [43, 44], and has important practical applications [6] for fuzzy data problems.

Without runtime limitation, good results are achieved with model independent methods [34] based on clustering of input data and re-sampling. Clustering of inputs and repetitive model re-training during the re-sampling both scale poorly with data size, and would limit the performance of ELM otherwise capable of processing billions of data samples [1].

A specific case [26] of model-independent approach limited to linear models (with arbitrary solution algorithm and hyper-parameters) provides good results for heteroscedastic datasets ([26], supplementary materials), and suits for ELM output layer solution as well. The method applies to any amount of training data, and will benefit from huge datasets by producing more independent models in its ensemble part. Unfortunately, it does not output prediction intervals directly.

The scope of this paper is constrained to fast ways of computing prediction intervals of outputs, tailored specifically for Extreme Learning Machine. The proposed solution works especially well in conjunction with ELM, re-using some heavy computational parts as shown in the next section. A fast runtime is one of the the key features of ELM, making it valuable for practical applications and Big Data processing. Another key feature of ELM is approximation of complex unknown functions, and the proposed method approximates prediction intervals of model outputs in similar fashion without probabilistic or fuzzy set notations.

2 Methodology

This section starts by introducing the Extreme Learning Machine. It continues with the prediction intervals idea, and its implementation suitable for ELM. The section concludes with a formal description of an algorithm.

2.1 Extreme Learning Machine

The Extreme Learning Machine [20] model is formulated as a feed-forward neural network with a single hidden layer. It has d input and L hidden neurons. Solution is given for one output neuron; in case of many output neurons each one has an independent solution. The hidden layer weights are initialized with random noise and fixed. Often an extra input neuron with the constant +1 value is added to function as bias.

Hidden layer neurons apply a non-linear transformation function ) to their output. Typical functions are sigmoid or hyperbolic tangent, but this function may be omitted to add linear neurons. For N input data samples gathered in a matrix , the hidden layer output matrix is:

where the function () is applied element-wise. In matrix notation, the formula simplifies to ).

The output layer of ELM is a linear regression problem , that is overdetermined in real cases with more data samples than hidden neurons (N > L). The output weights are given by an ordinary least squares solution computed with the Moore-Penrose pseudoinverse [36] of matrix H.

Random initialization may decrease the performance of a naive ELM. This problem is completely solved by including L2 regularization in the output layer solution. The linear regression problem becomes:

where is L2-regularization parameter optimized by validation. With L2 regularization and a large number of hidden neurons, ELM performance becomes stable and unaffected by a particular random initialization of W [19].

2.2 Prediction Intervals

where Φ() is an inverse cumulative distribution function, i.e. Φ(95%) 1.96.

The maximum likelihood estimator for the variance of a homoscedastic output y is given by Mean Squared Error [7]. However, it provides uniform prediction intervals that fit poorly to practical applications of Machine Learning.

An estimation of variance in linear regression is a well-researched topic, with plethora of theoretical [38] and experimental [34] results available. Variance of heteroscedastic model predictions ˆy can be computed with the Bienaym´e formula [27, 23] from the variance of model weights . However, variance of the predicted outputs corresponds to confidence intervals and does not describe the range of possible true outputs y.

The relation between the heteroscedastic prediction intervals and other methods is illustrated on Figure 1.

2.3 Prediction Intervals for Extreme Learning Machines

The idea of this paper is to estimate the variance of heteroscedastic outputs ) using a second ELM model. The model predictions ˆy are computed by the first ELM, then the squared residuals = (ˆare used as training outputs for the second ELM that learns to predict the conditional variance of outputs.

However, ELM predictions can be inaccurate, and their quality must be taken into account. For that reason, variances of the predictions for the first

Figure 1: Different types of confidence analysis on a toy heteroscedastic dataset (a). Uniform PI (b) estimate per-sample variance of outputs incorrectly, while confidence intervals (c) estimate variance of model predictions that is different from the variance of outputs. Only the heteroscedastic prediction intervals (d) provide a precise description of the dataset outputs distribution. ELM model predictions are used in (b-d).

ELM ) and the second ELM ) are added to the predicted squared residuals ˆ) to bound the true variance of the outputs ):

In addition to directly estimating the input-dependent variance ), this expression has the desired properties of giving larger variance for models with insufficient amount of training data. With an excessive amount of training data [1, N], variances of the predicted residuals ) and the predicted outputs ) decrease to zero and the variance of true outputs is given by its ELM estimation: lim). A similar approach to the prediction intervals exist in feed-forward neural networks [32], however it is valid only for the case inf.

The output layer of ELM is a linear regression. Bienaym´e formula [27, 23] provides the variance of outputs in linear regression, and in ELM:

where is the hidden layer output of an ELM for an input sample .

There is plethora of methods for estimating covariance Σof normally distributed linear system weights (ˆ). The method of choice is weighted Jackknife estimator [45]. It is unbiased, robust against heteroscedastic noise [38, 16, 13, 10], as fast as an ELM, and scales well with the data size. Another good method for variance estimation is Wild Bootstrap [10] with nice theoretical properties, but it is slower as the bootstrap part requires several repetitions to converge.

2.4 Weighted Jackknife for Big Data

A summary of the Weighted Jackknife methods is presented below. Its inputs are an ELM hidden layer outputs H and residuals .

The method uses three auxiliary matrices: P, S and A. Equation (9) creates a weighted data matrix by scaling every row of the original data H, its denominator includes a dot product between two vectors .

Weighted Jackknife works well together with ELM and Big Data. First, an auxiliary matrix P in (7) is an inverse of the already computed matrix in an ELM solution (3).

Second, Big Data applications with huge number of samples are often limited by memory size, especially if the matrix computations are run on GPUs with very limited memory pool. Weighted Jackknife avoids such limitation by batch computations. Let the data matrix H split in k equal parts with N/k samples each:

Then auxiliary matrix S can be computed in the corresponding parts = [1, k], and an auxiliary matrix A becomes a summation over all the parts A = . Size of matrices A and P does not depend on the number of samples N, and the weighting (9) may be done in-place without consuming additional memory.

Having only one data part in memory at a time reduces the total memory requirements by a factor of k. Large enough k allows a single workstation to process billions of samples with Weighted Jackknife, the same way as presented for ELM in [1]. The practical value of k is limited by the minimum size N/k of a single batch, that cannot fully utilize CPU/GPU computational potential for small data batches of N/k < 1000 [1].

2.5 ELM Prediction Intervals Algorithm

Prediction intervals are computed in two stages. The first stage uses training data to learn the two necessary ELM models , and estimate the covariances of output weights Σin these models:

1. Train an ELM model on the training data X, y

2. Predict outputs ˆy for the training data

3. Use weighed Jackknife to estimate covariance Σof the output weights

4. Compute residuals for the training data

5. Train another ELM model to predict the residuals

6. Use weighed Jackknife to estimate covariance Σof the output weights

The training data X, y and auxiliary vectors ˆcan be discarded at this point.

The second stage uses the previously trained models to predicts test outputs, their squared residuals and all variances. Then the prediction intervals are estimated with an equation (4).

1. Compute the hidden layer outputs for test inputs using the two ELM models

2. Predict test outputs ˆ

3. Compute variance of the predicted outputs = diag()

4. Predict squared residuals ˆ=

Figure 2: Artificial dataset with true 95% intervals for noise.

5. Compute variance of the predicted square residuals = diag()

6. Compute prediction intervals for a desired confidence level :

Models can have different optimal number of neurons, that should be validated. Using L2 regularization prevents numerical instabilities. Note that the predicted squared residuals ˆmight have negative values, that are replaced by zero.

3 Experimental Results

3.1 Artificial Dataset

An artificial dataset with heteroscedastic noise is shown on Figure 2. Additional tests are done on homoscedastic versions of the same dataset with the same projection function f() with an input-independent normally distributed noise. All experiments used ELM with one linear and 10 hyperbolic tangent hidden neurons, in both and .

Figure 3 shows the computed PI on the heteroscedastic artificial dataset at 95% confidence level. The figure also presents the standard deviation of the predicted residuals 1at 95% confidence, to show how it is affected by the amount of training data. As the amount of training data increases, PI are given more precisely by ˆand depend less on (Figure 3, right).

Similar results obtained for the datasets with homoscedastic noise, presented on Figure 4. Larger variance of outputs makes the prediction task harder, leading to larger errors in ˆy (Figure 4, upper left). At the same time the variance of ˆincreases (Figure 4, shaded area), and the true PI rarely go beyond their estimated boundaries. Smaller variance of noise leads to more more precise PI, that still cover the true PI most of the time.

Figure 3: Estimated PI for heteroscedastic stochastic outputs. Variance of the predicted residuals ˆ(shaded area) captures model uncertainty with less training data. Thin dash lines are actual PI, solid line is the projection function, thick dash line is an estimated output, and black dots are training data samples.

Figure 4: Estimated PI and their variance (shaded area) for homoscedastic stochastic outputs with difference variance; more data leads to more precise PI. Thin dash lines are actual PI, solid line is the projection function, thick dash line is estimated projection function, and black dots are training data samples.

Figure 5: Estimated PI and their variance (shaded area) with an insufficient amount of training data; PI are over-estimated in poorly predicted areas. Thin dash lines are actual PI, solid line is the projection function, thick dash line is estimated projection function, and black dots are training data samples.

Table 1: Real-world datasets used for comparison.

In the extreme case of a training set with only 30 samples (which is not enough for learning the correct shape of the true projection function), the predicted squared residuals ˆbecome unreliable. However, including their variance in the predictions compensates for the model uncertainty (see Figure 5). It sometimes leads to over-estimation of the true PI, but this is a desired property that prevents an uncertain model from predicting false highly confident outputs ˆy.

3.2 Comparison on Real World Datasets

ELM Prediction Intervals are compared on four real datasets with four other methods presented in [24]. Details of the datasets are given in Table 1. The paper uses two common metrics: Prediction Intervals Coverage Probability (PICP) that is a percentage of test samples whose outputs lie between the PI, and the Normalized Mean Predicted Interval Width that is an average width of PI on a test set divided by the range of the test targets. PICP shows what percentage of targets actually lie within PI, and it should correspond to the target coverage. NMPIW presents how optimal are the PI for the given task, compared to a naive approach of simply taking the full range of targets as an interval. Ideal PI have a small NMPIW with PICP equals to Φ) target coverage.

The two measures PICP and NMPIW are inter-dependent as increasing PI width also increases the coverage. The comparison work [24] proposed a combined measure to replace PICP and NMPIW, but it is subjective due to two arbitrary hyper-parameters. This paper rather presents PICP and NMPIW on

Table 2: Experimental results of ELM Prediction Intervals.

the same plot.

ELM PI method proposed in the paper is compared to four other methods of computing PI for neural networks. The Delta method [9] linearizes a neural networks model around a set of parameters, then applies an asymptotic theory to construct the PI. An extension of the Delta method to heteroscedastic noise is available [11], although still limited due to linearization. Bayesian learning of neural network weights allows for direct derivation of variance for particular predicted values [28], but at a very high computational cost. Bootstrap method is directly applicable to any machine learning method including neural networks, although caution should be taken in selecting bootstrap parameters to make the method resilient to heteroscedastic noise [10]. Finally, the Lower Upper Bound Estimation (LUBE) method proposed by [24] uses two additional outputs in a neural network to predict lower and upper PI, training the network with a custom cost function that includes both PICP and NMPIW.

Experimental setup uses L1 regularized ELM model [29] for automatic model structure selection on relatively small datasets, implemented in HP-ELM toolbox [1]. The datasets are randomly split in 70% training and 30% test samples, median results over 30 initializations are reported. Numerical experimental results are given in Table 2; comparison numbers for other methods are available in the corresponding paper [24]. Runtime is reported for a 1.4GHz dual-core laptop.

Performance of the methods is shown as points in NMPIW/PICP coordinates, presented on Figure 6. An ideal method would be at the left edge of the dashed line (low NMPIW with precise PICP). As shown on the figure, ELM PI method performs better on Steam pressure dataset, a little worse on Plasma beta-carotene datasets, and about average on the other two.

A further analysis shows possible reasons for good performance on Steam pressure, and bad one on Plasma beta-carotene. The analysis compares against uniform PI using the same ELM predictions for a dataset. Such PI estimate homoscedastic noise correctly, but cannot learn heteroscedastic noise. Let a uniform PI grow starting from zero, then as they grow both coverage and the interval width will increase, generating many pairs of {NMPIW, PICP} points. These points are then connected by a line that represents homoscedastic PI performance boundary. Homoscedastic PI performance boundary and ELM PI for the two datasets in question are shown on Figure 7.

Obviously, useful heteroscedastic PI must be above this boundary – but in practice they may end up below due to poorer parameter estimation. Indeed, heteroscedastic PI need interval width per sample while homoscedastic PI only have interval width per dataset, that is easier to estimate precisely. As seen from Figure 7, this is the situation for ELM PI on the Plasma beta-carotene dataset where uniform PI perform better. On Steam pressure however, heteroscedastic

Figure 6: Comparison of the ELM PI method (filled star) with four other methods from [24]. Best performing methods have low NMPIW and the target coverage (points close to the upper left corner).

Figure 7: Comparison of ELM PI (black marker) with uniform PI of varying width (solid line). Heteroscedastic ELM PI perform better on the Steam pressure dataset, while uniform PI are enough for the Plasma beta-carotene dataset.

PI perform better than uniform ones as they have higher coverage with the same average width. Another possible reason for the difference in performance is that Plasma beta-carotene dataset has homoscedastic noise, while Steam pressure dataset has actually a heteroscedastic noise (or heteroscedastic stochastic outputs), so heteroscedastic PI provide the most benefit when computed on the latter dataset.

4 Minimizing False Positives on a Large Real

This experiment uses PI to minimize the amount of false positive predictions on a large classification task. Note that the proposed PI methodology applies equally well to regression, and monotonic classification tasks are handled even better using purposely developed [48] implementations of ELM as .

A 4,000,000-sample dataset of pixel colors for skin/non-skin classification is created from the FaceSkin Images dataset [35]. The inputs are colors of the target pixel and its 7 7 neighbors with 7 3 (RGB) = 147 input features total, and the outputs are +1 for skin pixels and -1 for non-skin ones. The dataset uses photos of various people under different lighting conditions, without any pre-processing. True skin masks are created manually and are highly accurate. Half of the dataset is used for training, and the other half for test.

The applied ELM model uses 147 linear + 200 sigmoid neurons. Predictions of ELM are real values, that are turned into classes by taking their sign. Due to a simple model and input features (that are not tailored for image processing) the performance is average at about 87% accuracy. The goal of the experiment is to check whether the per-sample PI can be used to significantly improve the accuracy at a cost of coverage, compared to per-datasets PI computed by MSE.

To trade coverage for precision, a threshold is introduced. ELM predictions with an absolute value less than are ignored. A value of corresponding to the desired coverage percentage is found by scalar optimization methods. For per-sample PI, threshold is multiplied by the value of the corresponding PIfor a prediction .

The results are shown on Figure 8. Here, an ELM models with a total of 347 hidden neurons is trained on a dataset with two million samples. The per-sample PI improves the true positive rate slightly. However, they reach almost zero false positives with 3% coverage, and exactly zero at 1%. Contrary to the proposed method, uniform PI computed with MSE cannot achieve zero false positives. Although one percent of coverage seems very little, it represents 20,000 test samples for that dataset, and it is a surprising achievement for a simple ELM model that is not optimized for False Positives reduction like in custom applications [2]. A specifically designed model, or an ensemble of multiple models could achieve zero False Positives with a larger coverage – a significant result for practical use of ELM, and Machine Learning algorithms in general.

Figure 8: True Positive versus False Positive rate for the most confident part of the predictions (depicted by percentage) for a MSE-based threshold (dash line), and sample-specific threshold based on PI (solid line). Per-sample PI give almost zero False Positives for 3% best predictions, and exactly zero for 1% best. True Positives rate is overall higher than for an MSE-based threshold.

4.1 Runtime Analysis

The runtime of per-sample PI is examined on the pixel classification dataset explained above. The experiments are run on a desktop machine with 4-core Intel Skylake CPU, using an efficient ELM toolbox from [1]. With 2,000,000 training samples and 347 hidden neurons, training an ELM takes 12 seconds (for both or ). Computing covariance matrices Σand Σwith weighted Jackknife method takes 25 seconds each, or only twice longer that training an ELM itself. Test predictions take 8 seconds to compute, and test per-sample PI take 32 seconds. In total, prediction intervals increase the ELM runtime by a constant factor of about 5.

Runtime on the real-world datasets is not directly comparable with the other methods as they are run on different machines, but it is the same order of magnitude as Bootstrap, an order of magnitude faster than Delta or Bayesian methods, but also an order of magnitude slower than the LUBE method. Replacing L1 regularized ELM with standard ELM reduces the runtime to the level of LUBE method, however it degrades the results on small datasets with a few hundreds samples. Extremely large datasets that do not need regularization benefit from the faster run speed.

5 Conclusion

The paper proposed a method of computing per-sample prediction intervals for Extreme Learning Machines. It successfully evaluates variance of heteroscedastic stochastic outputs, using only ELM models and the weighted Jackknife method. The proposed framework works well for homoscedastic outputs, making the proposed method applicable on a general level. ELM PI is comparable to other methods of computing PI in neural networks on small datasets, while keeping it possible to have very fast runtimes and scalability for Big Data.

On a real dataset, the method has shown to allow for a better precision and lower False Positives rate. Heteroscedastic PI performs in a similar way as uniform PI from Mean Squared Error on 50%-70% of dataset samples, but they make a huge difference on the most confidently predicted 1%-10% of samples. For these samples, the proposed PI allowed to achieve zero False Positives rate even with a basic ELM model, which is an extremely useful feature in many practical applications. The runtime is comparable to the runtime of an ELM itself that makes it feasible for large datasets of Big Data problems.

ELM PI can be easily extended to non-symmetric PI by using two ELM models in the second stage for predicting upper and lower boundaries separately. An ensemble of ELMs may increase the coverage for zero False Positives data predictions. These extensions will be examined and evaluated in future works on this topic.

References

[1] Anton Akusok, Kaj-Mikael Bj¨ork, Yoan Miche, and Amaury Lendasse. High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications. IEEE Access, 3:1011–1025, July 2015.

[2] Anton Akusok, Yoan Miche, Jozsef Hegedus, Rui Nian, and Amaury Lendasse. A Two-Stage Methodology Using K-NN and False-Positive Minimizing ELM for Nominal Data Classification. Cognitive Computation, 6(3):432–445, March 2014.

[3] Anton Akusok, Yoan Miche, Juha Karhunen, Kaj-Mikael Bj¨ork, Rui Nian, and Amaury Lendasse. Arbitrary Category Classification of Websites Based on Image Content. IEEE Computational Intelligence Magazine, 10(2):30– 41, May 2015.

[4] Anton Akusok, David Veganzones, Yoan Miche, Kaj-Mikael Bj¨ork, Philippe du Jardin, Eric S´everin, and Amaury Lendasse. MD-ELM: Originally Mislabeled Samples Detection using OP-ELM Model. Neurocomputing, 159:242–250, July 2015.

[5] H Asai, S Tanaka, and K Uegima. Linear regression analysis with fuzzy model. IEEE Transaction Systems Man and Cybermatics, 12(6):903–07, 1982.

[6] Rana Aamir Raza Ashfaq, Xi-Zhao Wang, Joshua Zhexue Huang, Haider Abbas, and Yu-Lin He. Fuzziness based semi-supervised learning approach for intrusion detection system. Information Sciences, 378:484–497, February 2017.

[7] Christopher M Bishop. Pattern Recognition and Machine Learning, volume 4 of Information science and statistics. Springer Science+Business Media, Singapore, 2006.

[8] Yarui Chen, Jucheng Yang, Chao Wang, and DongSun Park. Variational Bayesian extreme learning machine. Neural Computing and Applications, 27(1):185–196, 2016.

[9] G. Chryssolouris, M. Lee, and A. Ramsey. Confidence interval prediction for neural network models. IEEE Transactions on Neural Networks, 7(1):229– 232, January 1996.

[10] Russell Davidson and Emmanuel Flachaire. The wild bootstrap, tamed at last. Journal of Econometrics, 146(1):162–169, September 2008.

[11] A. A. Ding and Xiali He. Backpropagation of pseudo-errors: Neural networks that are adaptive to heterogeneous noise. IEEE Transactions on Neural Networks, 14(2):253–262, March 2003.

[12] E. Soria-Olivas, J. Gomez-Sanchis, J. D. Martin, J. Vila-Frances, M. Martinez, J. R. Magdalena, and A. J. Serrano. BELM: Bayesian Extreme Learning Machine. IEEE Transactions on Neural Networks, 22(3):505–509, March 2011.

[13] Emmanuel Flachaire. Bootstrapping heteroskedastic regression models: Wild bootstrap vs. pairs bootstrap. 2nd CSDA Special Issue on Computational Econometrics, 49(2):361–376, April 2005.

[14] R Guidorzi and R Rossi. Identification of a power plant from normal operating records. Automatic Control Theory and Applications, 2(3):63–67, 1974.

[15] Yu-Lin He, Xi-Zhao Wang, and Joshua Zhexue Huang. Fuzzy Nonlinear Regression Analysis Using a Random Weight Network. Inf. Sci., 364(C):222– 240, October 2016.

[16] Paul S. Horn, Amadeo J. Pesce, and Bradley E. Copeland. A robust approach to reference interval estimation and evaluation. Clinical Chemistry, 44(3):622–631, March 1998.

[17] Guang-Bin Huang, Zuo Bai, L.L.C. Kasun, and Chi Man Vong. Local Receptive Fields Based Extreme Learning Machine. IEEE Computational Intelligence Magazine, 10(2):18–29, May 2015.

[18] Guang-Bin Huang, Lei Chen, and Chee-Kheong Siew. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Transactions on Neural Networks, 17(4):879–892, July 2006.

[19] Guang-Bin Huang, Hongming Zhou, Xiaojian Ding, and Rui Zhang. Extreme learning machine for regression and multiclass classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 42(2):513–529, April 2012.

[20] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: Theory and applications. Neural Networks Selected Papers from Symposium on Neural Networks, 70(1–3):489–501, December 2006.

[21] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: A new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference On, volume 2, pages 985–990, 25-29 July 2004.

[22] J. Hegedus, Y. Miche, A. Ilin, and A. Lendasse. Methodology for Behavioral-based Malware Analysis and Detection Using Random Projections and K-Nearest Neighbors Classifiers. In 2011 Seventh International Conference on Computational Intelligence and Security, pages 1016–1023, 3-4 Dec. 2011.

[23] Richard Arnold Johnson and Dean W Wichern. Applied Multivariate Statistical Analysis, volume 5. Prentice hall Upper Saddle River, NJ, 2002.

[24] A. Khosravi, S. Nahavandi, D. Creighton, and A. F. Atiya. Lower Upper Bound Estimation Method for Construction of Neural Network-Based Prediction Intervals. IEEE Transactions on Neural Networks, 22(3):337–346, March 2011.

[25] Amaury Lendasse, Vong Chi Man, Yoan Miche, and Guang-Bin Huang. Advances in extreme learning machines (ELM2014). Neurocomputing, 174, Part A:1 – 3, 2016.

[26] Bingqing Lin, Qihua Wang, Jun Zhang, and Zhen Pang. Stable prediction in high-dimensional linear models. Statistics and Computing, 27(5):1401– 1412, Sep 2017.

[27] Michel Lo`eve. Probability Theory; Foundations, Random Sequences. D. Van Nostrand Company, New York, 1955.

[28] David J. C. MacKay. The Evidence Framework Applied to Classification Networks. Neural Computation, 4(5):720–736, September 1992.

[29] Yoan Miche, Antti Sorjamaa, Patrick Bas, Olli Simula, Christian Jutten, and Amaury Lendasse. OP-ELM: Optimally-Pruned Extreme Learning Machine. IEEE Transactions on Neural Networks, 21(1):158–162, January 2010.

[30] Yoan Miche, Mark van Heeswijk, Patrick Bas, Olli Simula, and Amaury Lendasse. TROP-ELM: A double-regularized ELM using LARS and Tikhonov regularization. Advances in Extreme Learning Machine: Theory and Applications Biological Inspired Systems. Computational and Ambient Intelligence Selected papers of the 10th International Work-Conference on Artificial Neural Networks (IWANN2009), 74(16):2413–2421, September 2011.

[31] David W. Nierenberg, Therese A. Stukel, John A. Baron, Bradley J. Dain, and E. Robert Greenberg. Determinants of Plasma Levels of Beta-Carotene and Retinol. American Journal of Epidemiology, 130(3):511–521, September 1989.

[32] David A. Nix and Andreas S. Weigend. Learning Local Error Bars for Nonlinear Regression. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 489– 496. MIT Press, 1995.

[33] Darko Pevec and Igor Kononenko. Input dependent prediction intervals for supervised regression. Intelligent Data Analysis, 18(5):873–887, October 2014.

[34] Darko Pevec and Igor Kononenko. Prediction intervals in supervised learning for model evaluation and discrimination. Applied Intelligence, 42(4):790–804, 2015.

[35] S L Phung, A Bouzerdoum, and Sr. Chai D. Skin segmentation using color pixel classification: Analysis and comparison. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(1):148–154, January 2005.

[36] C. Radhakrishna Rao and Sujit Kumar Mitra. Generalized inverse of a matrix and its applications. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics, pages 601–620, Berkeley, CA, 1972. University of California Press.

[37] Zhigen Shang and Jianqiang He. Confidence-weighted extreme learning machine for regression problems. Neurocomputing, 148:544–550, January 2015.

[38] Jun Shao and C. F. J. Wu. Heteroscedasticity-Robustness of Jackknife Variance Estimators in Linear Models. The Annals of Statistics, 15(4):1563– 1579, 1987.

[39] Duˇsan Sovilj, Emil Eirola, Yoan Miche, Kaj-Mikael Bj¨ork, Rui Nian, Anton Akusok, and Amaury Lendasse. Extreme learning machine for missing data using multiple imputations. Neurocomputing, 174, Part A:220 – 231, January 2016.

[40] C. Swaney, A. Akusok, K.-M. Bj¨ork, Y. Miche, and A. Lendasse. Efficient Skin Segmentation via Neural Networks: HP-ELM and BD-SOM. INNS Conference on Big Data 2015 Program San Francisco, CA, USA 8-10 August 2015, 53:400–409, January 2015.

[41] Maite Termenon, Manuel Gra˜na, Alexandre Savio, Anton Akusok, Yoan Miche, Kaj-Mikael Bj¨ork, and Amaury Lendasse. Brain MRI morphological patterns extraction tool based on Extreme Learning Machine and majority vote classification. Neurocomputing, 174, Part A:344 – 351, 2016.

[42] A N Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl., 5:1035–1038, 1963.

[43] X. Z. Wang, H. J. Xing, Y. Li, Q. Hua, C. R. Dong, and W. Pedrycz. A Study on Relationship Between Generalization Abilities and Fuzziness of Base Classifiers in Ensemble Learning. IEEE Transactions on Fuzzy Systems, 23(5):1638–1654, October 2015.

[44] X. Z. Wang, T. Zhang, and R. Wang. Noniterative Deep Learning: Incorporating Restricted Boltzmann Machine Into Multilayer Random Weight Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics: Systems, PP(99):1–10, 2017.

[45] C. F. J. Wu. Jackknife, Bootstrap and Other Resampling Methods in Regression Analysis. Ann. Statist., (4):1261–1295, December 1986.

[46] I.-C. Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797–1808, 1998.

[47] Z. Huang, Y. Yu, J. Gu, and H. Liu. An Efficient Method for Traffic Sign Recognition Based on Extreme Learning Machine. IEEE Transactions on Cybernetics, 47(4):920–933, April 2017.

[48] Hong Zhu, Eric C.C. Tsang, Xi-Zhao Wang, and Rana Aamir Raza Ashfaq. Monotonic classification extreme learning machine. Neurocomputing, 225(Supplement C):205 – 213, 2017.