Optimal Uncertainty-guided Neural Network Training

2019·Arxiv

Abstract

Abstract

The neural network (NN)-based direct uncertainty quantification (UQ) methods have achieved the state of the art performance since the first inauguration, known as the lower-upper-bound estimation (LUBE) method. However, currently-available cost functions for uncertainty guided NN training are not always converging and all converged NNs are not generating optimized prediction intervals (PIs). Moreover, several groups have proposed different quality criteria for PIs. These raise a question about their relative effectiveness. Most of the existing cost functions of uncertainty guided NN training are not customizable and the convergence of training is uncertain. Therefore, in this paper, we propose a highly customizable smooth cost function for developing NNs to construct optimal PIs. The optimized average width of PIs, PI-failure distances and the PI coverage probability (PICP) are computed for the test dataset. The performance of the proposed method is examined for the wind power generation and the electricity demand data. Results show that the proposed method reduces variation in the quality of PIs, accelerates the training, and improves convergence probability from 99.2% to 99.8%.

Keywords: Neural Network, Wind Power, Prediction Interval, Uncertainty Quantification, Cost Function.

1. Introduction

All-natural quantities have some uncertainties. The quantity may slightly or greatly vary for the same circumstances. The variance of the quantity at a circumstance is the level of uncertainty for the circumstance. The level of uncertainty is heteroscedastic. The predictability of the same quantity can be different based on circumstances. Traditional point prediction systems predict a value which is the most probable for the corresponding input combination. The actual value may differ from the prediction slightly or greatly based on circumstances [1]. For instance, electricity demand at off-peak and full-peak hours is highly predictable but the demand on the transition times may largely vary from one day to another day [2, 3]. The difference between the prediction and the actual value may be caused by the modeling error, or the inherent randomness of the system [4, 5]. The inclusion of some inputs such as the current temperature and calendar information may reduce the uncertainty of predictions. However, some portion of the uncertainty can be random and cannot be predicted based on existing features. The uncertainty can also be asymmetrically heteroscedastic and the point prediction with a certain error possibility cannot provide adequate information to the user [6, 7].

Probabilistic forecast, such as an uncertainty bound is also popular in decision-making [8, 9]. However, a single uncertainty bound is unable to represent the level of uncertainty. Multiple uncertainty-bounds can be applied to quantify uncertainties. PI is a recognized UQ method, applies the upper and the lower bounds to quantify the level of uncertainty. Probabilistic

forecasts such as prediction intervals (PIs) with a certain coverage probability are more appropriate for understanding the uncertain condition. Fig. 1 presents the uncertainty captured by PIs with 99% coverage probability. The probability density function changes from sample to sample [10, 11]. The width of the PI varies from sample to sample based on the corresponding uncertainty. Decision-makers get the most probable regions of targets from PIs, generated by NNs even for an asymmetric and heteroscedastic system [12].

NN is commonly recognized as a black box to its end-user. It can provide state-of-art performance with proper training. The designer of any NN chooses NN-size, activation function and initial weights. Initial weights can also be random or pre-defined. Proper training provides an optimal selection of weights on interconnects and biases. The optimal NN output is a weighted sum of inputs and functions of inputs. Weight optimization is performed through a reward-based system. The reward is calculated based on the performance of NN through a cost function in each cycle. Therefore, a cost function needs to be designed considering the purpose, quality criteria, critical situations, and the convergence of optimization [13].

This paper proposes an optimal PI construction technique considering different aspects of recently proposed NN-based direct PI construction techniques. The direct construction of PIs from the NN result in a sharp PI for any type of probability distributions, such as skewed-Gaussian, log-normal and multimodal. Several direct PI construction techniques have been proposed in the literature. The relative performance of these techniques is questionable and should be comprehensively investi-

Figure 1: Importance of the uncertainty quantification. The point prediction, presented by the red line with a constant error possibility (the root mean square error) cannot represent the heteroscedastic uncertainty. PIs, represented by green lines becomes narrow in the less uncertain regions and become wide in more uncertain regions. Therefore, PIs can represent heteroscedastic uncertainty.

gated. Every new algorithm is claimed to be the best with data analysis during its proposal. However, the result may vary with different datasets. Therefore, we discuss the philosophy of developing cost functions and provide a novel one. The proposed cost function combines the important philosophies of recently proposed NN-based direct PI computation methods. The paper presents a rigorous performance analysis for the wind power generation and electricity demand data. The method is also applicable for the UQ of several other datasets, such as electricity prices, and other renewable generations. The improvements in the convergence for electricity demand, hydro and solar generations are also analyzed. Moreover, weather, geographical positions, and various human-made events can be considered as an input of the NN. The effect of many events may not be expressed through mathematical equations. NNs can find all of those hidden relations with reward-based training.

The paper is organized with the following flow of information. Section 2 presents the basics of the uncertainty quantifi-cation and the advantage of the NN-based direct uncertainty quantification. Section 3 presents all NN-based direct PI construction methods. Section 4 presents the proposed PI construction method. Section 5 reports the simulation results and performance metrics. Section 6 is the concluding section.

2. Uncertainty and its Quantiﬁcation

2.1. Increased Uncertainty in Power Grid

All real-world events consist of sub-events, among which many are random. However, their combined effect can be interval predictable or even deterministic. When tossing a coin

for a single time, the probability of getting the head or the tail is equal. However, the outcome of one hundred tossings of a coin is quite predictable. There is a 97.9% chance of getting 40 to 60 heads and there is 72.88% chance of getting 45 to 55 heads. Therefore, 20% region of the output range contains about 97-98% of the probability density. Therefore, that outcome is interval predictable. The outcome of ten thousand tossings is more predictable. There is a 95.56% chance of getting 4900 to 5100 heads. Therefore, 2% region of the output range contains about 95-96% of the probability density and it is point predictable with 1-2% root mean squared error (RMSE) [14].

A grid is connected to millions of electrical appliances. Whether individual equipment is consuming electricity or not is difficult to predict. However, the total electricity consumption of a large grid is point predictable. A fair coin has an equal probability of head or tail but an appliance may have a 10% probability of consuming electricity and that probability varies depending on time or weather or any other conditions. Therefore, the total power consumption of a large grid is predictable and depends on major events, such as weather, time, vacation, sports, etc.

Let us consider two events: the percentage of heads in tossing 1) five coins, 2) ten thousand coins. When the first event has a higher weight than the second event on a quantity, the quantity becomes highly uncertain. The grid also contains elements of different predictability. Wind power generation is highly random. Installation of large scale wind power plants has made the overall generation more unpredictable. The largescale introduction of the electric vehicle has made the overall consumption more unpredictable.

2.2. Uncertainty Quantification

All systems have some inherent randomness, known as the aleatory uncertainty. The output value slightly or greatly varies for the same input combination. The other type of uncertainty is epistemic or subjective which can be properly captured by the precise modeling [15, 16]. This uncertainty can cause signifi-cant prediction error when several secondary or tertiary effects are overlooked during the modeling. The future value of the parameter or the target (t j) is represented as:

where is the zero expectation error signal and j(n) is the sample number. Therefore, total uncertainty is represented as follows:

where ˆy j is the true regression mean for the jth sample. When the two terms in (2) are independent, the total variation associated with the model outcome is represented as:

Here, is the mean and is the target coverage probability parameter. The value of is 1.15, 1.64 and 1.96 respectively for 75%, 90% and 95% theoretical coverage [18]. Therefore, the width of the PI of jth sample is as follows:

The distribution of uncertainties may not be purely Gaussian for any combination of inputs. Some distributions can be log-normal, skewed Gaussian, or multimodal. Therefore, PIs constructed through the Gaussian assumption fails to maintain a narrow width and the required coverage simultaneously.

A relatively smarter approach to constructing PIs is considering the cumulative distribution function. Conditional probability functions can construct an interval of (1 ) confidence level. The upper bound (yj) is a value greater than (1 2) portion of the probability density function. Therefore, yj can be represented as [19]:

where Pj(condition) is the probability function for the target at jth sample (tj). The cumulative probability density function (CP) can represent the relation as follows:

Taking inverse:

The lower bound (yj) and the upper bound (y j) form the PI. Therefore, the PI (=[yj, y j]) becomes as follows:

The direct NN-based PI computation techniques construct an optimal PI empirically without considering any theoretical probability distribution.

Figure 2: The advantage of the NN-based direct PI construction over traditional and conditional PI construction techniques for a non-Gaussian probability distribution. The NN-optimization technique finds an optimal PI for any arbitrary probability distribution.

and Eq. (9) respectively and the LUBE NN is directly trained to optimize the cost function. Conditional PI is optimal compared to traditional PIs. The NN-based direct PI moves upper and lower bounds slightly and evaluates the effect during the training to achieve an optimal PI coverage probability (PICP).

3. Direct NN-based PI Construction Methods

Several direct NN-based methods have been proposed by different groups. The current section discusses relevant methods where a cost function is proposed or modified to overcome a limitation.

3.1. Initial LUBE Method

The lower upper bound estimation (LUBE) method is the very first method for constructing PIs through direct training [21]. The LUBE method is based on the following philosophies:

• The PI should cover samples with an equal or higher probability compared to the PI nominal coverage (PINC = 1 ).

• When the required PI coverage probability (PICP) is achieved (PICP PINC), the quality of PI depends on the narrowness of the average width.

The LUBE method minimizes the coverage width criterion (CWC). PICP, PINAW, and CWC are defined as follows:

provided that,

provided that,

50 is a hyperparameter used for penalizing low coverages. The LUBE method also proposes the structure of the NN. Fig. 4 presents the structure of a lower-upper-bound-based NN. In the LUBE NN, input and hidden layers are shared. Therefore, the difference between the lower and upper bounds is the weights of the connections between the last hidden layer to the output layer and output layer biases.

3.2. LUBE with Independent Width and Penalty Factors

The width factor (PINAW) exists in a multiplicative manner with the penalty factor (), as shown in (13) in the initial LUBE method. Therefore, the optimization was controversial [22, 23]. One of the concerns is the optimization at zero width (PINAW = 0), that happens frequently with the initial LUBE method. Khosravi et. al. resolved most of those controversies with the proposal of a new CWC equation of independent width and penalty factor [24] as presented at Eq. (14).

The function is modified based on the following philosophy:

• A separate PINAW in an additive manner restricts optimization at PINAW = 0.

3.3. LUBE with Continuous Cost Function

The NN training in LUBE method often fails to converge with the traditional LUBE cost function. Therefore, the following equation is recently introduced to achieve a continuous cost function at (PICP = PINC) [25]:

provided that,

where is the PI-failure penalty function. The value of that function is called the penalty. is zero for (PICP PINC). In contrast, its value exponentially increases with the lowering of PICP when PICP < PINC.

The concept of a continuous cost function is well known in the model-design rules of circuit simulations [26, 27]. The philosophy behind the continuous cost function is as follows:

• A continuous model is the prerequisite for the convergence of the simulation. Especially when the model is applied to optimize anything through an iterative process.

• A discontinuity on the model or it’s derivative may result in a very large gradient (w) during an iteration, as the y remains large for a very small w at the point of discontinuity. Therefore, the discontinuity may potentially result in the overflow while nearby regions are mapped following the gradient.

As the most optimized point of the NN training is PICP = PINC, the step-sizes are reduced near that point to achieve optimized weight with high precision [28, 29]. That means, x or a component of x may become very small while iterating near the most optimized point. Therefore, the cost function needs to be smooth near the minima.

3.4. Can Wan’s Interval Score-based Cost Function

where PINC is the target PI coverage, known as the PI nominal coverage. PI-width of the jth sample is defined as follows:

C Wan et al. [30] also define a component named the interval score. The interval score for jth sample is as follows:

The average interval score is as follows:

With the weighted summation of ACE and the interval score, the C Wan’s cost function is as follows:

Both and are set to one to provide the equal weight towards the PICP calibration and the PI sharpness.

According to our analysis, that cost function is based on the following philosophies:

• Both high and low PICPs are penalized with equal concentration to maintain a gradient throughout the input domain.

• Both of the width of the interval (PIW j) and the failure distance (y j tj or t j yj) are optimized.

• The failure distance is considered with much higher priority (2times) compared to the width.

3.5. L. G. Marns’ Deviation from Mid Interval Consideration

L. G. Marn et al. [31] consider the deviation from the mid interval and a continuous cost function. Their cost function is as follows:

They also perform the normalization of and PINAW. They introduce the deviation from mid interval as ||e|| and named it as the error quantity defined as follows:

The function is proposed based on the following philosophy:

• The most probable region of the target may not stay near the middle of the interval in the direct PI construction method. Minimization of the deviation of the target from the center can potentially shift the most probable region near the center of PIs.

3.6. G. Zhang’s Deviation Information-based Criterion (DIC)

When the cost function only considers PICP and PINAW, the NN finds smart ways to optimize the cost function. Though the PINAW becomes much smaller, the NN often keeps a smaller width instead of covering the target in critical situations. Therefore, target PICP is achieved with much smaller PINAW. However, PI misses the target by a large distance in critical situations. The situation is illustrated with a rough drawing in Fig. 3. Intervals covering and missing the target are denoted by the green color and red color respectively. In the critical situation, the NN aims to narrow down the PI instead of covering it by increasing the width. However, the user of the prediction algorithm needs indications of sudden rises or falls with higher accuracy to manage critical situations.

G. Zhang et al. [32] tried to avoid the computation extensive exponential cost function. They proposed deviation information-based criterion defined as:

provided that,

and

The cost function is modified based on the following philosophy:

• Avoiding computationally expensive exponential function.

• Bringing the most probable region of the target near the middle of the PI.

Figure 3: A rough diagram presenting NN based PIs when the optimization considers only PICP and PINAW. PIs cover 80% to 90% targets but fail to predict sharp changes. Successful and unsuccessful PIs are represented by green and red lines respectively with upper and lower bound marks.

4. Proposed Method

In this paper, a smooth and customizable cost function is proposed for the uncertainty guided NN training. Uncertainty guided NNs predict the upper bound and the lower bound. Bounds are computed with NNs without any assumption on the distribution.

Studying motives of all proposed algorithms, the following key criteria of a good cost function are concluded:

1. PICP, PINAW and PI normalized average failure distance (PINAFD) are important parameters of an ideal cost function. The consideration of the deviation from the mid-interval also results in a slightly lower failure distance but the PI becomes much wider.

2. Cost function and its derivatives need to be continuous for a better convergence profile of the NN-training.

3. Often a trained NN fails to maintain PICP PINC with the test dataset due to the slight variation between the training and test datasets. Therefore, a small coverage margin () is required during the training.

4. The entire input domain needs to have single minima at PICP = PINC . Except at the minima, the input domain needs to have non-zero gradients directing the optimization towards the minima.

5. A simpler or less computation extensive cost function is preferred. However, the simplicity of the function is not related to the quality of PIs.

6. Different users may have different preferences towards width, failure distance, and coverage penalty.

PIs may fail to cover the target during a sudden fluctuation but the NN should try to bring the nearby bound of the PI close to the target. When the nearby bound of the PI is close to the uncovered target, the user can manage the situation with a lower difficulty. The proposed NN training considers the minimization of the prediction interval normalized failure distance (PINAFD). The expression of PINAFD is as follows:

where cj contains the same meaning as of Eq. (10). The minimum distance of the target from bounds is considered to be the failure distance for the corresponding sample when the target is not bounded. The total failure distance is divided by the total number of missing samples (cj)) and normal- ized by the range (R) to achieve normalized average value. is a small value to avoid an undefined PINAFD value for 100% PICP. During an iteration of the training, NN may cover all samples, resulting in c j) = 0. In such a situation, the value of PINAFD becomes zero by zero. To avoid that undefined value, a small value, 1e 10 is added to the denominator to make zero PINAFD for 100% PICP.

We formulate the proposed optimization parameter by adding a weighted PINAFD. The proposed optimization parameter is as follows:

where is the failure distance resistance parameter, is the PI coverage penalty factor, and is the coverage margin. Usually, is set to one to provide an equal concentration towards the width and the failure distance. The coverage penalty needs to be high enough to provide higher concentration towards PICP than a slight shrink of average width or failure distance. Therefore, the value of is usually set to more than 200. The test PICP can be slightly lower than the train PICP due to a slight variation among datasets. The initial LUBE method strictly maintains the PICP higher than the nominal PICP. However, in a linear or a polynomial penalty, the test PICP can be slightly lower than the nominal. Therefore, a slight margin of 50 is kept. The variation of PICP is related to the nominal PICP. When is small, the sample density near the edge of PIs is lower and the variation in PICP is also lower. With 5%, a slight margin of 1% is enough for obtaining a test PICP of 95% most of the time. Therefore, 50 is kept for increasing the margin with increasing .

The proposed optimization parameter, presented in Eq. (25) is named as the Coverage Width Failure Distance Criteria (CWFDC). The proposed cost function is considering the coverage (PICP), the normalized average width (PINAW), and the failure distance (FD). Besides these considerations, it is smooth and considers a small margin to withstand a slight PICP variation.

5. Result Evaluation

The wind power generation and the electricity demand samples from August 2012 to August 2019 are downloaded from the UK-grid website [33]. Four recent samples and the time in the hour (TimeDay = hour+minutes/60) on a corresponding day

Figure 4: The structure of the NN with input-output combinations. Four recent samples and the time is applied to quantify the uncertainty on the next sample.

is provided as the input to the NN. Fig. 4 presents the structure of the NN with input-output combinations. We apply the simulated annealing technique for NN training. NNs of different sizes and initializations are trained with different cost functions to evaluate the result. Different steps of the result evaluation are as follows:

1. Finding the optimal NN-sizes for different cost functions, different PICP, and different datasets.

2. Evaluation of PINAW, PICP, and PINAFD for NNs with optimal sizes.

The NN size optimization is vital to avoid overfitting or un-derfitting [34]. The optimal neuron size of any NN-based prediction system depends on both the data and the cost function. Therefore, size of NNs is optimized for each cost function at first. Then, NNs of optimized sizes are trained to construct PIs and to evaluate its performance for the test set.

5.1. The LUBE method

Single hidden layer NNs with different neuron numbers are trained to find an optimal NN size. The NN is trained with four random initializations and one with the lowest CWC is selected. Fig. 5(a) presents the lowest CWC values for different NN sizes and three different PINC values for wind power generation data. The number of neurons is varied between 5 and 15. The optimal NN size is found to be 9, 8, and 7 for PINC = 95%, 90%, and 80% respectively. Similarly, the optimum size of NNs is found to be 8, 8 and 6 for PINC = 95%, 90%, and 80% respectively for the electricity demand data.

Khosravi et al. [24] previously observed that the training of the LUBE method fails to converge once in twelve cases. Therefore, the training converges at roughly 91.7% situations. 5.2% converged training provide a much wider or too narrow PIs. Therefore, only acceptable PIs are considered. The performance of the LUBE method is presented as the first segment of Table 1. Five hundred NNs are trained for each of the wind power and the electricity demand data and NNs providing logical PIs are considered. The LUBE method generates high-quality PIs on average. However, some PIs exhibit correct PICP on cross-validation data but provide slightly lower PICP compared to PINC for the test data. This happens due to the slight variation between the test dataset and the cross-validation dataset.

Table 1: PI Optimization Performance for the 5-minutes ahead forecast on the Wind Power Generation and Electricity Demand Data of the UK grid.

1000 (1 )2 ˜N(PICP(1%))iter is median of the number of iterations to reach PICP| < 1% ˜N(PINAW(1is median of the number of iterations to reach PINAW PINAWOpt

Figure 5: NN-size optimization for the uncertainty quantification of the wind power generation of the UK grid. (a) The LUBE cost function. Optimized NN-sizes are 9 for 10% and 7 for 20%. (b) C. Wan’s cost function. Optimized NN-sizes are 10 for 10% and 9 for (c) L. G. Marn’s cost function. Optimized NN-sizes are 11 for 10% and 10 for 20%. (d) G Zhang’s cost function. Optimized NN-sizes are 10 for 10% and 9 for 20%. (e) the proposed cost function. Optimized NN-size is 10 for 5%, 10%, and 20%.

5.2. C. Wan’s method

Four neurons with different structures and random initialization are trained and the one corresponding to the lowest ACES AV| value is selected. Fig. 5(b) presents those lowest cost function values. The number of neurons is varied from 5 to 15. The optimal NN size is found to be 10, 9, and 9 for PINC = 95%, 90%, and 80% respectively for the wind power data. Similarly, NN size is found to be 11, 8, and 7 for PINC = 95%, 90%, and 80% respectively for the electricity demand data.

Five hundred NNs of optimal size are trained for both of the wind power and the electricity demand data. The reported result considers NNs which provide logical PIs. With the C. Wan’s cost function, roughly 99% NN training converges and all converged NNs provide a logical PI. The performance of that method is presented as the second segment of Table 1. C. Wan’s method provides PIs of slightly higher width but reduces the average failure distance.

5.3. L. G. Marn’s method

L. G. Marn’s method considers the deviation from the mid-interval. This time, NNs with the lowest PINAW n values are considered. Fig. 5(c) presents the lowest cost function values. Following the same process, optimal NN size is found to be 11, 10, and 10 for PINC = 95%, 90%, and 80% respectively for the wind power data. NN size is found to be 11, 9, and 8 for PINC = 95%, 90%, and 80% respectively for the electricity demand data.

As this cost function and its derivatives are continuous, the convergence is higher (99.2%). The performance of the method is presented as the third segment in Table 1. As a slightly higher PICP compared to PINC is also penalized in this cost function, the average PICP becomes slightly greater compared to other methods. As Marn’s method does not optimize the failure distance, the failure distance is greater compared to C. Wan’s method. In the contrary, consideration of the deviation from the mid-interval results in a lower failure distance compared to the LUBE method. Also, optimization of the deviation from the mid-interval brings the most probable regions of targets close to the middle of PIs. That results in slightly wider PIs.

5.4. G. Zhang’s method

The same simulations are performed for NNs trained using G. Zhang’s cost function. The lowest DIC values are considered for drawing Fig. 5(d). Optimal NN size is found to be 10, 9, and 9 for PINC = 95%, 90%, and 80% respectively for the wind power data. Optimal NN size is 11, 9, and 8 for PINC = 95%, 90%, and 80% respectively and for the electricity demand data.

Due to the discontinuity of the cost function at PICP=PINC, the convergence of that simulation is low (96.2%). The performance of that method is presented as the fourth segment in Table 1. This function also provides a good PICP on average and a high failure distance. As the low PICP is poorly penalized and the sum of failure distance and the average width may potentially result in a wrong gradient direction, the variation in PICP is higher compared to other methods.

Figure 6: Typical representative PICP values over iterations for the proposed method and two existing methods. Only the first 100 iterations are presented. These values vary from simulation to simulation.

Figure 7: PIs of 80%, 95%, and 99% PICP with targets for (a) the electricity demand and (b) wind power generation of UK grid on 17/7/2019. A 5-minutes ahead UQ is performed. Two samples are drawn in each hour for better visualization. Intervals are computed through an optimally trained NN. The structure of the NN is presented in Fig. 4. Four previous samples and the time is used to compute the interval.

Figure 8: PIs of 80%, 95%, and 99% PICP with targets for (a) the electricity demand and (b) wind power generation of UK grid on 17/7/2019. A 30-minutes ahead UQ is performed. Intervals are wider compared to the 5-minutes ahead UQ.

5.5. The proposed method

NNs are trained using the proposed cost function as a part of the performance evaluation. The lowest PINAW + PINAFD values among four NNs are considered for drawing Fig. 5(e). Optimal NN size is found to be 10 for all PINC values (95%, 90%, and 80%) for the wind power generation. Optimal NN size is found to be 12, 10, and 9 respectively for 95%, 90%, and 80% PINC for the electricity demand data.

The failure distance resistance parameter () is set to one to provide an equal concentration towards width and the failure distance. The PI coverage penalty factor () is set to one thousand to reduce the PICP variation among different NNs. According to observations, the PICP variation is higher for a higher . Therefore, the coverage margin () is set to (50). One thousand NNs of optimal size are trained and NNs providing logical PIs are considered. 99.8% of NN training is converged and provides a logical PI. The probable reason for 0.02% convergence failure can be PICP = 0 for the first thousand iterations. Although the NN training process changes weights in each iteration, the PICP remains zero and the change in PINAW + PINAFD is negligible compared to the large failure penalty at PICP = 0. Thus, the gradient becomes small and causing slow convergence.

Although the simulation does not converge in 0.2% situations, the convergence of the proposed system is much better compared to any existing method. The simulation does not converge in 0.8% to 9% situations in currently available methods. Fig. 6 presents PICP values over iterations. The value of PICP reaches a feasible range (PINC - PICP < 1) within the first thirty iterations most of the time in the proposed method. Usually, more than fifty iterations are required to get such a feasible range in the initial LUBE method. Other continuous cost functions also require more than thirty iterations to get such a feasible range. Moreover, the PICP iterates near 100% when a greater PICP (PICP > PINC ) is not penalized. Variation in the finalized PICP is higher in LUBE and other continuous cost functions due to discontinuity at the gradient near PICP = PINC. As the proposed cost function is continuous along with its high order derivatives, the convergence is faster and possesses a low variance.

The performance of this method is presented as the last segment of Table 1. PICP variation becomes much lower with the proposed method when is set to 1000. The mean value of PICP remains very close to PINC and the PICP variation is close to . As a result, about 85% of NNs maintains PICP > PINC. On other systems, less than 70% of NNs maintains PICP > PINC. The proposed system has much lower PINAW + PINAFD as it is the performance criterion during PICP = PINC . However, the user may choose any desired failure distance resistance parameter () according to their preference. Table 1 also present the convergence of the neural network training. Both PICP and PINAW reaches near to their optimum value with fewer iterations with the proposed cost function. A few high values can change the average greatly, therefore we consider median values to compare convergences.

Fig. 7 presents PIs with targets for the 5-minutes-ahead prediction with the proposed NN training. Fig. 7(a) presents PIs of three different PICPs with targets for the electricity demand data. Fig. 7(b) presents PIs of three different PICPs with targets for the wind power generation data. PIs and corresponding targets are drawn on the same plot to visualize the PI of the heteroscedastic system.

Fig. 8 presents PIs with targets for the 30-minutes-ahead prediction with the proposed NN training. Fig. 8(a) presents PIs of three different PICPs with targets for the electricity demand data. Fig. 8(b) presents PIs of three different PICPs with targets for the wind power generation data.

6. Conclusion

Uncertainty is inescapable but uncertainty aware decisions bring higher sustainability and profitability. The NN based LUBE PI construction method achieved state of the art performance in quantifying asymmetrically heteroscedastic uncertainty in terms of narrow width and required PICP. However, NNs need to be re-trained for new types of signals and the nonsmooth LUBE cost function often fails to achieve an efficient uncertainty guided NN. Several improvements to the LUBE method have proposed by different researchers for optimal NN training. Each one of them has some limitations in terms of convergence, understandability, smoothness, parameter insuffi-ciency, or customizability. Therefore, a smooth optimization function is proposed. A low failure distance results in a low non-coverage penalty. Moreover, the user may prefer to minimize the failure distance. Therefore, the cost function considers coverage, width, and failure distance criteria to train NNs for the construction of PIs with higher consistency. Researchers may bring 100% convergence of the training in the future.

References

[1] H. D. Kabir, A. Khosravi, M. A. Hosen, S. Nahavandi, Neural network- based uncertainty quantification: A survey of methodologies and applications, IEEE Access 6 (2018) 36218–36234.

[2] N. A. Shrivastava, A. Khosravi, B. K. Panigrahi, Prediction interval es- timation of electricity prices using pso-tuned support vector machines, IEEE Transactions on Industrial Informatics 11 (2) (2015) 322–331.

[3] I. Koprinska, M. Rana, V. G. Agelidis, Correlation and instance based fea- ture selection for electricity load forecasting, Knowledge-Based Systems 82 (2015) 29–40.

[4] Y.-Q. Zhang, X. Wan, Statistical fuzzy interval neural networks for cur- rency exchange rate time series prediction, Applied Soft Computing 7 (4) (2007) 1149–1156.

[5] M. P. Clements, Evaluating the bank of england density forecasts of infla- tion, The Economic Journal 114 (498) (2004) 844–866.

[6] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learn- ing for computer vision?, in: Advances in Neural Information Processing Systems, 2017, pp. 5580–5590.

[7] Y. Gal, Uncertainty in deep learning, University of Cambridge.

[8] S. Roy, S. B. Roy, I. N. Kar, A new design methodology of adaptive sliding mode control for a class of nonlinear systems with state dependent uncertainty bound, in: 2018 15th International Workshop on Variable Structure Systems (VSS), IEEE, 2018, pp. 414–419.

[9] N. Laptev, J. Yosinski, L. E. Li, S. Smyl, Time-series extreme event fore- casting with neural networks at uber, in: International Conference on Machine Learning, no. 34, 2017, pp. 1–5.

[10] H. Quan, D. Srinivasan, A. Khosravi, Uncertainty handling using neural network-based prediction intervals for electrical load forecasting, Energy 73 (2014) 916–925.

[11] H. Quan, D. Srinivasan, A. Khosravi, Integration of renewable gener- ation uncertainties into stochastic unit commitment considering reserve and risk: A comparative study, Energy 103 (2016) 735–745.

[12] R. Ak, Y.-F. Li, V. Vitelli, E. Zio, Adequacy assessment of a wind- integrated system using neural network-based interval predictions of wind power generation and load, International Journal of Electrical Power & Energy Systems 95 (2018) 213–226.

[13] A. Altınten, F. Ketevanlio˘glu, S. Erdo˘gan, H. Hapo˘glu, M. Alpbaz, Self- tuning pid control of jacketed batch polystyrene reactor using genetic algorithm, Chemical Engineering Journal 138 (1) (2008) 490–497.

[14] H. D. Kabir, A. Khosravi, S. Nahavandi, Partial adversarial training for neural network-based uncertainty quantification, IEEE Transactions on Emerging Topics in Computational Intelligence.

[15] B. M. Ayyub, G. J. Klir, Uncertainty modeling and analysis in engineering and the sciences, Chapman & Hall/CRC Press, 2006.

[16] M. Alam, B. Grimm, J. P. Parmigiani, Effect of incident angle on crack propagation at interfaces, Engineering Fracture Mechanics 162 (2016) 155–163.

[17] R. W. Johnson, An introduction to the bootstrap, Teaching Statistics 23 (2) (2001) 49–54.

[18] B. R. Kirkwood, J. A. Sterne, Essential medical statistics, John Wiley & Sons, 2010.

[19] P. Pinson, G. Kariniotakis, Conditional prediction intervals of wind power generation, IEEE Transactions on Power Systems 25 (4) (2010) 1845– 1856.

[20] M. A. Carnero, A. P´erez, E. Ruiz, Identification of asymmetric condi- tional heteroscedasticity in the presence of outliers, SERIEs 7 (1) (2016) 179–201.

[21] A. Khosravi, S. Nahavandi, D. Creighton, Construction of optimal predic- tion intervals for load forecasting problems, Power Systems, IEEE Transactions on 25 (3) (2010) 1496–1503.

[22] P. Pinson, J. Tastu, Discussion of prediction intervals for short-term wind farm generation forecasts and combined nonparametric prediction intervals for wind power generation, IEEE Transactions on Sustainable Energy 5 (3) (2014) 1019–1020.

[23] C. Wan, Z. Xu, J. Østergaard, Z. Y. Dong, K. P. Wong, Discussion of combined nonparametric prediction intervals for wind power generation, IEEE Transactions on Sustainable Energy 5 (3) (2014) 1021–1021.

[24] A. Khosravi, S. Nahavandi, Closure to the discussion of prediction in- tervals for short-term wind farm generation forecasts and combined nonparametric prediction intervals for wind power generation and the discussion of combined nonparametric prediction intervals for wind power generation, Sustainable Energy, IEEE Transactions on 5 (3) (2014) 1022– 1023.

[25] H. M. D. Kabir, A. Khosravi, A. Hosen, S. Nahavandi, Partial adversarial training for prediction interval, 2018 International Joint Conference on Neural Networks (IJCNN) (2018) 1–6.

[26] G. J. Coram, How to (and how not to) write a compact model in verilog-a, in: Behavioral Modeling and Simulation Conference, 2004. BMAS 2004. Proceedings of the 2004 IEEE International, IEEE, 2004, pp. 97–106.

[27] H. D. Kabir, Z. Ahmed, R. Kariyadan, L. Zhang, M. Chan, Modeling of fringe current for semiconductor-extended organic tfts, in: Electron Devices and Solid-State Circuits (EDSSC), 2016 IEEE International Conference on, IEEE, 2016, pp. 177–180.

[28] G. D. Magoulas, M. N. Vrahatis, G. S. Androulakis, Effective backpropagation training with variable stepsize, Neural networks 10 (1) (1997) 69–82.

[29] G. D. Magoulas, M. N. Vrahatis, G. S. Androulakis, Improving the con- vergence of the backpropagation algorithm using learning rate adaptation methods, Neural Computation 11 (7) (1999) 1769–1796.

[30] C. Wan, Z. Xu, P. Pinson, Direct interval forecasting of wind power, IEEE Transactions on Power Systems 28 (4) (2013) 4877–4878.

[31] L. G. Mar´ın, F. Valencia, D. S´aez, Prediction interval based on type-2 fuzzy systems for wind power generation and loads in microgrid control design, in: Fuzzy Systems (FUZZ-IEEE), 2016 IEEE International Conference on, IEEE, 2016, pp. 328–335.

[32] G. Zhang, Y. Wu, K. P. Wong, Z. Xu, Z. Y. Dong, H. H.-C. Iu, An ad- vanced approach for construction of optimal wind power prediction intervals, IEEE Transactions on Power Systems 30 (5) (2015) 2706–2715.

[33] Gridwatch, Uk national grid status (Oct. 2017). URL http://www.gridwatch.templar.co.uk/

[34] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of machine Learning research 3 (Jan) (2003) 993–1022.