Deep UQ: Learning deep neural network surrogate models for high dimensional uncertainty quantification

2018·Arxiv

Abstract

Abstract

State-of-the-art computer codes for simulating real physical systems are often characterized by vast number of input parameters. Performing uncertainty quantification (UQ) tasks with Monte Carlo (MC) methods is almost always infeasible because of the need to perform hundreds of thousands or even millions of forward model evaluations in order to obtain convergent statistics. One, thus, tries to construct a cheap-to-evaluate surrogate model to replace the forward model solver. For systems with large numbers of input parameters, one has to deal with the curse of dimensionality - the exponential increase in the volume of the input space, as the number of parameters increases linearly. Suitable dimensionality reduction techniques are used to address the curse of dimensionality. A popular class of dimensionality reduction methods are those that attempt to recover a low dimensional representation of the high dimensional feature space. However, such methods often tend to overestimate the intrinsic dimensionality of the input feature space. In this work, we demonstrate the use of deep neural networks (DNN) to construct surrogate models for numerical simulators. We parameterize the structure of the DNN in a manner that lends the

DNN surrogate the interpretation of recovering a low dimensional nonlinear 1

manifold. The model response is a parameterized nonlinear function of the low dimensional projections of the input. We think of this low dimensional manifold as a nonlinear generalization of the notion of the active subspace. Our approach is demonstrated with a problem on uncertainty propagation in a stochastic elliptic partial differential equation (SPDE) with uncertain diffusion coefficient. We deviate from traditional formulations of the SPDE problem by not imposing a specific covariance structure on the random diffusion coefficient. Instead we attempt to solve a more challenging problem of learning a map between an arbitrary snapshot of the diffusion field and the response.

Keywords: Deep Neural Networks, dimensionality reduction, stochastic

elliptic PDE, Uncertainty quantification.

1. Introduction

With the tremendous increase in the availability of computational resources, computer codes which simulate physical systems have become highly sophisticated. Today, state-of-the-art numerical simulators are parameterized by a very large number of quantities which are used to describe material properties, initial conditions, boundary conditions, constitutive laws, etc. It is often the case, that many of the inputs to a numerical simulator are not known exactly. This raises the question - how defensible are the predictions from such numerical simulators? How do we objectively assess the effects of the uncertainties in model inputs on the quality of the model predictions? Answering such questions are at the core of the field of uncertainty quan-tification (UQ) [1, 2]. Specifically, the task of rigorously quantifying the effect of input parameter uncertainty on the model outputs is known as the forward UQ or uncertainty propagation (UP) problem.

The simplest method for tackling the UP problem is the Monte Carlo (MC) method [3, 4, 5]. The basic idea of MC is that one can compute empirical estimates of the statistics of some quantity of interest (QoI) by sampling averages. The MC method is guaranteed to converge in the limit of infinite samples. MC methods, and its advanced variants, are routinely applied to UQ tasks such as UP [6], inverse problems [7, 8], model calibration [9] and stochastic optimization [10]. The computational time to convergence of MC methods is independent of the number of the stochastic dimensions. However, the number of samples needed by MC methods, to obtain convergent statistics is large, typically being of the order of hundreds of thousands or millions. This makes MC methods unsuitable for UQ tasks involving expensive computer codes.

We typically deal with expensive computer codes, by building a cheap-to-evaluate surrogate of the response surface. To do this, a set of locations in the uncertain parameter-space are carefully selected and the forward model is evaluated at these locations. This produces a set of independent observations of the model response. The total number of such simulations to be performed is determined by one’s computational budget and desired accuracy. Because the surrogate model can be queried very cheaply, one can use it as a replacement of the original simulator and perform UQ tasks using MC techniques. Popular choices for surrogate models in the literature include, Gaussian processes [11, 12, 13, 14, 15], polynomial chaos expansions [16, 17, 18, 19], radial basis functions [20] and relevance vector machines [21]. Despite their success, these methods become intractable for problems in which the number of input stochastic dimensions is large.

In order to construct a surrogate response surface for a multivariate function with a large number of uncertain parameters, one has to overcome the phenomenon known as the curse of dimensionality, a term coined by the mathematician Richard Bellman [22]. It describes the exponential increase in the volume of a function input space even as the space dimensionality increases linearly. The implication of the curse of dimensionality is that to sufficiently explore a high dimensional space, one must visit an exponentially large number of points in that space. As a concrete example, suppose the task of approximating a surrogate model for a univariate function can be done by visiting 10 locations in the input space and evaluating the forward model at those input locations. For a bivariate function of similar lengthscale, one would need to visit roughly 10 10 = 100 points in the input space to maintain a similar level of accuracy of the constructed surrogate. Generalizing, a d-variate function requires visiting ) locations in the input space and evaluating the forward solver at those locations. Even if the forward model is inexpensive to evaluate, attempting to naively construct surrogate response surfaces for high dimensional functions is a futile task.

techniques have to be employed in order to deal with the curse of dimen-

sionality. The simplest way of doing so is to rank the input dimensions in order of their “importance”, and rejecting those input dimensions which contribute the least to the outcome of the numerical simulation. Techniques that adopt this approach include sensitivity analysis [23] and automatic relevance determination (ARD) [24]. These methods, while effective for problems with a small number of uncorrelated input dimensions, are not useful for problems involving functional uncertainties, such as stochastic partial differential equations (SPDE).

Many common dimensionality reduction techniques follow a simple idea : project the high dimensional inputs, onto a low-dimensional subspace which captures most of the information contained in the original input. In the UQ community, the most common dimensionality reduction method used is the Karhunen-Lo`eve expansion (KLE) [25, 26]. The KLE involves computing an eigendecomposition of the covariance function associated with the uncertain functional input. The eigenfunctions represent a set of orthogonal basis functions and the decay of the eigenvalues determine the set of basis functions to be retained for the purpose of constructing a low dimensional approximation of the high dimensional random field. In the machine learning (ML) community, this is more popularly known as the principal component analysis (PCA) [27, 28], whose goal is to produce a low-rank approximation of the empirical centered covariance matrix of the available input data. The result of such a computation is an orthogonal matrix, which maps a point in the high dimensional input space to a point on a low dimensional manifold, such that there is minimal reconstruction error. The obvious drawback of PCA is the fact that one is constrained to discover only linear projections. Furthermore, PCA is an unsupervised technique, which means that it only looks at samples of the input and does not consider information contained in the model outputs. As a result, PCA tends to overestimate the intrinsic dimensionality of the system. Thus, inspite of a reduction in the total number of input dimensions, the reduced representation remains very high dimensional and thereby unsuitable for surrogate model construction. One can alleviate the linearity constraint by using the kernel PCA (KPCA) [29, 30], which uses the kernel trick to discover nonlinear manifolds. Nevertheless, KPCA is also an unsupervised technique that ignores model output information.

A recent advancement in dimensionality reduction is active subspaces (AS) [31, 32, 33, 34, 35, 36]. An active subspace is a low dimensional linear manifold embedded in the input space which captures most of the model output variation. It does so by recovering an orthogonal projection matrix obtained through an eigendecomposition of an empirical covariance matrix of the model output gradients. In the absence of model output gradients (a scenario typical in engineering applications), one can estimate the projection matrix by posing it as a hyperparameter in a Gaussian process regression model and learning it from the available data through a maximum likelihood (MLE) computation [12]. While the upshot of the AS method is that one utilizes the information contained in the model outputs along with the model inputs, one is still constrained to discover only linear manifolds of the data.

In this work, we propose a systematic approach for constructing surrogate models using deep neural networks (DNNs) [37, 38, 39, 40]. Neural networks (NNs) (or artificial neural networks (ANNs)) are a class of function approximators that have shot to prominence in recent years because of breakthrough successes achieved in numerous artificial intelligence (AI) tasks such as image classification [41, 42, 43, 44, 45] and autonomous driving [46, 47]. The idea of DNNs is not new. The reason for their increased usage and popularity in recent times is due to: 1. Advancements in computer hardware leading to widespread availability of graphics processing units (GPUs) for accelerated computation; 2. Advances in stochastic optimization including techniques such as Adam [48], RMSprop [49], AdaGrad [50], AdaDelta [51] etc.; 3. Regularization techniques such as dropout [52]; and, 4. Development of easy-to-use software libraries, such as Tensorflow [53], PyTorch [54] and Theano [55].

The basic idea of DNNs is that one can represent multivariate functions through a hierarchy of features of increasing complexity. The most typical example of a DNN is a feedforward multilayer perceptron (MLP). A highly attractive property of MLPs is that, under mild assumptions on the underlying function being approximated, they are universal approximators [56]. This means that any continuous function, regardless of its complexity, can be approximated with a neural network of just one layer with a sufficient number of hidden units. DNNs tackle the curse of dimensionality through a series of nonlinear projections of the input into exploitable latent spaces. In fact, PCA can be thought of as a special case of a DNN with no hidden layers such that the latent space is recovered through an orthogonal projection of the input.

The powerful nonlinear function approximation capabilities coupled with the scalability of DNNs to high dimensions offers a very promising direction of research for the UQ community, with the potential to significantly improve upon state-of-the-art capabilities. [57] use a 3 layer convolutional DNN to learn a map between input coefficients of SPDEs to a functional of the PDE solution. [58] use a Bayesian fully convolutional encoder-decoder network to solve an image-to-image regression task mapping the random input coefficient field of an elliptic SPDE to the corresponding solution. These papers offer encouraging results for challenging problems in UQ. However, they are only applicable to tasks where input parameters and output quantities can be treated as images and they require a lot of cross validation to learn an optimal network structure. Specifically in the context of SPDEs, we are interested in learning a surrogate that can make predictions at spatial locations other than those included in the discretization of the underlying deterministic numerical solver.

The task of selecting the architecture of the network and values for hyperparameters such as regularization constants is a persistant problem in the application of DNNs. Under constraints of limited data, this task assumes added importance. In this work, we present a methodology based on MLPs where we parameterize our network in a way that lends it the interpretation of discovering a low dimensional nonlinear manifold that captures maximal variation of the model outputs. We think of this procedure as discovering a nonlinear active subspace. The projection function, which connects the high dimensional input, to the low dimensional manifold, is linked to the scalar model output through a linear transformation. We utilize a combination of Bayesian global optimization (BGO) [59] and grid search to select the best setting of the network hyperparameters and determine the appropriate structure.

This paper is organized as follows. In Sec. 2, we discuss the general setup of the problem we are dealing with. Sec. 2.1 provides a mathematical description of the UP problem. In Sec. 2.2, we discuss the task of surrogate modeling. In Sec. 2.3 through Sec. 2.6, we discuss the process of constructing a DNN surrogate model, including the parameterization of the network architecture and the optimization of the network parameters. We conclude Sec. 2 with a description of the procedure we use to select network hyperparameters. In Sec. 3, we demonstrate our methodology on a SPDE with uncertain diffusion coefficient. A novelty of our work is that we do not make any assumption on the regularity and lengthscales of the uncertain diffu-sion. Specifically, we construct a surrogate of the SPDE solver which can accurately predict the response when tested with input random fields that may not be structurally similar (in terms of smoothness and lengthscales) to samples of the input in the training dataset. This deviates from the standard formulation of this problem in the UQ literature, where a specific covariance structure is imposed on the uncertain parameter. As a result, our problem is not amenable to the application of preliminary dimensionality reduction using the KLE, thereby making it far more challenging than the traditional formulation of the problem. We wrap up this article with concluding remarks in Sec. 4.

2. Methodology

Suppose we have a computer code simulating a physical phenomena. Mathematically, we represent this simulator as a function accepts a vector of inputs could specify material properties, external loads, boundary conditions, initial conditions, etc. The output of the computer code is some scalar quantity of interest . Typically, f depends on the solution of some PDE which depends on . Furthermore, f is unknown in closed form and information about it can only be obtained by querying the simulator at feasible values of for the possibility that the output observation, y, may be a noisy estimate of the true solution is Gaussian noise. Finally, the dimensionality, D, of the input vector is large, potentially of the order of hundreds or thousands. Given a finite number of evaluations of the simulator, the task of constructing a surrogate function, ˆf, for the true response surface f becomes computationally infeasible without resorting to dimensionality reduction.

2.1. Uncertainty propagation

Suppose the inputs to the function f are not known exactly (a common scenario in numerous engineering tasks). We formalize our beliefs about using a suitable probability distribution:

Given our beliefs about , we wish to characterize the statistical properties of the output ) such as the mean:

the variance,

and the probability density,

This is formally known as the uncertainty propagation problem (UP).

2.2. Surrogate model structure

Since the function f is not known in closed form we resort to numerical approximations of the integrals in Eq. (2), (3) and (4). The easiest way to do so, would be to use the MC method. Unfortunately, as discussed earlier in Sec.(1), the MC method is computationally infeasible because of slow convergence in the number of forward model evaluations. Furthermore, information about f can only be obtained by querying the computer code at carefully selected design points. Say, we have N design points, which we collectively denote as X:

matrix, with each row representing a single sample from We evaluate the computer code for each of these N samples and obtain a potentially noisy estimate of the model output, We collectively represent all samples of the model

output as,

y is an vector in , with each element of the vector representing a sample of the output. We denote the inputs and the outputs taken together as D = (X, y). Thus, the task of building a surrogate model can be summarized as follows - given data D collected by querying the computer code at a finite number of input locations, we wish to build an approximation ˆf of the true model f. We propose a form of the surrogate model ˆf, which projects the input data onto a nonlinear low dimensional manifold, i.e., ˆ

that,

where, is the projected input corresponding to the true input x. We call the function function The link function is a generic nonlinear function of the projected inputs, . One immediately recognizes this structure as a generalization of the active subspace surrogate in [31, 12] which expresses ˆf as:

The proposed structure in Eq. 7 is capable of capturing directions which explain most of the variation in the model output. Alternatively, one also recognizes the above construction of the projection function as being the encoder section of neural network autoencoders[38]. The complete structure is posed as a Deep Neural Network (DNN) which we describe in the following section.

2.3. Structure of a feedforward Deep neural network

Neural networks (NN) are a powerful class of data-driven function approximation algorithms which represent information through a hierarchy of features. Each step in the hierarchy, beginning with the input, and ending with the final output, is known as a layer. Intermediate layers are known as hidden layers. By manipulating the number of hidden layers and the size of each hidden layer, one can learn functions of arbitrary complexity. The sizes of the input layer and output layer are fixed and determined by the dimensionality of the input and output. Fig. 1(a) shows a schematic of a NN with 2 hidden layers. Each circle in the schematic of the NN represents the fundamental unit of computation in a NN, known as the neuron. A neuron accepts one or more inputs and produces an output by performing a linear transformation followed by an element-wise nonlinear transformation. A schematic of a single neuron and the computations taking place within it are shown in Fig. 1(b). We discuss the symbols in full detail in the proceeding paragraphs.

Figure 1: 1(a)-Schematic of a neural network (NN). 1(b) - Schematic of a single neuron.

A DNN is simply a NN with a large number of hidden layers. The output of a layer is known as the activation. The activation from one layer of a DNN is used as the input to the next layer. The activation produced by the hidden layer of the DNN is given by:

where is the number of neurons in the

hidden layer. Note that is the input is a nonlinear 13

function applied element-wise on its arguments. Popular choices for clude the logistic function, the hyperbolic tangent function or the rectified linear unit (or ReLU) function [38]. The ReLU function has been utilized extensively in recent times, and has been shown to eliminate the need for an unsupervised pretraining phase while training deep architectures [60].

Figure 2: Swish activation with

However, a recent result de-

scribed in [61] demonstrates

the superior performance of

the Swish activation func-

tion defined as follows:

such that is either a con-

stant or a hyperparameter

to be learned from data. In this work, we use the Swish activation function with

The quantities , are known as the weights and biases of the network, respectively. Collectively, they are known as the parameters of the network, . The weights and biases together, fully describe the structure of the network, known as

the network architecture.

2.4. Training a deep neural network

As discussed in the previous section, ˆf is a parameterized function with parameters . Estimating reduces to the problem of minimizing a loss function ), which captures the mismatch between . For regression tasks L is typically chosen to be the mean squared error. In practice, we do not have access to the function f; only a limited set of observations, D. Suppose D is a dataset of N examples, with the example denoted as ). The training problem is cast as minimizing the mismatch between a prediction ˆ) and the correct output,

In practice, the averaging in Eq. 11 is performed over a small randomly sampled subset (or , at each iteration of the optimization procedure.

2.5. Regularized loss function

Recall that the output of the computer code at a given input location x is a potentially noisy estimate of f(x). Under the Gaussian noise assumption, the likelihood model is given by:

The unknown variance, , captures the discrepancy due to all sources of error including discretization errors, model discrepancy, noise, etc. It is easy to see that maximizing the logarithm of the conditional likelihood is equivalent to minimizing the mean squared error of the examples in the training dataset. Since DNNs are prone to overfitting [52], one resorts to penalizing the misfit function with an appropriate penalty term known as a

regularizer. This ensures that the DNN generalizes better to unseen data. 15

Popular choices for regularization include the scaled of the weights [38]. The norm penalty is known to promote sparsity in the estimator . On the other hand, the norm penalty drives the values of the weights close to 0. The elastic net regularizer introduced in [62] is a mixture between the regularizers and is known to combine the advantages of penalties (See [62]). While typically the parts in the elastic net are assigned different scaling factors, we share the scaling parameter (called the regularization constant). This choice is motivated by a need to reduce the complexity of the model selection task. Over a set consisting of M data samples the full loss function is

expressed as,

From the point of view of constrained optimization, normed weight penalties limit model complexity by shrinking the feasible region of parameters, their corresponding norm balls. There is also a Bayesian justification for the use of normed penalties on the weights. obtained by the minimization of the regularized loss function corresponds to the maximum a posteriori (MAP) estimate of , with the prior given by the chosen penalty. The penalties correspond to a Laplace and Gaussian priors on the weights while the elastic net represents a compromise between the two. In unnormalized form, the elastic net regularizer with equi-scaling of the and parts, correspond to the following prior on the weights:

2.6. Gradient computation and optimization

A DNN ˆ) is a highly complicated function of the network parameters because of the fact that it involves multiple layers of compositions of simpler functions. To perform gradient descent optimization one needs access to the gradients of the objective function. For training DNNs, this is achieved by utilizing the celebrated backpropagation algorithm [63]. In essence, the backpropagation algorithm is a recursive application of the standard chain rule. Unlike numerical differentiation schemes such as finite differences, backpropagation is exact.

Training a DNN reduces to a stochastic optimization problem with the objective function being the loss function described in Eq. (13). The most common way of solving this problem is via the stochastic gradient descent (SGD) [64] algorithm. As the name suggests, SGD is the stochastic analogue of deterministic gradient descent. The SGD algorithm produces a converging sequence of updates of the optimization variables, by making appropriately sized steps in the direction of the negative gradient of the objective function. The key idea of the SGD method, is that it approximates the negative gradient of the objective function by averaging a finite set of objective function gradient samples. This is done by independently sampling a small subset of examples, , from the full training dataset, D. The update scheme of the SGD method is:

Note that the sampling of is performed at every iteration of SGD. While the SGD algorithm is simple to implement, it is not guaranteed to perform well for complex high dimensional objective functions (as is typical for Eq. 11). While there are multiple variants of the SGD method that have demonstrated improvements over vanilla SGD, in this work, we solve Eq. (11) with the Adaptive Moments (ADAM) optimization algorithm [48]. The ADAM update scheme is as follows:

where, ) is the estimate of the objective function gradient at iteration are exponential moving average estimates of gradients and squared gradients respectively. are set to 0 and the bias introduced by this initialization is corrected by computing is a suitably small number introduced to prevent 0 denominator. and are averaging parameters which can be tuned. In practice, default values of 999, as suggested by [48] work well and we do not fiddle with these quantities.

2.7. Selecting network structure

Although various authors in the literature offer rules of thumb for selecting the number and size of DNN layers (such as those suggested in [65]), rigorous rules for the selection of these quantities do not exist. One typically resorts to extensive experimentation to arrive at a suitable network configuration. In the most naive case, the number and size of the hidden layers are hyperparameters selected using cross-validation. In this work, we are interested in learning a surrogate of the form described in Eq. (7). The function h accepts an input in a vector space of dimensions D and projects it to a vector space of dimension d, where d << D (d is to be determined through our methodology). We parameterize this section of the DNN such that the widths of it’s hidden layers decays exponentially. Specifically, the number of hidden units in the hidden layer in this section is given by:

where, represents the ceiling (closest greater integer) of the number a. The parameter determined.

Figure 3: Visualization of the parameterized network structure with L = 3 and d = 1.

The link function g is formulated as a

single layer MLP. The hidden layer in g

is taken to have a width of 300d. One

could set this width to be anywhere be-

tween 100 500 times the size of the

encoding d. The idea is that the sub-

network representing the link function

g ought to have a sufficient number of

hidden units to capture arbitrary non-

linearities. A visual representation of

the DNN surrogate is shown in Fig. 3.

Note that no activation function is used

at the output of the encoding subnetwork h, and the output of the link function subnetwork, g. The task of optimizing the network structure is then reduced to a task of cross validating over two integer quantities, L and d, a much simpler task than optimizing for the number of layers and sizes of the individual layers separately.

2.8. Combined global optimization and grid search for model selection

The stochastic optimization task stated in Eq. (11) is characterized by hyperparameters, weight decay , and the integer quantities L and h, which fully parameterize the structure of the network. We refer to structure parameters collectively, as, S = (L, h). Training a DNN involves, in addition to optimizing for , selection of appropriate values of hyperparameters. The naive approach to do this is to perform an intuition guided manual search. In this work, the task of model selection reduces to selecting 3 quantities -the discrete hyperparameters, L and d and the continuous hyperparameter, . To be systematic, we adopt a combined grid search and stochastic global optimization procedure. Specifically, we define a grid of values for L and d. Over each grid location of the structure parameters, we perform a Bayesian global optimization (BGO) [59, 66] for

We split the dataset D into 3 parts - a training set, , a validation set, and a test set, . We define a grid, G, of L and h values and seek to assign a score to each location on the grid. The optimal choice of the structure parameters, S would then be the grid location which minimizes the validation error:

where, (is the size of the validation set, and ˆis a DNN characterized by structure parameter, ) is an estimate of the network parameters, obtained by minimizing the loss function in Eq. (13), with the regularization constant set to and network structure parameter, S.

The optimal choice of regularization constant , corresponding to structure parameter, S, is:

Eq. (23) is a stochastic global optimization problem characterized by a noisy objective function. BGO sequentially seeks out the global optimum of the objective function, R, by iteratively updating a Gaussian process (GP) surrogate response surface for ). During each iteration of BGO, a new pair of input-output observations is generated by maximizing an information acquisition function (IAF). The most popular choice of IAF is the expected improvement (EI) function. In closed form, the EI-IAF is given by:

where, and Φ are the probability density function and the cumulative

distribution function of the standard normal distribution. where, ) is the predictive mean of the GP surrogate at is the predictive variance of the GP surrogate which captures the epistemic uncertainty induced due to the limited set of observations and is GP estimate of the observational noise induced by random initializations of the DNN weights and random splitting of the dataset into is thus a filtered version of the predictive variance which is robust to observational noise. The BGO algorithm is summarized in Alg. 1. Note that the we maximize the negative of the validation error R.

Finally, the optimal structure parameter, is given by:

The full algorithm is summarized in Alg. 2. The estimation of 22

individual S can be parallelized and the computational cost of the global optimization search for times the cost of a single run of the ADAM optimizer.

3. Numerical Example - Stochastic Elliptic Partial Diﬀerential Equation

We consider the following benchmark elliptic PDE on the 2-d unit square domain:

with boundary conditions:

where, x = (x, y) are the physical coordinates in 2-d Euclidean space. 23

Eq. (26) is a model for 2-d steady-state diffusion processes. The quantity a(x) is a spatially varying diffusion coefficient. The physical significance of the equation and all terms in it are derived from context. For instance, Eq. (26) could be an idealized model for single-phase groundwater flow in an aquifer [68], where a represents the transmissivity coefficient and the solution variable, u is the pressure.

It is often the case that the a is unknown throughout the PDE domain. The uncertainty in the diffusion coefficient is formalized by modeling a as

a log normal random field, i.e., i.e.,

where, ) are the mean and covariance functions, respectively, of the Gaussian random field which models the logarithm of the diffusion coefficient a(x). The mean function models beliefs about the generic trends of the diffusion field as a function of spatial location. For the sake of simplicity, we set m(x) = 0 in this example. The covariance function k models beliefs about the regularity of the diffusion field and the and the lengthscales in which it varies. A popular choice for k is the exponential kernel:

where represents the correlation length along the spatial direction. The correlation lengths are typically assigned a fixed value. One then proceeds to use the truncated KLE to produce a reduced representation of the infinite-dimensional random field. The coefficients of the KLE are i.i.d. standard normal, and realizations of the diffusion field, a, can be generated easily, i.e., by sampling the KLE coefficients. For each realization of a, the corresponding solution of Eq. (26) is obtained. Any relevant quantity of interest, q = Q[u], is computed. Finally, one learns a surrogate response surface that maps the coefficients of the truncated KL expansion, using suitable learning algorithms such as GP regression.

We broaden the scope of the problem by removing the restrictions on the lengthscales, . The goal, in this example, is to construct a surrogate, which can accurately predict the solution, u, of the PDE, regardless of the lengthscales of the realization of a. Our approach, then, is to construct a surrogate which directly maps the discretized random field, to the numerical solution of the PDE.

3.1. Forward model

We solve the PDE using the finite volume method (FVM). The solver is implemented in the Python library FiPy [69]. The unit square domain is discretized into 32 32 finite volume cells. The input to the solver, ˆis the discretized version of a sample of the random diffusion a. The output of the solver, ˆ, is the numerical solution of the PDE corresponding to the realization ˆa of the diffusion field.

The model inputs, are spatial coordinates appended to a flattened version of the discrete random field realization and the model outputs are the PDE solution at the FV cell centers, goal is to learn a surrogate response, ˆ, which maps the snapshot of the diffusion field and a particular coordinate in the unit

square to the solution of the PDE at that location.

3.2. Data Generation

Intuitively, we would like to sample more realizations, of a, that have smaller lengthscales because one would observe the most variability in the solution corresponding to low lengthscale diffusion fields. Thus, instead of sampling lengthscale pairs uniformly from the unit square, we bias our sampling procedure to pick lengthscales that are smaller. Alg. 3 describes the procedure to select lengthscales to train the DNN surrogate. Note that a lower bound on is set by constraining the lengthscale to be larger the FV cell size,

We generate n different lengthscale pairs following the procedure in Alg. 3 to obtain a design of lengthscale pairs L.

For each lengthscale pair, (, we solve the forward model N times by generating N realizations of the diffusion coefficient. In this example we set n = 60 and N = 100. A visual representation of L is shown in Fig.

4. The full data generation procedure is summarized in Alg. 4. Samples of 26

Figure 4: Visual representation of LHS design of lengthscale pairs. Each ’x’ represents a sampled pair of lengthscales.

the diffusion coefficient drawn from two different pairs of lengthscales are shown in Fig. 5 and 6.

Figure 5: Samples of the random field a(x) with lengthscales 789 along the x and y directions.

3.3. Numerical settings

3.3.1. Dataset split

We generated our dataset, 100 = 6000 pairs of ˆbased on the procedure outlined in Alg. 4. D is randomly shuffled and split into 3 parts - A set of 2000 training examples, , a set of 2000 validation examples, and a set of 2000 test examples, . For the purpose of constructing the surrogate, we work with the logarithm of both ˆduring training. Furthermore, ˆu in the training set is standardized along each dimension. Necessary inversions of the transformations are performed during test time.

Figure 6: Samples of the random field a(x) with lengthscales 099 along the x and y directions.

3.4. Model selection settings

For selection of using Alg. 1, we set the number of initial design points, = 5. The number of BGO iterations, maxiter = 10 and the bounding box, ]. The grid of structure parameters is set to be

3.4.1. Network optimizer settings

We set the ADAM optimization learning rate optimizer is run for 45000 iterations and is decreased by a factor of 0.1 every 15000 iterations. The batch size, M, is set to be 50. Default values of tunable parameters of the ADAM optimizer are used, as recommended in [48]. These settings are, by no means, universal. Refer to [65] for some practical guidelines on DNN hyperparameter selection.

We use the Python library tensorflow to write scripts for training our DNN surrogates. For the purpose of reproducibility, the NumPy pseudorandom number generator seed is fixed. The code to replicate all the results in this paper will be made available at https://github.com/rohitkt10/ deep-uq-paper upon publication of this manuscript.

3.5. Results

Fig. 7 shows a heatmap of and optimal validation error the grid of the structure parameters. We observe that for the chosen grid, the optimal structure parameter is found to be 1. Fig. 8 shows GP surrogate response for as a function of log 2). Observe, from Fig. 8 that there is a dense clustering of the ’x’ markers around the optimum, indicating the convergence of the sequential optimization process.

Figure 7: 7(a) - Heatmap of over the grid G. 7(b) - Heatmap of over the grid G.

Figure 8: Gaussian process surrogate generated during BGO. We maximize the negative of the validation error R as a function of the logarithm of the regularization parameter,

The quality of the DNN predictions are evaluated based on the following relative error metric:

where, is the Frobenius norm. ˆare the FVM PDE solution and the DNN prediction of the PDE solution corresponding to the realization ˆa of the diffusion field. We also check the coefficient of determination, (also known as the score), which is defined as:

where, k indexes all the FV cell centers, ˆare the FV solution and DNN predicted solution at the cell center respectively, and ¯is the mean of ˆ9 shows a comparison of the DNN predicted PDE solution solution corresponding to a few randomly chosen realizations of the diffusion field from . We observe that the relative error as reported on the headers of the predicted fields in Fig. 9 are less than 0scores close to 0.99, which implies that the predicted solution from the DNN matches the true very closely. We also note that the PDE solution predicted by the DNN is ‘smoother’ than the FV solution of the PDE. This effects gets more pronounced when the lengthscales of the input diffusion field are smaller. The smoothness is a consequence of regularizing the DNN loss function. Fig. 10 shows the histograms of E and scores for all samples in . Note that all testing of the predictive capacity of the network is done using the test set because the and have already been used during the training and model selection phase.

3.6. Predictions at arbitrary lengthscales

Having constructed a DNN surrogate for the FV solution of the PDE, we would like to test predictive accuracy for samples of ˆa with lengthscales which are not used to generate the dataset 10 uniform grid of lengthscales is generated in the domain [, and for each lengthscale, 100 observations of the diffusion field and it’s corresponding PDE solution are generated. The mean of the relative errors and mean of the each lengthscale pair in this uniform grid is computed and shown in Fig. 11. We observe that even when the input field has lengthscales that do not match the lengthscales used for training, we are able to predict the solution with accuracy similar to that obtained during testing with samples from . This suggests that DNN surrogate is learning to map the ‘picture’ of the input field to the corresponding output. Note that the accuracy of the DNN decreases for diffusion fields with very fine lengthscales. This is consistent with the intuitive expectation that the less “variation” in the structure of the diffusion field, the easier it is characterize the function that maps the input to the solution.

3.7. Uncertainty Propagation

Having constructed a DNN surrogate that maps the input diffusion coef-ficient of the PDE in Eq. 27 and verified its accuracy, we use this surrogate to solve a UP problem. This surrogate is generalizable for arbitrary choices of the lengthscale of the input diffusion field. We consider the following 2

different choices of input lengthscales for propagating uncertainty -

1. Case 1:

2. Case 2:

In each case, we draw 10samples from the corresponding input distribution and estimate the following output statistics from the DNN surrogate predictions:

1. Mean of ˆu.

2. Variance of ˆu.

3. Probability density of the PDE solution at

The comparison between DNN surrogate approximation of the above quantities and their corresponding MCS approximations, for cases 1 and 2, are shown in Fig. 12 and Fig. 13. The relative error and scores between the DNN surrogate and the MCS approximations of the mean and standard deviations are shown in Tab. 1. We note that the mean PDE solution from

Table 1: Relative error and scores in the mean and variance of the PDE solution for two different choices of spatial lengthscale pairs.

the DNN surrogate matches that from the MCS sampling very closely in both cases. The error in the standard deviation, while reasonably low, is increased because of the tendency of the DNN to ‘smooth out’ the solution as discussed earlier. This is why we see a somewhat larger relative error for case 2, where the smaller lengthscales of the diffusion coefficient lead to PDE solutions that are inherently less smooth than the larger lengthscales of case 1.

4. Conclusion

We propose a methodology for learning DNN surrogate models for uncertainty quantification based on a parameterization of the DNN structure, such that the DNN is a composition of an encoder and one-layer MLP. Our parameterization lends the DNN surrogate the interpretation of recovering a nonlinear active subspace. We use a combination of grid search and BGO to select model hyperparameters, namely, the number of hidden layers, L, the width of the AS, h, and the weight decay regularization constant, We demonstrate our methodology with a UP problem in elliptic SPDE with uncertain diffusion coefficient, and learn a surrogate which maps a ‘picture’ of the discretized version of the coefficient to the PDE solution. Furthermore, we demonstrated that the DNN surrogate can effectively predict the solution of the PDE, even for diffusion fields with lengthscales that are not used for training the network.

This work is an early step towards using deep learning to create surrogate models for high dimensional UQ tasks. UQ for state-of-the-art computational simulators are notoriously difficult because of the prohibitive time span for individual simulations. One can extend the methodology proposed in this work to a Bayesian treatment of DNNs[70], i.e., imposing a prior on the weights of the NN and using approximate inference techniques such as variational inference[71, 72] to estimate the posterior distribution over the weights. Additionally, the Bayesian approach would allow one to better quantify the epistemic uncertainty induced by limited data.

DNNs are also naturally suited for tasks for multilevel/multifidelity UQ [73, 74, 75]. For instance, fully convolutional networks do not impose constraints on input dimensionality and can be trained on data obtained from several simulators at varying levels of fidelity. The hierarchical representation of information with a deep network can be used to learn correlations between heterogeneous information sources.

References

[1] R. C. Smith, Uncertainty quantification: theory, implementation, and applications, volume 12, Siam, 2013.

[2] T. J. Sullivan, Introduction to uncertainty quantification, volume 63, Springer, 2015.

[3] J. S. Liu, Monte Carlo strategies in scientific computing, Springer Science & Business Media, 2008.

[4] C. Z. Mooney, Monte carlo simulation, volume 116, Sage Publications, 1997.

[5] C. P. Robert, Monte carlo methods, Wiley Online Library, 2004.

[6] P. Baraldi, E. Zio, A combined monte carlo and possibilistic approach to uncertainty propagation in event tree analysis, Risk Analysis 28 (2008) 1309–1326.

[7] K. Mosegaard, A. Tarantola, Monte carlo sampling of solutions to inverse problems, Journal of Geophysical Research: Solid Earth 100 (1995) 12431–12447.

[8] K. Mosegaard, M. Sambridge, Monte carlo analysis of inverse problems, Inverse Problems 18 (2002) R29.

[9] I. Bilionis, B. A. Drewniak, E. M. Constantinescu, Crop physiology calibration in the clm, Geoscientific Model Development 8 (2015) 1071–1083.

[10] J. C. Spall, Introduction to stochastic search and optimization: estimation, simula- tion, and control, volume 65, John Wiley & Sons, 2005.

[11] C. E. Rasmussen, Gaussian processes for machine learning (2006).

[12] R. Tripathy, I. Bilionis, M. Gonzalez, Gaussian processes with built-in dimensionality reduction: Applications to high-dimensional uncertainty propagation, Journal of Computational Physics 321 (2016) 191–223.

[13] I. Bilionis, N. Zabaras, Multi-output local gaussian process regression: Applications to uncertainty quantification, Journal of Computational Physics 231 (2012) 5718– 5746.

[14] I. Bilionis, N. Zabaras, B. A. Konomi, G. Lin, Multi-output separable gaussian process: towards an efficient, fully bayesian paradigm for uncertainty quantification, Journal of Computational Physics 241 (2013) 212–239.

[15] P. Chen, N. Zabaras, I. Bilionis, Uncertainty propagation using infinite mixture of gaussian processes and variational bayesian inference, Journal of Computational Physics 284 (2015) 291–333.

[16] D. Xiu, G. E. Karniadakis, The wiener–askey polynomial chaos for stochastic dif- ferential equations, SIAM journal on scientific computing 24 (2002) 619–644.

[17] D. Xiu, G. E. Karniadakis, Modeling uncertainty in flow simulations via generalized polynomial chaos, Journal of computational physics 187 (2003) 137–167.

[18] D. Xiu, G. E. Karniadakis, Modeling uncertainty in steady state diffusion problems via generalized polynomial chaos, Computer methods in applied mechanics and engineering 191 (2002) 4927–4948.

[19] H. N. Najm, Uncertainty quantification and polynomial chaos techniques in com-

putational fluid dynamics, Annual review of fluid mechanics 41 (2009) 35–52.

[20] J. Park, I. W. Sandberg, Universal approximation using radial-basis-function net- works, Neural computation 3 (1991) 246–257.

[21] I. Bilionis, N. Zabaras, Multidimensional adaptive relevance vector machines for uncertainty quantification, SIAM Journal on Scientific Computing 34 (2012) B881– B908.

[22] E. Keogh, A. Mueen, Curse of dimensionality, in: Encyclopedia of Machine Learn- ing, Springer, 2011, pp. 257–258.

[23] A. Saltelli, K. Chan, E. M. Scott, et al., Sensitivity analysis, volume 1, Wiley New York, 2000.

[24] R. M. Neal, Assessing relevance determination methods using delve, Nato Asi Series F Computer And Systems Sciences 168 (1998) 97–132.

[25] R. G. Ghanem, P. D. Spanos, Stochastic finite elements: a spectral approach, Courier Corporation, 2003.

[26] P. D. Spanos, R. Ghanem, Stochastic finite element expansion for random media, Journal of engineering mechanics 115 (1989) 1035–1053.

[27] I. Jolliffe, Principal component analysis, Wiley Online Library, 2002.

[28] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and intelligent laboratory systems 2 (1987) 37–52.

[29] B. Sch¨olkopf, A. Smola, K.-R. M¨uller, Kernel principal component analysis, in: International Conference on Artificial Neural Networks, Springer, pp. 583–588.

[30] X. Ma, N. Zabaras, Kernel principal component analysis for stochastic input model generation, Journal of Computational Physics 230 (2011) 7311–7331.

[31] P. G. Constantine, E. Dow, Q. Wang, Active subspace methods in theory and practice: applications to kriging surfaces, SIAM Journal on Scientific Computing 36 (2014) A1500–A1524.

[32] T. W. Lukaczyk, P. Constantine, F. Palacios, J. J. Alonso, Active subspaces for shape optimization, in: 10th AIAA Multidisciplinary Design Optimization Conference, p. 1171.

[33] P. Constantine, D. Gleich, Computing active subspaces with monte carlo, arXiv preprint arXiv:1408.0545 (2014).

[34] P. G. Constantine, M. Emory, J. Larsson, G. Iaccarino, Exploiting active subspaces to quantify uncertainty in the numerical simulation of the hyshot ii scramjet, Journal of Computational Physics 302 (2015) 1–20.

[35] P. G. Constantine, B. Zaharatos, M. Campanelli, Discovering an active subspace in a single-diode solar cell model, Statistical Analysis and Data Mining: The ASA Data Science Journal 8 (2015) 264–273.

[36] M. Bauerheim, A. Ndiaye, P. Constantine, G. Iaccarino, S. Moreau, F. Nicoud, Uncertainty quantification of thermo-acoustic instabilities in annular combustors, in: Proceedings of the Summer Program, pp. 209–218.

[37] C. M. Bishop, Neural networks for pattern recognition, Oxford university press, 1995.

[38] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT Press, 2016.

[39] J. Schmidhuber, Deep learning in neural networks: An overview, Neural networks 61 (2015) 85–117.

[40] S. S. Haykin, S. S. Haykin, S. S. Haykin, S. S. Haykin, Neural networks and learning machines, volume 3, Pearson Upper Saddle River, NJ, USA:, 2009.

[41] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep con- volutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105.

[42] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, R. Fergus, Regularization of neural networks using dropconnect, in: Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 1058–1066.

[43] B. Graham, Fractional max-pooling, arXiv preprint arXiv:1412.6071 (2014).

[44] D.-A. Clevert, T. Unterthiner, S. Hochreiter, Fast and accurate deep network learning by exponential linear units (elus), arXiv preprint arXiv:1511.07289 (2015).

[45] C.-Y. Lee, P. W. Gallagher, Z. Tu, Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree, in: International conference on artificial intelligence and statistics.

[46] B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue, et al., An empirical evaluation of deep learning on highway driving, arXiv preprint arXiv:1504.01716 (2015).

[47] C. Chen, A. Seff, A. Kornhauser, J. Xiao, Deepdriving: Learning affordance for direct perception in autonomous driving, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–2730.

[48] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).

[49] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning 4 (2012) 26–31.

[50] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (2011) 2121– 2159.

[51] M. D. Zeiler, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701 (2012).

[52] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting., Journal of machine learning research 15 (2014) 1929–1958.

[53] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, arXiv preprint arXiv:1603.04467 (2016).

[54] A. Paszke, S. Gross, S. Chintala, G. Chanan, Pytorch, 2017.

[55] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, Y. Bengio, Theano: A cpu and gpu math compiler in python, in: Proc. 9th Python in Science Conf, pp. 1–7.

[56] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are univer- sal approximators, Neural networks 2 (1989) 359–366.

[57] Y. Khoo, J. Lu, L. Ying, Solving parametric pde problems with artificial neural networks, arXiv preprint arXiv:1707.03351 (2017).

[58] Y. Zhu, N. Zabaras, Bayesian deep convolutional encoder-decoder networks for surrogate modeling and uncertainty quantification, arXiv preprint arXiv:1801.06879 (2018).

[59] E. Brochu, V. M. Cora, N. De Freitas, A tutorial on bayesian optimization of

expensive cost functions, with application to active user modeling and hierarchical reinforcement learning, arXiv preprint arXiv:1012.2599 (2010).

[60] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Pro- ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323.

[61] P. Ramachandran, B. Zoph, Q. Le, Searching for activation functions (2017).

[62] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2005) 301– 320.

[63] Y. Chauvin, D. E. Rumelhart, Backpropagation: theory, architectures, and appli- cations, Psychology Press, 1995.

[64] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Pro- ceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.

[65] Y. Bengio, Practical recommendations for gradient-based training of deep architec- tures, in: Neural networks: Tricks of the trade, Springer, 2012, pp. 437–478.

[66] P. Pandita, I. Bilionis, J. Panchal, Extending expected improvement for high-dimensional stochastic optimization of expensive black-box functions, Journal of Mechanical Design 138 (2016) 111412.

[67] R. L. Iman, Latin hypercube sampling, Encyclopedia of quantitative risk analysis and assessment (2008).

[68] W. Li, G. Lin, B. Li, Inverse regression-based uncertainty quantification algorithms for high-dimensional models: Theory and practice, Journal of Computational Physics 321 (2016) 259–278.

[69] J. E. Guyer, D. Wheeler, J. A. Warren, Fipy: partial differential equations with python, Computing in Science & Engineering 11 (2009).

[70] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, arXiv preprint arXiv:1505.05424 (2015).

[71] R. Ranganath, S. Gerrish, D. Blei, Black box variational inference, in: Artificial Intelligence and Statistics, pp. 814–822.

[72] A. Graves, Practical variational inference for neural networks, in: Advances in Neural Information Processing Systems, pp. 2348–2356.

[73] B. Peherstorfer, K. Willcox, M. Gunzburger, Survey of multifidelity methods in uncertainty propagation, inference, and optimization, Preprint (2016) 1–57.

[74] M. C. Kennedy, A. O’Hagan, Predicting the output from a complex computer code when fast approximations are available, Biometrika 87 (2000) 1–13.

[75] P. Perdikaris, M. Raissi, A. Damianou, N. Lawrence, G. E. Karniadakis, Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling, in: Proc. R. Soc. A, volume 473, The Royal Society, p. 20160751.

Figure 9: Comparisons of DNN prediction of the PDE solution to that correct solution for 4 randomly chosen test examples. The left column shows the input diffusion field, the middle column shows the FV solution of the PDE and the right column shows the solution of the PDE predicted by the DNN.

Figure 10: 10(a) - Histogram of relative errors, E, for all examples in the test data set. 10(b) - Histogram of the scores for all examples in the test data set.

Figure 11: 11(a) - Mean relative errors of the predicted solution corresponding to samples of a with arbitrary pairs of lengthscales not used in the DNN training. 11(b) - Mean scores of the predicted solutions corresponding to samples of a with arbitrary pairs of lengthscales not used in the DNN training. The ’x’ markers correspond to lengthscales used in training the DNN and the solid dots correspond to lengthscales used to test the DNN surrogate.

Figure 12: Mean and standard deviation of the PDE solution obtained by MC sampling of the DNN surrogate. In each sub figure the left column shows the MCS approximation and the right column shows the DNN approximation. The top half compares the mean of the solution and the bottom half compares the standard deviation. 12(a) - Case 1: 5. 12(b) - Case 2:

Figure 13: 13(a) and 13(b) - Density of PDE solution at for cases 1 and 2 respectively. 13(c) and 13(d) - Density of PDE solution at for cases 1 and 2 respectively.

designed for accessibility and to further open science