Bayesian Deep Convolutional Encoder-Decoder Networks for Surrogate Modeling and Uncertainty Quantification

2018·Arxiv

Abstract

Abstract

We are interested in the development of surrogate models for uncertainty quantification and propagation in problems governed by stochastic PDEs using a deep convolutional encoder-decoder network in a similar fashion to approaches considered in deep learning for image-to-image regression tasks. Since normal neural networks are data intensive and cannot provide predictive uncertainty, we propose a Bayesian approach to convolutional neural nets. A recently introduced variational gradient descent algorithm based on Stein’s method is scaled to deep convolutional networks to perform approximate Bayesian inference on millions of uncertain network parameters. This approach achieves state of the art performance in terms of predictive accuracy and uncertainty quantification in comparison to other approaches in Bayesian neural networks as well as techniques that include Gaussian processes and ensemble methods even when the training data size is relatively small. To evaluate the performance of this approach, we consider standard uncertainty quantification benchmark problems including flow in heterogeneous media defined in terms of limited data-driven permeability realizations. The performance of the surrogate model developed is very good even though there is no underlying structure shared between the input (permeability) and output (flow/pressure) fields as is often the case in the image-to-image regression models used in computer vision problems. Studies are performed with an underlying stochastic input dimensionality up to 4, 225 where most other uncertainty quantification methods fail. Uncertainty propagation tasks are considered and the predictive output Bayesian

statistics are compared to those obtained with Monte Carlo estimates.

Keywords: Uncertainty Quantification, Bayesian Neural Networks, Convolutional Encoder-Decoder Networks, Deep Learning, Porous Media Flows

1. Introduction

Uncertainty in complex systems arises from model error and model parametrization, unknown/incomplete material properties, boundary conditions or forcing terms, and other. Uncertainty propagation takes place by reformulating the problem of interest as a system of stochastic partial differential equations (SPDEs). Solution of such problems often needs to rely on the solution of the deterministic problem at a finite number of realizations of the random input using Monte Carlo sampling, or collocation methods. Considering the computational cost of solving complex multiscale/multiphysics deterministic problems, one often relies on Bayesian surrogate models that are trained with only a small number of deterministic solution runs while at the same time they are capable of capturing the epistemic uncertainty introduced from the limited training data [1].

For realistic problems in science and engineering, we only have access to limited number (e.g. 100 or so) of deterministic simulation runs. Vanilla Monte Carlo for uncertainty propagation is thus hopeless. A dominant solution is to train a surrogate model using the limited simulation-based training data, and then perform prediction and uncertainty propagation tasks using the surrogate instead of solving the actual PDEs. Unfortunately most existing surrogate models have difficulty scaling to high-dimensional problems, such as the ones based on Gaussian processes (GP) [2, 3] or generalized polynomial chaos expansions (gPC [4]). High dimensionality often arises from the discretization of properties with small correlation lengths (e.g. permeability in heterogeneous media flows), random distributed sources or force input fields with multiple scales [5].

To alleviate the curse of stochastic input dimensionality, we usually assume that the given input data lie on an embedded non-linear manifold within the higher dimensional space. This intrinsic dimensionality is captured by dimensionality reduction techniques [6], such as the KarhunenLo`eve expansion (KLE), t-SNE [7], auto-encoders [8], probabilistic methods like variational auto-encoders [9], Gaussian process latent variable models (GP-LVM) [10], and many more. Most dimensionality reduction models are unsupervised learning problems that do not explicitly take the regression task into account. Thus the classical approach to uncertainty quantification is to first reduce the dimensionality of the input to obtain a low-dimensional latent representation of the input field, then to built a regression model from this latent representation to the output. This approach is certainly not efficient is particular when the map from the latent representation to the physical space of the input data is not available. In [2], KLE was used for dimensionality reduction of the permeability field and then GP was performed as independent task for Bayesian regression. In [11], this approach was taken one step further with the probabilistic mappings from input to latent space and from latent space to output being modeled by generalized linear models both trained simultaneously end-to-end using stochastic variational inference instead of performing the unsupervised and supervised/regression tasks separately.

One of the essential upcoming approaches for handling high-dimensional data is to learn the latent input representation automatically by supervision with the output in regression tasks. This is the central idea of deep neural networks [12], especially convolutional neural networks (CNNs) [13] which stack (deeper) layers of linear convolutions with nonlinear activations to automatically extract the multi-scale features or concepts from high-dimensional input [14], thus alleviating the hand-craft feature engineering, such as searching for the right set of basis functions, or relying on expert knowledge.

However, the general perspective for using deep neural networks [15, 16] in the context of surrogate modeling is that physical problems in uncertainty quantification (UQ) are not big data problems, thus not suitable for addressing them with deep learning approaches. However, we argue otherwise in the sense that each simulation run generates large amount of data which potentially reveal the essential characteristics about the underlying system. In addition, even for a relatively small dataset, deep neural networks show unique generalization property [17, 18]. These are typically over-parameterized models (hundreds and thousands of times more parameters than training data), but they do not overfit, i.e. the test error does not grow as the network parameters increase. Deep learning has been explored as a competitive methodology across fiels such as fluid mechanics [19, 20], hydrology [21], bioinformatics [22], high energy physics [23] and other.

This unique generalization behavior makes it possible to use deep neural networks for surrogate modeling. They are capable to capture the complex nonlinear mapping between high-dimensional input and output due to their expressiveness [24], while they only use small number of data from simulation runs. In addition, there has been a resurgence of interest in putting deep neural network under a formal Bayesian framework. Bayesian deep learning [25, 26, 27, 28, 29, 30, 31, 32, 33] enables the network to express its uncertainty on its predictions when using a small number of training data. The Bayesian neural networks can quantify the predictive uncertainty by treating the network parameters as random variables, and perform Bayesian inference on those uncertain parameters conditioned on limited observations.

In this work, we mainly consider surrogate modeling of physical systems governed by stochastic partial differential equations with high-dimensional stochastic input such as flow in random porous media [34]. The spatially discretized stochastic input field and the corresponding output fields are high-dimensional. We adopt an end-to-end image-to-image regression approach for this challenging surrogate modeling problem. More specifically, a fully convolutional encoder-decoder network is designed to capture the complex mapping directly from the high-dimensional input field to the output fields without using any explicit intermediate dimensionality reduction method. To make the model more parameter efficient and compact, we use DenseNet to build the feature extractor within the encoder and decoder paths [35]. Intuitively, the encoder network extracts the multi-scale features from the input data which are used by the decoder network to reconstruct the output fields. In similarity with problems in computer vision, we treat the input-output map as an image-to-image map. To account for the limited training data and endow the network with uncertainty estimates, we further treat the convolutional encoder-decoder network to be Bayesian and scale a recently proposed approximate inference method called Stein Variational Gradient Descent to modern deep convolutional networks. We will show that the methodology can learn a Bayesian surrogate for a problem with an intrinsic dimensionality of 50, achieving promising results on both predictive accuracy and uncertainty estimates using as few as 32 training data. More importantly we develop a surrogate for the case of 4225 dimensionality using 512 training data. We also show these uncertainty estimates are well-calibrated using a reliability diagram. To this end, we believe that Bayesian neural networks are strong candidates for surrogate modeling and uncertainty propagation in high-dimensional problems with limited training data.

The remaining of the paper is organized as follows: In Section 2, we present the problem setup for surrogate modeling with high-dimensional input and the proposed approach in treating it as an image regression problem. We then introduce the CNNs and the encoder-decoder network used in our model. In Section 3, we present the Bayesian formulation of neural networks and a non-parametric variational method for the underlying challenging approximate inference task. In Section 4, we provide implementation details and show the performed experiments on a porous media flow problem. We finally conclude and discuss the various unexplored research directions in Section 5.

2. Methodology

2.1. Surrogate Modeling as Image-to-Image Regression

The physical systems considered here are modeled by stochastic PDEs (SPDEs) with solutions y(s, x(s)), i.e. the model response spatial location 3), with one realization x(s) of the random field is the index set and Ω is the sample space. The formulation allows for multiple input channels (i.e. 1) even though our interest here is on one input property represented as a vector random field. This random field appears in the coefficients of the SPDEs, and is used to model material properties, such as the permeability or porosity fields in geological media flows. We assume the computer simulation for the physical systems is performed over a given set of spatial grid locations (e.g. mesh nodes in finite element methods). In this case, the random field x is discretized over the fixed grids S, thus is equivalent to a high-dimensional random vector, denoted as The corresponding response y is solved over S, thus can be represented as a vector

With the discretization described above and assuming for simplicity fixed boundary and initial conditions and source terms as appropriate, we consider the computation simulation as a black-box mapping of the form:

In order to tackle the limitations of using the deterministic computationally expensive simulator for uncertainty propagation, a surrogate model ) is trained using limited simulation data proximate the ‘ground-truth’ simulation-induced function are the model parameters, and N is the number of simulation runs (number of training simulation-based data).

Let us consider that the PDEs governing the physical system described above are solved over 2D regular grids of the number of grid points in the two axes of the spatial domain (height and width), and . It is very natural to organize the simulation data as an image dataset is one input field realization, and is the simulated steady-state output fields for discretized over the same grids. Here are the number of dimensions for the input x and the output y at one location. These are treated herein as the number of channels in input and output images, similar to RGB channels in natural images. It is easy to generalize to the 3D spatial domain by adding an extra depth axis to the images, e.g. and

Therefore, we transform the surrogate modeling problem to an image-to-image regression problem, with the regression function as

In distinction from an image classification problem which requires imagewise prediction, the image regression problem is concerned with pixel-wise predictions, e.g. predicting the depth of each pixel in an image, or in our physical problem, predicting the output fields at each grid point. Such problems have been intensively studied within the computer vision community by leveraging the rapid recent progress of convolutional neural networks (CNNs), such as AlexNet [13], VGG [36], Inception [37], ResNet [38], DenseNet [35], and many more. A common model design pattern for semantic segmentation [39] or depth regression [40] is the encoder-decoder architecture. The intuition behind regression between two high-dimensional objects is to go through a coarse-refine process, i.e. to reduce the spatial dimension of the input image to high-level coarse features using an encoder, and then recover the spatial dimension by refining the coarse features through a decoder. One of the characteristics shared by those vision tasks is that the input and output images share the underlying structure, or they are different renderings of the same underlying structure [41]. However, for our surrogate modeling tasks, the input and output images appear to be quite different, due to the complex physical influences (defined by PDEs) of the random input field, forcing terms and boundary conditions on the system response. This was after all the reason of pursuing the training of a surrogate model that avoids the repeated solution of the PDEs for different input realizations. Surprisingly, as we will discuss later on in this paper, the encoder-decoder network still works very well.

Remark 1. The training data for the surrogate model of interest here include the realizations of the random input field and the corresponding multi-output obtained from simulation. Of interest is to address problems with limited training data sets considering the high-computational cost of each simulation run. However, note that key UQ tasks include the ability to predict the system response and our confidence on it using input realizations (testing dataset) consistent with the given training data but also the computation of the statistics of the response induced by the random input. Both of these tasks require the availability of a high-number of input data sets (e.g. 500 data points for testing, and 10input data points for Monte Carlo calculation of the output statistics). The problem of generating more input realizations using only the training dataset is the solution of a generative model problem. There has been significant progress in recent years in the topic, such as the generative adversarial networks (GANs) [42] and its ever exploding variants, variational auto-encoders (VAEs) [9], autoregressive models like PixelCNN [43], PixelRNN [44], and other. However, note that in this work our focus is on the image-to-image mapping and its performance on uncertainty quantification tasks. We will thus assume that enough input samples are provided both for testing and output statistics calculation even though only a small dataset will be used for training. In our examples in Section 4, synthetic log-permeability datasets are generated by sampling a Gaussian random field with an exponential kernel. The output for each permeability sample is generated using a deterministic simulator.

2.2. Dense Convolutional Encoder-Decoder Networks

In this subsection, we briefly introduce a state-of-the-art CNN architecture called DenseNet [35] and fully convolutional encoder-decoder networks developed in computer vision, and then present how to utilize these advances to build our baseline network for surrogate modeling in uncertainty quantification.

2.2.1. Densely Connected Convolutional Networks

DenseNet [35] is a recently proposed CNN architecture which extends the ideas of ResNet [38] and Highway Networks [45] to create dense connections between all layers, so as to improve the information (gradient) flow through the network for better parameter efficiency.

Let be the output of the Traditionally CNNs pass the output of one layer only to the input of the next layer, i.e. where denotes the nonlinear function of the hidden layer. In current CNNs, h is commonly defined as a composition of Batch Normalization [46] (BatchNorm), Rectified Linear Unit [47] (ReLU) and Convolution (Conv) or transposed convolution (ConvT) [48]. ResNets [38] create an additional identity mapping that bypasses the nonlinear layer, i.e. In this way, the nonlinear layer only needs to learn a residual function which facilitates the training of deeper networks.

DenseNets [35] introduce connections from any layer to all subsequent layers, i.e. ]). To put it in another way, the input features of one layer are concatenated to the output features of this layer and this serves as the input features to the next layer. Assume that the input image has channels, and each layer outputs K feature maps, then the layer would have input with feature maps, i.e. the number of feature maps in DenseNet grows linearly with the depth. K is here referred to as the growth rate.

For image regression based on encoder-decoder networks, downsampling and upsampling are required to change the size of feature maps, which makes concatenation of feature maps unfeasible. Dense blocks and transition layers are introduced to solve this problem and modularize the network design. A dense block contains multiple densely connected layers whose input and output feature maps are of the same size. It contains two design parameters, namely the number L of layers within and the growth rate K for each layer. An illustration of the dense block is in Fig. 1.

Figure 1: (a) A dense block contains with growth rate K = 2. (b) The second layer of the dense block, where ]) is its output feature map. Notice that the input to the third layer is the concatenation of the output and input features of ]. As is often the case, each layer is composed of Batch Normalization [46] (BatchNorm), Rectified Linear Unit [47] (ReLU) and Convolution (Conv). The convolution kernel has kernel size k = 3, stride s = 1 and zero padding p = 1, which keep the size of the feature maps the same as the input.

Transition layers are used to change the size of feature maps and reduce their number between dense blocks. More specifically, the encoding layer typically halfs the size of feature maps, while the decoding layer doubles the feature map size. Both of the two layers reduce the number of feature maps. This is illustrated in Fig. 2.

Figure 2: Both (a) encoding layer and (b) decoding layer contain two convolutions. The first convolution reduces the number of feature maps while keeps their size the same using a kernel with parameters k = 1, s = 1, p = 0; the second convolution changes the size of the feature maps but not their number using a kernel k = 3, s = 2, p = 1. The main difference between (a) and (b) is in the type of the second convolution layer, which is Conv and ConvT respectively, for downsampling and upsampling. Note that no pooling is used at transition layers for maintaining the location information. The colored feature maps used here are independent from the feature maps with the same color shown in other figures.

2.2.2. Fully Convolutional Networks

The fully convolutional networks (FCNs) [49] are extensions of CNNs for pixel-wise prediction, e.g. semantic segmentation. FCNs replace the fully connected layers in CNNs with convolution layers, add upsampling layers in the end to recover the input spatial resolution, and introduce the skip connections between feature maps in downsampling and upsampling path to recover finer information lost in the downsampling path. Most of the recent work focuses in improving the upsampling path and increase the connectivity within and between upsampling and downsampling paths. U-nets [39] extend the upsampling path as symmetric to the downsampling path and add skip connections between each size of feature maps in the downsampling and upsampling paths. Within SegNets [50], the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. Fully convolutional DenseNets [51] extend DenseNets to FCNs, which are closest to our network design but with several differences. We keep all the feature maps of a dense block concatenated so far before passing to the transition layers, while they only keep the output feature maps of the last convolution layer within the dense block. The feature maps explosion problem is addressed by the first convolution layer within the transition layer. Besides that we do not use skip connections between encoding and decoding paths because of the weak correspondence between the input and output images. We also do not use max-pooling for encoding layers, instead we use convolution with stride 2.

2.3. Network architecture: DenseED

We follow the fully convolutional networks (FCNs) [49] for image segmentation without using any fully connected layers, and encode-decoder architecture similar to U-net [39] and SegNet [50] but without the concatenation of feature maps between the encoder paths and decoder paths. Furthermore, we adapt the DenseNet [35] structure into the encoder and decoder networks. After extensive hyperparameter and architecture search, we arrived at a baseline dense convolutional encoder-decoder network, called DenseED, similar to the network proposed in [51] but with noticeable differ-ences as stated above and shown in Fig. 3.

Figure 3: Network architecture: DenseED.

In the encoding path, the input field realizations are fed into the first convolution layer with large kernel size k = 7, stride s = 2 and zero padding p = 2. Then the extracted feature maps are passed through an alternative cascade of dense blocks and encoding layers as introduced in Figs. 1 and 2. The dense block after the last encoding layer outputs the high-level coarse feature maps extracted from the input, as shown in purple at the right end of the network in Fig. 3, which are subsequently fed into the decoder path. The decoding network consists of an alternation of dense blocks and decoding layers, with the last decoding layer directly leading to the prediction of the output fields.

2.4. Network architecture engineering and hyperparameter search

Network architecture engineering and hyperparameter search are among the main challenges and source of innovations in deep learning, mostly problem-specific and empirical. The general network architecture is introduced in Section 2.2, which is based on the recent development of neural network design for image segmentation. For our image regression problem, the main design considerations include the following:

• Downsampling layers: convolution or pooling;

• Upsampling layers: bilinear upsampling or transposed convolution;

• Smallest spatial dimensions of feature maps: this is determined by the number of downsampling layers;

• Add or not of skip connections between the encoding and decoding paths;

• Kernel of convolution layers, kernel size k, stride s, zero padding p;

• Number of layers L and growth rate K within each dense block;

• Regularizations: weight decay, batch normalization, dropout, etc;

• Optimizer: stochastic gradient descent algorithms and their variants, such as Adam, Adagrad, RMSprop, and others;

• Training hyperparameters: batch size, learning rate and its scheduler.

The details of architecture search and hyperparameter selection for the particular problem considered are presented in Appendix A where we also report various experiments using DenseED for surrogate modeling with limited training data. No overfitting was observed in our calculations and the obtained results were quite good. This is an intriguing and active research topic in the deep learning community [17].

In our non-Bayesian calculations, we have considered ized MSE training loss function. Given an input image x, and a target image y, the prediction f(x, w), the regularized MSE loss is

where the penalty function Ω(regularization, and Ω(w) = regularization, and is the number

of pixels in all channels of one output image. w denotes all the parameters in the network. For our case, it includes the kernel weights in all the convolution and transposed convolution layers (no bias is used in convolutional kernel), the scale and shift parameters in all the batch normalization layers. See Section 4.3 for an example of the fully convolutional encoder-decoder network used for the Darcy flow problem. Note that regularization is implemented in PyTorch optimizers by specifying in Eq. (3).

The network architecture selected for the non-Bayesian model is the same as that used for our Bayesian model introduced next.

Remark 2. In the encoder-decoder network, the batch normalization layer used after each convolutional layer can also be considered as an effective regularizer. It is commonly adopted nowadays in deep convolutional networksreplacing dropout

3. Bayesian Neural Networks

Consider a deterministic neural net y = f(x, w) with input x, output y, and all parameters w including the weights and biases. Bayesian neural networks (BNNs) treat the parameters w as random variables instead of deterministic unknowns to account for epistemic uncertainty induced by lack of training data. Besides that, usually additive noise n is introduced to model the aleatoric uncertainty which can not be reduced by having more observations, also to make the probabilistic model have an explicit likelihood depending on the noise distribution, i.e.

where f(x, w) is the output of a neural network with the uncertain w, and n is the additive noise.

3.1. Sparsity inducing prior on weights

Given the large amount of ‘un-interpretable’ parameters w in a deep neural net, there are not many choices of priors. But the demand for compression [52, 33] of the neural net for lower memory and computation cost calls for sparsity promoting priors. We assume a fully factorized Gaussian prior with zero mean and Gamma-distributed precision on parameters w

This results in a student’s t-prior for w, which has heavy tails and more mass close to zero.

3.2. Additive Noise Model

Additive noise can be considered of the following form:

• Output-wise: , same for all output pixels;

• Channel-wise: ], same across each of the put channels/fields;

• Pixel-wise: , distinct for each output pixel.

Here, are scalars, is a field with the same dimension as the output denotes the element-wise product operator. In this work, we have considered both Gaussian noise, ) and Laplacian noise, In the numerical results discussed in Section 4, we concentrate in the the output-wise and channel-wise cases above. We treat the noise precision as a random variable with a conjugate prior ). For the generated training data, a priori we assume that the noise variance (also known as nugget [53]) to be very small e.g. 10. Thus the values provide a good initial guess for the prior hyperparameters.

Remark 3. We can also model the noise varying with input (pixel-wise noise model), resulting in a heteroscedastic noise model, i.e. n(x, w) = can be Gaussian or Laplacian. The heteroscedastic noise [54, 55, 56] can be implemented as extending the output of the neural net as:

The output of the system becomes y = f(x, w) + n(x, w). The pixel-wise case ) may help capture large variations for example near discontinuous regions of the output. In practice, we apply a softplus transformation to the second part of the output of the neural net to enforce the positive variance constraint, i.e. numerical stability.

3.3. Stein Variational Gradient Descent (SVGD)

Approximate inference for Bayesian deep neural network is a daunting task because of the large number of uncertain parameters, e.g. tens or hundreds of millions in modern deep networks. In our surrogate problem, the task is to find a high-dimensional posterior distribution over millions of random variables using less than hundreds or thousands of training data. As reviewed in Section 1, most of variational inference methods [57] restrict the approximate posterior within certain parametric variational family, while sampling-based methods are slow and difficult to converge. Here we adopt a recently proposed non-parametric variational inference method called stochastic variational gradient descent (SVGD) [29, 58] that is similar to standard gradient descent while maintaining the efficiency of particle methods.

For a prescribed probabilistic model with and prior ), we are interested in Bayesian inference of the uncertain parameters to find the posterior distribution denote the i.i.d. observations (training data), i.e. BNNs with homoescedastic Gaussian noise case, . Variational inference aims to approximate the target posterior distribution a variational distribution ) which lies in a restricted set of distributions Q by minimizing the KL divergence between the two, i.e.

where ˜) is the unnor- malized posterior, and is the normalization constant or model evidence, which is usually computationally intractable, but can be ignored when we optimize the KL divergence.

The variational family considered here is a set of distributions obtained by smooth transforms from an initial tractable distribution (e.g. the prior) represented in terms of particles. The transforms applied to each particle take the following form:

where is the step size, is the perturbation direction within a function space transforms the initial density

Instead of using parametric form for the variational posterior, a particle approximation is used, i.e. a set of particles with empirical measure . We would like to have to weakly converge to the measure of the true posterior . We apply the transform T to those particles, and denote the pushforward measure of

The problem is to find out the direction to maximally decrease the KL divergence of the variational approximation and the target distribution, i.e. to solve the following functional optimization problem:

It turns out that [29]

where is called the Stein operator associated to the distribution p,

The expectation ] evaluates the difference between maximum is defined as the Stein discrepancy,

It has been shown [59] that when the functional space F is chosen to be the unit ball in a product reproducing kernel Hilbert space H with the positive kernel ), the maximal direction to perturb (or the Stein discrepancy) has a closed-form solution,

Thus we have the following algorithm to transform an initial distribution to the target posterior

This is an one-line algorithm, where the gradient ) pushes the particles towards the high posterior region by kernel smoothed gradient term , while maintaining a degree of diversity by repulsive force term ). When the number of particles becomes 1, then the algorithm reduces to the MAP estimate of the posterior. This algorithm is implemented in PyTorch with GPU acceleration and scaled to our deep convolutional encoder-decoder network DenseED.

Here we use one toy example to illustrate the idea of SVGD. We start with the 20 particles from the Normal distribution 1), and tranport the particles iteratively to the target Gaussian mixture distribution 01) with Algorithm 1.

Figure 4: Example of transporting 20 particles from 1) to Gaussian mixture 01) with SVGD using 20 particles. The green dash represents the target Gaussian mixture.

3.3.1. Implementation and Training

• We use S samples of to approximate the empirical measure of the posterior. They are initialized and stored in 20 deterministic DenseED neural networks.

• At each step t, for each model j, the gradient of its joint likelihood (or unnormalized posterior) ) is computed by the auto- matic differentiation tool in PyTorch. We first compute the joint likelihood log ) by feeding forward the data , then back propagate to compute its gradient Note that this gradient is stored in PyTorch module associated to the weights in each network.

• Then we can proceed to compute the kernel matrix

and its gradient ), as well as the kernel weighted gradient of the joint likelihood ). For this to happen, we need to vectorize (extract out) the parameters and the computed gradient ) from each neural network. The optimal pertur- bation direction is further computed by the sum of the two terms as shown in the SVGD algorithm.

• After is computed, we send ) back to each neural net- work the gradient w.r.t. its parameters (overwriting the previously computed )), and updating locally in each neural network using PyTorch’s optimization library such as Adam or SGD to compute

• With one iteration complete, the algorithm repeats the above steps until convergence in the parameters is achieved.

3.4. Uncertainty Quantification

Of interest to the classical UQ problem is the computation of the posterior predictive distribution (predict the system response for a test input) as well as the computation of the output response averaged over the input probability distribution. In particular, we are interested in computing the following:

• Predictive uncertainty at ), and in particular the moments

• Propagated uncertainty to the system response by integrating over ), and in particular can use these moments to compute the statistics of conditional output statistics, e.g. Var

We can use Monte Carlo to approximate the moments of the predictive distribution

with mean (by the law of total expectation)

Note that the SVGD algorithm provides a sample representation of the joint posterior of all parameters ). To obtain the samples of the marginal posterior p(w | D) as needed above, one simply needs to use the samples corresponding to w.

The predictive covariance can also be easily calculated using the law of total variance. The variance and expectation below are w.r.t. to the posterior of the parameters. We can show the following:

Cov(

where ). The predictive variance is the diagonal of the predictive covariance:

where 1 is a vector of ones with the same dimension as f, and the square (is applied element-wise to the vectors.

The above computation is the prediction at a specific input would also like to compute the average prediction over the distribution of the uncertain input. We first compute the output statistics given the realizations of the uncertain parameters

conditional predictive mean is

and the conditional predictive covariance is

Cov(

Also the conditional predictive variance (at each spatial location) is

Var(

where 1 is a vector of ones with the same dimension as y, and the square operator is here applied element-wise to the vectors. Then we can further compute the statistics of the above conditional statistics due to the uncertainty in the surrogate, i.e. , which are the sample means and sample variances in each output dimension of of the conditional predictive mean and variance.

4. Numerical Implementation and Results

We study the two-dimensional, single phase, steady-state flow through a random permeability field following the case study in Section 3.2 in [2]. Consider the random permeability field K on a unit square spatial domain , the pressure field p and velocity field u of the fluid through the porous media are governed by Darcy’s law:

where ˆn denotes the unit normal vector to the boundary and the source term f is used to model an injection well on the left-bottom corner of S and a production well on the right-top corner. We also enforce no-flux boundary condition, and an integral constraint to ensure the uniqueness of the solution as in [2]. More specifically,

where r is the rate of the wells and w is their size. The input log-permeability field is restricted in this work to be a Gaussian random field, i.e.

where m is the constant mean and covariance function k is specified in the following form using the norm in the exponent instead of the in [2], i.e.

4.1. Datasets

The Gaussian random field [60] with exponential kernel for the onedimensional case corresponds to the Ornstein-Uhlenbeck process which is mean-square continuous but not mean-square differentiable. Thus when we discretize the field over a grid, the field value jumps (varies highly) as we move from pixel to pixel. The field does not become smoother when we use a finer grid over a fixed spatial domain. This high variability creates a significant challenge for data-driven models to capture, i.e. the intrinsic dimensionality of the discretized random field is the total number of pixels, e.g. 4, 225 for 6565 grids (which will be our reference grid for our calculations). However, a common assumption for natural images is that the underlying dimensionality is actually small (few hundreds) despite their complex appearance. To evaluate the generality and effectiveness of the methodology, we use KLE to control the intrinsic dimensionality of the permeability dataset. We evaluated our model using datasets produced with increasing dimensionality of 50, 500, 4225 (called KLE50, KLE500, and KLE4225, respectively). Notice that when the number of KLE terms is 4225, the permeability field is directly sampled from the exponential Gaussian field without any dimensionality reduction. The intrinsic dimensionality of dataset is hidden from our model, i.e. our model do not built a map from the KLE terms to the system output. Instead, it models an end-to-end mapping from input fields to output fields. In fact, we will show one specific network architecture that works well for all three datasets obtained from the different intrinsic input data dimensions.

We consider solving the Darcy flow Eq. (18) over a unit squared domain with fixed 65 65 grid, and length scale l = 0.1, kernel mean m = 0, rate of source r = 10, size of source w = 0.125. The ratio of the cumulative sum of eigenvalues (in decreasing order) over the total sum of them is shown in Fig. 5. The Darcy flow equation is solved using mixed finite

Figure 5: KLE profile.

element formulation implemented in FEniCS [61] with third-order RaviartThomas elements for the velocity, and fourth-order discontinuous elements for the pressure. The sample input permeability field and computed output pressure and velocity fields for three datasets are shown in Fig. 6.

When the available data is limited, it is common practice to use crossvalidation to evaluate the model. Since our dataset is synthetically generated, we have access to any number of training and test data up to computing constraints to solve the Darcy flow equations. Our current data includes four sets: the training set, validation set, test set, and uncertainty propagation set. The training set is sampled using the simplest design of experiment method, Latin hypercube sampling. More specifically, the KLE for the log-permeability field is

where ) are the eigenvalues and eigenfunctions of the exponential covariance function of the Gaussian field specified in Eqs. (20) and (21), ’s are i.i.d. standard Normal, and q is the number of KLE coefficients maintained in the expansion. The maximum number that can be used is

Figure 6: Sample permeability K obtained from the (a) KLE50 dataset, (b) KLE500 dataset, and (c) KLE4225 dataset (no dimensionality reduction) and the corresponding velocity components and pressure p obtained from the simulator. All figures are shown using pixels (imshow) to reveal the high variability of the input and output fields.

finite and equal to the number of grid points used in the discetization of the field over the unit square. We first use Latin hypercube design to sample from the hypercube [0, then obtain the eigenvalue by where Φ is the cumulative distribution function of the standard normal distribution. The KLE50 case contains 32, 64, 128, and 256 training data; KLE500 contains 64, 128, 256, and 512 training data; and KLE4225 contains 128, 256, 512, and 1024 training data.

The log-permeability fields in the other three sets are reconstructed directly with , which are sampled from standard normal. The validation and test set each contains 500 input permeability fields, and the dataset for uncertainty propagation contains 10, 000 realizations. All datasets are organized as images as discussed in Section 2.1.

4.2. Evaluation Metrics

Several metrics are used to evaluate the trained models on test data . In particular, we consider the following:

where ˆis the output mean of the Bayesian surrogate, i.e. as in Eq. (12) or predictive output of the non-Bayesian surrogate, i.e. just is the test target, ¯y is the mean of test target, and T is the total number of test data. This metric enables the comparison between different datasets since the error is normalized, with the score closer to 1 corresponding to better regression. This is the only metric used for evaluating nonBayesian surrogate, the following metrics are additional metrics for evaluating the Bayesian surrogate. Note that this metric is also used for tracking the performance of the training process, thus it is evaluated for both the training and test data sets. Root Mean Squared Error (RMSE):

This is a common metric for regression that is used in our experiments for monitoring the convergence of training. Mean Negative Log-Probability (MNLP):

This metric evaluates the likelihood of the observed data. It is is used to assess the quality of the predictive model.

Predictive uncertainty and Propagated Uncertainty: These metrics were introduced in Section 3.4.

Estimated Distributions: They include histograms or kernel density estimates for the output fields at certain locations of the physical domain.

Reliability Diagram: Given a trained Bayesian surrogate and a test data set, we can compute the p% predictive interval for each test data point based on the Gaussian quantiles using the predictive mean and variance [62]. We then compute the frequency of the test targets that fall within this predictive interval. For a well-calibrated regression model, the observed frequency should be close to p%. The reliability diagram is the plot of the observed frequency with respect to p. Thus a well-calibrated model should have a reliability diagram close to the diagonal.

4.3. Non-Bayesian Surrogate Model

The hyperparameters to search include the parameters that determine the network architecture and the ones that specify training process, which both affect model performance. We use Hyperband [63] algorithm to optimize those hyperparameters with a constraint that the number of model parameters being less than 0.25 million. The details of these experiments are given in Appendix A. The network configuration with the highest score that Hyperband finds is shown in Fig. 7 with more details provided in Table 1. This configuration is referred to as DenseED-c16. The 2nd–4th columns of Table 1 show the number of output feature maps, the spatial resolution of output feature maps, the number of parameters of each layer in the network.

Figure 7: DenseED-c16 with blocks (3, 6, 3), growth rate 16, and 48 initial feature maps after the first convolution layer (yellow in the figure). There are in total 19 (conv) layers and 241, 164 parameters in the network. The number of network parameters is optimized based on the generalization error analysis reported in Appendix B. It contains two down-sampling layers, thus the smallest spatial dimension (code dimension) of feature maps (purple in the figure) is 16 16, hence the network is named DenseED-c16. The first convolution kernel is of k7s2p2. The last transposed conv kernel in the decoding layer is of k5s2p1. The number of its output feature maps is 3 (corresponding to 3 output fields). For the decoding layers, the output padding is set to 1. The other conv kernels in dense blocks and encoding, decoding layers are described in Section 2.2.

The network DenseED-c16 is trained with Adam [64], a variant of stochastic gradient descent, with the loss function being regularized MSE which is implemented as weight decay in modern neural net frameworks, such as PyTorch and TensorFlow. Other loss functions may achieve better results,

Table 1: DenseED-c16 architecture for the Darcy flow dataset

such as smoothed loss, or conditional GAN loss [41]. This requires further investigations to be considered in future publication. The initial learning rate is 0.015, weight decay (regularization on weights) is 0.0005, the batch size is 16. We also use a learning rate scheduler which drops 10 times on plateau of the rooted MSE. The model is trained 200 epochs. We train the model with the dataset introduced in Section 4.1.

Training the deterministic neural networks with regularized MSE is equivalent to finding the maximum a posterior of the uncertain parameters in Bayesian neural networks whose prior is independent normal. The typical training process is shown in Fig. 8.

Figure 8: Training process of DenseED-c16 with 128 training data.

We train each network with different number of training data of KLE50, KLE500, and KLE4225. The validation -score is shown in Fig. 9, which shows that, with the same training data, the -score is closer to 1 when the intrinsic dimensionality is smaller, and the -score is higher with more training data of the same dimensionality. Note that the score is more than 0.9 with reasonably small size training data set for all the three cases which have dimensionality from 50 to 4225. This shows the effectiveness of the network DenseED-c16 for both low-dimensional and high-dimensional prob- lems.

Figure 9: Test scores for the non-Bayesian surrogate.

The prediction of the output fields can be easily obtained in the test time by feeding the test input permeability field into the trained network, i.e. ˆWe show the prediction of the test input shown in Fig. 6 using DenseED-c16, which is trained with three datasets (KLE50, KLE500, KLE4225) in Figs. 10, 11, and 12, respectively. The predictions are quite good even for the KLE4225 case, where both the input and output fields vary rapidly in certain regions of the domain.

4.4. Bayesian Surrogate Model

For all the experiments we only consider the homoscedastic noise model for Bayesian neural networks, i.e. output-wise Gaussian noise with Gamma prior on its precision , and Student’s t-prior on w.

The set of all uncertain parameters is denoted as SVGD to the Bayesian neural network with posterior set of deterministic model parameters DenseED’s and noise precision . In implementation, this corresponds to S different initializations for the deterministic DenseED and noise precision (a scalar). We update the parameters of S DenseED’s and the corresponding noise precision using the SVGD algorithm as in Algorithm 1. The kernel is chosen to be ), with median heuristic for the choice of the kernel bandwidth is the median of the pairwise distances between the current samples . We typically use

Figure 10: Prediction for the input realization shown in Fig. 6a from the KLE50 dataset using DenseED-c16 which is trained with datasets of sizes (a) 32 and (b) 128, respectively. In both subfigures, the first row shows the three test target fields (simulation output), i.e. pressure p and velocity , the second row shows the corresponding model predictions, the third row shows the error.

Figure 11: Prediction for the input realization as shown in Fig. 6b from KLE500 dataset using DenseED-c16 which is trained with datasets of sizes (a) 64 and (b) 256, respectively. In both subfigures, the first row shows the three test target fields (simulation output), i.e. pressure p and velocity , the second row shows the corresponding model predictions, the third row shows the error.

Figure 12: Prediction for the input realization as shown in Fig. 6c from KLE4225 dataset using DenseED-c16 which is trained with datasets of sizes (a) 128 and (b) 512, respectively. In both subfigures, the first row shows the three test target fields (simulation output), i.e. pressure p and velocity , the second row shows the corresponding model predictions, the third row shows the error.

to approximate the empirical measure of the posterior. For large number of training data, the unnormalized posterior is evaluated using mini-batches of training data, i.e. ˜). We observe that even for small training data such as 512, using smaller batch size (e.g. 16) helps to get lower training and test errors, but with more time for training. We use Adam [64] to update using the gradient , instead of the vanilla stochastic gradient descent, for 300 epochs, with learning rate 0, and a learning rate scheduler that decreases by 10 times when the training RMSE is on plateau.

The algorithm is implemented in PyTorch and runs on a single NVIDIA GeForce GTX 1080 Ti X GPU which requires about 2000 7000 seconds for training 300 epochs, when the training data size varies from 32 to 512. The training time depends heavily on the training mini-batch size, which is 16 for all cases. Potential ways to speed up significantly the training process include increasing the mini-batch size, or implementing the SVGD in parallel using multi-GPUs. The python source code will become available upon publication at https://github.com/bmmi/bayesnn.

We report next the -score computed similar to the non-Bayesian case, except the predicted output mean is used to compare with the test target. The scores are shown in Fig. 13. We can see that the Bayesian surrogate improves the -score significantly over the non-Bayesian version.

Figure 13: –scores comparison for the non-Bayesian and Bayesian models.

We also report the MNLP for test data in Fig. 14, which is a proper scor- ing rule and usually used to access the quality of predictive uncertainty [65].

Figure 14: MNLP.

The -score gives us a general estimate of how well the regression performs. For a given input permeability field, the Bayesian neural network can predict the mean of corresponding output fields, and also gives uncertainty estimate represented as predictive variance at each spatial location, which is unavailable for deterministic models, and desirable when the training data is small. In Figs. 15, 16, and 17, we show predictions for the test input shown in Fig. 6 with training data from KLE50, KLE500, and KLE4225, respectively. We can see that the predictive accuracy improves as the size of the training dataset increases, while the predictive uncertainty drops.

Figure 15: Prediction for the input realization shown in Fig. 6a from the KLE50 dataset. The rows from top to bottom show the simulation output fields (ground truth), predictive mean ], the error of the above two, and two standard deviation of predictive output distribution per pixel Var(). The three columns from left to right correspond to pressure field p, and two velocity fields , respectively.

Figure 16: Prediction for the input realization shown in Fig. 6b from the KLE500 dataset. The rows from top to bottom show the simulation output fields (ground truth), predictive mean ], the error of the above two, and two standard deviation of predictive output distribution per pixel Var(). The three columns from left to right correspond to pressure field p, and two velocity fields , respectively.

Figure 17: Prediction for the input realization shown in Fig. 6c from the KLE4225 dataset. The rows from top to bottom show the simulation output fields (ground truth), predictive mean ], the error of the above two, and two standard deviation of predictive output distribution per pixel Var(). The three columns from left to right correspond to pressure field p, and two velocity fields , respectively.

We also performed uncertainty propagation by feeding the trained Bayesian surrogate with 10, 000 input realizations sampled from the Gaussian field, and calculating the output statistics as in Section 3.4. In Figs. 18, 19 and 20, we show the uncertainty propagation results and compare with Monte Carlo using the 10, 000 UP data for the Bayesian surrogate trained with the datasets KLE50, KLE500, and KLE4225.

We show the estimate of pressure p, velocity components tions (0.85, 0.88), and (0.68, 0.05) on the unit square for 128, and 512 training data of KLE4225 in Fig. 21, and 22, respectively. The PDF is obtained by kernel density estimation using the predictive mean. We can see that the density estimate is close to the Monte Carlo result even when the training dataset is small, and becomes closer as the training dataset increases. From Fig. 22, we observe that the predictions for the velocity fields are better than the pressure field especially in locations away from the diagonal of the unit square domain, and this is in general the case for our current network architecture, where the three output fields are treated the same.

In order to access the quality of the computed uncertainty, we adopt the reliability diagram which expresses the discrepancy between the predictive probability and the frequency of falling in the predictive interval for the test data. The diagram is shown in Fig. 23. Overall our models are well-calibrated since they are quite close to the ideal diagonal, especially the case when the training dataset size is 128 as shown in Fig. 23d. In general the model turns to be over-confident (small predictive uncertainty) when the training data is small, and gradually becomes prudent (larger predictive uncertainty) when the training data increases. The main reason for this observation is that the predictive uncertainty is dominated by the variation seen in the training data, which is small when small data is observed. The initial learning rates and their scheduling scheme for network parameters w and noise precision may also play roles here since the uncertainty is determined by the optimum that that stochastic optimization obtained.

The training processes with different datasets are shown in Fig. 24. We can see that the training and test RMSE converges around 50 75 epochs of training for KLE50, and KLE500, but the convergence for KLE4225 seems to take longer time. The training dataset size is 256 for all three sets.

To empirically show that using 20 samples of the Bayesian neural nets for the SVGD algorithm is sufficient in our problem, we show in Fig. 25 the convergence of the test and training RMSE when we vary the number of samples.

Figure 18: Uncertainty propagation for KLE50: In (a) and (b) from top to bottom, we show the Monte Carlo output mean, predictive output mean ]], the error of the above two, and two standard deviation of conditional predictive mean Varthree columns from left to right correspond to pressure field p, and two velocity fields , respectively. In (c) and (d) from top to bottom, we show the Monte Carlo output variance, predictive output variance )], the error of the above two, and two standard deviation of conditional predictive variance Var)). The three columns are the same as (a) and (b).

Figure 19: Uncertainty propagation for KLE500: (a), (b), (c), and (d) refer to Fig. 18.

Figure 20: Uncertainty propagation for KLE4225: (a), (b), (c), and (d) refer to Fig. 18.

Figure 21: Distribution estimate for the pressure, and the two velocity components (from left to right) at location (0.85, 0.88) with Bayesian surrogate trained with KLE4225 dataset.

Figure 22: Distribution estimate for the pressure, and the two velocity components (from left to right) at location (0.68, 0.05) with Bayesian surrogate trained with KLE4225 dataset.

Figure 23: Reliability diagrams. Subfigures (a), (b), (c) show the reliability diagram for KLE50, KLE500, KLE4225, respectively. (d) shows the diagrams when the training dataset size is 128 for all three datasets.

Figure 24: Test and training RMSE for training of SVGD using 128 training data from KLE50, KLE500, and KLE4225.

Figure 25: Test and training error using SVGD to train the Bayesian NN with 128 training data of KLE500. As we increase the number of posterior samples for , i.e. the number of deterministic neural networks (and noise precision), both training and test errors drop and become steady. This plot empirically supports the reason we choose 20 model instances/samples for SVGD.

5. Conclusions

In this work, we explore to use an alternative Bayesian surrogate, i.e. a Bayesian neural network, to predict the output fields of a systems governed by stochastic partial differential equations with high-dimensional stochastic input. The approach is built on the highly expressive deep convolutional encoder-decoder networks. The end-to-end image-to-image regression avoids the usual linear dimensionality reduction step, and achieves promising results in terms of predictive performance and uncertainty modeling, even with limited training data.

We show in experiments that one network (DenseED-c16) works well across problems with different intrinsic dimensionality. We believe the performance at the small dataset domain of deep neural networks is due to its unique generalization property [66, 67] which effectively says the over-parameterized deep neural nets lack over-fitting.

The Bayesian model provides uncertainty estimates for our predictions thus accounting for epistemic uncertainty when training with small datasets. We show that Bayesian inference based on SVGD works well for training the Bayesian neural network, and presented results on uncertainty propagation tasks. The uncertainty of the trained model is well-calibrated by investigating the reliability diagrams.

There are three potential directions to improve the regression performance of non-Bayesian surrogate:

• Add skip connections between the encoder and decoder path in each level of feature map size. The reason is that the input permiability field directly affects the output velocity field pixel-wise through the equation ). Concatenate the features extracted in the encoder path to the decoder path may help to recover the discontinuity of the output velocity fields which are caused by the discontinuity in the input.

• Conditional GANs loss. The idea is that we should use a stronger loss function such as an adversarial classifier to enforce the model output to obey the discontinuity in the data. Similar situation has been explored in jet maps in high-energy particle physics [68].

• Using group convolution in the last decoding layer to separate the influences between the pressure field and the velocity fields in the end. This is to address the observation that the prediction performance for the velocity fields is better than for the pressure field using our current network architecture DenseED-c16.

For the Bayesian neural network, the inference task is genuinely difficult since it is asked to find the posterior of millions of random variables based on hundreds of training data. Exploring other priors for the network weights, and using recent advances in variational inference for Bayesian neural networks are definitely valuable directions to pursue.

The image-to-image regression approach can be used to handle the prediction of systems with different source terms, by adding the source field as another channel in the input besides the material property field. For implementation, one only needs to increase the number of input channels by 1, and leave everything else unchanged.

Our current network does not utilize any information from physics, such as the governing equations or constraints between the three output fields. Incorporating physics information into deep neural networks is a more principled ways to improve the model performance.

Acknowledgments

This work was supported from the University of Notre Dame, the Center for Research Computing (CRC) and the Center for Informatics and Computational Science (CICS). Early developments were supported by the Computer Science and Mathematics Division of ORNL under the DARPA EQUiPS program. N.Z. thanks the Technische Universit¨at M¨unchen, Institute for Advanced Study for support through a Hans Fisher Senior Fellowship, funded by the German Excellence Initiative and the European Union Seventh Framework Programme under Grant Agreement No. 291763. The authors acknowledge Dr. Steven Atkinson from the CICS for generating and providing us the Darcy flow datasets.

Appendix A. Experiments on Darcy Flow Problem

We optimize the network hyperparameters following the general architecture as shown in Fig. 3. The encoding layer in Fig. 2(a) down-samples its input feature maps by 2 using convolution of stride 2 as in the first convolution layer. The decoding layer in Fig. 2(b) up-samples the feature maps by 2 using transposed convolution instead of pooling layers (as commonly seen in image classification networks) since the location information is critical for regression. The number of down-sampling and up-sampling layers are the same. The datasets are described in Section 4.1. We aim to find one general network architecture that works well across different datasets from both Bayesian and non-Bayesian network models.

Experiment 1 - Spatial dimension of feature maps at the coarsest scale: This is determined by how many down-sampling layers are used. Shrinking the spatial dimension of feature maps can extract high-level or coarse information of the input permeability field, which is subsequently used to predict the output pressure and velocity fields. This design choice is related to a central concept in CNNs called receptive field [69] of an unit within a certain layer, which is the region in the input image that affects this unit feedforward, or the region in the output image that affects this unit when back-propagating gradients. For dense prediction tasks such as our surrogate modeling problems, it is important for each pixel in the code feature maps to have the suitable receptive field in the input and output images. We observe that for the dataset generated with fewer KLE terms, both the input and output velocity fields are smoother, or the output has stronger correlation across pixels, such as KLE50 in Fig. 6a. But for KLE500 or KLE4225 in Figs. 6b and 6c, the fields vary more rapidly from pixel to pixel, but still there is weak long-range correlation between pixel values.

In our experimental setup we have used no skip connections, the convolution kernels were fixed to be k7s2p2 for the first Conv layer, k3s1p1 in the layers within dense block, k1s1p0 and k3s2p1 in the encoding and decoding layers, and k5s2p2 in the last ConvtT layer. No dropout was used and the growth rate of dense blocks was taken as 16.

The loss function is taken as the regularized MSE in Eq. (3), and the validation metric as the -score in Eq. (23). Adam optimizer is used for training 200 epochs, with learning rate 0.001 and a plateau scheduler on the test RMSE. Batch size is always smaller than the number of training data (e.g. 16, 32, 64 for dataset with 32, 64, 128 data, respectively). Weight decay is set to 0.0005.

We vary the dense block configurations, i.e. a list of integers specifying the number of layers within each dense block. This list contains odd number of integers because of the symmetry between encoding and decoding paths. For example, blocks (3, 6, 3) specify the network architecture in Fig. A.26a. K16L6 in the second dense block means there are 6 layers within the dense blocks, and the growth rate for each layer is 16. In this case, the code dimension (purple feature maps) is 1616 since there are two down-sampling layers in the encoding path.

The second block configuration is (12, ), i.e. only one dense block sits after the first Conv layer, as shown in Fig. A.26b or (2, 2, 4, 2, 2) as shown in Fig. A.27.

The validation -scores of the above three networks DenseED-c16, DenseED-c8, DenseED-c32 on test set (500 Monte Carlo samples from each KLE case) are summarized in Tables A.2 , A.3, A.4. From these tables, it can be seen that the 16 network works well for all three datasets. The network DenseED-c8 does not work well for the KLE4225 dataset, since its code dimension is too small. This enforces the output field to be too smooth, which is still favorable to the KLE50 dataset. We can also clearly note that the network DenseED-c32 does not work well for KLE50, even though it performs well for the KLE4225 dataset. Predictions and learning curve for the selected network design DenseED-c16 are presented in Section 4.3. Experiment 2 - Hyperparameter optimization with Hyperband: The hyperparameters to select include the ones specifying the network architecture and the training process. We separately optimize those two sets of hyperparameters with one of them fixed. Hyperband [63] is a bandit random search algorithm which has several rounds of successive halving. There are two input for this algorithm, i.e. R = 243, the maximum iterations for each hy-

Figure A.26: Two configurations DenseED-c16 and DenseED-c32 of DenseED.

Figure A.27: Configuration DenseED-c8 with blocks (2, 2, 4, 2, 2).

Table A.2: Test

Table A.3: Test

Table A.4: Test

perparameter configuration (here each iteration corresponding to one epoch of training) and 3 of the best configurations after running through all the candidate configurations.

The search space for network architecture with the search space specified in Table A.5 with the constraint that the number of parameters being less than 0.25 million, gives the network architecture presented in Table 1.

Table A.5: Network architecture hyperparameters and associated ranges for the DenseED-c16 network.

The search for the training process hyperparameters as in Table A.6 gives approximately the following hyperparameters: initial learning rate 0.002, weight decay 0.0005, batch size 16, and Adam optimizer.

Note that the hyperparameters found by Hyperband are only sub-optimal, but indicate the potential range. The actual hyperparameters used for training are presented in Section 4.3.

Table A.6: Training Hyperparameters and associated ranges for the DenseED-c16 network.

Appendix B. Generalization behavior

Several numerical experiments were conducted to show the generalization behavior of the network DenseED-c16. The network was trained with different block configurations while keeping the number of training data to be 256. The number of parameters ranged from 37, 892 to 805, 204. The training and test RMSE are shown in Fig. B.28. Clearly, the model is over-parameterized but still shows no overfitting behavior, i.e. the training error does not become smaller and the test (generalization) error does not go higher as we increase the number of parameters. When the number of parameters is less than 200, 000 (still in the over-parameterized regime), there is plenty of room to improve in both training loss and test loss. We con-figure the number of parameters of the baseline network DenseED-c16 as in Fig. 7 to have 241, 164 parameters, which is favorable according to the generalization curve in Fig. B.28.

Figure B.28: Non-overfitting of over-parameterized neural network DenseED-c16.

References

[1] I. Bilionis, N. Zabaras, Bayesian uncertainty propagation using Gaus- sian processes, Handbook of Uncertainty Quantification (2016) 1–45.

[2] I. Bilionis, N. Zabaras, B. A. Konomi, G. Lin, Multi-output separable Gaussian process: towards an efficient, fully Bayesian paradigm for uncertainty quantification, Journal of Computational Physics 241 (2013) 212–239.

[3] I. Bilionis, N. Zabaras, Multi-output local Gaussian process regression: Applications to uncertainty quantification, Journal of Computational Physics 231 (17) (2012) 5718–5746.

[4] D. Xiu, G. E. Karniadakis, The Wiener–Askey polynomial chaos for stochastic differential equations, SIAM journal on scientific computing 24 (2) (2002) 619–644.

[5] S. Torquato, Random heterogeneous materials: microstructure and macroscopic properties, Vol. 16, Springer Science & Business Media, 2013.

[6] L. J. P. van der Maaten, E. O. Postma, H. J. van den Herik, Di- mensionality Reduction: A Comparative Review, Tech. rep., Tilburg University (2009). URL http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_ Dimensionality_Reduction_files/TR_Dimensiereductie.pdf

[7] L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research 9 (Nov) (2008) 2579–2605.

[8] G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science 313 (5786) (2006) 504–507.

[9] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114.

[10] N. D. Lawrence, Gaussian process latent variable models for visuali- sation of high dimensional data, in: Advances in neural information processing systems, 2004, pp. 329–336.

[11] C. Grigo, P.-S. Koutsourelakis, Bayesian model and dimension reduc- tion for uncertainty propagation: applications in random media, arXiv preprint arXiv:1711.02475.

[12] Y. Bengio, Learning Deep Architectures for AI, Found. Trends Mach. Learn. 2 (1) (2009) 1–127. doi:10.1561/2200000006. URL http://dx.doi.org/10.1561/2200000006

[13] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

[14] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer, 2014, pp. 818–833.

[15] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.

[16] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural computation 18 (7) (2006) 1527–1554.

[17] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understand- ing deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530.

[18] G. K. Dziugaite, D. M. Roy, Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, arXiv preprint arXiv:1703.11008.

[19] J. N. Kutz, Deep learning in fluid dynamics, Journal of Fluid Mechanics 814 (2017) 1–4.

[20] S. Chan, A. H. Elsheikh, A machine learning approach for ef- ficient uncertainty quantification using multiscale methods, CoRR abs/1711.04315. arXiv:1711.04315. URL http://arxiv.org/abs/1711.04315

[21] J. Mar¸cais, J.-R. de Dreuzy, Prospective interest of deep learning for hydrological inference, Groundwater 55 (5) (2017) 688–692.

[22] S. Min, B. Lee, S. Yoon, Deep learning in bioinformatics, Briefings in bioinformatics 18 (5) (2017) 851–869.

[23] P. Baldi, P. Sadowski, D. Whiteson, Searching for exotic particles in high-energy physics with deep learning, Nature communications 5 (2014) 4308.

[24] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, J. Sohl-Dickstein, On the expressive power of deep neural networks, arXiv preprint arXiv:1606.05336.

[25] D. J. MacKay, A practical Bayesian framework for backpropagation networks, Neural computation 4 (3) (1992) 448–472.

[26] R. M. Neal, Bayesian learning for neural networks, Vol. 118, Springer Science & Business Media, 2012.

[27] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: Repre- senting model uncertainty in deep learning, in: international conference on machine learning, 2016, pp. 1050–1059.

[28] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncer- tainty in neural networks, arXiv preprint arXiv:1505.05424.

[29] Q. Liu, D. Wang, Stein variational gradient descent: A general pur- pose Bayesian inference algorithm, in: Advances In Neural Information Processing Systems, 2016, pp. 2378–2386.

[30] D. P. Kingma, T. Salimans, M. Welling, Variational dropout and the local reparameterization trick, in: Advances in Neural Information Processing Systems, 2015, pp. 2575–2583.

[31] J. M. Hern´andez-Lobato, R. Adams, Probabilistic backpropagation for scalable learning of bayesian neural networks, in: International Conference on Machine Learning, 2015, pp. 1861–1869.

[32] C. Louizos, M. Welling, Multiplicative normalizing flows for variational bayesian neural networks, arXiv preprint arXiv:1703.01961.

[33] C. Louizos, K. Ullrich, M. Welling, Bayesian Compression for Deep Learning, arXiv preprint arXiv:1705.08665.

[34] S. Chan, A. H. Elsheikh, A machine learning approach for effi- cient uncertainty quantification using multiscale methods, Journal of Computational Physics 354 (Supplement C) (2018) 493 – 511. doi:https://doi.org/10.1016/j.jcp.2017.10.034. URL http://www.sciencedirect.com/science/article/pii/ S0021999117307933

[35] G. Huang, Z. Liu, K. Q. Weinberger, L. van der Maaten, Densely con- nected convolutional networks, arXiv preprint arXiv:1608.06993.

[36] K. Simonyan, A. Zisserman, Very deep convolutional networks for large- scale image recognition, arXiv preprint arXiv:1409.1556.

[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.

[38] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[39] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2015, pp. 234–241.

[40] D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, in: Advances in neural information processing systems, 2014, pp. 2366–2374.

[41] P. Isola, J. Zhu, T. Zhou, A. A. Efros, Image-to-Image Translation with Conditional Adversarial Networks, CoRR abs/1611.07004. arXiv: 1611.07004. URL http://arxiv.org/abs/1611.07004

[42] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, 2014, pp. 2672–2680.

[43] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., Conditional image generation with pixelcnn decoders, in: Advances in Neural Information Processing Systems, 2016, pp. 4790–4798.

[44] A. v. d. Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel recurrent neural networks, arXiv preprint arXiv:1601.06759.

[45] R. K. Srivastava, K. Greff, J. Schmidhuber, Training very deep net- works, in: Advances in neural information processing systems, 2015, pp. 2377–2385.

[46] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, 2015, pp. 448–456.

[47] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.

[48] Theano Development Team, Theano: A Python framework for fast computation of mathematical expressions, arXiv e-prints abs/1605.02688. URL http://arxiv.org/abs/1605.02688

[49] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.

[50] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation, arXiv preprint arXiv:1511.00561.

[51] S. J´egou, M. Drozdzal, D. Vazquez, A. Romero, Y. Bengio, The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, IEEE, 2017, pp. 1175–1183.

[52] S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, arXiv preprint arXiv:1510.00149.

[53] R. B. Gramacy, H. K. Lee, Cases for the nugget in modeling computer experiments, Statistics and Computing 22 (3) (2012) 713–722.

[54] D. A. Nix, A. S. Weigend, Estimating the mean and variance of the target probability distribution, in: Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference On, Vol. 1, IEEE, 1994, pp. 55–60.

[55] Q. V. Le, A. J. Smola, S. Canu, Heteroscedastic Gaussian process re- gression, in: Proceedings of the 22nd international conference on Machine learning, ACM, 2005, pp. 489–496.

[56] A. Kendall, Y. Gal, What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?, arXiv preprint arXiv:1703.04977.

[57] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (518) (2017) 859–877. arXiv:https://doi.org/10.1080/

01621459.2017.1285773, doi:10.1080/01621459.2017.1285773. URL https://doi.org/10.1080/01621459.2017.1285773

[58] Q. Liu, Stein variational gradient descent as gradient flow, in: Advances in Neural Information Processing Systems, 2017, pp. 3117–3125.

[59] Q. Liu, J. Lee, M. Jordan, A kernelized Stein discrepancy for goodness- of-fit tests, in: International Conference on Machine Learning, 2016, pp. 276–284.

[60] C. E. Rasmussen, C. K. Williams, Gaussian processes for machine learn- ing, Vol. 1, MIT press Cambridge, 2006.

[61] M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, G. N. Wells, The FEniCS Project Version 1.5, Archive of Numerical Software 3 (100). doi:10.11588/ ans.2015.100.20553.

[62] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable pre- dictive uncertainty estimation using deep ensembles, arXiv preprint arXiv:1612.01474.

[63] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, A. Talwalkar, Hyper- band: A novel bandit-based approach to hyperparameter optimization, arXiv preprint arXiv:1603.06560.

[64] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.

[65] J. Quinonero Candela, C. Rasmussen, F. Sinz, O. Bousquet, B. Sch¨olkopf, Evaluating predictive uncertainty challenge, in: Machine Learning Challenges: Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, Max-Planck-Gesellschaft, Springer, Berlin, Germany, 2006, pp. 1–27.

[66] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding deep learning requires rethinking generalization, CoRR abs/1611.03530. arXiv:1611.03530. URL http://arxiv.org/abs/1611.03530

[67] T. Poggio, K. Kawaguchi, Q. Liao, B. Miranda, L. Rosasco, X. Boix, J. Hidary, H. Mhaskar, Theory of Deep Learning III: explaining the non-overfitting puzzle, ArXiv e-printsarXiv:1801.00173.

[68] L. de Oliveira, M. Paganini, B. Nachman, Learning particle physics by example: Location-aware generative adversarial networks for physics synthesis, arXiv preprint arXiv:1701.05927.

[69] W. Luo, Y. Li, R. Urtasun, R. Zemel, Understanding the effective re- ceptive field in deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2016, pp. 4898–4906.

designed for accessibility and to further open science