A machine learning approach for efficient uncertainty quantification using multiscale methods

2017·Arxiv

Abstract

Abstract

1 Introduction

Uncertainty quantification is an important task in practical engineering where some parameters are unknown or highly uncertain (Elsheikh et al., 2012, 2013, 2015). After selecting adequate priors for the uncertain parameters, simulations are performed for a large number of realizations. In the particular case of reservoir simulations, the problem is further aggravated where very fine details of the geological models are needed (large number of cells) for accurate description of the flow. One traditional approach to address this problem is to upscale the geological models Babaei et al. (2013). Another more recent approach is to use multiscale methods (Jenny et al., 2003; Hou and Wu, 1997). In these methods, the global fine scale problem is decomposed into many smaller local problems. The solution of these smaller local problems results in a set of numerically computed basis functions which are then used to build a coarse system of equations. After solving the coarse system, an interpolation is performed with the basis functions to obtain the fine scale solution.

We note that in multiscale methods, a large number of local problems are solved under the same boundary conditions to obtain the required basis functions. This process is repeated for each geological realization in uncertainty quantification tasks. Our aim is to exploit the redundancy that may arise in solving these local problems for several geological realizations by introducing a data-driven approach for estimating the basis functions efficiently. Specifically, we exploit the large number of local problem solutions that become available after a given number of full runs to construct a computationally cheap function that generates approximate solutions to local problems, i.e. approximate basis functions. In effect, what we propose is a type of hybrid surrogate model by embedding a data-driven approach into the multiscale numerical method. In this work, we focus on one multiscale method called the Multiscale Finite Volume method (MsFV) introduced by Jenny et al. (2003). However, the proposed approach can be applied to any multiscale method where the explicit construction of basis functions is performed such as in the more recent multiscale method based on restriction-smoothed basis functions (MsRSB)

Aarnes and Efendiev (2008) introduced a multiscale mixed finite element (MsMFE) method for porous media flows with stochastic permeability field where a set of precomputed basis functions is constructed based on selected set of realizations of the permeability field. These basis functions are then used to build a low-dimensional approximation space for the velocity field. We note that the cost of solving the upscaled problem (i.e. coarse scale) in (Aarnes and Efendiev, 2008) increases with the number of selected set of realizations. In contrast, in this manuscript we directly address the generation of basis functions via a “black box” surrogate modeling approach using machine learning. The generated basis functions are then directly employed in the multiscale formulation without any further modification. Another major difference is that our method directly benefits from increasing the number of realizations used to build the surrogate model without any increase in the computational cost of solving the coarse scale problems for new realizations.

This paper presents the first attempt to combine/embed machine learning techniques within multiscale numerical methods with very promising results. The motivation for our work comes from the observation that computational power and data storage capacity are ever increasing. This trend is likely to continue for some time and results in an increased ability to store and data-mine large volumes of simulation data. Another source of motivation comes from the renewed interest in machine learning among the research community, specially in the branch of deep learning to tackle AI-complete tasks such as computer vision and natural language processing. Neural network models are regarded as universal function approximators (Hornik et al., 1989, 1990; Cybenko, 1989) with capacity to learn highly non-linear maps. Therefore, they seem to be suitable for our current application.

The rest of the paper is organized as follows: In section 2, we give a brief description of the multiscale finite volume method (MsFV) and neural networks (NN). In section 3, we present

Figure 1: A square domain with a fine discretization of size 15 15 (grey thin lines). A coarse 3 is defined on top of the fine grid (black bold lines). A is highlighted in green. The centers of each primal cell are marked with red dots. From these centers, the dual grid is defined (blue dashed lines). A is highlighted in light red.

the methodology for the proposed approach for machine learning the basis. In section 4, we

examine the effectiveness of the presented method for uncertainty quantification in two test cases. Finally, in section 5 we report the conclusions of this work along with a brief discussion of future directions.

2 Background

In this section we briefly describe the two main components of the proposed method: multiscale finite volume (MsFV) methods and neural networks (NN) for surrogate modelling. A number of variants of the MsFV method have been proposed since its introduction in (Jenny et al., 2003). In this manuscript, we employ the MsFV method as described in (Lunati and Jenny,

2.1 Multiscale finite volume method

We consider an elliptic equation describing the pressure field

where p denotes the fluid pressure, q denotes fluid sources, and K denotes the permeability tensor. Discretizing Eq. (1) by the finite volume method results in a system of equations of the form Ap = q, which for some applications (such as reservoir simulation) tends to be extremely

Figure 2: Shown in blue is the support region of basis function corresponding to coarse node 5 (green cell). Basis function is obtained by solving the local problems in Eq. (2) for i = 5, then . In this example, only basis function (see Section 3).

Figure 3: Illustration of a local solution (partial basis function) and a basis function within its support region.

large. The MsFV method tackles this problem by constructing and solving a much coarser system of equations , the solution of which is then used to obtain an approximation of p by interpolation. For this, it relies on a set of basis functions which are obtained by solving local problems. In this sense, the method slightly resembles the finite element method, except that the basis functions employed are not analytic functions but numerically computed functions.

The method begins with the definition of a pair of overlapping coarse grids, namely the primal grid and the dual grid, as shown in Figure 1. In principle, the primal grid can be any coarse partition defined over the fine grid. Next, we define the coarse nodes as the fine cells at the centers of each primal cell. Lastly, the dual grid is defined by the lines connecting these coarse nodes. We denote the primal cells with Ω, and the dual cells with

Ω

A set of (partial) basis functions are obtained by solving the local problems

where denotes the (partial) basis function on dual cell Ω(see Figure 3a) associated with coarse node denotes the coordinate of coarse node denotes the tangential component over the dual cell boundary . In the 2D case, this means solving 1D problems over , the solutions of which become the boundary conditions for the 2D problem on Ω

Another component of the MsFV formulation employed are the correction functions, ob-

tained by solving the local problems

where ˆdenotes the correction function on dual cell Ω

Once the basis and correction functions are obtained, we approximate the fine scale pressure

p as an interpolation of coarse scale pressure values plus a correction term

Substituting Eq. (4) in Eq. (1), and applying finite volume discretization over the primal grid, we get the coarse system of equations

In the current work, it is more convenient to express the basis and correction functions in a global point of view (see Figure 3b) by writing

Figure 4: Representation of a 1-hidden layer neural network as a graph. The first column of nodes (from left to right) is the input layer, taking inputs ) of dimension , and the last column is the output layer with output ) of dimension The intermediate column is the hidden layer. Each line connecting two nodes represents a multiplication by a scalar weight.

basis functions, we actually only need to sum over supporting dual cells, i.e. dual cells that are associated with the corresponding node). Then, Eq. (4) has a simpler interpolation expression of the form:

In the case that the pressure solution will be utilised to drive a transport problem at the fine scale, a flux reconstruction step consisting of solving additional local Neumann problems is

2.2 Feedforward neural networks for surrogate modelling

A feedforward neural network f is a function consisting of a compositional chain of simpler functions, i.e. is said to have is the first layer, the second layer, etc. The first layer is also called the input layer, the last layer the output layer, and the layers in between are called hidden layers. The size or number of units of a layer refers to its output dimension, i.e. . Neural networks are typically represented as a graph as shown in Figure 4.

The input layer encompasses any pre-processing of the data, while the output layer may take many forms depending on the learning task. For example, in classification tasks, the softmax function is normally employed where the output layer size (number of units) equals the number of classes, and the softmax output represents the probability of the input sample belonging to each class. This is a way of embedding prior information of the learning task into the model. For regression tasks, the identity function is normally employed, but other functions can also be chosen to embed any prior information of the output space.

A conventional hidden layer is an affine transformation followed by an element-wise nonlinear function or Some popular choices for hyperbolic tangent, the sigmoid function, and more recently, the rectified linear unit (Jarrett

2.2.1 Neural network optimization

Once the network architecture is defined (number of layers n, layer sizes , activation func- tions , etc.), the optimization problem is to find ] such that f best describes the observations. Let be the set of observations used to train the model (the training set), then the minimization problem is formulated as:

where . This problem can be solved using gradient-based algorithms such as gradient descent:

where . The derivative can be obtained with numerical differentiation schemes. In neural networks, this is typically the backpropagation algorithm (Rumelhart et al.,

More recent gradient-based algorithms improve over gradient descent by offering adaptive learning rates such as AdaGrad (Duchi et al., 2011), RMSProp (Tieleman and Hinton, 2012) and Adam (Kingma and Ba, 2014). The basic idea in these methods is to use a separate learning rate for each scalar parameter, and adapt these rates throughout the training process based on the historical values of the partial derivatives with respect to each parameter. The initial global learning rate is a tunable hyperparameter.

A recent important development regarding neural network optimization is batch normalization (Ioffe and Szegedy, 2015), which has shown to significantly speed up the optimization process. The method consists of adaptive reparametrization of inputs to each activation function. This is to address the simplifying assumptions made during the optimization where the gradients update each parameter assuming that the other layers do not change. In essence, the values after the affine transformation in a layer are normalized by the mean and standard deviation before being fed into the layer activation function.

2.2.2 Regularization

The optimization of from Eq. (6) alone yields a model that is prone to overfitting, i.e. it does not necessarily perform well for samples not seen in the training set. Hence, validation assessment is necessary where a separate set of samples that are not used in optimizing Eq. (6), called the validation set, is employed to assess the accuracy of the model. One simple regularization technique is early stopping, where the model is assessed after each update (or number of updates) of ); when the cost function on the validation set begins to increase, the optimization is early stopped. This is the stopping criteria in neural network optimization, which differs from conventional optimization where the criteria is generally based on the gradient norm. Another set of regularization techniques are parameter norm penalties, where an additional term is added to the cost function:

For example, for parameter norm, Ω(. The additional parameter is a regular- ization hyperparameter that is chosen using the validation set.

More recently, Dropout (Srivastava et al., 2014) has shown to be a very effective regularization technique that approximates model averaging in neural networks. The technique consists of randomly dropping out units of the network during the optimization iteration, by which the optimizer “sees” a number of different “models” in the process. Since traditional model averaging in neural networks is usually intractable, this approach serves as a proxy for averaging an exponential number of models. In practice, a dropout rate is chosen which indicates the probability of a unit being dropped out. This hyperparameter is tuned using a validation set. The authors in (Srivastava et al., 2014) suggested using a max-norm constraint along with dropout, which consists of constraining the norm of some of the weights by a fixed constant c, tuned with a validation set. This allows for more aggressive optimization search without the possibility of weights blowing up.

2.2.3 Architecture design and hyperparameter tuning

The design of the network architecture, i.e. defining the number of layers, the number of units for each layer, activation functions, regularizers, etc. is not a straightforward task. In principle, one can consider these parameters as additional hyperparameters and tune them using a validation set. However, hyperparameter optimization is an expensive task given that the objective function (performance on the validation set) is non-linear and non-differentiable with respect to the hyperparameters. In practice, heuristics and expertise are heavily employed in the design process to reduce the number of hyperparameters. Nevertheless, general guidelines do exist for the design of neural network models. The following is a compilation of guidelines

– Begin with a few number of layers and units, and well-tested optimizers and regularizers.

– Start with as few hyperparameters as possible to enable quick manual search to obtain

– Overfit, then regularize: increase the number of layers/units to overfit the training set,

– Regarding the choice of activation functions, the current default recommendation is to

– Whether to use few large layers, or many small layers is an open debate. It is generally

– Early stopping should almost always be employed.

– The learning rate is very influential in the model performance and should be fine-tuned.

– If there is an architecture that performs well for a similar task, use it as the base archi-

Once the general architecture is selected, key hyperparameters should be optimized. Traditional techniques such as grid search and random search are normally prohibitive. In the current manuscript we employed the Tree-Parzen Estimator (Bergstra et al., 2011), a sequential model-based hyperparameter optimization approach where a model is sequentially constructed to approximate the performance of hyperparameters based on historical measurements, and then subsequently choose new hyperparameters to test based on this model. Other hyperparameter optimization techniques include Bayesian optimization for neural networks (Snoek et al., 2012) and Hyperband (Li et al., 2016).

3 Methodology

We first introduce some terminology to simplify the presentation. A basis function is interior if its support region does not touch the domain boundary (see Figure 2). For practical purposes, we limit the learning process to interior basis functions, i.e. we build a predictor to generate

Figure 5: Workflow of the proposed method.

interior basis functions while computing the remaining basis functions as usual from the local problems defined by Eq. (2). In practice, this should not be a major concern given that the number of interior basis functions is normally much larger than the number of non-interior basis functions (basis functions on the edges and vertices of the domain). In any case, it is pretty straightforward to train additional predictors for the remaining types of basis functions.

We define a as the cropped region of the permeability field K that corresponds to the support region of a basis function . In the learning framework, the permeability patches are our inputs, and the corresponding basis functions are our outputs. In practice, it is more convenient to work with the log-permeabilities, i.e. log

For clarification, suppose we have an 81 81 Cartesian grid, over which we defined a 9 primal grid. Then there are 7 7 = 49 interior basis functions, each with array size (number of fine cells in the support region) of 19 19. The array size of each input permeability patch is 19 = 1 for isotropic fields, and d = 2 in the anisotropic case.

The method we propose aims to speedup uncertainty quantification studies where multi-scale methods are employed in the propagation task. In particular, we focus on the use of the MsFV method to solve Eq. (1) for large number of realizations of the permeability field K. Our method however could be applied to any multiscale method where the explicit generation of the basis functions is performed. Consider an uncertainty propagation task of solving Eq. (1) for be the number of full MsFV runs that we can afford. For each , the MsFV simulation delivers a set of basis functions with their corresponding permeability patches. The union of these sets provides the learning dataset . This dataset is used to train a predictor model that maps from permeability patches to basis functions that has an evaluation cost that is much cheaper than solving the local problems. The model we employed here is a neural network. Once the model is trained, it is used to predict the basis functions in the subsequent runs for . The end result is a hybrid approach where the MsFV formulation is modified to obtain the basis functions through a data-driven model instead of being computed from the local problems. Figure 5 summarises the workflow of the proposed method.

To ensure the partition of unity property of the basis functions (which is not necessarily ful-filled in the learning model), we perform a post-processing step on the generated basis functions as follows:

We note that the presented method benefits from the use of structured grids with coarse cells of same size. In the case of structured grids with cells of different sizes, the patches could be scaled to a unique input size. This is a standard preprocessing step in computer vision to handle images of different sizes. We also note that handling unstructured grids is not a straightforward task and is beyond the scope of the current manuscript.

3.1 Implementation and computational aspects

The computational gain of the proposed method is achieved by replacing the solution of local problems by a constant number of matrix-vector multiplications followed by element-wise function evaluations. For a network of input and output sizes n and hidden layers of size m, this means an evaluation complexity of O(mn). The constants associated with the evaluation cost will depend on the number of layers of the network.

Efficient algorithms for solving systems of n linear equations exist where complexities of O(n log n) (FFT) or even O(n) (multigrid methods) are achieved, however the constants associated with these algorithms are large, usually requiring a large value of n to be economical. This is in contrast to the MsFV approach where small local problems are preferred. Likewise, there exist efficient algorithms to perform matrix-vector multiplications that are only justified in practice when the matrices and vectors are very large.

Regarding the training of the neural network, this is performed in an offline phase as with other surrogate modelling techniques, and the justification of the cost will depend on the particular uncertainty quantification task at hand. The larger the uncertainty quantification task, the larger the time budget that can be assigned to the surrogate modelling process. As a practical note, it is worth mentioning that due to the surge of interest in neural networks and AI in general, efficient implementations have been intensively developed in recent years, both from the software and hardware sectors. Indeed, dedicated hardware devices are currently being released for various neural network implementations.

3.2 Other machine learning techniques

The proposed algorithm is not limited to neural networks and other traditional machine learning techniques are indeed applicable such as Gaussian processes and support vector machines to model the basis function predictor. A number of reasons led us to choose the neural network model. First, neural networks are universal approximators (Hornik et al., 1989), meaning that they can fit any measurable function with arbitrary accuracy. This practically covers any function encountered in engineering applications. Secondly, neural networks scale very well with the size of the dataset, in contrast to Gaussian processes and support vector machines. This is desirable where the trends of the cost of numerical simulations and data storage are ever decreasing. Lastly, research in neural networks is characterized by a remarkably large and evolving body of literature from which we can benefit in the near future. Advances in the field is specially strong in problems related to computer vision, which shares a lot of features with our work.

4 Numerical experiments

We consider the task of solving Eq. (1) over a unit square [0. The domain is discretized into 81 81 fine grid, with a primal coarse grid of 9 9. For estimating the statistical distributions, we utilize M = 1000 realizations of isotropic log-permeability fields generated

assuming a zero mean gaussian random field with an exponential covariance of the form

where denotes the Euclidean norm. We choose 0, and we investigate three values for the correlation length: L = 0.1, 0.2, and 0.4.

4.1 Learning process

We assume a budget of m = 20 full MsFV runs, obtaining a dataset of 980 samples (since each run yields 49 samples). The array sizes of the inputs (permeability patches) and outputs (basis functions) are 19 19. We set aside 20% of the dataset for validation (this should be done at the level of the realizations, i.e. samples generated from 4 full MsFV runs).

The architecture employed is a fully connected network with 1-hidden layer of size 1024 and ReLU activation function. Naturally, the input and output layers are of size 19

Table 1: -scores on different permeability types

Figure 6: Performance of basis function predictor, case L = 0.1

Figure 7: Performance of basis function predictor, case L = 0.2

Figure 8: Performance of basis function predictor, case L = 0.4

361, matching the size of the permeability patch and basis function. Additionally, we employ the hard sigmoid function as the activation of the output layer. This is to embed the prior knowledge that basis functions take values between 0 and 1. The hard sigmoid is the function Although this choice of function is usually problematic for gradient-based optimizations, it gave good results when coupled with dropout and batch normalization.

To train the network, the gradient-based optimizer Adam (Kingma and Ba, 2014) seemed more robust during our trials. The initial learning rate was set to . For regularization, a dropout rate of 5% after the hidden layer, and a max-norm constraint of 4 have proven useful. Additionally, early stopping is employed. All these hyperparameter values were chosen based on default recommendations along with some manual explorations.

A convenient metric employed to report the performance of a trained model is the coefficient of determination (or

where denotes the is the trained model, ˆpredicted basis functions, basis functions, and ¯

sample mean of the true basis functions. A score of 1.0 corresponds to perfect prediction, while

a score below 0 means that the predictor performance is worse than a model that always predicts the sample mean. Table 1 shows the validation scores obtained on the three permeability types considered, i.e. correlation lengths L = 0.1, 0.2 and 0.4. Figures 6, 7 and 8 show some of the predicted basis functions for cases L = 0.1, 0.2 and 0.4, respectively. We see that the prediction is more challenging for the case of shortest correlation length. This is likely due to the permeability field being more heterogeneous for shorter correlation lengths.

4.1.1 Predictor uncertainty

Error estimations of the predicted basis functions might be of interest to fully quantify the uncertainties in the results. Such estimations are readily available in machine learning models such as Gaussian processes. For neural networks, a number of methods such as Bootstrap aggregating (bagging) and others as discussed in (Tibshirani, 1996) could be employed. Another possibility is to employ the dropout technique as a Bayesian approximation method (Gal and Ghahramani, 2016). In our work, we consider the uncertainties in the predicted basis functions to be of second order and the presented numerical results support this assumption.

4.2 Hybrid model

Once the neural network model is trained, we can use it to compute the basis functions in the MsFV formulation. To assess the effectiveness of this hybrid approach (MsFV-NN), we consider two test cases:

In both cases, a pressure value of 0 is imposed at the center of the square to close the problem. The pressure Eq. (1) is solved using three methods: a standard cell-centered finite volume method at the fine-scale level which is taken as the reference “true” solution, the standard MsFV method, and the proposed hybrid method (MsFV-NN). Additionally, we also compute and compare the total velocities, which can be derived from the corresponding pressure solutions. In the reference solution, the total velocity can be derived using Darcy’s law (is the total mobility, here assumed as = 1), whereas in the MsFV and MsFV-NN methods, the total velocities are derived via a flux reconstruction step, as mentioned before. We take a further step and use the total velocity to solve a tracer flow problem. In this case, we solve the following advection equation:

where c denotes the concentration of the injected fluid (in this case water), denotes the domain porosity, denotes sources/sinks of the injected fluid, and denotes the density of the injected fluid. In all cases, we assume water with dimensionless density of = 1 that is injected into a reservoir with constant porosity 2 initially containing only oil, i.e. c(x, t = 0) = 0, which we assume to have the same viscosity as the injected fluid. Under these conditions the total velocity v does not change in time. The simulation time for both test cases is from t = 0 until t = 0.4. In reservoir engineering, it is more convenient to work with pore volume injected (PVI) as the time unit, which expresses the ratio of the total volume of fluid injected until time t and the reservoir pore volume (for constant injection, is the pore volume).

Figures 9 and 10 show sample solutions for one realization of correlation length L = 0.1, for the two test cases considered. We also show the contour plot of the difference between the reference and MsFV, the reference and MsFV-NN, and MsFV and MsFV-NN.

Figure 9: Quarter-five spot problem: Sample solution for one realization based on the

Figure 10: Uniform flow problem: Sample solution for one realization based on the reference (standard FVM), MsFV and MsFV-NN.

Table 2: Results of hyperparameter tuning.

4.2.1 Comparison of errors

The errors of the solutions (pressure, velocity, concentration) of MsFV and MsFV-NN are measured with respect to the reference solution using an area weighted norm. Let be a vector of values corresponding to cells Ωbe the area of cell i, we define the area weighted norm as . Using this notation, the pressure error (), the velocity error (), and the concentration error (

Figures 11 and 12 show scatter plots of the errors obtained by the MsFV and MsFV-NN. As expected, a better predictor performance (in terms of the -score) corresponded to a better correlation between both errors.

4.2.2 Estimated distributions

Finally, we compare all three methods in an uncertainty quantification task where we estimate the pressure p at (1/4, 1/4), the total production Q, and the water breakthrough time when water fraction reaches 1% at the production well).

Figures 13 and 14 show the estimated distributions according to each method. We can see that the distributions given by MsFV and MsFV-NN are almost indistinguishable even for the less accurate predictor (L = 0.1). From these results, it is clear that the effectiveness of the hybrid model is attached to the effectiveness of the target model (MsFV). The hybrid model is expected to perform well as long as the target model itself serves as a good proxy to the “true” solution.

4.3 Hyperparameter tuning

In this section, we show how to further improve the learning performance by fine-tuning the model with a hyperparameter optimization algorithm. Specifically, we employ the Tree-Parzen

Figure 11: Quarter five spot problem: Comparison of errors in MsFV and MsFV-NN.

Figure 12: Uniform flow problem: Comparison of errors in MsFV and MsFV-NN.

Figure 13: Quarter five spot problem: Estimated distributions by MsFV (orange dashed line), MsFV-NN (green dotted line), and reference (blue line).

Figure 14: Uniform flow problem: Estimated distributions by MsFV (orange dashed line), MsFV-NN (green dotted line), and reference (blue line).

Figure 15: Comparison of errors in MsFV and MsFV-NN (tuned model, L = 0.1). Improvements can be seen with respect to Figures 11(a) and 12(a).

Estimator algorithm. We shall consider the case of L = 0.1 where the trained predictor performed with a score of 0.927.

Previously, a dropout rate of 5% and a default learning rate of 10have been fixed. Here we let the hyperparameter optimization algorithm tune the values for the dropout rate and the learning rate. Moreover, we also employ batch normalization to enhance the optimization process.

Table 2 summarizes the hyperparameter optimization results under a budget of 20 iterations where we observe significant improvement in the -score. Figure 15 shows the error scatter plot for the resulting hybrid model. An improvement in the correlation is observed (please compare with Figures 11 and 12). Figure 16 compares the estimated distributions for the quantities of interest where a strong agreement between the data-driven approach and the MsFV is observed. Finally, Table 3 presents summary statistics of the results obtained. Overall, we see that there are improvements in both the errors and the estimated distributions when the learning performance increases, as is expected. Of course, even more improvements can be achieved by further tuning the model, for example by increasing the number of iterations of the hyperparameter optimization, or by employing additional tools such as regularizers.

Figure 16: Estimated distributions (tuned model, L = 0.1) by MsFV (orange dashed line), MsFV-NN (green dotted line), and reference (blue line). Compare with Figures 13(a) and 14(a).

Table 3: Summary statistics and point estimates (L = 0.1).

Table 4: Time to generate 1000 basis functions using different methods.

4.4 Computational gains

For an estimate of the computational speedup provided by the proposed method, we compared the time taken to generate 1000 basis functions using the predictor vs. the standard approach of solving local problems. Since the MsFV method obtains the basis functions by solving local problems which involve many intermediary steps, and this could lead to data-derived overheads which are implementation-dependent, we decided to measure the time of solving the four local 2D problems only, i.e. without accounting for the overheads of getting the local boundary conditions (which are obtained by solving the 1D problems) and assembling the local matrices. We employed two solvers for the local problems: GMRES iterative solver and UMFPACK direct solver, both highly optimized C-compiled packages provided in numpy/scipy.

Table 4 summarises the run times obtained. These results were obtained using one thread (except for the last row which is run on GPU). Here, “batch eval” refers to the prediction of the N basis functions “at once”: for a given input vector , the predictor performs a matrix-vector multiplication on . But this can be implemented as a matrix-matrix multiplication simply by building the matrix K whose columns are the vectors , allowing for additional numerical optimizations.

In this case, we see that the direct solver outperformed the iterative solver for the local problems since the local matrices are small, which is the common scenario in multiscale methods. We also see that the data-driven approach clearly outperforms the solver component, and if we add the overheads of solving the 1D problems plus the local matrix assembly, the computational advantage will be amplified. We note however that these times can vary depending on the implementation. In particular, different neural network architectures and different solvers for the system of equations may yield different times. Nevertheless, it is unlikely that solving the local problems will outperform a forward pass of a neural network, i.e. direct matrix-vector computations.

5 Conclusions and remarks

We have seen that for the presented subsurface flow problems, shallow neural networks performed very well as a simple surrogate for the computation of basis functions in the multiscale finite volume method. Further, we draw the following remarks:

– Results obtained for uncertainty propagation using MsFV and the proposed MsFV-NN

– The proposed method is applicable to any multiscale method where the sub-grid scales

– The proposed method is scalable with large coarse partitions (since more data samples

In addition, we note that if the data distribution remains unchanged (or is similar to that of the training data), then the same trained predictor can be used for different problem conditions (for example, to perform well location optimization), and further computational gains can be achieved since we avoid training a new predictor. This is the situation in cases such as steady state flow or tracer flow.

We have presented the first application of machine learning to capture sub-grid scale heterogeneities within a multiscale method. As our next step, we aim to study the application of the presented method for multiphase flow in porous media. Other possible research directions include extensions to more general permeability fields with anisotropy and channelized structures.

References

Jørg E Aarnes and Yalchin Efendiev. Mixed multiscale finite element methods for stochastic porous media flows. SIAM Journal on Scientific Computing, 30(5):2319–2339, 2008.

Masoud Babaei, Ahmed H. Elsheikh, and Peter R. King. A comparison study between an adaptive quadtree grid and uniform grid upscaling for reservoir simulation. Transport in Porous Media, 98(2):377–400, Jun 2013. ISSN 1573-1634. doi: 10.1007/s11242-013-0149-7. URL https://doi.org/10.1007/s11242-013-0149-7.

Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural networks: Tricks of the trade, pages 437–478. Springer, 2012.

James S Bergstra, R´emi Bardenet, Yoshua Bengio, and Bal´azs K´egl. Algorithms for hyper- parameter optimization. In Advances in Neural Information Processing Systems, pages 2546– 2554, 2011.

George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

Ahmed H. Elsheikh, Matthew Jackson, and Tara Laforce. Bayesian reservoir history matching considering model and parameter uncertainties. Mathematical Geosciences, 44(5):515–543, Jul 2012. ISSN 1874-8953. doi: 10.1007/s11004-012-9397-2. URL https://doi.org/10. 1007/s11004-012-9397-2.

Ahmed H. Elsheikh, Mary F. Wheeler, and Ibrahim Hoteit. Nested sampling algorithm for subsurface flow model selection, uncertainty quantification, and nonlinear calibration. Water Resources Research, 49(12):8383–8399, 2013. ISSN 1944-7973. doi: 10.1002/2012WR013406. URL http://dx.doi.org/10.1002/2012WR013406.

Ahmed H. Elsheikh, Vasily Demyanov, Reza Tavakoli, Mike A. Christie, and Mary F. Wheeler. Calibration of channelized subsurface flow models using nested sampling and soft probabilities. Advances in Water Resources, 75:14 – 30, 2015. ISSN 0309-1708. doi: http://dx.doi.org/10.1016/j.advwatres.2014.10.006. URL http://www.sciencedirect.com/ science/article/pii/S0309170814002085.

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050– 1059, 2016.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http: //www.deeplearningbook.org.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation of an un- known mapping and its derivatives using multilayer feedforward networks. Neural networks, 3(5):551–560, 1990.

Thomas Y Hou and Xiao-Hui Wu. A multiscale finite element method for elliptic problems in composite materials and porous media. Journal of computational physics, 134(1):169–189, 1997.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Kevin Jarrett, Koray Kavukcuoglu, Yann Lecun, et al. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision, pages 2146–2153. IEEE, 2009.

P Jenny, SH Lee, and HA Tchelepi. Multi-scale finite-volume method for elliptic problems in subsurface flow simulation. Journal of Computational Physics, 187(1):47–67, 2003.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hy- perband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.

Ivan Lunati and Patrick Jenny. Multiscale finite-volume method for compressible multiphase flow in porous media. Journal of Computational Physics, 216(2):616–636, 2006.

Ivan Lunati and Patrick Jenny. Multiscale finite-volume method for density-driven flow in porous media. Computational Geosciences, 12(3):337–350, 2008.

Olav Møyner, Knut-Andreas Lie, et al. A multiscale method based on restriction-smoothed basis functions suitable for general grids in high contrast media. In SPE Reservoir Simulation Symposium. Society of Petroleum Engineers, 2015.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- nov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

Robert Tibshirani. A comparison of some error estimates for neural network models. Neural Computation, 8(1):152–163, 1996.

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2), 2012.

designed for accessibility and to further open science