Efficient Data-Driven Geologic Feature Detection from Pre-stack Seismic Measurements using Randomized Machine-Learning Algorithm

2017·Arxiv

1 INTRODUCTION

1 INTRODUCTION

It is challenging to analyze and interpret seismic measurements for identifying prospective geological features. The challenges arise from processing of large volumes of seismic data and subjective human factors. Different geologic features play different roles in characterizing the subsurface structure. Since geologic fault is one of the most interesting features in subsurface characterization, we use that as the target to demonstrate the efficacy of our new machine-learning based geologic feature detection method. Geologic fault zone is essential to various subsurface energy applications. In geothermal exploration, geologic faults provide important information for siting the drilling wells. In carbon sequestration, geologic faults can be critical to monitor the potential leaks of stored CO. In oil & gas production, geologic faults are used to signal reservoir boundaries or hydrocarbon traps.

In current seismic exploration, most subsurface characterization techniques are physics dominated, meaning that the governing physics equations are well understood and utilized to describe the underlying physics of the problems of interest. A well known examples of this is the seismic full-waveform inversion (FWI) (Lin & Huang 2015a,c; Virieux & Operto 2009). In FWI, an inverse problem is formulated to connect the measurements and the governing physics equations. Numerical optimization techniques are utilized to solve for the subsurface models. Similar framework and procedures can be applied to many other techniques such as seismic imaging (Zhang et al. 2015; Lin & Huang 2015b), tomography (Lin et al. 2015; Rawlinson & Sambridge 2014), etc. With the numerical model built up from the optimization, human intervention is involved to further interpret and finalize the acceptable models. Even though those conventional methods have been shown great success in many applications, in some situations they can be limited because of poor data coverage, computational inefficiency, and subjective human factors. A robust, efficient, and accurate subsurface characterization method is therefore needed to address those needs.

Given the advances in data science and machine learning technologies, there has been a recent surge in utilizing automated data-driven algorithms to detect subsurface geologic features (Schnetzler & Alumbaugh 2017; Araya-Polo et al. 2017; Guillen 2015; Zhang et al. 2014; Hale 2013; Ramirez & Meyer 2011). In seismic applications, those methods can be categorized into either learning-from-data or learning-from-model. The major differentiation between these two types of methods are whether the learning algorithms are employed on the pre-stack or post-stack seismic data sets. Most of the current existing work of machine learning methods for seismic applications are based on the post-stack seismic data sets, meaning migrated or inverted models need to be obtain prior to the usage of machien learning techniques. In Guillen (2015), migration imaging models are first obtained from seismic data sets. Supervised learning technique is then applied to the imaging model to automatically detect the salt body. Similarly, in Hale (2013), seismic image is first computed before the estimation of the geologic fault location. Despite the success of those methods, there are limitations. The success of the prediction heavily relies on the resulting seismic image or model obtained from the data sets. To avoid this limitation, another type of learning method has been recently proposed and developed, i.e., learning from pre-stack seismic data directly. In the work of Araya-Polo et al. (2017) and Zhang et al. (2014), supervised learning methods are directly applied to the pre-stack seismic data to look for patterns which indicates the geologic features. Specifically, in Araya-Polo et al. (2017) deep neural network was utilized to seismic data sets to obtain geologic faults. In Zhang et al. (2014), kernel regression was used to learn a mapping between seismic data and geologic faults. In both paper, promising results have been reported.

In this work, our novel geologic feature detection is belong to the learning-from-data category, meaning our algorithm is to detect geological features from pre-stack seismic data sets. Through our experiments, we notice that despite of the success of those existing learning-from-data methods in controlled experiments, they are limited by their computational efficiency, mostly due to the need to process large volumes of high-dimensional data. Consequently, none of the existing solutions are suitable for real-time or even near real-time detection. In typical exploratory geophysics applications, strongly rectangular data arise, which implies that the number of receivers is much smaller than the number of data points that each receiver collects. Hence, we develop a scalable geologic feature detection technique by utilizing tools from randomized linear algebra allowing computational efficient geological feature detection.

Randomized matrix approximation methods enable us to efficiently deal with large-scale problems by sacrificing a provably trivial amount of accuracy (Drineas & Mahoney 2016). Broadly, the underlying idea is to perform dimensionality reduction on the large-scale matrix without losing information pertinent to task under consideration. Their main idea is to construct a sketch matrix of the input matrix. The sketch matrix is usually a smaller matrix that yields a good approximation and represents the essential information of the original input. In essence, a sketching matrix is applied to the data to obtain a sketch that can be employed as a surrogate for the original data to compute quantities of interest (Drineas & Mahoney 2016). Randomized algorithms have been successfully applied to various scientific and engineering domains, such as scientific computation and numerical linear algebra (Meng & Mahoney 2014; Drineas et al. 2011; Lin et al. 2010; Rokhlin & Tygert 2008), seismic full-waveform inversion and tomography (Moghaddam et al. 2013; Krebs et al. 2009), and medical imaging (Huang et al. 2016; Wang et al. 2015; Zhang et al. 2012).

In this paper, we developed a novel geologic feature detection methods based on randomization. In particular, we consider the use of kernel machines for automated feature detection and design a scalable algorithm using the Nystr¨om approximation (Drineas & Mahoney 2005; Gittens & Mahoney 2016). It is well known that the kernel matrix used in any kernel machines is the bottleneck for scaling up computation and memory efficiency. The main idea of Nystr¨om method is to approximate an arbitrary symmetric positive semidefinite (SPSD) kernel matrix using a small subset of its columns, and the method reduces the time complexity of many matrix operations from space complexity from is the number of data samples. There has been various research work utlizing Nystr¨om approximation to improve the computational efficiency and memory usage in machine learning community. Williams & Seeger (2001) used the Nystr¨om method to speedup matrix inverse such that the inference of large-scale Gaussian process regression can be efficiently performed. Later on, the Nystr¨om method has been applied to spectral clustering (Li et al. 2011; Fowlkes et al. 2004), kernel SVMs (Zhang et al. 2008), and kernel PCA and manifold learning (Talwalkar et al. 2013), etc. In this work, we employ Nystr¨om approximation to kernel ridge regression. Instead of forming the full kernel matrix from seismic data sets, we generate a low-rank approximation of the full kernel matrix by using Nystr¨om approximation. We further validate the performance of our new subsurface geologic feature detection method using synthetic surface seismic data. Our proposed detection method significantly improves the computational efficiency while maintaining the accuracy of the full model.

In the following sections, we first briefly describe some fundamentals of underlying geology and the governing physics of our problem of interests. We then go through the data driven approaches -kernel ridge regression model (Sec. 2). We develop and discuss our novel geologic feature detection method based on randomized kernel ridge regression method (Sec. 3). We then apply our method to test problems using both acoustic and elastic velocity models, and further discuss the results (Sec. 4). Finally, concluding remarks are presented in Sec. 5.

Figure 1. An illustration of the geologic fault zones. There are three major geological fault types depending on the relative direction of displacement between the rocks on either side of the fault: normal fault, reverse fault, and strike-slip fault. The fault block above the fault surface is called the hanging wall, while the fault block below the fault is the footwall.

2 THEORY

2.1 Geologic Features of Interest: Fault Zones

Geologic fault zone provides critical information for various subsurface energy applications. As an example, in carbon sequestration, leakage of and brine along faults at carbon sequestration sites isa primary concern for stroage integrity (Zhang et al. 2009). Accurately siting the geologic fault zones is essential to monitor the storage. We first provide some fundamentals on the geological fault.

Geological fault is a fracture or crack along which two blocks of rock slide past one another (Haakon 2010). As illustrated in Fig. 1, there are three major geological fault types depending on the relative direction of displacement between the rocks on either side of the fault: normal fault, reverse fault, and strike-slip fault. The fault block above the fault surface is called the hanging wall, while the fault block below the fault is the footwall. In this study, we focus on both normal faults and reverse faults, which are the most common fault types (Haakon 2010).

Out of various geophysical exploration methods, seismic waves are more sensitive to the acoustic/elastic impedance (which depends on density and seismic velocity of the medium) of the subsurface than other geophysical measurements (Fig. 2a). Hence, seismic exploration has been widely used to infer changes in the media impedance, which indicates geologic structures. In the next section, we briefly cover the mathematics and governing physics of seismic exploration.

2.2 Physics-Driven Methods

Seismic waves are mechanical perturbations that travel in the Earth at a speed governed by the acoustic/elastic impedance of the medium in which they are traveling. In the time-domain, the acoustic-wave equation is given by

where is the density at spatial location r, K(r) is the bulk modulus, s(r, t) is the source term,

p(r, t) is the pressure wavefield, and t represents time.

The elastic-wave equation is written as

where C(r) is the elastic tensor, and u(r, t) is the displacement wavefield.

The forward modeling problems in Eqs. (1) and (2) can be written as

where P is the pressure wavefield for the acoustic case or the displacement wavefields for the elastic case, f is the forward acoustic or elastic-wave modeling operator, and m is the velocity model parameter vector, including the density and compressional- and shear-wave velocities. We use a time-domain stagger-grid finite-difference scheme to solve the acoustic- or elastic-wave equation. Throughout this paper, we consider only constant density acoustic or elastic media.

The inverse problem of Eq. (3) is usually posed as a minimization problem

where d represents a recorded/field waveform dataset, f(m) is the corresponding forward modeling result, is the data misfit, stands for the Lis a regularization parameter and R(m) is the regularization term whose form depends on the type of the regularization used. The current technology to infer the subsurface geologic features is based on seismic inversion and imaging methods, which are computationally expensive and often yield unsatisfactory resolution in identifying small geologic features (Lin & Huang 2015a,c). Recent years, with the significantly improved computational power, machine learning and data mining have been successfully employed to a various domains from science to engineering. In the next section, we provide a different perspective (data driven approach) of extracting subsurface geological features from seismic measurements.

2.3 Data-Driven Approach for Subsurface Feature Detection

In this paper, we adopt a data-driven approach, which falls into the category of supervised machine

learning. Suppose one has n historical features vectors

which are from seismic measurements and , and the associated labels

which can be the location or angle of geologic faults.

Overall, the idea of data driven approach independent of applications can be illustrated as

In particular, one can build a machine learning model, such as kernel ridge regression (KRR), and train the model using the recorded physical measurements. After training, one gets a function, which takes a d-dimensional feature vector as input and returns a prediction of its label. Then for any unseen feature vector , one can predict its label by

As for subsurface geological feature detection specifically, we illustrate our data-driven approach in Fig. 2b. Seismic measurements including both historical and simulated are utilized as training data sets, which are fed into the learning algorithms. A mapping function, , is the outcome of the training algorithms. The function, , is the detection function, which creates a link from the seismic measurements to the corresponding geological features.

Note the difference between a machine learning model and the physical driven models as in Eq. (4). A data driven model, such as KRR, is generic: it can be used to predict wine quality, web page click, house price, etc. To apply a data driven model, one need zero knowledge of the physics behind the problem; one just need to provide the historical feature vectors and labels for training. This is in sharp contrast to the physics driven model in Eq. (4), which is specific to one particular problem and requires strong domain knowledge and intricate mathematical models.

The correctness of our applied data-driven approach, KRR, is ensured by machine learning theory (Friedman et al. 2001; Mohri et al. 2012). Assume that the training and test data are generated by the same model (otherwise, what is learned from the training data does not apply to the test data). As more data are used for training, the prediction error monotonically decreases. Importantly, KRR is known to be robust to noise: even if the training data are corrupted by intensive noise, the prediction is

Figure 2. (a). An illustration of subsurface properties exploration by using seismic wave. (b). The diagram of the data-driven procedure to learn geologic features from seismic measurements. Seismic measurements including both historical and simulated are utilized as training data sets, which are fed into the learning algorithms. A mapping function, , is the outcome of the training algorithms. The function, , is the detection function, which creates a link from the seismic measurements to the corresponding geological features.

still highly accurate, provided that the number of training data is sufficiently large. The robustness is useful in practice, because the seismic measurements have noise, and the locations and angles of the geologic faults may not be exactly known.

2.4 Ridge Regression and Kernel Trick

This work proposes to learn the function in question (denote ) using data driven techniques such as ridge regression and kernel ridge regression (KRR). Since all these regression methods are central to the proposed system, we recap their definitions in the following sections.

2.4.1 Ridge Regression

Ridge regression is one of most popular regression methods, which models the linear dependencies

between features vectors x and response variables y. Its loss function is usually posed as

where the first term is the cost function and the second term is used to avoid the over fitting. The

resulting regression model can be therefore obtained by

where . The major shortcoming of ridge regression is its limitation in modeling nonlinear data sets. However, in seismic applications, the relationship between the feature vectors and the response variables is nonlinear because of the governing physics as provided in Eqs. (1) and (2). We will need more advanced regression techniques to model the data nonlinearity while maintaining feasible computational costs. Kernel tricks provide us with the tools (Sch¨olkopf & Smola 2002).

2.4.2 Kernel Ridge Regression

Kernel ridge regression (KRR) is a popular non-linear regression method (Sch¨olkopf & Smola 2002).

It is built upon kernel function , which is defined as

Definition A function is a valid kernel if

any . In addition, a valid kernel defines such a feature map

denotes the inner product in the feature space F.

In general, a kernel measures the similarity between the two samples in . In this paper,

we consider the radial basis function (RBF) as the kernel which is defined as

where is the kernel width parameter and is the vector Euclidean norm (-norm). Since linear operations within the feature space F can be interpreted as non-linear operations in the Euclidean space, the linear models learned in the feature space provide the power of a non-linear model. Let be the (training) kernel matrix, where

Suppose the seismic measurements are stored as . KRR uses the data for training and returns a function which approximates . Given a test point makes prediction . We directly state the dual problem of KRR without derivation; the reader can refer to Campbell (2001) for the details:

where is the regularization parameter and should be fine tuned. Problem (10) has a closed-form

optimal solution

where identity matrix. Finally, for any

is the prediction made by KRR.

Machine learning theory indicates that more training samples lead to smaller variance and thereby better prediction performance. Ideally, one can collect as many seismic measurements as desired in the quest to improve detection. Unfortunately, the memory costs of KRR hinder the use of such large amounts of training data. To the best of our knowledge, these computational challenges of KRR have not been addressed by any of the prior efforts on using kernel machines for subsurface applications (Schnetzler & Alumbaugh 2017; Zhang et al. 2014; Ramirez & Meyer 2011). A practical approach to large-scale KRR is randomized kernel approximation, which sacrifices a limited amount of accuracy for a tremendous reduction in time and memory costs. In this work, we apply the Nystr¨om method (Nystr¨om 1930; Williams & Seeger 2001) to make large-scale KRR feasible on a single workstation. Consequently, we can easily enable the training of KRR using much larger amounts seismic measurements, thereby achieving substantially improved geologic detection performance.

To estimate the subsurface features using seismic data sets, the number of samples, n, is usually in the magnitude of millions or even more. KRR requires forming an kernel matrix K in Eq. (10) and inverting the matrix in (11). Therefore, when n is large, the computation of KRR is very challenging. Super computer clusters and distributed computing systems have been utlized to solve Large-scale KRR (Avron et al. 2017; Tu et al. 2016). Even though KRR provides promising regression results in various problems, such an expensive computational cost of KRR will unfortunately hinders its wide applications. A practical approach to large-scale KRR is randomized kernel approxation, which sacrifices a limited amount of accuracy for a tremendous reduction in time and memory costs. We apply the Nystr¨om method (Nystr¨om 1930; Williams & Seeger 2001) to make large-scale KRR feasible on a single workstation.

3 SCALABLE GEOLOGIC DETECTION THROUGH RANDOMIZED APPROXIMATION

In this section, we introduce the Nystr¨om method—a randomized kernel matrix approximation tool— to the geologic detection task, aiming at solving large-scale problems using limited computational resources. Sec. 3.1 describes the Nystr¨om method, Sec. 3.2 theoretically justifies the Nystr¨om method and its application to KRR, Sec. 3.3 discusses the three tuning parameters, Sec. 3.4 presents the whole

Figure 3. Illustration of the Nystr¨om approximation , where the low-rank approximation can be obtained.

procedure of KRR with Nystr¨om approximation, and finally, Sec. 3.5 analyzes the time and memory costs.

3.1 The Nystr¨om Method

Nystr¨om approximation (Williams & Seeger 2001) is a popular and an efficient approach. In addition to its simplicity, the Nystr¨om method is a theoretically sound approach: its approximation error is bounded (Drineas & Mahoney 2005; Gittens & Mahoney 2016); when applied to KRR, its statistical risk is guaranteed to diminish(Alaoui & Mahoney 2015; Bach 2013).

The Nystr¨om method computes a low-rank approximation time. Here is user-specified; larger values of s leads to better approximation but incurs higher computational costs. The tall-and-skinny matrix is computed as follows: First, sample s items from uniformly at random without replacement; let the resulting set be S. Subsequently, construct a matrix contain the rows of C indexed by S. Gittens & Mahoney (2016) showed that approximation to denotes the Moore-Penrose pseudo-inverse of W. The approximation is illustrated in Fig. 3. Finally, the low-rank approximation is computed.

Besides the Nystr¨om method, a number of other kernel approximation methods exist in the machine learning literature. Random feature mapping (Le et al. 2013; Rahimi & Recht 2007) is an equally popular class of approaches. However, compared to random feature mapping, several theoretical and empirical studies (Tu et al. 2016; Yang et al. 2012) are in favor of the Nystr¨om method. Furthermore, in the recent years, alternative approaches such as the fast SPSD model (Wang et al. 2016), MEKA (Si et al. 2014), hierarchically compositional kernels (Chen et al. 2016) have been proposed to speed up KRR. Since comparing different kernel approximation methods is beyond the scope of this work, we adopt the Nystr¨om method in our algorithm.

3.2 Theoretical Justifications of the Nystr¨om Method

The Nystr¨om method has been studied by many recent works (Alaoui & Mahoney 2015; Bach 2013; Drineas & Mahoney 2005; Gittens & Mahoney 2016; Wang et al. 2016, 2017), and its theoretical properties has been well understood. In the following, we first intuitively explain why the Nystr¨om method works and then describe its theoretical properties.

Let be a column selection matrix, that is, each column of P has exactly one nonzero entry whose position indicates the index of the selected column. We let

Then the matrices C and W (in Fig. 3) can be written as

The Nystr¨om approximation can be written as

In the extreme case where s = n, the matrix is identity, and thus the Nystr¨om approximation is exact. In general s < n, the matrix is called orthogonal projection matrix, which projects any matrix to the column space of D. Low-rank approximation theories show that if the “information” in K is spread-out, then most mass of are in the column space of a small subset of columns of . Theorefore, projecting to the column space of loses only a little accuracy, and the Nystr¨om approximation well approximates K.

Theoretical bounds (Gittens & Mahoney 2016; Wang et al. 2017) guarantee that the Nystr¨om approximation is close to K in terms of matrix norms. Let ) be arbitrary integer, be the best rank k approximation to be the spectral norm, Frobenius norm, or trace norm. If the eigenvalues of K decays rapidly and the number of samples, s, is sufficiently larger than k, then is comparable to

Applied to the KRR problem, the quality of the Nystr¨om method has been studied by Alaoui & Mahoney (2015); Bach (2013). The works studied the bias and variance, which directly affect the prediction error of KRR. The works showed that using the Nystr¨om approximation, the increases in the bias is bounded, and the variance does not increase at all. Therefore, using the Nystr¨om approximation, the prediction made by KRR will not be much affected. In addition, they showed that as the number of samples, s, increases, the performance monotonically improves.

3.3 Tuning Parameters

KRR with Nystr¨om approximation has totally three parameters: the regularization parameter kernel width parameter , and the number of random samples s. We discuss the effect of the parameters.

The regularization parameter ) is defined in the KRR objective function (10) and can be arbitrarily set by users. From the statistical perspective, trades off the bias and variance of KRR: big leads to small variance but big bias, and vice versa. The optimal choice of is the one minimizes the sum of variance and squared bias. However, such optimal choice cannot be analytically calculated;

in practice, it is determined by cross-validation.

The kernel width parameter is defined in (9). It defines how far the influence of a single training example reaches, with high values meaning “far” and high values meaning “close”. As goes to zero, K tends to be identity, where the training examples do not influence each other and the KRR model is too flexible; as goes to infinity, K tends to be an all-one matrix (its rank is one), where the KRR model is restrictive and lacks expressive power. The users can either set as In practice,

fine tuned; a good heuristic is searching

or searching around this value by cross-validation. Note that computing (13) costs is thereby impractical, a good heuristic is randomly sample a subset and approximate

which costs merely

The number of random samples s trades off the accuracy and computational costs: large s leads to good prediction but large computational costs. If the dataset has n samples of d-dimension, the total time complexity is , and the space (memory) complexity is O(nd + ns). It is always good to set s as large as one can afford because the prediction monotonically improves as s increase.

3.4 Overall Algorithm: KRR with Nystr¨om Approximation

Using the Nystr¨om method, the training of KRR can be performed in time, where the user-specified parameter s directly trades off accuracy and computational costs. Empirically speaking, setting s in the order of hundreds suffices in our application. The overall algorithm is described as follows.

Training. The inputs are . User specifies s, randomly select s out of the n samples, form the kernel sub-matrices , and compute The kernel matrix K can be approximated by according to the previous subsection. Finally,

defined in Eq. (11) can be approximated by

where the latter equality follows from the Sherman-Morrison-Woodbury (SMW) matrix identity as definied in Eq. (A.1) and the detailed deviration is provided from Eq. (A.2) to Eq. (A.3). More details can be found in Wang (2015). It is worthwhile to mention that the in Eq. (14) has been replaced by the matrix of in Eq. (15), which is a much smaller dimension of leads to the significant reduction of the computational costs.

Prediction. Let be any unseen test sample. The detection step is almost identical to

Eq. (12): we use instead of and makes prediction by

The location or angle of geological fault should be close to

3.5 Computational and Memory Cost Analysis

The training of KRR without kernel approximation has time complexity and space (memory) complexity. The costs are calculated as follows. For most kernel functions, including the RBF kernel, the evaluation of time. One needs to keep the n data samples in memory to compute every entry of , which costs O(nd) memory and compute , one needs to keep K in memory and perform matrix inversion, which requires memory and

The training of KRR with Nystr¨om approximation has time complexity and O(nd+ ns) space complexity. The costs are calculated as follows. To compute one only need to evaluate ns kernel functions, which requires O(nd) memory and O(nds) time. The computation of according to Eq. (15) has O(ns) space complexity (because the matrice C and W need to be kept in memory) and time complexity.

The prediction of KRR, either with or without approximation, for an unseen test sample,

Figure 4. An illustration of the geologic fault zone. The location of a geologic fault zone can be characterized by its horizontal offset and the dipping angle (Zhang et al. 2014).

O(nd) time complexity and O(nd) memory complexity. First, one keeps the n data samples in memory to evaluate the kernel functions , which costs O(nd) time and O(nd) memory. Then, one keeps the n kernel function values and ) in memory to make prediction, which costs merely O(n) time and O(n) memory.

4 NUMERICAL RESULTS

To validate the performance of our proposed approach, we carry out evaluations with synthetic seismic measurements to detect the location of the geologic faults. The siting of geologic fault zones can be characterized by its horizontal offset and the dipping angle as shown in Fig. 4 (Zhang et al. 2014). We employ our new subsurface feature detection method to estimate both the offset and angle of geologic fault zones. As for the computing environment, we run our tests on a computer with 40 Intel Xeon E5-2650 cores running at 2.3 GHz, and 64 GB memory.

The quality of training set is critical for any machine-learning algorithms. In this work, we consider velocity models consisting of horizontal reflectors with a single fault zone (layer model) to demonstrate the performance of our new geologic feature detection method. It is straight forward to employ our method to detect multiple fault zones. To best represent the geologically realistic velocity models, we create a database containing n = 60, 000 velocity models of size grid points similar to Zhang et al. (2014). The velocity models in the database are different from one another in terms of offset (ranging from 30 grids to 70 grids), dipping angle (ranging from of layers (ranging from 3 to 5 layers), layer thickness (ranging from 5 grids to 80 grids), and layer velocity (ranging from 3000 m/s to 5000 m/s). A small portion of the training velocity models are shown in Fig. 5.

Figure 5. A database of velocity models consisting of 60, 000 models of size grid points. The velocity models in the database are different from one another in terms of offset (ranging from 30 grids to 70 grids), dipping angle (ranging from ), number of layers (ranging from 3 to 5 layers), layer thickness (ranging from 5 grids to 80 grids), and layer velocity (ranging from 3000 m/s to 5000 m/s).

The seismic measurements are collections of synthetic seismograms obtained by implementing forward modeling on those 60, 000 velocity models. One common-shot gather of synthetic seismic data with 32 receivers is posed at the top surface of the model. The receiver interval is 15 m. We use a Ricker wavelet with a center frequency of 25 Hz as the source time function and a staggered-grid finite-difference scheme with a perfectly matched layered absorbing boundary condition to generate synthetic seismic reflection data. The synthetic trace at each receiver is a collection of time-series data of length 1, 000. So, the feature dimension of seismic data sets is . Therefore, out of 60, 000 velocity models, the total volume of synthetic seismic data is . In Fig. 6, we show a portion of the synthetic seismic data sets corresponding to velocity models that we generate. Specifically, the displacement in X direction is shown in Fig. 6(a), and the displacement in Z direction is shown in Fig. 6(b).

We employ a hold-out test to assess the efficacy of our proposed algorithm. Specifically, 75.0% of the dataset is used for training the model, while the rest is used for testing. For comparison, we use the conventional KRR method (denoted by “KRR”) as the reference method. We denote our new geologic feature detection method as “R-KRR” standing for Randomized KRR method. To evaluate the performance, we report both the accuracy and the computational efficiency of different methods. We use the mean-absolute error (MAE) metric to quantify the accuracy of a learning method, which

Figure 6. Synthetic seismic data sets are obtained using a staggered-grid finite-difference scheme with a perfectly matched layered absorbing boundary condition. The displacement of X direction (a) and Z direction (b) are both used as training sets. The total volume of synthetic seismic data is

is defined as

We calculate the wall time to measure the computational efficiency of a learning method and further provide the speed-up ratio.

To have a comprehensive understanding our randomized geologic feature detection methods, we provide three sets of tests. In Sec. 4.1, we provide an overall test on the detection accuracy of our method. In Sec. 4.2, we report the performance of our method as a function of the number of random samples, s. In Sec. 4.3, we test the robustness of our method with a view on the randomness of the approximation method.

4.1 Test on Detection Accuracy

We provide our first test on the detection accuracy. The estimation result on the dipping angle and horizontal offset is provided in Fig. 7. We test the performances of our method using two different Nystr¨om approximations, s = 3, 000 and s = 6, 000, as well as one other detection approach using conventional KRR method. We report the performances of those methods using different sizes of the seismic data. Specifically, we increase the seismic dataset generated from 5, 000 velocity models to 60, 000 velocity models with an incremental of 5, 000 velocity models. The corresponding MAE values are reported in Fig. 7. In particular, the results of angle estimation is provided in Fig. 7(a) and the results of the offset estimation is provided in Fig. 7(b). We notice when the dataset used for training is small, KRR method (in cyan) yields more accurate results of both angle and offset estimations. This is reasonable since all the available data sets are used for estimation. After using data from 10, 000 velocity models, KRR method becomes extremely inefficient because of the selection of the parameters using cross-validation. It is difficult to evaluate its performance given more training data. While, our method with both Nystr¨om approximations, s = 3, 000 and s = 6, 000, still yields accurate results and efficient performance. In particular, our method with larger Nystr¨om approximation, i.e., s = 6, 000, consistently gives us better results. Our best estimate of the dipping angle on the full seismic data set is (Fig. 7(a)). Similarly, we also report the performance of offset estimation in Fig. 7(b). The best estimate of the offset using our method on the full data set is about 1 grid.

Figures 8 show the computational time for training phases required by our approach (in blue and red), and KRR method (in cyan). Observing Fig. 8, KRR approach is much more computationally expensive and memory demanding than our method even if it provides more accurate estimation. On the other hand, our method is significantly more efficient than the conventional KRR method in training when the data sets become large. Utilizing the full dataset, it takes our method on the order of 10 seconds to train the prediction model. The speed-up ratios between our method and the conventional KRR method in training phase is up to 1, 000. Such an efficiency would allow the possibility to detect the geologic features in/towards real time. The computational time in estimating the offset is similar to that of the angle estimation and hence we do not report them. Similar conclusions can be drawn that our method not only yields more accurate estimation but also is more computationally efficient out of these two methods.

To have a visualization of our estimation, we provide a specific example of the true model and our estimation in Fig. 9. In Fig. 9(a), we show our true velocity model with angle = = 49.0. The estimation result of our randomized detection method is given in Fig. 9(b). The result of our estimation is angle = and offset = 50.0. Visually, our randomized detection method yields a rather accurate estimation compared to the ground truth.

4.2 Test on the Nystr¨om Sample Size

The number of random Nystr¨om sample size, s, is critical to the accuracy and efficiency of our randomized feature detection method. The appropriate selection of the Nystr¨om sample size value depends on the redundancy of data sets, which theoretically can be justified by the spectrum spanned by the singular vectors of the data sets. In this test, we provide our estimation results by varying the Nystr¨om sample size, s, from from 1, 000 to 6, 000 with an incremental of 500. Besides the acoustic seismic data sets, we also generate elastic seismic data sets for testing our prediction model. The estimation results are provided in Fig. 10, where Fig. 10(a) is the estimation of horizontal offset and Fig. 10(b) is the estimation of the dipping angle. In both figures, the estimation results using acoustic data sets

Figure 7. Estimation error for (a) dipping angle and (b) offset using conventional KRR method (in cyan), our method using s = 3, 000 (in blue), and our method using s = 6, 000 (in red). KRR method (in cyan) yields more accurate results of both angle and offset estimations with small size of data sets, and it fails to provide estimation when data sets becomes too large. Our method yields consistently comparable results to KRR on all sizes of data sets.

are plotted in red, and the results using elastic data sets are plotted in blue. We notice that in both figures that with the increase of the Nystr¨om sample size, the estimation accuracy for both dipping angle and horizontal offset also increases. This is reasonable and can be explained by the fact that more information are included and utilized for generating the prediction model. Comparing the estimation results using elastic and acoustic seismic data sets, we notice that the one using acoustic seismic data sets yields consistently more accurate results. This is because the elastic models include much more parameters than the acoustic models, which is indicated by Eqs. (1) and (2). With more degree of freedom, more training data sets is therefore needed to achieve the same level of accuracy. To conclude on

Figure 8. Computational times for the training phase using our method (in red and blue) and the conventional KRR method (in cyan). Our method yields accurate results and is computationally and memory efficient on all data points.

Figure 9. An synthetic model with a geologic fault in it: (a) the true model with angle = and offset = 49.0; (b) the estimation using our new randomized detection method. The result of our estimation is angle = and offset = 50.0. Visually, our randomized detection method yields a rather accurate estimation compared to the ground truth.

the selection of the random Nystr¨om sample size, we would suggest to use a value in between 3, 000 to 6, 000 considering a balance between the accuracy and efficiency.

Figure 10. The estimation results by varying the Nystr¨om sample size, s, from from 1, 000 to 6, 000 with an incremental of 500. Both the estimation results on the horizontal offset (a) and dipping angle (b) are provided. In both figures, the estimation results using acoustic data sets are plotted in red, and the results using elastic data sets are plotted in blue. We notice that in both figures that with the increase of the Nystr¨om sample size, the estimation accuracy for both dipping angle and horizontal offset also increases, which is due to the fact that more information are utilized in generating the prediction model.

Figure 11. 20 different realization tests of the Nystr¨om Method are generated. Each of them is drawn from a uniform distribution. Both the dipping angle and horizontal estimation errors according to Eq. 16 are calculated. For all the tests, we set the Nystr¨om Sample Size, s = 3, 000, and use the full data set size. We report the randomness results in Fig. 11, where the acoustic estimation results are plotted in red and the elastic results are plotted in blue. All of the 20 different realizations lead to similar error estimations of both dipping angle and horizontal offset.

4.3 Test on the Randomness of the Nystr¨om Method

The Nystr¨om method is a randomization-based approach, where the randomness arises from the uniform sampling of the columns in generating the low-rank approximation as in illustrated Fig. 3. Here we test the randomness in the prediction made by KRR with Nystr¨om approximation. We use the same model as in Test 1, where the velocity models is size of grid points. One geological fault zone is contained in the model. One common-shot gather of synthetic seismic data with 32 receivers is posed at the top surface of the model. We generate 20 different realization tests of the Nystr¨om method. Each of them is drawn from a uniform distribution. We calculate their dipping angle and horizontal estimation errors according to Eq. (16). For all the tests, we set the Nystr¨om sample size, s = 3, 000, and use the full data set size. We report the randomness results in Fig. 11, where the acoustic estimation results are plotted in red and the elastic results are plotted in blue. We observe that there are two clusters of data points corresponding to the acoustic and elastic scenarios. All of the 20 different realizations lead to similar error estimations of both dipping angle and horizontal offset. From this test, we conclude that our method yields robust and accurate results regardless of the randomness nature of the Nystr¨om method.

5 CONCLUSIONS

We developed a computationally efficient, machine learning approach for subsurface geological features detection using seismic data sets. Instead of detecting geological features from the post-stack image or inversion, our proposed techniques are capable of detecting the geological features of interest from pre-stack seismic data sets. Our data-driven detection methods are based on kernel ridge regression, which can be computationally intensive in training. To overcome the issues of excessive memory and computational cost that arises with kernel machines for high dimensional dataset, we incorporated a randomized matrix sketching technique. The randomization method can be viewed as a data-reduction technique, because it generates a surrogate system that has much lower degrees of freedom than the original problem. We show through our computational cost analysis that the proposed geologic feature detection method achieves significant reduction in computational and memory costs. Furthermore, we utilized a few numerical tests of detecting geological fault zone to demonstrate the performance of our detection method. As illustrated by our tests, our method yields comparable estimation performance compared to the conventional kernel ridge regression. To summarize, our data-driven detection method presents great potential to be utilized for detection of subsurface geological features.

6 ACKNOWLEDGMENTS

This work was supported by the U.S. DOE Fossil Energy research grant. The computation was performed using super-computers of LANL’s Institutional Computing Program. J. Thiagarajanunder was supported by the U.S. DOE under Contract DE-AC52-07NA27344 to Lawrence Livermore National Laboratory.

APPENDIX A: SHERMAN-MORRISON-WOODBURY MATRIX IDENTITY

Given a square invertible matrix such that B = A + UV . Then, assumingis invertible, we have the

Sherman-Morrison-Woodbury matrix identity defined as

By letting and employing the above formulation to Eq. (14), we

will have

REFERENCES

Alaoui, A. & Mahoney, M. W., 2015. Fast randomized kernel ridge regression with statistical guarantees, in Advances in Neural Information Processing Systems (NIPS).

Araya-Polo, M., Dahlke1, T., Frogner, C., Zhang, C., Poggio, T., & Hohl, D., 2017. Automated fault detection without seismic processing, The Leading Edge, 36, 208–214.

Avron, H., Clarkson, K. L., & Woodruff, D. P., 2017. Faster kernel ridge regression using sketching and preconditioning, SIAM Journal on Matrix Analysis and Applications, to appear.

Bach, F., 2013. Sharp analysis of low-rank kernel matrix approximations, in International Conference on Learning Theory (COLT).

Campbell, C., 2001. An introduction to kernel methods, Studies in Fuzziness and Soft Computing, 66, 155– 192.

Chen, J., Avron, H., & Sindhwani, V., 2016. Hierarchically compositional kernels for scalable nonparametric learning, arXiv preprint arXiv:1608.00860.

Drineas, P. & Mahoney, M. W., 2005. On the nystr¨om method for approximating a gram matrix for improved kernel-based learning, Journal of Machine Learning Research, 6, 2153–2175.

Drineas, P. & Mahoney, M. W., 2016. RandNLA: Randomized numerical linear algebra, Communications of the ACM, 6(6), 80–90.

Drineas, P., W., M. M., S., M., & Sarlos, T., 2011. Faster least squares approximation, Numerische Mathematik, 117, 219–249.

Fowlkes, C., Belongie, S., Chung, F., & Malik, J., 2004. Spectral grouping using the nystrom method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 214–225.

Friedman, J., Hastie, T., & Tibshirani, R., 2001. The elements of statistical learning, vol. 1, Springer series in statistics New York.

Gittens, A. & Mahoney, M. W., 2016. Revisiting the Nystr¨om method for improved large-scale machine learning, The Journal of Machine Learning Research, 17(1), 3977–4041.

Guillen, P., 2015. Supervised learning to detect salt body, in SEG Technical Program Expanded Abstracts, pp. 1826–1829.

Haakon, F., 2010. Structural Geology, Cambridge University Press.

Hale, D., 2013. Methods to compute fault images, extract fault surfaces, and estimate fault throws from 3d seismic images, Geophysics, 78(2), O33–O43.

Huang, L., Shin, J., Chen, T., Lin, Y., Gao, K., Intrator, M., & Hanson, K., 2016. Breast ultrasound tomog- raphy with two parallel transducer arrays, in Proc. SPIE 9783, Medical Imaging 2016: Ultrasonic Imaging, Tomography, and Therapy, pp. 97830C–97830C–12.

Krebs, J. R., Anderson, J. E., Hinkley, D., Neelamani, R., Lee, S., Baumstein, A., , & Lacasse, M. D., 2009. Fast full-wavefield seismic inversion using encoded sources, Geophysics, 74, WCC177–WCC188.

Le, Q., Sarl´os, T., & Smola, A., 2013. Fastfood-computing hilbert space expansions in loglinear time, in Proceedings of the 30th International Conference on Machine Learning, pp. 244–252.

Li, M., Lian, X., Kwok, J. T., & Lu, B., 2011. Time and space efficient spectral clustering via column sampling, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Lin, Y. & Huang, L., 2015a. Acoustic- and elastic-waveform inversion using a modified total-variation regu- larization scheme, Geophysical Journal International, 200, 489–502.

Lin, Y. & Huang, L., 2015b. Least-squares reverse-time migration with modified total-variation regularization, in SEG Technical Program Expanded Abstracts.

Lin, Y. & Huang, L., 2015c. Quantifying subsurface geophysical properties changes using double-difference seismic-waveform inversion with a modified total-variation regularization scheme, Geophysical Journal International, 203, 2125–2149.

Lin, Y., Wohlberg, B., & Guo, H., 2010. UPRE method for total variation parameter selection, Signal Processing, 90(8), 2546–2551.

Lin, Y., Syracuse, E. M., Maceira, M., Zhang, H., & Larmat, C., 2015. Double-difference traveltime tomogra- phy with edge-preserving regularization and a priori interfaces, Geophysical Journal International, 201(2), 574–594.

Meng, X. Saunders, M. A. & Mahoney, M. W., 2014. LSRN: A parallel iterative solver for strongly over- or underdetermined systems, SIAM J. Sci. Comput., 36, 95–118.

Moghaddam, P. P., Keers, H., Herrmann, F. J., & Mulder, W. A., 2013. A new optimization approach for source-encoding full-waveform inversion, Geophysics, 78(3), R125–R132.

Mohri, M., Rostamizadeh, A., & Talwalkar, A., 2012. Foundations of machine learning, MIT press.

Nystr¨om, E. J., 1930. ¨Uber die praktische aufl¨osung von integralgleichungen mit anwendungen auf randwertaufgaben, Acta Mathematica, 54(1), 185–204.

Rahimi, A. & Recht, B., 2007. Random features for large-scale kernel machines, in Advances in neural information processing systems (NIPS), pp. 1177–1184.

Ramirez, J. & Meyer, F. G., 2011. Machine learning for seismic signal processing: Phase classification on a manifold, in Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops.

Rawlinson, N. & Sambridge, M., 2014. Seismic travel time tomography of the crust and lithosphere, Advances in Geophysics, 46, 81–197.

Rokhlin, V. & Tygert, M., 2008. A fast randomized algorithm for overdetermined linear least-squares regres- sion, Proc. Natl. Acad. Sci. USA, 105(36), 13212–13217.

Schnetzler, E. T. & Alumbaugh, D. L., 2017. The use of predictive analytics for hydrocarbon exploration in the Denver-Julesburg basin, The Leading Edge, 36, 227–233.

Sch¨olkopf, B. & Smola, A. J., 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond., MIT Press.

Si, S., Hsieh, C.-J., & Dhillon, I., 2014. Memory efficient kernel approximation, in International Conference on Machine Learning (ICML), pp. 701–709.

Talwalkar, A., Kumar, S., Mohri, M., & Rowley, H., 2013. Large-scale SVD and manifold learning, Journal of Machine Learning Research, 14, 3129–3152.

Tu, S., Roelofs, R., Venkataraman, S., & Recht, B., 2016. Large scale kernel learning using block coordinate descent, arXiv preprint arXiv:1602.05310.

Virieux, J. & Operto, S., 2009. An overview of full-waveform inversion in exploration geophysics, Geophysics, 74(6), WCC1–WCC26.

Wang, K., Matthews, T., Anis, F., Li, C., Duric, N., & Anastasio, M. A., 2015. Waveform inversion with source encoding for breast sound speed reconstruction in ultrasound computed tomography, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 62(3), 475–493.

Wang, S., 2015. A practical guide to randomized matrix computations with MATLAB implementations, arXiv preprint arXiv:1505.07570.

Wang, S., Zhang, Z., & Zhang, T., 2016. Towards more efficient SPSD matrix approximation and CUR matrix decomposition, Journal of Machine Learning Research, 17(210), 1–49.

Wang, S., Gittens, A., & Mahoney, M. W., 2017. Scalable kernel k-means clustering with nystrom approxima- tion: Relative-error bounds, arXiv preprint arXiv:1706.02803.

Williams, C. & Seeger, M., 2001. Using the Nystr¨om method to speed up kernel machines, in Advances in Neural Information Processing Systems (NIPS).

Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., & Zhou, Z.-H., 2012. Nystr¨om method vs random fourier features: A theoretical and empirical comparison, in Advances in Neural Information Processing Systems (NIPS).

Zhang, C., Frogner, C., Araya-Polo, M., & Hohl, D., 2014. Machine-learning based automated fault detection in seismic traces, in Proceedings of 76th European Association of Geoscientists and Engineers Conference & Exhibition (EAGE).

Zhang, Y., Oldenburg, C. M., Finsterle, S., Jordan, P., & Zhang, K., 2009. Probability estimation of co2 leakage through faults at geologic carbon sequestration sites, Energy Procedia, 1, 41–46.

Zhang, Y., Duan, L., & Xie, Y., 2015. A stable and practical implementation of least-squares reverse time migration, Geophysics, 80(1), V23–V31.

Zhang, Z., Tsang, I., & Kwok, J., 2008. Improved nystrm low-rank approximation and error analysis, in International Conference on Machine Learning (ICML).

Zhang, Z., Huang, L., & Lin, Y., 2012. Efficient implementation of ultrasound waveform tomography using

source encoding, in Proc. SPIE 8320, Medical Imaging 2012: Ultrasonic Imaging, Tomography, and Therapy, pp. 832003–832003–10.

designed for accessibility and to further open science