A hybrid machine learning and process-based-modeling (PBM) approach is proposed and evaluated at a handful of AmeriFlux sites to simulate the top-layer soil moisture state. The Hybrid-PBM (HPBM) employed here uses the Noah land-surface model integrated with Gaussian Processes. It is designed to correct the model only in climatological situations similar to the training data else it reverts to the PBM. In this way, our approach avoids bad predictions in scenarios where similar training data is not available and incorporates our physical understanding of the system. Here we assume an autoregressive model and obtain out-of-sample results with upwards of a 3-fold reduction in the RMSE using a one-year leave-one-out cross-validation at each of the selected sites. A path is outlined for using hybrid modeling to build global land-surface models with the potential to significantly outperform the current state-of-the-art.
The value of Machine Learning (ML) in Earth Sciences is well established, and it is an active area of research across a wide variety of application domains. In hydrological modeling, we have already seen many instances where ML has substantially outperformed traditional process-based modeling (PBM) [e.g., 6-7, 9]. These results motivate the need for answers to important questions about the use of ML “black-box” models in science:
1. How can we trust ML models? 2. How do we gain an understanding of the processes
The first question arises because ML employs flexible nonparametric models whose structure is constructed entirely from data. Traditional PBMs are constructed from some underlying physical principles that we believe are generally valid in any climatology. ML or data-driven models, on the other hand, can only be expected to do well in situations similar to the data the model was trained on. It is possible, and we hope, that ML models learn the fundamental underlying processes, but at this point we do not know of a way to satisfactorily determine if this is true. An obvious
Figure 1 Example of Gaussian Process (GP) regression (blue) on pseudo-noisy-linear data (purple) compared to a standard linear regression (black). Here we see that the zero mean GP prior causes the fit to extrapolate to zero outside of the data as opposed to parametric (linear) regression.
approach to circumvent this problem is to somehow combine our physical understanding with ML, and in fact this is an area actively being pursued [5, 10]. There are two approaches that might be employed: (1) imposing physical constraints on our ML models either analytically [e.g., 1] or through penalties imposed on the cost function, and (2) integrating PBMs with ML [e.g., 8]. In this work, we propose a method using the latter approach.
The second question is perhaps more difficult to answer, and to the authors knowledge, no satisfactory and universal solution exists at this time. However, before we can learn new physics from ML, we must first build reliable global ML models that captured the fundamental processes of the systems being modeled. Here we propose a potential approach, evaluate its performance modeling the soil moisture state, and present promising initial results.
To integrate PBMs with ML, we require the basic design principle that the ML model should only be active in situations that are climatologically similar to the training data, otherwise it should revert back to the PBM. For convenience, we will refer to such a model as a Hybrid-PBM (HPBM). The HPBM addresses (1) in Section 1, and in addition it addresses the important practical problem of having a “complete” training set. In some cases, Earth observational data may be extensive and in others less so. This may be true for several reasons such as limited sensor coverage, rare climatological events, or even unseen climatology induced by anthropogenic change. A HPBM allows you to make use of incomplete data sets by only incorporating observational data when it is available. When it is not available the model defaults back to the PBM which is out best estimate based on our physical understanding of the system.
where denotes the state-vector at time , and we assume a discrete time index. The ML model is incorporated straightforwardly as
where is used to distinguish the timeseries under the evolution of y and denotes some unspecified ML model. To adhere to our principle, we must have whenever the state of the system is dissimilar to the available training data. This can be accomplished by placing a vanishing prior mean on the ML model in which case
This can be achieved naturally with Gaussian Processes (GPs), and is in fact, standard practice . To highlight this, in Figure 1 we show the fit results on pseudo-noisy-linear data using GPs and a linear-fit model. Unlike the linear-fit model, GPs extrapolate to zero outside of the data. While this lack of extrapolability is often thought of as a weakness of GPs, here it is desirable behavior. For this reason, GPs are an ideal candidate to integrate ML an PBMs, and as such, is chosen as the ML model in this study. We then write our HPBM as
We note that although simple in form, the flexibility of GPs makes this model quite general and capable of capturing a large class of climatological responses.
If we replace the second term on the RHS of Eq. 4 with a random white noise, we have the equation of a classical data assimilation (DA) problem. In fact, the HPBM is a form of data assimilation; The ML model tries to correct the PBM using observational data. However, there is a key difference. In DA,
Figure 2 Geographical locations of the ten AmerFlux sites used to test the results of the hybrid machine learning process-based model.
we add random fluctuations to the model and use a Bayesian approach to correct the model based on observations. In this case, we add a structured corrective term and learn the model structural error though the error patterns between observations and model predictions. In this way, unlike classical DA, we create a new dynamics model that can assimilate observations both in- and out-of-sample. Just like classical DA, the non-linearity of the PBMs leads to intractable integrals that have to be evaluated either using Monte Carlo techniques or by making simplifying assumptions that make the integrals tractable. Since here we are trying to learn the model structural error, it is essential to retain the non-linearity of the GP, and linear approximations like the Kalman filter are not sufficient.
Our HPBM is a Gaussian Process State-Space-Model (GPSSM), and there have been several recent advances related to training GP-SSMs. An approachable overview is given by Frigola . Though ultimately GP-SSMs will have to be employed, to investigate the potential of this approach we circumvent the intractable integrals by ignoring the uncertainty propagation and performing autoregression. This approach is possible as long as one has observational pairs at the desired simulation timestep, and the quality of the data is sufficient to determine with a small enough uncertainty. To meet these requirements, we test our approach at AmeriFlux tower sites where in-situ half-hourly data is available.
To train the autoregression model, we make the assumption that the input state . In this case the input/output training pairs are , see Equation 4. The “one-step-ahead” training set is generated as follows: (1) initialize , (2) set , (3) compute and record, and (4) repeat steps 2-3 for .
Often, as in the case of this work, one only observes part of the state-vector . In this case, the unobserved part of the state vector is evolved according to the PBM i.e. the
Table 1 Performance results of the Hybrid-Process-Based-Model (HPBM) using the Noah hydrological model. The first column indicates the site locations, see Figure 2, and the second column shows the number of years of training data. The third and fourth columns show the cross-validated average performance of the Noah and HPBM, and the last column displays the percent improvement.
prediction from the PBM model for the unobserved part of the state-vector is used as input to the next timestep unaltered by the ML model. This process is then equilibrated by enforcing temporal periodic boundary conditions and running through the data several times. In this way, the unobserved portion of the state is equilibrated to the observed portion of the climatology.
4.1. Experimental setup
To test the HPBM, we used the Noah land surface model as our PBM and predicted the top-layer (5 cm) soil moisture state at a collection of AmeriFlux towers sites, see Figure 2. The training data consist of observed values of the top layer soil moisture, , and forcing inputs, , at half-hour intervals. The unobserved portion of the state-vector , including lower level soil moisture states, is not corrected and is evolved solely by the PBM model. Denoting the top layer soil moisture by , we write the state vector as . The one-step-ahead training procedure in this case follows the prescription in Section 3 with step (1) replaced with and step (2) with .
To train the HPBM, we use the squared-exponential (SE) kernel 
and employ sparse-pseudo input GPs (SPGP) .
For each training set, we vary the number of pseudo inputs and perform several restarts to avoid bad local minima. To select the model with the best performance, it is important to measure the dynamical performance. To do this, we run the HPBM model on a test year, compute the RMSE, and choose the model with the lowest RMSE. A leave-one-out (LOO) cross-validation is performed over all available years at each site, and the results are reported in Section 5.
To carry out this and future work, we developed an efficient C++ High-Performance Computing (HPC) implementation of GPs capable of training on tens-of-millions of samples. This software package will be released under NASA’s public license and made available to the scientific community throughout this work. All computer simulations and numerical work were carried out at the NASA Center for Climate Simulations.
We identified nine AmeriFlux stations with a variety of different climates that have top layer soil moisture observations available. At each site, we first calibrate the Noah model on the training data using Shuffled Complex Evolution . Then we create the one-step-ahead training sets and train the HPBMs. We did not use a separate test set since some of the sites did not contain enough data.
For each site, we select the model with the best performance on the validation year, cross-validate over all the years, and computed the average RMSE. The results are shown in Table 1. We see that all sites show an improvement with some sites achieving a 3-fold reduction in the RMSE. Site 7 saw no significant improvement, but it was only trained using one-year of data and is likely a result of those 2 years being climatologically different.
The HPBM models were trained by varying the number of pseudo-inputs in the SPGP with ten restarts for each pseudoinput. Altogether, for each training set, we produced ~100 sets of fit parameters. We found the regression performance was good and quite stable, but the dynamical performance inand out-of-sample varied sometimes substantially. This is likely the result of ignoring the temporal uncertainty propagation, and we expect it will be resolved using GPSSMs.
In Table 1, we show the results for the parameter sets that performed best out-of-sample. In practice, the model selection should not be based on the out-of-sample dynamical performance. While the in- and out-of-sample dynamical performance is well correlated, it was not true that the models that performed the best in-sample were the ones that performed best out-of-sample. One way to incorporate all the models to create a more stable predictor is to marginalize. This can be accomplished by combining the GP posterior predictive distributions via model averaging ,
Figure 3 A time series plot of the observations (blue), Noah (orange), HPBM (green), and relative precipitation (cyan). In the spring, the Noah model exhibits an oscillatory artifact that indicates a model structural error related to snowmelt. The HPBM is able to learn and correct this structural error which is easily seen by comparing to observations.
which results in a posterior predictive mean,
For example, one could choose the marginalization weights, , to be normally distributed according to the RMSE performance. We plan to investigate this approach in the near future.
Lastly, in Figure 3, we take a closer look at the timeseries results at site 2, the Blodgett Forest Station in CA. Here we can see an oscillatory artifact in the Noah land surface model. This is due to structural error in the model that does not account properly for snowmelt. Here we clearly see the ML model is able to correct the structural error and demonstrates the HPBM is able to learn and correct structural error and is not just correcting model biases.
Here we proposed a methodology for integrating ML and PBMs to create new dynamical model behaviors. To evaluate the performance, we created a HPBM to predict the top layer soil moisture states at a handful of AmeriFlux sites. At each site, we performed a one-year LOO validation and obtained as much as a 3-fold reduction in the average RMSE. These performance gains are a strong indication of the potential of this approach. We found some instability in performance believed to be a result of ignoring uncertainty propagation during training. We propose a simple weighted average procedure to reduce these effects, but ultimately plan to employ GP-SSMs.
In this work, we demonstrated the ability to learn model structural error using a HPBM at a particular geo-location. To build global models, we will need to create HPBM that can predict out-of-sample both spatially and temporally. In addition, we will need to include remote sensing data, and properly account for uncertainty propagation using GPSSMs.
 T. Beucler, M. Pritchard, S. Rasp, P. Gentine, J. Ott, and P. Baldi, “Enforcing analytic constraints in neural-networks emulating physical systems,” arXiv preprint arXiv:1909.00912, 2019.  Q. Y. Duan, V. K. Gupta, and S. Sorooshian, “Shuffled complex evolution approach for effective and efficient global minimization,” Journal of Optimization Theory and Applications, vol. 76, no. 3, pp. 501–521,1993.  R. Frigola, “Bayesian time series learning with gaussian processes,” PhD thesis, University of Cambridge, 2015.  J. A. Hoeting, D. Madigan, A. E. Raftery, and C. T. Volinsky, “Bayesian model averaging: A tutorial,” Statistical science, pp. 382– 401, 1999.  A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach, A. Banerjee, A. Ganguly, S. Shekhar, N. Samatova, and V. Kumar, “Theory-guided data science: A new paradigm for scientific discovery from data,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 10, pp. 2318–2331, 2017, issn: 1041-4347.  F. Kratzert, D. Klotz, M. Herrnegger, A. Sampson, S. Hochreiter, and G. Nearing, “Towards improved predictions in ungauged basins: Exploiting the power of machine learning,” Water Resources Research, vol. n/a, no. n/a, 2019, issn: 0043-1397.  F. Kratzert, D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, and G. Nearing, “Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets,” Hydrol. Earth Syst. Sci., vol. 23, pp. 5089–5110, 2019.  G. S. Nearing and H. V. Gupta, “The quantity and quality of information in hydrologic models,” Water Resources Research, vol. 51, no. 1, pp. 524–538, 2015, issn: 00431397.  G. S. Nearing, B. L. Ruddell, M. P. Clark, B. Nijssen, and C. Peters-Lidard, “Benchmarking and process diagnostics of land models, Journal of Hydrometeorology, vol. preprint online, no. doi.org/10.1175/JHM-D-17-0209.1, null, 2018.  M. Reichstein, G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, and N. Carvalhais, “Deep learning and process understanding for data-driven earth system science,” Nature, vol. 566, no. 7743, p. 195, 2019, issn: 1476-4687.  C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, ser. Adaptive Computation and Machine Learning. Cambridge, MA, USA: MIT Press, Jan. 2006, p. 248.  E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing
Systems 18, Y. Weiss, B. Schölkopf, and J. C. Platt, Eds., MIT Press, 2006, pp. 1257–1264.