The gap between theory and practice in function approximation with deep neural networks

2020·Arxiv

Abstract

Abstract

Deep learning (DL) is transforming whole industries as complicated decision-making processes are being automated by deep neural networks (DNNs) trained on real-world data. Driven in part by a rapidly-expanding literature on DNN approximation theory showing that DNNs can approximate a rich variety of functions, these tools are increasingly being considered for problems in scientific computing. Yet, unlike more traditional algorithms in this field, relatively little is known about DNNs from the principles of numerical analysis, namely, stability, accuracy, computational efficiency and sample complexity. In this paper we first introduce a computational framework for examining DNNs in practice, and then use it to study their empirical performance with regard to these issues. We examine the performance of DNNs of different widths and depths on a variety of test functions in various dimensions, including smooth and piecewise smooth functions. We also compare DL against best-in-class methods for smooth function approximation based on compressed sensing. Our main conclusion from these experiments is that there is a crucial gap between the approximation theory of DNNs and their practical performance, with trained DNNs performing relatively poorly on functions for which there are strong approximation results (e.g. smooth functions), yet performing well in comparison to best-in-class methods for other functions. To analyze this gap further, we then provide some theoretical insights. We establish a practical existence theorem, which asserts the existence of a DNN architecture and training procedure that offers the same performance as compressed sensing. This result establishes a key theoretical benchmark. It demonstrates that the gap can be closed, albeit via a DNN approximation strategy which is guaranteed to perform as well as, but no better than, current best-in-class schemes. Nevertheless, it demonstrates the promise of practical DNN approximation, by highlighting the potential for developing better schemes through the careful design of DNN architectures and training strategies.

1. Introduction. The past decade has seen an explosion of interest in the field of machine learning, largely due to the impressive results achieved with DNNs. Breakthroughs have been obtained on large classes of historically-challenging problems, including: speech recognition [21, 50] and natural language processing [89], image classification [53, 75], game intelligence [74], and autonomous vehicles [34]. As DNNs have shown such promise in these real-world applications, a trend has developed in the scientific computing community towards applying them to problems in mathematical modelling and computational science. Recent studies have focused on applications ranging from image reconstruction tasks in medical imaging [5], discovering underlying partial differential equation (PDE) dynamics [71] and approximating

2 B. ADCOCK AND N. DEXTER

solutions of PDEs [31, 88] to complex mathematical modeling, prediction, and classification tasks in physics [13], biology [76,83,93], and engineering [59,82].

Simultaneously, the broader applied mathematics community has taken interest in the approximation capabilities of NNs [6,10,69,90]. The earliest results in this direction [19,51] established that even a single hidden layer fully-connected NN has universal approximation capability: so long as the number of nodes in the hidden layer are allowed to grow unbounded, such architectures are able to approximate any Borel measurable function on a compact domain to arbitrary uniform accuracy. More recent works have studied the connection between expressiveness of DNNs and their depth [57, 60, 91], while others have established connections between DNNs and other methods of approximation, e.g., sparse grids [62], splines [85], polynomials [22,72], and “h, p”-finite elements [67]. A plethora of results now exist concerning the approximation power of DNNs for different function spaces – e.g. Sobolev spaces [45], bandlimited functions [62], analytic functions [33,68], Barron functions [32], cartoon-like functions [44], H¨older spaces [73] – and tasks in scientific computing, such as approximation of high-dimensional functions [56, 72] and PDEs [11, 43, 55], dimensionality reduction [92], and methods for DEs [61, 88]. Theoretically, these works establish best-in-class approximation properties of DNNs for many problems. Concurrently, some works have sought to address the construction of DNN approximations directly, although typically without theoretical guarantees on trainability. For example [26, 36] employ ideas from greedy methods in order to do this, [27] derives a formula for integral representations of shallow ReLU networks, and [23] construct networks directly approximating the Legendre basis, using this as an initialization point for training. On the other hand, works such as [20] reinterpret DNN approximation as approximation in adaptive bases, which can be learned via training data.

1.1. Challenges. Yet despite the impressive empirical and theoretical results achieved in the broader DL community, there is concern that methods based on DNNs do not currently meet the usual rigorous standards for algorithms in computational science [9]. While the aforementioned theoretical results assert the expressibility of the class of DNNs – that is, the existence of a DNN of a given architecture that achieves a desired rate of convergence for a given problem – they say little about their practical performance when trained by modern approaches in DL. If such techniques are to achieve widespread adoption in scientific computing, it is vital they be understood through the lens of numerical analysis, namely, (i) stability, (ii) accuracy, (iii) sample complexity, (iv) curse of dimensionality and (v) computational cost.

(i) Stability. Recently, researchers have begun to question the stability properties of DNNs [35, 63, 81]. A series of works have demonstrated that DNNs trained on tasks such as image classification are vulnerable to misclassification when provided images with small “adversarial” perturbations [64] and can even completely fail on image reconstruction tasks in the presence of small structural changes in the data [4, 42]. As deep learning is increasingly being applied towards critical problems in healthcare, e.g., DeepMind’s recent work on machine-assisted diagnostic imaging in retinal disease [24], many have questioned the ethics of applying tools whose stability properties are not fully understood to such problems.

(ii) Accuracy. Over the past 5 years, many works have been published on the classes of functions, e.g., analytic or piece-wise continuous, that can be approximated by DNNs of a given size with a certain rate of convergence. These results are constructive, often showing

THE GAP BETWEEN THEORY AND PRACTICE 3

the existence of a DNN emulating another approximation scheme, e.g. polynomials, for which convergence rates have already been established. While such results provide a useful benchmark for DNN expressivity, they do not suggest methods for training DNNs that reliably achieve the tolerances required in computational science applications.

(iii) Sample complexity. Areas in which DL has seen the greatest success include problems in supervised learning such as image classification. In such settings, DNNs are trained on large sets of labeled images, yielding a model capable of predicting labels for unseen images. Popular datasets for DL competitions include the ImageNet database which contains 14 million hand-annotated images of more than 20,000 categories of subjects [25]. In contrast, problems in computational science are often relatively data-starved, e.g. applications in uncertainty quantification (UQ) which involve computing a quantity of interest from sampled solutions of a parameterized PDE [46]. As each sample involves the discretization and solution of a PDE, which may require thousands of degrees of freedom to accurately resolve, there is great attention paid in such problems to minimizing the required number of samples [3,29].

(iv) Curse of dimensionality. Many modern problems in scientific computing involve high dimensionality. High-dimensional PDEs occur in numerous applications, and parametrized PDEs in UQ applications often involve tens to hundreds of variables. Recent works have shown that certain DNNs have the expressive capabilities to mitigate the curse of dimensionality to the same extent as current best-in-class schemes [11,33,43,55,62,68,72]. Yet, as noted, this does not assert these rates can be achieved via training. Moreover, the curse of dimensionality is an important consideration in the sample complexity, as the cost of obtaining samples can often dominate the overall cost. A recent numerical study has shown that approximation quality degrades with increasing dimension for approximating solutions of high-dimesional parameterized PDEs [38]. However the observed scaling is not exponential in the dimension d, but dependent on the complexity of the underlying problem. Understanding this scaling with respect to the sample complexity is crucial to applying these methods in computational science applications.

(v) Computational cost. By far, the largest barrier to entry for DL research is the cost of training. DNNs are typically trained on graphics processing units (GPUs), and a single GPU can cost thousands of US dollars. In many industry applications, models are trained on hundreds of these specialized cards. In addition, the training process itself is very energyintensive, and can produce a large amount of excess CO. Even a small reduction in computational cost can yield large cost savings and greater access to resources for researchers.

In the near term, it seems likely that any DL implementation will pay a price in computational cost. Hence there needs to be a clear understanding of the benefits vis-a-vis properties (i)–(iv) above. The study of these concerns is the broad purpose of this paper.

1.2. Contributions. Our main objective is to examine practical DNN approximation on problems motivated by scientific computing. In many applications in computational science, the core task involves approximating a function , with domain (often 1). Hence our main aim is to examine the performance of DL on practical function

4 B. ADCOCK AND N. DEXTER

approximation through the five considerations (i)–(v). Our main contributions are:

1. We develop a computational framework for examining the practical capabilities of DNNs in scientific computing, based on the rigorous testing principles of numerical analysis. We provide clear practical guidance on training DNNs for function approximation problems.

2. We conduct perhaps the first comprehensive empirical study of the performance of training fully-connected feedforward ReLU DNNs on standard function classes considered in numerical analysis, namely, (piecewise) smooth functions on bounded domains. We compare performance over a range of dimensions, examining the capability of DL for mitigating the curse of dimensionality. We examine the effect of network architecture (depth and width) on both ease of training and approximation performance of the trained DNNs. We also make a clear empirical comparison between DL and current best-in-class approximation schemes for smooth function approximation. The latter is based on polynomial approximation via compressed sensing (CS) [3], which (as we also show) achieves exponential rates of convergence for analytic functions in arbitrarily-many dimensions, see Section 5.1 for more details.

3. We present theoretical analysis that compares the performance of DL to that of CS for smooth function approximation. In particular, we establish a novel practical existence theorem, which asserts the existence of a DNN architecture and training procedure (based on minimizing a certain cost function) that attains the same exponential rate of convergence as CS for analytic function approximation with the same sample complexity.

1.3. Conclusions. The primary conclusion of this work is the following. While it is in- creasingly well understood that DNNs have substantial expressive power for problems relevant to scientific computing, there remains a large gap between expressivity and practical performance achieved with standard methods of training. Surprisingly, trained DNNs can perform very badly on functions for which there are strong expressivity results, such as smooth functions in high dimensions and piecewise smooth functions. Yet, on other examples, they are competitive with current best-in-class schemes based on CS. We also draw the following conclusions based on our experimental results training fully-connected feedforward ReLU DNNs:

1. The accuracy of trained DNNs is limited by the precision used, but is typically nowhere near machine epsilon despite training to such tolerances in the loss. In this work, we perform experiments in both single and double precision. Yet, in both cases, it is typically impossible to get beyond four digits of accuracy even when approximating extremely smooth functions. In contrast, a combination of Legendre polynomial approximation in the hyperbolic cross subspace with weighted -minimization and a specific choice of weights motivated by smooth function approximation can reliably achieve six or seven digits of accuracy on such problems. Since our theoretical contribution shows that DNNs should be capable of obtaining these results (albeit with a different initialization and training procedure), this fact suggests a limitation of standard training methods in obtaining more accurate results.

2. The training process itself is also highly sensitive to the parameterization of the solvers and initialization of the weights and biases of the DNNs. After extensive testing, we chose the Adam optimizer with exponentially decaying learning rate over a variety of optimizers including standard SGD, finding empirically that this strategy can help to mitigate some of the challenges of solving the non-convex optimization problem of training, e.g., non-monotonic decrease in the loss, and slow convergence to minimizers. We also initialize our networks with

THE GAP BETWEEN THEORY AND PRACTICE 5

symmetric uniform or normal distributions with small variance, finding larger variances can lead to failure in training or slow convergence. These choices combined lead to a training process that is largely stable and minimizes the probability of failure in training over a wide range of architectures. However, failures can still occur and are often a consequence of choosing a network that is either too shallow and narrow (where failure occurs immediat