Resolving learning rates adaptively by locating Stochastic Non-Negative Associated Gradient Projection Points using line searches

2020·Arxiv

Abstract

Abstract

Learning rates in stochastic neural network training are currently determined a priori to training, using expensive manual or automated iterative tuning. This study proposes gradient-only line searches to resolve the learning rate for neural network training algorithms. Stochastic sub-sampling during training decreases computational cost and allows the optimization algorithms to progress over local minima. However, it also results in discontinuous cost functions. Minimization line searches are not effective in this context, as they use a vanishing derivative (first order optimality condition), which often do not exist in a discontinuous cost function and therefore converge to discontinuities as opposed to minima from the data trends. Instead, we base candidate solutions along a search direction purely on gradient information, in particular by a directional derivative sign change from negative to positive (a Non-negative Associative Gradient Projection Point (NNGPP)). Only considering a sign change from negative to positive always indicates a minimum, thus NN-GPPs contain second order information. Conversely, a vanishing gradient is purely a first order condition, which may indicate a minimum, maximum or saddle point. This insight allows the learning rate of an algorithm to be reliably resolved as the step size along a search direction, increasing convergence performance and eliminating an otherwise expensive hyperparameter.

Keywords Optimization Artificial Neural Networks Line Search Discontinuous Loss Function Stochastic Sub-Sampling

1 Introduction

The aim of this paper is to compare gradient-only line searches to minimization line searches in the context of the non-convex, discontinuous cost functions encountered

Centre for Asset and Integrity Management (C-AIM), Department of Mechanical and Aero-

in Artificial Neural Network training. With the ability to conduct line searches in this environment, it is possible to automatically resolve step sizes (learning rates) adaptively, as determined by cost function characteristics. To this end, we introduce the various concepts needed for our study.

1.1 Cost Functions and Gradients

Machine learning models are often posed with objective functions of the form

where is a vector of parameters, is a training set, and ) is a loss quantifying the performance of parameters x on training sample t. The exact gradient w.r.t. x is then computed by back-propagation [47]

For large M, the exact gradient can become computationally demanding to compute. Instead, a (mini-)batch sample, of size is selected from the training set to compute an approximate loss function

and approximate stochastic gradient

with expectation E[L(x)] = L(x) and ) [45]. Usually, the batch or mini-batch is drawn uniformly and independently from M, but it has been noted that mini-batch sampling may lead to the non-convergence of certain algorithms [4], which can be rectified using stratified or active mini-batch sampling [50]. In addition to cost, mini-batch sampling may alleviate the presence of some local minima, as neural network objective functions have been reported to be multi-modal [11,17,10,40] and therefore aid convergence [39].

Hence, since L may be non-differentiable due to discontinuities, we weaken the notion of gradients or derivatives to associated derivative and associated gradient [48,44] as defined in Definitions 1 and 2.

Notation: Explicit dependency on variables are occasionally omitted e.g. L instead of L(x). Sequences are denoted by subscripts e.g. , denoting the iterate. Entry-wise products also known as the Hadamard or Schur products [12] are explicitly denoted using .

Definition 1 Let be a piecewise smooth real univariate stepdiscontinuous function that is everywhere defined. The ) for f(x) at a point x is given by the derivative of f(x) at x when f(x) is differentiable at x. The for f(x) non-differentiable at x, is given by the left-sided derivative of f(x) when x is associated with the piecewise continuous section of the function to the left of the discontinuity, otherwise it is given by the right-sided derivative.

Definition 2 Let be a piecewise continuous function that is everywhere defined. The associated gradient g(x) for f(x) at x is given by the gradient of f(x) at x when f(x) is differentiable at x. The associated gradient g(x) for f(x) non-differentiable at x is defined as the vector of partial derivatives with each partial derivative as defined in Definition 1.

It follows from Definitions 1 and 2 that the associated gradient reduces to the gradient for everywhere differentiable functions. Derivative and gradient is from here on taken to imply respectively associated derivative and associated gradient. Directional derivative denotes a derivative or gradient vector projected along a direction. Gradient vectors are assumed to be column vectors and the vector transpose is indicated by superscript T.

1.2 Optimization Formulations: Smooth Case Using Full Batch

In general there are three formulations to define the solution or candidate solutions to an optimization problem using (1) and (2), namely:

1. Formulation 1: direct minimization of (1) [15,3,44], 2. Formulation 2: finding stationary points of (2), known as the optimality criterion [15,3,44], and

3. Formulation 3: finding non-negative gradient projection points, , [48, 44], as defined by the non-local optimality condition in Definition 3.

Definition 3 Suppose that is a real-valued function for which the associated gradient field ) is uniquely defined for every . Then, a point is a non-negative associated gradient projection point (NNGPP) if there exists a real number 0 for every such that

The non-negative associated gradient projection point is named after the directional derivative that is required to be non-negative for all multi-dimensional directions u, where the gradient is evaluated away from at .

Formulations 1-3 are well illustrated by the univariate loss function F(x) depicted in Figure 1(a) with (b) loss derivative function:

1. Formulation 1: is the minimum of F(x).

2. Formulation 2: The only stationary point (candidate minimum) is , since the necessary condition minimum if in addition to 0, which together forms the sufficiency conditions for to be a local minimum.

Fig. 1: (a) Univariate loss function and (b) derivative function. The non-negative projection point, , (Formulation 3) for the example quadratic function coincides with , at which which the derivative is also zero (Formulation 2) and the function value is a minimum (Formulation 1). Non-negative gradient projection points (NN-GPP) are defined using a non-local optimality formulation that requires all directional derivatives of points in a non-empty ball, around the NN-GPP to be non-negative. The directional derivative, , of a point in the ball is defined by the direction vector that connects the non-negative projection point to the point in the ball x, i.e. , which is then projected onto the derivative evaluated at the point in the ball ).

3. Formulation 3: The only non-negative projection point is , consider the point ), with one-dimensional direction 0 and derivative 0 that results in the directional derivative 0. Similarly, ), with direction 0 and derivative 0 that results in the directional derivative 0. Hence, any point ) results in a non-negative directional derivative. Note that neither or is a non-negative projection point, since there are points to the right of that result in negative directional derivatives (d > 0 while 0) and there are points to the left of that result in negative directional derivatives (d < 0 while 0). The non-negative projection point defines a local minimum without having to compute second order information [48,44], as positive directional derivative implies increase in function along that direction.

Consider Figures 2(a) and (b) for the application of Formulations 1-3 to the actual loss function of the full training set of the Glass1 dataset [36], for step length along the steepest descent direction, ), evaluated at . Figure 2(a) depicts the actual univariate loss function along the search (descent) direction, i.e.

with directional derivative,

along the search direction depicted in Figure 2(b). The minimum, stationary point and non-negative projection point is at 5.

Formulations 1-3 have been shown to be equivalent [48], when restricting (1) to the class of convex functions. Formulation 1 is solved by direction minimization of (1), while formulation 2 requires the roots of (2) to be computed. Formulation 3 is solved by finding sign changes along directional derivatives of descent directions from negative to positive until no more descent directions remain. Note sign changes from negative to positive along a search direction implies a minimum without requiring the directional derivative to be zero at the minimum. This distinction is important when considering discontinuous loss functions that occur when conducting mini-batch sampling.

Fig. 2: Typical loss function along the steepest descent direction in neural net training, with (a) function value and corresponding (b) directional derivative along the descent direction as a function of step size . The problem is constructed using the full training data of the Glass1 dataset [36] (for details see Section 3.5). The local minimum is indicated as a black dot in (a), while the stationary point and non-negative projection point are indicated by a black point in (b).

1.3 Optimization Formulations: Discontinuous Case Using Sub-Sampled Mini-Batches

In the case of randomly sampling a new mini-batch at every approximate loss evaluation and gradient, L(x) and g(x), discontinuities occur between evaluations. These are stochastic in nature, resulting in spurious local minima being found at discontinuities. Additionally, the probability of a stationary point (g(x) = 0) existing at a full-batch minimum, , is low, since sub-sampled losses might have gradients close to, but not equal to zero. Although the NN-GPP definition is robust to the presence of discontinuities [48,44], the stochasticity introduced due to continually sampling different mini-batches causes the locations of NN-GPPs to no longer be unique. We therefore relax and generalize its definition to the Stochastic NN-GPP as follows:

Definition 4 Suppose that is a real-valued function for which the ) is uniquely defined for every . Then, a point is a stochastic non-negative associated gradient projection point (SNN-GPP) if there exists a real number 0 for every such that

with a non-zero probability.

Therefore, every NN-GPP is also a SNN-GPP, as a NN-GPP satisfies the above definition with probability 1. However, the stochastic nature of the loss results in all SNN-GPPs of a given neighbourhood on the loss surface being contained in a ball with radius smaller than . Thus, we defined ball, as follows:

Definition 5 Consider a mini-batch sub-sampled loss function L(x), of a continuous, smooth and convex full-batch loss function L(x), such that E[L(x)] = L(x) and ) [45]. The minimum of the full-batch loss is . Due to the variance of L(x) at full-batch minimum , there exists a ball, : , that contains all the stochastic non-negative gradient projection points (SNN-GPPs).

To illustrate these concepts, reconsider the Glass1 dataset [36], with the exception of sampling using mini-batches with stochastic sub-sampling to compute the loss function and gradients. Figures 3(a) and (b) show the function values and directional derivatives along the steepest descent direction for different mini-batch sizes [10, 30, 50] and compares these to the full dataset (All Data).

Firstly, it is evident that the loss function and directional derivative along the search direction are both discontinuous as a result of sampling induced discontinuities. The equivalence between Formulations 1-3 having been addressed in the previous section when using the full batch, the differences between Formulations 1-3 are now quantified when using stochastically sub-sampled batches to approximate the loss function and directional derivative along a search direction, as depicted in Figures 3(a) and (b). It is evident that defining the solution as the minimum of (1) is both problematic and not representative of the underlying solution as shown by (All Data). Similarly, considering Formulation 2 requires the directional derivative to be zero, which evidently does not exist and also for which the point with smallest directional derivative is not necessarily a good approximation to the All Data solution. Evidently SNN-GPPs are significantly better approximations to the All Data solution. SNN-GPPs are localized in a ball around the All Data solution, due to the stochastic nature of the problem. The larger the batch size the smaller the ball radius as indicated by the vertical dotted lines in Figure 3(b) for the corresponding P values averaged over 100 reconstructions.

In addition, we count the average number of local minima and SNN-GPPs over 100 reconstructions of the line search ) over 0 10. For each run along the search direction, 100 steps are taken in increments of 0.1, where a function value and gradient evaluation is performed at each increment. The local minima and SNN-GPPs are then counted within the given range as depicted in Figure 3(c) for the Glass1 dataset. The shaded region gives an indication of the standard deviation and the solid line denotes the mean. When the full dataset is considered, both the number of local minima and SNN-GPP are one and equal to the single local minimum located at 5. It is clear, that the SNN-GPPs indicate only one solution for a much smaller batch size than the minimum of the function. Overall the number of solutions for SNN-GPPs are around half the local minima present in the loss function using single sample mini-batches. In addition to the number of solutions, the spatial location of the solutions along the line search is quantified in Figure 3(b) for a batch size of 10. The SNN-GPPs are localized to the domain 2 3, that is clustered around 5, whereas the local minima of the approximated loss function are scattered over almost the entire domain 010.

Fig. 3: (a) Loss function and (b) directional derivative along the steepest descent direction with step size for different mini-batch sizes, namely, 10, 30 and 50 samples per batch stochastically sub-sampled. The size of ball, , along the search direction is also indicated in (b) by vertical lines for different batch sizes. In addition, the (c) average number of local minima as a function of sample size for 0 10, and (d) the spatial location of local minima along the search direction are indicated. Evidently, as the sample size increases, the number as well as location of the minima become consistent. Variance in the directional derivative is far less affected by mini-batch sampling than the loss function, making strategies that sensibly rely on gradient or derivative information better suited for line searches, albeit that the variance has implications for exact and inexact search strategies.

Evidently, SNN-GPPs are better generalizers to the All Data solution than the minimum of the approximated loss function. The implication is that line searches that aim at finding SNN-GPPs should perform better than line searches that aim to minimize along a search direction. Away from the domain of SNN-GPPs, sign changes are negligibly scarce in the directional derivative and instead are centered around the All Data solution in a ball, , due to the stochastic nature of the problem. This has implications primarily for exact line searches, where we can only expect to resolve SNN-GPPs to within .

1.4 Related Work

Supervised machine learning is often divided into ”noisy” stochastic optimization problems [9] related to very small batch samples (typically a single data point) and batch averaged approximation regimes that utilize larger batch sizes. As demonstrated in this paper, both problems are discontinuous, with the size of the discontinuities decreasing as the batch size increases. Additional approaches to reduce the discontinuity size include dynamic sample sizes i.e. increasing the sample size as the optimizer converges [9] that results in a smooth and continuous function in the limit of maximum sample size i.e. full batch. As dynamic sub-sampling requires the sample size to grow to the limit of the full batch it is not well suited for mini-batch applications with memory constrained datasets. As demonstrated in this study, the generalization of the optimization problem to find SNN-GPPs allows for a framework that is well suited for stochastic or active sub-sampling [50] over the entire optimization run. Active sub-sampling [50] is another approach to reduce the discontinuity size, in which the training data is split into additional datasets based on their error and then effectively sub-sampled stochastically. Active sub-sampling [50] can also address the non-convergence issues [4] of well known machine learning optimizers such as Adam [22]. Global optimization strategies such as particle swarm optimization [13], genetic algorithms [32] and Bayesian combined genetic programming approaches [31] are useful in the context of highly non-linear and multi-modal problems while only relying on loss function evaluations. However, these methods are well-known to be computationally demanding.

Stochastic gradient algorithms were introduced by Robbins-Monro [38], and include developments by Nesterov’s dual averaging method [34]. Subgradient methods introduced by Shor [42,41] are closely related to stochastic gradient algorithms. In fact, stochastic gradient descent (SGD) is a classical subgradient method i.e. steepest descent with a fixed learning rate [7]. In turn, subgradient methods are closely related to the newly coined proximal-gradient methods [7]. All these methods make use of a priori selected step sizes that are either constant or follow some schedule with known convergence characteristics. These learning rate parameters remain the most sensitive to successful configuration [6] and are currently determined either by the user on a ”trial and error” basis, or by computationally expensive automated means [43,5,6,18]. An attempt to incorporate an adaptive learning rate using an inexact line search strategy based on the Wolfe conditions requires dynamic sub-sampling that iteratively increases to the batch size from small to the full batch size [9] to finally ensure a smooth and continuous problem that has well-known convergence characteristics. Another approach has been to conduct probabilistic line searches in a Bayesian optimization framework [29]. All the approaches discussed so far are referred to as function minimizers i.e. they aim to find the minimum function value of the loss function.

Mini-batch of sub-sampling have seen limited application of line search as they perform poorly or do not converge since the underlying assumptions on which the line searches were developed do not apply for stochastic sub-sampling. Theoretical developments of convergence proofs and estimating theoretical convergence rates include the well-known linear convergence rate of gradient descent methods for strongly-convex loss functions. Sub-linear convergence rates are achieved when f is only convex [21]. A number of alternatives to strong convexity have been presented that include error bounds [28], essential strong convexity [26], weak strong convexity [16], restricted Secant inequality [51], quadratic growth with the Mangasarian-Fromovitz constraint qualification [2], Polyak-Lojasiewicz condition [21] and associated derivative descent sequences [48]. In line with the discontinuous nature of the stochastic loss function, associated derivative descent sequences are not based on assumptions of Lipschitz continuity or convexity but only assumes associated derivative unimodal functions with convergent subsequences.

An alternative to function minimizers are gradient-only optimization strategies that solves for SNN-GPPs as defined by the gradient-only optimization problem [48,44]. SNN-GPPs were specifically proposed to define solutions for discontinuous functions. In this study we consider line searches specifically developed to find SNN-GPPs. We base our convergence proofs on associated derivative descent sequences.

1.5 Contribution

The gradient-only optimization problem which solves SNN-GPPs [48,44] is a generalized problem, which applies from full batch training (or conventional minimization problem) to mini-batch training (or discontinuous optimization problem). In practice SNN-GPPs have been resolved using gradient-only line search (GOLS) strategies [48,44]. In this paper we adopt this approach to automatically and adaptively determine the learning rates during supervised mini-batch neural network training, and demonstrate its effectiveness and generality by comparing GOLS to minimization line search strategies over a large number of ANN problems.

2 Method

As the emphasis of this study is on resolving the step size and not the implications of search directions, we restrict ourselves to the mini-batch stochastic gradient descent algorithm. We consider exact and inexact line searches that rely on both loss function values [15,3,44] and gradients-only line search strategies [48,44]. Additional benefits for considering line searches to resolve the step size include the potential to consider higher order algorithms such as conjugate gradient and QuasiNewton approaches to resolve curvature information regarding the problem [15,3, 44].

2.1 Stochastic Gradient Descent with Line Search Strategies

Consider stochastic gradient descent (SGD) algorithm as outlined:

which defines functional value based line searches. Alternatively, line searches can be used to find SNN-GPPs i.e.

to form the class of gradient-only line search strategies. For both approaches the desired points can be resolved within a small tolerance to give an exact line search, or approximately, which results in inexact line searches.

2.2 Exact line searches: Function Value based Golden Section (GS) and Bisection Gradient-only Line Search (B-GOLS)

Function value based line searches were developed for twice differentiable smooth loss functions with the assumption that the bracketing phase only bounds one minimizer. Similarly, B-GOLS was initially developed to bound a single SNNGPP for static discontinuous functions. As a result of the stochastic nature of the loss function the usage of an exact line search is a misnomer as the minimum or sign change is assumed to occur in a non-zero n-dimensional ball that is an implied dense set. Here, exact line search implies converging to a likely minimizer or SNN-GPP within this n-dimensional ball. The exact line search strategies are considered for comparative purposes in this study.

Exact line searches first bracket a candidate solution and then refine the interval to find the minimum or SNN-GPP [3]. Refinement of a minimum requires significantly more computation than isolating a SNN-GPP as illustrated in Figures 4(a) and (b). Four points forming three intervals are required to isolate a local minimum [15,3,44]). The optimal reduction is Golden Section reduction that reduces the interval by 38% per iteration [15,3,44]. Isolating a SNN-GPP is equivalent to isolating the directional derivative from negative to positive along the search direction which can be done using bisection i.e. three points forming two intervals that reduces the interval by 50% per iteration [44]. Pseudo-code B-GOLS to find SNN-GPP is included in Appendix A. It is important to note that finding only a sign change from negative to positive in the directional derivative along the search direction enhances the robustness of gradient-only approaches against noise, since magnitude variations in directional derivatives are ignored for this exact line search. For both the GS and B-GOLS the step size for the location of minima and SNN-GPP were resolved to a tolerance of 10.

Fig. 4: Comparison between exact line searches that (a) minimize, such as the Golden Section (GS) method, versus (b) identify SNN-GPP by isolating sign changes from negative to positive using the Bisection Gradient-Only Line Search (B-GOLS). The directional derivatives along the search direction d = 1 at the discrete points are indicated by the red line segments.

2.3 Inexact line search: Function value based Armijo’s rule (ARLS) and inexact gradient-only line search (I-GOLS)

Inexact line searches define ranges for acceptable steps that are not:

Ensuring that large enough step sizes are taken is usually enforced by backtracking from large step sizes until an acceptable step size has been found. Alternatively, step sizes are advanced from small to larger ones until the largest acceptable step size has been found.

Fig. 5: Schematic diagrams for (a) the function value based inexact line search that is based on Armijo’s rule (ARLS) and the (b) inexact gradient-only line search (IGOLS) with the directional derivative slopes at the points of interest highlighted in red. Armijo’s rule was developed for smooth functions, while the gradient-only inexact approach was developed for discontinuous functions.

The chosen function value based inexact line search (ARLS) is based on the practical and robust Armijo’s Rule [3] that defines an upper bound to a domain of acceptable steps

with 0 1. For p = 0 any value below the level-set F(0) is allowed, and for p = 1 the step-size for convex functions is reduced to 0. As a guideline p = 0.2 is preferred. In this study, a robust implementation of Armijo’s Rule is realized that ensures the largest feasible step size is taken as the update step. If the initial step is acceptable the step size is increased until the condition is not satisfied and the largest acceptable step taken as the step. Should the initial step fail Armijo’s rule, then the step size is reduced using backtracking until the first acceptable step size. This selective employment of advancement or backtracking ensures the largest steps are taken. In this light, increases and decreases of the step size were made with a factor of 2.

A gradient-only line search that allows for sampling beyond the actual SNNGPP is simply, any that satisfies:

with 0 1. Here, r = 0 implies any update that reduces the magnitude of the directional derivative, whereas r = 1 implies only derivative descent updates i.e. updates for which the directional derivative is negative at the update. Hence, IGOLS may be sensitive to variance in gradient magnitude. The algorithmic details of this method are given in Appendix B.

As loss functions exist that have unbounded solutions we impose a maximum step size limit on both ARLS and I-GOLS, in addition to a minimum step limit given by

The inexact line search strategies require an initial step size. For the very first iteration, this initial guess for is set to . In subsequent iterations the initial guess is set to be the resolved step size of the previous iteration, .

2.4 Theoretical Basis

Before we present proofs of convergence of associated derivative descent sequences for descendible functions, we present definitions for associate derivative (weakly) unimodal functions. Associate derivative unimodal functions include discontinuous functions with finite discontinuities but excludes piece-wise linear functions.

Definition 6 For a given sequence : suppose ) for some k and with defined in Definition 5. Then the sequence is an associated derivative descent sequence for , if an associated sequence : may be generated such that if is a descent direction from the set of all possible descent directions at , i.e. 0 then

In this study we limit the functions to the class of descendible associated derivative multivariate functions, which includes a number of piece-wise smooth step discontinuous functions but excludes piece-wise linear continuous functions.

Definition 7 A multivariate function is descendible if and is an associated derivative descent sequence, as defined in Definition 6, for f with initial point , then every subsequence of converges. The limit of any convergent subsequence of is SNN-GPP, as defined in Definition 3, of f.

Note 1 Performing exact gradient-only line searches result in strict associated derivative descent sequences, whilst our proposed inexact line search may or may not depending on the chosen parameter values. Although we consider parameter values in this study for which strict associated derivative descent sequences do not hold, the generated subsequences are convergent for the problems considered in this study.

Theorem 1 Let be a descendible associated derivative multivariate function as defined by Definition 7. Let and be an associated derivative descent sequence, as defined in Definition 6, for f with initial point , then by definition every subsequence of converges. The limit of any convergent subsequence of is a ball of Stochastic non-negative associated gradient projection point (SNN-GPP), as defined in Definition 5, of f.

Proof. It follows our assertion that f is a descendible associated derivative multivariate function given by Definition 7, that every derivative descent subsequence converges.

Now let be a convergent subsequence of and let be itslimit. Suppose, contrary to the second assertion of the theorem, that is not a SNN-GPP as defined in Definition 3 of f. Since we assume that is not a SNN-GPP, and by Definition 6, there exists a + for and : such that + 0, which contradicts our assumption that is the limit of the subsequence . Therefore, for to be the limit of an associated derivative descent subsequence as defined in Definition 5, which completes the proof.

Specific classes of descendible functions can be defined that include descendible associated derivative unimodal multivariate functions.

Definition 8 A multivariate function is descendible if and is an associated derivative descent sequence, as defined in Definition 6, for f with initial point , then every subsequence of converges. The limit of any convergent subsequence of is a generalized SNN-GPP, as defined in Definition 3, of f and the set of SNN-GPP is compact.

The class of associated derivative unimodal multivariate functions can be relaxed to descendible associated derivative weakly unimodal multivariate functions

Definition 9 A multivariate function is descendible associated derivative according to Definition 7 and weakly unimodal if the set of SNN-GPP forms a dense set.

We consider the latter class of descendible associated derivative weakly unimodal multivariate functions as an appropriate definition for a large class of machine learning loss functions.

3 Numerical Studies

3.1 Diverse Toy Classification Datasets

In the numerical investigations we consider 22 classification problem datasets that cover from 150 to 70 000 observations per dataset. Our earliest dataset was made available in 1936 and the latest in 2016. The datasets vary in input features from 4 to 784 and classes from 2 to 29. This range in characteristics gives the selected problems a large variety of cost function landscapes for the different optimization algorithms to navigate. The primary aim of our numerical studies is to demonstrate that the performance trends observed are general across vastly different datasets, and representative of ANN problems with non-convex, discontinuous cost function landscapes. Details concerning the datasets are given in Table 1 and the corresponding neural network architectures implemented for the different datasets are given in Table 2. Fully connected neural network layers with one, as well as two hidden layers were considered for all of the datasets but one. For the two hidden

Table 1: Properties of the datasets considered for the investigation.

Table 2: Properties of the neural network architecture for each dataset.

layer case, each hidden layer was given the same number of nodes, as shown in Table 2. This results in a total of 43 classification problems to be solved.

We consider all four optimizers outlined, namely the minimizing line searches (GS and ARLS) and the gradient-only line searches (B-GOLS and I-GOLS). All optimizers are run for 3000 iterations, while the mini-batch size remains constant at 10 data points over all datasets. The mini-batch size is chosen arbitrarily in this case, though arguably all datasets have different characteristics and therefore the optimal mini-batch size to allow for comparable variance between datasets as well as optimizing for computational performance may vary [37,25]. However, this is outside of the scope of this study.

For all the given datasets, the data was split into training, validation and test datasets such that the ratios between the training dataset to the validation and test datasets are 3:1 respectively. The data is split into three datasets in order to show consistency between the different fractions of the data, should a validation set be selected for a stop criterion and a test set used to evaluate the generality of the model. It is therefore desirable to observe as much consistency as possible between validation and test datasets. The loss function expressed by Equation (16) was used to determine the training as well as validation and test dataset errors, using the network in its current configuration at a given iteration. The optimization runs of 3000 iterations are repeated 10 times using exactly the same starting points, determined by generating values between [1] for each of the neural network weights [36]. This ensures that initially all sigmoid activation functions are in their sensitive domain.

As some methods in this study use function values and other use gradients, a common convention is defined to compare the relative computational cost of the different types information used. Towards this aim, results are given as a function of function evaluations, where one function value evaluation equates to one function evaluation and one gradient evaluation equates to two function evaluations. A function value evaluation is simply a forward pass to determine the error from the cost function, whereas a gradient evaluation on its own consists of both a forward and a backward pass via backpropagation, thus having twice the computational cost of a function value evaluation. For fair comparison, the number of function evaluations are reported for the different methods and their performances based thereon.

3.2 Variational Auto Encoder Training

A separate example is included that demonstrates the performance of I-GOLS in the context of training a variational auto-encoder. The code is written using Pytorch and was sourced from Zou [53], based on the work of Kingma [23]. It was subsequently modified to include and use I-GOLS in SGD. The parameters of I-GOLS were kept the same as discussed previously and the activation functions were again chosen to be sigmoids.

In this investigation we compare I-GOLS to three instances of SGD with fixed step sizes of 1010and 10. These learning rates were manually chosen and selected such that one learning rate is too high, one appropriate and one too low, each separated by an order of magnitude.

4 Results

The results are split into different error plots for the training, validation and test datasets respectively. This shows the effectiveness of the different methods during training, and displays the consistency of obtained solutions to the unseen data. It is important to note here, that since the optimizers operate only on the training data, their specific performances are gauged on the training data errors. The validation and test dataset error plots give indications on the generalization of the network configurations found. Generalization is linked to the characteristics of the individual dataset (noise, data overlap etc.) and therefore should not be over-emphasised in determining an optimizers performance. This trait underlines the importance of considering a multitude of datasets for determining optimizer performance in this study. However, the trends of validation and test error can give insight into the behaviour of different optimizers and therefore should still be considered.

4.1 Averaged results

Figure 6 shows the error plots as averaged over all the given datasets for the single hidden layer networks. The solid lines indicate the average errors as defined by the cost function for the different line search algorithms in terms of the number of function evaluations. The shadings surrounding the average error plots indicates the standard deviation of the errors.

Fig. 6: Average training, validation and test dataset error graphs averaged over (a)-(c) single hidden layer networks, (d)-(f) double hidden layer networks and (g)-(i) both single and double hidden layer networks. The graphs are corrected for the computational cost of function value and gradient information. They are given in terms of Function Evaluations, where a function value evaluation equates to one, and a gradient evaluation equates to two Function Evaluations.

The standard deviations are large in Figure 6, as the datasets considered exhibit a wide range of different characteristics, with varying minimum errors and convergence rates. The exact line search algorithms require more function evaluations per iteration than the inexact line searches, as expected. The gradient-only based line searches also require more function evaluations than the equivalent function value based methods. Therefore the question is whether the performance of the gradient-only methods outweighs their added computational cost.

Consider the case of GS in the single hidden layer average results in Figures 6(a)-(c). Though the training error GS does not progress as far down as that B-GOLS, the validation and test errors are comparatively lower. This may indicate an ability to find better local minima. However, this trend is not general, as can be seen in the case of the double hidden layer networks in Figures 6(d)-(f). In this case B-GOLS is superior in both error and computational cost. This carries over to the average over all problems in Figures 6(g)-(i).

Between the inexact line searches it is clear, that I-GOLS is superior. Function value based ARLS fails to converge to a useful solution within the given number of iterations. The gradient only based I-GOLS produces comparable performance to B-GOLS at an order of magnitude lower number of function evaluations. This is consistent over all three average error plots, demonstrating the ability of I-GOLS to be competitive over both differing datasets as well as network architectures.

4.2 Examples of individual best performances for different methods

Within the dataset pool there were isolated incidences where different methods performed better than their averages reflect. In some cases the performance could be deemed better for given methods than the rest.

This performance was measured by considering training, validation and test errors obtained at the respective computational cost, with a bias towards the errors. If the errors for the different methods were comparable, the best performer is the method with the least computational cost. However, if a clear minimum in error was achieved by a given method in the set number of iterations, that method was considered to be the best performer regardless of its computational cost. Given here are some examples of such cases and the number of their occurrences.

4.2.1 Golden Section as best performer (3 out of 43 cases)

Figure 7 gives an example where GS was effective in finding local minima. It is clear, that this particular problem is prone to overfitting, as the training error continues to decrease while that of the validation and test datasets exhibits a minimum and sharp increase as training continues. In the training data error plot in Figure 7(a), the convergence rate of GS is comparable to that of B-GOLS, albeit that it does not progress as far down in error as B-GOLS for the same number of iterations. This is not so evident in Figure 7, but can be deduced from Figures 7(b) and (c), where the GS plot ends early. However, the resulting validation and test error minima are on average an error value of 0.8 lower. Similar cases occurred for 3 out of the total of 43 problems, and only on those using single hidden layer networks.

Fig. 7: Average (a) training, (b) validation and (c) test dataset errors obtained for different line searches for the Gene1 dataset [36] using the single hidden layer architecture, an example of a dataset where the function value based GoldenSection method was the best performer.

4.2.2 Bisection Gradient-Only Line Search as best performer (6 out of 43 cases)

Figure 8 shows an example of B-GOLS being the most effective in training. This problem exhibits the characteristic where the cost function has a large flat plane to traverse over at an error value around 5. However, after this obstacle has been overcome, convergence is more rapid. It is here where B-GOLS performed the best. Its convergence rate post flat plane is the fastest, obtaining the lowest training as well as validation and test errors for the given number of iterations. The improvement over its nearest rival, I-GOLS is an error value around 0.4. Examples such as these indicate that there are problems which require more accurate resolution of the SNN-GPP in order to progress effectively. However, more iterations would be required in such cases in order to determine the absolute performance of the methods. Training had not yet been completed, thus it is possible that I-GOLS, though slower in convergence after traversing the flat plane, would be computationally cheaper in finding a good SNN-GPP. This would require further investigation. Nevertheless, given the number of fixed iterations, there were 6 datasets on which B-GOLS marginally better than the rest.

Fig. 8: Average (a) training, (b) validation and (c) test dataset errors obtained for different line searches for the Soybean1 dataset [36], an example of the Bisection Gradient-Only Line Search as the best performer.

4.2.3 Inexact Gradient-Only Line Search as best performer (34 out of 43 cases)

Fig. 9: Average (a) training, (b) validation and (c) test dataset errors obtained for different line searches for the Mushroom1 dataset [36], an example of the inexact gradient-only line search method as the best performer.

Overall, I-GOLS is the most efficient in progressing training relative to computational cost. As confirmed in the averaged error plots, it is able to reach comparable validation and test error values at an order of magnitude lower function evaluation counts. However, in some cases its performance was considerably better than that of the rest. Figure 9 shows a case where the progress made is both more efficient as well as superior in convergence. For the given dataset the method progresses quickly towards a good SNN-GPP, causing the error to drop rapidly after overcoming an initial flat plane. In training, the obtained error is 5 orders of magnitude lower than the nearest competitor B-GOLS. This improvement is more pronounced with 9 orders of magnitude for the validation and test dataset errors. This indicates that I-GOLS was superior in training, as well as finding a generalized solution over B-GOLS and GS. This is one of the more extreme examples. However, in the majority of cases the method was capable of returning comparable or better errors than the remaining methods at a order of magnitude lower computational cost, which held for 34 out of 43 of the examined problems.

This shows that the method, though inexact, is capable of reliably traversing the different features of various cost functions relating to a range of datasets and neural network architectures.

4.3 Comparing step size characteristics of the Glass1 and Cancer1 datasets

Other matters of interest are the resolved step sizes at every iteration of the different line search methods. Figures 10(a)-(f) show the average step sizes, training errors as well as validation errors over 10 runs for the Glass1 and Cancer1 datasets with the double hidden layer architecture. Figure 10(a) shows how the step sizes on average vary during an optimization run for the different line search methods on the Glass1 problem. The function value based ARLS produces very small step sizes and therefore does not progress throughout the cost function, as has been previously observed. The GS method is more effective as it produces step sizes with a reasonable magnitude. However, these seem to vary around a mean value. Conversely, I-GOLS and B-GOLS show a distinct variation in step sizes as a function of iterations, first increasing by an order of magnitude into the range of before 1000 iterations are reached, then slowly decreasing to a range of . This behaviour is matched to the shape of the error graphs, where a flat plane is overcome. As this plane is being traversed, the step size is increased. However, once this is overcome, the step size decreases as the optimization methods move along presumably more detailed features in the cost function. This is confirmation that GOLS are able to adjust the step size according to true trends of the discontinuous cost function. It would be difficult to manually pre-empt the requirement of, and determine a learning rate schedule to mimic this behaviour.

From this analysis it is also evident, that I-GOLS is more aggressive than BGOLS, resulting in larger step sizes. However, this does not negatively impact optimization, as the training and validation errors in Figures 10(b) and (c) show I-GOLS to be the overall best performer.

Fig. 10: Step size investigations for (a)-(c) the Glass1 dataset [36] and (d)-(f) the Cancer1 dataset [36] used with the double hidden layer network architecture. Shown are average Step sizes, training errors and validation errors obtained for different line search methods. In this case, the step sizes are given as a function of iterations.

Similarly, in the Cancer1 problem in Figures 10(d)-(f), the step sizes of I-GOLS are 4 orders of magnitude larger than those of B-GOLS, ending at step sizes just under . In this case the gradients decrease towards the end of training, causing I-GOLS to increase the step size to move along the cost function. These large step sizes still relate to good performance on the training loss as depicted in Figure 10(e). In terms of validation error, depicted in Figure 10(f), a decrease in generalization is observed due to over-fitting of a too flexible neural network for the given data characteristics. However, in the context of training on the data and architecture given, it is evident that I-GOLS remains the best performer.

For the two considered datasets the step sizes exhibited very different nonlinearly changing behaviour as a function of the number of iterations. It is practically intractable to determine an equivalent step size rule a priori for the given datasets, as it would require a substantial effort by the user to establish. This underlines the importance of resolving the step size automatically on an iteration basis via effective line search strategies.

The poor performance of ARLS can be attributed to local minima that result from the discontinuous loss function, in particular, positive jump discontinuities [48]. This significantly hampers the growing of the step size from its conservative initial guess. In the case of GS, an exponential bracketing strategy is used, which is less prone to this problem. It is the magnitude of the parameters in the bracketing strategy that aids the progress. The discontinuities cause the bracketing strategy to fall short of encompassing a true minimum. However, if a few probing iterations do not encounter an increasing jump discontinuity, a certain interval may already be defined and subsequently refined to an arbitrary step size therein. Statistically, this makes GS perform similarly to a fixed learning rate, dependent on the magnitude of the parameters chosen for the initial bracketing strategy. However, it is not able to reliably adjust the step size according to features in the cost function.

Conversely, both B-GOLS and I-GOLS is capable of adjusting the step size to the required magnitude within a single optimization iteration, as can be seen for both investigated cases in Figure 10(a) and (d). Both gradient based methods converge to similar magnitude step sizes initially. It is in the latter stages of training for the Cancer1 dataset that the two methods diverge in step size.

4.4 Analysis of iterative performance of line search methods

It is evident, that there are different computational costs associate with the different investigated line search methods. To quantify this, we counted the average number of function evaluations per iteration and summarize them in Table 3. To give context, a single iteration of SGD with a constant learning rate consists of 2 function evaluations, since a gradient computation is required. Function value based line search methods are on average close to half as expensive as the gradient-only based methods. This is due to the added cost of computing the gradient. The inexact methods offer about an order of magnitude computational savings over their exact equivalents. Though comparing function evaluations gives an indication of computational cost of the methods, it does not give a basis for comparison of the iterative efficiency of the respective exact and inexact line search methods. We therefore count the number of ”information calls”, where a function value and gradient evaluation each constitute a single information call. On this basis the respective exact and inexact line search methods perform very similarly, having similar number of information calls. This shows that algorithmically the methods are comparable. Thus the difference between them is simply a function of the information paradigms used. Even with the added computational cost of the gradient-only line searches, the information gain is substantial enough to offset its cost.

Table 3: Average number of function evaluations and information calls per iteration for the various line search search optimizers.

Due to the stochastic nature of the loss function it seems more reasonable to consider inexact rather than exact strategies. An exact line search strategy wastes computational resources resolving the accuracy to a bound that is smaller than the variance in the solution due to stochastic subsampling. Gradient-only based I-GOLS is able to bypass discontinuities by observing consistent gradient trends in the loss function, while requiring less gradient evaluations than B-GOLS. It is therefore a plausible method to efficiently resolve the learning rate in the context of discontinuous loss functions, as a result of stochastic data sub-sampling, to sufficient accuracy.

4.5 Investigation on a Variational Autoencoder example

Fig. 11: (a) Training loss and (b) step sizes errors for training a variational autoencoder on MNIST [24]. The gradient-only inexact line search is compared to three instances of SGD with different constant learning rates that were determined manually by the user.

The resulting training loss and corresponding step sizes for the variational autoencoder (VAE) example are given in Figure 11. The training loss in Figure 11(a) shows the performance of the different training methods. The fixed learning rates 10and 10are too large and too small respectively. This is indicated by the errors ”flat-lining” or decreasing very slowly. However, a fixed step size of 10shows marked improvement in the decrease of the training error. This demonstrates the sensitivity of this problem to the learning rate: It is necessary to determine the value to within an order of magnitude in order to have satisfactory training performance. The convergence of I-GOLS is superior to that of the fixed step size of 10.

Recall that these learning rates were manually assigned by the user. In the case of I-GOLS in Figure 11(b), the step size is immediately and automatically adjusted from the initial guess to a range predominantly between 10and . As training progresses, the step size range slowly decreases with subsequent epochs towards . This occurs entirely without modification to the parameters in I-GOLS or any other form of user intervention, showing that the SNN-GPP definition has made I-GOLS an effective line search strategy. Arguably one can also use iterative hyperparameter optimization methods [6] to resolve a constant step size for this problem. However, these methods require more than 5 guesses to find an appropriate constant learning rate, making I-GOLS superior in terms of computational cost. In addition, should a problem require an adaptive learning rate, as pointed out, I-GOLS is able to resolve this on an iteration by iteration basis.

The ”manual” optimization process was spread over a range of 3 orders of magnitude and showed the sensitivity therein. Gradient-only based I-GOLS is capable of a step size range over 15 orders of magnitude and was immediately able to resolve a competitive learning rate automatically. This practical example confirms the utility of I-GOLS search in eliminating the learning rate hyperparameter from neural network training while using stochastic sub-sampling.

5 Conclusion

Data sub-sampling is readily used to optimize the weights for neural networks, as it speeds up the optimization process for large datasets and allows for memory constrained devices such as GPUs to be more readily employed. The added benefit of adding variance to the function values and gradients, allows optimization strategies to move beyond ”weak” local minima or ”weak” SNN-GPPs in the loss function. This comes at the cost of introducing discontinuities to the loss function that introduces sampling local minima which significantly hampers a function value based minimization line searches. This is evident by the poor performance of both the exact and inexact function value based line searches in our investigations. Both methods are hampered by the multi-modal nature of the objective function used in training.

In turn, sampling discontinuities do not manifest so readily as SNN-GPPs which are solved for using gradient-only line search (GOLS) strategies. This study demonstrated that in the context of discontinuous loss functions due to stochastic data sub-sampling, the learning rate in steepest gradient descent can be reliably resolved using exact or inexact gradient-only line search strategies over 22 differ-ent datasets used in shallow, deep, and variational autoencoder neural network architectures.

Currently, learning rates or learning schedules in ANN training are either selected a priori and manually adjusted until settings are found that are deemed acceptable by the user, or expensively solved at the global hyperparameter level. This study demonstrated that line searches can be reliably performed to resolve step sizes in discontinuous loss functions as seen in Neural Network training, alleviating the need for a priori determined step sizes or step size rules.

The usage of robust line searches may have significant implications for conjugate gradient and Quasi-Newton philosophies to be applied to ANN training.

Acknowledgements This work was supported by the Centre for Asset and Integrity Management (C-AIM), Department of Mechanical and Aeronautical Engineering, University of Pretoria, Pretoria, South Africa.

References

1. Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7657 LNCS, 216–223 (2012)

2. Anitescu, M.: Degenerate Nonlinear Programming with a Quadratic Growth Con- dition. SIAM Journal on Optimization 10(4), 1116–1135 (2000). DOI 10.1137/ S1052623499359178. URL http://epubs.siam.org/doi/10.1137/S1052623499359178

3. Arora, J.: Introduction to Optimum Design, Third Edition. Academic Press Inc (2011). URL https://www.sciencedirect.com/science/book/9780128008065http:// www.amazon.com/Introduction-Optimum-Design-Third-Edition/dp/0123813751

4. Balles, L., Hennig, P.: Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients. arXiv:1705.07774v2 [cs.LG] pp. 1–17 (2018). URL http://arxiv.org/abs/ 1705.07774

5. Bergstra, J., Bardenet, R., Bengio, Y., K´egl, B.: Algorithms for Hyper-Parameter Opti- mization. In: Advances in Neural Information Processing Systems (NIPS), pp. 2546–2554 (2011). DOI 2012arXiv1206.2944S

6. Bergstra, J., Bengio, Y.: Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research 13(Feb), 281–305 (2012). DOI 10.1162/153244303322533223. URL http://www.jmlr.org/papers/v13/bergstra12a.html

7. Bertsekas, D.P., Massachusetts Institute of Technology.: Convex optimization algorithms. Athena Scientific (2015). URL http://www.athenasc.com/convexalgorithms.html

8. Bishop, C.M.: Pattern recognition and machine learning. Springer (2006)

9. Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization meth- ods for machine learning. Mathematical Programming 134(1), 127–155 (2012). DOI 10. 1007/s10107-012-0572-5. URL http://link.springer.com/10.1007/s10107-012-0572-5

10. Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., LeCun, Y.: The Loss Surfaces of Multilayer Networks. AISTATS 2015 38, 192–204 (2015). URL http://arxiv.org/abs/ 1412.0233

11. Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: ICLR 2014, pp. 1–9 (2014). URL http://arxiv.org/abs/1406.2572

12. Davis, C.: The norm of the Schur product operation. Numerische Mathematik 4(1), 343–344 (1962). DOI 10.1007/BF01386329. URL http://link.springer.com/10.1007/ BF01386329

13. Engelbrecht, A.P.: Fundamentals of computational swarm intelligence. Wiley (2005). URL http://si.cs.up.ac.za/

14. Fisher, R.A.: THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROB- LEMS. Annals of Eugenics 7(2), 179–188 (1936). DOI 10.1111/j.1469-1809.1936.tb02137. x. URL http://doi.wiley.com/10.1111/j.1469-1809.1936.tb02137.x

15. Floudas, C.A., Pardalos, P.M. (eds.): Encyclopedia of Optimization, Second Edition. Springer (2009)

16. Gong, P., Ye, J.: Linear Convergence of Variance-Reduced Stochastic Gradient without Strong Convexity. ArXiv e-prints (2014). URL http://arxiv.org/abs/1406.1102

17. Goodfellow, I.J., Vinyals, O., Saxe, A.M.: Qualitatively Characterizing Neural Network Optimization Problems. ICLR pp. 1–11 (2015). URL http://arxiv.org/abs/1412.6544

18. Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W.M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., Kavukcuoglu, K.: Population Based Training of Neural Networks. arxiv pp. 1–13 (2017). URL http: //arxiv.org/abs/1711.09846

19. Johnson, B., Tateishi, R., Xie, Z.: Using geographically weighted variables for image clas- sification. Remote Sensing Letters 3(6), 491–499 (2012). DOI 10.1080/01431161.2011. 629637. URL http://www.tandfonline.com/doi/abs/10.1080/01431161.2011.629637

20. Johnson, B.A., Tateishi, R., Hoan, N.T.: A hybrid pansharpening approach and multiscale object-based image analysis for mapping diseased pine and oak trees. International Journal of Remote Sensing 34(20), 6969–6982 (2013). DOI 10.1080/01431161.2013.810825. URL http://www.tandfonline.com/doi/abs/10.1080/01431161.2013.810825

21. Karimi, H., Nutini, J., Schmidt, M.: Linear Convergence of Gradient and Proximal- Gradient Methods Under the Polyak-�Lojasiewicz Condition. In: ECML PKDD: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, Cham (2016)

22. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. In: ICLR 2015, pp. 1–15 (2015). DOI http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503. URL http://arxiv.org/abs/1412.6980

23. Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. arXiv Ml, 1–14 (2013). DOI 10.1051/0004-6361/201527329. URL http://arxiv.org/abs/1312.6114

24. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998). DOI 10.1109/5.726791

25. Li, M., Zhang, T., Chen, Y., Smola, A.J.: Efficient Mini-batch Training for Stochastic Optimization. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014). DOI 10.1145/2623330.2623612. URL http: //www.cs.cmu.edu/{~}muli/file/minibatch{_}sgd.pdf

26. Liu, J., Wright, S.J., R´e, C., Bittorf, V., Sridhar, S.: An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. Journal of Machine Learning Research 16, 285–322 (2015). URL http://www.jmlr.org/papers/volume16/liu15a/liu15a.pdf

27. Lucas, D.D., Klein, R., Tannahill, J., Ivanova, D., Brandon, S., Domyancic, D., Zhang, Y.: Failure analysis of parameter-induced simulation crashes in climate models. Geoscientific Model Development 6(4), 1157–1171 (2013). DOI 10.5194/gmd-6-1157-2013. URL http: //www.geosci-model-dev.net/6/1157/2013/

28. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research 46-47(1), 157–178 (1993). DOI

10.1007/BF02096261. URL http://link.springer.com/10.1007/BF02096261

29. Mahsereci, M., Hennig, P.: Probabilistic Line Searches for Stochastic Optimization. Jour- nal of Machine Learning Research 18(119), 1–59 (2017). DOI 10.1016/j.physa.2015.02.029. URL http://jmlr.org/papers/v18/

30. Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V.: Quantitative Struc- tureActivity Relationship Models for Ready Biodegradability of Chemicals. Journal of Chemical Information and Modeling 53(4), 867–878 (2013). DOI 10.1021/ci4000213. URL http://pubs.acs.org/doi/10.1021/ci4000213

31. Marwala, T.: Bayesian training of neural networks using genetic programming. Pattern Recognition Letters 28(12), 1452–1458 (2007). DOI 10.1016/J.PATREC.2007.03.004. URL https://www.sciencedirect.com/science/article/pii/S0167865507000967

32. Montana, D.J., Davis, L.: Training feedforward neural networks using genetic algorithms (1989). URL https://dl.acm.org/citation.cfm?id=1623876

33. Nash, W.J., Sellers, T.L., Talbot, S.R., Cawthorn, A.J., Ford, W.B.: The Population Bi- ology of Abalone ( Haliotis species) in Tasmania. I. Blacklip Abalone ( H. rubra ) from the North Coast and Islands of Bass Strait. Tech. rep., Sea Fisheries Division (1994)

34. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program., Ser. B 120, 221–259 (2009). DOI 10.1007/s10107-007-0149-x. URL http://ium.mccme. ru/postscript/s12/GS-NesterovPrimal-dual.pdf

35. Paschke, F., Bayer, C., Bator, M., M¨onks, U., Dicks, A., Enge-Rosenblatt, O., Lohweg, V.: Sensorlose Zustands¨uberwachung an Synchronmotoren (2013)

36. Prechelt, L.: PROBEN1 - a set of neural network benchmark problems and benchmarking rules (Technical Report 21-94). Tech. rep., Universit¨at Karlsruhe (1994). URL http: //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.5355

37. Radiuk, P.M.: Impact of Training Set Batch Size on the Performance of Convolutional Neural Networks for Diverse Datasets. Information Technology and Management Science 20(1), 20–24 (2017). DOI 10.1515/itms-2017-0003. URL http://www.degruyter.com/ view/j/itms.2017.20.issue-1/itms-2017-0003/itms-2017-0003.xml

38. Robbins, H., Monro, S.: A Stochastic Approximation Method. The Annals of Mathematical Statistics 22(3), 400–407 (1951). DOI 10.1214/aoms/1177729586. URL http: //projecteuclid.org/euclid.aoms/1177729586

39. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv:1609.04747v2 [cs.LG] pp. 1–14 (2016). DOI 10.1111/j.0006-341X.1999.00591.x. URL http://arxiv. org/abs/1609.04747

40. Saxe, A.M., McClelland, J.L., Ganguli, S.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR abs/1312.6, 1–22 (2013). URL http: //arxiv.org/abs/1312.6120

41. Shor, N.Z.: In: Minimization Methods for Non-Differentiable Functions, chap. The Sub- gra, pp. 22–47. Springer, Berlin Heidelberg (1985). URL http://www.springerlink.com/ index/10.1007/978-3-642-82118-9{_}3

42. Shor, N.Z.: Minimization Methods for Non-Differentiable Functions. Springer, Berlin Hei- delberg (1985)

43. Snoek, J., Larochelle, H., Adams, R.: Practical Bayesian Optimization of Machine Learning Algorithms. In: Nips, pp. 1–9 (2012). DOI 2012arXiv1206.2944S. URL https://papers.nips.cc/paper/ 4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf

44. Snyman, J.A., Wilke, D.N.: Practical Mathematical Optimization, Springer Optimization and Its Applications, vol. 133. Springer International Publishing, Cham (2018). DOI 10. 1007/978-3-319-77586-9. URL http://link.springer.com/10.1007/978-3-319-77586-9

45. Tong, F., Liu, X.: Samples Selection for Artificial Neural Network Training in Pre- liminary Structural Design. Tsinghua Science & Technology 10(2), 233–239 (2005). DOI 10.1016/S1007-0214(05)70060-2. URL https://www.sciencedirect.com/science/ article/pii/S1007021405700602

46. Vurka¸c, M.: Clave-direction analysis: A new arena for educational and creative appli- cations of music technology. Journal of Music, Technology and Education 4(1), 27– 46 (2011). URL http://openurl.ingenta.com/content/xref?genre=article{&}issn= 1752-7066{&}volume=4{&}issue=1{&}spage=27

47. Werbos, P.J.: The Roots of Backpropagation: From Ordered Derivatives to Neural Net- works and Political Forecasting. Wiley-Interscience, New York, NY, USA (1994)

48. Wilke, D.N., Kok, S., Snyman, J.A., Groenwold, A.A.: Gradient-only approaches to avoid spurious local minima in unconstrained optimization. Optimization and Engineering 14(2), 275–304 (2013). DOI 10.1007/s11081-011-9178-7. URL http://link.springer. com/10.1007/s11081-011-9178-7

49. Yeh, I.C., Lien, C.h.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36(2), 2473–2480 (2009). DOI 10.1016/J.ESWA.2007.12.020. URL https: //www.sciencedirect.com/science/article/pii/S0957417407006719

50. Zhang, C., ¨Oztireli, C., Mandt, S., Salvi, G.: Active Mini-Batch Sampling using Repulsive Point Processes. ArXiv e-prints (2018). URL http://arxiv.org/abs/1804.02772

51. Zhang, H., Yin, W.: Gradient methods for convex minimization: better rates under weaker conditions. ArXiv e-prints (2013). URL http://arxiv.org/abs/1303.4645

52. Zi¸eba, M., Tomczak, S.K., Tomczak, J.M.: Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Systems with Applications 58, 93–101 (2016). DOI 10.1016/J.ESWA.2016.04.001. URL https://www.sciencedirect. com/science/article/pii/S0957417416301592

53. Zuo, X., Chintala, S.: Basic vae example. https://github.com/pytorch/examples/tree/ master/vae (2018). Accessed: 2018-06-07

A Artiﬁcial Neural Networks

The focus of this study is to compare different line search strategies in a stochastic environment and compare their relative performances. The choice of neural network architecture is therefore incidental, as it simply allows for the platform for comparison. Therefore fully connected feed forward single and double hidden-layer neural networks with the sigmoid activation functions were chosen. This architecture is expressed mathematically by Equations (14) and (15) below.

Suppose a given datasets has an input domain, X with B observations and D dimensions (features). The respective output domain is given by Y with corresponding observations B and output dimensions K, which in our case are representative of the class on one-hot basis. Then for every observation b and every output dimension k, a prediction of the output data ˆY can be constructed from the original data input domain X, given by

The number of nodes in the respective hidden layers is given by 2]. The nodal activation function is denoted by h and W represents the nodal weights between different layers, where 3] denotes the set of weights connecting sequential layers in the network between the input layer and the output layer in a forward direction. Thus the single hidden layer network has two sets of weights and the double hidden layer network has three respectively [8].

The nodal weights W are optimized to a configuration which best captures the relationship between the input and output of the data. The cost-function used is the mean least squared error, determined over every b in batch size P and every class according to the Proben1 dataset guidelines [36] as:

where ˆY(W) is the current output of the current network configuration, manipulated as a function of the weights, and Y is the target output of the corresponding training dataset samples.

B Exact line search: Bisection Gradient-only Line Search (B-GOLS)

The directional derivative values used in this method, denoted each calculated by taking the dot products between the gradient value, search direction, u, at the respective values for at the different points.

C Inexact line search: Inexact Gradient-only Line Search (I-GOLS)

Parameters used for this method are: