Using a thousand optimization tasks to learn hyperparameter search strategies

2020·Arxiv

Abstract

Abstract

We present TaskSet, a dataset of tasks for use in training and evaluating optimizers. TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional neural networks, to variational autoencoders, to non-volume preserving flows on a variety of datasets. As an example application of such a dataset we explore meta-learning an ordered list of hyperparameters to try sequentially. By learning this hyperparameter list from data generated using TaskSet we achieve large speedups in sample efficiency over random search. Next we use the diversity of the TaskSet and our method for learning hyperparameter lists to empirically explore the generalization of these lists to new optimization tasks in a variety of settings including ImageNet classification with Resnet50 and LM1B language modeling with transformers. As part of this work we have opensourced code for all tasks, as well as 29 million training curves for these problems and the corresponding hyperparameters.1

1. Introduction

As machine learning moves to new domains, collecting diverse, rich, and application-relevant datasets is critical for its continued success. Historically, research on learning optimization algorithms have only leveraged single tasks (Andrychowicz et al., 2016; Metz et al., 2019a), or parametric synthetic tasks (Wichrowska et al., 2017), due to the difficulty of obtaining large sets of tasks.

1.1. TaskSet: A set of tasks

We present a set of tasks significantly larger than any optimizer dataset previously studied. We aim to better enable standardized research on optimizers, be that analysis of existing optimizers, or development of new learned learning algorithms. We call this suite of tasks TaskSet.

Much in the same way that learned features in computer vision outpaced hand designed features (Krizhevsky et al., 2012; LeCun et al., 2015), we believe that data driven approaches to discover optimization algorithms will replace their hand designed counterparts resulting in increased performance and usability. To this end, standardizing a large suite of optimization tasks is an important first step towards more rigorous learned optimizer research.

In this setting, a single “example” is an entire training procedure for a task defined by data, loss function, and architecture. Thus, TaskSet consists of over a thousand optimization tasks, largely focused on deep learning (neural networks). They include image classification using fully connected and convolutional models, generative models with variational autoencoders (Kingma & Welling, 2013) or flows (Dinh et al., 2016; Papamakarios et al., 2017), natural language processing tasks including both language modeling and clas-sification, as well as synthetic tasks such as quadratics, and optimization test functions. The problems themselves are diverse in size, spanning 7 orders of magnitude in parameter count, but remain reasonably fast to compute as almost all tasks can be trained 10k iterations on a CPU in under one hour. To demonstrate the breadth of this dataset we show an embedding of all the tasks in Figure 1.

1.2. Amortizing hyperparameter search

Machine learning methods are growing ever more complex, and their computational demands are increasing at a frightening pace (Amodei & Hernandez, 2018). Unfortunately, most modern machine learning models also require extensive hyperparameter tuning. Often, hyperparameter search is many times more costly than the final algorithm, which ultimately has large economic and environmental costs (Strubell et al., 2019).

Figure 1. A 2D TSNE embedding of all 1162 tasks. This embedding is produced from a 1,000 dimensional feature vector consisting of task loss evaluated with many different hyperparameter configurations. We find similar tasks – e.g. masked auto regressive flow models, and character / word RNN models – cluster, suggesting similarity in the optimizers that perform well. See §4 for more details.

The most common approach to hyperparameter tuning involves some form of quasi-random search over a pre-specified grid of hyperparameters. We build on past work (Wistuba et al., 2015b; Pfisterer et al., 2018) and explore a hyperparameter search strategy consisting of a simple ordered list of hyperparameters to try. The idea is that the first few elements in this list will cover most of the variation in hyperparameters found in typical machine learning workloads.

We choose the elements in this list by leveraging the diversity of tasks in TaskSet, by meta-learning a hyperparameter list that performs the best on the set of tasks in TaskSet. We then test this list of hyperparameters on new, larger machine learning tasks.

Although learning the list of hyperparameters is costly (in total we train 29 million models consisting of over 4000 distinct hyper parameter configurations), our final published list is now available as a good starting guess for new tasks.

Furthermore, we believe the raw training curves generated by this search will be useful for future hyperparameter analysis and meta-learning research, and we release it as part of this work2. We additionally release code in Tensor-flow(Abadi et al., 2016), Jax(Bradbury et al., 2018), and PyTorch(Paszke et al., 2019) for a reference optimizer which uses our learned hyperparameter list, and can be easily applied to any model.3 We believe this hyperparameter search strategy will enable the machine learning community to train better performing models, in less time, and with reduced compute and energy cost.

1.3. Paper structure

In §2, we define our open source optimizer dataset, Taskset. In §3, we choose several common optimizers, and we detail our algorithm for finding a performant search strategy over those optimizer’s hyperparameters for Taskset. §4 provides qualitative summary statistics for the tasks in Taskset for a typical optimizer. §5 constitutes a detailed study of the performance of our learned hyperparameter list benchmarked against several baseline search strategies, both in terms of training time and optimizer generalization performance. Finally, §6, considers transfer learning experiments of our learned list to significantly larger architectures.

2. TaskSet: A set of tasks

How should one choose what problems to include in a set of optimization tasks? In our case, we seek to include common problems in machine learning research. As such, we strive to include optimization tasks that have been influential over the course of research in the last several decades. This is necessarily subjective, but by distilling these beliefs into a clear set of tasks we are explicit about this subjectivity. Designing this dataset requires striking a balance between including realistic large-scale workloads and ensuring that tasks are fast to train so that using it for meta-learning is tractable. We chose to fill our dataset with a mixture of mostly neural network based tasks. Our chosen tasks have between ten thousand and one million parameters (much smaller than the billions commonly used today), as a result most problems can train in under an hour on a cloud CPU with 5 cores. We additionally focus on increased “task diversity” by including many different kinds of training algorithms, architectures, and datasets inspired by past work in reinforcement learning which has demonstrated large numbers of problems and increased diversity around some domain of interest is use- ful for both training and generalization (Heess et al., 2017; Tobin et al., 2017; Cobbe et al., 2018; OpenAI et al., 2019). Once again though, a balance must be struck as in the limit of too much diversity no learning can occur due to the no free lunch theorem (Wolpert & Macready, 1997).

Our dataset, TaskSet, is made up of 1162 tasks in total. We define a task as the combination of a loss function, a dataset, and initialization.

Specifically we define a task as a set of 4 functions:

• Initialization () parameter initial values

• Data generator data split (e.g. train / valid / test) batch of data

• Forward pass (batch of data, params)

• Compute gradients (input data, params) gradients ( dlossdparams)

A task has no tunable hyperparameters and, coupled with an optimizer, provides all the necessary information to train using first order optimization. This makes experimentation easier, as each task definition also specifies reasonable defaults for hyperparameters such as batch size (Shallue et al., 2018; McCandlish et al., 2018) or initialization (Schoenholz et al., 2016; Yang & Schoenholz, 2017; Xiao et al., 2018; Li & Nguyen, 2019; Pretorius et al., 2018; Hayou et al., 2018; Karakida et al., 2018; Blumenfeld et al., 2019; Hayou et al., 2019) that no longer need to be tuned.

Hand-designing architectures, datasets, and losses for thousands of neural-network-based tasks is a challenge. We augment a set of “fixed” tasks which have been designed by hand, with “sampled” tasks that are randomly generated task instances.

2.1. Sampled families of tasks

Sampled tasks are created by sampling neural network architectures, activation functions, datasets, and other properties. We organize these sampled tasks into similar families of tasks: See Appendix G for more details and example con-figurations.

• mlp: Multi layer perceptrons trained on image data.

• mlp_ae: Multi layer perceptron based auto encoder trained on image data (Hinton & Salakhutdinov, 2006).

• mlp_vae Multi layer perceptron based variational auto encoder trained on image data (Kingma & Welling, 2013).

• conv_pooling ConvNet with spatial pooling before the classification layer.

• conv_fc: ConvNet with fully connected classification network instead of pooling.

• nvp: Non volume preserving flows trained on image data (Dinh et al., 2016).

• maf: Masked autoregressive flows trained on image data (Papamakarios et al., 2017).

• char_rnn_language_model Language modeling with an RNN on characters (Graves, 2013).

• word_rnn_language_model Language modeling with an RNN on words / subwords.

• rnn_text_classification Text classification using RNN models.

• quadratic Problems based on quadratics possibly transformed by a nonlinearity.

• losg_tasks Tasks generated from the synthetic optimization problems documented in “Learned Optimizers that Scale and Generalize” (Wichrowska et al., 2017), abbreviated “losg”.

Defining a sampling distribution that generates tasks that are always valid, and that run within a time constraint, is difficult. Instead, we define a broad distribution and make use of rejection sampling to remove tasks that are either too slow, contain errors, or that we are unable to optimize at all. By starting with a distribution that is too broad, and pruning it, we hope to achieve better coverage of tasks.

2.2. Hand designed tasks

In addition to the sampled tasks, we also include 107 hand designed tasks. These consist of more common tasks that both improve the coverage beyond the sampled tasks, and provide for better interpretability through a closer match to existing tasks in the literature. These tasks span image classification, text classification, language modeling, and generative modeling, as well as some synthetic tasks such as associative retrieval (Ba et al., 2016). We leave the description of each one of these tasks to Appendix G.3.

3. Amortized hyperparameter search

As a first demonstration leveraging the large-scale task dataset for meta-learning research, we consider learning hyperparameter lists. This idea of learning lists of hyper parameters has been explored in (Wistuba et al., 2015b; Pfis- terer et al., 2018). We define an optimizer as the pairing of an optimization algorithm and all its corresponding hyperparameters (e.g. learning rate). While sometimes practitioners use a single optimizer – e.g. Adam (Kingma & Ba, 2014) with default hyperparameters – most practitioners will often run multiple optimizers and use a validation set to select the best performer.

3.1. Optimizer families

We define different parameterizations of hand designed op- timizers as an optimizer family. The optimizer families we consider consist of:

• Adam1p: One hyperparameter, the fixed learning rate

• Adam4p: Four Adam hyperparameters,

• Adam6p: Adam4p hyperparameters, and two additional hyperparameters controlling linear and exponential learning rate decays

• Adam8p: The hyperparameters in Adam6p plus two additional hyperparameters for regularization terms

• NAdamW: A 10 hyperparameter search space based on NAdam (Dozat, 2016) with cosine learning rate decay, and weight decay.

For the full update equations see Appendix C.1 for Adam and C.2 for NadamW. We chose Adam based on its use in existing work, and NAdam based on performance shown in (Choi et al., 2019).

3.2. Learned hyperparameter lists

Traditionally researchers tune hyperparameters on a per model basis. While this often results in performance gains; it comes at the cost of immense compute, and researchers are almost never able to expend enough compute to saturate model performance (Shallue et al., 2018). As an alternative to per-problem tuning, we proposes instead tuning the search strategy itself on a dataset of tasks and transferring the knowledge gained to new tasks of interest. This idea is already implicitly done by humans – e.g. we don’t start a hyperparameter search with a learning rate of values that the community has found useful.

This dataset-based tuning has a number of desirable prop- erties. First, the resulting search strategies are much more efficient, resulting in large speedups in sample efficiency on unseen tasks over a random search baseline. Second, we are less restricted by the number of optimizer parameters we search over or by needing to define reasonable search spaces. For example, if there are redundant regions of search space, our learned optimizer will be less likely to sample them repeatedly, unlike random search. If there is a region of hyperparameter space that performs poorly on all problems, the learned search strategy will avoid it.

In this work we parameterize the learned search strategy as an ordered list of optimizers to try (i.e. a list of hyper-parameter configurations). Given a fixed number of task evaluations we would like to achieve the best possible performance on all tasks in the training set of tasks. For a length k list of optimizers we define our loss as:

where are the optimizer hyperparameters for element i in the list, and f is an appropriately normalized loss computed after training task

We seek to find an optimal list of optimizers as:

3.3. Scoring an optimizer by averaging over tasks

To score a task, we initialize the parameters of the task and run 10,000 iterations of an optimizer. We monitor loss on each data split (train, validation, test) every 200 steps using an average over 50 mini-batches per evaluation. For all data presented in this paper we also compute averages over 5 random task parameter initializations.

A side effect of the diverse task dataset is that losses span multiple orders of magnitude, making direct aggregation of performance problematic. To remedy this we normalize the loss values for all tasks linearly between 0 and 1 where 1 is validation loss at initialization and zero is the lowest validation loss achieved by any tested optimizer. Loss values greater than the loss at initialization are clipped to 1.

To collapse an entire normalized training curve into a scalar cost, we compute the mean normalized loss over the 10,000 iterations. We find empirically that this choice is similar to taking the minimum (Appendix A.4). We leave exploring alternative methods such as performance profiles (Dolan & Moré, 2002) and Nash averaging (Balduzzi et al., 2018) for future work.

3.4. Greedy learning from random search

Optimizing Eq. 2 is combinatorially expensive. To tractably solve this optimization problem, we introduce two approximations. First, we shift the unconstrained search over the full space of optimizers to search over a finite set of optimizers, . This finite set can be computed ahead of time and decouples the expensive procedure of training each task with an optimizer from training the learned search space.

Figure 2. a. A histogram of parameter counts for each problems in the task suite. Our task suite spans more than 7 orders of magnitude in model size. b. Percentage of optimizers (y-axis) capable of reaching a given loss value (color) for tasks (x-axis). We find there exists around 100 “easy” tasks on which more than half of the optimizers perform well, and a large number of “difficult” tasks for which almost no optimizers perform well. c. A histogram of training times. Almost all tasks can be trained in under an hour.

Separating data and training in this way has been done for both hyperparameter search (Eggensperger et al., 2015), and neural architecture search (Klein & Hutter, 2019; Ying et al., 2019). In total we trained 1,000 optimizer configu-rations for each of Adam1p, Adam4p, Adam6p, Adam8p, and NAdamW on all 1,162 tasks with 5 random seeds per pair. Second, we use a greedy heuristic to approximate the combinatorial search over sets of k optimizers. For a single optimizer trial, k = 1, we select the best performing optimizer on average across all training tasks. We then continue to select optimizer parameters such that the minimum of all optimizer-parameters per task, aggregated over all tasks is minimized. This shifts the complexity from exponential in k to linear. Finding a length k set of optimizers can thus be efficiently computed as follows:

We note that the first argument of the outer min, b, can be computed once per set of hyperparameters as it does not depend on . Finally, as our tasks are stochastic, we order optimizers based on validation loss and report test loss (Van Hasselt et al., 2016). 4

This training strategy requires an original search space from which to collect data and build . The search space we use is described in Appendix D.1. While large, we find that the optimal parameters for each task end up covering almost the entire space.

4. Experiments: TaskSet

In this section we demonstrate various properties of the task suite. For a qualitative view, we first construct a feature space consisting of performance measurements for each task+optimizer pair (See §3.3). This forms a dense matrix of size number of tasks by number of optimizers. We then perform T-SNE (Maaten & Hinton, 2008; Van Der Maaten, 2014) to reduce the dimensionality to two and plot the results coloring by task family (Figure 1). Clusters in this space correspond to tasks that work well with similar optimizers. We find diversity of tasks with clusters occurring around similar families of tasks.

Next, we look at aggregate statistics. In Figure 2a we show histograms of compute times for all problems and find almost all problems train under an hour (see Appendix B for per task family histograms). In Figure 2c we plot a histogram of the number of parameters per tasks. Finally, in Figure 2b we show a distribution of task difficulty by plotting the fraction of optimizer configurations that achieve a certain loss value. We find that for some tasks as many as 50% of optimizers perform well while for others < 1% achieve a loss close to the smallest observed loss.

5. Experiments: Training and generalization of learned hyperparameter lists

With our dataset of tasks and data collected, we turn our attention to exploring training of the hyperparameter lists, and generalization beyond the suite of tasks in TaskSet. Our main tool to show performance are figures that sweep the number of optimizers configurations on the x-axis, and show the best performance achieved for each number of optimizers tried, averaged over some set of tasks (Eq. 1).

5.1. Learned hyperparameter lists are more efficient than random search

To demonstrate the impact of learning a search space, we take the 1,162 tasks split them into even train and test tasks.

Figure 3. a. By learning a search space we achieve large speedups over random search. On the y-axis we show J, or the best aggregated and normalized performance achieved given some number of optimizer trials (x-axis). This is computed on heldout tasks not used to train the hyperparameter list. In solid we show median performance with 25-75 percentile shown with error bars over 50 resamplings of the train-test split of tasks, as well as random samplings. In black we show a learned search space computed from the Adam8p family of optimizes. In color we show various random search baselines. See §5.1 for description of these.

We then learn a search strategy using optimizers from the Adam8p family following Eq. 5 on the train tasks. As baselines, we use random search with different search spaces, including just learning rate (Rand: Adam1p), the default Adam hyper parameters (Rand: Adam4p), as well as the Adam 8 dimensional search space (Rand: Adam8p). Search spaces are specified in Appendix D.1.

The performance of random search critically depends on the boundaries of the original search space. Without prior knowledge about the problems, however, picking a good search space is difficult. To explore this we additionally choose search spaces after collecting and looking at the data. We then use this search space to simulate random search within the constraints via rejection sampling. To find these search spaces we find the best hyper parameters for each task and construct new hyperparameter ranges with min and max values determined by the smallest and largest values of each hyperparameter which were the best hyperparameter for some task. This removes regions of the search space not used by any task. We also tested bounds based on the 5th and 95th percentile of best performing hyperparameters computed over all tasks. In the case of min and max, we find the optimal hyperparameters cover nearly all of the existing space, whereas the percentile based search spaces reduces the volume of the search hypercube by more than 90% leaving us with only 100 hyperparameter con-figurations. In Figure 5, we find, in all cases, learning the hyperparameter list is much more efficient.

5.2. More tasks lead to better generalization

We next look at the effects of the number of training tasks on generalization. We take subsets of tasks of different

Figure 4. Using more tasks to train the search space results in improved performance on heldout tasks. Left: The number of optimizers tried vs performance on heldout tasks. In color, we show different numbers of tasks used to learn the search spaces. We show median performance with error bars denoting 25 and 75 percentile. Right: Performance at a fixed number of optimizers tried vs number of tasks used for meta-training. We find performance continues to improve as we meta-train on more tasks. These plots are slices out of the left pane, with colors matching the vertical dashed lines.

size, and train hyperparameter lists using Eq.5. We compute test performance on the remainder of the tasks and plot loss averaged over different splits in Figure 4. We find that a large number of tasks (more than 100) are required to achieve near-optimal test performance. This is surprising to us given how simple our learned search strategy is (simply a list of hyperparameters), but not wholly so given past work studying generalization in RL (Cobbe et al., 2018).

5.3. Generalization to different types of problem

For learned algorithms to be generally useful, some amount of generalization to unseen task families is required. To test this, we split our data into disjoint task types. We perform two splits: testing on RNN tasks and training on all others, and testing on autoencoder tasks and training on all others. As a best case baseline we additionally train search spaces on the test task families directly. We find an order of magnitude better sample efficiency than random search for both cases and find our learned search space is close in performance to search spaces trained on just the testing tasks (Fig. 5).

5.4. Generalization to different sized problems

Training learned algorithms on large models is often infeasible for computational reasons. As such, one form of generalization needed when building learned algorithms is the ability to transfer to different sized models. As shown in Figure 2 the tasks in this suite contain a wide range of parameter counts, and can thus be used to test this kind of generalization. We split the tasks into 8 groups – one group per order of magnitude in parameter count, and train

Figure 5. We show aggregate performance (J) as a function of number of optimizers tried when training the hyperparameter list on a different distributions of tasks than those we test on. We show testing on RNNs (left) and auto encoders (right), and train on 700 tasks sampled from the remainder set of tasks. We find that these learned search spaces perform much better than random search in both the learning rate search space and in the original Adam8p search space. We additionally plot the best case performance – the case where we train and test on the same problem type. We show median and 25-75 percentile averaged over 50 different samplings.

hyperparameter lists on one range and test on the rest. In Figure 6 we plot the fraction of the training loss achieved by the test loss on the target parameter range. We find peak performance around the model sizes used for training, and smooth falloff as the testing tasks become more dissimilar as measured by parameter count. We note that our problems are not evenly distributed across these groups thus each group will contain a different percentage of the underlying tasks. While this potentially confounds these results, we believe a similar bias occurs in realistic workloads as well.

Figure 6. We show learned search space generalization, measured as a ratio of the loss achieved in training and testing, versus the number of task parameters used during search space training. Generalization falls off as one moves further away from the training regime. In black we show that a uniform mixture of the 7 parameter buckets does not fall off.

6. Experiments: Realistic problems

In §5.3 and §5.4 we explored generalization of learned hyperparameter lists to held out tasks within the TaskSet dataset. While useful for analysis, these tasks are still far from the workloads commonly employed to solve real problems. In this section, we explore the performance of our learned search space on a number of state of the art models. These models drastically differ from the training set of tasks in parameter count and compute cost. For all experiments in this section we take the optimizer ordering using the NAdamW optimizer family on all TaskSet tasks then apply the resulting search space to the target problem. The final list of hyperparameters can be found in Appendix F. We show results for ResNet50 on ImageNet, and Transformers on LM1B. Additional results with reinforcement learning using PPO are in Appendix A.1.

6.1. ImageNet Resnet50

We take the TPU implementation with default settings from the official Tensorflow models repository (Tensorflow, 2019), and swap out different optimizers. We test the default optimizer, SGD + momentum with a learning rate warmup and staircase decay, learning rate tuned Adam (in half orders of magnitudes between 1e-6, 3e-2, as well the learned list of hyperparameters. For momentum and learning rate tuned Adam we leave the default weight decay value. For our learned search space we remove weight decay as this is handled by the optimizer.

We show accuracy computed over the course of training as well as best performance for a given hyperparameter budget in Figure 7. We find that the learned search space vastly outperforms learning rate tuned Adam. After 67 model evaluations we find a optimizer that outperforms the default SGD+momentum staircase learning rate schedule commonly used to train these models. By using the list as opposed to searching randomly in the original search space we find beter hyperparameters faster. Using this list of hyperparameters does not require any problem-specific knowledge. Despite this, we are able to slightly improve upon the default methodology used when training a ResNet50.

6.2. LM1B Transformer

We take the transformer (Vaswani et al., 2017) example implemented in Jax (Bradbury et al., 2018) with Flax (Flax Developers, 2020). We train using a 2x2 TPU V2 config-uration for 100k iterations. Once again we take all other hyperparameters as is and simply swap optimizer implementation. We additionally split a second validation set from the training set to perform the max over hyperparameters over. We present 2 baselines: first, tuning learning rate only, and otherwise using the default transformer training hyper-parameters; and second a fixed learning rate Adam baseline. Results in Figure 10. We find the learned hyperparameter list dramatically outperforms the default optimizer setting and the fixed learning rate baseline. We suspect the fact that the fixed learning rate performs better than the built in learning rate schedule is due to limited training time and model hyperparameters. Nevertheless, we emphasize that

Figure 7. We find our learned optimizer outperforms learning rate tuned Adam by a large margin and matches (with a slight improvement) the default, staircase schedule which is the standard when training ResNet50. Left: Learning curves for the best optimizer from each class. Right: Number of optimizers tried vs best top 1 ImageNet accuracy achieved. For NAdamW, we first train models with a validation set created by splitting the training set. We compute maxes over this set, and show the performance when retraining on the full training set on the official ImageNet validation set. For AdamLR, we simply compute a max over the official ImageNet validation set.

our method does not require any knowledge of the underlying problem to achieve faster results. See Appendix A.2 for this same transformer with a budget of 20k iterations.

7. Related Work

The idea of sets of tasks has been explored throughout machine learning. The majority of these suites are for use in evaluation where as our suite is targeted for meta-learning. The closest family of optimization tasks for evaluation to those presented here is DeepObs (Schneider et al., 2019) which includes 20 neural network tasks. Our task suite focuses on smaller problems and contains 50x more tasks. Outside of evaluation, task suites in reinforcement learning such as Obstacle Tower (Juliani et al., 2019), ProcGen (Cobbe et al., 2019), CoinRun (Cobbe et al., 2018), and Sonic (Nichol et al., 2018) focus on training algorithms that work across a variety of settings.

The creation of TaskSet was motivated by the goal of learning learning algorithms, or meta-learning (Schmidhu- ber, 1987; 1995; Hochreiter et al., 2001), and in particular learned optimizers (Bengio et al., 1990; Andrychowicz et al., 2016; Bello et al., 2017; Wichrowska et al., 2017; Li & Ma- lik, 2017; Lv et al., 2017; Metz et al., 2019a;b). In this work we do not use this task suite to train learned optimizers, but instead focus on learning a hyperparameter search strategy. Tuning hyperparameters by leveraging multiple tasks has been explored within the contexts of Bayesian optimization (Swersky et al., 2013; Perrone & Shen, 2019; Perrone et al., 2018) as well as meta-learning (Reif et al., 2012; Gomes et al., 2012; Feurer et al., 2014; Wistuba et al., 2015b;a; Chen et al., 2017; Pfisterer et al., 2018).

Figure 8. Our learned optimizer list outperforms learning rate tuned Adam with both a constant learning rate, and a fixed learning rate schedule on a 53M parameter Transformer trained on LM1B. Left: Training curves for the best hyperparameter for each optimizer class. Right: Number of optimizers tried vs the test loss obtained given the best validation performance.

See Appendix E.1 for a full discussion of sets of tasks in machine learning, Appendix E.2 for more info on optimization in machine learning, and Appendix E.3 for a discussion on existing hyper parameter search methods.

8. Discussion

Learning optimization algorithms represents a promising direction to accelerate machine learning research. For the resulting algorithms to become useful tools, however, we must further understand the relationships between training tasks, meta-optimization, and both iid and out of distribution generalization. This work takes steps towards this goal by introducing a set of tasks which can be used to train and study optimization algorithms. We then use this task set and learned hyperparameter lists to answer questions related to optimization and generalization of learned learning algorithms. We find a large degree of generalization even to out of distribution tasks but as the tasks get more varied, transfer performance suffers. At this point, the training of learned learning algorithms is computationally expensive despite the extreme simplicity of our learned-learning algorithm parameterization (a list of hyperparameteters). We hope to explore alternative parameterizations which will increase performance such as by leveraging previous evaluations and partial model trainings (Swersky et al., 2014; Li et al., 2016).

We are releasing the optimal hyperparameter list we have found as a drop-in replacement optimizer in a variety of deep learning frameworks (Tensorflow (Abadi et al., 2016), PyTorch (Paszke et al., 2019), and JAX (Bradbury et al., 2018)) in the hopes that the research community finds them useful. We believe this represents a new set of reasonable optimizer defaults for new problems. We additionally hope TaskSet encourages more standardized research on general purpose optimizers.

ACKNOWLEDGMENTS

We would like to thank Alex Alemi, George Dahl, Justin Gilmer, Jaehoon Lee, Anselm Levskaya, Chris Madison, Alec Radford, Christopher Shallue for input on this work. Finally we would like to thank the entire Brain team.

References

URL https://s3.amazonaws.com/ amazon-reviews-pds/readme.html.

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pp. 265–283, 2016.

Amodei, D. and Hernandez, D. Ai and compute. Heruntergeladen von https://blog. openai. com/aiand-compute, 2018.

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989, 2016.

Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z., and Ionescu, C. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, pp. 4331–4339, 2016.

Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Re- evaluating evaluation. In Advances in Neural Information Processing Systems, pp. 3268–3279, 2018.

Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W. M., Perolat, J., Jaderberg, M., and Graepel, T. Open-ended learning in symmetric zero-sum games. arXiv preprint arXiv:1901.08106, 2019.

Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wain- wright, M., Küttler, H., Lefrancq, A., Green, S., Valdés, V., Sadik, A., et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

Bello, I., Zoph, B., Vasudevan, V., and Le, Q. Neural optimizer search with reinforcement learning. 2017. URL https://arxiv.org/pdf/1709.07417.pdf.

Bengio, Y., Bengio, S., and Cloutier, J. Learning a synaptic learning rule. Université de Montréal, Département d’informatique et de recherche opérationnelle, 1990.

Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.

Bergstra, J. S., Bardenet, R., Bengio, Y., and Kégl, B. Al- gorithms for hyper-parameter optimization. In ShaweTaylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 2546–2554. Curran Associates, Inc., 2011.

Blumenfeld, Y., Gilboa, D., and Soudry, D. A mean field theory of quantized deep networks: The quantizationdepth trade-off. arXiv preprint arXiv:1906.00771, 2019.

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.

Bousquet, O., Gelly, S., Kurach, K., Teytaud, O., and Vin- cent, D. Critical hyper-parameters: No random, no cry. arXiv preprint arXiv:1706.03200, 2017.

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., and Wanderman-Milne, S. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.

Bradski, G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.

Brock, A., Lim, T., Ritchie, J. M., and Weston, N. Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344, 2017.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., and Koehn, P. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013. URL http://arxiv.org/ abs/1312.3005.

Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and de Freitas, N. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 748–756. JMLR. org, 2017.

Chen, Y., Huang, A., Wang, Z., Antonoglou, I., Schrit- twieser, J., Silver, D., and de Freitas, N. Bayesian optimization in alphago. arXiv preprint arXiv:1812.06855, 2018.

Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J., and Dahl, G. E. On empirical comparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446, 2019.

Chrabaszcz, P., Loshchilov, I., and Hutter, F. A downsam- pled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. Lever- aging procedural generation to benchmark reinforcement learning, 2019.

Dewancker, I., McCourt, M., Clark, S., Hayes, P., John- son, A., and Ke, G. A stratified analysis of bayesian optimization methods. arXiv preprint arXiv:1603.09441, 2016.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti- mation using real nvp. arXiv preprint arXiv:1605.08803, 2016.

Dolan, E. D. and Moré, J. J. Benchmarking optimization software with performance profiles. Mathematical programming, 91(2):201–213, 2002.

Domhan, T., Springenberg, J. T., and Hutter, F. Speed- ing up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-Fourth International Joint Conference on Artifi-cial Intelligence, 2015.

Dozat, T. Incorporating nesterov momentum into adam. 2016.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121– 2159, 2011.

Eggensperger, K., Hutter, F., Hoos, H., and Leyton-Brown, K. Efficient benchmarking of hyperparameter optimizers via surrogates. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

Elman, J. L. Finding structure in time. Cognitive science, 14(2):179–211, 1990.

Falkner, S., Klein, A., and Hutter, F. Bohb: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774, 2018.

Feurer, M., Springenberg, J. T., and Hutter, F. Using meta- learning to initialize bayesian optimization of hyperparameters. In Proceedings of the 2014 International Conference on Meta-learning and Algorithm Selection-Volume 1201, pp. 3–10. Citeseer, 2014.

Flax Developers. Flax: A neural network library for jax designed for flexibility, 2020. URL https://github.com/google-research/ flax/tree/prerelease.

Foundation, W. Wikimedia downloads. URL https:// dumps.wikimedia.org.

Ginsburg, B., Castonguay, P., Hrinchuk, O., Kuchaiev, O., Lavrukhin, V., Leary, R., Li, J., Nguyen, H., and Cohen, J. M. Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286, 2019.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.

Gomes, T. A., Prudêncio, R. B., Soares, C., Rossi, A. L., and Carvalho, A. Combining meta-learning and search techniques to select parameters for support vector machines. Neurocomputing, 75(1):3–13, 2012.

Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

Guadarrama, S., Korattikara, A., Ramirez, O., Castro, P., Holly, E., Fishman, S., Wang, K., Gonina, E., Wu, N., Kokiopoulou, E., Sbaiz, L., Smith, J., BartÃ¸sk, G., Berent, J., Harris, C., Vanhoucke, V., and Brevdo, E. TF-Agents: A library for reinforcement learning in tensorflow. https://github.com/tensorflow/ agents, 2018. URL https://github.com/ tensorflow/agents. [Online; accessed 25-June-2019].

Hansen, N., Auger, A., Mersmann, O., Tusar, T., and Brockhoff, D. Coco: A platform for comparing continuous optimizers in a black-box setting. arXiv preprint arXiv:1603.08785, 2016.

Hayou, S., Doucet, A., and Rousseau, J. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266, 2018.

Hayou, S., Doucet, A., and Rousseau, J. Mean-field be- haviour of neural tangent kernel for deep neural networks, 2019.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.

Heess, N., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S., Riedmiller, M., et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.

Hinton, G. E. and Salakhutdinov, R. R. Reducing the di- mensionality of data with neural networks. science, 313 (5786):504–507, 2006.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

Hochreiter, S., Younger, A. S., and Conwell, P. R. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87–94. Springer, 2001.

Hutter, F., Kotthoff, L., and Vanschoren, J. (eds.). Automatic Machine Learning: Methods, Systems, Challenges. Springer, 2018. In press, available at http://automl.org/book.

Juliani, A., Khalifa, A., Berges, V.-P., Harper, J., Henry, H., Crespi, A., Togelius, J., and Lange, D. Obstacle tower: A generalization challenge in vision, control, and planning. arXiv preprint arXiv:1902.01378, 2019.

Karakida, R., Akaho, S., and Amari, S.-i. Universal statistics of fisher information in deep neural networks: mean field approach. 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

Klein, A. and Hutter, F. Tabular benchmarks for joint archi- tecture and hyperparameter optimization. arXiv preprint arXiv:1905.04970, 2019.

Klein, A., Falkner, S., Springenberg, J. T., and Hutter, F. Learning curve prediction with bayesian neural networks. 2016.

Krizhevsky, A., Nair, V., and Hinton, G. Cifar-10 and cifar- 100 datasets. URl: https://www. cs. toronto. edu/kriz/cifar. html, 6, 2009.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Kumar, M., Dahl, G. E., Vasudevan, V., and Norouzi, M. Parallel architecture and hyperparameter search via successive halving and classification. arXiv preprint arXiv:1805.10255, 2018.

LeCun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.

Li, K. and Malik, J. Learning to optimize neural nets. arXiv preprint arXiv:1703.00441, 2017.

Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016.

Li, P. and Nguyen, P.-M. On random deep weight-tied au- toencoders: Exact asymptotic analysis, phase transitions, and implications to training. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=HJx54i05tX.

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.

Loshchilov, I. and Hutter, F. Fixing weight decay regular- ization in adam. arXiv preprint arXiv:1711.05101, 2017.

Lv, K., Jiang, S., and Li, J. Learning gradient descent: Better generalization and longer horizons. arXiv preprint arXiv:1703.03633, 2017.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/ anthology/P11-1015.

Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.

McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.

McCann, B., Keskar, N. S., Xiong, C., and Socher, R. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.

Melis, G., Dyer, C., and Blunsom, P. On the state of the art of evaluation in neural language models. arXiv preprint arXiv:1707.05589, 2017.

Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., and Sohl-Dickstein, J. Understanding and correcting pathologies in the training of learned optimizers. In International Conference on Machine Learning, pp. 4556–4565, 2019a.

Metz, L., Maheswaranathan, N., Shlens, J., Sohl-Dickstein, J., and Cubuk, E. D. Using learned optimizers to make models robust to input noise. arXiv preprint arXiv:1906.03367, 2019b.

Nene, S. A., Nayar, S. K., Murase, H., et al. Columbia object image library (coil-20). 1996.

Nesterov, Y. A method for unconstrained convex minimiza- tion problem with the rate of convergence o (1/kˆ 2). In Doklady AN USSR, volume 269, pp. 543–547, 1983.

Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman, J. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018.

Olsen, A., Konovalov, D. A., Philippa, B., Ridd, P., Wood, J. C., Johns, J., Banks, W., Girgenti, B., Kenny, O., Whinney, J., Calvert, B., Rahimi Azghadi, M., and White, R. D. DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning. Scientific Reports, 9(2058), 2 2019. doi: 10.1038/s41598-018-38343-3. URL https: //doi.org/10.1038/s41598-018-38343-3.

OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W., and Zhang, L. Solving rubik’s cube with a robot hand. arXiv preprint, 2019.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347, 2017.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.

Perrone, V. and Shen, H. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. In Advances in Neural Information Processing Systems, pp. 12751–12761, 2019.

Perrone, V., Jenatton, R., Seeger, M. W., and Archambeau, C. Scalable hyperparameter transfer learning. In Advances in Neural Information Processing Systems, pp. 6845–6855, 2018.

Petrak, J. Fast subsampling performance estimates for classification algorithm selection. In Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pp. 3–14. Citeseer, 2000.

Pfisterer, F., van Rijn, J. N., Probst, P., Müller, A., and Bis- chl, B. Learning multiple defaults for machine learning algorithms. arXiv preprint arXiv:1811.09409, 2018.

Pretorius, A., van Biljon, E., Kroon, S., and Kamper, H. Critical initialisation for deep signal propagation in noisy rectifier neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 5717–5726. Curran Associates, Inc., 2018.

Ramachandran, P., Zoph, B., and Le, Q. Searching for activation functions. 2017.

Rapin, J. and Teytaud, O. Nevergrad - A gradientfree optimization platform. https://GitHub.com/ FacebookResearch/Nevergrad, 2018.

Reif, M., Shafait, F., and Dengel, A. Meta-learning for evo- lutionary parameter optimization of classifiers. Machine learning, 87(3):357–380, 2012.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.

Schaul, T., Antonoglou, I., and Silver, D. Unit tests for stochastic optimization. arXiv preprint arXiv:1312.6055, 2013.

Schmidhuber, J. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, Technische Universität München, 1987.

Schmidhuber, J. On learning how to learn learning strategies. 1995.

Schneider, F., Balles, L., and Hennig, P. Deepobs: A deep learning optimizer benchmark suite. International Conference on Learning Representations, 2019.

Schoenholz, S. S., Gilmer, J., Ganguli, S., and Sohl- Dickstein, J. Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.

Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.

Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G. E. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.

Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., and Fleuret, F. On the tunability of optimizers in deep learning. arXiv preprint arXiv:1910.11758, 2019.

Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.

Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, M., and Adams, R. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180, 2015.

Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.

Swersky, K., Snoek, J., and Adams, R. P. Multi-task bayesian optimization. In Advances in neural information processing systems, pp. 2004–2012, 2013.

Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw bayesian optimization. arXiv preprint arXiv:1406.3896, 2014.

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., and Riedmiller, M. DeepMind control suite. Technical report, DeepMind, January 2018. URL https://arxiv.org/abs/1801.00690.

Tensorflow. tensorflow tpu resnet50, Oct 2019. URL https://github.com/tensorflow/tpu/ tree/master/models/official/resnet.

Tieleman, T. and Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012.

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. IEEE, 2017.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.

Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Manzagol, P.-A., and Larochelle, H. Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096, 2019.

Van Der Maaten, L. Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- ment learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.

Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm. org/10.1145/2641190.2641198.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638, 2016.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.

Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537, 2019.

Werbos, P. J. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550– 1560, 1990.

Wichrowska, O., Maheswaranathan, N., Hoffman, M. W., Colmenarejo, S. G., Denil, M., de Freitas, N., and SohlDickstein, J. Learned optimizers that scale and generalize. International Conference on Machine Learning, 2017.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148–4158, 2017.

Wistuba, M., Schilling, N., and Schmidt-Thieme, L. Learn- ing hyperparameter optimization initializations. In 2015 IEEE international conference on data science and advanced analytics (DSAA), pp. 1–10. IEEE, 2015a.

Wistuba, M., Schilling, N., and Schmidt-Thieme, L. Se- quential model-free hyperparameter tuning. In 2015 IEEE international conference on data mining, pp. 1033–1038. IEEE, 2015b.

Wolpert, D. H. and Macready, W. G. No free lunch theo- rems for optimization. IEEE transactions on evolutionary computation, 1(1):67–82, 1997.

Wu, Y., Ren, M., Liao, R., and Grosse, R. B. Understanding short-horizon bias in stochastic meta-optimization. pp. 478–487, 2016.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492, June 2010. doi: 10.1109/CVPR.2010.5539970.

Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. Dynamical isometry and a mean field

theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In International Conference on Machine Learning, 2018.

Yang, G. and Schoenholz, S. Mean field residual networks: On the edge of chaos. In Advances In Neural Information Processing Systems, 2017.

Ying, C., Klein, A., Real, E., Christiansen, E., Murphy, K., and Hutter, F. NAS-Bench-101: Towards Reproducible Neural Architecture Search. arXiv e-prints, art. arXiv:1902.09635, Feb 2019.

Zhai, X., Puigcerver, J., Kolesnikov, A., Ruyssen, P., Riquelme, C., Lucic, M., Djolonga, J., Pinto, A. S., Neumann, M., Dosovitskiy, A., et al. The visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.

Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.

Zoph, B. and Le, Q. V. Neural architecture search with rein- forcement learning. International Conference on Learning Representations, 2017. URL https://arxiv. org/abs/1611.01578.

A. Additional Experiments

A.1. Reinforcement Learning with PPO

Figure 9. We find our learned hyperparameter lists performs about as well as random search on the NAdam search space, and worse than the random search on the learning rate tuned Adam search space.

We test the learned hyperparameter lists on two continuous control reinforcement learning environments, half cheetah and humanoid, from Gym’s Mujoco environments(Todorov et al., 2012; Brockman et al., 2016). We use TF-Agents (Guadarrama et al., 2018) with all non-optimizer hyperparameters set via searching a mixture of environments. In figure A.1 we find our learned hyperparameter lists achieves comparable to slightly worse performance does not out perform learning rate tuning of Adam in both efficiency nor final performance. To diagnose this behavior we ran all 1k optimizers for both problems and found the learned hyperparameter list performs comparable to random search in the underlying space. To probe further, we computed spearman correlation on the performance of each optimizer as compared to the rest of the tasks in the task suite. We found considerably worse correlations than where present for tasks in the TaskSet. This is not surprising as TaskSet contains no reinforcement learning problems.

A.2. LM1B targeting 20k iterations

We show a transformer on LM1B similar to that shown in §6.2 except run for only 20k iterations, a fith of the steps. Results in Figure 10. We find the learned hyperparameter lists are much more efficient than either of the baselines.

A.3. Probing short horizon

Often the goal when training a learned optimizers is to minimize performance after training some number of iterations. This is extremely computationally expensive and in practice approximations must be used. One common family of approximations is short horizon based methods. These methods rely upon somehow truncating training so that up-

Figure 10. We find our learned hyperparameter lists out performs learning rate tuned Adam with both a constant, and a fixed learning rate schedule on a 53M parameter Transformer trained on LM1B. Left: Learning curves for the best of the optimizers. Right: Number of optimizers tried vs best test loss.

Figure 11. Hyperparameter lists trained on short horizon data generalize remarkably well. On the y-axis we show performance evaluated on the the full 10k training iterations for a given number of optimizers tried (x-axis). In color we show different number of steps used when evaluating task optimizer performance when training the hyperparameter list.

dates can be made to the learned optimizer more frequently. This is commonly done via truncated backprop (Werbos, 1990; Wichrowska et al., 2017; Metz et al., 2019a; Wu et al., 2016), or proxy objectives such as only training for a handful of epoch (Zoph & Le, 2017). While this short horizon proxy is certainly not optimal(Wu et al., 2016), the performance gains are immense and in practice is what makes meta-training optimizers feasible. In our task suite, we test this short horizon learning by training hyperparameter lists only using some finite amount of training iterations per task and testing in the full training regieme (10k steps). Results in figure 11. We find that even when learning the hyperparameter list on a mere 200 steps, our hyperparameter list continues to generalize to outperform random search on Adam8p. This is promising as this suggests that training the learned hyperparameter list can be done with 1/50th of the total compute. This result is surprising to us as prior work indicates the effect of this bias can be severe (Wu et al., 2016; Metz et al., 2019a). We suspect it is due to the simplicity of the learned parameter space but leave a thorough analysis of this for future work.

A.4. Choice of normalization function

There is no easy way to define a single metric for optimizer performance over a mixture of tasks. This paper picks a single normalization strategy based on minimum validation loss and the validation loss at initialization presented in §3.3. In this section we show the impact of choosing a different normalization and or aggregation technique. First, instead of computing the mean over learning curves as described in §3.3 we compute a min. Second, instead of rescaling based on init and min, we linearly rescale based on the 95 percentile of validation loss and the min validation loss achieved at the end of training each task.In Figure 12 we show learned hyperparameter list training and testing performance as a function of number of optimizers tried when training with different normalization techniques. We find using the min instead of mean results in a negligible change, while using the percentile loss more significantly hurts performance. This difference can be explained by Figure 12b and 12c where we show correlations between the two losses. We find the percentile loss has a much weaker correlation to the default normalizer. We suspect this difference is due to the fact that many optimizers diverage on tasks. By using the 95 percentile we upweight optimizers that do not diverge.

A.5. Task families are diverse

To show the effects of diversity we train and test hyperparameter lists on each pair of task family. We additionally normalize each column from 0-1 to account for different mean losses across tasks. Results in Figure 13. While we do find some similarity in tasks – e.g. between MAF and NVP models, but no two tasks behave the same performance characteristics (no duplicate columns) suggesting that each task family is providing a different contribution to the space of all tasks. We also find when training on certain “far away” tasks, e.g. the quadratic family, we find poor performance on most other task families.

A.6. Effects of the meta-training search space size

Our offline learning technique described in §3.4 hinges on a finite set of optimizers collected via random search. This set is denote by in Eq.5. In this section we probe the impact of this size. We take different sized subsets of the the thousand Adam8p optimizer configurations and train and test search spaces on different iid splits of tasks. We then plot performance as a function of this number of optimizers in Figure 15. Moving left in this figure corresponds to increasing the compute needed to train the learned hyper-parameter list. We find performance continues to improve as the size of grows. Given the high dimension of our meta-parameters, 8, this is not a surprise as the number of evaluations needed to explore the space will grow exponentially. We find that the full thousand trials are needed to out perform learning rate tuned Adam when only given a single optimizer evaluation. We find around 100 optimizers (size of ) are needed in the case of 10 optimizer trials (k = 10).

Overall this sugjests that randomsearch might not be the most efficient learning method for creating hyperparameter lists. This is especially true as we work with optimizer families that have more hyperparameters. Other approximate learning methods should likely be explored such as truncated backprop through time as used by the learned optimizer community(Metz et al., 2019a), and/or population based methods (Balduzzi et al., 2019).

B. Task timings

In Figure 14 we show box plots of training times for each problem. For each task we use the median step time recorded over a mixture of different physical devices and multipled by 10k to estimate a full training time. Future versions of this dataset of tasks will contain more variation within each task family.

Figure 12. Left: Aggregate performance (y-axis) vs number of optimizer tried (x-axis) for different normalization and aggregation techniques. In each curve we train the hyperparameter list with a different normalization and aggregation strategy and test with the default normalization and aggregation technique described in 3.3. We find some some strategies are near identical in performance (e.g. min norm), while others perform significantly worse – e.g. last quantile norm. In both cases, however, we still perform better than the underlying random search. Center: Correlation between default normalization and the quantile based normalization strategy. Correlation is quite low – 0.193 Pearson’s correlation. Right: Correlation between the default normalization using a mean to aggregate over validation over the course of training vs using a min over validation over the course training. We find a much higher correlation of 0.911.

Figure 13. Learning hyperparameter lists using one task family and testing on the remainder of task families. We normalize each column from 0-1 to account for different mean losses across tasks. Lower loss means better performance. We find some groups of similar tasks, but in general no two task families behave identically.

Figure 14. Timings computed for each task family. We find most task families have a narrow distribution of compute times.

1 hparams (Adam8p)

1 hparams (Adam4p)

10 hparams (Adam8p) 10 hparams (Adam4p) 100 hparams (Adam8p) 100 hparams (Adam4p) 1 -- Adam lr

Figure 15. Performance continues to improve as more and more optimizers are used when training the search spaces. On the x-axis we show number of optimzers (size of , the number of hyperparameter evaluations used in training the learned hyperparameter list) and y-axis we show test loss achieved when applying the learned search space for a given fixed length, e.g. different values of k shown in color). We plot median with 25-75 percentile shaded over different random optimizer samples and iid task splits. Stars (with horizontal guide lines) denote best search for the corresponding number of hyperparameters for learning rate tuned Adam in half orders of magnitude.

C. Optimizer family update equations

C.1. Adam8p update equations

The 8 meta-parameters are: the learning rate, , first and second moment momentum, , the numerical stability term, regularization strength, and learning rate schedule constants and linear_decay. For Adam6p, we set

C.2. NAdamW update equations

This optimizer family has 10 hyper parameters. The base learning rate, , first and second moment momentum, , the numerical stability term, regularization strength, AdamW style weight decay, and a boolean to switch between NAdam and Adam, buse nesterov. The learning rate schedule is based off of a single cycle cosine decay with a warmup. It is controlled by 3 additional parameters – min learning rate mult.

The learning rate is defined by:

The update equations of NAdamW are quite similar to that of Adam8p. For clarity we list the full update here.

D. Optimizer family search spaces

D.1. Adam8p, Adam6p, Adam4p, AdamLr search spaces

For Adam1p, Adam4p, Adam6p, and Adam8p we sample learning rate logritmically between 1e-8 and 10, beta1 and beta2 we parametrize as and sample logrithmically between 1e-4 and 1 and 1e-6 and 1 respectively. For learning rate schedules we sample linear decay between 1e-7, 1e-4 logrithmically and exponential decay logrithmically between 1e-3, 1e-6. We sample both and logrithmcally between 1e-8, 1e1.

D.2. NAdamW search space

This search space was chosen heuristically in an effort to generalize to new problems. We would like to emphasize that it was not tuned. We used our insight from Adam based optimizer families and chose this. No iterations where done. We expect more iterations will improve not only in distribution performance, but also generalization performance.

The initial learning rate, is sampled from log space between is sampled logrithmically between is sampled between is sampled logarithmically between and 1e4. We sample using nesterov (buse nesterov) 50% of the time. We sample logrithmically between and . Equal probabilities of a third we either use both terms, zero out , or zero out . With 50% probability we use a nonzero min learning rate multiplier sampled logrithmically between 50% probability we sample the warm up fraction, between 1e-5 and 1e-1, otherwise it is set to zero. Finally, we uniformly sample the amount of time the learning rate is held constant() between 0 and 1.

E. Extended related work

E.1. Sets of tasks

Benchmarks consisting of multiple tasks are becoming an increasingly common technique for measuring improvement in algorithm design. Reinforcement learning has Atari (Bellemare et al., 2013), DMLab (Beattie et al., 2016), gym (Brockman et al., 2016), and dm_control (Tassa et al., 2018). Natural language processing has evaluation sets such as GLUE (Wang et al., 2018), Super GLUE (Wang et al., 2019), and the NLPDecathalon (McCann et al., 2018). In computer vision there is (Zhai et al., 2019) which studies transfer learning of image features. In black box optimization there is Nevergrad (Rapin & Teytaud, 2018), COmparing Continuous Optimizers (COCO) (Hansen et al., 2016) and a number of tasks to test Bayesian hyperparameter optimization presented in (Dewancker et al., 2016). For first order gradient methods there are unit tests for stochastic optimization (Schaul et al., 2013) which studies toy optimization functions, and DeepObs (Schneider et al., 2019) which includes 20 neural network tasks. Hyperparameter tuning practices on these benchmarks vary between tuning on each task separately, to tuning one set of hyperparameters for all problems. In Atari (Bellemare et al., 2013), for example, it is common practice to tune hyperparameters on a subset of tasks and evaluate on the full set. This protocol can further be extended by leveraging unseen levels or games at test time as done in Obstacle Tower (Juliani et al., 2019), ProcGen (Cobbe et al., 2019), CoinRun (Cobbe et al., 2018), and Sonic (Nichol et al., 2018). We believe generalization to unseen tasks is key for learned algorithms to be useful thus our learned search space experiments mirror this setting by making use of hold out tasks.

Existing meta-learning data sets share similar goals to our work but focus on different domains. In few shot learning there is MiniImageNet (Vinyals et al., 2016) which is built procedurally from the ImageNet dataset (Russakovsky et al., 2015). Meta-Dataset (Triantafillou et al., 2019) takes this further and also focuses on generalization by constructing few shot learning tasks using images from a number of different domains for evaluation purposes. The automated machine learning community has OpenML (Vanschoren et al., 2013) with a focus on selecting and tuning non-neural algorithms. For learning optimizers, the use of task suites has been limited and ad-hoc. Many works use a single or small number of standard machine learning tasks (Andrychowicz et al., 2016; Li & Malik, 2017; Lv et al., 2017; Metz et al., 2019a). Wichrowska et al. (2017) uses a set of synthetic problems meant to emulate many different kinds of loss surfaces. While existing collections of tasks exist for optimizer evaluation, e.g. (Schneider et al., 2019), they contain too small a number of tasks to act as a comprehensive training set for learning algorithms, and many of their tasks are additionally too computationally expensive to be useful during learning.

E.2. Hand designed and learned optimizers

Optimization is core to machine learning and thus the focus of extensive work. Methods such as Nesterov momentum (Nesterov, 1983), AdaGrad (Duchi et al., 2011), RMSProp (Tieleman & Hinton, 2012), and Adam (Kingma & Ba, 2014) have all shown considerable improvements in both the speed of optimization and ease of use by exposing robust, and easier to tune hyperparameters than SGD (Sivaprasad et al., 2019). Adaptive step size methods in particular have emerged at the forefront with many works building from it including AdamW (Loshchilov & Hutter, 2017), RAdam (Liu et al., 2019), Novograd (Ginsburg et al., 2019), and NAdam (Dozat, 2016). Recently, there has been a focus on comparing optimizers either for best performance, or ease of use (Wilson et al., 2017; Choi et al., 2019; Schneider et al., 2019; Sivaprasad et al., 2019). This has proven difficult as performance is heavily dependent on the choice of search space for optimization hyperparameters (Choi et al., 2019).

Learned optimizers represent a parallel thread in the development of optimizers. By learning as opposed to hand-designing optimizers, researchers hope to not only increase performance but also ease of use (e.g. minimize the number of hyperparameters required or lower hyperparameter sensitivity) (Bengio et al., 1990; Schmidhuber, 1995; Hochreiter et al., 2001). Recently, there has been renewed interest in parameterizating learning algorithms with neural networks and learning these optimizers on neural network based losses (Andrychowicz et al., 2016; Wichrowska et al., 2017; Li & Malik, 2017; Lv et al., 2017; Metz et al., 2019a;b). Other approaches make learn symbolic parameterizations for new optimizers (Bello et al., 2017). These various methods are all trained and evaluated on different distributions of tasks making comparison across papers challenging. The dataset of tasks presented here will hopefully aid in the ability to compare and evaluate progress in learned optimizer research.

In this work, we develop a much more minimal type of “learned optimizer” than previous work which developed new functional forms for the optimizer. Optimization involves not only the functional form of the optimizer, but also the rules for choosing hyperparameters and applying the optimizer. We focus on this second aspect of optimization and learn a hyperparameter search space to improve the performance of existing hand designed methods.

E.3. Hyperparameter search

Hyperparameter search is a key component in machine learning. Considerable improvements have been made in language (Melis et al., 2017), computer vision (Snoek et al., 2012), and RL (Chen et al., 2018) simply by tuning better. Often no single hyperparameter configuration works well across all tasks for existing optimization methods. Most current hyperparameter search methods involve trying a very large number of hyperparameters for every new task, which is computationally infeasible for large tasks, and additionally can severely limit the number of hyperparameters that can be tuned. Many common techniques such as random search (Bergstra & Bengio, 2012; Bousquet et al., 2017), Bayesian optimization (Snoek et al., 2012; 2015), tree parzen estimators (Bergstra et al., 2011), or sequential halving (Kumar et al., 2018) require setting a hyperparameter search space by hand which is not only difficult but often wildly inefficient.

Learning hyperparameters or search strategies by leveraging multiple tasks has been explored within the context of Bayesian optimization (Swersky et al., 2013; Perrone & Shen, 2019; Perrone et al., 2018) as well as under the term meta-learning in Chen et al. (2017) in which an LSTM is meta-trained to produce function locations to query.

The cost of hyperparameter search is often large as each evaluation requires training a model to completion. Often multi-fidelity based approaches are used which leverage “simpler” tasks and transfer the resulting hyperparameters (Hutter et al., 2018). Common approaches include training on partial function evaluations (Swersky et al., 2014; Domhan et al., 2015; Li et al., 2016; Klein et al., 2016; Falkner et al., 2018), or leveraging simplified data and models (Petrak, 2000; Zoph & Le, 2016; Brock et al., 2017). Our dataset of tasks serves as a: “simpler” set of tasks to train on; a large and diverse enough set of problems that optimization algorithms trained on it may be expected to generalize; and a framework to test transfer across different

types of problems.

F. List of NAdam HParams

Top 50 hyper parameters found using the NAdamW search space. We find diverse learning rates, with very little warmup used. We additionally find most good performing optimizers make use of AdamW style weight decay. Finally, matching insight from (Choi et al., 2019), we find large values of

G. Description of tasks in task suite

In this section we detail the task distribution used throughout this work. In addition to this text, a Tensorflow (Abadi et al., 2016) implementation is also released at github.com/google- research/google-research/tree/master/task_set.

G.1. Sampled Tasks

As many of the sampled tasks are neural networks. We define common sampling routines used by all the sampled tasks.

Activation functions: We define a distribution of activation functions which is sampled corresponding the following listing both name and weight. These are a mix of standard functions (relu, tanh) to less standard (cos).

• relu: 6

• tanh: 3

• cos: 1

• elu: 1

• sigmoid: 1

• swish (Ramachandran et al., 2017): 1

• leaky relu (with

• leaky relu (with

• leaky relu (with

Initializations: We sample initializers according to a weighted distribution. Each initialization sample also optionally samples hyperparameters (e.g. for random normal initializers we sample standard deviation of the underlying distribution).

• he normal (He et al., 2015): 2

• he uniform (He et al., 2015): 2

• glorot normal (Glorot & Bengio, 2010): 2

• glorot uniform (Glorot & Bengio, 2010): 2

• orthogonal: 1. We sample the “gain”, or multiplication of the orthogonal matrix logarithmically between [0.1, 10].

• random uniform 1.0: This is defined between where s is sampled logarithmically between [0.1, 10].

• random normal: 1.0: The std is sampled logarithmically between (0.1, 10).

• truncated normal: 1.0: The std is sampled logarithmically between (0.1, 10).

• variance scaling: 1.0: The scale is sampled logarithmically between (0.1, 10).

RNN Cores: We define a distribution over different types of RNN cores used by the sequential tasks. With equal probability we sample either a vanilla RNN (Elman, 1990), GRU(Chung et al., 2014), or LSTM(Hochreiter & Schmid- huber, 1997). For each cell we either sample 1 shared initialization method or sample a different initialization method per parameter vector with a 4:1 ratio. We sample the core hidden dimension logarithmically between [32, 128].

Image Datasets: We sample uniformly from the following image datasets. Each dataset additionally has sampled parameters. For all datasets we make use of four data splits: train, valid-inner, valid-outer, test. Train is used to train models, valid-inner is used while training models to allow for modification of the training procedure (e.g. if validation loss doesn’t increase, drop learning rate). Valid-outer is used to select meta-parameters. Test should not be used during meta-training.

For all datasets, we sample a switch with low probability (10% of the time) to only use training data and thus not test generalization. This ensures that our learned optimizers are capable of optimizing a loss as opposed to a mix of optimizing and generalizing.

Mnist: Batch size is sampled logarithmically between [8, 512]. We sample the number of training images logarithmically between [1000, 55000] (LeCun, 1998).

Fashion Mnist: Batch size is sampled logarithmically between [8, 512]. We sample the number of training images logarithmically between [1000, 55000] (Xiao et al., 2017).

Cifar10: Batch size is sampled logarithmically between [8, 256]. The number of training examples is sampled logarithmically [1000, 50000] (Krizhevsky et al., 2009).

Cifar100: Batch size is sampled logarithmically between [8, 256]. The number of training examples is sampled logarithmically [1000, 50000] (Krizhevsky et al., 2009).

{food101_32x32, coil100_32x32, deep_weeds_32x32, sun397_32x32}: These dataset take the original set of images and resize them to 32x32 using OpenCV’s (Bradski, 2000) cubic interpolation. We ignore aspect ratio for this resize. Batch size is sampled logarithmically between [8, 256] (Bossard et al., 2014; Nene et al., 1996; Olsen et al., 2019; Xiao et al., 2010).

Imagenet32x32 / Imagenet16x16: The ImageNet 32x32 and 16x16 dataset as created by Chrabaszcz et al. (2017).

Batch size is logrithmically sampled between [8, 256].

IMDB sentiment classification: We use text from the IMDB movie reviews dataset(Maas et al., 2011) and tokenize using subwords using a vocab size of 8k(Sennrich et al., 2015). We then take length s random slice from each example where s is sampled logarithmically between [8, 64]. These examples are then batched into a batch size logarithmically sampled between [8, 512]. We sample the number of training examples logarithmically between [1000, 55000] and with 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.

G.1.4. CHARACTER AND WORD LANGUAGE MODELING

For the character and word language modeling datasets we make use of the following data sources: imdb movie reviews(Maas et al., 2011), amazon product reviews (ama) using the Books, Camera, Home, and Video subset each as separate datasets, LM1B(Chelba et al., 2013), and Wikipedia(Foundation) taken from the 20190301 dump using the zh, ru, ja, hab, and en language codes. We split each article by new lines and only keep resulting examples that contain more than 5 characters. For infrastructure reasons, we only use a million articles from each language and only 200k examples to build the tokenizer.

Byte encoding: We take length s random slices of each example where s is sampled logarithmically between [10, 160]. These examples are then batched into a batch size logarithmically sampled between [8, 512]. With probability 0.2 we restrict the number of training examples to a number logarithmically sampled between [1000, 50000]. Finally, with a 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.

subword encoding: We encode the text as subwords with a vocabsize of 8k (Sennrich et al., 2015). We then take length s random slices of each example where s is sampled logarithmically between [10, 256]. These examples are then batched into a batch size logarithmically sampled between [8, 512]. With probability 0.2 we restrict the number of training examples to a number logarithmically sampled between [1000, 50000]. Finally, with a 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.

G.2. Sampled Tasks

G.2.1. MLP

This task family consists of a multi layer perceptron trained on flattened image data. The amount of layers is sampled uniformly from [1, 6]. Layer hidden unit sizes are sampled logarithmically between [16, 128] with different number of hidden units per layer. One activation function is chosen for the whole network and is chosen as described in G.1.1. One shared initializer strategy is also sampled. The image dataset used is also sampled.

Two sampled configurations are shown below.

G.2.2. MLP_AE

This task family consists of a multi layer perceptron trained with an auto encoding loss. The amount of layers is sampled uniformly from [2, 7]. Layer hidden unit sizes are sampled logarithmically between [16, 128] with different number of hidden units per layer. The last layer always maps back to the input dimension. The output activation function is sampled with the following weights: tanh:2, sigmoid:1, linear_center:1, linear:1 where linear_center is an identity mapping. When using the linear_center and tanh activation we shift the ground truth image to before performing a comparison to the model’s predictions. We sample the per dimension distance function used to compute loss with weights l2:2, l1:1, and the reduction function across dimensions to be either mean or sum with equal probability. A single activation function, and initializer is sampled. We train on image datasets which are also sampled.

A sample configurations is shown below.

G.2.3. MLP VAE

This task has an encoder with sampled number of layers between [1, 3]. For each layer we sample the number of hidden units logarithmically between [32, 128]. For the decoder we sample the number of layers uniformly between [1, 3]. For each layer we sample the number of hidden units logarithmically between [32, 128]. We use a gaussian prior of dimensionality logarithmically sampled between [32, 128]. A single activation function and initialization is chosen for the whole network. The output of the encoder is projected to both a mean, and a log standard deviation which parameterizes the variational distribution, q(z|x). The decoder maps samples from the latent space to a quantized gaussian distribution in which we compute data log likelihoods log p(x|z). The loss we optimize is the evidence lower bound (ELBO) which is computed by adding this likelihood to the kl divergence between our normal distribution prior and q(z|x). We use the reparameterization trick to compute gradients. This model is trained on sampled image datasets.

A sample configuration is listsed below.

G.2.4. CONV POOLING

This task consists of small convolutional neural networks with pooling. We sample the number of layers uniformly between [1, 5]. We sample a stride pattern to be either all stride 2, repeating the stride pattern of 1,2,1,2... for the total number of layers, or 2,1,2,1... for the total number of layers. The hidden units are logarithmically sampled for each layer between [8, 64]. We sample one activation function and weight init for the entire network. Padding for the convolutions are sampled per layer to either be same or valid with equal probability. For the convnet we also sample whether or not to use a bias with equal probability. At the last layer of the convnet we do a reduction spatially using either the mean, max, or squared mean sampled uniformly. This reduced output is fed into a linear layer and a softmax cross entropy loss. These models are trained on a sampled image dataset.

A sample configuration is shown below.

G.2.5. CONV FC

This task consists of small convolutional neural networks, flattened, then run through a MLP. We sample the number of conv layers uniformly between [1, 5]. We sample a stride pattern to be either all stride 2, repeating the stride pattern of 1,2,1,2... for the total number of layers, or 2,1,2,1... for the total number of layers. The hidden units are logarithmically sampled for each layer between [8, 64]. Padding for the convolutions are sampled per layer to either be same or valid with equal probability.

The output is then flattened, and run through a MLP with hidden layers sampled uniformly from [0, 4] and with sizes sampled logrithmically from [32, 128]. The loss is then computed via softmax cross entropy.

We sample one activation function and weight init for the entire network. For the convnet we also sample whether or not to use a bias with equal probability. These models are trained on a sampled image dataset.

An example configuration is shown below.

This task takes character embedded data, and embeds in a size s embedding vector where s is sampled logarithmically between [8, 128] with random normal initializer with std 1.0. With 80% we use all 256 tokens, and with 20% chance we only consider a subset of tokens sampled logarithmically [100, 256]. We then pass this embedded vector to a RNN with teacher forcing with equal probability we use a trainable initializer or zeros. A linear projection is then applied to the number of vocab tokens. Losses are computed using a softmax cross entropy vector and mean across the sequence.

A sample configuration is shown below.

This task takes word embedded data, and embeds in a size s embedding vector where s is sampled logarithmically between [8, 128] with random normal initializer with std 1.0. A vocab size for this embedding table is sampled logarithmically between [1000, 30000]. We then pass this embedded vector to a RNN with teacher forcing with equal probability we use a trainable initializer or zeros. A linear projection is then applied to the number of vocab tokens. Losses are computed using a softmax cross entropy vector and mean across the sequence.

A sample configuration shown below.

G.2.8. LOSG PROBLEMS

These tasks consist of a mixture of many other tasks. We sample uniformly over the following types of problems. We brielfy describe them here but refer reader to the provided source for more information. In this work we took all the base problems from (Wichrowska et al., 2017) but mod-ified the sampling distributions to better cover the space as opposed to narrowly sampling particular problem families. Future work will consist of evaluating which sets of problems or which sampling decisions are required.

quadratic: n dimensional quadratic problems where n is sampled logarithmically between [10, 1000]. Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between [0.01, 10].

bowl: A 2d qaudratic bowl problem with a sampled condition number (logrithmically between [0.01, 100]). Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between [0.01, 10].

sparse_softmax_regression: A synthetic random sparse logistic regression task.

optimization_test_problems: A uniform sample over the following functions: Ackley, Beale, Branin, logsumexp, Matyas, Michalewicz, Rosenbrock, StyblinskiTang.

fully_connected: A sampled random fully connected clas-sification neural network predicting 2 classes on synthetic data. Number of input features is sampled logrithmically between 1 and 16, with a random activation function, and a sampled number of layers uniformly sampled from 2-5.

norm: A problem that finds a minimum error in an arbitrary norm. Specifically: . The dimentionality, N, is sampled logrithmically between 3, and 1000. The power, p, is sampled uniformly between 0.1 and 5.0. W, and y are drawn from a standard normal distribution.

dependency_chain: A synthetic problem where each parameter must be brought to zero sequentially. We sample dimensionality logrithmically between 3, 100.

outward_snake: This loss creates a winding path to infin-ity. Step size should remain constant across this path. We sample dimensionality logrithmically between 3 and 100.

min_max_well: A loss based on the sum of min and max over parameters: . Note that the gradient is zero for all but 2 parameters. We sample dimentaionlity logrithmically between 10 and 1000. Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between [0.01, 10].

sum_of_quadratics: A least squares loss of a dimentionality sampled logrithmically between 3 and 100 to a synthetic dataset.

projection_quadratic: A quadratic minimized by probing different directions. Dimentionality is sampled from 3 to 100 logrithmically.

In addition to these base tasks, we also provide a variety of transformations described bellow. The use of these transformations is also sampled.

sparse_problems: With probability 0.9 to 0.99 the gradient per parameter is set to zero. Additional noise is added with probability 0.5 sampled from a normal with std sampled logrithmically between [0.01, 10.0].

rescale_problems: Rescales the loss value by 0.001 to 1000.0 sampled logrithmically.

log_objective: Takes the log of the objective value.

2 Sample configurations shown below.

Masked autoregressive flows are a family of tractable density generative models. See XX for more information. The MAF is defined by a sequence of bijectors. For one bijector samples a number of layers to either be 1 or 2 with equal probability, and a number of hidden layers sampled logarithmically between [16, 128]. We sample the number of bijector uniformly from [1, 4] and use the same hidden layers across all bijector. We sample activation function, and initializer once for the whole model. In this task we model image datasets which are also sampled.

A sample configuration is shown below.

NVP are a family of tractable density generative models. See (Dinh et al., 2016) for more information. The NVP is defined by a sequence of bijectors. For one bijector samples a number of layers to either be 1 or 2 with equal probability, and a number of hidden layers sampled logarithmically between [16, 128]. We sample the number of bijector uniformly from [1, 4] and use the same hidden layers across all bijector. We sample activation function, and initializer once for the whole model. In this task we model image datasets which are also sampled.

A sample configuration shown below.

This task distribution defines a synthetic problem based on a non-linear modification to a quadratic. The dimensionality of the problem is sampled logarithmically between [2, 3000].

The loss for this task is described by:

where weight_rescale and where param is initialized by initial_dist.sample() / weight_rescale.

The output_fn is sampled uniformly between identity, and f(x) = log(max(0, x)). The loss scale is sampled logarithmically between [

We define a distribution over matrices A as a sample from one of the following: normal: we sample a mean from a normal draw with a standard deviation of 0.05 and a std from a uniform [0, 0.05]. The elements of A are drawn from the resulting distribution. uniform: linspace_eigen: logspace_eigen:

We define a distribution over B to be either normal with mean and std sampled from N(0, 1), U(0, 2) respectively or uniform with min and range equal to U(-5, 2.5), U(0, 5) respectively.

With probability 50% we add noise from a distribution whose parameters are also sampled.

A sample configuration shown below.

This task consists of using an RNN to classify tokenized text. We first trim the vocab length to be of a size logarithmically sampled between [100, 10000]. The text is then embedded into a vocab size logarithmically sampled between [8, 128]. These embeddings get fed into a sampled config RNN. With equal probability the initial state of the rnn is either sampled, or zeros. With equal probability we either take the last RNN prediction, the mean over features, or the per feature max over the sequence. This batch of activations is then passed through a linear layer and a softmax cross entropy loss. The initialization for the linear projection is sampled.

An example configuration shown below. In this version of TaskSet the dataset sampling contains a bug. All data used is from the imdb_reviews/subwords8k dataset.

G.3. Fixed Tasks

In addition to sampled tasks, we also define a set of hand designed and hand specified tasks. These tasks are either more typical of what researcher would do (e.g. using default initializations) or specific architecture features such as bottlenecks in autoencoders, normalization, or dropout.

In total there are 107 fixed tasks. Each task is labeled by name with some information about the underlying task. We list all tasks, discuss groups of tasks, but will not describe each task in detail. Please see the source for exact details.

Associative_GRU128_BS128_Pairs10_Tokens50 Associative_GRU256_BS128_Pairs20_Tokens50 Associative_LSTM128_BS128_Pairs10_Tokens50 Associative_LSTM128_BS128_Pairs20_Tokens50 Associative_LSTM128_BS128_Pairs5_Tokens20 Associative_LSTM256_BS128_Pairs20_Tokens50 Associative_LSTM256_BS128_Pairs40_Tokens100 Associative_VRNN128_BS128_Pairs10_Tokens50 Associative_VRNN256_BS128_Pairs20_Tokens50

These tasks use RNN’s to perform an associative memory task. Given a vocab of tokens, and some number of pairs to store and a query the RNN’s goal is to produce the desired value. For example given the input sequence A1B2C3?B_ the RNN should produce ________B.

This model embeds tokens, applies an RNN, and applies a linear layer to map back to the output space. Softmax cross entropy loss is used to compare outputs. A weight is also placed on the losses so that loss is incurred only when the RNN is supposed to predict. For RNN cells we use LSTM (Hochreiter & Schmidhuber, 1997), GRU (Chung et al., 2014), and VRNN – a vanilla RNN. The previous tasks are defined with the corresponding RNN cell, number of units, batch size, sequence lengths, and number of possible tokens for the retrieval task.

Copy_GRU128_BS128_Length20_Tokens10 Copy_GRU256_BS128_Length40_Tokens50 Copy_LSTM128_BS128_Length20_Tokens10 Copy_LSTM128_BS128_Length20_Tokens20 Copy_LSTM128_BS128_Length50_Tokens5 Copy_LSTM128_BS128_Length5_Tokens10 Copy_LSTM256_BS128_Length40_Tokens50 Copy_VRNN128_BS128_Length20_Tokens10 Copy_VRNN256_BS128_Length40_Tokens50

These tasks use RNN’s to perform a copy task. Given a vocab of tokens and some number of tokens the RNN’s job is to read the tokens and to produce the corresponding outputs. For example an input might be: ABBC|____ and the RNN should output ____|ABBC. See the source for a complete description of the task. Each task in this set varies the RNN core, as well as the dataset structure.

This model embeds tokens, applies an RNN, and applies a linear layer to map back to the output space. Softmax crossentropy loss is used to compare outputs. A weight is also placed on the losses so that loss is incurred only when the RNN is supposed to predict. For RNN cells we use LSTM (Hochreiter & Schmidhuber, 1997), GRU (Chung et al., 2014), and VRNN – a vanilla RNN. The previous tasks are defined with the corresponding RNN cell, number of units, batch size, sequence lengths, and number of possible tokens.

FixedImageConvAE_cifar10_32x32x32x32x32_bs128 FixedImageConvAE_cifar10_32x64x8x64x32_bs128 FixedImageConvAE_mnist_32x32x32x32x32_bs128 FixedImageConvAE_mnist_32x64x32x64x32_bs512

FixedImageConvAE_mnist_32x64x8x64x32_bs128

Convolutional autoencoders trained on different datasets and with different architectures (sizes of hidden units).

Convolutional variational autoencoders trained on different datasets, batch sizes, and with different architectures.

Convolutional neural networks doing supervised classification. These models vary in dataset, architecture, and initializations.

FixedLM_lm1b_patch128_GRU128_embed64_avg_bs128 FixedLM_lm1b_patch128_GRU256_embed64_avg_bs128 FixedLM_lm1b_patch128_GRU64_embed64_avg_bs128 FixedLM_lm1b_patch128_LSTM128_embed64_avg_bs128 FixedLM_lm1b_patch128_LSTM256_embed64_avg_bs128

Language modeling tasks on different RNN cell types and sizes.

Masked auto regressive flows models with different architectures (number of layers and sizes).

Autoencoder models based on multi layer perceptron with different number of hidden layers and dataset.

FixedMLPVAE_cifar101_128x128x32x128x128_bs128 FixedMLPVAE_cifar101_128x32x128_bs128 FixedMLPVAE_food10132x32_128x64x32x64x128_bs64 FixedMLPVAE_mnist_128x128x8x128_bs128 FixedMLPVAE_mnist_128x64x32x64x128_bs64 FixedMLPVAE_mnist_128x8x128x128_bs128 Imagenet32x30_FC_VAE_128x64x32x64x128_relu_bs256

Variational autoencoder models built from multi layer perceptron with different datasets, batchsizes, and architectures.

Image classification based on multi layer perceptron. We vary architecture, data, batchsize, normalization techniques, dropout, and loss type across problems.

Non volume preserving flow models with different batchsizesm and architectures.

RNN text classification problems with different RNN cell, sizes, embedding sizes, and batchsize.

2D quadratic bowls with different condition numbers.

Toy 2D test functions.

designed for accessibility and to further open science