NAS-Bench-1Shot1: Benchmarking and Dissecting One-shot Neural Architecture Search

2020·Arxiv

ABSTRACT

ABSTRACT

One-shot neural architecture search (NAS) has played a crucial role in making NAS methods computationally feasible in practice. Nevertheless, there is still a lack of understanding on how these weight-sharing algorithms exactly work due to the many factors controlling the dynamics of the process. In order to allow a scientific study of these components, we introduce a general framework for one-shot NAS that can be instantiated to many recently-introduced variants and introduce a general benchmarking framework that draws on the recent large-scale tabular benchmark NAS-Bench-101 for cheap anytime evaluations of one-shot NAS methods. To showcase the framework, we compare several state-of-the-art one-shot NAS methods, examine how sensitive they are to their hyperparameters and how they can be improved by tuning their hyperparameters, and compare their performance to that of blackbox optimizers for NAS-Bench-101.

1 INTRODUCTION

While neural architecture search (NAS) has attracted a lot of attention due to the effectiveness in automatically designing state-of-the-art neural networks (Zoph & Le, 2017; Zoph et al., 2018; Real et al., 2017; 2019), the focus has recently shifted to making the search process more efficient (Pham et al., 2018; Elsken et al., 2019; Liu et al., 2019; Xie et al., 2019; Cai et al., 2019; Casale et al., 2019). The most crucial concept which led to a reduction in search costs to the order of a single function evaluation is certainly the weight-sharing paradigm: Training only a single large architecture (the one-shot model) subsuming all the possible architectures in the search space (Brock et al., 2018; Pham et al., 2018).

Despite the great advancements of these methods, the exact results of many NAS papers are often hard to reproduce (Li & Talwalkar, 2019; Yu et al., 2020; Yang et al., 2020). This is a result of several factors, such as unavailable original implementations, differences in the employed search spaces, training or evaluation pipelines, hyperparameter settings, and even pseudorandom number seeds (Lindauer & Hutter, 2019). One solution to guard against these problems would be a common library of NAS methods that provides primitives to construct different algorithm variants, similar to what as RLlib (Liang et al., 2017) offers for the field of reinforcement learning. Our paper makes a first step into this direction.

Furthermore, experiments in NAS can be computationally extremely costly, making it virtually impossible to perform proper scientific evaluations with many repeated runs to draw statistically robust conclusions. To address this issue, Ying et al. (2019) introduced NAS-Bench-101, a large tabular benchmark with 423k unique cell architectures, trained and fully evaluated using a one-time extreme amount of compute power (several months on thousands of TPUs), which now allows to cheaply simulate an arbitrary number of runs of NAS methods, even on a laptop. NAS-Bench-101 enabled a comprehensive benchmarking of many discrete NAS optimizers (Zoph & Le, 2017; Real et al., 2019), using the exact same settings. However, the discrete nature of this benchmark does not allow to directly benchmark one-shot NAS optimizers (Pham et al., 2018; Liu et al., 2019; Xie et al., 2019; Cai et al., 2019). In this paper, we introduce the first method for making this possible.

Specifically, after providing some background (Section 2), we make the following contributions:

1. We introduce NAS-Bench-1Shot1, a novel benchmarking framework that allows us to reuse the extreme amount of compute time that went into generating NAS-Bench-101 (Ying et al., 2019) to cheaply benchmark one-shot NAS methods. Our mapping between search space representations is novel to the best of our knowledge and it allows querying the performance of found architectures from one-shot NAS methods, contrary to what is claimed by Ying et al. (2019). Specifically, it allows us to follow the full trajectory of architectures found by arbitrary one-shot NAS methods at each search epoch without the need for retraining them individually, allowing for a careful and statistically sound analysis (Section 3).

2. We introduce a general framework for one-shot NAS methods that can be instantiated to many recent one-shot NAS variants, enabling fair head-to-head evaluations based on a single code base (Section 4).

3. We use the above to compare several state-of-the-art one-shot NAS methods, assess the correlation between their one-shot model performance and final test performance, examine how sensitive they are to their hyperparameters, and compare their performance to that of black-box optimizers used in NAS-Bench-101 (Section 5).

We provide our open-source implementation1, which we expect will also facilitate the reproducibility and benchmarking of other one-shot NAS methods in the future.

2 BACKGROUND AND RELATED WORK

2.1 NAS-BENCH-101

NAS-Bench-101 (Ying et al., 2019) is a database of an exhaustive evaluation of all architectures in a constrained cell-structured space on CIFAR-10 (Krizhevsky, 2009). Each cell is represented as a directed acyclic graph (DAG) where the nodes represent operation choices and the edges represent the information flow through the neural network (see also Figure 1 and Section 3.1). To limit the number of architectures in the search space, the authors used the following constraints on the cell: 3 operations in the operation set O = {3x3 convolution, 1x1 convolution, 3x3 max-pool}, at most 7 nodes (this includes input and output node, therefore 5 choice nodes) and at most 9 edges.

These constraints, and exploiting symmetries, reduced the search space to 423k unique valid architectures. Each architecture was trained from scratch three times to also obtain a measure of variance. In addition, each architecture was trained for 4, 12, 36 and 108 epochs; for our analysis, we mainly used the results for models trained for 108 epochs, if not stated otherwise.

2.2 NAS-BENCH-102

Concurrently to this work, Dong & Yang (2020) released NAS-Bench-102, which is another NAS benchmark that, differently from NAS-Bench-101, enables the evaluation of weight-sharing NAS methods. Their search space consists of a total of 15,625 architectures, which is exhaustively evaluated on 3 image classification datasets. Similarly to Zela et al. (2020) and this work, Dong & Yang (2020) found that architectural overfitting occurs for DARTS for all their datasets.

While NAS-Bench-102 and this work go towards the same direction, they differ in many ways:

1. They use extensive computation to create a new benchmark (with 15.625 architectures), while we devise a novel reformulation to reuse the even much more extensive computation of the NAS-Bench-101 dataset ( 120 TPU years) to create three new one-shot search spaces with the larges one containing 363.648 architectures. This required zero additional computational cost.

2. We show that it is possible to reuse the graph representation in NAS-Bench-101 to run one-shot NAS methods; this requires changes to the one-shot search space, but allows a mapping which can be used for architecture evaluation.

Figure 1: Overview of the NAS-Bench-1Shot1 analysis strategy. The one-shot model we construct only contains discrete architectures that are elements of NAS-Bench-101 (Ying et al., 2019). The cell architecture chosen is similar to that of Bender et al. (2018), with each choice block containing an operation decision. Note that NAS-Bench-101 does not contain a separate reduction cell type. Plot on the right from Ying et al. (2019) (Best viewed in color).

3. They evaluate their search space on 3 image classification datasets, while we introduce 3 different search spaces (as sub-spaces of NAS-Bench-101) with growing complexity.

2.3 ONE-SHOT NEURAL ARCHITECTURE SEARCH

The NAS problem can be defined as searching for the optimal operation (e.g. in terms of validation error of architectures) out of the operation set O in each node of the DAG and for the best connectivity pattern between these nodes.

Designing architectures for optimized accuracy or to comply with resource constraints led to sig-nificant breakthroughs on many standard benchmarks (Pham et al., 2018; Zoph & Le, 2017; Brock et al., 2018; Liu et al., 2019; Cai et al., 2019; Elsken et al., 2019). While early methods were computationally extremely expensive (Zoph & Le, 2017), the weight-sharing paradigm (Brock et al., 2018; Pham et al., 2018) led to a significant increase in search efficiency. Here, the weights of the operations in each architecture are shared in a supermodel (the so-called one-shot model or convolutional neural fabric (Saxena & Verbeek, 2016)), which contains an exponential number of sub-networks, each of which represents a discrete architecture. Architectures whose sub-networks share components (nodes/edges) also share the weights for these components’ operations; therefore, in analogy to DropOut (Srivastava et al., 2014), training one architecture implicitly also trains (parts of) an exponential number of related architectures. There are a variety of methods on how to conduct NAS by means of the one-shot model (Brock et al., 2018; Pham et al., 2018; Bender et al., 2018; Liu et al., 2019; Li & Talwalkar, 2019) (see also Appendix B), but the final problem is to find the optimal sub-network in this one-shot model.

The weight sharing method was used to great effect in DARTS (Liu et al., 2019), where it allows a gradient based optimization of both the architectural and the one-shot weights. Subsequent work on DARTS has addressed further lowering the computational and the memory requirements (Dong & Yang, 2019; Xu et al., 2020; Cai et al., 2019; Casale et al., 2019).

One fundamental drawback of the weight sharing method is the fact that the architecture search typically takes place in a lower fidelity model (e.g., using less cells and/or cheaper operations): the so-called proxy model. After the search, a discrete architecture is derived from the proxy model which is then trained with more parameters — a stage often referred to as architecture evaluation. This poses the question whether the architecture found in the proxy model is also a good architecture in the bigger model, a question studied by several recent works (Bender et al., 2018; Yu et al., 2020).

3 A GENERAL FRAMEWORK FOR BENCHMARKING ONE-SHOT NAS

We will now introduce our framework for cheaply benchmarking the anytime performance of one-shot NAS methods. Our main analysis strategy is the following: First, we run the search procedure of various methods and save the architecture weights of the one-shot models for each epoch. Second, we find the discrete architecture at each epoch and query it in NAS-Bench-101. The last step is not trivial due to the different representations of the search space used in NAS-Bench-101 and standard one-shot methods. Ying et al. (2019) state that one-shot methods cannot be directly evaluated on NAS-Bench-101. In the following sections we present a mapping between these different search space representations, which eventually enable us to evaluate one-shot methods on NAS-Bench-101. To the best of our knowledge this is a novel contribution of this paper.

In order to carry out the analysis we propose in this work, we had to construct a search space that only contains discrete architectures that are also contained in NAS-Bench-101. This allows us to look up any discrete architectures’ performance in NAS-Bench-101 when the larger model is trained from scratch. Unfortunately, this is non-trivial since the NAS-Bench-101 space does not match the typical space used in one-shot NAS methods. We separately consider the various parts of the search space.

Network-Level Topology. In terms of network-level topology, our search spaces closely resemble the models which were evaluated in NAS-Bench-101. We used the same macro architecture as in NAS-Bench-101, i.e., 3 stacked blocks with a max-pooling operation in-between, where each block consists of 3 stacked cells (see Figure 1). While our final evaluation models exactly follow NAS-Bench-101 in order to be able to reuse its evaluations, our one-shot model only has 16 initial convolution filters, rather than the 128 used in NAS-Bench-101. This is a common practice to accelerate NAS and used similarly in, e.g., Liu et al. (2019).

Cell-Level Topology. The cell-level structure is represented as a DAG, where the input node is the output of a previous cell or the convolutional stem (projected using a convolution in order to scale the channel counts), and the output node consists of firstly concatenating the outputs of all intermediate nodes and adding to that the projected output of the input node. In order to have the operation choices still in the intermediate nodes of the DAG, we adapt the choice block motif from Bender et al. (2018) as depicted in Figure 1. The edges connecting input, output nodes and choice blocks represent only the information flow in the graph. To have a large and expressive enough search space(s), we introduce the following architectural weights in the DAG edges:

• to edges connecting nodes i < j to choice block j. The input of choice block j is then computed as , where is the output tensor of node i (either input

• to the edges connecting the input node or choice blocks j < k to the output node k of the cell, where the corresponding output feature maps of the choice blocks are concatenated and then added to the projected input of the cell: ,

Note that the non-linearity applied to the edge weights varies depending on the NAS optimizer used; e.g. for GDAS (Dong & Yang, 2019) and SNAS (Xie et al., 2019) it would be a GumbelSoftmax (Eric Jang & Poole, 2017) instead.

Choice Blocks. As in Bender et al. (2018), each choice block inside the cell can select between the operations in the operations set O of NAS-Bench-101. In order to find the optimal operation in each choice block via gradient-based one-shot NAS methods, we assign an architectural weight to each operation inside the choice block. The output of the choice block j is computed by adding element-wise the latent representations coming from the operations outputs:

which is basically the so-called MixedOp in DARTS. NASBench cells contain 1x1 projections in front every operation (demonstrated in Figure 1 in (Ying et al., 2019)). The number of output channels of each projection is chosen such that the output has the same number of channels as the input. This adaptive choice for the number of channels is incompatible with the one-shot model due to the different tensor dimensionality coming from previous choice blocks. We used 1x1 projections with a fixed number of channels instead.

By means of these additional weights we do not restrict the possible architectures in the search space to contain only a fixed number of edges per cell, as done for example in Zoph et al. (2018), Pham et al. (2018), Liu et al. (2019), etc. This requirement would have restricted our architectural decisions heavily, leading to only small search spaces.

Table 1: Characteristic information of the

Table 1 shows the characteristics of each search space. We propose three different search spaces by making different decisions on the number of parents each choice block has. The decisions affect the quality and quantity of the architectures contained in each search space. For all search spaces note that the sum of the number of parents of all nodes in the search space is chosen to be 9, to match the NAS-Bench-101 requirement. Search space 1, 2 and 3 have 6240, 29160 and 363648 architectures with loose ends respectively, making search space 3 the largest investigated search space. To the best of our knowledge search space 3 is currently the largest

and only available tabular benchmark for one-shot NAS. For details on each search space see Appendix A.

Given the architectural weights of the cell shown in Figure 1 we query the test and validation error of the discrete architecture from NAS-Bench-101 as follows.

1. We determine the operation chosen in each choice block by choosing the operation with the highest architectural weight.

2. We determine the parents of each choice block and the output by choosing the top-k edges according to Table 1 (e.g. for choice block 4 in search space 3 we would choose the top-2 edges as parents).

3. From 1. we construct the operation list and from 2. the adjacency matrix of the cell which we use to query NAS-Bench-101 for the test and validation error.

Each node in the graph chooses its parents during evaluation following e.g. DARTS (Liu et al., 2019). However, because edges model information flow and the output edges are also architectural decisions there is possibility of a node being a loose end. These are nodes whose output does not contribute to the output of the discrete cell, as seen in the upper cell under evaluation of Figure 1. As a result, we can count the number of architectures with or without loose ends. Note, that had we chosen the children of each node we could have invalid architectures where a node has an output but no input.

4 A GENERAL FRAMEWORK FOR ONE-SHOT NAS METHODS

Most of the follow-up works of DARTS (Algorithm 1), which focus on making the search even more efficient and effective, started from the original DARTS codebase2, and each of them only change very few components compared to DARTS.

Algorithm 2 and Algorithm 3 highlight these components (relative to DARTS) for PC-DARTS (Xu et al., 2020) and GDAS (Dong & Yang, 2019), respectively. For example, when comparing PCDARTS and DARTS, the only difference in our benchmark is the partial channel connections (line 3 of Algorithm 2) in the choice blocks, which consists of a channel sampling mask that drops feature maps coming from . GDAS, on the other hand, replaces the Softmax (S) function in DARTS by a Gumbel-Softmax (GS), which applies for every architectural weight in (lines 1-3 in Algorithm 3), and uses this concrete distribution to sample single paths through the cell during search (line 7 in Algorithm 3). Random Search with Weight Sharing (Random WS) (Li & Talwalkar, 2019) (Algorithm 4) and ENAS (Pham et al., 2018) (Algorithm 5) do not need the continuous relaxation in order to conduct the architecture search, instead they sample randomly in RandomWS or from the recurrent neural network controller (line 6 in Algorithm 1) in ENAS, in order to select the sub-network in the one-shot model to train.

These close correspondences between current one-shot NAS variants provide an opportunity to implement all of these variants in the same general code basis. This allows us to (a) automatically guard against any confounding factors when evaluating the strengths and weaknesses of different approaches, and (b) allows us to mix and match the components of different algorithms. We implemented all variants in a single code basis, which we are committed to grow into a flexible library of

Figure 2: Comparison of different one-shot NAS optimizers on the three different search spaces defined on NASBench. The solid lines show the anytime test regret (mean std), while the dashed blurred lines the one-shot validation error (Best viewed in color).

primitives for one-shot NAS methods, and for which we will gladly accept any help the community wants to provide.

One-shot NAS methods in this code basis inherit all the methods and attributes necessary for building the one-shot computational graph from a base parent class. This encapsulation and modularity ensures that all differences in their performance come from a few lines of code, and that all other confounding factors cannot affect these results. This will also facilitate the incorporation of other one-shot NAS methods and pinpoint the components that differ in them.

Furthermore, the primitives encoding the search spaces presented in Section 3 are defined separately from the NAS optimizers. This encapsulation will allow researchers to study each of these components in isolation, experimenting with one of them while being sure that the other one does not change.

5 NAS-BENCH-1SHOT1 AS A BENCHMARK AND ANALYSIS FRAMEWORK

We now demonstrate the use of NAS-Bench-1Shot1 as a benchmark for one-shot NAS. We first evaluate the anytime performance of five different one-shot NAS methods: DARTS (Liu et al., 2019), GDAS (Dong & Yang, 2019), PC-DARTS (Xu et al., 2020), ENAS (Pham et al., 2018) and Random Search with Weight Sharing (Random WS) (Li & Talwalkar, 2019).3 Afterwards, we investigate the robustness of these one-shot NAS optimizers towards their search hyperparameters and show that if these hyperparameters are carefully tuned, the one-shot NAS optimizer can outperform a wide range of other discrete NAS optimizers in our search spaces.

We ran the NAS search for 50 epochs4 using their respective default hyperparameter settings (see Appendix C). If not stated otherwise, all the following results were generated by running each experiment with six random seeds (0 to 5). All plots show the mean and standard deviation of the test regret queried from NAS-Bench-101. Over the three independent trainings contained in NASBench-101 for each architecture on each epoch budget we average. The search was done on a single NVIDIA RTX2080Ti using the same python environment. In Figure 2 we report the anytime test regret for each of these methods. Our findings can be summarized as follows:

• The test error (queried from NAS-Bench-101) for DARTS, GDAS and PC-DARTS decreases along with the validation error of the one-shot model. However, these two quantities are not necessarily correlated. One such example is the learning curve of PC-DARTS. While the validation error smoothly decreases, the test error of the found solutions slightly goes up in the beginning and then decreases again. This behavior becomes even more substantial when increasing the number of search epochs (see Appendix D). Figure 8 shows that when running DARTS for 200 epochs, the test regret diverges. The same result was also previously observed by Zela et al. (2020) on subspaces of the standard DARTS space.

• The ranking of the NAS optimizers differs across search spaces. For instance, while PC-DARTS performs the best in search space 1, this is clearly not the case in the other search spaces.

• GDAS has a better anytime performance than the other optimizers in all 3 benchmarks, however, due to the temperature annealing of the Gumbel Softmax, it manifests some premature convergence (in less than 5 search epochs) to a sub-optimal local minimum.

• Random WS and ENAS mainly explore poor architectures across all three search spaces. This behaviour is a result of the small correlation between the architectures evaluated with the one-shot weights and their true performance during architecture evaluation (as queried from NAS-Bench-101 (see Section 5.2)). This correlation directly affects these methods since in the end of search they sample a certain number of architectures (1000 randomly sampled for Random WS and 100 using the learned controller policy for ENAS) and evaluate them using the one-shot model weights in order to select the architecture which is going to be trained from scratch in the final evaluation

Figure 3: Correlation between the one-shot validation error and the corresponding NAS-Bench-101 test error for each search space. (Best viewed in color).

phase. When running ENAS for 100 epochs in search space 2 (Figure 7 in the appendix) we see that the it performs better than Random WS. ENAS also has a stronger correlation between the sampled architectures and the NAS-Bench-101 architectures for search space 2 (see Section 5.2).

Many one-shot NAS methods, such as ENAS (Pham et al., 2018), NAONet (Luo et al., 2018) or Random WS (Li & Talwalkar, 2019) select the final architecture by means of the one-shot parameters. In order to assess if this is optimal we computed the correlation between a given architecture’s one-shot test error and its respective NAS-Bench-101 test error, for all 4 available budgets in NAS-Bench-101 on every 10th search epoch. This analysis was performed for all architectures without loose ends in each search space. The only exception is ENAS, for which we decided to evaluate the correlation by sampling 100 architectures (as done by the algorithm after the search has finished to select the one to retrain from scratch) from the controller instead of evaluating every architecture in the search space.

As shown in Figure 3, there is almost no correlation between the weight sharing ranking and the true one (Spearman correlation coeff. between -0.25 and 0.3) during search for DARTS, PC-DARTS, GDAS and Random WS. Only ENAS shows some correlation for search space 2 and some anticorrelation for search spaces 1 and 3. These results agree with the ones reported by Yu et al. (2020) (who could only do this evaluationx on a small search space) and explain the poor performance of Random WS and ENAS on our benchmarks, since the architectures sampled during evaluation and ranked according to their one-shot validation error are unlikely to perform well when evaluated independently. To the best of our knowledge this is the first time that an evaluation of this correlation is conducted utilizing such a large number of architectures in the search space, namely 137406 different architectures for search space 3. We added further experiments on the correlation between the lower fidelity proxy model used in architecture search and the final model from architecture evaluation in the Appendix H.

Zela et al. (2020) observed that the generalization performance of architectures found by DARTS heavily depends on the hyperparameters used during the search phase. More specifically, they investigate regularization hyperparameters that shape the inner objective landscape, such as regularization or CutOut (DeVries & Taylor, 2017) and ScheduledDropPath (Zoph et al., 2018).

Due to the fast queries from NAS-Bench-101, NAS-Bench-1Shot1 enables researchers to assert if their one-shot NAS optimizers with some specified hyperparameter settings overfit on the validation set during search (Zela et al., 2020). In Figure 4 we plot the anytime performance of DARTS, GDAS and PC-DARTS on Search Space 3 when ran with different factors in the inner objective (see Appendix E for more results on other search spaces and regularizers such as CutOut). Based on these results we observe that:

Figure 4: The impact that weight decay has on the test (one-shot validation) performance of architectures found by DARTS, GDAS and PC-DARTS on search space 3 (Best viewed in color).

• Interestingly, while Zela et al. (2020) only showed overfitting behavior for DARTS, it may also occur for GDAS and PC-DARTS as shown Figure 4b and 4c. This indicates that this might be an intrinsic property of these methods due to the local updates in the architecture space.

• A good hyperparameter setting for one NAS optimizer is not necessarily good for others. In Figure 4 we can see that for GDAS and PC-DARTS have the best anytime performance, whilst in DARTS we notice an overfitting behavior for the same value.

• Interestingly enough, it seems that these NAS optimizers show the same pattern across search spaces for the same values as shown in Figure 12 and Figure 13 in Appendix E.

Next to the hyperparameters used in the evaluation pipeline, one-shot NAS methods have several hyperparameters of their own, such as the regularization hyperparameters studied in Section 5.3, learning rates, and other hyperparameters of the search phase.

Figure 5: Optimizing the hyperparameters of one-shot optimizers with BOHB on NAS-Bench-1Shot1. Every curve shows the meanstd of the incumbent trajectories (500 and 3 search repetitions for the black-box and one-shot optimizers, respectively).

Naively tuning these hyperparameters with the one-shot validation error as the objective would lead to sub-optimal configurations, since, as we saw, this metric is not a good indicator of generalization. In our proposed benchmarks, we can tune these hyperparameters of the NAS optimizer to minimize the validation error queried from NAS-Bench-101. By doing so, we aim to shed more light onto the influence these hyperparameters have during the one-shot search, and to study the sensitivity of the NAS method towards these hyperparameters.

To this end, we constructed 3 configuration spaces of increasing cardinality, CS1, CS2 and CS3 (see Appendix F for details), which only include hyperparameters controlling the NAS process. We chose BOHB (Falkner et al. (2018), see Appendix F for details) as the hyperparameter optimization method and DARTS (1st order), GDAS, PC-DARTS as our NAS methods to be tuned across all configuration spaces.

Starting from the respective default hyperparameter settings of each one-shot optimizer, we run BOHB with a total budget of 280 function evaluations, where in this case a function evaluation is a query from NAS-Bench-101. Note that we never use the validation set split used to evaluate the individual architectures in NAS-Bench-101 during the architecture search. This subset with 10k examples is only used to compute the objective function value that BOHB optimizes. Therefore, the one-shot optimizers use 20k examples for training and 20k for the architectural parameter updates. Figure 5 shows the anytime test regret of the architectures found by the respective NAS method’s configurations tried by BOHB on CS2 (see Figure 14 for results on CS1 and Figure 15 for results on CS3). The x-axis in these figures shows the simulated wall-clock time: , where is the time spent during search by each configuration and is the training time for 108 epochs (queried from NAS-Bench-101) of the architectures selected by the one-shot optimizer. As the figure shows, DARTS, PC-DARTS and GDAS can be tuned to find solutions closer to the global minimum compared to the default hyperparameter settings. We note that carrying out this optimization on NAS-Bench-1Shot1 reduced the time for tuning DARTS from a simulated 45 GPU days to 1 day on 16 GPUs.

Figure 5 also provides an evaluation of DARTS, GDAS and PC-DARTS compared to the state-of-art discrete NAS optimizers used by Ying et al. (2019) (such as RL, RE, and HPO methods). Since these one-shot NAS methods are much faster than black-box methods, it is possible to tune them online using BOHB and the resulting BOHB-{DARTS, GDAS, PC-DARTS} typically still yields better performance over time. From this experiment, we make the following observations:

• Across all search spaces the architectures found using the best BOHB configurations outperformed the architectures found by the default DARTS configuration by up to a factor of 10.

• As in NAS-Bench-101, Regularized Evolution (Real et al., 2019) performed the best in all search spaces of NAS-Bench-1Shot1 compared to the other black-box optimizers.

• The found robust configurations did not only avoid overfitting in the architectural level, but they also typically outperformed the architectures found by state-of-art discrete NAS optimizers used by Ying et al. (2019) (such as RL, RE, and HPO methods).

• The multi-fidelity in BOHB does not only accelerate the hyperparameter optimization procedure, but in this case also allows to determine the sufficient number of epochs to run the NAS optimizer in order to get an optimal architecture. In fact, the best configuration on each of the incumbents comes usually from the lowest budget, i.e. 25 search epochs.

6 CONCLUSION AND FUTURE DIRECTIONS

We proposed NAS-Bench-1Shot1, a set of 3 new benchmarks for one-shot neural architecture search which allows to track the trajectory and performance of the found architectures computationally cheaply. Using our analysis framework, we compared state-of-the-art one-shot NAS methods and inspected the robustness of the methods and how they are affected by different hyperparameters. Our framework allows a fair comparison of any one-shot NAS optimizer and discrete NAS optimizers without any confounding factors. We hope that our proposed framework and benchmarks will facilitate the evaluation of existing and new one-shot NAS methods, improve reproducibility of work in the field, and lead to new insights on the underlying mechanisms of one-shot NAS.

ACKNOWLEDGMENTS

The authors acknowledge funding by the Robert Bosch GmbH, support by the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program through grant no. 716721, and by BMBF grant DeToL. We would like to thank Yuge Zhang for his work on reproducing our results, and making us aware of his results on counting the number architectures in our search space.

REFERENCES

Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understand- ing and simplifying one-shot architecture search. In International Conference on Machine Learning, 2018.

Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architecture search through hypernetworks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rydeCEhs-.

Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations, 2019.

Francesco Paolo Casale, Jonathan Gordon, and Nicolo Fusi. Probabilistic neural architecture search. arXiv preprint arXiv:1902.05116, 2019.

Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.

Xuanyi Dong and Yi Yang. Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1761– 1770, 2019.

Xuanyi Dong and Yi Yang. Nas-bench-102: Extending the scope of reproducible neural architec- ture search. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=HJxyZkBKDr.

Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Efficient multi-objective neural architecture search via lamarckian evolution. In International Conference on Learning Representations, 2019.

Shixiang Gu Eric Jang and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2017.

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter op- timization at scale. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1437–1446, Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/falkner18a.html.

F. Hutter, H. Hoos, and K. Leyton-Brown. An efficient approach for assessing hyperparameter importance. In E. Xing and T. Jebara (eds.), Proceedings of the 31th International Conference on Machine Learning, (ICML’14), pp. 754–762. Omnipress, 2014.

K. Jamieson and A. Talwalkar. Non-stochastic best arm identification and hyperparameter optimiza- tion. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: Bandit-based con- figuration evaluation for hyperparameter optimization. In International Conference on Learning Representations, 2017.

Liam Li and Ameet Talwalkar. Random search and reproducibility for neural architecture search. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, pp. 129, 2019. URL http://auai.org/uai2019/ proceedings/papers/129.pdf.

Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Gold- berg, and Ion Stoica. Ray rllib: A composable and scalable reinforcement learning library. CoRR, abs/1712.09381, 2017. URL http://arxiv.org/abs/1712.09381.

Marius Lindauer and Frank Hutter. Best practices for scientific research on neural architecture search. arXiv preprint arXiv:1909.02453, 2019.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.

Renqian Luo, Fei Tian, Tao Qin, and T. M. Liu. Neural architecture optimization. In NeurIPS, 2018.

Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, 2018.

Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2902–2911, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr. press/v70/real17a.html.

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Aging Evolution for Image Classi- fier Architecture Search. In AAAI, 2019.

Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 4053–4061. Curran Associates, Inc., 2016.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15: 1929–1958, 2014.

B. Williams, T. Santner, and W. Notz. Sequential design of computer experiments to minimize integrated response functions. Statistica Sinica, 2000.

Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In International Conference on Learning Representations, 2019.

Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Guo-Jun Qi, Qi Tian, and Hongkai Xiong. Pc-darts: Partial channel connections for memory-efficient architecture search. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=BJlS634tPr.

Antoine Yang, Pedro M. Esperana, and Fabio M. Carlucci. Nas evaluation is frustratingly hard. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=HygrdpVKvr.

Chris Ying, Aaron Klein, Eric Christiansen, Esteban Real, Kevin Murphy, and Frank Hutter. NAS- bench-101: Towards reproducible neural architecture search. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7105–7114, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/ ying19a.html.

Kaicheng Yu, Christian Sciuto, Martin Jaggi, Claudiu Musat, and Mathieu Salzmann. Evaluating the search phase of neural architecture search. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1loF2NFwr.

Arber Zela, Thomas Elsken, Tonmoy Saikia, Yassine Marrakchi, Thomas Brox, and Frank Hut- ter. Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id= H1gDNyrKDS.

Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR) 2017 Conference Track, 2017.

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In Conference on Computer Vision and Pattern Recognition, 2018.

A DETAILS ON SEARCH SPACES

Search space 1 The main characteristic of this search space is that the number of parents for each

choice block and output has to be exactly 2 (apart from choice block 1 which is only connected to

the input). Because of this requirement one choice block had to be discarded as that would exceed

the requirement to have at most 9 edges. The total distribution of test error in shown in Figure 6a. It

is the smallest search space discussed in this work.

Search space 2 This search space is related to search space 1 in that it only consists of 4 in-

termediate nodes, but in contrast the output has three parents and nodes 1 and 2 only one parent.

This increases the number of architectures in this space compared to search space 1. The test error

distribution is shown in Figure 6b.

Search space 3 All available intermediate nodes (up to 5) are used in this search space, making

this search space the largest, but also the search space where each node has on average the least

number of parents. The test error distribution is shown in Figure 6c.

B OPTIMIZERS

DARTS (Liu et al., 2019) uses a weighted continuous relaxation over the operations to learn an

architecture by solving a bilevel optimization problem. The training dataset is split in two parts, one

used for updating the parameters of the operations in the one-shot model, and the other to update the

weights appended to operations, that determine the importance of that operation.

For evaluation, we choose the parents of each choice block based on the highest architectural weights

and the number of parents for that choice block given by Table 1. We pick the highest weighted

operation from the choice block.

GDAS (Dong & Yang, 2019) modifies DARTS, such that individual paths are sampled differen-

tiably through each cell using Gumbel-Softmax (Eric Jang & Poole, 2017) to adapt the architecture

weights. This reduces the memory overhead created by DARTS as only the sampled paths have to

be evaluated. GDAS uses the same search space and evaluation procedure as DARTS.

PC-DARTS (Xu et al., 2020) reduces the memory overhead by only evaluating a random fraction

of the channels with the mixed-ops. The authors argue that this also regularizes the search as it

lowers the bias towards weight-free operations such as skip-connect and max-pooling, which are

often preferred early on in DARTS search. In addition to partial channel connections, the authors

propose edge normalization, which adds additional architectural parameters to the edges connecting

to an intermediate node. This is done to compensate for the added fluctuations due to the partial

channel connections. These additional weights are already part of the search spaces we proposed in

this paper.

Random Search with Weight Sharing (Random WS) (Li & Talwalkar, 2019) randomly samples

architectures from the one-shot model for each training mini-batch and trains only the selected

subset of the one-shot model on that mini-batch. Differently from DARTS, PC-DARTS and GDAS,

Figure 6: Distribution of test error in the search spaces with loose ends.

Random WS does not require a validation set, since there are no architectural weights that need

to be updated. For evaluation Random WS samples 1000 architectures from the search space and

evaluates each for only a small number of batches on the validation set using the optimized weights

of the one-shot model corresponding to the sub-networks. Then the 5 architectures with the lowest

one-shot validation error are chosen and fully evaluated on the validation set. The overall best

architecture is returned.

ENAS (Pham et al., 2018) similarly to Random WS samples sub-networks of in the one-shot model,

however by means of a recurrent neural network (RNN) controller rather than randomly. As the

search progresses the parameters of the RNN controller are updated via REINFORCE (Williams

et al., 2000) using the validation error of the sampled architectures as a reward. This way the

sampling procedure is handled in a more effective way.

C HYPERPARAMETERS

If not stated otherwise the following hyperparameters were used for all our experiments. We used

a batch size of 96 throughout for DARTS, GDAS and PC-DARTS as the search spaces are small

enough to allow it and as this reduces the randomness in the training, which makes the comparison

between optimizers easier. Random WS was trained with a batch size of 64. All other hyperparam-

eters were adapted from DARTS (Liu et al., 2019).

D COMPARISON OF OPTIMIZERS OVER DIFFERENT BUDGETS

102

Figure 7: Comparison of different One-Shot Neural Architecture optimizers on the three different search spaces defined on NAS-Bench-101 over 100 epochs.

2 × 102

3 × 102

4 × 102

Figure 8: Comparison of DARTS first and second order on the three different search spaces defined on NAS-Bench-101 for 200 epochs epochs.

E REGULARIZATION

E.1 CUTOUT

Enabling CutOut (DeVries & Taylor, 2017) during the architecture search has in general a negative

on the quality of solutions found by one-shot NAS optimizers as shown in Figure 9 for search space

3. The same observation also holds for search space 1 and 2 as shown in Figure 10 and Figure 11.

Figure 9: Illustration of the impact that Cutout has on the test regret on NAS-Bench-101 and the validation error of the one-shot model using DARTS, GDAS and PC-DARTS on search space 3 (Best viewed in color).

Figure 10: Comparison of the effect of using Cutout during architecture search on GDAS for search space 1 and 2.

Figure 11: Comparison of the effect of using cutout during architecture search on PC-DARTS for search space 1 and 2.

E.2 REGULARIZATION

In Figure 12 and 13 we can see that increasing the regularization has a positive effect on the

found architectures for GDAS and PC-DARTS in search space 1 and 2, whilst in DARTS the quality

of these architectures degenerates when increasing beyond the default in Liu et al. (2019).

2 × 102

3 × 102

4 × 102

Figure 12: The impact that weight decay has on the test (one-shot validation) performance of architectures found by DARTS, GDAS and PC-DARTS on search space 1 (Best viewed in color).

Figure 13: The impact that weight decay has on the test (one-shot validation) performance of architectures found by DARTS, GDAS and PC-DARTS on search space 2 (Best viewed in color).

F BOHB DETAILS

BOHB (Falkner et al., 2018) is a combination of Bayesian Optimization (BO) and Hyperband (HB)

(Li et al., 2017). HB uses SuccesssiveHalving (SH) (Jamieson & Talwalkar, 2016) to stop poorly

performing trainings early. SH starts trainings with an initial budget and advances the top fraction

() of them to the next stage with higher budget. HB uses this a subroutine to evaluate many

uniformly at random sampled configurations on small budgets. The budgets and scaling factors are

chosen such that all SH evaluations take approximately the same time. BOHB combines HB with

BO by using a probabilistic model to guide the search towards better configurations. As a result,

BOHB performs as well as HB during early optimization, but samples better configurations once

enough samples are available to build a model.

F.1 SETUP

We ran BOHB for 64 iterations of SuccessiveHalving (Jamieson & Talwalkar, 2016) on 16 paral-

lel workers, resulting in 280 full function evaluations. In our experiments we use the number of

epochs that the one-shot NAS optimizers run the search as the fidelity used by BOHB and optimize

the validation error after 108 epochs of training queried from NAS-Bench-101. Namely, we use

min budget = 25 epochs, max budget = 100 epochs and in BOHB. Note that this is only

the number of epochs used for the architecture search. Additionally, we never use the validation set

split used to evaluate the individual architectures in NAS-Bench-101 during the architecture search.

Therefore, each one-shot NAS optimizer will use 20k examples for training and 20k for search. The

x-axis in Figures 5, 14, 15 shows the simulated wall-clock time: , where

is the time spent during search by each NAS algorithm configuration and is the train-

ing time for 108 epochs (queried from NAS-Bench-101) of the architectures selected by the NAS

optimizers.

We build 3 configuration spaces with different cardinality and which include hyperparameters af-

fecting the architecture search process. The spaces are as follows:

1. CS, CO prob}

2. CS, CO prob, lr}

3. CS, CO prob, lr, moment, CO len, batch size, grad clip, arch lr, arch L

F.2 RESULTS

Interestingly, optimizing on CS2 (see Figure 5) led to not only a better configuration of the one-

shot NAS optimizers, but also in outperforming a state-of-the-art black-box NAS optimizer such

as Regularized Evolution (RE) (Real et al., 2019). Including the learning rate in the configuration

space was crucial to achieve such a performance. Figure 14 shows the results when running BOHB

with the same settings on CS1. Note that none of the sampled configurations outperforms RE. On

the other hand, increasing the cardinality of the configuration space requires many more samples to

build a good model. Figure 15 shows the results when optimizing with BOHB on CS3. Even though

the learning rate was inside this space, again none of the one-shot optimizers (except PC-DARTS on

search space 3) with the sampled configurations is better than the discrete NAS optimizers.

102

Figure 14: Analogous to Figure 5, with the only difference being that here we optimize on CS1.

102

Figure 15: Analogous to Figure 5, with the only difference being that here we optimize on CS3.

F.3 TRANSFERABILITY BETWEEN SPACES.

Table 2 shows the performance of the best found configuration on search space 3 for the 50 epochs

budget by BOHB when transferred to search spaces 1 and 2. The results show the mean and standard

deviation of the architectures found by 6 independent search runs with the respective optimizers and

hyperparameters. We can see that there is no clear pattern on what is transferable where.

G HYPERPARAMETER IMPORTANCE

To better understand the configuration space which we evaluated with BOHB we use functional

analysis of variance (fANOVA) (Hutter et al., 2014). The idea is to assess the importance of individ-

Table 2: Results of architectures found on search space 1 and 2 with the best found configuration for 50 epochs by BOHB on search space 3.

ual hyperparameters by marginalizing performances over all possible values other hyperparameters

could have taken. The marginalization estimates are determined by a random forest model which

was trained on all configurations belonging to specific budgets during the BOHB optimization pro-

cedure.

Figure 16 shows the interaction between the Cutout (CO) and factor when optimizing on CS2,

DARTS 1st order, for search space 1, 2 and 3. Notice that across search spaces there is some

correlation between the estimated loss functions from fANOVA, indicating that this might be an

interesting future direction for studying hyperparameter transfer across search spaces.

Figure 16: Parameter importance for two hyperparameters, Cutout (CO) and regularization (CS2) across different training epochs and search spaces (SS).

H CORRELATION BETWEEN THE ARCHITECTURE SEARCH MODEL AND THE ARCHITECTURE EVALUATION MODEL

As a further experiment we wanted to test how strongly the validation error of the models used in

architectures search and architecture evaluation are correlated. For this we sampled 150 architectures

from search space 3 and trained this architecture in the proxy model. During training we compute the

Spearman rank correlation between the validation error of the proxy model and the full architecture

evaluation as queried from NAS-Bench-101. The results are shown in Figure 17. Note that 9 cells

in the proxy was used for all of our previous experiments as it is also used by the NAS-Bench-101

models.

For 9 cells (Figure 17c) in the proxy model, we find that increasing the total number of channels leads

to stronger correlation between the proxy model and the full architecture. However, increasing it

beyond 16 channels leads to a decrease in correlation. For 3 cells (Figure 17a) the strongest anytime

correlation was interestingly found using only 2 initial channels, with more channels leading to

worse performance at the beginning and no better final performance. The results suggest that there

exists a set of good combinations between the number of cells and the initial number of channels to

get the maximum correlation between the search and evaluation model. As a future work we plan

to investigate this relationship, which could eventually lead to a more effective bandit-based NAS

method.

2 × 101

3 × 101

4 × 101

Figure 17: In this experiment we varied the total number of cells within [3, 6, 9] and the number of initial channels of the proxy model within [2, 4, 8, 16, 36].