Although deep neural networks have achieved tremendous success in many domains (e.g., computer vision [4,5,6], speech recognition [7,8], natural language processing [8,9], games [10,11]), it still remains a great challenge to design the optimal network structure for a certain task. Most existing works rely on extensive human efforts on designing and experimenting with different structures.
Optimizing the network structures involves two fundamental issues: how to define the search space of network structures; and how to design an efficient algorithm to search a good network structure in the search space. A great challenge
Fig. 1. A quick comparison with two recent similar works in terms of trade-off between the size of search space and computational costs. The first number in the parenthesis denotes the top 1 accuracy achieved on CIFAR-10 data and the second number denotes the number of GPU hours used.
in solving these issues lies in how to balance the trade-off between the size of search space and the computational cost of the search algorithm. Earlier works based on neuro-evolution for automatically discovering network structures usually impose strong restrictions on the search space of the network structures due to limited computational power and scarcity of data [12].
Recently, there emerged revived interests in optimizing network structures (especially deep convolutional neural networks) using genetic/evolutionary algorithms [13,14,15]. However, the dilemma of computational costs and search space trade-off has pushed these works into two ends. At one end, one has to restrict the search space by imposing strong constraints on the network structures. For example, in [14] a network is composed of a fixed number of stages and each stage is composed of a fixed number of nodes representing convolutional operations. In [13], Dufourq and Bassett restricted mutation operations to adding, deleting and replacing a randomly selected layer in a network with a predetermined maximum number of layers (e.g., 7 is used in their experiments). As a result, their evolved networks share a single path structure in contrast to a multiple-path structures as in residual networks [16] and Inception [17]. On the other end, Real et al. [15] investigated large-scale evolution of networks operating at unprecedented scale by using a large amount of computing resources (e.g., spending over 10 days on 250 GPUs), which verifies neuro-evolution can achieve competitive performance as hand-crafted models built on many years of human experience. However, such an brute-force approach is not affordable for general users who have limited computational resources.
In this paper, we focus on optimizing deep CNN structures for image classifi-cation due to the availability of existing results for comparison and its popularity in computer vision. Similar to [15], we also study evolution-based algorithms, which search CNNs in a large search space that is defined by a set of mutations. Nevertheless, the difference from previous works [13,14,15] is that our main focus is to tackle an important and challenging question for optimizing neural network structures, i.e., how to maximize the exploration in the search . Instead of imposing strong restrictions on the search space, we propose new effective strategies to reduce the computational costs. We use an aggressive method to select strong individuals for survival and reproduction. In particular, among a set of individuals (i.e., population) only a small number of fittest individuals that are sufficiently different from each other are selected for producing the next generation. This strategy avoids wasting time on training weaker individuals that may eventually be eliminated in a later stage. However, a potential issue caused by this strategy is that the diversity of population decreases, which is very important for genetic programming. To remedy this issue, we propose to (i) increase the number of possible mutations; (ii) make clones of the selected fittest individuals to undergo different mutations. Additional techniques are also investigated to speed up the search process and to shorten the training time of each individual during evolution. Before ending this section, we give a quick comparison of our work and two recent works highlighting the trade-off between the size of search space and computational costs in Figure 1.
The main contribution of this paper can be summarized as following: 1) Through extensive experiments, we show that we can automatically (without any human effort at all, for example, tuning, modification, adding other layers) design network structures to achieve competitive performance, which can be conducted by general users with limited computing resources; 2) From empirical experiments, we found some interesting insights in designing neural networks, for example, skip layers was found in early stage of searching process later on was replaced by other layers; 3) We propose a simple yet efficient selection strategy, which performed better compared to other strategies, has its interest and can easily be adopted by practitioners.
There exist abundant studies on using genetic or evolutionary algorithms for discovering neural networks structures before the re-emergence of deep learning [18,12,19] in 2012. These algorithms are also known as neuro-evolution. Most of these works are restricted to feedforward neural networks of a few layers. However, many techniques in these earlier works are also useful for optimizing large convolutional neural networks. In designing a neuro-evolution algorithm, several fundamental questions need to be answered, including (i) how to encode an individual; (ii) what are the allowed mutations; (iii) how to select individuals for reproduction; (iv) what is the fitness function. Existing works may differ from each other on how to address these questions, which are also related to a fundamental issue in neuro-evolution (and also in other meta-heuristic optimization algorithms): how to balance between the size of search space and the computational costs. In the following discussion, we will highlight how to address these fundamental questions and how to balance the trade-off.
In [14], the authors developed a genetic algorithm using a fixed-length binary string to encode the network structure. The search space consists of all networks with a fixed number of stages, where each stage is composed of a fixed number of nodes representing convolutional operations. Mutations are easily operated by randomly flipping each bit in the string representation, which correspond to adding, deleting and changing connections of nodes within each stage. By restricting the number of stages (e.g., 3) and the number of nodes in each stage, their computational cost is controlled under a manageable level. The selection of individuals for reproduction is done by a Russian roulette process, which selects individuals based on a non-uniform distribution whose probabilities are proportional to the fitness of the individuals, i.e., individuals with higher fitness score will be selected with higher probability.
The focus of [15] is to scale up neuro-evolution to take advantage of the tremendous computational resources Google LLC has. A total of 250 GPUs are used in their experiments. They used a graph to encode an individual, and de-fined seven mutations to change the structure of networks , and used a standard binary tournament selection method [20] for selecting individuals for reproduction.
The fitness function of the above works purely depends on the performance of individual models on a validation data, which are trained by back-propagation. The evolutionary techniques used in [13] is similar to that in [15] except that their mutations are restricted to adding, deleting and replacing one of six pre-defined layers, which include two-dimension convolution, one-dimension convolution, fully connected, dropout, one-, and two-dimension max pooling. As a result, their algorithm cannot discover multiple-path networks, which are prevalent in modern deep learning community. A common feature of these works is that they use a combined strategy that lets the structure evolve but optimizes the weights of each individual by back-propagation, which is also adopted in the present work.
Another fundamental issue in genetic/evolutionary algorithm is the diversity of the population. A traditional approach for encouraging diversity of population is fitness sharing, where the fitness of each individual is scaled based on its proximity to others. It means that originally good solutions in densely populated regions will be given a lower fitness value than comparably good solutions in sparsely populated regions. All the three recent works [14,15,13] did not use any type of fitness sharing to encourage diversity. In [15], the authors simply use a very large population size (i.e., 1000) to increase the diversity. A key difference between our work and these previous work lies in the selection process and the number of mutation operations. The proposed solution takes both the limitations of computational resources and diversity of the population into account. As a result, even though we do not impose any strong restriction on the search space, we can use less computational costs to achieve competitive if not better prediction performance than [14,13,15].
Fig. 2. Comparison between the proposed aggressive selection and mutation strategy (right) vs conventional tournament selection and mutation strategy (left). Each colored ball denotes an individual, the number within each ball denotes its fitness score, red dashed arrows denote a copy and green solid arrows denote mutations.
Other related works on automatically discovering network structures include reinforcement learning [21,22] based approaches and Bayesian optimization [23,24,25] based approaches. We refer the readers to [15] for more discussion and references.
The proposed algorithm follows the standard flow of neuro-evolution, i.e., population initialization, individual selection, reproduction, mutation/cross-over, and fitness score evaluation. The individuals in the initial population are simple neural network structures with only one global pooling layer or one fully connected layer. We use an acyclic graph to encode an individual with each node in the graph representing a basic operation or connection including convolution, pooling, fully connected, concatenation and skip. Those operations are standard in the neural network literature. Please refer to Figure 3 for examples of individuals represented by a graph. We also use the prediction performance on a validation data as fitness score.
However, different from [14,15,13], we are not only exploring how to implement a genetic/evolutionary algorithm for optimizing deep convolutional neural networks, but also exploring how to reduce the computational costs under the framework of neuro-evolution without imposing strong restriction on the search space. Next, we will present our strategies for reducing the computational costs.
3.1 Aggressive selection and mutation
A potential issue in traditional selection strategies (e.g., tournament selection or sampling-based selection) is that weak individuals might survive for a long period. While this feature is helpful to increase the diversity of the population, however it may waste a lot of time to train these weak individuals that will eventually be eliminated. We propose to eliminate these weak individuals at
their very early age, and use other approaches to increase the diversity of the population. The algorithmic description of the proposed aggressive selection is presented in Algorithm 1, and an illustration of the proposed selection process is presented in Figure 2. In particular, we greedily select the top k individuals from a population of individuals based on their fitness scores. To encourage the diversity, we also make sure the distance between the selected top individuals exceeds a certain threshold. The distance between two individuals is computed by comparing the nodes of two graph structures from input layer to output layer. If the nodes are represented by an alphabet denoting their operation or connection types, the distance is simply the hamming distance. It is notable that the distance between individuals is also considered by fitness sharing in previous studies [20] to encourage diversity.
Multiple Cloning. With the aggressive selection strategy described above, we can eliminate many weak individuals at their early ages. However, the small number of retained individuals will reduce the size of the next population and thus restrict diversity. To address this issue, we will resort to cloning, i.e., making multiple copies of the selected individuals to undergo different mutations for generating the next population. For comparison, in traditional tournament selection as illustrated in Figure 2, a weak individual might be selected and each survived individual only undergoes one mutation, which is the strategy adopted in [15,13]. In conventional sampling-based selection and mutation, each individual has a certain probability to be retained and the retained individual has a certain probability to be mutated. This is the strategy adopted in [14], where the mutation probability is set to a small value (e.g. 0.05). We can see that the proposed selection and mutation strategy is more aggressive than the existing works in that only a small number of strong individuals are retained and each survived individual reproduces themselves to undergo more mutations for potential growth in the fitness.
3.2 Mutation operations
To complement our proposed aggressive selection and mutation, we increase the number of possible mutation operations compared to the existing works [15,13,14]. We define 15 different types of mutation operations as shown in Table 1, which almost doubles the amount considered in [15]. Note that three mutation operations (resetweight, continue
learningrate) appeared in [15] are not included in Table 1 since those operations do not change the structure of a neural network. Next, we provide more details regarding each mutation operation. In the following, we discuss some implementation details of each mutation operation. We are going to focus on the add operations. For all removal operations, we randomly select and remove one of existing layers of the chosen type. If no such layer exists, no operation will be applied.
– addconvolution: Firstly, we randomly select the position to add a convolution layer. Then we insert a convolution layer with channel number 32, stride 1, filter size 3 3, and number of padding pixel 1. For simplicity, those values are chosen to ensure the input and output of the feature map dimensions do not change after the convolution. Note that even though we use a predefined set of channel number, stride and filter size, those values could be altered later through alterchannel
number, alterstride, alterfiltersize mutation operations, which we will discuss shortly. In this work, a convolutional layer is by default followed by batch normalization [26] with Relu [4] activation unit. – alterchannel
filtersize: These three types of mutation operations are to reset the hyper-parameters in a convolution layer. We randomly choose a new value for the corresponding hyperparameter from a predefined list, i.e., {8, 16, 32, 48, 64, 96, 128} for channel numbers, for filter sizes, and {1, 2} for strides. – addskip: A skip layer, illustrated in Figure 3, is to implement the skip connection introduced in residual networks [16]. Since a skip layer requires its two bottom layers to share the same feature map dimension and channel number, we first find out all pairs of layers which could potentially be bottom layers of the skip layer. Then, a skip connection is added on top of a randomly selected pair from all possible pairs. – addconcatenate: Similar to skip layer, a concatenate layer requires two bottom layers to share the same feature map dimension, but they could have different channel numbers. Thus, the addconcatenate mutation follows a similar procedure as addskip mutation. – addpooling: Here, we restrict the insertion of a pooling layer such that it can only take place right after a convolution layer. For simplicity, we limit the pooling strategy to be max pooling and kernel size to 2
2 with stride 2. This predefined pooling configuration could be relaxed in future work. – addfullyconnected: For this operation, we limit its position to be the last layer or immediately following another fully connected layer. The output dimension of this inserted layer is uniformly chosen from the following set {50, 100, 150, 200}.
Fig. 3. Example of how a neural network structure mutates to new ones after undergo- ing different mutation operations: addconvolution, add
The dotted squares mark added convolution layer “Conv3”, concatenate layer “Con-cat1” and skip layer “Skip1”.
– adddropout: For this operation, we limit its position to be immediately after a fully connected layer. For simplicity, we set dropout ratio to 0.5.
Note that applying some mutation operations (e.g., alter
stride, addpooling) may result in inconsistency of feature map dimensions. In this situation, we will adopt the following strategies to address the above issue: i) adding additional padding pixel; ii) adding a 1 1 convolution layer to adjust channel numbers. If it still results in an invalid network structure, we simply apply a mutation again. In Figure 3, we show an example of how a neural network structure evolves to new ones after undergoing some mutation operations: addconvolution, addconcatenate, addskip. We use the dotted square to mark the new layers. Clearly, we could see that as network evolves, we can explore the diverse neural network structures and obtain neural network structure with potential better performance.
3.3 Training strategy
To evaluate the fitness score of each individual, a standard approach is to use existing optimization algorithms off-the-shelf to learn the weight parameters. Training a CNN may take tens of thousands gradient descent iterations to achieve a good local minimum. However, we observe that “deep” training (i.e., setting a very stringent condition for stopping the training process) is not necessary during the evolution since our goal is to have a ranked list of individuals for selection. Therefore, a rough estimate of the prediction performance for each individual is sufficient for driving the evolution.
To reduce the training time of each individual during the evolution, we explore a different learning rate decay strategy to train a deep neural network. There are two popular strategies for decaying the learning rate. One method is
Table 1. The allowed mutation operations in our work and in [15]; represents that mutation operation is defined while - represents not available
where is the initial step size and
are the hyper-parameters. Another popular method is a multi-stage strategy [4], where the learning rate is reduced by a fixed factor (e.g., 10) after a large number of iterations. We use a mixture of both strategies. We divide our training process into three stages with a maximum number of 20000 iterations: with the first stage being the first 10000 iterations, from 10000 to 15000 iterations as the second stage, and the last 5000 iterations as the final stage. Within each stage, we use the inverse learning rate strategy. The learning rate is reduced by a fixed factor after each stage. This strategy avoids running a large number of iterations with the same step size without improving the prediction performance much at each stage, and also quickly gives a rough estimate of the prediction performance without spending long time at the tail of the learning curve that has little improvement on the prediction performance. Finally, after the neuro-evolution process terminates with a good structure, we switch to existing optimization algorithms for deep training.
3.4 Mutation operation sampling
To speed up the evolution process, we also use non-uniform sampling probabilities for choosing a mutation operation. Using uniform probabilities to choose a mutation operation will waste lots of time training weak individuals that are mutated by removing convolution, skip, concatenation from their parents in the early stage of evolution process. To avoid this issue, we explicitly set the sampling probabilities of addconvolution, add
channelnumber two times larger than that of other mutation operations at the earlier stage of the evolution process.
In this section, we report some experimental results of the proposed aggressive genetic programming approach for optimizing convolutional neural network structures. We emphasize that we are not aiming to achieve better performance than [15] due to limited computing resources. Instead, we focus on showing that the proposed strategies can reduce the computational time and also achieve competitive and even better performance than similar works using significantly less computational power.
In subsection 4.1, we describe the experiment setup including datasets and data preprocessing. In subsection 4.2, we show the performance of the aggressive selection strategy under different values of k on CIFAR-10 [3] dataset to justify the proposed aggressive selection. In subsection 4.3, we compare the performance of the proposed aggressive evolution with other genetic approaches as well as previous works that achieved the-state-of-art results by hand-crafted networks on four standard benchmark datasets. In subsection 4.4, we present the discovered neural network structures for CIFAR-10 and CIFAR-100.
4.1 Experimental setup
Datasets and Preprocessing: We conduct experiments on four benchmark datasets: MNIST [1], SVHN [2], CIFAR-10 [3] and CIFAR-100 [3]. MNIST dataset contains 60, 000 training images and 10, 000 test image where each grayscale image contains one of the 10 digits, 0 to 9. CIFAR-10 dataset [3] has 50,000 training images and 10,000 test images. It contains 10 classes and each RGB image has a size of 32 32. The data is preprocessed by applying a Global Contrast Normalization (GCN) and ZCA whitening [28] and each side is padded with four pixels. In the training phase, a 32
32 patch is randomly cropped from the padded image while in the test phase the original images are used. CIFAR-100 dataset is similar to CIFAR-10 dataset but has 100 classes in total. SVHN is a street view house number dataset which contains about 73, 257 training images and 26, 032 test images.
Experiment configuration: In our experiment, the population size is set to be 10. Given a population of 10 individuals (which are clones of top k individuals in intermediate generations), we let each individual undergo a mutation, and then select top k individuals from the 10 mutated individuals and the original 10 individuals. A new population will be created by making equal number of clones of the selected individuals to reach the population size 10. It is worth mentioning that even though we use such a small population size, our performance is competitive and even better than [13,14], in which the population size is set to 100 and 20, respectively. It is expected that using a larger population
Fig. 4. Left: The test accuracy of the best individual among the selected top k individuals vs the number of generations on CIFAR-10 dataset, where different curves correspond to aggressive selection with different values of k. Middle: The test accuracy of the best individual among the selected individuals by using aggressive, tournament, sampling uniformly and sampling by fitness selection strategies vs the number of generations on CIFAR-10 dataset. Right: The evolution of model size and test accuracy of the best individual in AG-Evolution algorithm on CIFAR-10.
size will further increase our performance according to [15]. We use mini-batch Stochastic Gradient Descent (SGD) to train each individual neural network for a maximum of 20,000 iterations with a momentum 0.9. The mini-batch size is fixed to be 128. The weight decay is set to be 0.0005. The learning rate strategy is described in subsection 3.3. The initial learning rates for the three stage are set to be 1010
10
, respectively. The parameters in (1) are set to
= 0.001 and
= 0.75. The distance threshold in aggressive selection is set to be 1. In our experiments, one evolution process is always run on one GPU.
4.2 The effect of aggressive selection
Here, we present the evidence that the proposed aggressive selection strategy can dramatically speed up the evolution process. The following experiments are conducted in the CIFAR-10 dataset. In the left of Figure 4, we plot the evolved network performance under four different values of k = 1, 2, 5, 10 used in our aggressive selection strategy. The smaller the k value, the more aggressive the selection strategy. For each experiment setting, we plot the test accuracy of the best individual among the selected top k individuals from each generation. We observe that aggressive selection with smaller values of k (e.g., 1 and 2) evolves faster than non-aggressive selection using larger values of k (e.g., 5, 10). We further compare the proposed aggressive selection strategy with other existing selection strategies such as Tournmanet, Sampling Uniformly and Sampling by Fitness. In the middle of Figure 4, we plot the test performance of the best individual in one generation by using those different selection strategies, from which we can observe that aggressive selection evolves faster than other strategies dramatically. For researchers and practitioner in genetic programming, this proposed competition strategy is of interest and can be easily adopted.
4.3 Comparison with existing methods
In this section, we compare the performance of the proposed aggressive genetic programming approach with existing genetic approaches on benchmark datasets MNIST, SVHN, CIFAR-10, CIFAR-100. We refer to the genetic approaches presented in [13,14,15] as EDEN, Genetic-CNN and LS-Evolution, respectively. For our method, we report the result using aggressive selection with k = 1, referred to AG-Evolution. For reference and comparison, we also include state-of-the-art results based on hand-crafted neural network structures (referred them as SOTA) as well as the results using non-aggressive selection (i.e., by setting k = 10 in our framework), which is referred to as NA-Evolution.
For NA-Evolution and AG-Evolution, we terminate the evolution process when the performance on the validation data saturates on all datasets except on the CIFAR-100 dataset, in which we terminate the process earlier. It is possible that by continuing the evolution process, the performance might be further improved. We would also like to emphasize that state-of-the-art results could be attributed to not only a good network structure but also some other factors (e.g., using a good pooling function [29]), which are not considered in current genetic approaches. For Genetic-CNN, we do not compare with the results in their Table 3 for SVHN, CIFAR-10 and CIFAR-100. The reason is that their results in Table 3 are not directly achieved by the evolution process. They re-trained the networks by using large number of filters, which are not exactly the networks found by their genetic algorithm.
Table 2, 3, 4, 5 show our results on different datasets. For each method, we report both the test accuracy and the computational cost measured by the total number of used GPU hours (GPUH), which is used in [14,13,15]. The GPUH numbers for other methods are directly from the original papers. It is notable that on some datasets the results of EDEN, Genetic-CNN and LS-Evolution are missing, which is because they are not reported in the original papers. From the results, we have the following observations.
– First, the performance of the discovered neural network structures by our genetic approach AG-Evolution on MNIST and SVHN datasets is very close to the state-of-the-art results.
– Second, compared with EDEN [13], AG-Evolution achieves better performance on MNIST and CIFAR-10, and compared with Genetic-CNN [14], AG-Evolution can find a much better neural network on CIFAR-10 (0.9052 vs 0.7706 for test accuracy) with much less time (72GPH vs 408GPUH).
– Third, compared with the LS-Evolution [15] on CIFAR-10 data, which achieves a test accuracy of 0.9180 with 17971 GPUH, our AG-Evolution achieves a similar performance of 0.9052 with much less time, i.e., 72GPUH. It is expected that by continuing our evolution process, we might achieve similar test accuracy to 0.9460 but with less amount of time.
– Finally, AG-Evolution uses a much shorter time to find network structures achieving almost similar performance to NA-Evolution with a much longer time on SVHN, CIFAR-10 and CIFAR-100 datasets, which further verifies the benefit of the proposed aggressive selection.
In Figure 5, we plot the test accuracy of the best individual in each generation versus the number of generations on the four datasets for AG-Evolution, from which we could see the proposed genetic approach gradually improves the
Table 2. Comparison of test accuracy and computational cost on MNIST dataset. NA-Evolution is not run on this dataset due to that AG-Evolution almost achieves the same performance as the state-of-the-art.
performance of neural network structures. We also report the evolution of model size of individuals on CIFAR-10 dataset in the right of Figure 4.
Table 3. Comparison of test accuracy and computational cost on SVHN dataset.
Table 4. Comparison of test accuracy and computational cost on CIFAR-10 dataset.
Note that the benefit of this work for practitioners in deep learning and computer vision community is that they can utilize our work to automatically search the optimal neural network structures for their own task with acceptable computing cost.
4.4 Discovered Network Structures
Finally, we show two network structures found by the proposed AG-Evolution for CIFAR-10 and CIFAR-100 datasets in Figure 6. It is notable that both net-
Table 5. Comparison of test accuracy and computational cost on CIFAR-100 dataset.
Fig. 5. Test Accuracy vs Generation Number on MNIST, SVHN, CIFAR-10, CIFAR- 100 dataset
works are multiple path networks with concatenation and skip connections similar to GoogleNet [17] and ResNet [16].
Fig. 6. Discovered neural network structures for CIFAR-10 and CIFAR-100 dataset.
In this paper, we have developed an aggressive genetic programing approach to optimize the structure of convolutional neural networks under limited computational resources without imposing strong restrictions on the search space. Our study shows that it is possible to achieve promising result using the proposed aggressive genetic programming approach in a reasonable amount of time. We expect the proposed strategies can be also useful for optimizing other types of neural networks (e.g., recurrent neural networks), which will be left as future work.
1. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11) (1998) 2278–2324
2. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning. Volume 2011. (2011) 5
3. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. (2009)
4. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convo- lutional neural networks. In: NIPS. (2012)
5. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR. (2015)
6. Girshick, R.: Fast r-cnn. In: ICCV. (2015)
7. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29(6) (2012) 82–97
8. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing 20(1) (2012) 30–42
9. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug) (2011) 2493–2537
10. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676) (2017) 354–359
11. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587) (2016) 484–489
12. Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evolutionary computation 10(2) (2002) 99–127
13. Dufourq, E., Bassett, B.A.: Eden: Evolutionary deep networks for efficient machine learning. arXiv preprint arXiv:1709.09161 (2017)
14. Xie, L., Yuille, A.: Genetic cnn. arxiv preprint. arXiv preprint arXiv:1703.01513 (2017)
15. Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Le, Q., Kurakin, A.: Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041 (2017)
16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016)
17. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
18. Miller, G.F., Todd, P.M., Hegde, S.U.: Designing neural networks using genetic algorithms. In: ICGA. Volume 89. (1989) 379–384
19. Gruau, F.: Genetic synthesis of modular neural networks. In: Proceedings of the 5th International Conference on Genetic Algorithms, Morgan Kaufmann Publishers Inc. (1993) 318–325
20. Goldberg, D.E., Deb, K.: A comparative analysis of selection schemes used in genetic algorithms. Foundations of genetic algorithms 1 (1991) 69–93
21. Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016)
22. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016)
23. Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems. (2012) 2951–2959
24. Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: International Conference on Machine Learning. (2013) 115–123
25. Mendoza, H., Klein, A., Feurer, M., Springenberg, J.T., Hutter, F.: Towards automatically-tuned neural networks. In: Workshop on Automatic Machine Learning. (2016) 58–65
26. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. (2015) 448–456
27. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar- rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
28. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)
29. Lee, C.Y., Gallagher, P.W., Tu, Z.: Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In: Artificial Intelligence and Statistics. (2016) 464–472
30. Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Proceedings of the 30th international conference on machine learning (ICML-13). (2013) 1058–1066
31. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)