Predicting Neural Network Accuracy from Weights

2020·Arxiv

Abstract

Abstract

We show experimentally that the accuracy of a trained neural network can be predicted surprisingly well by looking only at its weights, without evaluating it on input data. We motivate this task and introduce a formal setting for it. Even when using simple statistics of the weights, the predictors are able to rank neural networks by their performance with very high accuracy (

score more than 0.98). Furthermore, the predictors are able to rank networks trained on different, unobserved datasets and with different architectures. We release a collection of 120k convolutional neural networks trained on four different datasets to encourage further research in this area, with the goal of understanding network training and performance better.

1. Introduction

Deep neural networks (DNNs) are considered state of the art methods for many machine learning problems today. Yet, a deeper understanding of the mechanisms underlying these successes is still lacking. The deep learning phenomena, i.e. various surprising and insightful empirical findings surrounding the efforts to understand DNN training and generalization have recently gained a lot of attention from researchers and practitioners (Zhang et al., 2017; Frankle & Carbin, 2019; Zhang et al., 2019). Research in this direction is actively growing, yet many such phenomena remain to be discovered.

This paper discusses the prediction of the accuracy of trained neural networks, using only their weights as inputs. Specifically, we consider convolutional neural networks (CNNs) trained on standard datasets for the popular task of image classification. We see this study as a step towards gaining a deeper understanding of neural network training and performance. Understanding what can be said by looking at the trained weights can be useful in understanding the training process in general. It can also have practical applications such as early stopping of unsuccessful training runs (Domhan et al., 2015).

As a first step in this direction we study CNNs trained in the under-parameterized regime, in which the observed train and test accuracies do not differ substantially. Then we show that our findings appear to transfer to the overparameterized regime (Belkin et al., 2018). We demonstrate (Section 5.2) that the predictor trained on a collection of very small CNNs is capable of ranking large ResNet models according to train/test accuracy fairly well by looking only at the ResNet’s weights.

The studies presented in this paper may raise more questions than they answer, but we hope that this will serve as starting point for other researchers to make progress in understanding deep learning phenomena. The main contributions of this paper are:

• We propose a new formal setting that captures the approach and relates to previous works.

• We introduce a new, large dataset with strong baselines and discuss extensive empirical results. The data is of a new modality, mapping trained weights of neural networks to their accuracy.

• The experiments show that, somewhat surprisingly, it is possible to predict the accuracy using trained weights alone. Furthermore, only few statistics of the weights are sufficient for high accuracy in prediction.

• Experiments on transfer of prediction across architectures and datasets show that it is possible to rank neural network models trained on an unknown dataset just by observing the trained weights, without ever having access to the dataset itself.

Next, we describe a formal setting that considers this and related tasks (Section 2) and discuss related work (Section 3). We introduce a new dataset for this task and present empirical results on our dataset (Section 4). We also discuss the performance of the resulting predictors under domain shift (Section 5).

2. Formal setting

Consider a fixed unknown data-generating distribution P(X, Y ) defined over , where X and Y are input

Figure 1. Diagram of the learning setting. Nodes contain hyperparameters , CNN weights W, and expected accuracy Edges are labeled with the information necessary for the mapping: the training dataset and the data-generating distribution P.

and output domains, respectively. In the context of this paper, X will be the space of images and Y will be a set of class labels. We observe a training set of input-output pairs sampled i.i.d. from P.

We will train CNNs on using hyperparameters and get a particular weight vector , where A denotes the learning procedure and W may be considered a flattened vector containing all the weights. The hyperparameters include architecture-specific details (e.g. number of layers and activation function), optimizer-specific details (e.g. learning rate and initialization variance), and other parameters (e.g. weight regularization and fraction of the training set to use). Notice that the training method A may have internal sources of stochasticity, including order of examples in mini-batches or weight initialization. Also note that depending on , the weight vector W may be of a variable dimension (e.g. for varying number of layers).

We will denote the function realized by the CNN with weights W using . This function has the training accuracy and the expected accuracy denoted with and , respectively.

The goal discussed in this paper is to predict a CNN’s expected accuracy by looking at its weights W. Importantly, since the data distribution P(X, Y ) is fixed, the mapping that we want to learn (blue arrow in Figure 1) exists and is defined uniquely. Unfortunately, it is unknown to us, as well as P, and to this end we need to estimate it with a predictor .

To build an estimator we need to specify how to measure its quality. In other words, we need to measure how similar the mappings and , both defined on W, are. Since this work is motivated by studying CNN training, we will not compare the two on the entire space W but rather focus on the subset consisting of weights that can be actually obtained as a result of training. We propose to generate a set of hyperparameter configurations and then train K different CNNs on the training set . We cannot compute the exact values of , but we can estimate them well using the test accuracy measured on the separate test set of i.i.d. input-output pairs sampled from P independently of . Finally, we can train the estimator by minimizing its Mean Squared Error (MSE) on the CNN collection .

Why use only weights? The framework proposed above already makes use of the dataset by training CNNs on it. This means that the estimator and, as a consequence, its predictions, implicitly depend on . A natural idea would be to make the dependence on more explicit: e.g. by holding out some part and returning as a prediction for the accuracy of the CNN W. Based on decades of theoretical and practical ML experience, this approach will likely provide a very strong baseline for the task of predicting the accuracy. So why are we considering predictors that only look at weights and not utilize explicitly?

The main reason is that predicting the accuracy is only an indirect goal of this study. Ultimately we hope to gain insights about DNN training and generalization by understanding the structure of network weights, which are some of the most prominent characteristics of the DNN. Other minor advantages of not choosing another set to compute can be of a more practical nature: supporting prediction with less computational effort than an inference pass over S requires.

2.1. Predicting from hyperparameters

Another important and related question is to what extent the test accuracy of can be predicted from the hyperparameters that were used to train it. Once we fix the training set and the random seed, which determines the learning procedure’s internal source of stochasticity, there exists a unique deterministic mapping (dotted red arrow in Figure 1) and we may try to estimate it using the same scheme as described above. While the Bayes error of both using or the resulting weights W for predicting the accuracy is 0, in practice the two problems may have different sample complexities.

If the training set and/or the random seed are not fixed but instead generated each time we train the CNN, the mappings and, as a consequence, both become stochastic. In this case the estimation is possible only up to the noise introduced by the variance of and/or different choices of the seed.

2.2. Domain shift

Does the learned estimator generalize to yet unseen data distributions P or hyperparameter configurations ? In other words, if we were to train an estimator on CNNs which were themselves trained on CIFAR10, how accurately would predict the test accuracy of a CNN trained on SVHN? We will refer to this setting as domain shift. A priori, even if we solve the original problem well on CI-FAR10, there are no guarantees that the estimator would perform well for SVHN. The same applies to a change in the architecture.

Rather than discovering properties of DNNs that are spe-cific to a particular dataset or architecture (which nevertheless could be interesting on its own), we are even more interested in those that hold across various datasets and architectures. In that sense, domain shift provides a setting close to what we actually are interested in: observing any sort of positive transfer between different datasets and architectures would indicate that there are properties of DNNs that transfer. Our goal is to demonstrate the existence of these invariant properties and study them.

3. Related work

There are only few works that consider the problem setting described in Section 2. The most relevant of these are (Jiang et al., 2019; Yak et al., 2019; Eilertsen et al., 2020; Martin & Mahoney, 2020; Martin et al., 2020).

The overall setting and motivations of Eilertsen et al. (2020) are similar to ours. However, the main difference is that instead of predicting the accuracy, the authors focus on predicting the hyperparameters using the weights W (the opposite direction of the black arrow in Figure 1).

Concurrent works from Martin & Mahoney (2020); Martin et al. (2020) confirm our findings by showing that more complex statistics derived from weight matrices (Martin & Mahoney, 2018) correlate well with the performance of state-of-the-art models in vision and language processing.

Jiang et al. (2019) and Yak et al. (2019) both investigate how to predict the generalization gap, i.e. the difference between training and test set performance, of a neural network based on the hidden activations of training set examples. Jiang et al. (2019) train large CNN/ResNet architectures on CIFAR datasets and approximate the minimal distances to the class boundary for each data point in each hidden layer. They use this margin distribution to train a linear regressor that predicts generalization gaps. Yak et al. (2019) expand upon this work by training a large number of small fully-connected networks on different variations of a generated spiral dataset. They replace the linear predictor with a recurrent neural network to handle varying neural network depth, and show that predictions transfer between small fully-connected architectures and varying synthetic datasets. Both works heavily rely on the margins in the intermediate layers of the networks. These margins can not be computed analytically and require a computationally expensive approximation procedure (Elsayed et al., 2018), which is not guaranteed to be accurate. Margin approximation also involves an inference pass over the training set . Our estimators use only weights of the networks (or their simple statistics) to predict the accuracy. As the weights are (one of) the most important characteristic of a trained DNN, it is interesting to study this connection without requiring information about the training set . We show that these estimators transfer to networks trained on unobserved natural image datasets and with ResNet32 architectures. Finally, experimental design utilized in these previous works may lead to an undesirable leakage, as discussed in Section 4.1.

DeChant et al. (2019) train ResNets and other large architectures on CIFAR and ImageNet datasets. They demonstrate that it is possible to tell whether or not the network will make a mistake on one particular image by looking at the activations of that image in the network’s layers.

The relation between the train and test accuracies is the central question of statistical learning theory (Vapnik, 1998; Shalev-Shwartz & Ben-David, 2014). Jiang et al. (2020) recently performed a large scale empirical study analyzing correlation between various generalization error bounds and network performance.

A problem somewhat similar to ours has been studied in the context of hyperparameter optimization and neural architecture search (NAS). (Streeter, 2019a;b) propose procedures that select good hyperparameter values based on previous exploration. To apply early stopping to unsuccessful runs, Swersky et al. (2014) and Domhan et al. (2015) predict the final performance of a neural network based on few training iterations. Similar techniques were applied in NAS to select candidate architectures, where the prediction is usually based on hyperparameters, architectures, information about the dataset, and performance measurements of similar architectures, see (Baker et al., 2017; Istrate et al., 2019) and references therein.

4. Experiments: Small CNN Zoo

Results reported in this section are based on a new dataset which we call the Small CNN Zoo1. It contains weights of a fixed CNN architecture trained on 4 different image datasets using a large number of different hyperparameter configurations. For each network, accuracy and cross-entropy loss on the train and test data are available.

4.1. The Small CNN Zoo dataset

To enable predicting accuracy from the flattened weight vector, we keep the number of weights in the architecture small: 3 convolutional layers with 16 filters each, followed

by global average pooling and a fully connected layer, for a total of 4 970 learnable weights. As a result, the best test accuracies we obtain on CIFAR10 and SVHN are 56% and 78%, respectively, which is far below state of the art. However, it is worth pointing out that the smallest CNN architectures achieving above 90% test accuracy on CIFAR10 that we are aware of require on the order of parameters (Lin et al., 2014; Springenberg et al., 2015), i.e. 200x more, and work on RGB inputs, while we ignore away color information.

We train on 4 natural image classification problems: MNIST (LeCun et al., 2010), Fashion MNIST (Xiao et al., 2017), grayscale CIFAR10 (CIFAR10-GS) (Krizhevsky, 2009), and grayscale SVHN (SVHN-GS) (Netzer et al., 2011). Global average pooling and using grayscale allows us to apply the same architecture across all four datasets.

For each dataset, we sample 30k different hyperparameter configurations chosen independently at random from pre-specified ranges (listed in the Supplemental A.2). We vary optimizer, learning rate, type of initialization and its variance, fraction of the training examples to use, activation function, dropout rate, and -regularization of weights. We use one random seed per hyperparameter configuration. We did not use data augmentation or batch normalization.

Instead of stopping training when networks converge or reach a certain level of accuracy, we train each CNN for 86 epochs. We do so because we want to study CNNs under general conditions: properties discovered by only looking at converged models may not hold for intermediate steps.

Finally, we discard the models in which numerical instabilities (e.g. infinite gradients) were detected. This process leads to 4 CNN collections: with 29 996 CNNs for MNIST, with 29 999 for Fashion MNIST, with 29 999 for CIFAR10-GS, and with 29 987 for SVHNGS. The Small CNN Zoo is the union of these 4 collections.

The distribution of the CNN models with respect to their test/train accuracy is reported in Figure 2. MNIST, Fashion MNIST, and CIFAR10-GS all have balanced classes and the histograms peak at around 10%—the accuracy of a random or constant prediction. SVHN-GS is unbalanced with the largest class containing around 19% of the samples. Here many models seem to converge to the constant majority class prediction, which explains the shifted peak.

We do not observe overfitting in the Small CNN Zoo dataset, even though some of the models were trained only on 10% of the training examples. Likely, this is due to the small architecture used. Based on this dataset we may gain insights on why and how neural networks train, but it is less likely that the dataset will directly lead to deeper understanding generalization.

Figure 2. Distribution of the networks from the Small CNN Zoo collection over their test accuracy (first row) and their training/test accuracies (second row).

Why not use multiple seeds? We use one random seed per hyperparameter configuration. This avoids having models that are too similar between the train and test splits of the CNN collections, which possibly leads to a leakage. In fact, this may point to a possible shortcut taking place in the studies of Jiang et al. (2019). The authors used 3 random seeds per hyperparameter configuration and did not enforce that the models trained with the same hyperparameters (but different random seeds) were allocated to the same split. Inspecting their dataset closer shows that the variance in generalization gap between the networks that only differ in random seed is orders of magnitude smaller than the average variance between all networks (vs ). A similar shortcut may take place in the studies of Yak et al. (2019). Here, the authors did not use the same hyperparameters with different seeds for training, but they trained networks with the same hyperparameters on versions of the synthetic datasets that were generated using different seeds.

4.2. Training the estimators

Once we have the CNN collection with weights and their test accuracies , we can start training various estimators .

Types of estimators We explore three different estimators: logit-linear model (L-Linear), gradient boosting machine using regression trees (GBM), and a fully-connected DNN. All three methods were trained to minimize MSE. Each of these 3 methods comes with its own hyperparameters and initial experiments showed that it is important to tune them.

For the logit-linear model, we train weights and offsets using mini-batch SGD/Adam varying the learning rate, batch size, initialization, and -regularization. We use LightGBM (Ke et al., 2017) to train the GBM model and vary the number of leaves and maximum depth of the trees, the learning rate, and regularization, and parameters for the features/examples subsampling. We use a feed-forward fully-connected architecture for the DNN model

Table 1. scores for predicting test accuracies of CNNs trained on CIFAR10-GS with different input features (columns) and different estimators (rows). GBM is on par or better than DNN, and significantly better than L-Linear. All std. dev. (w. r. t. training models on three different folds of the cross-validation) for numbers in this table were below 0.005.

with ReLU activations and sigmoid transform. We train it with mini-batch SGD/Adam varying the learning rate, number of layers and their width, -regularization, initialization type and variance, and batch size.

Input features We investigate several ways of preprocessing the weight vectors W before feeding them to the estimators: (1) Using flattened parameters (weights/kernels and biases) of a single -th layer stands for the last fully connected layer); (2) Using statistics of the entire flattened vector consisting of 7 real numbers: the mean, the variance, and q-th percentiles for ; (3) Computing the above statistics for each layer separately, while processing kernels and biases independently, and concatenating the results, which yields real-valued features ; (4) Computing or norms for each layer separately, while processing kernels and biases independently, and then concatenating the results, which yields real-valued features and .

Training protocol and metrics Each of the 4 CNN collections is divided into two splits: 15k CNNs are used for the training split and the remaining ones were held out for the test split. The entire training and hyperparameter selection for the models took place on the training splits. The test splits are used only once to evaluate the single best model that we chose based on the 3-fold cross-validation performed on the training split.

We performed hyperparameter selection by evaluating 1k unique hyperparameter configurations sampled randomly and independently from pre-specified ranges for every combination of estimator type, input features, and CNN collection.

In all experiments we use MSE as the training objective. We also compute the mean absolute deviation and the co-efficient of determination or score. The score normalizes the MSE of the estimator by the MSE of the best constant prediction. Larger scores correspond to better predictions and the score never exceeds 1. For further details on the Small CNN Zoo dataset and the experimental setup we refer to Supplementary A.

4.3. Empirical results

In the experiments, GBM and DNN models always produce significantly better results than the logit-linear model. In some cases, the DNN model achieves slightly better results than GBM, but overall it is on par or significantly worse than GBM. These conclusions hold across all 4 datasets and the corresponding results are shown for one of the datasets (CIFAR10-GS) and a selection of input features in Table 1. In the interest of space we therefore only report the results for GBM in the following. All numbers for other models can be found in Supplementary A.5.

Table 2 presents the results of training the GBM models with different input features on the 4 CNN collections.

Using flattened weights First, we notice that a naive baseline of using the entire flattened vector W already achieves a rather strong performance across all 4 datasets. Interestingly, almost the same performance can be recovered just by using the parameters of the last dense layer , while using any other (convolutional) layer alone results in a noticeably worse performance. This observation is consistent with feature importance measurements produced by the GBM model (Supplementary B), which indicate that parameters of the last dense layer were among the most informative and frequently used ones.

Using weight statistics Results based on the per-layer statistics of the weights are the best obtained across all 4 CNN collections. In particular, they are significantly better than results based on the entire weight vector W. At first glance this may look surprising, because W contains suf-ficient information to recover . However, computing quantiles requires sorting numbers and presumably neither the GBM nor DNN estimators have capacity to do this. Also, compared to the entire weight vector or the weights of the last dense layer the feature vector of statistics is relatively lowdimensional. This may provide an additional explanation of superior performance of : sample complexity of the regression problem is generally known to grow with the dimension of the feature space (Tsybakov, 2008).

Notably, Eilertsen et al. (2020) also report a strong performance of the per-layer statistics in their work.

Using weight norms We also tried using the and norms of the weights as features , . Norms tra- ditionally play an important role in the statistical learning theory (Neyshabur et al., 2015; Bartlett et al., 2017) and are still actively used in practice to regularize DNNs with weight decay. In contrast to weight decay, which is commonly implemented by adding the sum of the norms across all layers multiplied by a single regularization coefficient to the objective, we kept the norms for different layers sep-

Table 2. scores for predicting test accuracies of CNNs trained on various datasets (columns) with GBM using different input features (rows). Best numbers for each dataset are in boldface. The largest std. dev. (w. r. t. training models on three different folds of the cross-validation) across all numbers in this table was 0.002. See Sections 4.3 and 4.4 for row descriptions.

arate. This should provide more flexibility for the estimator. Table 2 (first block) shows that the estimators based on the norms perform slightly (but statistically significantly) worse than the ones using weight vectors W or .

Interpreting the score and MSE values MSE provides an absolute measure of the model performance and on its own does not tell us much about the model: the value of can correspond to a good and bad performance depending on the problem. The score is a relative measure: it compares the MSE of the model to the MSE of a constant prediction. Moreover, score is scale invariant and multiplying the outputs by a constant won’t change the metric. In Table 2 we use the scores because we find them slightly easier to interpret: a non-positive value indicate that we are not doing better than fitting a constant predictor and values close to 1 point at stronger performance. The MSE values are reported in the Supplementary A.5 and scatter plots with raw predictions and true targets can be found in Supplementary D.

4.4. Ablation studies

Table 2 (upper block) shows that the parameter vector of a trained CNN alone contains a strong signal regarding the network’s accuracy. To understand more about the nature of this signal, we performed additional studies.

We tried several other input features for the estimators, including (i) the hyperparameter configuration (containing 7 parameters) used while training the CNN, (ii) the concatenation of the hyperparameters with the entire weight vector W, and (iii) the weight statistics similar to

computed only for a subset of the layers: for the final dense layer and for the combination of the first convolutional and the final dense layers. The results are reported in Table 2 (second block).

Hyperparameters As discussed in Section 2.1, the Bayes error of the predictor based on either hyperparameters or weight vectors W is 0, but sample efficiency may differ. The results show that for the Small CNN Zoo predicting the accuracies with weights is easier than with hyperparameters. We also tried predicting from individual hyperparameters to see if there was a single parameter sufficient to recover the signal. They all gave similar bad results. For the reference, we include the results for the learning rate . Predicting with both and weights W does not improve on predicting with W alone.

Statistics for subsets of layers Motivated by the fact that using the weights of the last dense layer is as good as using the whole weight vector W we tested whether statistics for a subset of the layers is enough to recover the performance based on . Curiously, the statistics of the last dense layer perform worse. The results improve if we add the statistics of the first convolutional layer , but they are still slightly worse than with all layers.

Permutation and scale invariance We also examined how the estimator’s predictions change as we modify its inputs. Notice that two ReLU CNNs with parameters W and have exactly the same test/train accuracy (but not the same cross-entropy loss) for any real value c > 0, because their outputs h(X; W) and coincide for all inputs X. The same is true for any CNN if we permute the order of filters/channels consistently across all layers. We want to emphasize that we did not incorporate these inductive biases in any of the estimators we trained. Nevertheless, it may be interesting to test whether these (or similar) invariances emerge naturally in the trained estimators.

For a given estimator trained with the entire weight vectors W, we tested several ways of modifying its inputs , including multiplying it with various positive factors and permuting it in several different ways. Then we looked at the absolute difference across multiple CNNs W (from the test split of the same CNN collection was trained on) and various types of modifications . We report a short summary of this study here. Details can be found in Supplementary C.

The Mean Absolute Deviation (MAD) of modifications that we tried spanned a range between 0.01 and 0.13. Scaling the weights with or permuting parameters within each of the first 3 convolutional layers leads to MADs less than 0.05. The estima-

Figure 3. Scatter plots of the networks trained on CIFAR10-GS, colored by test accuracy (best viewed in color). Bias range width () in first layer (x-axis) and last layer (y-axis) together with the upper-right corners zoomed in. Networks trained with Adam/RMSProp (left) and SGD (right).

tor is more sensitive to permutations within the final dense layer, which leads to a MAD of 0.06. Global permutation of the entire vector W (without preserving the layers) or scaling with small constants all lead to a MAD larger than 0.11. Summarizing, the estimator is not too sensitive to the order of parameters in the convolutional layers, and much more sensitive to permutations within the final dense layer. The estimator is invariant to scaling the weights with positive factors larger than 1 and changes its predictions significantly for factors smaller than 1.

4.5. Understanding observed behaviors: first steps

Seeing that very few values extracted from the weights already lead to good predictions, it is natural to ask if these predictions can be (at least partially) reduced to simple, human-interpretable rules. In informal experiments we explored different approaches such as GBM feature importances, LASSO, and univariate feature selection, but did not observe any clear and consistent signals. Thus, we manually inspect the CIFAR10-GS CNN collection .

As discussed above, we observe that the prediction works well when only considering simple statistics of the weights as inputs for the predictor. When predicting based on just one full network layer, the first and last layers of the network are most useful. Looking manually into weight statistics we noticed that the range () of the biases in the first and last layer is often correlated with the network’s accuracy. (Note that biases in the Small CNN Zoo were initialized to 0.) We do not claim any particular significance of this observation, but it can be used to generate hypotheses that could then be verified in further experiments. For example, we can visualize the networks’ performance in a 2D scatter plot using those two ranges (Figure 3). It seems surprising how well these two measurements separate the data (at least visually) already.

We notice that networks trained on CIFAR10-GS with SGD (Figure 3, right) perform significantly worse than those trained with Adam/RMSProp (Figure 3, left). It is also interesting to note how the majority of the networks trained with SGD align along a line in this 2D space. Among the networks trained with Adam/RMSProp, we observe two well-separated groups: the strongly performing ones in the upper-right corner (also depicted in the zoomed-in subplot) and the ones with near-chance performance (the blue “tentacles” in the bottom part). Further analysis reveals that these two groups can be perfectly separated from each other by looking at the bias maxima in the final dense layer (not shown in the plots): the bias maxima are below 0.1 for the badly performing models (the “tentacles”) and above 0.1 for all the rest of the networks. In future work we would like to understand better what causes these “symptoms” during training and investigate ways to alleviate them.

5. Transfer to new architectures and datasets

In the previous section we showed using the Small CNN Zoo dataset that strong predictors of accuracy based on weights exist. Next we want to explore the domain shift setting introduced in Section 2.2 and study whether the predictors can handle networks trained on unobserved datasets or with different architectures. We emphasize that throughout this section the models were not fine-tuned or adjusted to the new collections in any way.

5.1. Networks trained on unobserved datasets

First we look at how the GBM models transfer across the CNN collections. Two examples of such experiments are shown in Figure 4. The figures demonstrate that the MSE of the predictions may not be the best metric to look at. The “drift” of points away from the diagonal line (which corresponds to zero MSE) is likely due to the difference in average accuracy between various datasets. Most of the networks in the MNIST collection achieve an accuracy higher than 60%, while the best accuracy for CIFAR10-GS was 55%. Nevertheless, we see that networks with higher accuracy tend to receive higher prediction values. In other words, the predictors are doing a reasonable job in ranking the networks. We can use Kendall’s rank correlation coefficient to measure the quality of ranking. It ranges from -1 (anti-ranking) to 1 (perfect ranking) and takes values around 0 for random ranking.

Table 3 contains the values of Kendall’s coefficient for all possible transfer experiments performed on the Small CNN Zoo (and 2D plots similar to Figure 4 are reported in Supplementary D). The smallest coefficient of 0.6 corresponds to the transfer from SVHN-GS to the MNIST collection. It is perhaps surprising that the rank test shows such a large correlation. We want to highlight that when training CNNs on the 4 datasets we only scale the pixel values to the interval and do not perform any other standard-

Figure 4. Distribution of true/predicted test accuracies for networks from the SVHN-GS (left) and CIFAR10-GS (right) collections together with Kendall’s coefficient. Predictions were made with the GBM models trained on CIFAR10-GS (left) and MNIST (right) collections using

Table 3. Kendall’s rank correlation between GBM model’s predictions and true test accuracies. The GBM model trained using layer statistics as inputs on one CNN collection (rows) was used to make the predictions on the other (columns).

ization. We would expect the moments of the pixel values for the MNIST dataset to be very different from those of SVHN-GS and this difference in distributions to affect the form of the filters in the convolutional layers.

5.2. Networks trained with different architecture

In this section we want to test whether predictors trained on the Small CNN Zoo can rank larger, over-parametrized networks, capable of overfitting. For this purpose we will use the DEMOGEN collection (Jiang et al., 2019), which contains 216 Wide-ResNet32 models (He et al., 2016) trained on the original (colored) CIFAR10 dataset with the best models achieving 100% training and 93% test accuracy. The collection contains 72 networks for each of the three different architectures: ResNet32x1, ResNet32x2, and ResNet32x4, which differ in the number of filters.

In Section 4.4 we discovered that weight statistics computed for the final layer of the CNN provide a strong signal for predicting a network’s test accuracy. Using these features on the CIFAR10-GS collection, GBM achieves performance that is very close to the best model overall (Table 2). Because the vector has the same dimension for CNNs of any architecture, we can use estimators trained on the Small CNN Zoo to make predictions for the ResNet32 models from the DEMOGEN collection.

Table 4. Kendall’s coefficients between predictions and training/test accuracies of the ResNet32 models from the DEMOGEN dataset across different widths (columns). Predictions are made with the GBM model trained on the CIFAR10-GS collection from the Small CNN Zoo using . The GBM model is able to rank large ResNet models according to their accuracy, despite having been trained on small CNN architectures on different datasets.

Table 4 reports the coefficients demonstrating how well predictions of the GBM model trained on CIFAR10-GS CNN collection correlate with actual accuracies of the networks from DEMOGEN. We compare to both train and test accuracies, because, as discussed in Section 4.1, for the Small CNN Zoo dataset there is no relevant difference between the two and we do not really know which of them the GBM model predicts. As a reference we also report the coefficients when using the train accuracy as a proxy for the test one (or vice versa).

All the numbers are significantly larger than zero, indicating that the predictor’s ranking is far from being random. The predictions seem to correlate slightly better with train accuracy than with test. This hints that the predictors trained on the Small CNN Zoo may be using the train accuracy as a shortcut while predicting the test one. The ranking coefficient between the train and test accuracies decreases with network size, which points to increasing overfitting.

To further verify that our findings hold up with other architectures, we show in Supplement E that our findings also hold for Multi-Layer Perceptrons.

6. Conclusions and future directions

We demonstrated that it is possible to predict the performance of a DNN using only its weights (or simple statistics thereof) as inputs. Surprisingly, these predictions are able to rank networks trained on unobserved natural image datasets/with different large architectures. Whether these predictions can be reduced to simple human-interpretable rules and whether they can be helpful to improve DNN training remains an important open question. It also remains to be explored whether our findings transfer to domains outside of CNNs, e.g. to architectures commonly used in natural language understanding, reinforcement learning, or unsupervised applications.

Our work only used off-the-shelf regression algorithms (GBM and fully-connected DNNs) to predict the network accuracy using its weights. In future it seems natural to try methods with stronger inductive biases. For instance, using Deep Sets approach (Zaheer et al., 2017) to account for the invariance of CNNs w.r.t. the order of the filters and channels could allow us to get even better performance in practical applications, or yield better insights.

We believe our findings open the door to a number of interesting further questions. The idea that most neural network contain a highly efficient sub-network, the “lottery ticket hypothesis” (Frankle & Carbin, 2019), recently gained a lot of attention. Morcos et al. (2019) show that these sub-networks transfer across tasks and datasets. An interesting avenue for future research would be to see if a trained clas-sifier is able to identify these sub-networks (or other related properties) from the weights used to initialize a network.

Finally, we share a large dataset of trained CNNs in hope that this will enable the community to further explore this interesting direction of research.

Acknowledgements

We are thankful to Ibrahim Alabdulmohsin, Iuliya Beloshapka, Samy Bengio, Lucas Beyer, Alexey Dosovitskiy, Pierre Foret, Yiding Jiang, Alexander Kolesnikov, Dilip Krishnan, Hossein Mobahi, Shay Moran, Behnam Neyshabur, Sebastian Nowozin, Paul Rubenstein, Hanie Sedghi, Jakob Uszkoreit, and Scott Yak for valuable discussions.

References

Baker, B., Gupta, O., Raskar, R., and Naik, N. Accelerating neural architecture search using performance prediction. In NeurIPS Meta Learning Workshop, 2017.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally- normalized margin bounds for neural networks. In NeurIPS, 2017.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine learning and the bias-variance tradeoff. arXiv/1812.11118, 2018.

DeChant, C., Han, S., and Lipson, H. Predicting the accu- racy of neural networks from final and intermediate layer outputs. In ICML Deep Learning Phenomena Workshop, 2019.

Domhan, T., Springenberg, J. T., and Hutter, F. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In IJCAI, 2015.

Eilertsen, G., J¨onsson, D., Ropinski, T., Unger, J., and Yn- nerman, A. Classifying the classifier: dissecting the weight space of neural networks. arXiv/2002.05688, February 2020. Accepted for ECAI 2020.

Elsayed, G. F., Krishnan, K., Mobahi, H., Regan, K., and Bengio, S. Large margin deep networks for classifica-tion. arXiv/1803.05598, 2018.

Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.

Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In ECCV, 2016.

Istrate, R., Scheidegger, F., Mariani, G., Nikolopoulos, D., Bekas, C., and Malossi, A. C. I. TAPAS: Train-less accuracy predictor for architecture search. In AAAI, 2019.

Jiang, Y., Krishnan, D., Mobahi, H., and Bengio, S. Pre- dicting the generalization gap in deep networks with margin distributions. In ICLR, 2019.

Jiang, Y., Neyshabur, B., Krishnan, D., Mobahi, H., and Bengio, S. Fantastic generalization measures and where to find them. In ICLR, 2020.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In NeurIPS, 2017.

Kingma, D. P. and Lei, J. Adam: A method for stochastic optimization. arXiv/1412.6980, 2014.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

LeCun, Y., Cortes, C., and Burges, C. MNIST handwritten digit database. ATT Labs [Online]., 2010.

Lin, M., Chen, Q., and Yan, S. Network in network. In ICLR, 2014.

Martin, C. H. and Mahoney, M. W. Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. arXiv/1810.01075, 2018.

Martin, C. H. and Mahoney, M. W. Heavy-tailed universal- ity predicts trends in test accuracies for very large pretrained deep neural networks. arXiv/1901.08278, 2020.

Martin, C. H., Tongsu, Peng, and Mahoney, M. W. Pre- dicting trends in the quality of state-of-the-art neural networks without access to training or testing data. arXiv/2002.06716, 2020.

Morcos, A., Yu, H., Paganini, M., and Tian, Y. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In NeurIPS, 2019.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NeurIPS, 2011.

Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based capacity control in neural networks. In COLT, 2015.

Saxe, A., McClelland, J., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv/1312.6120, 2014.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: from theory to algorithms. Cambridge University Press, 2014.

Springenberg, J., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. In ICLR Workshops, 2015.

Streeter, M. Learning effective loss functions efficiently. arXiv/1907.00103, 2019a.

Streeter, M. Learning optimal linear regularizers. In ICML, 2019b.

Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw Bayesian optimization. arXiv/1406.3896, 2014.

Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer, 2008.

Vapnik, V. Statistical Learning Theory. Wiley & Sons, 1998.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv/1708.07747, 2017.

Yak, S., Gonzalvo, J., and Mazzawi, H. Towards task and architecture-independent generalization gap predictors. In ICML Understanding and Improving Generalization in Deep Learning Workshop, 2019.

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In NIPS, 2017.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

Zhang, C., Bengio, S., and Singer, Y. Are all layers created equal? In ICML Deep Learning Phenomena Workshop, 2019.

A. Further details on the Small CNN Zoo dataset and experiments

This section contains details on the way the Small CNN Zoo was generated and on the results of training the accuracy predictors reported in Tables 1 and 2 of the main text.

A.1. Base CNNs: architecture

All CNN models share the same architecture: 3 hidden convolutional layers with 16 filters each, followed by the global average pooling and the final dense layer. Dropout is applied to every convolutional layer. -regularization is applied to all layers. For exact details refer to the code at https://github.com/google-research/google-research/ tree/master/dnn_predict_accuracy.

A.2. Base CNNs: hyperparameters for training

For each dataset, we sample 30k different hyperparameter configurations of the CNN training:

• Optimizer is chosen uniformly from one of the following: vanilla SGD optimizer, Adam optimizer (Kingma & Lei, 2014), and RMSProp optimizer;

• Learning rate is sampled log-uniformly from ;

• is sampled log-uniformly from ;

• Dropout rate is sampled uniformly from [0, 0.7];

• Variance of weight initializer is sampled log-uniformly from ;

• Type of weight initializer is chosen uniformly from one of the following: Xavier normal (Glorot & Bengio, 2010), He normal (He et al., 2015), orthogonal (Saxe et al., 2014), normal, and truncated normal;

• Biases are initialized with zeros;

• Activation function is chosen uniformly from ReLu and hyperbolic tangent;

• Fraction of training examples to use is sampled uniformly from {0.1, 0.25, 0.5, 1.0};

• We never used same hyperparameter configuration with several different random seeds.

A.3. Accuracy predictors: types of the models

We use three types of predictors: logit-linear models, gradient boosted machine using desicion trees (GBM), and fully-connected ReLu networks (DNN).

The logit-linear model takes the output of the linear model and transforms it with the sigmoid function . Here denotes the inner product. We use a logit-linear (instead of plain linear) model because the targets (test accuracies) are in [0, 1] and in preliminary experiments we did not observe a linear model that achieved a better performance. We train the parameters and b with mini-batch SGD/Adam varying the learning rate, batch size, initialization, and -regularization.

We use LightGBM (Ke et al., 2017) to train the GBM model and vary the number of leaves and maximum depth of the trees, the learning rate, and regularization, and parameters for the features/examples subsampling.

We use a feed-forward fully-connected architecture for the DNN model with ReLU activations and sigmoid transform. We train with mini-batch SGD/Adam varying the learning rate, number of layers and their width, -regularization, initialization type and variance, and batch size.

A.4. Accuracy predictors: hyperparameters for training

For each of the 3 types of predictors and each of the 4 CNN collections we perform hyperparameter selection by evaluating 1k unique configurations:

• For the GBM accuracy predictor we use the following protocol. Refer to the Light-GBM documentation for the exact meaning of the parameters:

– max depth is sampled uniformly from [5, 15];

– learning rate is sampled log-uniformly from ;

– max bin is sampled uniformly from ;

– min child weight is sampled uniformly from {1, 2, 3, 4, 5};

– reg lambda is sampled uniformly from ;

– reg alpha is sampled uniformly from ;

– subsample is sampled uniformly from {0.1, 0.2, . . . , 0.9, 1};

– subsample freq is set to 1;

– colsample bytree for the high dimensional inputs (all weights W, weights of the second and third convolutional layers and , and concatenation of all weights with the hyperparameters ) is sampled log-uniformly from , for lower dimensional inputs is sampled uniformly from [0.7, 1];

• For the DNN accuracy predictor we use the following protocol:

– Number of layers is sampled uniformly from {3, 4, . . . , 9}; – Number of units is sampled uniformly from {256, 257, . . . , 511}; – ReLu activation is used for all models; – Dropout rate is sampled uniformly from [0, 0.2]; – is sampled log-uniformly from ; – Learning rate is sampled log-uniformly from ; – Variance of weight initializer is sampled log-uniformly from ; – Optimizer is chosen randomly from Adam and SGD; – Batch size is sampled uniformly from {64, 128, 256, 512}; – Biases are initialized with zeros; – Type of weight initializer is chosen uniformly from one of the following: Xavier normal (Glorot & Bengio, 2010), He normal (He et al., 2015), orthogonal (Saxe et al., 2014), normal, and truncated normal;

– Sigmoid transform is applied to the final layer output.

• For the logit-linear predictor we use the same protocol as for DNN predictor, while setting Number of layers to zero and applying regularization to the final dense layer.

A.5. Accuracy predictors: detailed empirical results

Tables 5 and 6 contain both scores and MSE values for all three types of predictors trained on all four CNN collections. Standard deviations capture the variability when training the predictors on three folds of the cross-validation. Every entry in the Tables 5 and 6 is obtained by evaluating 1k hyperparameter configurations of the accuracy predictor (as described in Section A.4) and picking the best one using 3-fold cross validation. Then the best configuration is evaluated on the holdout test split of the CNN collection. The resulting numbers are reported in the tables.

Tables 5 and 6 do not contain results for several input types, including the ones based on norms and , on weights statistics computed for subsets of layers and , and on learning rate . Results for these input types are reported in Tables 7 and 8.

B. GBM importance plots

Figure 5 presents importance values for various entries of the weight vector W when training the GBM accuracy predictor. Importance values reported in the figure are based on the number of times a single feature (a particular entry of the vector W in our case) was chosen in the nodes of the trees. Higher numbers correspond to more important (more frequently used) features. We see that all four models make extensive use of parameters of the final dense layer. Among those, biases seem to be slightly more important than weights.

Table 5. (together with standard deviations) for predicting test accuracies of CNNs trained on various datasets (blocks) with various models (rows) using different input features (columns). S.t.d. capture the variability when training the models on 3 different folds of the cross-validation. “Lin” refers to the logit-linear model. See main text for the descriptions of input features.

Table 6. MSE (together with standard deviations) for predicting test accuracies of CNNs trained on various datasets (blocks) with various models (rows) using different input features (columns). S.t.d. capture the variability when training the models on 3 different folds of the cross-validation. “Lin” refers to the logit-linear model. See main text for the descriptions of input features.

Table 7. (together with standard deviations) for predicting test accuracies of CNNs trained on various datasets (blocks) with various models (rows) using different input features (columns). S.t.d. capture the variability when training the models on 3 different folds of the cross-validation.

Table 8. MSE (together with standard deviations) for predicting test accuracies of CNNs trained on various datasets (blocks) with various models (rows) using different input features (columns). S.t.d. capture the variability when training the models on 3 different folds of the cross-validation.

Figure 5. Light-GBM feature importance values based on number of times the feature appeared in the trees. Four plots correspond to GBM predictors trained on four CNN collections using entire weight vectors W as inputs. “L” in feature names refer to the layer, “W” to the (filter) weights, “B” to the biases. For example, “L4-B7” is the 7th bias parameter of the final dense layer and “L1-W123” is the 123rd filter weight parameter of the first convolutional layer.

C. Permutation and scale invariance

This section contains the results of a study on how accuracy estimator’s predictions change as we modify its inputs. For a given accuracy predictor trained using weight vectors W as inputs we test several ways of modifying its inputs :

1. Globally permuting the elements of W;

2. Permuting the order of parameters within each layer of W;

3. Permuting the order of parameters within all three convolutional layers of W;

4. Permuting the order of parameters in the final dense layer;

5. Multiplying all elements of W by a constant c > 0.

For every type of permutation we try two options: (a) permuting biases and weights jointly, allowing them to mix and (b) permuting biases and weights separately, without mixing them.

We test these modifications with the GBM predictor trained using weight vectors W as inputs on the CIFAR10-GS CNN collection, which has the score of 0.97. We use uniformly sampled random permutations and scale factors . For every type of modification we take the absolute differences between the predictions on the modified and original CNNs respectively. Then we average them across 1000 CNNs W from the test split of the CIFAR10-GS CNN collection. Results are reported in Table 9.

Table 9. Sensitivity of the GBM predictor trained using weights W on the CIFAR10-GS collection w.r.t. various modifications of its inputs.

D. Detailed results on the transfer experiments

Figure 6 contains the results for all possible transfer experiments performed on Small CNN Zoo as described in Section 5.1. Diagonal plots correspond to the holdout test evaluation of four GBM models.

Figure 6. Distribution of true/predicted test accuracies for networks from different CNN collections (columns) together with Kendall’s coefficients. Predictions were made with the GBM models trained on different CNN collections (rows, same order) using

Table 10. Kendall’s coefficients between predictions and test-accuracies on networks of different sizes and/or datasets. 80 % of the data was used for training, and 20 % for evaluation. Subscript u indicates that the estimator was trained on networks with hidden layers of size u.

Table 11. Kendall’s coefficients between predictions and test-accuracies on MLPs of different sizes and/or datasets. For each row, we show the results of training on a given dataset/MLP-size and evaluating on all other sizes. Subscript u indicates that the estimator was trained on networks with hidden layers of size u.

E. Results on Multi-Layer Perceptrons

To verify that our results do not only apply to CNNs, we performed experiments on fully connected Multi-Layer Perceptrons (MLPs). We trained 10k MLPs each on CIFAR10 and SVHN, using the same hyperparameters as in the Small CNN Zoo (see Section A.2), except that we also sampled the number of hidden units in each layer to be either 8, 16, 32 or 64. This gave us four different neural network sizes for each dataset. To save computation time and verify that our observations also hold with different estimators, we used a Random Forest estimator with 32 trees in all of the experiments. The estimator was trained on networks from one specific dataset and hidden-unit size, and evaluated on all other settings. We used the same weight statistics as in the main text as input features. Table 10 shows the results of this experiment. We then verified that these results transfer across network architectures and datasets: The resulting Kendall’s coefficients are listed in Table 11. Together, these results confirm our CNN findings, namely that it is possible to predict the performance of an MLP based on its weights, and that this prediction transfers across models of different sizes as well as across datasets.