b

DiscoverSearch
About
My stuff
A Variational-Sequential Graph Autoencoder for Neural Architecture Performance Prediction
2019·arXiv
Abstract
Abstract

In computer vision research, the process of automat- ing architecture engineering, Neural Architecture Search (NAS), has gained substantial interest. In the past, NAS was hardly accessible to researchers without access to large-scale compute systems, due to very long compute times for the recurrent search and evaluation of new candidate architectures. The NAS-Bench-101 dataset facilitates a paradigm change towards classical methods such as supervised learning to evaluate neural architectures. In this paper, we propose a graph encoder built upon Graph Neural Networks (GNN). We demonstrate the effectiveness of the proposed encoder on NAS performance prediction for seen architecture types as well an unseen ones (i.e., zero shot prediction). We also provide a new variational-sequential graph autoencoder (VS-GAE) based on the proposed graph encoder. The VS-GAE is specialized on encoding and decoding graphs of varying length utilizing GNNs. Experiments on different sampling methods show that the embedding space learned by our VS-GAE increases the stability on the accuracy prediction task.

Deep learning using convolutional neural architectures has been the driving force of recent progress in computer vision and related domains. Multiple interdependent aspects such as the increasing availability of training data and compute resources are responsible for this success. Arguably, none has had as much impact as the advancement of novel neural architectures [19, 8]. Thus, the focus of computer vision research has shifted from a feature engineering process to an architecture engineering process. The urge to automate this process using machine learning techniques is a direct consequence. Neural Architecture Search (NAS) [5] attends to techniques automating architecture engineering. Due to very long compute times for the recurrent search and evaluation of new candidate architectures for example in proposed genetic algorithms or reinforcement learning [38], NAS research has hardly been accessible for researchers without access to large-scale compute systems.

Yet, the publication of NAS-Bench-101 [34], a dataset of over 423k fully trained neural architectures, facilitates a paradigm change in NAS research. Instead of carefully evaluating each new proposed neural architecture, NAS-Bench-101 enables to experiment with classical data-based methods such as supervised learning to evaluate neural architectures. In this paper, we tackle the task of learning to predict the accuracy of convolutional neural architectures in a supervised way using continuous representations of neural architectures. We further demonstrate that such embeddings can be employed for the reconstruction and generation of neural architectures. In combination, these two abilities allow to sample neural architectures at a given target accuracy.

Most current neural architectures for computer vision can be represented as directed, acyclic graphs (DAGs). Thus, we base our approach on Graph Neural Networks. Graph Neural Networks (GNNs) [31] have proven very powerful comprehending local node features and graph substructures. This makes them a very useful tool to embed nodes as well as full graphs like the NAS-Bench-101 architectures into continuous spaces.

Within this setting, finding new neural architectures translates into generating new graphs. From the few existing graph generating models, sequential approaches like [35] or [21] are very promising. The model Deep Generative Models of Graphs (DGMG) in [21] utilizes GNNs and shows dominance over Recurrent Neural Network (RNN) methods. In natural language processing, it is common to use RNNs on an encoder-level as well as on a decoder-level. Inspired by this approach, we extend the ideas of the DGMG model to build a Variational-Sequential Graph Autoencoder (VS-GAE), a variational autoencoder [15] that utilizes GNNs on the encoder-level and decoder-level simultaneously. To the best of our knowledge, we propose the first graph autoencoder that is built on GNNs and acts on graphs of different sizes. This makes it a powerful tool

to handle neural architectures.

Having a densely sampled and fully evaluated search space at hand, such as provided by NAS-Bench-101, is a luxury that is rarely given for real-world problems. Thus, sampling methods play an important role when working on neural architectures. While current sampling driven approaches to NAS [10] rely on uniform or equidistant sampling in the discrete search space defined through the edit distance, we argue that such a space is artificial by nature and does not reflect the similarities between architectures appropriately. If for example a  3×3convolution is replaced by a  5 × 5convolution, the distance is the same as if it were replaced by a pooling operation. We thus propose to sample unseen neural architectures from our learned VS-GAE 1 embedding space as an alternative and show increased stability on the accuracy prediction task. In summary, we make the following contributions:

We present a graph encoder built on GNNs and adjusted to the NAS-Bench-101 neural architectures. We outline the benefit of this graph encoder for performance prediction on the one hand and for modeling a graph decoder on the other hand.

We provide VS-GAE, a new variational-sequential graph autoencoder. VS-GAE is specialized on the encoding and decoding of graphs of varying length utilizing GNNs. This makes it a strong tool for designing new architectures and generating a meaningful graph latent space.

We outline different sampling methods. Using VSGAE, we introduce a method to sample in the graph latent space. We compare this new sampling approach with existing methods based on the discrete graph space and show increased stability.

The remaining paper is structured as follows: Section 2 gives a short review of the related work. In Section 3, we further discuss GNNs and graph generating models. In Sections 4 - 6, we present our proposed model. We introduce the encoder, our model VS-GAE and the sampling methods. In Section 7, we present our experiments and results. Finally, we give a conclusion and outline some future directions in Section 8.

Neural Architecture Search (NAS) [37, 29, 38, 23, 28], the process of designing neural network architectures in an automatic way, gained substantial attention recently. See [5] for an overview and detailed survey over recent NAS methods. The currently most successful approaches follow different paradigms: Reinforcement learning (RL) [37, 38, 28] as a NAS strategy considers the neural architecture generation as the agent’s action with it’s reward given in terms of validation accuracy. Evolutionary Algorithm (EA) [29, 22] approaches optimize the neural architectures themselves by guiding the mutation of architectures and evaluating their fitness given in terms of validation accuracy. Bayesian optimization (BO) [14] derive kernels for architecture similarity measurements to extrapolate the search space. Gradient based methods [23, 24] use continuous relaxations of neural architectures to allow for gradient-based optimization.

NAS-Bench-101 [34] is a public dataset of neural architectures in a restricted cell structured search space [38] evaluated on the CIFAR-10-classification set [18]. NAS-Bench-101 considers the following constraints to limit the search space: it only considers directed acyclic graphs, the number of nodes is limited to  |V | ≤ 7, the number of edges is limited to  |E| ≤ 9and only 3 different operations are allowed {3 × 3 convolution, 1 × 1 convolution, 3 × 3 max − pool}. These restrictions lead to a total of 423k unique convolutional architectures. The architectures have been trained for four increasing numbers of epochs {4, 12, 36, 108}. Each of these architectures is mapped to its validation and training measures. In this paper we use the architectures trained for 108 epochs and their corresponding validation accuracy.

The idea of Graph Neural Networks as an iterative process which propagates the node states until an equilibrium is reached, was initially mentioned 2005 in [9]. Years later when CNNs became popular, [4] and [12] defined graph convolutions in the Fourier domain by utilizing the graph Laplacian. The modern interpretation of GNNs was first mentioned in [26] and [16] where node information was inductively updated through aggregating information of each node’s neighborhood. This approach was further specified and generalized by [11] and [7].

Existing graph generating models can roughly be classi-fied in global approaches and sequential approaches. Global approaches output the full graph at once usually by relaxing the adjacency matrix [17, 30]. The sequential approach is an iterative process of adding nodes and edges alternately. [24] used RNNs to generate neural architectures in this sequential manner. The model in [35] introduced a second edge-level RNN capturing the edge dependencies. [36] created and used an asynchronous message passing scheme instead of RNNs to decode the computations of neural architectures. In contrast, the model in [21] utilizes the synchronous message passing scheme as known from GNNs for the sequential graph generation.

Variational autoencoders (VAE) [15] allow to learn a recognition model  qφ(z|G), i.e., the encoder, and a generative model  pθ(G|z), i.e., the decoder, jointly, where  φand θare the outputs of the encoder and decoder, respectively. Note, the VAE learns the parameters of the data’s probabil- ity distribution instead of a lower dimensional representation of the input data.

The VAE learns by maximizing a lower bound estimator on the log-likelihood log  pθ(G):

image

From the trained VAE model new data can be generated by sampling z  ∼ pθ(z)and decode it by means of  pθ(G|z).

Combining modern machine learning methods with graph structured data has increasingly gaining popularity. And rightly so; one can interpret it as an extension of deep learning techniques to non-Euclidean data [3] or even as inducing relational biases within deep learning architectures to enable combinatorial generalization [2]. Because of the discrete nature of graphs, they can not trivially be optimized in differentiable learning methods that act on continuous spaces. In this paper, we address this problem. We want to use continuous methods to handle the graphs characterizing neural architectures from the NAS-Bench-101 dataset.

3.1. From Discrete to Continuous

The research in GNNs enabled breakthroughs in multiple areas related to graph analysis such as computer vision [32, 20, 33], natural language processing [1], recommender systems [25], chemistry [7] and others. The capability of GNNs to accurately model dependencies between nodes makes them the foundation of our research. We utilize them to move from the discrete graph space to the continuous space and vice versa.

3.2. From Continuous to Discrete

Generating new graphs particularly new neural architectures is an ambitious task that has to overcome multiple fundamental challenges. The main focus lies in the highly variable graph search space of NAS-Bench-101 and the complex dependencies within a single graph.

Global approaches like [17] or [30] are restricted to a fixed and small number of nodes since they utilize relaxations of the adjacency matrix which is by nature quadratic in the number of nodes. [24] were the first to use RNNs to generate neural architectures in a sequential manner. Their model acts on graphs with a fixed number of nodes and lacks the capability to induce the complex graph structures. These issues were partially addressed in [35]. This model acts on graphs of variable sizes and utilizes a second edge-level RNN to capture the edge dependencies. In [21], the superiority of using GNNs over RNNs during the graph generating process was expressed.

image

Figure 1: Illustration of the graph encoding process: The node-level propagation using T rounds of bidirectional message passing (left) and the graph-level aggregation into a single graph embedding  hG(right).

Our model can be interpreted as an extension of the conditional version of [21]. To the best of our knowledge, our model is the first that utilizes GNNs on the encodinglevel as well as the decoding-level obtaining a variational-sequential graph autoencoder acting on graphs of different sizes. The model in [36] is also related to our work; unlike our model, it acts on a fixed number of nodes. Also in contrast to our method, [36] built a model on an asynchronous message passing scheme that encodes computations instead of graph structures.

In this section we present our GNN-based model to encode the discrete graph space of NAS-Bench-101 into a continuous vector space. One can imagine a single GNN iteration as a two-step procedure. First, each node sends out a message to its neighbors alongside its edges. Second, each node aggregates all incoming messages to update itself. After a final amount of these iteration steps, the individual node embeddings are aggregated into a single graph embedding.

4.1. Node-Level Propagation

Let G = (V, E) be a graph with nodes  v ∈ Vand edges e ∈ E ⊆ V ×V. We denote  N(v) = {u ∈ V | (u, v) ∈ E}and  N out(v) = {u ∈ V | (v, u) ∈ E}as the directed neighborhoods of a node  v ∈ V. For each node  v ∈ V, we associate an initial node embedding  hv ∈ Rdn. In our experiments we use a learnable look-up table based on the node types. Propagating information through the graph can be seen as an iterative message-passing process

image

with a differentiable message module  M (t)in (2), a differentiable update module  U (t)in (3) and a differentiable, permutation invariant aggregation function  Ξ. The message module  M (t)is illustrated by the green arrows in Figure 1 (left). To address the directed nature of the NAS-Bench-101 graphs, we add a reverse message module

image

This is outlined in Figure 1 (left) by the red arrows and leads to so-called bidirectional message passing. The update module  U (t)utilizes each node’s incoming messages to update that node’s embedding from  h(t−1)v to h(t)v.

Exploring many different choices for the message and update modules experimentally, we find that the settings similar to [21] work best for our needs. We pick a concatenation together with a single linear layer for our message modules. The update module consists of a single gated recurrent unit (GRU) where  h(t−1)vis treated as the hidden state. For the aggregation function, we choose the sum. To increase the capacity of our model, on the one hand, we apply multiple rounds of propagation and on the other hand, we use a different set of parameters for each round.

4.2. Graph-Level Aggregation

After the final round of message-passing, the propagated node embeddings  h = (hv)v∈Vare aggregated into a single graph embedding  hG ∈ Rdg, where

image

We obtain good results by using a linear layer combined with a gating layer that adjusts each node’s fraction in the graph embedding. This aggregation layer A in (6) is further illustrated in Figure 1 (right). In case that variational outputs are required, we interpret  hGas the mean and add an extra graph aggregation layer ˜Ain (7) which outputs the variance  hvarG.

In this section we present our Variational-Sequential Graph Autoencoder (VS-GAE). The VS-GAE is composed of the variational version of the graph-based encoder from Section 4 and a graph generating decoder using a sequential process. The encoder  qφ(z|G)takes graph G with node labels as input and outputs a prior distribution p(z) over the latent space in line with common variational autoencoders. The decoder  pθ(G|z)takes a sampled point z from this latent space p(z) as input and generates a graph iteratively as a sequence of operations that add new nodes and edges until the end/output node is generated. Note that the sampled point z contains a summary of the original graph G.

5.1. Graph Generating Process

The decoder consists of multiple modules, mainly describing a distribution over the outcomes of a specific step in the generating sequence. Each module utilizes for each iteration t one or multiple of the following inputs:

image

Note that the learnable embedding look-up table L is independent of the one in Section 4. Correspondent to the NAS-Bench-101 graphs, we begin the iteration with the start/input node that receives an initial node embedding h(0) = (h0)according to the sampled point and the look-up table. With these preparations we can represent the full graph generating process through iterating over the following modules. Such iteration step can be tracked module by module through following Figure 2. Note that the modules’ weights are shared over different iterations.

GraphProp This module processes the embedding  h(t)of the previously created nodes together with their underlying graph structure  G(t)for two complementary but distinct tasks. First, the node embeddings are updated by propagating through them. Second, the updated node embeddings are read out and aggregated into a single graph embedding,

image

This is illustrated in Figure 2 a). The graph embedding hG(t)can be interpreted as a summary of the hitherto created partial graph. We use a GNN to propagate and aggregate the node embeddings  h(t)in (8). More precisely, we use an exact copy of our encoder from Section 4 initialized with its own weights. This is motivated by NLP methods that use two distinct RNNs on the encoder-level and the decoder-level, respectively.

AddNode In this module, a new node is created and its node type is selected. The input is the summary of the original graph, i.e., the sampled point z from the latent space as well as the summary  hG(t)of the already created partial graph. The intention behind these inputs and the ones for the following modules is always the same: What does the original graph look like? What does the partially created graph look like? What is missing? This concludes in following module,

image

The addNode module is outlined in Figure 2 b). The output of (9) is a categorical distribution over all possible node

image

Figure 2: Illustration of a single iteration during the graph generation process. a) A decoder-level GNN propagates the node embeddings through the partially created graph and aggregates them into a summary of this graph. b) A new node is created and its node type is selected using the summary of the partially created graph as well as the original one. c) The newly created node is initialized with a node embedding. d) A score of all edges connecting the new node is calculated and evaluated into the set of new edges.

types. Note that sampling over this distribution yields a one-hot encoding of a specific node type. In line with the structure of NAS-Bench-101 graphs, the iteration stops after running through the step that adds the end/output node.

InitNode This module initializes the node embedding of the just created node. The input is the sampled point z, the summary  hG(t)of the created graph and the embedding of the node type L[type] in the look-up table. This embedding of the new node is then added to the already existing propagated node embeddings,

image

This process is further illustrated in Figure 2 c).

AddEdges In this module, the edges towards the newly created node are selected. For this purpose, a score between the new node  ht+1and each previous node is calculated, respectively. A high score stands for a high probability of an edge and vice versa. This is illustrated in Figure 2 d).

image

where Edges is a family of Bernoulli distributed random variables describing a probability for each possible edge e connecting the new node. Sampling over these distributions yields the new set of edges. We interpret each edge as directed towards the new node.

In all our experiments, we let  faddNode, finitNodeand faddEdgesbe two-layer MLPs with ReLU nonlinearities.

5.2. Training and Loss

Recall that a VAE maximizes the lower bound estimator for a graph G and a latent space representation z:

image

The first term of (14) is the model specific reconstruction loss which enforces high similarity between the input graph and the generated graph. The second term is the Kullback– Leibler Divergence which regularizes the latent space. In the following, we will discuss the reconstruction loss of VSGAE.

We train the encoder and the decoder of VS-GAE jointly in an unsupervised manner. Although the encoder is by construction invariant under graph isomorphisms, the decoder needs a certain ordering over the nodes. To fulfill the prior of the decoder that each edge is directed towards the new node, this ordering has to be in such a manner that the adjacency matrix is an upper triangle matrix. This is, for example, given by the canonical ordering in which the graphs of NAS-Bench-101 are provided.

Given this fixed ordering of the nodes, we know the ground truth of the outputs of AddNode (9) and AddEdges (13) during training. One the one hand, we can use this ground truth to calculate a node-level loss  LiVand an edge- level loss  LiEat each iteration step, respectively. On the other hand, we can replace the model’s output by the ground truth such that possible errors will not accumulate through iterations. This is also known as teacher forcing.

In order to calculate the overall reconstruction loss for a graph G, we sum up node losses and edge losses over all

iterations

image

Following [15], we assume  pθ(z) ∼ N(z; 0, 1)and  pθ(G|z) ∼ N(µ, Σ). Furthermore, we approximate the posterior by a multivariate Gaussian distribution with diagonal covariance structure. This can be written as log qφ(z|G) = log N(z; µ, σ21)and ensures a closed form of the KL divergence

image

Thus, it holds for the overall loss

image

where the KL divergence is additionally regularized. Following previous work of [13] and [36], we set the regularization term  α = 0.005.

In this section we present multiple sampling methods for the NAS-Bench-101 dataset. We distinguish between methods acting directly in the discrete graph space and a method in the continuous latent space utilizing the VS-GAE embedding.

6.1. Discrete Space

A fundamental property of the NAS-Bench-101 dataset is the varying graph size from 2 to 7 nodes. We make use of this feature to implement a simple sampling method. For each set of graphs with the same node size we sample uniformly at random.

The second approach is a more sophisticated sampling method in the discrete space. A recent work [10] finds that sampling uniformly with respect to the graph space constraints performs better than the purely random search. For this purpose, we make use of the edit distance. The edit distance is the smallest number of changes which are required to transform one graph into another; one change consists of either turning an operation, i.e., the node’s label, or adding and removing an edge, respectively. Edit sampling consists of sampling uniformly by means of the edit distance of graphs with the same length.

6.2. Continuous Space

For the continuous approach, we fully train VS-GAE without supervision on the NAS-Bench-101 dataset and pick  dG = 56as the dimension of the latent space. Through its variational nature, the latent space builds upon a probability distribution space p(z). For our sampling method,

image

Figure 3: Two distinct properties of NAS-Bench-101; The allocation of the dataset sorted by the ground truth accuracy in logarithmic scale (left)  ∼98.8% in the two last bins. The explained variance of the VS-GAE latent space regarding a principal component analysis (right)  ∼16 significant dimensions.

we transform this distribution space into a regular vector space by discarding the variance, i.e., each graph is now represented by its mean derived by the encoder. The principal components of the embedded graphs in this vector space, see Figure 3 (right), suggest that it suffices to reduce the dimension drastically. By reducing the dimensions to 4, we keep enough information to define a useful sampling method while preventing the curse of dimensionality regarding equidistant sampling.

This sampling method operates as follows: We divide the reduced vector space into a fixed number of bins of same size. The number of bins is detected in a preprocessing step, where we aim to have roughly the same amount of non-empty bins as required number of samples in order to reduce the compute time. To obtain randomness, we shift these bins by a random proportion in [0, 1] of the bin size in each dimension, respectively. For each non-empty bin, we select the graph whose embedding is the closest to the center of the bin. In case of too many samples, we discard the redundant samples randomly. If not enough graphs are sampled, i.e., the number of non-empty bins is less than the required amount of samples, we have to differentiate between two scenarios: first, if the number of non-empty bins is greater than the number of missing samples k, we pick k non-empty bins randomly and sample one graph of these bins randomly. Second, if the number of missing samples exceeds the number of non-empty bins, we randomly sample graphs from each of these bins and repeat the procedure until we arrive at the first scenario.

Our experiments aim to cover three different but still complementary domains. The core of VS-GAE is its encoder since the full model is built around it. Our first experiment evaluates the encoder’s performance prediction ability. For the second experiment, we test the abilities of VSGAE itself. Last, we unite all discussed theories and show experimentally how sampling in the latent space generated

image

Figure 4: The predicted accuracy and ground truth of 100 randomly sampled graphs from the NAS-Bench-101 dataset sorted by the ground truth showing a low prediction error for graphs with high accuracy. For low accuracy architectures, our model mostly predicts low values.

image

Figure 5: The mean and variance of the squared error of the test set performance prediction sorted by the ground truth accuracy in logarithmic scale. Predictions are less certain for architectures in the low accuracy domain.

by VS-GAE is a useful tool to reinforce the performance prediction of the encoder.

If not mentioned differently, we set  dv = 250for the node dimensions and  dG = 56for the dimension of the latent space. We split the dataset 70%/20%/10% edit-sampled into training-, test- and validation set.

All our experiments are implemented using PyTorch [27] and PyTorch Geometric [6]. Our code is available at :

image

7.1. Performance Prediction

In these experiments, we evaluate how well the latent space generated by the encoder can predict a metric of interest of the NAS-Bench-101 graphs, i.e., the validation accuracy on the CIFAR-10 classification task. For this purpose, we utilize a simple predictor, i.e., a four-layer MLP with ReLU nonlinearities, and jointly train the encoder and the predictor end-to-end in a supervised manner. We test for prediction as well as for zero shot prediction. Figure 6 displays the development of training loss against validation loss measured by means of the rooted-mean-squared error

image

Figure 6: Progress of loss and validation error over 50 epochs regarding performance prediction. Best validation RMSE  ∼ 0.0487.

(RMSE). There are a few outliers in the NAS-Bench-101 graphs that end up with a low validation accuracy on the CIFAR-10 classification task. Figure 4 visualizes these outliers and shows that our model is able to find them even if it cannot perfectly predict their accuracies. One can see that the model predicts the ground truth of the other graphs very accurately. To further explore the loss, Figure 5 illustrates the mean and variance of the squared error of the test set partitioned in 10 bins with respect to the ground truth accuracy. The greater part of the loss arises from graphs with a low accuracy. More importantly, our model is very accurate in its prediction for graphs of interest namely graphs with high accuracy. The rather bad prediction of graphs with low and intermediate accuracy can be explained through their low share in the dataset, see Figure 3 (left).

Next, we consider the task of predicting the validation accuracy of previously unseen graph types, i.e., zero-shot prediction. The zero shot prediction task is furthermore divided into two subtasks. First, the encoder is learned on all graphs of length 2, 3, 4, 5, 7 and tested on graphs of length 6. Second, we learn the encoder on graphs of length 2, 3, 4, 5, 6 and test it on graphs of length 7. The progress of the loss and the test error can be seen in Figure 7. The experiments show that our model is able to accurately predict data that it has never seen before. The behavior of the test error during the second zero shot prediction task, see Figure 7 (right), displays interesting information. During the first epochs, the error rises before it starts decreasing and approaching the training loss asymptotically. One interpretation could be that the model first learns simple graph properties like the number of nodes before it learns more complex graph substructures that generalize to the unseen data.

Table 1 summarizes the performance prediction results. All experiments are repeated 5 times and we report the mean and the relative standard deviation.

image

Figure 7: Progress of loss and test error over 50 epochs regarding zero shot prediction with two distinct splits. One training set consists of all graphs of length 2, 3, 4, 5, 7 with a test set of the graphs of length 6 (left). The other consists of all graphs of length 2, 3, 4, 5, 6 with a test set of the graphs of length 7 (right).

image

Table 1: Predictive performance of the VS-GAE encoder in terms of RMSE on prediction and zero shot prediction.

image

Table 2: The reconstruction accuracy and prior validity of VS-GAE in %.

7.2. Model Ability

In this section, we evaluate VS-GAE by means of reconstruction ability and valid generation of neural architectures. To evaluate these abilities, we train the VS-GAE on 90% of the dataset and test it on the 10% held-out data.

We first measure the reconstruction accuracy which describes how often our model can reconstruct the input graphs of the test set perfectly. For this purpose, after calculating the mean  hGand the variance  hvarGof the approxi- mated posterior  qφ(z|G)for the test set, we sample z from the latent representation of each input graph 10 times and decode each sample again 10 times. The average portion of the decoded graphs that are identical to the input ones is then reported as the reconstruction accuracy.

The second ability we are interested in is the prior validity which quantifies how often our model is able to generate valid graphs from the VS-GAE prior distribution. Following [36], we sample 1, 000 vectors from the latent space with prior distribution p(z) and decode each vector 10 times. The average portion of the decoded graphs that are valid is then reported as the prior validity. For a valid graph by means of the NAS-Bench 101 [34] search space,

image

Figure 8: The test error regarding sample method and sample size visualized in two different resolutions. Sampling based on discrete methods can yield very inaccurate predictions while sampling in the VS-GAE space is most stable.

it has to pass the following validity checks: 1) exactly one starting point, i.e., the input node, 2) exactly one ending point, i.e., the output node, 3) there exist no nodes which do not have any predecessors, except for the input node, 4) there exist no nodes which do not have any successors, except for the output node, 5) the graphs are DAGs.

See Table 2 for the evaluation results. Our model, VSGAE, shows a nearly perfect reconstruction accuracy with a high prior validity. Thus, VS-GAE represents a strong tool for further downstream tasks like sampling or neural architecture generation.

7.3. Stability regarding Sampling

In this experiment we use the different sampling methods from Section 6 to downsize the training set with different factors ranging from 1% to 90% of the original size. We keep the validation set and the test set unchanged and calculate the test error regarding the best validation error over 100 epochs. The results are outlined in Figure 8. While the error fluctuates drastically for the sampling methods in the discrete space, the latent space created by VS-GAE enables a sampling that allows for stable accuracy predictions. Interestingly, we can not confirm that sampling uniformly by means of the edit distance gives any stability advantage over sampling uniformly at random.

In this paper, we proposed VS-GAE, a new variational-sequential graph autoencoder, specialized on graphs representing neural architectures. Through multiple experiments on NAS-Bench-101, we examined various capabilities of VS-GAE. On the one hand, its encoder is a powerful tool regarding performance prediction, even in the zero-shot setup. On the other hand, the latent space generated by VS-GAE enables useful sampling methods in the graph space. Further research will mainly review the possibilities of neural architecture generation of VS-GAE, potentially in accordance with further performance prediction.

The authors acknowledge support by the German Federal Ministry of Education and Research Foundation via the project DeToL.

[1] Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheg- giani, and Khalil Sima’an. Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675, 2017. 3

[2] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Al- varo Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018. 3

[3] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017. 3

[4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le- Cun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013. 2

[5] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. arXiv preprint arXiv:1808.05377, 2018. 1, 2

[6] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019. 7

[7] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263– 1272. JMLR. org, 2017. 2, 3

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. 1

[9] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 729–734. IEEE, 2005. 2

[10] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single path one-shot neural architecture search with uniform sampling. CoRR, abs/1904.00420, 2019. 2, 6

[11] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017. 2

[12] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convo- lutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015. 2

[13] Wengong Jin, Regina Barzilay, and Tommi S. Jaakkola. Junction tree variational autoencoder for molecular graph

generation. In Proceedings of the 35th International Conference on Machine Learning, pages 2328–2337, 2018. 6, i

[14] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnab´as P´oczos, and Eric P. Xing. Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, pages 2020–2029, 2018. 2

[15] Diederik P Kingma and Max Welling. Auto-encoding vari- ational bayes. arXiv preprint arXiv:1312.6114, 2013. 1, 2, 6

[16] Thomas N Kipf and Max Welling. Semi-supervised classi- fication with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. 2

[17] Thomas N Kipf and Max Welling. Variational graph auto- encoders. arXiv preprint arXiv:1611.07308, 2016. 2, 3

[18] Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012. 2

[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 1

[20] Loic Landrieu and Martin Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4558–4567, 2018. 3

[21] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. Learning deep generative models of graphs, 2018. 1, 2, 3, 4

[22] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisan- tha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. CoRR, abs/1711.00436, 2017. 2

[23] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: differentiable architecture search. CoRR, abs/1806.09055, 2018. 2

[24] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu. Neural architecture optimization. In Advances in neural information processing systems, pages 7816–7827, 2018. 2, 3

[25] Federico Monti, Michael Bronstein, and Xavier Bresson. Ge- ometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems, pages 3697–3707, 2017. 3

[26] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016. 2

[27] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 7

[28] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. In Proceedings of the 35th International Conference on Machine Learning, pages 4092–4101, 2018. 2

[29] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Sax- ena, Yutaka Leon Suematsu, Jie Tan, Quoc V. Le, and Alexey Kurakin. Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning, pages 2902–2911, 2017. 2

[30] Martin Simonovsky and Nikos Komodakis. Graphvae: To- wards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pages 412–422. Springer, 2018. 2, 3

[31] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019. 1

[32] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5419, 2017. 3

[33] Li Yi, Hao Su, Xingwen Guo, and Leonidas J Guibas. Sync- speccnn: Synchronized spectral cnn for 3d shape segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2282–2290, 2017. 3

[34] Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, and Frank Hutter. Nas-bench-101: Towards reproducible neural architecture search. arXiv preprint arXiv:1902.09635, 2019. 1, 2, 8

[35] Jiaxuan You, Rex Ying, Xiang Ren, William L Hamil- ton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. arXiv preprint arXiv:1802.08773, 2018. 1, 2, 3

[36] Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, and Yixin Chen. D-vae: A variational autoencoder for directed acyclic graphs. arXiv preprint arXiv:1904.11088, 2019. 2, 3, 6, 8, i

[37] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In 5th International Conference on Learning Representations, 2017. 2

[38] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018. 1, 2

In this section, we give further details about the implementation of our proposed models.

A.1. The Encoder

Message The message module  M (t)concatenates the embedding of the considered node  h(t−1)vas well as the incoming embedding  h(t−1)u, each of dimension  dn. It further performs a linear transformation on the concatenated embedding. The reverse message module  M (t)outis a clone of M (t)initialized with its own weights,

image

The message module (green) and the reverse message (red) can be seen on the left side of Figure 1 in the main paper.

Update The update module  U (t)is a single GRU cell. First, the incoming messages  mu→vand  moutu→vare added and handled as the GRU input. Second, the node embedding h(t−1)vis treated as the hidden state and is updated,

image

Aggregation We use two rounds of propagation before aggregating the node embeddings into a single graph embedding. This graph aggregation consists of two parts. First, a linear layer transforms the node embeddings to the required graph embedding dimension  dg. Second, another linear layer combined with a sigmoid handles each node’s fraction in the graph embedding,

image

The aggregation function for the variational outputs ˜Ais an exact copy of A with its own weights. An illustration of the aggregation module is given in Figure 1 (right).

A.2. VS-GAE

InitNode The learnable embedding look-up table E consists of five embeddings of size  dn, one for each of the five node types. It is initialized from N(0, 1). The InitNode module concatenates the sampled point of the latent space z of size  dg, the summary of the partially created graph  hG(t)of size  dgand the node embedding of the picked node type,

image

This can be seen in Figure 2 c). For the very first embedding, we exclude the partially created graph,

image

GraphProp The GraphProp module consists of two rounds of message passing with the exact same modules from above and a variance-free graph aggregation. Each of these modules is initialized with its own weights, see Figure 2 a).

AddNode The AddNode module concatenates the sampled point of the latent space z and the summary of the partially created graph  hG(t)and outputs logits over all five possible node types,

image

This can be seen in Figure 2 b).

AddEdges The AddEdges module concatenates the embedding of the newly created node  ht+1and each previous node  hvas well as the sampled point of the latent space z and the summary of the partially created graph  hG(t),

image

The output is the score that describes the probability of an edge in logits. This process is illustrated in Figure 2 d).

The node and graph dimensions,  dn = 250and  dg = 56, are chosen as in [13, 36] to attain comparability. For all experiments, we used the Adam optimizer with no dropouts. For training VS-GAE, we used a learning rate of  1e−4for a total amount of 300 epochs. Whenever the loss didn’t drop for 10 epochs, we decreased the learning rate by a factor of  1e−1.

B.1. Performance Prediction

The hidden layers of the predictor are of size 28, 14 and 7. We used no activation function for the very last output (linear regression) and trained the joint model with a learning rate of  1e−5.


Designed for Accessibility and to further Open Science