Diffusion-based Deep Active Learning

2020·Arxiv

Abstract

Abstract

The remarkable performance of deep neural networks depends on the availability of massive labeled data. To alleviate the load of data annotation, active deep learning aims to select a minimal set of training points to be labelled which yields maximal model accuracy. Most existing approaches implement either an ‘exploration’-type selection criterion, which aims at exploring the joint distribution of data and labels, or a ‘refinement’-type criterion which aims at localizing the detected decision boundaries. We propose a versatile and efficient criterion that automatically switches from exploration to refinement when the distribution has been sufficiently mapped. Our criterion relies on a process of diffusing the existing label information over a graph constructed from the hidden representation of the data set as provided by the neural network. This graph representation captures the intrinsic geometry of the approximated labeling function. The diffusion-based criterion is shown to be advantageous as it outperforms existing criteria for deep active learning.

1. Introduction

Deep learning has provided unprecedented performance in various semi-supervised learning tasks ranging from speech recognition to computer vision and natural language processing. Deep Convolutional Neural Networks (CNN), in particular, have demonstrated object recognition that exceeds human’s performance. However, this success comes with the requirement for massive amounts of labelled data. While data collection has become easier, the annotation of such data with labels is time consuming and expensive. In fact, the annotation of data has become a major bottleneck in the application of deep learning to many real life problems.

Active learning provides a plethora of techniques that allows to select a minimal set of data points for labeling which optimally minimize the error probability under a fixed budget of a labeling effort (see (Settles, 2012) for a review). A well known trade-off in active learning is between the exploration and refinement stages. Exploration aims at mapping the joint data and labeling distribution in order to identify decision boundaries, and typically yields optimal results at the earlier stage of active learning. Refinement (also referred to as exploitation), on the other hand, characterizes acquisition of labels at the proximity of a discovered decision boundary. Refinement typically provides better gains in accuracy when it follows an exploration stage. It has been shown that active learning exhibits improved results when the balance between exploration and refinement is optimal (e.g. (Krause and Guestrin, 2007)).

Incorporating active learning into deep learning is yet a challenging task for several reasons. First, the network representation does not allow to construct an easy probabilistic criterion that incorporates exploration and refinement. Thus, most existing works in active deep learning focus on either exploration (e.g. (Sener and Savarese, 2017)) or on refinement type criteria (e.g. (Gal et al., 2017)), which often leads to sub-optimal results. Second, training of deep networks is a costly operation. Hence, acquiring only a small amount of labels at each step is impractical in terms of the overall running time of retraining the model sequentially. Therefore, at each step a sufficiently large batch of labels needs to be queried. This batch of points needs to be diverse enough to provide higher gains in accuracy for the labeling effort spent. Selecting such a batch may be a daunting task, as often a model needs to be trained to generate prediction for the unlabeled set while a candidate set of labels is hypothesized, as, for example, in the model-change criterion for active learning (Settles, 2012), or in (Settles et al., 2007). This state pushed active deep learning to focus mostly on simple uncertainty-based criteria (e.g. (Gal et al., 2017; Wang et al., 2016; Stark et al., 2015)), or geometric criterion (e.g. (Sener and Savarese, 2017)) to avoid at least some of the re-training burden.

In this paper we propose a batch active learning criterion that uses a label diffusion process on a graph constructed from the hidden layer representation. The main contributions of this approach are with addressing the solution of the above problems. First, the label diffusion based query criterion exhibits a natural switch from exploration to refinement, which avoids the need for an additional optimization machinery (e.g. as in (Yang et al., 2015)). It has the versatility of an explorer and a refiner by using just a single query criterion. Moreover, the learning of the data distribution is done on a graph constructed from the top hidden layers which are highly correlated with the labeling approximation. The graph representation captures the intrinsic geometry of the labeling function instead of probing it in the ambient feature space as, for example, in (Sener and Savarese, 2017).

To address the second problem above, we note that label diffusion has running time that is linear in data size as it involves the application of a sparse matrix to a sparse vector for only a handful of iterations. The graph construction is based on a nearest neighbor procedure, which can be efficiently done if the hidden layer representation is low dimensional, as typical to the top layers. Once the diffusion iterations are done, the subsequent query selection step involves nothing but a quick sort of the diffused values. Therefore, the batch querying process is fast, can scale to and diverse batches, and does not require model training, as for example in model-change-based criteria.

The optimality of our criterion is demonstrated on benchmark data over a range of state-of-the-art active learning criteria and in particular those constructed for deep learning.

2. Related work

Active learning is a well-studied branch of machine learning. To reflect on the diverse approaches used in active learning we note some of the earlier seminal works of (Lewis and Gale, 1994) on using posterior probability for uncertainty sampling, (MacKay, 1992) using information theory, (Tong and Koller, 2002) for using geometric distance to decision boundary, (Seung et al., 1992; Dagan and Engelson, 1995) which suggested an ensemble committee of classifiers, and (Roy and McCallum, 2001) using expected risk estimation. For an introduction, we refer to the comprehensive review (Settles, 2012).

In light of the important impact of active learning, it has recently been incorporated with deep models. In this context, a first class of active learning criteria are uncertainty based method. Such methods sample points from the unlabeled data set according to some measure of uncertainty, extracted from the so-far learned classifier. A few recent works (Stark et al., 2015; Wang et al., 2016; Lin et al., 2018) considered these methods to learn deep neural nets. Recently, the authors of (Ducoffe and Precioso, 2018) proposed margin-based approach for deep learning. In (Ducoffe and Precioso, 2018), the distance of data points from the decision boundaries is estimated as the distance to the closest adversarial examples. Points closer to the decision boundaries are queried. While uncertainty methods can be effective at later stages of active learning, when the learned model is already a discrete classifier, their performances can be even lower than random labeling at earlier stages, as we show in Section 6.

Another important family of active learning techniques is the one of ensemble methods. In this setting the uncertainty measure is computed over an ensemble of models trained on the existing labeled dataset (Beluch et al., 2018). Due to the high computational effort of training deep neural networks, such methods can be impractical. A recent work-around has been recently proposed in (Gal et al., 2017), based on an equivalence between dropout and approximate Bayesian inference (Gal and Ghahramani, 2016). In this work the authors average the uncertainties over the outputs of an ensemble of networks obtained via dropout.

Exploratory criteria have also been experimented for learning deep neural networks. The works (Sener and Savarese, 2017) and (Geifman and El-Yaniv, 2017) proposed to perform large batch queries by choosing data that form an -covering of the data distribution. These methods were shown to yield gains in some cases, but they seem to be less effective for querying smaller batches at a time and they typically require an expensive optimization problem. In a recent work a discriminative network was trained in (Gissin and Shalev-Shwartz, 2019) as a proxy for exploration. While promising, these methods seemed to perform similarly on some benchmark data sets, as discussed in (Gissin and Shalev-Shwartz, 2019).

We believe that a key concept missing in the above cited methods is the combination of exploration and refinement criteria in active learning. The above cited works either try to tackle the decision boundaries based on the current model or to cover the data distribution. In the deep learning setting, a first attempt to merge the two types of criteria was proposed in (Liu and Ferrari, 2017) where the authors use a heuristic method to combine uncertainty with representativeness designated for the specific task of human pose annotation. We introduce in this paper a versatile and principled active learning criterion based on label-diffusion over graphs. Label-diffusion based criterion was first proposed in (Kushnir, 2014), where it was coupled with a label-adaptation of the graphs diffusion weights to mitigate diffusion in the label space. Here we do not require the label adaptation of the diffusion kernel, as the graph is constructed from the top hidden layer, which is highly correlated with the hypothesized labels.

3. Problem set-up

We consider the following multi-class classification set-up. Let D be a generic probability distribution over indicates the number of classes. Our aim is to find a classifier denotes the space of probability measures over [C]) that minimizes the classification error

In deep learning, we consider the parametric functions

where . The parameter defines our final model . In particular, in the following experiments, is taken to be a neural network. In this case the function corresponds to the i-th layer of the network and to the final classifier layer.

Active learning In active learning, one is given a pool of input data points and a budget Q of data points to select for labeling. Optionally, for a subset the labels of the data points may also be given as input. In the following, we denote as the set of unlabeled points in . The learner is given the possibility to query from an oracle the labels of up to . At each iteration of active learning, the newly labeled points are added to the existing set of labeled points, and the model can be retrained with the updated training set. The aim of active learning is to minimize the approximation error while querying at most Q points from . An active learning strategy consists of the following steps:

1. Train on a given set of labeled samples

2. Until a given budget Q of labels is queried, repeat:

(b) Query the labels of the data points in from an oracle and add them (with their labels) to

The task in this paper is to design an efficient and meaningful criterion for data labeling in the above setting.

4. Classiﬁcation via Graph Diﬀusion

In this section, we describe a framework for classification based on label diffusion over graphs. This framework is at the core of the diffusion-based active learning criterion that we introduce for deep learning in Section 5.

We assume a finite weighted graph G = (V, E) consisting of a set of N vertices V , a set of , and a non-negative . We interpret the weight as a measure of similarity between the vertices . The graph kernel is defined by

where D is diagonal with . The operator M is a weighted averaging operator over functions f defined over the graph: , with the weights given by the similarities . It has the ‘averaging effect’ of smoothing the function f over the graph.

In the data context, a graph G can be constructed in which the vertices of G correspond to the data points in . The weights W represents similarities between data points:

where is a local scaling parameter, m is a decreasing function, and is a measure of distance between data points. The actual similarities used are the local ones in order to preserve local geometry and reduce computation time by using sparse matrices. These local similarities are realized by computing the K-nearest-neighbors for each point x, denoted by N(x).

4.1 Label diffusion on graphs

At the center of our algorithmic ideas lies a Markov process that is used to propagate labels from . The graph kernel defined in (1) is a row-stochastic matrix, which can be viewed as the transition probabilities of a Markov random walk on the graph G. Specifically, the onestep transition probability between states is given by

We consider a random walk as a mean to assign a label to The predicted label of is associated with the probability of arriving to a labeled point x of class 1 after performing a t-step random walk starting at (Zhu et al., 2003) (we consider here the binary case). Marking this probability as , it can be derived by the recursive relation

We associate with the probability . For labeled points in we see that and its sign can be used to generate binary labels. The vector can be partitioned as corresponds to the indices of the labeled nodes in corresponds to the indices of the unlabeled nodes . Similarly, D and W can be partitioned into blocks

Eq. (2) can be transformed and re-written for

resulting in the system

where is the graph Laplacian, and the sign of each provides the predicted hard label of . A similar system, motivated by quadratic energy minimization, is obtained by minimizing

while forcing equality on the labeled set (Chapelle et al., 2010). Specifically, minimizing (4) with respect to

which is the same as (3), since

The system (5) can be solved via the well-known Jacobi method (Chapelle et al., 2010). The iterative Jacobi method solves the system Az = b by approximating the solution at the step t + 1 by

The Jacobi iteration matrix is defined as is a diagonal matrix with -th diagonal element, and R and Q are the upper and lower triangular matrices of A. In matrix notation the iteration scheme is

For the system (5) we have , which then yields the iteration

Since training points labels are restarted to their true values after every iteration an equivalent system can be considered with which yields the same result. It is clear now that (7) is a label diffusion process: transducing a label to weighted average of the labels of its neighbors with the transition weights.

4.2 Convergence

The equivalence we have just drawn between our label propagation and the Jacobi iteration plays an important role in deriving convergence and solution properties for the diffusion process that we propose for active learning.

converges as

4.3 Algorithm

Let be the available labeled subset of a model which has been trained on . Our goal is to select a batch of unlabeled points of a given size be labeled and added to the training set such that the accuracy of a model retrained with is maximized on D. To facilitate our presentation we first describe in detail the basic query step of our algorithm. On the following section 5.1 we address the mechanism to select a large and diverse batch size.

where

The neighborhoods N(i) are determined as K-NN neighborhoods based on the distance above, where represents the penultimate layer output of the network Therefore, the constructed graph corresponds to a weighted K-NN graph over the set of represented data points

The next step consists of propagating the uncertainty over the graph as suggested in Section 4.1. The diffusion iteration starts with setting the values in the vector each class

The values are propagated as

for t = T time steps. In our supplementary material we propose possible variants on the initialization of

4.4 Setting K and T.

The parameters T and K determine the level of confidence imposed by the diffused training set over the unlabeled set. Higher K impose strong confidence in the labeling hypothesis over larger neighbourhood around each labeled point, but makes the method less computationally efficient (as it controls the sparsity of M). Similarly, large number of iterations T enables communication between further nodes, but T too large may result in an overly smoothed (and thus less informative) signal . During the exploration stage large T imposes an hypothesis that may be locally correct but is still far from being globally reliable, as too few labeled samples cannot correctly approximate the labeling function.

Denote the average number of unlabeled points per a labeled point as active learning settings we assume that the labeled set is minimal we have that some degree K we can approximate , with the intuition that the labeling assignment can be recovered by diffusion from the labelled set to all unlabeled nodes iterations. Hence,

The interplay between K and N: although K can be treated as a constant, it is preferred that the graph will be connected, otherwise, the exploration phase of the algorithm will be excessively used to label each disconnected component. Hence, if we assume a characteristic graph model (such as Erdősh-Rénye), with degree k = 2 log(N) the graph will be connected with high probability. In our experiments, we use large enough to allow the graph to be connected, as per the above. We further discuss on the choices of K and T in the supplemental material.

We conclude that label diffusion is a convergent iteration which provide the label probabilities for each node.

5. Active Learning

In this section we provide the full description of the diffusion-based active deep learning algorithm. The algorithms is based on the graph diffusion framework presented in Section 4.1, and extended to the batch querying in the deep network setting.

Query criterion. To this end the matrix of propagated values can be interpreted as uncertainties measured by the absolute value . Specifically, the absolute value magnitude represents a measure of uncertainty on whether vertex i belongs to class c. This can be used to select the new batch to query as

Here denotes the B smallest elements.

The main idea behind the criterion in (8) is that it automatically switches between exploration and refinement stages when exploration provides only little accuracy gains. In the first exploratory phase, very few labeled points reside sparsely on the graph. Following the diffusion process, the unlabeled points with small label magnitude correspond to unexplored regions that reside far from the labeled set on the data manifold. By querying such data points we ensure to cover the data distribution efficiently. At later stages, when the data distribution has been explored, the criterion becomes a refiner. At this point, the data points in close proximity to decision boundaries tend to receive the same amount of signal from points of different classes and present smaller label magnitude. Queries that focus on such data points refine the existing decision boundaries.

Running time The running time of a batch query is composed of three parts. First, we need to compute the K-NN graph. This is the most expensive operations and brute force algorithms come with a complexity of is the dimensionality of the data used to compute distances. Using the penultimate layer of the network to represent the data allows us to notably reduce this computational cost. Indeed, in this way we can use procedures for k-NN based on KD (Bentley, 1975) or ball trees (Omohundro, 1989), which come with a computational cost of approximately . Moreover, at larger stages of active learning, the network representation is closer to be linearly separable, and this also notably speeds up this construction. In practice, in our experiments, we have been using the K-NN graph implementation offered in the Scikit-learn library (Pedregosa et al., 2011). Other alternatives include approximate search (e.g. (Datar et al., 2004)) which suggests a trade-off of an almost linear time complexity with a prescribed error constant

Second, the diffusion vector is diffused with the kernel matrix. Addressing its sparsity as O(KN) non-zero entries, this operation scales linearly in N as O(TKN). Using our arguments in section 4.4: , which leads to an operations for the diffusion stage.

be applied, which can be done in We conclude that the running time is

5.1 Enhanced large batch queries

When applying deep active learning to massive data sets, querying large batch sizes is more of a necessity than a choice. For example, querying large size batches using (8) could possibly results in over-sampling certain regions of D, and wasting labeling efforts. In order to avoid over sampling, we propose to split the batch query in a series of R mini-batch queries of much smaller size P (with R = B/P). After each mini-query, the diffusion vector is updated with the newly added mini-batch in order to enhance the diversification of the next mini-queries.

The batch query procedure BatchQuery for large batches is summarized in Algorithm 1. BatchQuery takes as input the set of labeled points , the model , the batch size B, and the mini-batch size P. At start, the K-NN graph is constructed as outlined in Section 5. Then, for , we diffuse the labels in and use the obtained vector to perform the mini-batch query:

to update is the set of unlabeled data points at iteration i. After each mini-batch iteration the labeled set is updated with the new mini-batch, until R mini-batches are queried. We note that the graph G stays unchanged during this process.

The batch query criterion is summarized in Algorithm 1. The full active learning procedure is summarized in Algorithm 2.

6. Experiments

Our experiments include a demonstration on the trade off between exploration and refinement with a 2-dimensional toy data set, showing the advantages of our versatile query criterion over other approaches. We then follow with a set of experiments on the benchmark data sets MNIST, CIFAR10 and SVHN, where we compare other query criteria and algorithms to our diffusion-based criterion. We conclude with a discussion on the experimental results.

6.1 Exploration vs. refinement: a toy example

We consider a simple non-separable 2D data set for binary classification. The data set is given as points on a 2-dimensional binary ‘checkerboard’ where each color corresponds to a class (‘red’, ‘blue’); see Figure 1. This example demonstrates the utility of refinement vs. exploration criteria in (Baram et al., 2004). We consider a pool data set of labeled points drawn from the checkerboard distribution. We start with a training set

composed of 4 points randomly drawn from the pool set for each of the two classes (so that ) and train a feed-forward neural network. We run the chosen active learning criterion up to 120 queries with |B| = 5, and present the first 110 queries in yellow color in Fig. 1. The accuracy and its variance are measured on a separate test data set of drawn from the checkerboard distribution in Fig. 2. All the results are averaged over 5 runs. Further standard parameter details of the network architecture and training are reported in the supplemental material.

We considered four different selection criteria:

1. Random - the points are drawn uniformly at random from at each query iteration.

2. Uncertainty - the points where the current model is the most uncertain are selected (Lewis and Gale, 1994).

Figure 1: From left to right: (1) Binary checkerboard pool dataset (2) Points queried (in yellow) using coreset criteria (Sener and Savarese, 2017) (3) Points queried using uncertainty criterion (Lewis and Gale, 1994) (4) Points queried using our criterion.

Figure 2: Performance analysis for active learning on the binary checkerboard set. Top: Accuracy versus size of training data set. Bottom: Variance of accuracy versus size of training data set.

3. Coreset - this is the greedy version of the algorithm introduced in (Sener and Savarese, 2017). It essentially aims to select points in order to form an approximate the pool data.

4. Diffusion

Figure 1-Coreset demonstrates the exploratory nature of (Sener and Savarese, 2017). Coreset aims at uniformly covering the data distribution. Yet, as seen, the decision boundaries are left unrefined, and therefore its accuracy at the refinement stage is inferior to the diffusion-based criterion. On the other hand, Figure 1-Uncertainty shows to select points closer to the decision boundary of the current model . If the model has a discrete understanding of the data distribution, these are close to the decision boundaries of the problem. Nevertheless, at earlier stages of learning, the data queried so far is not enough to cover the distribution and the learned model may be erroneous yet highly confident on unexplored regions. As the model is highly confident on these regions, no points are selected from these regions, leaving them unexplored and causing mis-classification that persists at later stages. At query 110, we see most uncertainty-based queries concentrate near the detected decision boundary, while boundary segments which have not been explored are completely missed in the classifier.

Finally, Fig. 1-Diffusion illustrates how the diffusion criterion operates in two phases. In the first, exploratory, phase, it tends to cover uniformly the distribution and identifies all decision boundaries. At later stages, once the model has enough information of the structure of D, it queries labels near the decision boundaries. Figure 2-bottom shows that diffusion-based active learning also exhibit smaller variance than all other criteria over different realizations of the same experiments.

6.2 Benchmarks evaluation

We experiment with the benchmark classification problems: MNIST , CIFAR10, SVHN (figs. 3 and 4). The advantage of the diffusion-based criterion is especially prominent during the early exploration and the transition to refinement. The competitive accuracy persists into the longer tail of the refinements stage, Where uncertainty criteria is approaching from below.

Datasets and experimental setups. We compared the performance of the diffusion-based active learner with representative methods including (Sener and Savarese, 2017) (Gal et al., 2017) and with a range of criteria included in (Settles, 2012):

1. Random

2. Three different uncertainty-based criteria:

(b) Margin - the points with lowest output probability difference between the two most probable alternatives are selected (method (2.2) in (Settles, 2012)). We note that margin is also an explorer-type criterion as it probes representatives points in the early stages of active learning.

(c) Entropy - the points with highest entropy in the model’s posterior label distribution are selected (method (2.3) in (Settles, 2012)).

3. Coreset - (Sener and Savarese, 2017)

4. Bayesian Dropout - the algorithm of (Gal et al., 2017), where dropout is used to obtain an ensemble of models. An uncertainty measure is then evaluated by averaging over the ensemble of models. We considered the following measures of uncertainty:

Experimental details. For MNIST we consider both fully connected and convolutional models . For CIFAR10 and SVHN we take to be a VGG-16 network (Simonyan and Zisserman, 2014). For CIFAR10, the network was pre-trained on ImageNet. We performed batch queries of size B = 20 for MNIST and B = 200 for CIFAR10 and SVHN. The reported accuracies are averaged over 5 runs. All the experiments were performed using PyTorch (Paszke et al., 2017). Additional network configuration and training details are reported in Appendix A.

Figure 3: Left: Results on the MNIST dataset using a fully connected neural net. Right: using a convolutional neural net.

Figure 4: Left: Results for CIFAR10 data set using a VGG16 network pre-trained on ImageNet. Right: SVHN data set with VGG16 architecture.

6.3 Discussion

The experiments demonstrate the strong improvement that is obtained by incorporating exploration with refinement through a principled graph-diffusion framework. We observe that during exploration stage ‘margin’ (which may also act as an explorer) as well as ‘coreset’, which are applied in the ambient space, fall behind the diffusion-based learner. We advocate that this is a result of our use of the graph representation which captures the intrinsic geometry of the data points in the labeling space, and accurately models the probability of label assignment in this space.

Finally, we note the stability of our method vs. all other methods as explicitly shown in Fig. 2-bottom and in the error bars in Figs. 3-4. This high stability of the diffusion-based method is a key advantage - showing robustness in achieving high accuracy in different initialization (i.e. initially labeled data points) or noisy settings. The graph representation plays a key role as it denoises the data. On the contrary, we observe that the ’uncertainty‘ criterion exhibits the highest variance in results as it selects points that do not necessarily represent the distribution (including outliers). Therefore, it is more susceptible to initialization.

References

Yoram Baram, Ran El-Yaniv, and Kobi Luz. Online choice of active learning algorithms. J. Mach. Learn. Res., 5:255–291, December 2004. ISSN 1532-4435. URL http://dl.acm. org/citation.cfm?id=1005332.1005342.

William H. Beluch, Tim Genewein, Andreas Nürnberger, and Jan M. Köhler. The power of ensembles for active learning in image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.

Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. ISBN 0262514125, 9780262514125.

Ido Dagan and Sean P. Engelson. Committee-based sampling for training probabilistic clas- sifiers. In Proceedings of the Twelfth International Conference on International Conference on Machine Learning, ICML’95, pages 150–157, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1-55860-377-8. URL http://dl.acm.org/citation.cfm? id=3091622.3091641.

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hash- ing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253–262, 2004.

Melanie Ducoffe and Frederic Precioso. Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841, 2018.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.

Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1183–1192. JMLR. org, 2017.

Yonatan Geifman and Ran El-Yaniv. Deep active learning over the long tail. arXiv preprint arXiv:1711.00941, 2017.

Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. arXiv preprint arXiv:1907.06347, 2019.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Andreas Krause and Carlos Guestrin. Nonmyopic active learning of gaussian processes: An exploration-exploitation approach. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 449–456, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. doi: 10.1145/1273496.1273553. URL http://doi.acm.org/10.1145/ 1273496.1273553.

Dan Kushnir. Active-transductive learning with label-adapted kernels. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 462–471. ACM, 2014.

David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, pages 3–12, New York, NY, USA, 1994. Springer-Verlag New York, Inc. ISBN 0-387-19889-X. URL http://dl.acm.org/citation. cfm?id=188490.188495.

L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang. Active self-paced learning for cost-effective and progressive face identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(1):7–19, Jan 2018. doi: 10.1109/TPAMI.2017.2652459.

Buyu Liu and Vittorio Ferrari. Active learning for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4363–4372, 2017.

David JC MacKay. Information-based objective functions for active data selection. Neural computation, 4(4):590–604, 1992.

Stephen M Omohundro. Five balltree construction algorithms. International Computer Science Institute Berkeley, 1989.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De- Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differenti-ation in PyTorch. In NIPS Autodiff Workshop, 2017.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Nicholas Roy and Andrew McCallum. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 441–448, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-778-1. URL http://dl.acm.org/citation.cfm? id=645530.655646.

Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2003. ISBN 0898715342.

Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.

Burr Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6(1):1–114, 2012.

Burr Settles, Mark Craven, and Soumya Ray. Multiple-instance active learning. In Proceedings pages 1289–1296, USA, 2007. Curran Associates Inc. ISBN 978-1-60560-352-0. URL http: //dl.acm.org/citation.cfm?id=2981562.2981724.

H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 287–294, New York, NY, USA, 1992. ACM. ISBN 0-89791-497-X. doi: 10.1145/130385.130417. URL http://doi.acm.org/10.1145/130385.130417.

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Fabian Stark, Caner Hazırbas, Rudolph Triebel, and Daniel Cremers. Captcha recognition with active deep learning. In Workshop new challenges in neural computation, volume 2015, page 94. Citeseer, 2015.

Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45–66, March 2002. ISSN 1532-4435. doi: 10.1162/153244302760185243. URL https://doi.org/10.1162/153244302760185243.

Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016.

Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G. Hauptmann. Multi- class active learning by uncertainty sampling with diversity maximization. Int. J. Comput. Vision, 113(2):113–127, June 2015. ISSN 0920-5691. doi: 10.1007/s11263-014-0781-x. URL https://doi.org/10.1007/s11263-014-0781-x.

Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty. Semi-supervised learning using gaus- sian fields and harmonic functions. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, pages 912–919. AAAI Press, 2003. ISBN 1-57735-189-4. URL http://dl.acm.org/citation.cfm?id=3041838. 3041953.

Appendix A. Training details

In the following we report the hyper-parameter used in the experiments in Section 6 for each

of the data sets.

Checkerboard We trained a fully-connected network with 2 hidden layers of width 30 each.

We optimized using SGD with batch size 1, learning rate 0.001 and momentum 0.9. We ran

100 epochs after each query. The experiments were run on a pool set of size 2000. The

accuracy was evaluated on a separate test data set sampled from the same distribution (200

points).

MNIST The fully-connected model used had 2 hidden layers of width 100 and 50 respec-

tively. The convolutional model used was composed by a convolutional layer with 16 channels

and a kernel of size , a MaxPool layer with a kernel of size 2 and padding 2, and 2 hidden

fully-connected layers of width 20 each. For both models, we optimized using Adam (Kingma

and Ba, 2014) with batch size 8 and learning rate 0.001. For both models, a BatchNorm

(Ioffe and Szegedy, 2015) layer was added before each hidden fully-connected layer. We ran

100 epochs after each query. The experiments were run on a pool set formed by a (balanced)

randomly selected subset of the training data set of size 10000. The accuracy was evaluated

on the test data set (10000 points).

CIFAR10 The network used was a VGG-16 architecture pre-trained on ImageNet. We

took the convolutional part of such network and added 2 fully-connected layers of width 512

and 20 respectively. A dropout layer was added after each of these fully-connected layers.

Only the fully-connected layers were trained. We optimized using Adam (Kingma and Ba,

2014) with batch size 100. After each query the learning rate was initialized to 0.0003 and

decayed by 0.5 every 30 epochs. We ran 100 epochs after each query. The full training data

set (50000 points) was used as pool data set for the experiments. The accuracy was evaluated

on the test data set (10000 points).

SVHN The network used was a VGG-16 architecture. We optimized using SGD with

batch size 50, learning rate 0.005, momentum 0.9 and weight-decay 0.0005. We ran 50 epochs

after each query. The experiments were run on a pool set formed by a (balanced) randomly

selected subset of the training data set of size 20000. The accuracy was evaluated on the test

data set (26032 points).

Diffusion The following parameters were used for the diffusion algorithm in the experiments

presented in Section 6. For the experiment with the checkerboard data set (experiment in

Figures 1 and 2), we used T = 4, K = 10 and P = 1. For the experiments with the MNIST

data set (experiment in Figure 3), we used T = 5, K = 10 and P = 1. For the experiment

with the CIFAR10 data set (experiment in Figure 4-Top), we used T = 4, K = 20 and P = 10.

We also used the technique explained in Section B.3 with . For the experiment with

the SVHN data set (experiment in Figure 4-Bottom), we used T = 5, K = 20 and P = 1. We

also used the technique explained in Section B.3 with and initialized the signal

using the soft-labels information (as explained in Section B.2) for each query after the first.

Bayesian criterion In order to perform the active learning queries using the Bayesian

criterion, we added a dropout layer after each hidden fully connected layer to the models with

no dropout layers.

Appendix B. Variants and enhancements of diffusion-based criterion

B.1 Choosing among points not reached by diffusion

In the first stages of active learning, the diffusion process may not reach all the points. This

means that, given a finite time T, some of the points may have identically zero

propagated vector: . Clearly, the query criterion would pick such points (according

to (8)). But in case that there are more of such points than points to query, i.e.

we need a criterion to choose B points out of the zero labeled ones. We provide a secondary

criterion on top of the diffusion procedure in Algorithm 2: points are chosen according to an

influence criterion

where we denoted

B.2 Uncertainty signal initialization

In Section 5, we described the initialization of the signal by using the available

labels:

Another possibility is to use the softlabels provided by the current model to initialize

as follows:

B.3 Dynamically changing diffusion time

We propose a criterion that reduces T based on the number of points reached by diffusion.

This criterion can be incorporated in the enhanced batch query method presented in Section

5.1. At each call of BatchQuery, the parameter T is initialized to a chosen value

iteration i of BatchQuery, we count the number of points not reached by the diffusion:

If , we reduce is an additional parameter that

needs to be fixed.

Appendix C. Proof of Lemma 1

Proof The proof relies on the convergence proof of the Jacobi method Saad (2003): if A

in (6) is strictly diagonally dominant then the iteration converges. Since for

, we have that, for all

is strictly diagonally dominant. This implies that the sums of the rows

of the iteration matrix are smaller than 1. Since , we obtain

that the spectral radius of is bounded by 1: . Therefore

convergent matrix and the Jacobi iteration (7) converges.

designed for accessibility and to further open science