Unsupervised Learning of the Set of Local Maxima

2020·Arxiv

ABSTRACT

ABSTRACT

This paper describes a new form of unsupervised learning, whose input is a set of unlabeled points that are assumed to be local maxima of an unknown value function v in an unknown subset of the vector space. Two functions are learned: (i) a set indicator c, which is a binary classifier, and (ii) a comparator function h that given two nearby samples, predicts which sample has the higher value of the unknown function v. Loss terms are used to ensure that all training samples x are a local maxima of v, according to h and satisfy c(x) = 1. Therefore, c and h provide training signals to each other: a point in the vicinity of x satisfies or is deemed by h to be lower in value than x. We present an algorithm, show an example where it is more efficient to use local maxima as an indicator function than to employ conventional classification, and derive a suitable generalization bound. Our experiments show that the method is able to outperform one-class classification algorithms in the task of anomaly detection and also provide an additional signal that is extracted in a completely unsupervised way.

1 INTRODUCTION

...from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved. (Darwin, 1859)

When we observe the natural world, we see the “most wonderful” forms. We do not observe the even larger quantity of less spectacular forms and we cannot see those forms that are incompatible with existence. In other words, each sample we observe is the result of optimizing some fitness or value function under a set of constraints: the alternative, lower-value, samples are removed and the samples that do not satisfy the constraints are also missing.

The same principle also holds at the sub-cellular level. For example, a gene can have many forms. Some of them are completely synonymous, while others are viable alternatives. The gene forms that become most frequent are those which are not only viable, but which also minimize the energetic cost of their expression (Farkas et al., 2018). For example, the genes that encode proteins comprised of amino acids of higher availability or that require lower expression levels to achieve the same outcome have an advantage. One can expect to observe most often the gene variants that: (i) adhere to a set of unknown constraints (“viable genes”), and (ii) optimize an unknown value function that includes energetic efficiency considerations.

The same idea, of mixing constraints with optimality, also holds for man-made objects. Consider, for example, the set of houses in a given neighborhood. Each architect optimizes the final built form to cope with various aspects, such as the maximal residential floor area, site accessibility, parking considerations, the energy efficiency of the built product, etc. What architects find most challenging, is that this optimization process needs to correspond to a comprehensive set of state and city regulations that regard, for example, the proximity of the built mass of the house to the lot’s boundaries, or the compliance of the egress sizes with current fire codes.

In another instance, consider the weights of multiple neural networks trained to minimize the same loss on the same training data, each using a different random initialization. Considering the weights of each trained neural network as a single vector in a sample domain, also fits into the framework of local optimality under constraints. By the nature of the problem, the obtained weights are the local optimum of some loss optimization process. In addition, the weights are sometimes subject to constraints, e.g., by using weight normalization.

The task tackled in this paper is learning the value function and the constraints, by observing only the local maxima of the value function among points that satisfy the constraints. This is an unsupervised problem: no labels are given in addition to the samples.

The closest computational problem in the literature is one-class classification (Moya et al., 1993), where one learns a classifier c in order to model a set of unlabeled training samples, all from a single class. In our formulation, two functions are learned: a classifier and a separate value function. Splitting the modeling task between the two, a simpler classifier can be used (we prove this for a specific case) and we also observe improved empirical performance. In addition, we show that the value function, which is trained with different losses and structure from those of the classifier, models a different aspect of the training set. For example, if the samples are images from a certain class, the classifier would capture class membership and the value function would encode image quality. The emergence of a quality model makes sense, since the training images are often homogeneous in their quality.

The classifier c and the value function v provide training signals to each other, in an unsupervised setting, somewhat similar to the way adversarial training is done in GANs (Goodfellow et al., 2014), although the situation between c and v is not adversarial. Instead, both work collaboratively to minimize similar loss functions. Let S be the set of unlabeled training samples from a space X. Every satisfies c(x) = 1 for a classifier that models the adherence to the set of constraints (satisfies or not). Alternatively, we can think of c as a class membership function that specifies, if a given input is within the class or not. In addition, we also consider a value function v, and for every point , such that , for a sufficiently small , we have: . This structure leads to a co-training of v and c, such that every point in the vicinity of x can be used either to apply the constraint on v, or as a negative training sample for c. Which constraint to apply, depends on the other function: if , then the first constraint applies; if , then is a negative sample for c. Since the only information we have on v pertains to its local maxima, we can only recover it up to an unknown monotonic transformation. We therefore do not learn it directly and instead learn a comparator function h, which given two inputs, returns an indication which input has the higher value in v.

An alternative view of the learning problem we introduce considers the value function v (or equivalently h) as part of a density estimation problem, and not as part of a multi-network game. In this view, c is the characteristic function (of belonging to the support) and h is the comparator of the probability density function (PDF).

2 RELATED WORK

The input to our method is a set of unlabeled points. The goal is to model this set. This form of input is shared with the family of methods called one-class classification. The main application of these methods is anomaly detection, i.e., identifying an outlier, given a set of mostly normal (the opposite of abnormal) samples (Chandola et al., 2009).

The literature on one class classification can be roughly divided into three parts. The first includes the classical methods, mostly kernel-base methods, which were applying regularization in order to model the in-class samples in a tight way (Sch¨olkopf et al., 2001). The second group of methods, which follow the advent of neural representation learning, employ classical one-class methods to representations that are learned in an unsupervised way (Hawkins et al., 2002; Sakurada & Yairi, 2014; Xia et al., 2015; Xu et al., 2015; Erfani et al., 2016), e.g., by using autoencoders. Lastly, a few methods have attempted to apply a suitable one-class loss, in order to learn a neural networkbased representation from scratch (Ruff et al., 2018). This loss can be generic or specific to a data domain. Recently, Golan & El-Yaniv (2018) achieved state of the art one-class results for visual datasets by training a network to predict the predefined image transformation that is applied to each of the training images. A score is then used to evaluate the success of this classifier on test images, assuming that out of class images would be affected differently by the image transformations.

Despite having the same structure of the input (an unlabeled training set), our method stands out of the one-class classification and anomaly detection methods we are aware of, by optimizing a specific model that disentangles two aspects of the data: one aspect is captured by a class membership function, similar to many one-class approaches; the other aspect compares pairs of samples. This dual modeling captures the notion that the samples are not nearly random samples from some class, but also the local maximum in this class. While “the local maxima of in-class points” is a class by itself, a classifier-based modeling of this class would require a higher complexity than a model that relies on the structure of the class as pertaining to local maxima, as is proved, for one example, in Sec. A. In addition to the characterization as local maxima, the factorization between the constraints and the values also assists modeling. This is reminiscent of many other cases in machine learning, where a divide and conquer approach reduces complexity. For example, using prior knowledge on the structure of the problem, helps to reduce the complexity in hierarchical models, such as LDA (Blei et al., 2003).

While we use the term “value function”, and this function is learned, we do not operate in a reinforcement learning setting, where the term value is often used. Specifically, our problem is not inverse reinforcement learning (Ng & Russell, 2000) and we do not have actions, rewards, or policies.

3 METHOD

Recall that S is the set of unlabeled training samples, and that we learn two functions c and v such that for all it holds that: (i) c(x) = 1, and (ii) x is a local maxima of v.

For every monotonic function f, the setting we define cannot distinguish between v, and . This ambiguity is eliminated, if we replace v by a binary function h that satisfies if and otherwise. We found that training h in lieu of v is considerably more stable. Note that we do not enforce transitivity, when training h, and, therefore, h can be such that no underlying v exists.

3.1 TRAINING c AND h

When training c, the training samples in are positive examples. Without additional constraints, the recovery of c is an ill-posed problem. For example, Ruff et al. (2018) add an additional constraint on the compactness of the representation space. Here, we rely on the ability to generate hard negative points1. There are two generators and , each dedicated to generating negative training points to either c or h, as described in Sec. 3.2 below.

The two generators are conditioned on a positive point and each generates one negative point per each and . The constraints on the negative points are achieved by multiplying two losses: one pushing to be negative, and the other pushing to be negative.

Let be the binary cross entropy loss for . c and h are implemented as neural networks trained to minimize the following losses, respectively:

The first sum in ensures that c classifies all positive points as positive. The second sum links the outcome of h and c for points generated by . It is given as a multiplication of two losses. This multiplication encourages c to focus on the cases where h predicts with a higher probability that the point is more valued than x.

The first term of (respectively ) depends on c’s (respectively h’s) parameters only. The second term of (respectively ), however, depends on both h’s and c’s parameters as well as ’s (respectively ’s) parameters.

The loss is mostly similar. It ensures that h has positive values when the two inputs are the same, at least at the training points. In addition, it ensures that for the generated negative points , is , especially when is high.

. This, in our experiments, leads to very simi-lar results, and we opt for the slightly simpler version.

3.2 NEGATIVE POINT GENERATION

We train two generators, and , to produce hard negative samples for the training of c and h, respectively. The two generators both receive a point as input, and generate another point in the same space X. They are constructed using an encoder-decoder architecture, see Sec. 3.4 for the exact specifications.

When training , the loss is minimized. In other words, finds, in an adversarial way, points , that maximize the error of c (the first term of does not involve and does not contribute, when training ).

minimizes during training the loss , for some parameter . Here, in addition to the adversarial term, we add a term that encourages to be in the vicinity of x. This is added, since the purpose of h is to compare nearby points, allowing for the recovery of points that are local maxima. In all our experiments we set .

The need for two generators, instead of just one, is verified in our ablation analysis, presented in Sec. 4. One may wonder why two are needed. One reason stems from the difference in the training loss: h is learned locally, while c can be applied anywhere. In addition, c and h are challenged by different points, depending on their current state during training. By the structure of the generators, they only produce one point per input x, which is not enough to challenge both c and h.

3.3 TRAINING PROCEDURE

The training procedure follows the simple interleaving scheme presented in Alg. 1. We train the networks in turns: and then c, followed by and then h. Since the datasets in our experiments are relatively small, each turn is done using all mini-batches of the training dataset S. The ADAM optimization scheme is used with mini-batches of size 32.

The training procedure has self regularization properties. For example, assuming that , as a function of h, has a trivial global minima. This solution is to assign to 1 iff . However, for this specific h, the only way for to maximize is to rely on c and h being smooth and to select points that converge to x, at least for some points in . In this case, both and will become high, since and .

3.4 ARCHITECTURE

In the image experiments (MNIST, CIFAR10 and GTSRB), the neural networks and employ the DCGAN architecture of Radford et al. (2015). This architecture consists of an encoder-decoder type structure, where both the encoder and the decoder have five blocks. Each encoder (resp. decoder) block consists of a 2-strided convolution (resp. deconvolution) followed by a batch norm layer, and a ReLU activation. The fifth decoder block consists of a 2-strided convolution followed by a tanh activation instead. c and h’s architectures consist of four blocks of the same structure as for the encoder. This is followed by a sigmoid activation.

For the Cancer Genome Atlas experiment, each encoder (resp. decoder) block consists of a fully connected (FC) layer, a batch norm layer and a Leaky Relay activation (slope of 0.2). Two blocks are used for the encoder and decoder. The encoder’s first FC layer reduces the dimension to 512 and the second to 256. The decoder is built to mirror this. c and h consist of two blocks, where the first FC layer reduces the dimension to 512 and the second to 1. This is followed by a sigmoid activation.

3.5 ANALYSIS

In Appendix A, We show an example in which modeling using local-maxima-points is an efficient way to model, in comparison to the conventional classification-based approach. We then extend the framework of spectral-norm bounds, which were derived in the context of classification, to the case of unsupervised learning using local maxima.

4 EXPERIMENTS

Since we share the same form of input with one-class classification, we conduct experiments using one-class classification benchmarks. These experiments both help to understand the power of our model in capturing a given set of samples, as well as study the properties of the two underlying functions c and h.

Following acceptable benchmarks in the field, specifically the experiments done by Ruff et al. (2018), we consider single classes out of multiclass benchmarks, as the basis of one-class problems. For example, in MNIST, the set S is taken to be the set of all training images of a particular digit. When applying our method, we train h and c on this set. To clarify: there are no negative samples during training.

Post training, we evaluate both c and h on the one class classification task: positive points are now the MNIST test images of the same digit used for training, and negative points are the test images of all other digits. This is repeated ten times, for digits 0–9. In order to evaluate h, which is a binary function, we provide it with two replicas of the test point.

The classification ability is evaluated as the AUC obtained on this classification task. The same experiment was conducted for CIFAR-10 where instead of digits we consider the ten different class labels. The results are reported in Tab. 1, which also states the literature baseline values reported by Ruff et al. (2018). As can be seen, for both CIFAR-10 and MNIST, c strongly captures classmembership, outperforming the baseline results in most cases. h is less correlated with class membership, resulting in much lower mean AUC values and higher standard deviations. However, it should not come as a surprise that h does contain such information.

Indeed, the difference in shape (single input vs. two inputs) between c and h makes them different but not independent. c, as a classifier, strongly captures class membership. We can expect h, which compares two samples, to capture relative properties. In addition, h, due to the way negative samples are collected, is expected to model local changes, at a finer resolution than c. Since it is natural to expect that the samples in the training set would provide images that locally maximize some clarity score, among all local perturbations, one can expect quality to be captured by h.

To test this hypothesis, we considered positive points to be test points of the relevant one-class, and negative points to be points with varying degree of Gaussian noise added to them. We then measure using AUC, the ability to distinguish between these two classes.

Table 1: One class experiments on the MNIST and CIFAR-10 datasets. For MNIST, there is one experiment per digit, where the training samples are the training set of this digit. The reported numbers are the AUC for classifying one-vs-rest, using the test set of this digit vs. the test sets of all other digits. For CIFAR-10, the same experiment is run with a class label, instead of the digits. Reported numbers (in all tables) are averaged over 10 runs with random initializations. Each reported value is the mean result the standard deviation.

Table 2: An ablation analysis on the ten CIFAR classes (shown in order, Airplane to Truck).

As can be seen in Fig. 1, h is much better at identifying noisy images than c, for all noise levels. This property is class independent, and in Fig. 3 (Appendix C), we repeat the experiment for all test images (not just from the one class used during training), observing the same phenomenon.

We employ CIFAR also to perform an ablation analysis comparing the baseline method’s c and h with four alternatives: (i) training c without training h, employing only ; (ii) training h and without training c nor ; (iii) training both h and c but using only the generator to obtain negative samples to both networks; and (iv) training both h and c but using only the generator for both. The results, which can be seen in Tab. 2, indicate that the complete method is superior to the variants, since it outperform these in the vast majority of the experiments.

Next, we evaluate our method on data from the German Traffic Sign Recognition (GTSRB) Benchmark of Houben et al. (2013). The dataset contains 43 classes, from which one class (stop signs, class #15) was used by Ruff et al. (2018) to demonstrate one-class classification where the negative class is the class of adversarial samples (presumably based on a classifier trained on all 43 classes).

Figure 1: The ability to differentiate between an in-class image and an in-class image with added noise for both c (yellow) and h (blue). The x-axis is the amount of noise (SD of the Gaussian noise). The y-axis is the AUC. As can be seen, for both CIFAR-10 and MNIST, h is much more attuned to the image quality.

We were not able to obtain these samples by the time of the submission. Instead, We employ the sign data in order to evaluate three other one-class tasks: (i) the conventional task, in which a class is compared to images out of all other 42 classes; (ii) class image vs. noise image, as above, using Gaussian noise with a fixed noise level of ; (iii) same as (ii) only that after training on one class, we evaluate on images from all classes.

The results are presented, for the first 20 classes of GTSRB, in Tab. 3. The reported results are an average over 10 random runs. On the conventional one-class task (i), both our c and h neural networks outperform the baseline Deep-SVDD method, with c performing better than h, as in the MNIST and CIFAR experiments. Also following the same pattern as before, the results indicate that h captures image noise better than both c and Deep-SVDD, for both the test images of the training class and the test images from all 43 classes.

In order to explore the possibility of using the method out of the context of one-class experiments and for scientific data analysis, we downloaded samples from the Cancer Genome Atlas (https: //cancergenome.nih.gov/). The data contains mRNA expression levels for over 22,000 genes, measured from the blood of 9,492 cancer patients. For most of the patients, there is also survival data in days. We split the data to 90% train and 10% test.

We run our method on the entire train data and try to measure whether the functions recovered are correlated with the survival data on the test data. While, as mentioned in Sec. 1, the gene expression optimizes a fitness function, and one can claim that gene expressions that are less fit, indicate an expected shortening in longevity, this argument is speculative. Nevertheless, since survival is the only regression signal we have, we focus on this experiment.

We compare five methods: (i) the h we recover, (ii) the c we recover, (iii) the h we recover, when learning only h and not c, (iv) the c we recover, when learning only c and not h, (v) the first PCA of the expression data, (vi) the classifier of DeepSVDD. The latter is used as baseline due to the shared form of input with our method. However, we do not perform an anomaly detection experiment.

Table 3: Results obtained on the GTSRB dataset on three one-class tasks. Reported are AUC values in percents. DS denotes Deep-SVDD by Ruff et al. (2018).

Table 4: Correlation between the recovered functions and the patient’s survival.

In the simplest experiment, we treat h as a unary function by replicating the single input, as done above. We call this the standard correlation experiment. However, h was trained in order to compare two local points and we, therefore, design the local correlation protocol. First, we identify for each test datapoint, the closest test point. We then measure the difference in the target value (the patient’s survival) between the two datapoints, the difference in value for unary functions (e.g., for c or for the first PCA), or h computed for the two datapoints. This way vectors of the length of the number of test data points are obtained. We use the Pearson correlation between these vectors and the associated p-values as the test statistic.

The results are reported in Tab. 4. As can be seen, the standard correlation is low for all methods. However, for local correlation, which is what h is trained to recover, the h obtained when learning both h and c is considerably more correlated than the other options, obtaining a significant p-value of 0.021. Interestingly, the ability to carve out parts of the space with c, when learning h seems significant and learning h without c results in a much reduced correlation.

Finally, we test our method on the Gaussian Mixture Model data following Balduzzi et al. (2018), who perform a similar experiment in order to study the phenomenon of mode hopping. In this experiment, the data is sampled from 16 Gaussians placed on a 4x4 grid with coordinates ,

Figure 2: Results for a mixture of Gaussians placed on a 2D grid, following Balduzzi et al. (2018). (a) The values of the function c across the 2D domain, when c and h are jointly trained. (b) The comparator h shown as a quiver plot of the direction of maximal increase in value. (c) The values of c when it is trained alone without h. (d) A quiver plot for h, when it is trained without c.

0.5 and 1.5 in each axis. In our case, since we model local maxima, we take each Gaussian to have a standard deviation that is ten times smaller than that of Balduzzi et al. (2018): 0.01 instead of 0.1. We treat the mixture as a single class and sample a training set from it, to which we apply our methods as well as the variants where each network trains separately.

The results are depicted in Fig. 2, where we present both c and h. As can be seen, our complete method captures with c the entire distribution, while training c without h runs leads to an unstable selection of a subset of the modes. Similarly, training h without c leads to an h function that is much less informative than the one extracted when the two networks are trained together.

5 DISCUSSION

The current machine learning literature focuses on models that are smooth almost everywhere. The label of a sample is implicitly assumed as likely to be the same as those of the nearby samples. In contrast to this curve-based world view, we focus on the cusps. This novel world view could be beneficial also in supervised learning, e.g., in the modeling of sparse events.

Our model recovers two functions: c and h, which are different in form. This difference may be further utilized to allow them to play different roles post learning. Consider, e.g., the problem of drug design, in which one is given a library of drugs. The constraint function c can be used, post training, to filter a large collection of molecules, eliminating toxic or unstable ones. The value function h can be used as a local optimization score in order to search locally for a better molecule.

ACKNOWLEDGEMENTS

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974). The contribution of Sagie Benaim is part of Ph.D. thesis research conducted at Tel Aviv University.

REFERENCES

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.

David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Grae- pel. The mechanics of n-player differentiable games. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 354–363, Stockholmsmssan, Stockholm Sweden, 10– 15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/balduzzi18a. html.

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In NIPS.

David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent dirichlet allocation. Journal of Machine Learning Research, 3:2003, 2003.

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.

Charles Darwin. On the Origin of Species by Means of Natural Selection. Murray, London, 1859. or the Preservation of Favored Races in the Struggle for Life.

Sarah M Erfani, Sutharshan Rajasegarar, Shanika Karunasekera, and Christopher Leckie. High- dimensional and large-scale anomaly detection using a linear one-class svm with deep learning. Pattern Recognition, 58:121–134, 2016.

Zolt´an Farkas, Dorottya Kalapis, Zolt´an B´odi, B´ela Szamecz, Andreea Daraba, Karola Alm´asi, K´aroly Kov´acs, G´abor Boross, Ferenc P´al, P´eter Horv´ath, et al. Hsp70-associated chaperones have a critical role in buffering protein production costs. eLife, 7:e29845, 2018.

Izhak Golan and Ran El-Yaniv. Deep anomaly detection using geometric transformations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 9781– 9791. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/ 8183-deep-anomaly-detection-using-geometric-transformations.pdf.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.

Simon Hawkins, Hongxing He, Graham Williams, and Rohan Baxter. Outlier detection using repli- cator neural networks. In International Conference on Data Warehousing and Knowledge Discovery, pp. 170–180. Springer, 2002.

Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. Detec- tion of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, number 1288, 2013.

David Mcallester. Simplified pac-bayesian margin bounds. In In COLT, pp. 203–215, 2003.

M. M. Moya, M. W. Koch, and L. D. Hostetler. One-class classifier networks for target recognition applications. NASA STI/Recon Technical Report N, 93, 1993.

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In ICLR, 2018.

Andrew Y Ng and Stuart J Russell. Algorithms for inverse reinforcement learning. In ICML, pp. 663–670, 2000.

Emanuel Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065–1076, 09 1962.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexan- der Binder, Emmanuel M¨uller, and Marius Kloft. Deep one-class classification. In ICML, 2018.

Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoencoders with nonlinear dimen- sionality reduction. In Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4. ACM, 2014.

Seebock Schlegl. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. IPMI, pp. 146157, 2017.

Bernhard Sch¨olkopf, John C. Platt, John C. Shawe-Taylor, Alex J. Smola, and Robert C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computing, 13(7):1443–1471, 2001.

Y. Xia, X. Cao, F. Wen, G. Hua, and J. Sun. Learning discriminative reconstructions for unsupervised outlier removal. In 2015 IEEE International Conference on Computer Vision (ICCV), 2015.

Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553, 2015.

A ANALYSIS

We show an example in which modeling using local-maxima-points is an efficient way to model,

in comparison to the conventional classification-based approach. We then extend the framework of

spectral-norm bounds, which were derived in the context of classification, to the case of unsuper-

vised learning using local maxima.

A.1 MODELING USING IS BENEFICIAL

While modeling with a classifier c is commonplace, modeling a set as the local maxima

of a function is much less conventional. Next, we will argue that at least in some situations, it may

be advantageous.

We compare the complexity of a ReLU neural network modeling a set

of m real numbers as either a classifier or as maxima of a value function. Here, and are

linear transformations, b is a vector and is the

ReLU activation function, for and . We denote by the

function that satisfies, if and only if .

For this purpose, we take any distribution D over that has positive probability for

sampling from S. Formally, D is a mixture distribution that samples at probability q > 0 from

and probability from , where is a distribution supported by S and is a distribution

supported by the segment . The task is to achieves error in approximating .

The error of a function is measured by 11, which is the probability of

c incorrectly labeling a random number .

We show that there is a ReLU neural network with 2m neurons, such

that, the set of local maxima of v is S. In particular, we have: 11. Here,

, that satisfies, if and only if x is a local maxima of v. Additionally,

we show that any such v has at least hidden neurons. On the other hand, we show that

for any distribution D (with the specifications above), any classification neural network c(x) =

that has error 11has at least 3m hidden neurons.

Theorem 1. Let be any set of points such that for all

1}. We define to be the function, such that if and only if . Then,

1. There is a ReLU neural network of the form with 2m hidden neurons such that the set of local maximum points of v is S.

2. Any ReLU neural network of the form such that any is a local maxima of v has at least neurons.

3. Let be a distribution that samples at probability q > 0 from and probability from , where is a distribution supported by S and is a distribution supported by the segment . Then, for a small enough , every ReLU neural network of the form , such that 11has at least 3m hidden neurons.

Proof. We begin by proving (1). We construct v as follows:

, x: v.

, . . . , m , : v.

, . . . , m , x: v.

): + 1.

we consider that v is a piece-wise linear function with 2m linear pieces and arg max v = S. By

Thm. 2.2 in Arora et al. (2018), this function can be represented as a ReLU neural network of the

form , that has 2m hidden neurons.

Next, we prove (2). Let be a function of the form , such that

each is a local maxima of it. First, by Thm. 2.1 in Arora et al. (2018), v is a piece-wise linear

function. We claim that v has at least two linear pieces between each consecutive points and ,

for . Assume the contrary, i.e., there is an index , such that,

v has only one piece between and . If , then, v is constant between and

, and therefore, and are not local maximas of v, in contradiction. If ,

then, because v is linear between and , for every point , we have ,

in contradiction to the assumption that is a local maxima of v. If , then, because

v is linear between and , for every point , we have , in

contradiction to the assumption that is a local maxima of v. We conclude that v has at least

two linear pieces between the points and , for all . Therefore, v has at

least 2m pieces. By Thm. 2.2 in Arora et al. (2018), v has at least hidden neurons.

Next, we prove (3). We denote and . Let be the probability

of sampling from . Since is supported by S, we have: . We define

. In addition, is a continuous distribution supported by the closed

segment . Thus, by Weierstrass’ extreme value theorem, the probability density

function of that is a continuous function, obtains its extreme values within the segment.

In addition, since is supported by , we have: for all

. By combining the above two statements, we conclude that there is a point

such that for every . We denote by

. Since we are interested in proving the claim

for a small enough , we can simply assume that and a ReLU neural

network of the form .

We have:

Assume by contradiction that: . Then, 11in contradic-

tion. Therefore, for every .

We also have:

Assume by contradiction that there is , such that the set

is finite. Then,

in contradiction. Let and be two points such that

and . Since c is a continuous function and piece-wise linear and the four points

are not co-linear, we conclude that c has at least three linear pieces in

the segment . Similarly, c has at least two linear pieces in each of the segments

and . We conclude that c has at least 3m+1 pieces. By Thm. 2.2 in Arora et al. (2018),

c has at least 3m hidden neurons.

In the above theorem we showed that there exists a ReLU neural network

with 2m hidden neurons that captures the set S as its local maximas. Furthermore, we note that the

set of functions that satisfy these conditions (i.e., shallow ReLU neural networks with 2m hidden

neurons that capture the set S) is relatively limited. For instance, any such v behaves as a piece-

wise linear function between any and with only two linear pieces. Therefore, any such v is

uniquely determined by the set S up to some freedom in the selection of the linear pieces between

and in S. On the other hand, a shallow ReLU neural network with

more than 2m hidden neurons is capable of having more than two linear pieces between any and

.

A.2 A GENERALIZATION BOUND

The following lemma provides a generalization bound that expresses the generalization of learning

c along with v. In the following generalization bound, we assume there is an arbitrary distribution D

of positive samples. In addition, we parameterize the class, , of value

functions by vectors of parameters and the class, ,

of classifiers by . We upper bound the probability of a mistake done by any classifier

and value function on a random sample . In this case, and mistake if x is not a

local maxima of or classified as negative by . The upper bound is decomposed into the sum of

the average error of and on a dataset and a regularization term.

See Appendix B for the exact formulation and the proof.

Lemma 1 (Informal). Let be a class of value functions and

a class of classifiers. Assume that and are ReLU

neural networks of fixed architectures, with parameters and (resp.). Let C(g) is the spectral

complexity of the neural network g and an -neighborhood of

x. Let D be a distribution of positive examples. With probability at least over the selection of

the data , for every and , we have:

The above lemma shows that the probability of to be a local maxima of and classified

as a positive example by , is at most the sum of the probability of to be a local maxima

of and classified as a positive example by and a penalty term. The penalty in this case is

of the form O

is the sum of the spectral norms of and . This suggests a tradeoff between the

sum of the spectral complexities of and and the ability to generalize. The bound is similar

asymptotically to the bounds of Neyshabur et al. (2018) and Bartlett et al. for (multi-class) super-

vised classification. In their bound, the penalty term is of the form O

the (multi-class) classifier is of the form , for a neural network

.

Our analysis focused on the value and not on the comparator h. However, the complexities of the

two are expected to be similar, since a value function can be converted to a comparator by employing

.

B A FORMAL STATEMENT OF THE GENERALIZATION BOUND

In this section, we build upon the theory presented by Neyshabur et al. (2018) and provide a gener-

alization bound that expresses the guarantees of learning c, along with v for a specific setting.

Before we introduce the generalization bound, we introduce the necessary terminology and

setup. We assume that the sampling space is a ball of radius B, i.e.,

. Each value function is a ReLU neural network of the form

, where, for such that

and . In addition, is the ReLU activation

function extended to all and . We denote, . The set C consists

of classifiers such that each function is a ReLU neural network of

the form , where, for such that

and . We denote . Additionally, we denote,

and .

The spectral complexity of a ReLU neural network with parameters

is defined as follows:

For two distributions P and Q over a set X, we denote the KL-divergence between them by,

. For two functions function , we denote

the asymptotic symbols: g(x) = O(f(x)) to specify that , for some constant c > 0.

We denote by 11[x] the indicator, if a boolean true, false} is true or false.

We define a margin loss of the form:

where, are fixed margins and is the -neighborhood

of x, for a fixed . In this model, the margins serve as parameters that dictate the amount of

tolerance in classifying an example as positive. Similar to the standard learning framework, for a

fixed distribution D over X, the goal of a learning procedure is to return (given some input) and

that minimize the following generalization risk function:

The learning process has no direct access to the distribution D. Instead, it is provided with a set

of m i.i.d samples from . In order to estimate the generalization risk, the

empirical risk function is used during training:

The following lemma provides a generalization bound that expresses the generalization of learning

c along with h.

Lemma 2. Let and C be as above. Let D be a distribution of positive examples. With

probability at least over the selection of the data , for every and

, we have:

B.1 PROOF OF LEM. 2

All over the proofs, we will make use of two generic classes of functions

and . For simplicity, we denote the indices of and

by and 1 (instead of 1 and 2). Given a target function and two functions

and , we denote the loss of them with respect to a sample x by:

The generalization risk:

And the empirical risk:

We modify the proof of Lem. 1 in Neyshabur et al. (2018), such that it will fit our purposes.

Lemma 3. Let be a target function. Let and

be two classes class of functions (not necessarily neural networks).

Let and be any two distributions on the parameters and (resp.) that are independent of the

training data. Then, for any , with probability over the training set of size m, for

any two posterior distributions and over and (resp.), such that

Proof. Let be a set with the following properties:

S, ω< γand < γ

We construct a distribution over , with probability density function:

Here, Z is a normalizing constant. By the assumption in the lemma, .

By the definition of , we have:

Since this equation holds uniformly for all , we have:

Now using the above inequalities together with Eq. 6 in Mcallester (2003), with probability

over the training set we have:

where the last inequality follows from the following observation.

Let denote the complement set of and denote the density function restricted

to and normalized. In addition, we denote . Then,

where is the binary entropy function. Since the

KL-divergence is always positive, we get,

Since are are independent joint distributions, we have:

.

Lemma 4. Let be a class of value functions

and a class of classifiers (not necessarily neural

networks). We define two classes of functions and

. Then,

where, and .

Proof. We would like to prove that if

and if . Since and are independent, it willprove the desired inequality.

First, we consider that:

With no loss of generality, we assume that and denote

. Therefore, we have:

Next, we consider that:

Lemma 5. Let be a class of value functions

and a class of classifiers (not necessarily neural

networks). Let and be any two distributions over the parameters and (resp.) that are

independent of the training data. Then, for any , with probability over the

training set of size m, for any two posterior distributions and over and (resp.), such that

and , we have:

Proof. Let G and H be as in Lem. 4. By Lem. 4 and our assumption,

We note that all of the samples in D are positive. Therefore, by Lem. 3, with probability at least

, we have:

By the definition of . In addition, by the

definition of and . Therefore, we can rephrase Eq. 31 as

follows:

Proof of Lem. 2. We apply Lem. 5 with priors and posteriors , distributions simi-

lar the proof of Thm. 1 in (Neyshabur et al., 2018). In their proof, they show that for their

selection of prior and posterior distributions: (1)

1

holds and (2) . By taking to be half

of the value the use and therefore, (from their proof) to be half of the value they use

as well, we obtain and

. We select and in a similar fashion. In particular, we can re-

place the penalty term in Lem. 5 as follows:

C ADDITIONAL FIGURES

Figure 3: Same as Fig. 1, but where the images are taken from the test set of all classes, regardless of the single class used for training.