A Connection between Feed-Forward Neural Networks and Probabilistic Graphical Models

2017·Arxiv

Abstract

Abstract

Two of the most popular modelling paradigms in computer vision are feed-forward neural networks (FFNs) and probabilistic graphical models (GMs). Various connections between the two have been studied in recent works, such as e.g. expressing mean-field based inference in a GM as an FFN. This paper establishes a new connection between FFNs and GMs. Our key observation is that any FFN implements a certain approximation of a corresponding Bayesian network (BN). We characterize various benefits of having this connection. In particular, it results in a new learning algorithm for BNs. We validate the proposed methods for a classification problem on CIFAR-10 dataset and for binary image segmentation on Weizmann Horse dataset. We show that statistically learned BNs improve performance, having at the same time essentially better generalization capability, than their FFN counterparts.

1 Introduction

In this work we study connections between feed-forward neural networks (FFNs) and graphical models (GMs). The latter can be divided in two classes – directed and undirected ones (see e. g. [9]). The corresponding statistical models are known as Bayesian networks (BNs) and Markov random fields (MRFs), respectively. On the neural network side, perhaps the two most popular types are restricted Boltzmann machines (RBMs, [1]) and feed-forward networks. RBMs are statistical models, i. e. they implement Boltzmann distributions, which are nothing but MRFs. Hence, there is a one-to-one correspondence between RBMs and undirected GMs. The situation is quite different for FFNs. Although the architectures of FFNs and BNs are basically the same, FFNs do not represent probability distributions at all. They are considered just as deterministic mappings, transforming observed values into output values. The main aim of this work is to bridge this gap between FFNs and BNs by showing that each FFN implements an approximation of a corresponding BN.

Undirected GMs have been used successfully for decades, especially in computer vision. Directed GMs appear to be less popular. At the same time, deep learning methods have found tremendous success in recent years, especially convolutional neural networks (CNNs, [13]). The most successful techniques, however, have no statistical background as of yet. Hence from a practical viewpoint, the goal of this work is to establish a connection that makes a wide range of elaborate statistical inference and learning techniques available to modern deep architectures.

Related works. There are many works showing how certain algorithms for graphical models can be represented as applications of specially designed networks. One popular approach is e. g. to express mean-field based inference techniques as neural networks, which allows for very fast implementations [10]. In [20] it is shown how to express mean-field based inference as a differentiable layer in a CNN. In [14] it is proposed to directly learn messages for energy minimization, again expressed in terms of neural computations. In these works, however, only specific aspects are considered. The full power of MRFs, including uncertainty, statistical learning, decisions other than maximum a-posteriori etc. is not exploited. More importantly, the connections established in these works go from GM to FFN, i. e. for a particular GM algorithm, the corresponding FFN architecture is designed. Our aim, in contrast, is to address the opposite direction, i. e. to be able to transform a given FFN to the corresponding GM, thereby taking into account that the latter implements a complex probability distribution, rather than an optimization problem.

The observation that incorporating stochastics into the learning process can improve performance is not new. The most famous example may be the “drop-out” method [18]. Injecting noise (see e. g. [3] and references therein) is also a popular approach. Note how the use of randomness improves the generalization capabilities of the learned models, as it is usual for statistical models in contrast to discriminative ones. However, these approaches use stochastics in a rather ad-hoc manner, i. e. the proposed procedures are not derived from a statistically sound probability model for all constituents.

Probably the closest to our work are [17] and [19], where algorithms for learning Bayesian networks are presented. The approach in [17] employs techniques commonly used for undirected graphical models. In particular, the Bayesian network is transformed into the corresponding undirected graphical model of higher order, and an adapted version of mean-field approximation is applied to it. Finally, the algorithm uses a complex message passing scheme, which, while very interesting from a theoretical point of view, seems to be not efficient enough for very deep models and large data. The algorithm in [19] in contrast does not rely on the mean-field approximation but aims to maximize the original likelihood using importance sampling. As acknowledged by the authors, importance sampling has certain requirements in order to work in practice, namely that the proposal distribution should be “not small”. Unfortunately, for very deep architectures, it seems to be just the opposite case. Note that in both of these works no connection between FFN and BN is explicitly given. In other words, there are no statements on how to interpret deterministic computations performed by an FFN in terms of a corresponding statistical BN. In contrast, our work provides this interpretation, making it possible to directly compare FFNs to their statistical counterparts.

From the viewpoint of modeling, our work is motivated by the following observation. One of the recent trends in computer vision is to combine FFNs and conditional random fields (CRFs, [12]) in a unified framework, and to learn them jointly (see e.g. [2]). There are different ways to achieve this. For instance, a CRF can be seen as a differentiable layer in an FFN as in [20]. Another option is to use a FFN for computing the potentials of a CRF sitting on top of it. As an example, in [5] a maximum likelihood based learning scheme is presented, where the gradient of the likelihood is approximately computed using belief propagation. In [8] an efficient sampling-based approach is presented. However, in both [5] and [8] the FFN is again only considered as a mapping, producing CRF potentials from the input in a deterministic manner. There is no statistical model or interpretation of the FFN as such. Obviously, for such FFN-CRF combinations it would be profitable to be able to represent both FFN and CRF in a unified manner, because then they could be combined in a transparent way and the same learning principles applied for both parts.

Contributions. First, we show that for each FFN, i.e. a deterministic mapping, a corresponding probabilistic directed graphical model, or Bayesian network, can be defined that implements a probability distribution such that the former is an approximation of the latter. We discuss possible benefits from having such a correspondence with respect to modeling, learning and inference.

Second, based on this correspondence, we derive a learning algorithm for BNs, which has the following properties. On the one side, it implements a statistical learning. As a consequence, learned probabilistic models are less vulnerable to over-fitting as compared to the corresponding discriminative/deterministic approaches. At the same time, unlike many statistical learning algorithms, our method enjoys perfect sampling, that makes it quite efficient in practice. On the other side, the proposed method is very similar to the commonly used error back-propagation (EBP) algorithm for FFNs. It potentially allows the proposed approach to use numerous findings made for FFNs in the past years, like e.g. elaborated general solvers. Nonetheless, due to its similarity to the EBP, the proposed method can be seamlessly incorporated into existing deep learning frameworks, like e.g. Caffe [6].

In the experimental section, we evaluate the proposed approach with respect to the obtained performance and generalization capability for a classification task on CIFAR-10 database and for binary segmentation on Weizmann Horse dataset. We show improved performance and great improvement of the generalization capability of the proposed method as compared to the corresponding deterministic models.

2 Statistical interpretation of FFNs

A unified representation of different types of neurons. We start from considering a simple sigmoid neuron: , where sigmis the input vector, and w are the weights. Its statistical interpretation is well known to be a logistic regression. Let us introduce a random variable and its conditional probability distribution (log-linear model). Then sigm. Hence, the outputs of sigmoid neurons in an FFN can be interpreted as probability values.

The problems that prevent us from following this specific interpretation of FFN neurons are the following. For one, the idea can not be applied to many other types of neurons, such as tanh or RELU, simply because these neurons compute values that may be negative or unbounded. But we deem the next problem to be even more crucial. Assume that the output of a sigmoid neuron serves as input for another one. Then this output value is multiplied by a weight, which can be arbitrary. This operation, however, has no statistical interpretation, since the obtained value can have arbitrary range depending on the weight. The values x for a neuron are usually understood as features. But it is hardly possible to reinterpret probability values as features, as it is difficult to pose a well-founded statistical model with clearly defined sample space, random variables, probability distribution etc. in this case.

The key idea is to interpret the output of a sigmoid neuron in an FFN not as a probability value, but as an expectation

with y and p(y|x; w) as described above. Now it is easy to see how this interpretation can be applied for other neuron types. Consider e. g. a tanh neuron with the transfer function

Let us introduce the random variable and the conditional probability distribution , similar to (1).

Concerning the RELU neuron, we borrow the idea of [15], where it is shown that RELUs can be represented as a sum of sigmoid neurons with appropriately defined weights. Since the sum of expectations is the expectation of a sum, we can assert the applicability of (1) for RELUs as well.

In fact, it is possible to represent any deterministic mapping y = f(x) as an expectation. The simplest variant is

where if its argument is true. Of course, this interpretation is not very useful, since the function remains in fact deterministic – there is only one possible value of y for each value of f(x). It asserts, however, the possibility to represent all thinkable neuron types (e. g. pooling, normalization etc.) in a unified manner. To summarize, neurons can be represented in a unified manner by means of the following constituents: (i) two random variables x and y, and (ii) a conditional probability distribution p(y|x). According to this representation, the application of a neuron consists in the computation of the expectation of y,

FFN and the corresponding BN. Consider an FFN that consists of neurons Each neuron has its inputs . The simplest way to represent an FFN is as a directed graph G = (V, E), where nodes correspond to neurons and directed edges are defined as E = . Each neuron v is associated with its output value and is characterized by a function , where denotes the vector of values for all neurons . Some neurons are input ones, namely those for which . The values of these neurons are not computed by the network, but are just given as observations. There are also output neurons , i. e. those which do not serve as input for some other neurons in the network. The set of all other neurons is denoted by . The feed-forward architecture means that the graph G = (V, E) has no directed cycles. Hence, it is possible to introduce a partial ordering on the set of neurons, such that each neuron has higher rank than its inputs, i. e. . The application of the network consists of computing the output values given the input values . Since the neurons are ordered, the computation of can be performed sequentially according to the introduced partial ordering

The corresponding probabilistic directed graphical model, or Bayesian network, is built as follows. First, we introduce random variables for each neuron. Let us denote by variables that correspond to the input neurons, are output variables, and all intermediate variables. Each variable takes values that correspond to the particular type of neuron. For example, if v is a sigmoid neuron, if v is a tanh neuron, for input neurons etc. In order to unify and simplify notation, let us also introduce “unified input vectors” for all neurons (analogous to the previously introduced vector of input values for a neuron v. These values can be each non-input neuron , we define the conditional probability distribution , which also corresponds to a particular type of neuron (e. g. for both sigmoid neuron and tanh, where w are the weights). An elementary event in our statistical model is a vector (x, y, z) of values for all variables, where x denotes the vector of all input values; y and z defined analogously. The model is the conditional probability distribution

In order to be inline with graphical model representation, let us define clique functions as for all and for all . Then the probability distribution (5) can be written as with the energy

Note that unlike in undirected GMs, there is no normalization constant Z, since we use logarithms of real conditional probabilities as clique functions.

FFNs approximate the corresponding BNs. For simplicity, let us consider first a BN as described above with only one hidden layer, i.e. -s for all contain only input variables . Let also the output y depends only on z, i.e. p(y|z, x) = p(y|z) for clarity. Since z-s are auxiliary variables, at the end we are interested in the conditional probability distribution of the output variables given the input, i.e. p(y|x), which is obtained by marginalization over auxiliary variables . This probability distribution can be seen as a convex combination of probability distributions p(y|z) with different values of the argument z and coefficients p(z|x). Let us approximate the convex combination of function values by the value of the function, obtained at convex combination of its argument, i.e.

Since p(z|x) is conditionally independent (see (5)), the components of are expectations

This is however exactly what neurons compute (see (4)). To conclude, FFN with one hidden layer approximates the true probability distribution p(y|x) of the corresponding BN using (7).

It is very easy to see how to extend the above statement to the models with more than one hidden layer. Let us consider the layered structure as follows. We partition the set of intermediate variables layers according to the previously introduced order of neurons (i.e. inputs for neurons in the l-th layer belong to previous layers ). Let us denote by the vector of all variable values in the l-th layer. The probability distribution (5) can be rewritten as

The probability of interest p(y|x) is obtained by marginalization over all . Let us first perform marginalization over values of only, using thereby the approximation (7). In doing so we obtain

The value of can be seen as a “new observation” for a network that has one layer less than the original one. By proceeding further sequentially we finally obtain

Hence, multi-layer FFNs perform a sequence of approximations in the corresponding BNs. We will call it sequential approximation.

At this point we would like to discuss a crucial difference between the original model (9) and its sequential approximation (11). In fact, the latter implements a conditionally independent posterior probability distribution p(y|x), since all are computed deterministically. It means, that (11) is not able to capture the dependencies between output variables for a particular input in principle. The original model (9) can indeed, because the desired posterior is obtained by marginalization over all hidden , i.e. output variables are not conditionally independent. Hence, especially for structured output we could expect essentially better performance of the original model as compared to its approximation.

What is the benefit of having this connection? In essence, the presented correspondence between FFNs and Bayesian networks allows to use for FFNs the rich and diverse set of methods that has been developed for graphical models, both directed and undirected ones, since the former can be trivially transformed into the latter. Below, we give a brief overview of possible research directions in this context.

Inference. After training, FFNs are typically applied through forward-propagation, which we have shown to be equivalent to computing a sequential approximation (11). However, since we can now represent FFNs as graphical models (6), we can try other inference strategies, such as minimizing the associated energy of (6) in order to compute the maximum a-posteriori (MAP) estimate. This task is known to be hard for BNs [17]. However, there are many elaborated techniques for energy minimization that have recently been developed for undirected graphical models [7]. Thus far, we did not pursue this direction, mainly because the corresponding energy minimization problem is of higher order, hence a specialized solver should be developed/chosen for this.

In general, we can apply Bayesian decision theory with a loss function of our choice to obtain a corresponding estimator. The associated optimization problem is typically intractable. However, note that BNs admit drawing perfect independent samples efficiently from the posterior p(y|x). Hence, we can rather easily approximate Bayesian estimators through the use of samples. To that end, we use samples to approximate the maximum marginal decision in our experiments for binary image segmentation (see below).

Learning. There are many approaches to learning graphical models. For example, one could use standard conditional likelihood maximization of the original model (5). However, we did some preliminary experiments and found this not to work well. Hence, we instead present an algorithm in the following section that maximizes a lower bound of the conditional likelihood, which we found to be quite stable and efficient.

Another possibility would be to pose the learning task in a discriminative manner by using the framework of structural support vector machines (SSVM, see e.g. [16] for an overview) with latent variables. A very important property of the SSVM framework is the possibility to learn with application-specific losses. Unfortunately, again, in order to perform this kind of learning, a specialized solver for the corresponding energy minimization problems should be developed first.

CNN+CRF combinations. Using the representation of an FFN as a graphical model makes it essentially easier to combine a CRF with a CNN, that basically means just to extend the energy function (6) by adding the potential functions associated with the CRF.

Practical benefits. Leveraging the connection between FFNs and the corresponding BNs gives the possibility to combine or alternate between different learning algorithms. For instance, common error back-propagation is faster as compared to BN learning. Moreover, it scales more easily to large amounts of data. Hence, a reasonable procedure could be to start learning by error back-propagation, then convert the learned FFN into the corresponding BN and perform fine-tuning by training the BN. On the other hand, stochastic methods are known to be beneficial to escape local optima. Hence, another interesting direction would be to alternate between error back-propagation and statistical BN learning, using the former to speed up convergence and the latter to escape local optima.

3 Learning

We pose the learning task as maximization of the conditional data likelihood

where are the training data, summarizes all unknown parameters, and is obtained by marginalization over all hidden , organized in layers (see (9)). In further we omit the summation over t to simplify notations, i.e. we concentrate on optimization for just one learning example (x, y), where y is the ground truth for x.

Let us assume for now that we want to learn only the parameters of the output layer. For this, we approximate (12) by its lower bound using Jensen’s inequality and obtain

where represents the last hidden layer, and is obtained by marginalization over all hidden layers but L. Furthermore, let us specify where is some function. Using this, the gradient of (13) wrt.

The summations over and over in (14) are infeasible. Therefore, we employ stochastic gradient ascent as follows: (i) draw a sample according to the current probability distribution , (ii) draw a sample according to the current probability distribution , and (iii) compute and apply the stochastic gradient

Note, that stochastic gradient ascent is not an approximation in the common sense, because it is exact in the limit of infinity. We would also like to point in that we are able to draw perfect independent samples, since we have a directed graphical model behind our probability distribution. In particular, we have to draw sequentially a and so on up to the needed in (15). Note also, that each sampling step can be performed very efficiently, since all probability distributions are conditionally independent. This is in contrast to sampling in general undirected graphical models using e.g. Gibbs sampling, where obtaining independent samples is quite time consuming.

In order to learn the parameters of an l-th layer, we proceed as follows. We “split” the network in two parts – one from x to and another from to the output y. Again, using Jensen’s inequality, the corresponding lower bound of the likelihood (12) reads

The summation over is again carried out using stochastic approximation. For use the approximation (11), i.e. we approximate the part of BN from by the corresponding FFN.

To summarize, the resulting algorithm consists of two steps: (i) draw a sample using current model, and (ii) propagate an “error” through the network, compute and apply stochastic gradient, like (15) for the output layer. The second step is thereby exactly the same as in the commonly used error back-propagation (EBP) for FFNs. The difference is only in the first step – EBP deterministically computes the output value for each neuron, whereas in our algorithm the output values are sampled according to the corresponding conditional probability distributions. In essence, the core difference between EBP and our statistical learning is the following. EBP approximates the underlying probability distribution but uses exact gradient. Our approach is to train the original model, but use approximated gradient ascent for learning.

4 Experiments

Image classification. Here, we validate our approach on the CIFAR-10 dataset [11]Caffe framework [6] in these experiments and the “standard” network implementationas the baseline. First, we experimented with the network variant built of sigmoid neurons. In order to further simplify experiments and avoid influences of additional design choices we used simple stochastic gradient descent solver without momentum and fixed gradient step size. At the same time we slightly increased the step size and learned models with essentially higher number of iterations. Both EBP for the baseline FFN and our algorithm for the corresponding BN were learned with exactly the same parameters. We also performed several runs for both methods and averaged the measurements.

Figure 1: Results for CIFAR-10 benchmark. Left – training and test accuracies for the baseline and the proposed approach (with sigmoid neurons) as the dependencies on time during learning, dashed lines – training, solid lines – test. Right – results summary (classification accuracies in %), “orig.” – the original network, “appr.” – RELUs are approximated by the sum of sigmoids as in [15].

Fig. 1 (left) shows the dependencies of the training and test accuracies on time. At the beginning of the learning process the test accuracy of the FFN slightly outperforms the one of the BN. After relatively short time however it saturates and even starts to decrease, that is a sign for over-fitting. In contrast, the BN’s test accuracy constantly increases and finally outperforms FFN by about 2%. What is however even more important in our opinion is that the BN shows essentially smaller difference between the training and the test accuracies. Hence, it is much more robust to over-fitting as compared to its deterministic counterpart. The results are summarized in Fig. 1 (right).

We also experimented with the RELU-variant of the network. In doing so, we used all default parameters (step size, momentum, weight decay etc.), increasing only the number of learning iterations. Remember that first we need to approximate RELUs by a sum of sigmoid neurons. Actually, it makes the test accuracy essentially worse (see Fig. 1 (right), “orig.” vs. “appr.”). On the other side, the original RELU network turns out to be extremely over-fitting. As in the previous esperiment with sigmoids, BN improves both the performance and the generalization of its deterministic counterpart. Among all experiments, the BN with stochastic RELUs shows the best generalization.

Binary image segmentation. Next, we evaluate the proposed method for binary image segmentation on Weizmann Horse dataset [4]. It consists of 328 images, showing horses in a side view on different backgrounds. In order to accelerate computations, we downscaled all images to the size of

For this experiment we designed our own convolutional (CNN) architecture shown in Fig. 2 (left, bottom). Formally, the input variables correspond to image pixels, whereas the RGB-color. There is one output variable for each pixel that assigns a label to it. Neurons are organized in layers with one neuron for each pixel. We will refer all neurons by their “coordinates” (l, i, j), where l is the layer number and (i, j) are image coordinates. The set of inputs for each neuron consists of neurons from previous layers that are within a certain radius around (i, j). Neurons from the first layers have also access to the image pixels, again within the radius . All neurons are sigmoids.

We study the model performance as well as the generalization capability with respect to varying network depth and the number of training examples. One experiment consists in following. First, we draw randomly a pre-defined number of training examples (out of 328 in total), the rest is used for test. Then we learn the deterministic variant of the CNN by EBP and the corresponding BN by the proposed algorithm. After the models are learned we perform inference using maximum marginal decision strategy and compute intersection over union (IoU) for both models and both training and test sets. Each experiment is repeated 10 times (i.e. with different randomly sampled training/test sets of the same size) and the measures are averaged.

Fig. 2 shows the dependencies of IoU on the network depth (upper row) and on the training set size (bottom row, middle and right). First of all, it is clearly seen that statistical model has essentially better generalization capabilities, since the difference between training and test accuracies is always smaller. Concerning performance (i.e. IoU on the test set), statistical model is also superior in most cases. This performance gain decreases for larger training sets (see Fig. 2,

Figure 2: Results for Weizmann Horse dataset. Dependencies of the IoU on the model capacity (top row) and on the size of the training set (bottom row, middle and right). Top, left – weakest training set, middle – strongest training set/model, right – averages over training sets/models. Dashed lines – training, solid lines – test. Bottom, left – the used network architecture (only one image row is shown). The input sets for two neurons depicted in bold are shown. The connection to intermediate variables are shown in green, connections to the image – in blue. For this example,

top, middle, and bottom, right). This is indeed expected, since it is well known that discriminative approaches are less robust with respect to over-fitting in general, as compared to the statistical ones. At the same time, they are less vulnerable to misspecification, i.e. they perform better for large training data in general.

5 Conclusion

In this work we established a connection between feed-forward neural networks and Bayesian networks. We showed that the application of an FFN can be seen as an approximation of the corresponding BN. Based on this connection, we developed a learning algorithm for BNs. Evaluation shows that BNs outperform the corresponding FFNs in terms of both performance and generalization capabilities. We also characterized various benefits of having this connection between FFNs and GMs (see the discussion in Sec. 2), which motivate us for further work.

References

[1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.

[2] Anurag Arnab, Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Måns Larsson, Alexander Kirillov, Bogdan Savchynskyy, Carsten Rother, Fredrik Kahl, and Philip Torr. Conditional random fields meet deep neural networks for semantic segmentation.

[3] Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.

[4] E Borenstein, E Sharon, and S Ullman. Combining top-down and bottom-up segmentation. In CVPR, pages 46–46, 2004.

[5] L.-C. Chen, A. G. Schwing, A. L. Yuille, and R. Urtasun. Learning deep structured models. In ICML, pages 1785–1794, 2015.

[6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[7] Jörg H. Kappes, Bjoern Andres, Fred A. Hamprecht, Christoph Schnörr, Sebastian Nowozin, Dhruv Batra, Sungwoong Kim, Bernhard X. Kausler, Thorben Kröger, Jan Lellmann, Nikos Komodakis, Bogdan Savchynskyy, and Carsten Rother. A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision, 115(2):155–184, 2015.

[8] A. Kirillov, D. Schlesinger snd S. Zheng, B. Savchynskyy, P.H.S. Torr, and C. Rother. Joint training of generic cnn-crf models with stochastic optimization. In ACCV, 2016.

[9] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning. The MIT Press, 2009.

[10] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS2011, pages 109–117, 2011.

[11] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

[12] J. Lafferty, A. McCallum, and F. CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.

[13] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[14] Guosheng Lin, Chunhua Shen, Ian Reid, and Anton van den Hengel. Deeply learning the messages in message passing inference. In NIPS, pages 361–369, 2015.

[15] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, 2010.

[16] Sebastian Nowozin and Christoph H Lampert. Structured prediction and learning in computer vision. Foundations and Trends in Computer Graphics and Vision, 6(3-4):3–4, 2011.

[17] Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan. Mean field theory for sigmoid belief networks. J. Artif. Int. Res., 4(1):61–76, March 1996.

[18] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[19] Yichuan Tang and Ruslan Salakhutdinov. Learning stochastic feedforward neural networks. In NIPS2013, pages 530–538, 2013.

[20] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.

Appendices

A Approximation with discrete variables.

First of all we would like to discuss in more detail the approximation

(see (7) in the main part), especially for the case of discrete variables z.

Consider e.g. the case that the variable z of some conditional probability distribution p(z|x) is discrete, e.g. are not numbers, but just elements of a general discrete set (like labels). The expectation is not defined in this case, because addition and multiplication are not defined. In praxis however, the values of z are supposed to be (although discrete) elements of a vector space in fact, i.e. they are “encoded” e.g. by , or similar. Moreover, often the values of a random discrete variable z are further used in another probability distribution p(y|z), where p(y|z) is specified by scalar product, i.e. Hence again, the values of z should be vectors in order to be able to define scalar product.

In order to be able to work with discrete variables, let us explicitly introduce the “encoding function” , the argument of which is a value of a random variable, while output is an element of a vector space, i.e. addition and multiplication with a scalar are well-defined for values of the encoding function makes possible to consider the expectation of values of instead of values of , which is indeed well-defined. For instance, in the example above, a and b are to be considered as events, whereas is a random variable. For instance, and is the encoding used in sigmoid neurons, and corresponds to a tanh-neuron, etc. Using these notations (17) can be rewritten as

Correspondingly, the sequential approximation

(see (11) in the main part) can be rewritten as

In further however, we will still use “simplified” notations instead of ”mathematically correct” just to omit notational clutter.

B The learning algorithm.

Here, we give a more detailed derivation of our learning algorithm. Let us first recall the problem statement. Our model is (see (9) in the main part)

The learning task is posed as maximization of the conditional data likelihood

where are the training data, summarizes all unknown parameters, and is obtained by marginalization of (21) over all hidden . In further we omit the summation over t to simplify notations, i.e. we concentrate on optimization for just one learning example (x, y), where y is the ground truth for x.

Output layer. Let us assume for now that we want to learn only the parameters of the output layer. For this, we approximate (22) by its lower bound using Jensen’s inequality and obtain

where represents the last hidden layer, and is obtained by marginalization over all hidden layers but L. Furthermore, let us specify where is some function. Using this, the gradient of (23) reads

The summations over and over in (24) are infeasible. Therefore, we employ stochastic gradient ascent, i.e. we replace the expectation over by a realization. In particular, first, we draw a sample according to the current probability distribution , and second, apply stochastic gradient

Moreover, the second addend in (25) is again an expectation. Hence, we can employ stochastic approximation here as well. In particular, given an , we draw an from , and our final stochastic gradient is

To be more concrete, let us consider an example, where the output layer consists of sigmoid neurons, i.e.

where the lower index v indicates a particular output neuron, and represents its weights. Correspondingly, the gradient (26) is

Intermediate layers. In order to learn the parameters of an l-th layer, we proceed as follows. We “split” the network in two parts: one from x to and another from to the output y. Again, using Jensen’s inequality, the corresponding lower bound of the likelihood (22) reads

The conditional probability distribution is obtained by marginalization over all layers . The key idea to cope with this infeasible marginalization is to approximate by the corresponding feed-forward network using our approximation (19), i.e.

which will be denoted by . After this approximation the lower bound (29) reads

Its gradient is w.r.t. is obtained via chain rule as

The summation over is again carried out using stochastic approximation: (i) draw , and (ii) apply the stochastic gradient

The first factor in (32) is obtained using standard error back-propagation. It computes an error , which is a vector of values for each neuron v in the l-th layer. The second factor in (32) is

Combining (32) and (33) for a particular neuron v we obtain the stochastic gradient w.r.t. its parameters

Let us again concretize (34) for sigmoid neurons (27). Remember that in this case i.e. the sum over is reduced to just one value for . Hence, the stochastic gradient for sigmoid neurons is

There is a practical concern with the resulting algorithm. The problem is that for each layer we need to perform its own forward and backward pass to compute the first factor in (32), that is quite time consuming. Therefore in practice, we perform only one forward and one backward pass per gradient update to save computations. During the forward pass we sample values which are necessary for stochastic approximations (25), (26) and (32). During the backward pass we use these sampled values for error propagation and update parameters by (26) and (34).

All in all, the resulting algorithm turns out to be very similar to the commonly used error back-propagation. The difference is only in the forward pass – error back-propagation deterministically computes the output value for each neuron, whereas in our algorithm the output values are sampled according to the corresponding conditional probability distribution.

C Additional experiments

Here we present some additional experiments for binary image segmentation on Weizmann Horse dataset.

Robustness against image noise. This experiment aims to compare the behaviour of the models with respect to image noise. For this, we learned two models: a feed-forward network (FFN) and the corresponding Bayesian network (BN) with depth equal 50 on the whole dataset of 328 images (see the model description in the experimental section of the main part). Perhaps the most popular noise types are pixel-wise independent Gaussian noise and Gaussian blur. We performed series of experiments with an image from the training set, which was segmented very well by both FFN and BN. We corrupted it by (i) Gaussian noise of increasing variance and (ii) Gaussian blur with increasing radius. The results are shown in Fig. 3. We prefer to give not the decisions but gray-value coded foreground marginals, since these are more illustrative in our opinion. It can be clearly seen that the segmentations produced by FFN degenerate faster with the increasing noise level as compared to those produced by the corresponding BN. It asserts the favourable robustness of the latter to image noise.

Figure 3: Results for images with increasing noise level. Middle row – FFN, bottom row – BN. Left – original image, second and third columns – Gaussian noise of increasing variance, fourth and fifth columns – Gaussian blur with increasing radius.

Figure 4: Results for interactive segmentation. Middle – without and right – with user scribbles (marked red).

Semi-supervised segmentation. Let us assume that at the inference stage some pixels are marked by the user as being the foreground or background. Obviously, this additional information (we call it “user scribbles”) does not play any role, when working with FFN, again, because the underlying probability distribution for output variables is conditionally independent given an image. Hence, a given decision for a particular pixel does not influence other pixels at all. In the corresponding BN in contrast, the additional user information can be taken into account as follows. Let us denote by the subset of output variables that are set by user, and vector of corresponding values. Let be the set of all other output variables, and the vector of their values. Instead to consider p(y|x) as at the usual inference, one can consider the posterior . In order to get the decision according to the maximum marginal strategy, we need marginal probability distributions for unknown output variables for , which are obtained by marginalization over all hidden z-s. Since we have a usual graphical model (although of higher order, see (6) in the main paper), we can estimate these probabilities. In particular, we sample configurations according to the posterior using Gibbs sampling. The results are presented in Fig. 4. We used an “erroneously” segmented image from the previous experiment with blur. Note, how greatly the user scribble influences the marginal distributions even in pixels that are quite far away from it.

designed for accessibility and to further open science