Probabilistic Discriminative Learning with Layered Graphical Models

2019·arXiv

Abstract

Abstract

Probabilistic graphical models are traditionally known for their successes in generative modeling. In this work, we advocate layered graphical models (LGMs) for probabilistic discriminative learning. To this end, we design LGMs in close analogy to neural networks (NNs), that is, they have deep hierarchical structures and convolutional or local connections between layers. Equipped with tensorized truncated variational inference, our LGMs can be efficiently trained via backpropagation on mainstream deep learning frameworks such as PyTorch. To deal with continuous valued inputs, we use a simple yet effective soft-clamping strategy for efficient inference. Through extensive experiments on image classification over MNIST and FashionMNIST datasets, we demonstrate that LGMs are capable of achieving competitive results comparable to NNs of similar architectures, while preserving transparent probabilistic modeling.

1. Introduction

Probabilistic graphical models (Koller & Friedman, 2009) offer an expressive approach to represent and reason with probability distributions. It has been successfully applied in various generative tasks such as extracting biologically similar visual features (Lee et al., 2008) or inpainting occluded images (Mnih et al., 2011; Ping & Ihler, 2017), and is also widely used for structured prediction, such as semantic image segmentation (Kr¨ahenb¨uhl & Koltun, 2011; Chen et al., 2018) in computer vision, or processing sequential data for domains like natural language processing (Man- ning & Sch¨utze, 1999; Collobert et al., 2011) and signal processing (Chen, 2003).

Figure 1. Illustration of a layered graphical model consisting of five layers of nodes and four layerwise connections (two local and two dense connections). All connections are undirected.

1.1. Challenge

For classical discriminative problems such as image classifi-cation, transparent probabilistic modeling would, to a large extent, facilitate model interpretability and uncertainty estimation (Lipton, 2018). This goal, however, is challenging to achieve: State-of-the-art graphical model-based solutions are often limited to small scale or binary image datasets (Mnih et al., 2011; Ping & Ihler, 2017), due to intractable inference on general loopy graphs (Koller & Friedman, 2009). Neural networks (NNs) (Goodfellow et al., 2016), though delivering currently the best performances (Krizhevsky et al., 2012; He et al., 2016), have no clear probabilistic interpretations, and are hard to analyze and vulnerable to humanindiscernible adversarial attacks (Goodfellow et al., 2015).

This paper addresses this challenge by proposing a layered graphical model framework, equipped with efficient inference and training, for probabilistic discriminative learning.

1.2. Related work

Conditional restricted Boltzmann machines (CRBMs) (Mnih et al., 2011) extend restricted Boltzmann machines (RBMs) from generative to discriminative settings. Prior works (Mnih et al., 2011; Ping & Ihler, 2017) have shown that by using approximative inference (via sampling or variational inference), CRBMs are able to handle binary image classification problems with noisy or occluded inputs. Especially, Ping & Ihler (2017) demonstrates, with matricized operations, the effectiveness of loopy belief propagation (Pearl, 1988), which was deemed only practical for graphical models of moderate size (Mnih et al., 2011). However, the simplistic structure of CRBM limits its usage for more complex modeling.

Using truncated Gibbs sampling, contrastive divergence (CD) (Hinton, 2012) is designed to train RBMs efficiently. It can be further extended to train deep belief network (Hinton et al., 2006) and deep Boltzmann machine (Salakhutdinov & Hinton, 2012) in a greedy, layerwise fashion. While these CD-trained models are effective for complex generative tasks (Hinton et al., 2006; Salakhutdinov & Hinton, 2012), the probabilistic meaning of the whole model is somewhat lost due to the greedy layerwise approximation. Also, it is observed that CD is less effective in training conditional models (Mnih et al., 2011; Ping & Ihler, 2017).

In terms of probabilistic modeling in deep learning, variational autoencoder (Kingma & Welling, 2014) models its latent space with mixture of Gaussians to generate data; Bayesian deep learning (Blundell et al., 2015; Gal & Ghahra- mani, 2016) introduces weights prior and applies Bayesian reasoning to model uncertainty (Kendall & Gal, 2017). However, their probabilistic modelings are entangled with “black-box” NNs and the overall representations are not limpid.

As a side note, there also exist works that combine NNs and graphical models for structured prediction (Chen et al., 2018; Zheng et al., 2015; Huang et al., 2015). In particular, Zheng et al. (2015) unrolls (truncated) mean-field inference and integrates it into the NN for end-to-end training.

1.3. Contributions

The contributions of our paper are summarized as follows:

• We propose layered graphical models (LGMs) with hierarchical structures and convolutional and local connections in close analogy to convolutional NNs.

• We integrate tensorized variational inference into MLE training of LGMs, and devise efficient training based on truncated inference and backpropagation. To deal with continuous inputs such as grayscale images, we use a “soft” clamping approach.

• Through extensive experiments we demonstrate that LGMs can achieve competitive results comparable to NN baselines on various image classification tasks, while preserving transparent probabilistic modeling.

2. Layered graphical models

To harness the power of hierarchical representation, in this work, we design a family of undirected graphical models with layered structure which we refer to as layered graphical models. The layered structure is attractive because: 1) It introduces a clear, compact and hierarchical representation of abstraction; 2) All associated computation can be performed in tensor form, which can be easily accelerated in modern computing systems.

2.1. General graphical representation

The layered graphical model is a special instance of pairwise undirected graphical models where the nodes are arranged into layers. Denote the node and edge sets by V and E, respectively. We model the joint distribution of random variables , where for each , as Gibbs distribution given in energy form:

where Z is the partition function, and the unary and pairwise energies, respectively.

Furthermore, we enforce the following constraints:

1. Each layer is homogeneous, i.e., with all its nodes having the same set of labels;

2. There are no intra-layer connections.

Here we do not specify how the layers are connected between each other. We will discuss some possible connection types in Section 2.2. Also, there is no constraint on the connection pattern of the hypergraph formed by layers: they can be connected a priori into loops, cliques, hypercubes, etc., although in this work we mainly focus on structures with chain-like connection patterns.

2.2. Layerwise connections

One direct way to connect two layers is to connect all possible combinations and form a dense connection, as shown in Figure 2 on the left. This ensures that all possible interactions are considered by the learning process.

Figure 2. Dense (left) and convolutional (right) connection. In the right subfigure, edges with the same color share the same weight.

On the other hand, sometimes structured sparse connections are preferred for structured data such as images, where local patterns are predominant. In this case, one may wish to connect only local patches between layers in an LGM, which yields the convolutional connection (cf. Figure 2 right) if we enforce shared weights. We will also consider the variant without weight-sharing, which we refer to as local connection. Again, like in convolutional neural networks, we can customize parameters like kernel size, stride, dilation, etc.

2.3. Connection to existing structures

Boltzmann machine RBM, conditional RBM and DBM are all special cases of LGM whose nodes have binary labels. Especially, the LBP updates in Section 3.1 for LGM generalize the work of Ping & Ihler (2017) and function on general multi-labeled layered structures. Compared to layered Boltzmann structures, higher cardinality nodes allow for natural representation of mutually exclusive cases, e.g., output classes for classification and discretized input.

Neural network LGMs and neural networks have similar connection patterns. However they represent different models in nature: neural networks are considered “black-box” universal function approximators, while LGMs offer transparent probabilistic modeling; accordingly, the inference process for neural networks is the feed-forward function evaluation, while LGMs require probabilistic or MAP inference procedures.

3. Inference and learning

In this section, we present in details the efficient variational inference and learning methods for LGM.

3.1. Inference on LGM

3.1.1. VARIATIONAL INFERENCE METHODS

We begin with an overview of several variational inference methods, and customize them for LGM.

Mean field (MF) The (na¨ıve) mean field (Opper & Saad, 2001) approximates the joint distribution P by a simpler distribution Q consisting of the product of unary believes:

By minimizing the KL-divergence , we obtain the following update formula:

Loopy belief propagation (LBP) The loopy belief propagation (Pearl, 1988) generalizes the belief propagation from tree-structured graphs to general graphs. Its updates are expressed as follows:

Here we directly construct the node-to-node messages since we have a pairwise model. As a side note, the above sumproduct updates perform the probabilistic inference and can be easily modified to perform MAP inference (max-product) by changing the summation to maximization in Eq. (5).

Tree-reweighted message passing (TRW) The tree-reweighted message passing (Wainwright et al., 2005) approximates the original graph as a convex combination of its spanning trees. The update is similar to loopy belief propagation but involves defining a distribution over the set of spanning trees from which the edge appearance probabilities are deduced.

For LGMs having a tree structured layerwise connection, we can choose an arbitrary layer as root and construct spanning trees as combinations of mapping connections between connected layers following the leaf-to-root direction, provided that the mappings to the root layer are surjective. For classification tasks, the output layer typically has only one node, therefore the mappings onto it are trivially surjective.

Once are determined, the update formulas for TRW are obtained by adding a factor to each message in the belief update (Eq. (4)) and dividing the pairwise energy in the message update (Eq. (5)).

3.1.2. COMPACT PARAMETRIZATION IN LOG DOMAIN

The variational inference methods in Section 3.1.1 are all formulated in the domain of exponential of negative energy, for the ease of understanding. We found out that in practice it is beneficial to reformulate these inference updates in the log domain, since it allows for better numerical efficiency and stability in general.

Notably, the formulation in log domain allows us to easily remove redundant parameters and achieve compact representation: for unary believes and messages of l labels, they can be represented as the softmax of parameters along with a fixed 0 last term; for unary and binary energies, we can reparametrize them so that one slice along each label direction can be zeroed-out. And it turns out that all inference updates we considered can be formulated directly with this compact representation. Further details are provided in the Supplementary Manuscript.

Also, we provide the tensorized implementations (including the pseudo-codes) of all aforementioned inference updates in the Supplementary Manuscript.

3.2. Learning with LGM

The parameters of an LGM are learned by maximizing the likelihood of the training data. Specifically, the variable nodes of the LGM can be partitioned into three subsets: the input nodes v, the hidden nodes h and the output nodes y. For given data , we train to minimizes the negative log-likelihood (NLL):

Here the input nodes v are always observable and hence set as conditioned nodes, so that LGM does not need to model them. As we will see in Section 5.5, we can easily extend the learning framework to the case where v is partially observable, and LGM will infer the missing part as a byproduct of the learning process.

To find efficiently on a large dataset, we can estimate the likelihood using probabilistic inference, then perform mini-batch gradient descent with the negative log-likelihood loss. The mini-batch gradient can be computed using backpropagation, which is supported by mainstream deep learning frameworks. The probabilistic inference part, however, requires more attention, as it is intractable in general for loopy graphical models such as LGM.

3.2.1. TRUNCATION OF ITERATIVE INFERENCE

We use the methods described in Section 3.1.1 to perform approximative inference efficiently. They are all local and iterative updates, and we schedule them to run in parallel in the global scale, or layerwise if sequential updates are desired. Approximate inference such as LBP does not have convergence guarantee, but is observed to work well in practice (Murphy et al., 1999).

Furthermore, we truncate the inference procedure to a fixed number of iterations. The underlying rationale is that the truncated iterative inference provides a sufficiently good approximation of the prediction provided that the convergence takes place. In case of non-convergence, it instead provides a reasonable surrogate of the true prediction. The experiments in Section 5.2.1 identify with our reasoning.

As a remark, the truncated iterative inference was previously studied by Domke (2011) using conditional random field on a small dataset. His observation agrees with ours, however without resorting to stochastic gradient and more complex models with hierarchical structures.

3.2.2. TRAINING PROCESS OF LGM

Algorithm 1 summarizes the overall training process of LGM with T inference iterations and backpropagation.

3.2.3. REMARKS ON ANALYTICAL GRADIENT

An alternative way of estimating the gradient of log-likelihood is to use its analytical expression: Let be the energy depending on variables of factor c (either

unary or pairwise in LGM), then the analytical gradient expression for can be written (with Iverson bracket) as:

which takes the form of a difference of two expectations over an indicator function. Both expectations need to be estimated using probabilistic inference.

In principle, the analytical expression might offer better flexibility, since there is no need to build up the computation graph in the inference process as in the case of backpropagation. However, as shown in Section 5.2.3, the analytical gradient (evaluated using truncated variational inference) yields significantly inferior results on loopy graphs, compared to the backpropagation approach. It is observed that the analytical gradient estimation is prone to inaccuracy from approximative inferences used for Eq. (7). In contrast, the backpropagation approach is always faithful to the truncated iterative inference even if it is inaccurate.

4. Input modeling and soft clamping

Continuous values such as grayscale intensity are commonly modeled in graphical models with their discretized representation. This is problematic since: 1) the high cardinality (e.g., 256 for gray scale pixels) results in high computation cost and over-parametrization. 2) the natural ordering of the input is not preserved. Previous works on conditional RBM (Mnih et al., 2011; Ping & Ihler, 2017) avoid this problem by using binarized inputs. While the first problem can be alleviated in some cases with a coarser quantization (see Section 4.3), a better input modeling is needed to properly tackle these problems.

In this section, we introduce soft clamping as a simple yet effective way to model continuous inputs, which enables efficient inference procedure.

4.1. A soft clamped representation

Soft clamping is based on a simple observation that any ranged continuous value (e.g., grayscale intensity) can be regarded as a probabilistic state between its two endpoints (e.g., black and white for image intensity). This allows us to model a continuous value using just one binary node. Accordingly, instead of considering an observation as a “hard clamping” of a node V to a certain state, we “soft clamp” the mean of the Bernoulli distribution of the binary node to the observed value.

Figure 3. A comparative illustration of discretized labeling (left) and soft clamping (right).

Thus using soft clamping, we could keep the original information and the ordering of the continuous observation, while reducing the label cardinality of the node to binary. Our objective function, however, is changed to minimizing the expected negative log-likelihood:

4.2. Approximation for efficient inference

For each combination of hard clamped input , it is possible to estimate the negative log-likelihood using an inference procedure. However, naively evaluating Eq. (8) would require that we perform an exponential number of inferences for all possible combination of , which is simply intractable.

Instead, we remark that:

which is an expectation of the difference of two convex “log-sum-exp” terms.

In soft clamping, we first compute the expected energy:

and then approximate Eq. (9) by:

where both terms above refer to lower approximations, due to Jensen’s inequality, of the respective terms in Eq. (9).

It turns out that Eq. (11) is actually the negative log-likelihood of the distribution with energy . It can then be computed efficiently with only one inference after computing , which in practice simply requires tensor product between the observations and binary energies of their connections (instead of slicing as in the hard clamping cases).

4.3. Remarks on coarse quantization

A simple “hard clamping” alternative to reduce the complexity of the original discretization is to use a coarser quantization. This makes sense when the original precision is not critical for the task. As illustrated by the example of FashionMNIST below, the grayscale image classification problem fulfills this criteria:

Figure 4. (Re)quantization of FashionMNIST data samples to 2, 4 and 256 (original) colors.

In Figure 4, we see that quantization to 4 colors can already give us sufficient information to recognize the shown objects. This indicates that there can be a trade-off between the input precision and the over-parametrization of learning model.

5. Experiments

In this section, we present and analyze the experimental results for LGM on several image classification problems.

5.1. General settings

We conducted a series of experiments on image classifica-tion using LGM to examine its properties. We tested over the MNIST (Lecun et al., 1998) and the FashionMNIST (Xiao et al., 2017) datasets, both containing grayscale images, 60,000 training samples, 10,000 test samples and 10 balanced classes. MNIST consists of black-and-white hand-written digits where grayscale is only used for antialiasing1, while FashionMNIST contains images from online shopping catalogs in true grayscale. For both datasets, we further split the training samples to 48,000 images (80%) for training and 12,000 for validation (20%).

In our implementation we use PyTorch (Paszke et al., 2017) for GPU acceleration and auto-differentiation. For weight update we use Adam optimizer (Kingma & Ba, 2014) with default settings. The batch size is set to 20 and trainings are stopped when the validation loss ceases to decrease.

5.2. Introspective tests

We start with a series of experiments on MNIST to test out several aspects of LGM, namely the effect of truncated inference, soft clamping and the comparison with estimated analytical gradient learning method.

5.2.1. INFLUENCE OF TRUNCATION

First of all, we study the behavior of truncated inference using a sequential LGM with dense connections and a certain number of binary hidden layers with 100 nodes in between, as shown in Figure 5. The inputs are thresholded to binary.

Figure 5. Structure of sequential model with dense connections. The number of nodes and labels are indicated under each layer.

Table 1 summarizes the result of LGMs with varying depth and truncation. Each test was run with sequential LBP and same truncation was used for training and testing:

Table 1. Test accuracy of sequential models with 0–4 hidden layers and 1–5 inference iterations.

Here one can observe a clear separation of success and failure (in gray) cases, indicating that the number of inference iterations should be no smaller than the depth of the model (i.e. distance between the input and the output layer). If this is fulfilled, we can perceive that the truncation works quite well, and more iteration does not necessarily lead to better

classification result.

Some further experiments indicate that using either longer or shorter iteration at test time seems to deteriorate the results in general. It is thus better to use exactly matched truncation during training and evaluation.

5.2.2. SOFT CLAMPING FOR MNIST

Also, we analyze the effect of using soft clamping for MNIST input. Table 2 provides the comparison results:

Table 2. Comparison of soft clamping against binary thresholding for MNIST: test accuracy of LGM with 1 hidden layer and 5 inference iterations.

We see that even with the nearly binary MNIST dataset, soft clamping gives a visible boost to the result. Experiments in Section 5.4 further show the advantage of soft clamping for true grayscale data.

We will use soft clamping for LGMs in later experiments, unless stated otherwise.

5.2.3. ANALYTIC GRADIENT

Additionally, we tested the approach of analytical gradient approximation. We compared 1) the case (Exact) of the sequential structure in Section 5.2.1 with no hidden layer so that the inference is exact; and 2) the case (Dense) with one hidden layer where the inference is approximative.

Table 3. Performance of analytic gradient in Exact and Dense cases (with number of inference iterations).

The tests in Table 3 were run with sequential LBP and shows comparable result for exact inference. However, compared to the second column of Table 1,we see that for approximative inference a significant performance drop can be observed, and performance deterioration can also be observed with other inference methods.

Also, we see a limited improvement for analytic gradient estimation with increased inference iterations, however it can not fully compensate the performance drop due to the inaccuracy from approximative inferences used for Eq. (7).

5.3. Comparison of approximate inference methods

We then analyze the performance of the variational inference methods presented in Section 3.1 (i.e. MF, LBP, TRW) with parallel and sequential scheduling (with prefix “Par” and “Seq”, respectively), as well as the effect of different connections seen in Section 2.2: we use the sequential model as in Section 5.2.1 with one hidden layer (Dense) for pure dense connection and the structure presented in Figure 6 for convolutional (Conv) or local (Local) connections.

Figure 6. Structure of LGM with two 2D conv/local connections of kernel size 5 and stride 2. Node shapes/sizes and label sizes are indicated for each layer.

As baselines we used neural networks with sigmoid/rectified linear unit (ReLU) activation function and similar architecture to the corresponding LGMs. We used 5 inference iterations for all LGM models.

For comparison, we also included the results for contrastive divergence (CD) with 1 and 10 Gibbs sampling steps. In this case, we pretrained all connections except the last one with greedy layerwise CD algorithm and then trained the last layer with softmax non-linearity as classifier. The results for local structure are not provided due to the lack of existing implementation for transposed local operation.

Table 4. Comparison of inference algorithms and neural network baselines with “Dense”, “Conv” and “Local” structures.

Table 4 reports the comparison. We observe that

• Overall, convolutional and local connections result in better performance, both for neural networks and LGMs;

• The variational inference approaches clearly outperform layerwise contrastive divergence on all archi-

tectures, while tree-reweighted message passing yields the best results for variational inference;

• Compared to neural networks, LGMs achieve comparable results with convolutional/local connections but worse results with only dense connections.

5.4. Dealing with grayscale images

Furthermore, we test our system for grayscale image classi-fication using FashionMNIST.

5.4.1. SOFT CLAMPING V.S. REQUANTIZATION

Recall from Section 4 that we have discussed two methods for efficiently dealing with grayscale images: soft clamping and coarse quantization. We will validate our intuition that a coarse quantization is sufficient for the classification task, as well as compare the performance of these two approaches.

An additional issue for requantization is to extend the hidden part to accommodate the increase of input label space. To achieve this, we can increase either the number of nodes for hidden layers, or the number of labels. To compare these two approaches, we used a dense sequential model with one hidden layer. For input with

• The “N” approach extends the hidden layer size to while keeping the binary labeling;

• The “L” approach extends the label size to n + 1 while keeping the layer size to 100.

Table 5 shows the results with sequential LBP. We reduced the batchsize to 4 for the tests with 32 colors and to 1 for 256 colors in order to limit the runtime memory load. We also consider the no-scaling baseline (“F”).

Table 5. Comparison of soft clamping and quantization to 2, 4, 8, 32, 256 colors with different scaling strategies: for colors, “F” fixes the hidden layer to be binary of size 100, “N” extends its size to , and “L” extends the cardinality to n + 1.

We conclude from Table 5 that finer quantization does not necessarily improve the result for LGMs. Instead, it seems to cause over-fitting. Also, soft clamping approach outperforms requantization by a considerable margin.

5.4.2. RESULTS ON FASHIONMNIST

Considering the previous results, we now perform a comparison of the two approaches using convolutional/local connections with neural network baselines. For soft clamping and neural network baselines, we reuse the structure shown in Figure 6, while for requantization we set the input to have 4 colors, and extend the first hidden layer to 3 labels. Sequential TRW is used for all LGM structures.

Table 6. Comparison between message passing algorithms and neural network baselines.

Table 6 lists the results. Again, the soft clamping approach outperforms the requantization approach. And we see that with soft clamping LGM is able to attain accuracy comparable to neural networks with similar architectures.

5.5. Classification of partially observable inputs

Finally, we experiment on MNIST and FashionMNIST with partially observable inputs: in this case, each input pixel has a certain probability to be observed. To handle this, LGM models the input explicitly by taking the unobserved pixels as unclamped nodes in the inference process. For the neural network baselines, we heuristically fill up the missing pixels with gray value 0.5.

Also, to account for the uncertainty of the output caused by unreliable input, we apply smoothing to the ground-truth labels: instead of the one-hot representation, we set the probability of correct label to for the rest. We fix and observe consistent improvement over all methods in our experiments.

Table 7. Accuracies on MNIST with partially observed input (0.3 and 0.7 visible) using LGM and neural network baselines.

Table 8. Accuracies on FashionMNIST with partially observed input (0.3 and 0.7 visible) using LGM and neural network baselines.

Tables 7 and 8 summarize the results for experiments with partial input. Compared to neural network baselines, LGM yields slightly sub-optimal performance in some cases, however the results are still comparable in general.

Interestingly, the probabilistic modeling of LGM provides additional insights. For example, we obtain the believes of the missing input pixels as an outcome of probabilistic inference; see visualizations in Figures 7 and 8. Figure 8 also shows that the end-to-end probabilistic modeling of LGM is able to correctly handle ambiguous inputs.

Figure 7. Samples from Local LGM for FashionMNIST with 30% visible input. The silhouettes become clearer with filled believes.

Figure 8. An ambiguous sample from MNIST with 70% visible input, its belief-filling with Local LGM, and its probabilistic prediction: the model shows uncertainty between 4 and 9.

6. Conclusion

We propose the layered graphical model framework for efficient probabilistic discriminative learning. Combining

• layered architecture,

• local or convolutional connections,

• truncated variational inference with backpropagation,

• soft clamping,

our layered graphical models are able to go beyond existing application range of probabilistic graphical models. As shown by Sections 5.3 and 5.4, they achieve comparable performances vis-a-vis resembling neural networks on grayscale image classification.

Compared to neural networks, layered graphical models additionally provide a transparent probabilistic representation, which, as indicated by Section 5.5, allows for natural modeling of uncertain inputs and inference of missing information.

We expect our work to open new opportunities for graphical models in uncertainty modeling and interpretable learning.

References

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 1613–1622. JMLR, 2015.

Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.

Chen, Z. Bayesian Filtering: From Kalman Filters to Par- ticle Filters, and Beyond. Technical report, McMaster University, 2003.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12: 2493–2537, November 2011.

Domke, J. Parameter learning with truncated message-passing. In CVPR 2011, pp. 2937–2943, June 2011.

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi- mation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1050–1059. JMLR, 2016.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. stat, 1050:20, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778, 2016.

Hinton, G. E. A practical guide to training restricted Boltz- mann machines. In Neural networks: Tricks of the trade, pp. 599–619. Springer, Berlin Heidelberg, 2012.

Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput., 18(7): 1527–1554, July 2006.

Huang, Z., Xu, W., and Yu, K. Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991, 2015.

Kendall, A. and Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pp. 5574–5584, 2017.

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In ICLR, 2014.

Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, 2009.

Kr¨ahenb¨uhl, P. and Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pp. 109–117, 2011.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Proceedings of Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 2012.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.

Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief net model for visual area v2. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T. (eds.), Advances in Neural Information Processing Systems 20, pp. 873–880. Curran Associates, Inc., 2008.

Lipton, Z. C. The mythos of model interpretability. Commun. ACM, 61(10):36–43, 2018.

Manning, C. D. and Sch¨utze, H. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-13360-1.

Mnih, V., Larochelle, H., and Hinton, G. E. Conditional restricted boltzmann machines for structured output prediction. In UAI 2011, Proceedings of the TwentySeventh Conference on Uncertainty in Artificial Intelligence, Barcelona, Spain, July 14-17, 2011, pp. 514–522, 2011.

Murphy, K. P., Weiss, Y., and Jordan, M. I. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, pp. 467– 475, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

Opper, M. and Saad, D. Advanced mean field methods: Theory and practice. MIT press, 2001.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,

A. Automatic differentiation in pytorch. In NIPS-W, 2017.

Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.

Ping, W. and Ihler, A. T. Belief propagation in conditional rbms for structured prediction. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pp. 1141–1149, 2017.

Salakhutdinov, R. and Hinton, G. An efficient learning procedure for deep boltzmann machines. Neural Comput., 24(8):1967–2006, August 2012.

Wainwright, M. J., Jaakkola, T., and Willsky, A. S. Map estimation via agreement on trees: message-passing and linear programming. IEEE Transactions on Information Theory, 51(11):3697–3717, 2005.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., and Torr, P. H. S. Conditional random fields as recurrent neural networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1529–1537, 2015.

Supplementary Manuscript: Probabilistic Discriminative Learning with Layered Graphical Models

Here we provide additional details for the practical implementation of LGM. The demo code is provided at https://github.com/tum-vision/lgm.

A. Parametrization of LGM

In this section, we specify the way layers and connections are parametrized in our implementation.

A.1. Compact parametrization in log domain

To represent the parameters and the intermediate states of LGM in a compact and non-redundant way, we make use of the following compact representation in logarithmic domain (denotes the label set of node i):

Energies The unary and pairwise energies are the parameters of LGM which determine the joint distribution. The energy terms are defined up to a constant, and we can reparametrize them to satisfy the following constraints:

Believes, messages The inference methods that we consider in this work are all iterative and the intermediate states such as unary believes and (normalized) messages are all probability-like. For example, the unary belief has the following constraints:

We can thus define the following log-form:

so that we have

The log-domain believes are thus unconstrained and match the intrinsic degrees of freedom in the original parameterization . For normalized message similarly.

Note that the choice of always “zeroing-out” the first label is merely for notational convenience, the label choice can be arbitrary for each node.

A.2. Parametrization for dense connection

For layers labels respectively, the full (dense) connection will introduce edges. Using the minimal representation described in Section A.1, we can represent the unary energies in e.g., layer p using a tensor of shape , and the pairwise energies between p, q with a tensor of shape

Similar to a d-dimensional convolutional layer in a neural network, using the minimal representation, each layer p will have channels of nodes arranged in shape so that the unary energy tensor is of shape , and we define the kernel size (i.e. patch size) between so that the pairwise energy tensor for local connection. The convolutional connection refers to local connection with shared energies across all patches, hence yielding

A.4. Equivalence between

With the above tensor representation, ically have different shapes. Nevertheless, they represent the same set of parameters arranged in different orders, and we can define a “flip” operation to transform between these two shapes.

B. Efﬁcient variational inference

In Section 3.1, we reviewed several variational inference methods and their iterative updates. Here we provide explicitly the updates with compact parametrization.

B.1. Inference updates with compact parametrization

Mean field (MF) For mean field we have

Loopy belief propagation (LBP) The loopy belief propagation updates become the following:

Tree-reweighted message passing (TRW) With as defined in Section 3.1.1, the update for tree-reweighted message passing becomes:

We see that they are quite similar to Eq. (19) and (20). We will thus omit its pseudo-code in Section B.2

B.2. Implementation of efficient updates for LGM

Based on the Eq. (18),(19),(20), we implement inference updates in tensor form using the minimal representation. Denote by the unary belief tensor for layer p and by the incoming message tensor from and take (log-form) minimal representation, and is of the same shape as has a similar shape than , except that it does not have the label dimension of layer p, nor does it have shared weights. We also define the following tensor operations:

• reshape adds necessary broadcastable dimensions to inputs so that the corresponding dimensions are aligned for elementwise operations;

• flip transforms the input to its alternative representation, as described in Section A.4;

• sumSource sums over the dimensions related to the source layer.

The inference update step for mean field with LGM are shown in Algorithm 2. We use “” for the elementwise product between tensors, “ation over the label dimension which takes into account the implicit label, and “” to denote the contribution to update from layer q.

Algorithms 3 and 4 show the inference update steps for loopy belief propagation on LGM. Here “” denotes the logSumExp operator over the latter label dimension while taking into account the implicit label. Without showing further details, we remark that LBP can be turned into MAP inference by replacing in Algorithm 3. With minor changes as discussed in Section 3.1.1, LBP can also be adapted for tree-reweighted message passing.

Designed for Accessibility and to further Open Science