b

DiscoverSearch
About
My stuff
Correlated Equilibria for Approximate Variational Inference in MRFs
2016·arXiv
Abstract
Abstract

Almost all of the work in graphical models for game theory has mirrored previous work in probabilistic graphical models. Our work considers the opposite direction: Taking advantage of recent advances in equilibrium computation for probabilistic inference. In particular, we present formulations of inference problems in Markov random fields (MRFs) as computation of equilibria in a certain class of game-theoretic graphical models. While some previous work explores this direction, none of that work concretely establishes the precise connection between variational probabilistic inference in MRFs and correlated equilibria. There is no work that exploits recent theoretical and empirical results from the literature on algorithmic and computational game theory on the tractable, polynomial-time computation of exact or approximate correlated equilibria in graphical games with arbitrary, loopy graph structure. Our work discusses how to design new algorithms with equally tractable guarantees for the computation of approximate variational inference in MRFs. In addition, inspired by a previously stated game-theoretic view of state-of-the-art tree-reweighed (TRW) message-passing techniques for belief inference as zero-sum game, we propose a different, general-sum potential game to design approximate fictitious-play techniques. We perform synthetic experiments evaluating our proposed approximation algorithms with standard methods and TRW on several classes of classical Ising models (i.e., with binary random variables). We also evaluate the algorithms using Ising models learned from the MNIST dataset. Our experiments show that our global approach is competitive, particularly shinning in a class of Ising models with constant, “highly attractive” edge-weights, in which

it is often better than all other alternatives we evaluated. With a notable exception, our more local approach was not as effective as our global approach or TRW. Yet, in fairness, almost all of the alternatives are often no better than a simple baseline: estimate the marginal probability to be 0.5.

Almost all of the work in graphical games has borrowed heavily from analogies to probabilistic graphical models. Yet, over-reliance on those analogies and previous standard approaches to exact inference might have led that approach to face the same computational roadblocks that plagued most exact-inference techniques.

As an example of work that heavily exploits previous work in probablistic graphical models (PGMs), Kakade et al. [2003] designed polynomial-time algorithms based on linear programming for computing correlated equilibria (CE) in standard graphical games with tree graphs. The approach and polynomial-time results extend to graphical games with bounded-tree-width graphs and graphical polymatrix games with tree graphs. Exact inference is tractable in PGMs whose graphs have bounded treewidth, but intractable in general [Cooper, 1990, Shimony, 1994, Istrail, 2000]. In 2005, Papadimitriou and Roughgarden showed the intractability of computing the “social-welfare” optimum CE in arbitrary graphical games (see also Papadimitriou and Roughgarden [2008]). Everything seemed to point toward an eventual resignation that the approach of Kakade et al. [2003], along with any other approach to the problem for that matter, had hit the “bounded-treewidth-threshold wall.”

Yet, soon after, Papadimitriou [2005] took a radically different approach to the problem, and surprised the community with an efficient algorithm for computing CE not only in graphical games, but also in almost all known compactly representable games. Jiang and Leyton-Brown [2015a] built upon Pa- padimitriou’s idea to provide what most people would consider an improved polynomial-time algorithm, because of the simplification of the CE that their algorithm outputs (see also Jiang and Leyton-Brown, 2011, for a summary). 1

An immediate question that arises from the algorithmic results just described is, what is so fundamentally different between the problem of exact inference in graphical models and equilibrium computation that made this result possible in the context of graphical games? Of course, CE, probabilistic inference, and their variants are different problems, even within the same framework of graphical models. The question is, how different are they?

It is well-known that pure strategy Nash equilibrium (PSNE) is inherently a classical/standard discrete constraints satisfaction problem (CSP). It is also

well-known that any CSP can be cast as a most-likely, or equivalently, a maximum a posteriori (MAP) assignment estimation problem in Markov random fields (MRFs). 2Through this connection, it is clear that there exists a MAP formulation of PSNE. But what about other, more general forms of equilibria?

We present here a formulation of the problem of equilibrium computation as a kind of local conditions for different approximations to belief inference. Similarly, we show how one can view some special games, called graphical potential games [Ortiz, 2015], as defining an equivalent MRF whose “locally optimal” solutions correspond to arbitrary equilibria of the game. Hence, Papadimitriou’s result, and later that of Jiang and Leyton-Brown, open up the possibility that at least new classes of problems in probabilistic graphical models could be solved exactly and efficiently. The question is, which classes?

While we provide specific connections between the two fields that yield immediate theoretical and computational implications, we also provide practical alternatives that result from those connections. That is, the foundation of both Papadimitriou’s and Jiang and Leyton-Brown’s algorithms is the ellipsoid method, which is one approach that leads to the polynomial-time algorithm for linear programming. This approach, while provably efficient in theory, is often seen as less practical as other alternatives such as so-called interior-point methods. This is in contrast to the simple linear programs that are possible for certain classes of graphical games [Kakade et al., 2003]. Are there simpler and practically effective variants of Papadimitriou’s or Jiang and Leyton-Brown’s algorithms? While the last question is an important open question, we do not address it directly in this paper. Instead, We employ ideas from the literature of learning in games [Fudenberg and Levine, 1999], particularly no-regret algorithms and fictitious play, to propose two specific instances of game-theoretic inspired, practical, and effective heuristics for belief inference in MRFs. One heuristic takes a local approach, and the other takes a global approach. We evaluate our proposed algorithms within the context of the most popular, standard, and state-of-art techniques from the literature in probabilistic graphical models.

This manuscript describes our work, which starts to address some of the questions above, and reports on our progress.

1.1 Overview of the Paper

Section 2 provides preliminary material, introducing basic notation, terminology, and concepts from graphical models and game theory.

Section 3 is the main technical section of the paper. It shows reductions of different problems in belief inference in MRFs as computing equilibria in graphical potential games compactly represented as Gibbs potential games [Ortiz, 2015]. The reductions presented here vary in generality from MAP assignment, marginals, and full-joint estimation to pure-strategy Nash equilibria (PSNE), mixed-strategy Nash equilibria (MSNE), and correlated equilibria (CE), respectively. We briefly discuss a connection between Papadimitriou’s algorithm, as well as Jiang and Leyton-Brown’s, and the work of Jaakkola and Jordan [1997] on variational approximations to the problem of probabilistic inference in MRFs via mean-field mixtures. The paper also includes a discussion on the connections to previous work in computer vision on the problem of relaxation labeling, and work on game-theoretic approaches to (Bayesian) statistical estimation. We then present an alternative approach based on a more global view of the problem, in contrast to the more local approach of the formulations mentioned above. More specifically, we formulate the inference problem using a two-player potential game, inspired by the work on tree reweighed (TRW) message-passing [Wain- wright et al., 2005]. We propose a special type of sequential, “hybrid” standard and stochastic fictitious play algorithm for belief inference.

Section 4 reports on our experimental evaluation. We compare our proposed algorithms to the popular, most commonly used, standard, and easily implementable approximation techniques in use today.

Section 5 discusses future work and suggests new opportunities for other potential research directions, beyond those already discussed in the main technical sections of the paper.

Section 6 concludes the paper with a summary of our contributions.

This section introduces basic notation and concepts in graphical models and game theory used throughout the paper. It also includes brief statements on current state-of-the-art mathematical and computational results in the area.

Basic Notation. Denote by  x ≡ (x1, x2, . . . , xn) an n-dimensional vector and by  x−i ≡ (x1, . . . , xi−1, xi+1, . . . , xn) the same vector without component i. Similarly, for every set  S ⊂ [n] ≡ {1, . . . , n}, denote by  xS ≡ (xi : i ∈ S) the (sub-)vector formed from x using only components in S, such that, letting  Sc ≡[n] − Sdenote the complement of S, we can denote  x ≡ (xS, xSc) ≡ (xi, x−i) for every i. If  A1, . . . , Anare sets, denote by  A ≡ ×i∈[n]Ai, A−i ≡ ×j∈[n]−{i}Ajand  AS ≡ ×j∈SAj.

Graph Terminology and Notation. Let G = (V, E) be an undirected graph, with finite set of n vertices or nodes V = {1, . . . , n} and a set of (undirected) edges E. For each node i, let  N(i) ≡ {j | (i, j) ∈ E}be the set of neighbors of i in G, not including i, and  N(i) ≡ N(i) ∪ {i}the set including i. A clique C of G is a set of nodes with the property that they are all mutually connected: for all  i, j ∈ C, (i, j) ∈ E; in addition, C is maximal if there is no other node k outside C that is also connected to each node in C, i.e., for all k ∈ V − C, (k, i) /∈ Efor some  i ∈ C.

Another useful concept in the context of this paper is that of hypergraphs, which are generalizations of regular graphs. A hypergraph graph G = (V, E) is defined by a set of nodes V and a set of  hyperedges E ⊂ 2V. We can think of the hyperedges as cliques in a regular graph. Indeed, the primal graph of the hypergraph is the graph induced by the node set V and where there is an edge between two nodes if they both belong to the same hyperedge; in other words, the primal graph is the graph induced by taking each hyperedge and forming cliques of nodes in a regular graph.

2.1 Probabilistic Graphical Models

Probabilistic graphical models are an elegant marriage of probability and graph theory that has had tremendous impact in the theory and practice of modern artificial intelligence, machine learning, and statistics. It has permitted effective modeling of large, structured high-dimensional complex systems found in the real world. The language of probabilistic graphical models allows us to capture the structure of complex interactions between individual entities in the system within a single model. The core component of the model is a graph in which each node i corresponds to a random variable  Xiand the edges express conditional independence assumptions about those random variables in the probabilistic system.

2.1.1 Markov Random Fields, Gibbs Distributions, and the Hammersley-Clifford Theorem

By definition, a joint probability distribution P is a Markov random field (MRF) with respect to (wrt) an undirected graph G if for all x, for every node i, P(Xi = xi | X−i = x−i) =  P(Xi = xi | XN (i) = xN (i)).In that case, the neighbors/variables  XN (i)form the Markov blanket of node/variable  Xi.

Also by definition, a joint distribution P is a Gibbs distribution wrt an undirected graph G if it can be expressed as P(X = x) = �C∈CΦC(xC) for some functions ΦCindexed by a clique  C ∈ C, the set of all (maximal) cliques in G, and mapping every possible value  xCthat the random variables  Xcassociated with the nodes in C can take to a non-negative number.

We say that a joint probability distribution P is positive if it has full support (i.e., P(x) > 0 for all x). 3

image

Let P be a positive joint probability distribution. Then, P is an MRF with respect to G if and only if P is a Gibbs distribution with respect to G.

In the context of the theorem, the functions ΦCare positive, which allows us to define MRFs in terms of  local potential functions {φC}over each clique C in the graph. Define the function Ψ(x) ≡ �C∈C φC(xC). Let us refer to any function of this form as a Gibbs potential with respect to G. A more familiar expression of an MRF is  P(X = x) ∝exp(�C∈C φC(xC)) = exp(Ψ(x)).

2.1.2 Some Inference-Related Problems in MRFs

One problem of interest in an MRF is to compute a most likely assignment x∗ ∈arg maxx P(X = x) = arg maxx�C∈C φC(xC)); that is, the most likely outcome with respect to the MRF P. Another problem is to compute the in- dividual marginal probabilities P(Xi = xi) = �x−i P(Xi = xi, X−i = x−i) ∝�x−iexp(�C∈C φC(xC))) for each variable  Xi. A related problem is to com- pute the normalizing constant Z = �xexp(�C∈C φC(xC))) (also known as the partition function of the MRF).

Another set of problems concern so called “belief updating.” That is, computing information related to the  posterior probability distribution P ′having observed the outcome of some of the variables, also known as the evidence. For MRFs, this problem is computationally equivalent to that of computing prior marginal probabilities.

2.1.3 Brief Overview of Computational Results in Probabilistic Graphical Models

Both the exact and approximate versions of most inference-related problems in MRFs are in general intractable (e.g., NP-hard), although polynomial-time algorithms do exists for some special cases (see, e.g., Dagum and Luby, 1993, Roth, 1996, Istrail, 2000, Wang et al., 2013, and the references therein). The complexity of exact algorithms is usually characterized by structural properties of the graph, and the typical statement is that running times are polynomial only for graphs with bounded treewidth (see, e.g., Russell and Norvig, 2003 for more information). Several deterministic and randomized approximation approaches exist (see, e.g., Jordan et al., 1999, Jaakkola, 2000, Geman and Ge- man, 1984). An approximation approach of particular interest in this paper is variational inference [Jordan et al., 1999, Jaakkola, 2000]. Roughly speaking, the general idea is to approximate an intractable MRF P by a “closest” probability distribution  Q∗within a “computationally tractable” class Q: formally,  Q∗ ∈arg maxQ∈QKL(Q ∥ P), where KL(Q ∥ P) ≡ �x Q(x) ln  Q(x)P (x)is the Kullback-Leibler (KL) divergence between probability distributions P and Q wrt Q. The simplest example in the so called mean-field (MF) approximation, in which Q = {Q | Q(x) = �i Q(xi) for all  x ∈ Ω}consists of all possible product distributions. Even if P is an IM, no closed-form solution exists for its mean-field approximation, and the most common computational scheme is based on simple axis parallel optimizations, leading to individual local conditions of optimality and potential local minima: that is, the problem is essentially reduced to finding  Q∗(x) = �i Q∗i(xi) such that for all i, we have

image

H(Qi) ≡ HQi(Xi) ≡ − �xi Qi(xi) ln  Qi(xi) is the (Shannon) entropy of ran- dom variable  Xi ∼ Qi.

2.2 Game Theory

Game theory [von Neumann and Morgenstern, 1947] provides a mathematical model of the stable behavior (or outcome) that may result from the interaction of rational individuals. This paper concentrates on noncooperative settings: individuals maximize their own utility, act independently, and do not have (direct) control over the behavior of others. 4

The concept of equilibrium is central to game theory. Roughly, an equilibrium in a noncooperative game is a point of strategic stance, where no individual player can gain by unilaterally deviating from the equilibrium behavior.

2.2.1 Games and their Representation

Let V = [n] denote a finite set of n players in a game. For each player  i ∈V , let  Aidenote the set of actions or pure strategies that i can play. Let A ≡ ×i∈V Aidenote the set of  joint actions, x ≡ (xi, . . . , xn) ∈ Adenote a joint action, and  xithe individual action of player i in x. Denote by  x−i ≡(x1, . . . , xi−1, xi+1, . . . , xn) the joint action of all the players except i, such that x ≡ (xi, x−i). Let  Mi : A → Rdenote the payoff/utility function of player i. If the  Ai’s are finite, then  Miis called the payoff matrix of player i. Games represented this way are called normal- or strategic-form games.

There are a variety of compact representations for large games inspired by probabilistic graphical models in AI and machine learning [La Mura, 2000, Kearns et al., 2001, Koller and Milch, 2003, Leyton-Brown and Tennenholtz, 2003, Jiang and Leyton-Brown, 2008]. The results of this paper are presented in the context of the following generalization of graphical games [Kearns et al., 2001], a simple but powerful model inspired by probabilistic graphical models such as MRFs previously defined by Ortiz [2014]. 5

Definition 1. A graphical multi-hypermatrix game (GMhG) is defined by

a directed graph G = (V, E) in which there is a node  i ∈ Vin G for each of the n players in the game (i.e., |V | = n), and the set of directed edges, or arcs, E defines a set of  neighbors N(i) ≡ {j | (j, i) ∈ E, i ̸= j}whose action affect the payoff function of i (i.e., j is a neighbor of i if and only if there is an arc from j to i); and

for each player  i ∈ V,

image

a hypergraph where the vertex set is its (inclusive) neighborhood N(i) ≡ N(i) ∪ {i}and the hyperedge set is a set of cliques of players Ci ⊂ 2N(i), and

a set  {M ′i,C:  AC → R | C ∈ Ci}of local-clique payoff (hyper)matrices.

The interpretation of a GMhG is that, for each player i, the local and global payoff (hyper)matrices M ′i:  AN(i) → Rand  Mi : A → Rof i are (implicitly) defined as  M ′i(xN(i)) ≡ �C∈Ci M ′i,C(xC) and  Mi(x) ≡ M ′i(xN(i)), respectively.

Graphical potential games. Graphical potential games are special instances of GMhGs. They play a key role in establishing a stronger connection between probabilistic inference in MRFs and equilibria in games than previously noted. Ortiz [2015] provides a characterization of graphical potential games, and discusses the implication of convergence of certain kinds of “playing” processes in games based on connections to the Gibbs sampler [Geman and Geman, 1984], via the Hammersley-Clifford Theorem [Hammersley and Clifford, 1971, Besag, 1974]. Yu and Berthod [1995] (implicitly) used graphical potential games to establish an equivalence between local maximum-a-posteriori (MAP) inference in Markov random fields and Nash equilibria of the game, a topic revisited in Section 3.1. 6

2.2.2 Equilibria as Solution Concepts

Equilibria are generally considered the solutions of games. Various notions of equilibria exist. A pure strategy (Nash) equilibrium (PSNE) of a game is a joint action  x∗such that for all players i, and for all actions  xi, Mi(x∗i , x∗−i) ≥Mi(xi, x∗−i).That is, no player can improve its payoff by unilaterally deviating from its prescribed equilibrium  x∗i, assuming the others stick to their actions x∗−i. Some games, such as the extensively-studied Prisoner’s Dilemma, have PSNE; many others, such as “playground” Rock-Paper-Scissors, do not. This is problematic because it will not be possible to “solve” some games using PSNE.

A mixed-strategy of player i is a probability distribution  Qiover  Aisuch that  Q(xi) is the probability that i chooses to play action  xi. 7A joint mixed-strategy is a joint probability distribution Q capturing the players behavior, such that Q(x) is the probability that joint action x is played, or in other words, each player i plays action in component  xiof x. Because we are assuming that the players play independently, Q is a product distribution: Q(x) = �i Qi(xi). Denote by  Q−i(x−i) ≡ �j̸=i Qj(xj) the joint mixed strategies of all the players except i. The expected payoff of a player i when some joint mixed-strategy Q is played is �x Q(x)Mi(x); abusing notation, denote it by  Mi(Q). The conditional expected payoff of a player i given that he plays action  xiis �x−i Q−i(x−i)Mi(xi, x−i); abusing notation again, denote it by  Mi(xi, Q−i).

image

is a product distribution formed by the individual players mixed strategies  Q∗isuch that, for all players i, and any other alternative mixed strategy  Q′ifor his play,  Mi(Q∗i , Q∗−i) ≥ Mi(Q′i, Q∗−i). Everygame in normal-form has at least one such equilibrium [Nash, 1951]. Thus, every game has an MSNE “solution.” One relaxation of MSNE considers the case where the amount of gain each player can obtain from unilateral deviation is very small. This concept is particularly useful to study approximation versions of the computational problem. Given  ϵ ≥0, an  (approximate) ϵ-Nash equilibrium (MSNE)is defined as above, except that the expected gain condition becomes  Mi(Q∗i , Q∗−i) ≥Mi(Q′i, Q∗−i) − ϵ.Several refinements and generalizations of MSNE have been proposed. One of the most interesting generalizations is that of a correlated equilibrium (CE) [Aumann, 1974]. In contrast to MSNE, a CE can be a full joint distribution, and thus characterize more complex joint-action behavior by players. Formally, a correlated equilibrium (CE) is a joint probability distribution Q over A such that, for all players  i, xi, x′i ∈ Ai, xi ̸= x′i, and  Q(xi) >0,

image

where  Q(xi) ≡ �x−i Q(xi, x−i) is the (marginal) probability that player i will play  xiaccording to Q and  Q(x−i|xi) ≡ Q(xi, x−i)/ �x′i Q(x′i, x−i) is the con- ditional given  xi. An MSNE is CE that is a product distribution. An equivalent expression of the CE condition above is �x−i Q(xi, x−i)Mi(xi, x−i) ≥�x−i Q(xi, x−i)Mi(x′i, x−i).As was the case for MSNE, we can relax the con- dition of deviation to account for potential gains from small deviation. Given ϵ >0, adding the term “−ϵ” to the right-hand-side of the condition above defines an  (approximate) ϵ-CE. 8

CE have several conceptual and computational advantages over MSNE. For instance, all players may achieve better expected payoffs in a CE than those achievable in any MSNE; 9some “natural” forms of play are guaranteed to converge to the (set of) CE [Foster and Vohra, 1997, 1999, Fudenberg and Levine, 1999, Hart and Mas-Colell, 2000, 2003, 2005]; and CE is consistent with a Bayesian framework [Aumann, 1987], something not yet possible, and apparently unlikely for MSNE [Hart and Mansour, 2007].

2.2.3 Brief Overview of Results in Computational Game Theory

There has been an explosion of computational results on different equilibrium concepts on a variety of game representations and settings since the beginning of this century. The following is a brief summary. We refer the reader to a book by Nisan et al. [2007] for a (partial) introduction to this research area.

The problem for two-player zero-sum games, where the sum of the entries of both matrix is zero, and therefore only one matrix is needed to represent the game, can be solved in polynomial time: It is equivalent to linear programming [von Neumann and Morgenstern, 1947, Sz´ep and Forgo´o, 1985, Karlin, 1959]. After being open for over 50 years, the problems of the complexity of computing MSNE in games was finally settled recently, following a very rapid sequence of results in the last part of 2005 [Goldberg and Papadimitriou, 2005, Daskalakis et al., 2005, Daskalakis and Papadimitriou, 2005, Daskalakis et al., 2009b, Chen and Deng, 2005b]: Computing MSNE is likely to be hard in the worst case, i.e., PPAD-complete [Papadimitriou, 1994], even in games with only two players [Chen and Deng, 2005a, 2006, Chen et al., 2009, Daskalakis et al., 2009a,b]. The result of Fabrikant et al. [2004] suggests that computing PSNE in succinctly representable games is also likely to be intractable in the worst case, i.e., PLS-complete [Johnson et al., 1988]. A common statement is that computing MSNE, and in some cases even PSNE, with “special properties” is hard in the worst case [Gilboa and Zemel, 1989, Gottlob et al., 2003, Conitzer and Sandholm, 2008]. Computing approximate MSNE is also thought to be hard in the worst case [Chen et al., 2006, 2009]. We refer the reader to Ortiz and Irfan [2017], and the references therein, for recent results along this line and a brief survey of the state-of-the-art for this problem.

Most current results for computing exact and approximate PSNE or MSNE in graphical games essentially mirror those for MRFs and constraint networks: polynomial time for bounded treewidth graph; intractable in general [Kearns et al., 2001, Gottlob et al., 2003, Daskalakis and Papadimitriou, 2006, Ortiz, 2014]. This is unsurprising because they were mostly inspired by analogous versions in probabilistic graphical models and constraint networks in AI, and therefore share similar characteristics. Several heuristics exist for dealing with general graphs [Vickrey and Koller, 2002, Ortiz and Kearns, 2003, Daskalakis and Papadimitriou, 2006].

In contrast, there exist polynomial-time algorithms for computing CE, both for normal-form games (where the problem reduces to a simple linear feasibility problem) and even most succinctly-representable games known today [Papadim- itriou, 2005, Jiang and Leyton-Brown, 2015a], including graphical games.

The line of work presented in this section is partly motivated by the following question: Can we leverage advances in computational game theory for problems in the probabilistic graphical models community? Establishing a strong bilateral connection between both problems may help us answer this question.

The literature on computing equilibria in games has skyrocketed since the beginning of this century. As we discover techniques developed early on within the game theory community, and as new results are generated from the extremely active computational game theory community, we may be able to adapt those techniques for solving games to the inference setting. If we can establish a strong bilateral connection between inference problems and the computation of equilibria, we may be able to relate algorithms in both areas and exchange previously unknown results in each.

3.1 Pure-Strategy Nash Equilibrium and Approximate MAP Inference

Consider an MRF P with respect to graph G and Gibbs potential Ψ defined by the set of potential functions  {φC}. For each node i, denote by  Ci ⊂ Cthe subset of cliques in G that include i. Note that the (inclusive) neighborhood of player i is given by N(i) =  ∪C∈CiC.

Define an MRF-induced GMhG, and more specifically, a (hyperedge-symmetric) hypergraphical game [Papadimitriou, 2005, Ortiz, 2015], with the same graph G, and for each player i, hypergraph with hyperedges  Ciand local-clique payoff hypermatrices  M ′i,C(xC) ≡ φC(xC) for all  C ∈ Ci. A few observations about the game are in order.

Property 1. The representation size of the MRF-induced game is the same as that of the MRF: not exponential in the largest neighborhood size, but the size of the largest clique in G.

Property 2. The MRF-induced game is a graphical potential game [Ortiz, 2015] with graph G and (Gibbs) potential function Ψ: i.e., for all i, x and x′i, Mi(xi, x−i) − Mi(x′i, x−i) =  M ′i(xi, xN (i)) − M ′i(x′i, xN (i))

image

Remark 1. Through the connection established by the last property, it is easy to see that sequential best-response dynamics is guaranteed to converge to a PSNE of the game in finite time, regardless of the initial play. 10In fact, we can conclude that a joint-action  x∗is a PSNE of the game if and only if  x∗is a local maxima or a critical point of the MRF P. Thus, the MRF-induced game, like all potential games [Monderer and Shapley, 1996b], always has PSNE. 11

Similarly, for any potential game, one can define a game-induced MRF using the potential function of the game whose set of local maxima (and critical points) corresponds exactly to the set of PSNE of the potential game. Through this connection we can show that solving the local-MAP problem in MRFs is PLScomplete in general [Fabrikant et al., 2004]. 12

One question that comes to mind is whether one can say anything about the properties of the globally optimal assignment in the game-induced MRF and the payoff it supports for the players. Or whether it can be characterized by stronger notions of equilibria. For example, are strong NE, in which no coalition of players could obtain a Pareto dominated set of payoffs by unilaterally deviating, joint MAP assignments of the MFR? Or more generally, what characteristics can we assign to the MAP assignments of the game-induced MRF?

In short, we can use algorithms for PSNE as heuristics to compute locally optimal MAP assignments of P and vice versa. 13

Remark 2. Daskalakis et al. [2007] extended results in game theory characterizing the number of PSNE in normal-form games (see Stanford, 1995, Rinott and Scarsini, 2000, and the references therein) to graphical games, but now taking into consideration the network structure of the game. Information about the number of PSNE in games can provide additional insight on the structure of MRFs.

For example, one of the results of Daskalakis et al. [2007] states that for graphs respecting certain expansion properties as the number of nodes/players increases, the number of PSNE of the graphical game will have a limiting distribution that is a Poisson with expected value 1. Also according to Daskalakis et al. [2007], a similar behavior occurs for games with graphs generated according to the Erd¨os-R´enyi model with sufficiently high average-degree (i.e., reasonably high connectivity). Thus, either the set of MRF-induced games has significantly low measure relative to the set of all possible randomly generated games (something that seems likely), or the number of local maxima (and critical points) of the MRF will have a similar distribution, and thus that number is expected to be low. The latter would suggest that local algorithms such as the max-product algorithm may be less likely to get stuck in local maxima (or critical points) of the MRF.

In addition, there have been several results stating that PSNE are unlikely to exist in many graphs, and that, when they do exist, they are not that many [Daskalakis et al., 2007]. 14MRF-induced games would in that sense represent a very rich class of non-randomly generated graphical games for which the results above do not hold.

3.2 Mixed-strategy Equilibria and Belief Inference

Going beyond PSNE and MAP estimation, this subsection begins to establish a stronger, and potentially more useful connection between probabilistic inference and more general concepts of equilibria in games.

image

QS(xS) ≡ �xV −S Q(x) the (marginal) probability distribution of Q over possi- ble joint actions of players in S. Consider the condition for correlated equilibria (CE), which for the MRF-induced game we can express as, for all  i, xi, x′i ̸= xi,

image

Commuting the sums and simplifying we get the following equivalent condition:

image

This simplification is important because it highlights that, modulo expected payoff equivalence, we only need distributions over the original cliques, not the induced neighbohoods/Markov blankets, to represent CE in this class of games, in contrast to Kakade et al. [2003]; thus, we are able to maintain the size of the representation of the CE to be the same as that of the game. As an alternative, we can use the fact that the MRF-induced game is a potential game and, via some definitions and algebraic manipulation, get the following sequence of equivalent conditions, which hold for all  i, xiand  x′i. �x−i Q(xi, x−i) (Mi(xi, x−i) − Mi(x′i, x−i)) ≥0 �x−i Q(xi, x−i) (Ψ(xi, x−i) −Ψ(x′i, x−i)) ≥0 �x−i Q(xi, x−i) (ln  P(xi, x−i) −ln  P(x′i, x−i)) ≥0

Rewriting the last expression, we get the following equivalent condition: for all i, xiand  x′i,

image

The following are some additional remarks on the implications of the last condition. 15

Remark 3. First, it is useful to introduce the following notation. For any distribution  Q′, let  H(Q′, P) ≡ �x Q′(x)[−log2 P(x)] be the cross entropy be- tween probability distributions  Q′and P, with respect to P. 16Denote by Q−i(x−i) ≡ �xi Q(xi, x−i) the marginal distribution of play over the joint- actions of all players except player i. Denote by  Q′iQ−ithe joint distribution defined as (Q′iQ−i)(x) ≡ Q′i(xi)Q−i(x−i) for all x. Then, condition 2 implies the following sequence of conditions, which hold for all i.

image

As anonymous reviewer pointed out, the condition is actually that of a coarse CE (CCE) [Hannan, 1957, Moulin and Vial, 1978], which is a superset of CE and allows us to apply several simple methods for computing such equilibrium concept, as discussed later in this section. Hence, any CE of the MRF-induced

image

The following property summarizes this remark.

Property 3. For any MRF P, any correlated equilibria Q of the game induced by P satisfies  H(Q, P) ≤miniminQ′i H(Q′iQ−i, P).

Remark 4. Let us introduce some additional notation. For any joint distribution of play  Q′, let  H(Q′) ≡ �x Q′(x)[−log2 Q′(x)] be its entropy. Similarly, for any player i, for any marginal/individual distribution of play  Q′i, let  H(Q′i) ≡�xi Q′i(xi)[−log2 Q′i(xi)] be its (marginal) entropy. For any distribution  Q′and P, let KL(Q′ ∥ P) ≡ �x Q′(x) log2(Q′(x)/P(x)) =  H(Q′, P) − H(Q′) be the Kullback-Leibler divergence between  Q′and P, with respect to  Q′. Denote by  H(Qi|−i) ≡ �xi,x−i Q(xi, x−i) log2(Q(xi, x−i)/Q−i(xi)) =  H(Q−i) − H(Q) the conditional entropy of the individual play of player i given the joint play of all the players except i, with respect to Q.

Then, we can express the condition 2 as the following equivalent conditions, which hold for all i.

image

Hence, any CE of a MRF-induced game is a kind of approximate local optimum (or critical point) of a special kind of variational approximation of the MRF. The following property summarizes this remark.

Property 4. For any MRF P, any correlated equilibria Q of the game induced by P satisfies KL(Q ∥ P) ≤mini�minQ′iKL(Q′iQ−i ∥ P) +  H(Q′i)�− H(Qi|−i).

Note that the last property implies that the approximation Q satisfies the local condition KL(Q ∥ P) ≤miniminQ′iKL(Q′iQ−i ∥ P) + log2 |Ωi|.

Before continuing exploring connections to CE, it is instructive to first consider MSNE.

3.2.1 Mixed-strategy Nash Equilibria and Mean-Field Approximations

image

Denoting by  X +i ≡ {xi ∈ Ai | Qi(xi) > 0}, the last condition implies that �

image

image

which, in turn, we can express as H(Q, P) = minQ′i H(Q′Q×, P) .The last expression is also equivalent to

image

Hence, a NE Q of the game is almost a locally optimal mean-field approximation, except for the extra entropic term. In summary, for MSNE we have the following tighter condition than for arbitrary CE.

Property 5. For any MRF P, any MSNE Q of the game induced by P satisfies KL(Q ∥ P) =�minQ′iKL(Q′iQ×−i ∥ P) +  H(Q′i)�− H(Qi), for all i.

Note that the last property implies that the mean-field approximation Q satisfies the local condition KL(Q ∥ P) ≤minQ′iKL(Q′iQ×−i ∥ P) + log2 |Ωi|for all i.

One possible way to address the issue of the extra entropic term is to consider instead the MRF-induced infinite game, where each player i has the (continuous) utility function 17

image

and wants to maximize over its mixed-strategy  Qigiven the other player mixed-strategies  Qjfor all  j ̸= i.

Property 6. The MRF-induced infinite game defined above is an infinite Gibbs potential game with the same graph G and the following potential over the set of individual (product) mixed strategies

image

where Z is the normalizing constant for P. From this we can derive that the individual player mixed-strategies  {Qi}are a “pure strategy” equilibrium of the infinite game if and only if

image

Or, in other words, if Q is a PSNE of the infinite game, then Q is also a local optimum (or critical point) of the mean-field approximation of P.

Remark 5. The local payoff function defined above for the infinite game also has connections to the game theory literature on learning in games [Fudenberg and Levine, 1999]. This area studies properties of processes by which players “learn” how to play in (usually repeated) games; especially properties related to the existence of convergence of the learning (or playing) dynamics to equilibria. In particular, the local payoff function is similar to that used by logistic fictitious play, a special version of a “learning” process called smooth fictitious play. The difference is that the last entropy term involving the individual player’s mixed strategy has a regularization-type factor  λ >0 such that players play strict best-response as  λ →0. In addition, logistic fictitious play is an instance of a learning process that, if followed by a player, achieves so called approximate universal consistency (i.e., roughly, in the limit of infinite play, the average of the payoffs obtained by the player will be close to the best obtained overall during repeated play, regardless of how the other players behave), also known as Hannan consistency [Hannan, 1957], for appropriate values of  λdepending on the desired approximation level.

Indeed, it is not hard to see that in fact the best-response mixed-strategy Qiof player i to the mixed strategies  QN (i)of their neighbors is  Qi(xi) ∝

image

Hence, running sequential best-response dynamics in the MRF-induced infinite game is equivalent to finding a variational mean-field approximation via recursive updating of the first derivative conditions. 18The process will then be equivalent to minimizing the function  F(Q) ≡KL(Q ∥ P) by axis-parallel updates. The resulting sequence of distributions/mixed-strategies monotonically decreases the value of F and is guaranteed to converge to a local optimum or a critical point of F. Hence, the corresponding learning process is guaranteed to converge to a PSNE of the infinite game, which is in turn an approximate MSNE of the original game. But this is not surprising in retrospect, given the last property (Property 6). That property essentially states a broader property of all potential games: they are isomorphic to so called games with identical interests [Monderer and Shapley, 1996b], which are games where every player has exactly the same payoff function.

Remark 6. The previous discussion suggests that we could use appropriately-modified versions of algorithms for MSNE, such as NashProp [Ortiz and Kearns, 2003], as heuristics to obtain a mean-field approximation of the true marginals.

Going in the opposite direction, the discussion above also suggests that, by treating any (graphical) potential game as an MRF, for any fixed  λ >0, logistic fictitious play in any potential game converges to an approximate (λ/mini |Ai|)-MSNE of the potential game. Indeed, there has been recent work in this direction, which explores the connection between learning in games and mean-field approximations in machine learning [Rezek et al., 2008]. That work proposes new algorithms based on fictitious play for simple mean-field approximation applied to statistical (Bayesian) estimation.

The game-induced MRF is a  λ-temperature Gibbs measure. As we take λ →0, we get the limiting 0-temperature Gibbs measure which is a probability distribution over the set of global maxima of the potential function of the game, and 0 probability everywhere (i.e., the support of the limiting distribution is the set of joint-actions that maximize the potential function). The support of the 0-temperature Gibbs measure is a subset of the “globally optimal” PSNE of the potential game. But there might be other equilibria corresponding to local optima (or critical points) of the potential function.

Are there other connections between the Nash equilibria of the game and the support of the limiting distribution?

3.2.2 Correlated Equilibria and Higher-order Variational Approximations

Kakade et al. [2003] designed polynomial-time algorithms based on linear programming for computing CE in standard graphical games with tree graphs. The approach and polynomial-time results extend to graphical games with bounded-tree-width graphs and graphical polymatrix games with tree graphs. Ortiz et al. [2007] (see also Ortiz et al., 2006) proposed the principle of maximum entropy (MaxEnt) for equilibrium selection of CE in graphical games. They studied several properties of the MaxEnt CE, designed a monotonically increasing algorithm to compute it, and discussed a learning-dynamics view of the algorithm. Kamisetty et al. [2011] employed advances in approximate inference methods to propose approximation algorithms to compute CE. In all of those cases, the general approach is to use ideas from probabilistic graphical models to design algorithms to compute CE. The focus of this paper is the opposite direction: employing ideas from game theory to design algorithms for belief inference in probabilistic graphical models.

Property 4 suggests that we can use the CE for the MRF-induced game as a heuristic approximation to higher-order variational approximations. In fact, one would argue that in the context of inference, doing so is more desirable because, in principle, it can lead to better approximations that can capture more aspects of the joint distribution than a simple mean-field approximation would alone. For example, mean-field approximations are likely to be poor if the MRF is multi-modal. Motivated by this fact, Jaakkola and Jordan [1997] suggest using mixture of product distributions to improve the simple variational mean-field approximation.

3.2.3 Some Computational Implications

But, consider the algorithms of Papadimitriou [2005] or Jiang and Leyton-Brown [2015a] (see also Papadimitriou and Roughgarden, 2008, and Jiang and Leyton- Brown, 2011), which we can use to compute a CE of the MRF-induced game in polynomial time. Such CE will be, by construction, also a (polynomially-sized) mixture of product distributions. (In the case of Jiang and Leyton-Brown’s algorithm it will be a mixture of a subset of the joint-action space, which is equivalent to a probability mass function over a polynomially-sized subset of the joint-action space; said differently, a mixture of product of indicator functions, each product corresponding to particular outcomes of the joint-action space.) Hence, the algorithms of Papadimitriou and Jiang and Leyton-Brown both provide a means to obtain a heuristic estimate of a local optimum (or critical point) of such a mixture in polynomial time. The result would not be exactly the same as that obtained by Jaakkola and Jordan [1997] in general, because of the extra entropic term mentioned in the discussion earlier. Can we

image

Regardless, at the very least one could use the resulting CE to initialize the technique of Jaakkola and Jordan [1997] without specifying an a priori number of mixtures.

Having said that, both Papadimitriou’s and Jiang and Leyton-Brown’s algorithms make a polynomial number of calls to the ellipsoid-algorithm, or more specifically, its “oracle,” to obtain each of the product distributions whose mixture will form the output CE. It is known that the ellipsoid algorithm is slow in practice. Papadimitriou [2005], Papadimitriou and Roughgarden [2008], and Jiang and Leyton-Brown [2015a] leave open the design of more practical algorithms based on interior-point methods.

Finally, this connection also suggests that we can (in principle) use any learning algorithm that guarantees convergence to the set of CE (as described in the section on preliminaries on game theory where the concept was introduced) as a heuristic for approximate inference. Several so-called “no-regret” learning algorithms satisfy those conditions. Indeed, we use two simple variants of such algorithms in our experiments. Viewed that way, such learning algorithms would be similar in spirit to stochastic simulation algorithms with a kind of “adaptivity” reminiscent of the work on adaptive importance sampling (see, e.g., Cheng and Druzdzel, 2000, Ortiz and Kaelbling, 2000, Ortiz, 2002, and the references therein). Establishing a possible stronger connection between learning in games, CE, and probabilistic inference seems like a promising direction for future research. In fact, as previously mentioned (at the end of Remark 5), there has already been some recent work in this direction, but specifically for MSNE and mean-field approximations [Rezek et al., 2008].

Later in this paper, we present the results of an experimental evaluation of the performance of a simple no-regret learning algorithm in computational game theory [Fudenberg and Levine, 1999, Blum and Mansour, 2007, Hart and Mas-Colell, 2000] in the context of probabilistic inference. Those are iterative algorithms like many other approximate inference methods such as mean field and other variational approximations, but closer in spirit to sampling/simulationbased methods such as the Gibbs sampler and other similar MCMC methods. Indeed, the running time per iteration of those algorithms is roughly the same as that of sampling-based methods. We delay the details until the Experiments section (Section 4).

3.3 Other Previous and Related Work

Earlier work on the so called “relaxation labeling” problem in AI and computer vision [Rosenfeld et al., 1976, Miller and Zucker, 1991] has established connections to polymatrix games [Janovskaja, 1968] (see also Hummel and Zucker, 1983, although the connection had yet to be recognized at that time). That work also establishes connections to inference in Hopfield networks, dynamical systems, and polymatrix games [Miller and Zucker, 1991, Zucker, 2001]. A reduction of MAP to PSNE in what we call here a GMhG was introduced by Yu and Berthod [1995] in the same context (see also Berthod et al., 1996); although they concentrate on pairwise potentials, which reduce to polymatrix games in this context. Because, in addition, the ultimate goal in MAP inference is to obtain a global optimum configuration, Yu and Berthod [1995] proposed a Metropolis-Hastings-style algorithm in an attempt to avoid local minima. Their algorithm is similar to simulated annealing algorithms used for solving satisfi-ability problems, and other local methods such as WalkSAT [Selman et al., 1996] (see, e.g., Russell and Norvig, 2003 for more information). The algorithm can also be seen as a kind of learning-in-games scheme [Fudenberg and Levine, 1999] based on best-response with random exploration (or “trembling hand” best response). That is, at every round, some best-response is taken with some probability, otherwise the previous response is replayed. Zucker [2001] presents a modern account of that work. The connection to potential games, and all its well-known properties (e.g., convergence of best-response dynamics) does not seem to have been recognized within that literature. Also, none of the work makes connections to higher-order (i.e., beyond mean-field) inference approximation techniques or the game-theoretic notion of CE.

3.4 Approximate Fictitious Play in a Two-player Potential Game for Belief Inference in Ising Models

This section presents a game-theoretic fictitious-play approach to estimation of node-marginal probabilities in MRFs. The approach this time is more global in terms of how we use the whole joint-distribution for the estimation of individual marginal probabilities. The inspiration for the approach presented here follows from the work of Wainwright et al. [2005]. The section concentrates on Ising models, an important, special MRF instance from statistical physics with its own interesting history.

Definition 2. An Ising model wrt an undirected graph G = (V, E) is an MRF wrt G such that

image

where  θ ≡(,¯W) is the set of node biases  bi’s and edge-weights  wij’s, which are the parameters defining the joint distribution  Pθover  {−1,+1}n.

It is fair to say that interest on more general classes of MRFs originates from the special class of Ising models. It is also fair to say that, because of the relative simplicity and importance of Ising models for problems in statistical physics, as well as to other ML and AI applications areas such as computer vision and NLP, Ising models have become the most common platforms in which to empirically study approximation algorithms for arbitrary MRFs. In short, simplicity of presentation and empirical evaluation guide the focus of Ising models in this section: Generalizations to arbitrary MRFs are straightforward but cumbersome to present. Hence, in this manuscript, we omit the details of such generalizations.

As an outline, the current section begins with an algorithmic instantiation of the iterative approach. The exact instantiation depends on whether we are using CE or MSNE as the solution concept. The section then follows with an informal discussion of the game-theoretic foundations of the general framework behind the approach, and a discussion of immediate implications to computational properties and potential convergence.

Denote by  TGthe set of all spanning trees of connected (undirected) graph G = (V, E) that are maximal with respect to E (i.e., does not contain any spanning forests). If spanning tree  T ∈ TG, we denote by  E(T) ⊂ Ethe set of edges of T. To simplify the presentation of the algorithm, let

image

Initialize  x(1) ←Uniform({−1,+1}n), and for each (i, j) ∈ E, �µ(1)(i,j) ←x(1)i x(1)j. At each iteration l = 1, 2, . . . , m,

image

image

as the estimate of the exact Ising-model’s marginal probability  pi ≡ P(Xi= 1).

The running time of the algorithm is dominated by the computation of the maximum spanning tree (Step 1) which is O(|E| + n log n). All other steps take O(|E|).

Within the literature on probabilisitic graphical models, Hamze and de Fre- itas [2004] propose an MCMC approach based on sampling non-overlapping trees. While our approach has a sampling flavor, its exact connection to MCMC is unclear at best. Also, the spanning trees that our algorithm generates may overlap.

The following discussion connects the algorithm above to an approximate version of fictitious play from the literature on learning in games in game theory. For the most part, we omit discussions to approximate variational inference in this manuscript, except to say that TRW message-passing [Wainwright et al., 2005] is the inspiration behind our proposed algorithm above.

The game implicit in the heuristic algorithm above is a two-player potential game between a  “joint-assignment” (JA)player and a  “spanning-tree” (ST)player. The potential function is ΨX,T (x, T).The payoff functions  MXand  MTof the JA player and the ST player, respectively, are identical and equal the potential function ΨX,T (x, T): formally,  MX(x, T) =  MT (x, T) = ΨX,T (x, T). Note that the payoff function of the ST player is strategically equivalent to the function �(i,j)∈E 1[(i, j) ∈ E(T)]wijxixj.

Technically, this is a game with identical payoffs, which are known to have what Monderer and Shapley [1996a] called the fictitious play property: the empirical play of fictitious play is guaranteed to converge to an MSNE of the game. While determining a best-response for the ST player is easy (i.e., using an algorithm for computing maximal spanning tree such as Kruskal’s, as we do in our implementation for the experiments), unfortunately the same is in general not possible for the JA player, whose best-response is as hard as computing a MAP assignment of another Ising model with the same graph and (generally non-zero) biasnode parameters, but a slightly different set of edge-weights. 19

One approach to deal with the problem of obtaining a best-response from the JA player is to draw one tree uniformly at random from the empirical distribution and find a best-response to that tree. Such an approach is equivalent to a type of smooth best-response. If both players were to do the same, simultaneously, the result is a stochastic version of fictitious play or stochastic fictitious play for short [Fudenberg and Levine, 1999]. The empirical distribution of play of stochastic fictitious play in a game with identical payoffs, or what’s strategically equivalent, any potential game, also converges to an MSNE of the game [Hofbauer and Sandholm, 2002]. In our case, however, we really have a type of “hybrid” sequential-version, where the ST player is always behaving as in standard fictitious play, while the JA player is behaving according to a stochastic fictitious play.

In addition, as an alternative to the best-response computation for player JA, we might want to add an entropic (preference) function of the mixed-strategy to the JA player as an additional term in JA’s payoff, so that the result is really a “smooth” best-response, or more specifically in this case a smooth stochastic fictitious play [Fudenberg and Levine, 1999]. Such an addition would make the connection to variational inference more evident, and would allow us to develop more direct bounds on the quality of the variational result. The main problem is that we do not know of any study of such hybrids within game theory. In addition, most instances of fictitious play assume simultaneous moves. Numerical instability is another problem we found in practice when using such smooth best-response. Even in instances where that was not a problem, the performance was indistinguishable, in a formal statistical sense, from the version of the algorithm that we propose above.

In the context of belief inference, we believe it actually makes more sense to have a so called “sequential” play, where players trade moves: the JA player starts by choosing some action (i.e., full, joint assignments to the random variables), the ST player best-responds to that action, then the JA player best-responds to the ST player’s action, continuing in that way, such that at each round, each player is best-responding to the empirical distribution of play 20up to the time the player makes a move (i.e., draws an action). While this type of sequential process often helps to stabilize the dynamics and improve the likelihood of convergence, it seems that such sequential processes have received considerably less attention than their simultaneous-move counterpart within the game-theory community.

We conjecture, however, that the type of fictitious play process defined above in fact converges. We believe that the proof follows from combining results from standard and stochastic fictitious play for games with identical payoffs, which are (strategically equivalent) instances of potential games [Monderer and Shapley, 1996a, Hofbauer and Sandholm, 2002]. The derivation is complex and not trivial, involving key mathematical concepts from the literature in stochastic approximation. Delving into such level of complexity not only goes beyond the scope of this paper, but more importantly, doing so distracts attention from the paper’s main focus: to provide a general, broad illustration of how ideas and results from game theory may be useful in providing alternative, effective, and practical approaches to hard belief-inference problems in probabilistic graphical models. Thus, we leave the formal proof of our conjecture as future work.

As a last point, it is important to understand and keep in mind that, as it is well-known, in the context of potential games, while sequential best-reply converges to a PSNE (i.e., a joint assignment), fictitious play can converge to an MSNE of the game. 21Monderer and Shapley [1996a] provide an example in a 2-player 2-action normal-form (coordination) game with identical payoffs. Said differently, the resulting empirical distribution of play for the JA player may be to what Monderer and Shapley [1996a] themselves call a “purely mixed strategy” (i.e, every action is played with positive probability; or said differently, the corresponding probability mass function has full suport over the action set of the player). 22In the context of belief inference, the resulting mixed-strategy would correspond to an (approximate) marginal distribution, not a particular joint-assignment. Hence, in the context of belief inference, the convergence of the procedure above may not have to be to a single (possibly local) optimum of the potential function ΨX,T: in principle, convergence could be to a (nondeterministic) mixture over joint-assignments. In fact, this is what we observe in our experiments, albeit after only a finite number of iterations. A thorough understanding of the convergence properties observed in practice requires considerably more experimental work than is reasonable within the context and purpose of the work described in this manuscript.

3.5 Sketch of algorithm derivation and relation to TRW

The connection between TRW and the algorithm presented in this section results from a stochastic minimization of a precise upper bound on a variational approximation that uses a joint distribution (CE) or a product distributions (MSNE) over the spanning trees of the MRF graph G = (V, E) and the original MRF’s random variables X. Here is a sketch of the precise mathematical expressions for the CE-based case. (The ones for the MSNE-case are very similar, and omitted for brevity.) While the derivation is more general, we only present it in the context of Ising models.

We use the following notation for the purpose of the discussion here. Let

QX,Tbe the variational joint distribution over the random variables X and T corresponding to joint variable assignments and spanning tree, respectively. Let QXbe the marginal probability of  QX,Tover X: i.e.,  QX(x) ≡ �T ∈TG QX,T (x, T), for all  x ∈ {−1,+1}. Let  PX ≡ Pθbe the ground-truth joint distribution (defin-ing the Ising model) we would like to approximate. Denote by

image

the KL-divergence of between  QXand  PXwith respect to  QX. Let �QX,Tthe empirical joint distribution of ”joint actions” for the both players generated during fictitious play: i.e., �QX,T (x, T) ≡ 1m�ml=1 1�x(l)=  x, T (l)=  T�. Let vij ≡ E �QX,T[XiXj] =  �x �QX(x)xixj, where  �QX(x) ≡ �T ∈TG �QX,T (x, T) de- notes the empirical marginal over X only; that is, summed over all spanning trees T over G = (V, E) with respect to �Q, which is clearly easy to compute: �QX(x) ≡ �T ∈TG �QX,T (x, T) for all x. Let

image

where  Q∗X,T ∈arg maxQX,T�x,T QX,T (x, T)Ψ(x, T). Denote by  H(Q∗X) ≡− �x Q∗X(x) ln  Q∗X(x) the standard ”Shannon’s entropy” of X with respect to the marginal of  Q∗X,Tover X (i.e.,  Q∗X(x) =  �T Q∗X,T(x, T)); and similarly for H( �QX). After some algebra, we can obtain the following bound for the variational approximation:

image

The first term in the bound, maximizing over  QX,T, inspires the application of fictitious play. As an aside, note that we can generate a family of upper bounds (details omitted); e,g,, for �QX,

image

Note that H( �QX) is easy to compute, and that the last expression leads immediately to an easily computable lower bound on ln Z.

In this section we present the results of synthetic experiments on the performance of the game-theoretic-inspired heuristics we propose in this paper for approximate belief inference in MRFs. Our algorithms have very simple implementations. We also compare them with the most popular approximation algorithms and heuristics, with equally simple implementations, proposed in the literature on probabilistic graphical models.

4.1 Experimental Design: Synthetic Models

The experimental design in terms of the class of Ising models is as in Domke and Liu [2013]. We consider Ising models with  d × dsimple grid graphs, which are planar (i.e., no “wrap around” edges, such that each of the four corner nodes have exactly two neighboring nodes, any other non-internal node has exactly three neighbors, while the rest, i.e., all internal nodes, have exactly four neighbors). Hence, the number of variables or nodes is  n = d2. We used d ∈ {8,12} for our experiments. We did not consider edge-weights magnitude parameters 1.0 or 1.5, because it is really hard to beat a Gibbs sampler for maximum weight magnitudes smaller than 2.0, relative to the bias parameters bi’s being in the real-valued interval [−1,1]. The reason for this might be that, because, as stated in Domke and Liu [2013], the mixing rate of a Gibbs sampler in such models grows roughly exponential with the magnitude, the induced Markov chain mixes pretty fast for such cases; thus convergence is quick. For each value  w ∈ {2.0, 2.5, 3.0, 3.5, 4.0}, we generated random Ising models with edge-weights  wij ∼Uniform([−w, w]) or  wij ∼Uniform([0, w]) for the “mixed” or “attractive” case, respectively, for each (i, j) ∈ E, i.i.d., and node biases bi ∼Uniform([−1,+1]), also i.i.d. for all i, and independent of the edge-weights.

One exception on the class of Ising models used for evaluation is a class we use with edge-weights with constant magnitude (i.e., w = max(i,j)∈E |wij|), but in which we vary the probability q of attractive edge-weights; that is, given a probability q, the sign of the edge-weight are i.i.d. random variables in which the sign is positive with probability q, and negative with probability 1  − q. 23We propose this class of Ising models for future evaluations of approximate belief inference techniques. For evaluation using this class, we consider  w ∈{2.0, 2.5, 3.0, 4.0}. For each q, we randomly generated 50 Ising models as samples for w = 4, and 5 samples for each  w ̸= 4.

Note that despite the graphs being planar, the bias parameter is non-zero in general, so that the known polynomial-time exact algorithms for planar graphs do not technically apply.

Here, we consider simple no-regret algorithms from the literature of learning in games [Fudenberg and Levine, 1999, Blum and Mansour, 2007]. The two most common notions of regret are external and swap regret, the latter being stronger than the former. (Another name often used for external regret is “unconditional regret,” while other names for swap regret are “internal regret” and “conditional regret.” We refer the reader to the references to standard literature on learning in games for specific definitions.) There are several such no-regret algorithms in the literature with different types of convergence guarantees depending on the exact notion of regret used. Here we consider two types of no-regret algorithms, and leave the evaluation of other no-regret algorithms for future work. One type of algorithm we consider is really a class of algorithms based on the Multiplicative Weight Update (MWU) algorithm [Blum and Man- sour, 2007]. In our implementation of the MWU algorithm, for each player i at each round  t ≥1, we set the probability of playing action  xiat round t + 1, which we denote by  x(t+1)i, to be  pt+1(xi) ∝ pt(xi)�1 − ηt(1 −¯Mi(xi, x(t)−i))�,

where  ηtis analogous to a learning rate in ML (i.e., the step size when using standard gradient descent/ascend for optimization), and ¯Miis the normalized payoff function for player i (i.e., the expression  xi��(i,j)∈E wi,jxj + bi�nor-

malized so that minimum and maximum values are 0 and 1, respectively). In general, if we set the value of  ηtto a constant  ηwe guarantee that the empirical joint distribution over joint actions induced by the played sequence of joint actions  x(t)converges to the set of approximate CCE, where the level of the approximation depend on  η. If we set  ηt=

convergence to the set of CCE. 24There is a simple construction that allows us to use the MWU algorithm to construct algorithms for which a player can have either no swap regret, or approximately swap-regret, depending on the value of ηt: if constant (η) the empirical distribution of empirical play converges to the set of approximate CE, while if set as above convergence is guaranteed to the set of CE. 25We refer to the different versions of no-regret algorithms based on MWU related to convergence to exact or approximate CCE as “mw er” and “mw er cf,” respectively; and to those related to exact or approximate CE as “mw sr” and “mw sr cf,” respectively. We set  ηto 0.01 in all of our experiments. We also evaluate a simple (approximate) no-swap-regret algorithm by Hart and Mas-Colell [2000], which we denote as “nr” from now on. Like all (approximate) no-swap-regret algorithms, nr guarantees to converge to (the set of approximate) CE. Each iteration of nr takes roughly the same amount of time as that for Gibbs sampling. We set the number of iterations of the nr algorithm to 105for the standard experimental setup, and to 106for our proposed new evaluation setting. Our exact implementation is a natural adaptation we believe is more amenable to the belief-inference setting. In particular, we evaluate a version in which we update the mixed-strategy each player uses to draw an action at every iteration t as follows. For each player, (1) we set the probability of switching the player’s last action being equal to the empirical regret, or 0 if the empirical regret is negative; and (2) we set the player’s probability of playing action +1 by “damping” the currently suggested probability of playing

+1,  pt(1), for the corresponding player by the original algorithm: 26that is, we use the update 0.99 × pt(1) + 0.01 ×(0.5). We also use 105iterations. Also, we only present results for the sequential, “semi-stochastic” fictitious play we discuss in Section 3.4, for the case of CE only, which we denote as “fp (ce)” from now on. We set the number of iterations m = 15. 27Finally, the results for the MSNE-instantiation of the fictitious play algorithm we propose are quite similar to those for fp (ce), at least for m = 15; thus, we omit those results in the interest of keeping the plots less “crowded” and thus easier to interpret.

We compare the different mw-type algorithms, the simple nr algorithm, and our proposed fp (ce) to (1) standard mean-field approximation (mf), with se-quentialaxis-parallel updates; (2) standard belief propagation (bp), with simultaneous updates; (3) TRW (trw); and (4) the Gibbs sampler (gs). The running times per iteration of all methods is O(|E|), except that of fp (ce) which is O(|E| + n log n), and of course that of bl which is constant. In the next paragraph, we provide more detail on the specifics of the implementations of methods (1–4).

As baseline (bl), we use the simplest possible estimator from the perspective of average marginal-error to measure quality: always use 0.5 as the estimate of the exact marginal distribution of each variable. Certainly, one would expect that for an algorithm to be competitive, its performance should be better than bl. As we soon discuss, our experimental results suggest that this is not always the case; that is, several standard methods, including some of the ones proposed here and even state-of-the-art such as TRW, do not satisfy that condition for “hard” cases. We evaluate mean field (mf) using sequentialaxis-parallel updates, stopping if the maximum absolute difference in probability values between iterations is  ≤10−5, and using a maximum number of iterations = 106. For belief propagation (bp) we use simultaneous updates, and “smooth” the update based on the average of the current value and the new value in order to “dampen” or at least try to prevent oscillations and improve the likelihood of convergence, 28stopping if the maximum absolute difference in probability values between iterations is  ≤10−7, and a maximum number of iterations = 105. For tree reweighed message-passing (trw), we use a constant parameter  ρ= 0.55 for all corresponding edge-appearance-probability parameters  ρij’s [Wainwright et al., 2005], along with a smooth update and the same stopping criterion as for bp. For the Gibbs sampler, we use 106iterations.

4.2 Experimental Results: Synthetic Models

Fig. 1 summarizes our results for the most common classes of Ising models considered in the experimental evaluation of approximation algorithms and heuristic for belief inference in the literature as described above. We perform hypothesis testing for the result in these classes of Ising models using paired z-tests on the individual (i.e., not joint) differences, each with p-value 0.05. Hence, all the statements are statistically significant with respect to such hypothesis tests. Note that there is no globally best approximation technique overall for these classes. Finally, plots for both 8x8 and 12x12 models are included to illustrate how the relative performance of these approximation algorithms are not strongly affected by the grid size d. For brevity, we will only discuss results for the 12x12 case. We refer the reader to Appendix A for experimental results and discussion for the 8x8 case.

Among all mw-type algorithms, we only present the results of mw er cf because it outperforms all other types almost consistently, as we discuss later in this section and plot in Fig. 5. Hence, we refer to mw er cf simply as ’mw’ from now on.

“Mixed” 12x12 case (Bottom left plot, Fig. 1).Clearly, gs is best for all w in this case. Among the other approximation algorithms, we observe the following.

1. fp (ce) is worse than bp for w < 3.5, and indistinguishable from bp for w ≥ 3.5.

2. fp (ce) is consistently better than trw.

3. trw is consistently worse than bp.

4. mw is worse than fp (ce) for w < 3.0, and indistinguishable from fp (ce) for  w ≥ 3.0

5. All methods, except for mf and nr, are consistently better than bl. mf and nr are consistently worse than bl.

6. mf is better than nr for  w ≥ 3.0, and indistinguishable from each other for w < 3.0.

“Attractive” 12x12 case (Bottom right plot, Fig. 1).In this case, there is no clear overall best. We also observe the following.

1. trw is best among all methods for  w ≥ 3.0, indistinguishable from gs for w = 2.5, and worse than gs for w = 2.0.

2. fp (ce) is worse than gs for w = 2.0, but better than gs for w = 4.0, and indistinguishable from gs otherwise.

3. mw and fp (ce) are consistently indistinguishable.

4. mf, nr, and bp are consistently indistinguishable from each other, except for w = 2.0, where bp is better than nr.

5. bp and bl are consistently indistinguishable, except for w = 4.0, where bp is better.

Fig. 2 summarizes our experimental results for a class of Ising models which appears to lead to “harder” Ising-model instances. 29We perform hypothesis testing for the result in these classes of Ising models using two approaches depending on w. For w = 4, where we draw 50 models as samples for each q, we use appropriately modified paired z-tests on the individual (i.e., not joint) differences, each with p-value 0.05. We modify the calculation of the variances resulting from the average over the samples computed for each q. We do so because the distributional properties of the empirical meanaverage for each q may differ. For w < 4, where we only draw 5 models as samples for each q, we use bootstrapped-based, individual, paired hypothesis-testing over each pair of aggregate differences between the methods for each of those values of w; we use 100 bootstrap samples, and p-value 0.05 All the statements are statistically significant with respect to such hypothesis tests.

Aggregate results for 12x12 (Bottom left plot, Fig. 2). The bottom left-hand plot in Fig. 2 shows the aggregate results for this case. There is no clear overall best over all q. We also observe the following.

1. fp (ce) is best among all methods except for when w = 2.0, where gs is better.

2. trw is second best among all methods, except for when w = 2.0, where it is third best (behind fp (ce) and gs).

3. bp is consistently better than mf and nr except when w = 3.5, where it is indistinguishable from nr (but still better than mf).

4. mf is consistently worse than bl, except when w = 4.0, where they are indistinguishable. nr is also consistently worse than bl, except when w = 2.5, where they are indistinguishable

5. gs is consistently better than mf, nr, and bl, except when w = 4.0, where gs and bl are indistinguishable.

6. mw is better than bl when w < 3.5, but indistinguishable from bl when w ≥ 3.5.

Results for constant edge-weight magnitude w = 4 as a function of probability of attractive interaction q (Right plots, Fig. 2). The right-hand plots in Fig. 2 shows finer-grain results for this case, for both the 8x8 and 12x12 models. The results suggest that in fact such instances of Ising models tend to be harder in the sense that even state-of-the-art algorithms such as trw are no better than the simple baseline estimation, in which  �pi= 0.5 for all nodes/variables i, for about half of the full range of values of the sign probability q (i.e., for  q ∈ {0.1, 0.4, 0.5, 0.6, 0.8, 0.9}). In fact, the performance of trw is almost exactly the same as baseline across the range of non-extreme values of q. (Note how the plot of the values for trw and bl are essentially on top of each other for values of q other than 0 or 1.) On the other hand, note how fp (ce) is consistently better than bl across the whole range of values for q. In fact, fp (ce) is always in the set of (statistically) best performers for all q with the exception of  q ∈ {0.0, 1.0}, where trw is better. Almost all the methods other than fp (ce) are no better, and often worse, than bl. Two notable exceptions are that trw and mw beat bl only when  q ∈ {0.0, 1.0}. Also, mw and fp (ce) are indistinguishable, except for  q ∈ {0.0, 1.0}where fp (ce) is better.

Fig. 3 plots the proportion of non-convergent runs of bp (higher curve) and trw (lower curve). Note the interesting behavior of bp: the likelihood of convergence diminishes considerably as q nears 0.5. The effect is almost symmetrical. In contrast, the effect on the non-convergence of trw is negligible. Note, however, that the bp’s non-convergence does not seem to really affect its performance in terms of marginal error. This plot provides additional evidence for our claim that the generative model of random Ising models used for evaluation does lead to harder problem instances.

Results for the effect of different types of no-regret algorithms. Fig. 5 shows the results of various types of no-regret algorithms based on the multiplicativeweights algorithm. The variants result from the combination of (a) external vs. swap regret and (b) exact vs. approximate no-regret guarantees. The take-home message is that the version for external regret with approximate no-regret guarantees, which we refer to both as ’mw’ and ’mw er’ throughout this section, is consistently better or no worse than the others, as we stated at the beginning of this subsection (second paragraph). This result appears counter-intuitive given that such variant has the lowest guarantees in terms of optimality/equilibrium conditions. Said differently, the possible set of solutions is largest among all variants. An analogy with gradient-descent based optimizations in other machinelearning contexts may provide a possible explanation for this behavior. For example, it is well known that reducing the learning rate or step size in inverse proportion to the number of iterations when learning neural networks via backprop theoretically guarantees convergence in parameter (weight) space. Yet, it is equally well-known that doing so is often slow in practice, and that using a small contant as the learning rate tends to lead to faster convergene to good models, despite the lack of theoretical guarantees. We refer the reader to the caption in Fig. 5 for further discussion.

Results on the effect of the number of iterations. Finally, Fig. 4 shows the marginal error of the estimates obtained by fp (ce), for different numbers of iterations. Increasing the number of iterations from m = 15 to m = 50 (and greater) only yields minimal improvement in marginal error. In addition, each run of fp results in pretty consistent marginal errors at each iteration level. Based on this, it appears that fp converges to an estimate in a fairly low number of iterations, and does so consistently. Compare this with Fig. 6, which shows two similar plots each for the trw and gs algorithms. Though trw results in a comparable average marginal error, the marginal error of each run varies more than in fp. Increasing the number of iterations for trw does not decrease this variance, either. The same behavior occurs with gs, though its average marginal error is a bit higher than trw and fp.

4.3 Experimental Design: MNIST-based Ising Models

We also evaluated the various algorithms on Ising models for more realistic settings. We use images of handwritten digits from the popular MNIST dataset to build Ising models for soft de-noising. Note that the interest here is not classification nor MAP estimation, but belief inference: using the individual marginal probabilities as confidence measure on the individual pixel values of the de-noised image. The images consist of 28x28 pixel images, so our Ising models are 28x28 simple planar grid graphs, as in the synthetic experiments. The grayscale pixel values in the original MNIST images are converted to black (+1) and white (−1) values, using a threshold of 0.5. Rather than randomly setting edge weights, we compute the average product between neighboring pixels, taken across all training images, then use that average product as the edge-weight between those neighboring pixels. That is, denote by  Ilthe matrix representation of the l-th image in the MNIST training dataset, and denote by m the number of images in that training dataset. We set the weight  w(i,j),(i,j+1)of the edge between nodes/pixels (i, j) and (i, j + 1), for example,

image

values. and the prior bias  b(i,j)for node/pixel (i, j), for example, as

image

and the normalization factor is such that max(i,j) |b(i,j)|= 1. We select 100 images uniformly at random from the MNIST test dataset, and apply the thresholding described above to turn them into a BW images. We add 5% noise to each of the resulting BW images by ”flipping” each pixel value independently with probability p = 0.05. Hence, we have a different Ising model for each image: the edge weights are all the same, but the biases differ depending on the specific value of the test image. That is, if I denotes the matrix representation of the noisy BW test image, then the Gibbs potential of the Ising model becomes ΨI(x) ≡ �((i,j),(r,s))∈E �w(i,j),(r,s) x(i,j) x(r,s) + � (i, j)�b(i,j)(I(i, j)) x(i,j) ,where �w(i,j),(r,s) ∝ w(i,j),(r,s)and �b(i,j)(I(i, j)) ∝ b(i,j)+ 12I(i, j) ln  1−pp, and the nor-

malization factor is such that max(i,j) |�b(i,j)(I(i, j))|= 1 (for consistency with the node biases of Ising models in the synthetic experiments).

We ran the exact same algorithms on the constructed Ising models as in the synthetic experiments, though we used a slightly different number of (max) iterations: 100 for fp (all variants); 104for bp, nr, and mw (all variants); 105for trw and gs; and 106for mf.

4.4 Experimental Results: MNIST-based Ising Models

Fig. 7 shows our results of the experimental evaluation of the algorithms described in earlier sections (with the exception of gs). Like in the synthetic experiments, we perform hypothesis testing using paired z-tests on the individual differences, each with p-value 0.05. It is important to note that statements in this section comparing algorithm performance are always made with respect to gs. That is, the better performing algorithms here are actually producing output most similar to gs.

The computed edge-weights for the Ising models derived from MNIST images actually consist of all positive values, so this experiment can be thought of as analogous to the “attractive” case in the synthetic experiments. However, as evidenced in Fig. 7, the MNIST-derived models appear to be much “easier” than the synthetic models, since every algorithm performs much better than baseline. For example, mf has very low marginal error in this case, even though in the synthetic experiments it was often indistinguishable from bl. In order, the best performing algorithms are 1) bp, 2) trw, 3) mf, 4) mw er, 5) mw er cf, 6) fp, 7) nr, 8) mw sr, 9) mw sr cf, and 10) bl.

Fig. 7 shows our results for the case of Ising models for the Handwritten Digit ”1” only. That is, the edge weights were computed using only images in the MNIST dataset with a training label of the ”1” digit. Likewise, the observed image samples came only from images in the test data set with a label of the ”1” digit. The figure shows that while the various algorithms have the same relative order to each other when run on ”1’s” vs ”all” digits, the range of average marginal errors they achieved tightened.

It would be nice to have a better understanding of the exact relationship between the true joint distribution of the MRF and the equilibrium points of the induced graphical potential game. For example, it is known that no-external-regret-based algorithms like mw converge to PSNE in “generic” potential games [Klein- berg et al., 2009], such as the MRF-induced game. In fact, we observe such consistent convergence to PSNE by mw in our experiments, which is unlike the behavior we observed for fp (ce). But convergence to PSNE means that we are essentially approximating the whole distribution with a single joint assignment (i.e., a point mass). Yet, mw can outperform state-of-the-art algorithms like trw, particularly on “hard” instances, despite yielding such extremely coarse approximations. In addition, best-response dynamics in the MRF-induced game converges to a PSNE and is equivalent to the method of iterated conditional modes (ICM) [Besag, 1986] in PGMs, which converges to a locally optimal joint assignment of the original MRF. We did not include ICM in our experiments because it is generally considered inferior to other methods. The results of mw suggests we might want to also evaluate ICM for hard instances and compare its output and performance to that of mw. One interesting question is whether mw often finds better local minima that ICM, or whether ICM is equally effective, in those hard cases. As another example of how the proposed study would be useful, it might give us a better idea as to whether one can think of a Gibbs sampler, or other Monte-Carlo sampling algorithms, as providing solutions to equilibrium problems of certain quality.

Here we establish a connection between mf and MSNE. Despite the fact that mf often provides poor approximations, even worst than baseline in many cases, it would still be theoretically interesting to study the relationship between the output of mf and that of algorithms that compute approximate MSNE in loopy graphical games, such as NashProp [Ortiz and Kearns, 2003].

The focus of the experimental evaluation in this paper was testing our proposed, game-theoretically-inspired algorithms for belief inference with standard algorithms in the literature of probabilistic graphical models with relatively “simple” implementations (e.g., do not require calls to software packages or the implementation of complex optimizations). An empirical study involving such algorithms with considerably more complex implementations must have a precise experimental methodology and design that accounts for not only the complexity of implementation, but also a fair comparison that achieves the right balance between measures of solution quality and running times. We leave such evaluations for future work because of the level of complexity required to carry them out correctly.

The work in this paper just “scratches the surface” in terms of the synergy between equilibirum computation in game theory and belief inference in probabilistic graphical models. We state and discuss several immediate theoretical, algorithmic, and computational implications, but many more may be possible. An even broader and more thorough literature review than the one provided in this manuscript is necessary to fully exploit this connection. Thus, many opportunities for novel contributions remain available in either direction.

We provide general formulations of the problem of inference in MRFs as equilibrium computation in graphical potential games. We provide connections, particularly to variational inference approaches, with immediate algorithmic, computational, and theoretical implications to belief inference in probabilistic graphical models that follow immediately from the game-theory literature to various related problems. We provide two approaches for approximate belief inference: a local and a global approach. We experimentally evaluate the effectiveness of the proposed algorithms in the context of Ising models with grid graphs, and provide a characterization of their computational effectiveness based on common measures used to characterize classes of Ising models (e.g., mixed and attractive models with different relative levels of magnitude between the edge weights and node bias values). We also empirically evaluate effectiveness using a slightly different approach in which we keep the edge-weight magnitude constant but vary the “sign probability.” We show how most methods are often not much better than a simple baseline (i.e., estimate that the marginal probabilities are all equal to 0.5) in that class of Ising models. Our results suggest that the proposed class of Ising models does indeed lead to harder instances than the popular models used for empirical evaluation in the same context of Ising models. We empirically show that our proposed method based on a global approach is best, beating even TRW within that class, and shinning in a class of Ising models with constant, “highly attractive” edge-weights, in which it is often better than all other alternatives we evaluated. Note that TRW is generally considered state-of-the-art. We propose such class of Ising models for future evaluations because our experimental results suggest that instances from that class are often the hardest. While our more local approach is not as effective as our global approach or TRW, in fairness, almost all of the alternatives were no better than a simple baseline: estimate the marginal probability to be 0.5.

Some reviewers have expressed the view that our general equilibrium-based approach to approximate inference is “limited to locally optimal solutions to inference problems.” We would like to point that almost all approaches to approximate inference based on variational approximation employed in practice, including simple methods such as mean field and state-of-the-art methods such as TRW, suffer from exactly the same limitations.

In closing, our hope is that the work we present in this manuscript will start a conversation on the synergy between equilibrium computation and belief inference. We believe our work and results establish sufficient precedent for research in the direction of formulating probabilistic inference problems as problems of equilibrium computation. We believe this research direction is sci-entifically intriguing and potentially fruitful for mathematical, algorithmic, and computational game-theory, as well as for probabilistic graphical models.

R.J. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1, 1974.

R.J. Aumann. Correlated equilibrium as an expression of Bayesian rationality. Econometrica, 55, 1987.

Marc Berthod, Zoltan Kato, Shan Yu, and Josiane Zerubia. Bayesian image

classification using Markov random fields. Image and Vision Computing, 14 (4):285–295, 1996.

Julian Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):192– 236, 1974. ISSN 00359246. URL http://www.jstor.org/stable/2984812.

Julian Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society. Series B (Methodological), 48(3):259–302, 1986. ISSN 00359246. URL http://www.jstor.org/stable/2345426.

Avrim Blum and Yishay Mansour. Learning, regret minimization, and equilib- ria. In Noam Nisan, Tim Roughgarden, ´Eva Tardos, and Vijay V. Vazirani, editors, Algorithmic Game Theory, chapter 4, pages 79–102. Cambridge University Press, 2007.

Xi Chen and Xiaotie Deng. Settling the complexity of 2-player Nashequilibrium. Technical Report 140, Electronic Colloquium on Computational Complexity (ECCC), 2005a. http://eccc.hpi-web.de/eccc-reports/ 2005/TR05-140/index.html.

Xi Chen and Xiaotie Deng. 3-NASH is PPAD-complete. Technical Report 134, Electronic Colloquium on Computational Complexity (ECCC), 2005b. http://eccc.hpi-web.de/eccc-reports/2005/TR05-134/index.html.

Xi Chen and Xiaotie Deng. Settling the complexity of two-player Nash equilib- rium. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 2006.

Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Computing Nash equilibria: Approximation and smoothed complexity. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 2006.

Xi Chen, Xiaotie Deng, and Shang-Hua Teng. Settling the complexity of com- puting two-player Nash equilibria. J. ACM, 56(3):1–57, 2009.

Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence Research, 13:155–188, 2000.

Vincent Conitzer and Tuomas Sandholm. New complexity results about Nash equilibria. Games and Economic Behavior, 63(2): 621 641, 2008. ISSN 0899-8256. doi: DOI:10.1016/j.geb. 2008.02.015. URL http://www.sciencedirect.com/science/article/ B6WFW-4SDGR4B-1/2/cf2498492436f6eaa78dd3314a877366. Second World Congress of the Game Theory Society.

Gregory F. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks (research note). Artif. Intell., 42(2-3):393– 405, 1990.

Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley & Sons, New York, second edition, 2006.

Paul Dagum and Michael Luby. Approximating probabilistic inference in bayesian belief networks is np-hard. Artificial Intelligence, 60(1):141 – 153, 1993. ISSN 0004-3702. doi: https://doi.org/10.1016/0004-3702(93) 90036-B. URL http://www.sciencedirect.com/science/article/pii/ 000437029390036B.

Constantinos Daskalakis and Christos H. Papadimitriou. Computing pure Nash equilibria in graphical games via Markov random fields. In  EC ’06:Proceedings of the 7th ACM conference on Electronic commerce, pages 91– 99, New York, NY, USA, 2006. ACM. ISBN 1-59593-236-4. doi: http: //doi.acm.org/10.1145/1134707.1134718.

Constantinos Daskalakis, Alexandros G. Dimakis, and Elchanan Mossel. Con- nectivity and equilibrium in random games, 2007. http://www.citebase. org/abstract?id=oai:arXiv.org:math/0703902.

Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a Nash equilibrium. SIAM Journal on Computing, 39(1):195–259, 2009a.

Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a Nash equilibrium. Commun. ACM, 52(2): 89–97, 2009b.

Konstantinos Daskalakis and Christos H. Papadimitriou. Three-player games are hard. Technical Report 139, Electronic Colloquium on Computational Complexity (ECCC), 2005. http://eccc.hpi-web.de/eccc-reports/ 2005/TR05-139/index.html.

Konstantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a Nash equilibrium. Technical Report 115, Electronic Colloquium on Computational Complexity (ECCC), 2005. http: //eccc.hpi-web.de/eccc-reports/2005/TR05-115/index.html.

Justin Domke and Xianghang Liu. Projecting Ising model parameters for fast mixing. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 665–673. 2013. URL http://media.nips.cc/nipsbooks/ nipspapers/paper_files/nips26/391.pdf.

Alex Fabrikant, Christos Papadimitriou, and Kunal Talwar. The complexity of pure Nash equilibria. In  STOC ’04: Proceedings of the Thirty-Sixth AnnualACM Symposium on Theory of Computing, pages 604–612, New York, NY, USA, 2004. ACM.

D. Foster and R. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 1997.

D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior, pages 7 – 36, 1999.

D. Fudenberg and D. Levine. The Theory of Learning in Games. MIT Press, 1999.

Drew Fudenberg and Jean Tirole. Game Theory. The MIT Press, 1991.

Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-6(6):721–741, Nov. 1984.

I. Gilboa and E. Zemel. Nash and correlated equilibria: some complexity con- siderations. Games and Economic Behavior, 1:80–93, 1989.

Paul W. Goldberg and Christos H. Papadimitriou. Reducibility among equilibrium problems. Technical Report 090, Electronic Colloquium on Computational Complexity (ECCC), 2005. http://eccc.hpi-web.de/ eccc-reports/2005/TR05-090/index.html.

Georg Gottlob, Gianluigi Greco, and Francesco Scarcello. Pure Nash equilibria: Hard and easy games. In  TARK ’03: Proceedings of the 9th conference onTheoretical aspects of rationality and knowledge, pages 215–230, New York, NY, USA, 2003. ACM. ISBN 1-58113-731-1. doi: http://doi.acm.org/10. 1145/846241.846269.

J.M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. Unpublished, 1971.

Firas Hamze and Nando de Freitas. From fields to trees. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pages 243– 250, Arlington, Virginia, United States, 2004. AUAI Press. ISBN 0-9749039-0-6. URL http://dl.acm.org/citation.cfm?id=1036843.1036873.

James Hannan. Approximation to Bayes risk in repeated play. In M. Dresher, A. W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume III, pages 97–140. Princeton University Press, 1957.

Sergiu Hart and Yishay Mansour. The communication complexity of uncoupled nash equilibrium procedures. In STOC ’07:Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 345–353, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-631-8. doi: http://doi.acm.org/10.1145/1250790. 1250843. URL http://portal.acm.org/ft_gateway.cfm?id=1250843& type=pdf&coll=Portal&dl=GUIDE&CFID=48027515&CFTOKEN=24230463.

Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127 – 1150, 2000.

Sergiu Hart and Andreu Mas-Colell. Uncoupled dynamics do not lead to Nash equilibrium. American Economic Review, 93(5):1830–1836, 2003.

Sergiu Hart and Andreu Mas-Colell. Stochastic uncoupled dynamics and Nash equilibrium: Extended abstract. In TARK, pages 52–61, 2005.

Josef Hofbauer and William H. Sandholm. On the global convergence of stochas- tic fictitious play. Econometrica, 70(6):2265–2294, 2002. ISSN 00129682, 14680262. URL http://www.jstor.org/stable/3081987.

J J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8):2554–2558, 1982. URL http://www. pnas.org/content/79/8/2554.abstract.

Robert A. Hummel and Steven W. Zucker. On the foundations of relaxation labeling processes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-5(3):267–287, May 1983.

Sorin Istrail. Statistical mechanics, three-dimensionality and NPcompleteness: I. universality of intracatability for the partition function of the Ising model across non-planar surfaces (extended abstract). In  STOC ’00: Proceedings of the thirty-second annual ACM sympo-sium on Theory of computing, pages 87–96, New York, NY, USA, 2000. ACM. ISBN 1-58113-184-4. doi: http://doi.acm.org/10.1145/335305. 335316. URL http://portal.acm.org/ft_gateway.cfm?id=335316&type= pdf&coll=Portal&dl=GUIDE&CFID=46800608&CFTOKEN=87597059.

Tommi S. Jaakkola. Tutorial on variational approximation methods. In M. Op- per and D. Saad, editors, Advanced Mean Field Methods: Theory and Practice, pages 129–159. MIT Press, Cambridge, MA, 2000.

Tommi S. Jaakkola and Michael L. Jordan. Improving the mean field approxi- mation via the use of mixture distributions. In Proceedings of the NATO ASI on Learning in Graphical Models. Kluwer, 1997. http://people.csail.mit. edu/tommi/papers/mix.ps.

E. B. Janovskaja. Equilibrium situations in multi-matrix games. Litovsk. Mat. Sb., 8:381–384, 1968.

Albert Xin Jiang and Kevin Leyton-Brown. Action-graph games. Technical Report TR-2008-13, University of British Columbia, September 2008. URL http://www.cs.ubc.ca/cgi-bin/tr/2008/TR-2008-13.

Albert Xin Jiang and Kevin Leyton-Brown. Polynomial computation of exact correlated equilibrium in compact games. SIGecom Exchanges, 10(1):6–8, 2011.

Albert Xin Jiang and Kevin Leyton-Brown. Polynomial-time computation of exact correlated equilibrium in compact games. Games and Economic Behavior, 91:347 – 359, 2015a. ISSN 0899-8256. doi: http://dx.doi.org/10.1016/j. geb.2013.02.002. URL http://www.sciencedirect.com/science/article/ pii/S0899825613000249.

Albert Xin Jiang and Kevin Leyton-Brown. Polynomial-time computation of exact correlated equilibrium in compact games. Games and Economic Behavior, 91:347 – 359, 2015b. ISSN 0899-8256. doi: http://dx.doi.org/10.1016/j. geb.2013.02.002. URL http://www.sciencedirect.com/science/article/ pii/S0899825613000249.

David S. Johnson, Christos H. Papadimitriou, and Mihalis Yannakakis. How easy is local search? Journal of Computer and System Sciences, 37(1):79 – 100, 1988.

Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183–233, 1999.

Sham Kakade, Michael Kearns, John Langford, and Luis Ortiz. Correlated equi- libria in graphical games. In  EC ’03: Proceedings of the 4th ACM Conferenceon Electronic Commerce, pages 42–47, New York, NY, USA, 2003. ACM.

Hetunandan Kamisetty, Eric P. Xing, and Christopher J. Langmead. Approx- imating correlated equilibria using relaxations on the marginal polytope. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1153–1160, New York, NY, USA, 2011. ACM. URL http://www.icml-2011.org/papers/ 594_icmlpaper.pdf.

Samuel Karlin. Mathematical Methods and Theory in Games, Programming, and Economics. Addison Wesley Publishing Company, 1959.

M. Kearns, M. Littman, and S. Singh. Graphical models for game theory. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 253–260, 2001.

Robert Kleinberg, Georgios Piliouras, and Eva Tardos. Multiplicative updates outperform generic no-regret learning in congestion games: Extended abstract. In Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing, STOC ’09, pages 533–542, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-506-2. doi: 10.1145/1536414.1536487. URL http://doi.acm.org/10.1145/1536414.1536487.

Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. Games and Economic Behavior, 45(1):181–221, 2003.

Pierfrancesco La Mura. Game networks. In Proceedings of the 16th Annual Conference on Uncertainty in Artificial Intelligence (UAI-00), 2000.

Kevin Leyton-Brown and Moshe Tennenholtz. Local-effect games. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 772–777, 2003.

Douglas A. Miller and Steven W. Zucker. Copositive-plus Lemke algorithm solves polymatrix games. Operations Research Letters, 10(5):285 – 290, 1991.

Douglas A. Miller and Steven W. Zucker. Efficient simplex-like methods for equi- libria of nonsymmetric analog networks. Neural Computation, 4(2):167–190, 1992. doi: 10.1162/neco.1992.4.2.167. URL http://www.mitpressjournals. org/doi/abs/10.1162/neco.1992.4.2.167.

Dov Monderer and Lloyd S. Shapley. Fictitious play property for games with identical interests. Journal of Economic Theory, 68(1):258 – 265, 1996a. ISSN 0022-0531. doi: http://dx.doi.org/10.1006/jeth.1996.0014. URL http: //www.sciencedirect.com/science/article/pii/S0022053196900149.

Dov Monderer and Lloyd S. Shapley. Potential games. Games and Economic Behavior, 14(1):124 – 143, 1996b. ISSN 0899-8256. doi: DOI: 10.1006/game.1996.0044. URL http://www.sciencedirect.com/science/ article/B6WFW-45MH08H-2C/2/69a63e2471795b08d937785bb923e9ea.

H. Moulin and J. P. Vial. Strategically zero-sum games: The class of games whose completely mixed equilibria cannot be improved upon. International Journal of Game Theory, 7(3):201–221, 1978. ISSN 1432-1270. doi: 10.1007/ BF01769190. URL http://dx.doi.org/10.1007/BF01769190.

John Nash. Non-cooperative games. Annals of Mathematics, 54:286–295, September 1951.

Noam Nisan, Tim Roughgarden, ´Eva Tardos, and Vijay V. Vazirani, editors. Algorithmic Game Theory. Cambridge University Press, 2007.

Luis E. Ortiz. Selecting Approximately-Optimal Actions in Complex Structured Domains. PhD thesis, Brown University, May 2002. http://www-personal. umd.umich.edu/~leortiz/phd.pdf.

Luis E. Ortiz. On sparse discretization for graphical games. CoRR, abs/1411.3320, 2014. URL http://arxiv.org/abs/1411.3320.

Luis E. Ortiz. Graphical potential games. CoRR, abs/1505.01539, 2015. URL http://arxiv.org/abs/1505.01539.

Luis E. Ortiz and Mohammad T. Irfan. Tractable algorithms for approximate Nash equilibria in generalized graphical games with tree structure. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), pages 635–641. AAAI Press, 2017.

Luis E. Ortiz and Leslie Pack Kaelbling. Adaptive importance sampling for esti- mation in structured domains. In  UAI ’00: Proceedings of the 16th Conferenceon Uncertainty in Artificial Intelligence, pages 446–454, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-55860-709-9.

Luis E. Ortiz and Michael Kearns. Nash propagation for loopy graphical games. In Suzanna Becker Becker, Sebastian Thrun Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 817– 824, 2003.

Luis E. Ortiz, Robert E. Schapire, and Sham M. Kakade. Maximum entropy cor- related equilibrium. Technical Report TR-2006-21, CSAIL MIT, Cambridge, MA USA, March 2006.

Luis E. Ortiz, Robert E. Schapire, and Sham M. Kakade. Maximum entropy cor- related equilibria. In Marina Meila and Xiaotong Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS-07), volume 2, pages 347–354. Journal of Machine Learning Research - Proceedings Track, 2007. URL http://jmlr.csail.mit.edu/ proceedings/papers/v2/ortiz07a/ortiz07a.pdf.

C. H. Papadimitriou, A. A. Sch¨affer, and M. Yannakakis. On the complexity of local search. In  STOC ’90: Proceedings of the twenty-second annual ACMsymposium on Theory of computing, pages 438–445, New York, NY, USA, 1990. ACM. ISBN 0-89791-361-2. doi: http://doi.acm.org/10.1145/100216. 100274.

Christos H. Papadimitriou. On the complexity of the parity argument and other inefficient proofs of existence. J. Comput. Syst. Sci., 48(3):498–532, 1994. ISSN 0022-0000. doi: http://dx.doi.org/10.1016/S0022-0000(05)80063-7.

Christos H. Papadimitriou. Computing correlated equilibria in multi-player games. In  STOC ’05: Proceedings of the Thirty-Seventh Annual ACM Sym-posium on Theory of cComputing, pages 49–56, 2005.

Christos H. Papadimitriou and Tim Roughgarden. Computing equilibria in multi-player games. In  SODA ’05: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 82–91, 2005.

Christos H. Papadimitriou and Tim Roughgarden. Computing correlated equilibria in multi-player games. J. ACM, 55(3):1–29, 2008. ISSN 0004-5411. doi: http://doi.acm.org/10.1145/1379759.1379762. URL http://portal.acm.org/ft_gateway.cfm?id=1379762&type=pdf&coll= Portal&dl=GUIDE&CFID=48210635&CFTOKEN=34538586.

Iead Rezek, David S. Leslie, Steven Reece, Stephen J. Roberts, Alex Rogers, Rajdeep K. Dash, and Nicholas R. Jennings. On similarities between inference in game theory and machine learning. J. Artif. Intell. Res. (JAIR), 33:259– 283, 2008. URL http://dblp.uni-trier.de/db/journals/jair/jair33. html#RezekLRRRDJ08.

Yosef Rinott and Marco Scarsini. On the number of pure strategy Nash equi- libria in random games. Games and Economic Behavior, 33(2):274 – 293, 2000.

Azreil Rosenfeld, Robert A. Hummel, and Steven W. Zucker. Scene labeling by relaxation operations. IEEE Transactions of Systems, Man, and Cybernetics, SMC-6(6), June 1976.

Dan Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82(1):273 – 302, 1996. ISSN 0004-3702. doi: https://doi.org/10.1016/ 0004-3702(94)00092-1. URL http://www.sciencedirect.com/science/ article/pii/0004370294000921.

Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson Education, 2003.

Bart Selman, Henry Kautz, and Bram Cohen. Local search strategies for satis- fiability testing. In David S. Johnson and Michael A. Trick, editors, Cliques, Coloring, and Satisfiability: Second DIMACS Implementation Challenge, October 11-13, 1993, volume 26 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science. AMS, 1996.

Solomon Eyal Shimony. Finding MAPs for belief networks is NP-hard. Artifi-cial Intelligence, 68(2):399 – 410, 1994. ISSN 0004-3702. doi: DOI:10.1016/ 0004-3702(94)90072-8. URL http://www.sciencedirect.com/science/ article/B6TYF-47YY1PM-5/2/464fa2711fc166135138f15751a06bd2.

William Stanford. A note on the probability of k pure Nash equilibria in matrix games. Games and Economic Behavior, 9(2):238 – 246, 1995.

J. Sz´ep and F. Forgo´o. Introduction to the Theory of Games. D. Reidel Publishing Company, 1985.

D. Vickrey and D. Koller. Multi-agent algorithms for solving graphical games. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-02), pages 345–351, 2002.

John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ, 1947. Second Edition.

M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, July 2005. ISSN 0018-9448. doi: 10.1109/TIT. 2005.850091.

Chaohui Wang, Nikos Komodakis, and Nikos Paragios. Markov random field modeling, inference & learning in computer vision & image understanding: A survey. Computer Vision and Image Understanding, 117 (11):1610 – 1627, 2013. ISSN 1077-3142. doi: http://dx.doi.org/10. 1016/j.cviu.2013.07.004. URL http://www.sciencedirect.com/science/ article/pii/S1077314213001343.

S. Yu and M. Berthod. A game strategy approach for image labeling. Computer Vision and Image Understanding, 61(1):32 – 37, 1995.

Steven W. Zucker. Relaxation labeling: 25 years and sill iterating. In L. S. Davis, editor, Foundations of Image Understanding, chapter 10, pages 289– 322. Kluwer Academic Publishers, Boston, 2001.

Fig. 1 summarizes our results for the most common classes of Ising models considered in the experimental evaluation of approximation algorithms and heuristic for belief inference in the literature as described above. We perform hypothesis testing for the result in these classes of Ising models using paired z-tests on the individual (i.e., not joint) differences, each with p-value 0.05. Hence, all the statements are statistically significant with respect to such hypotesis tests. Note that there is no globally best approximation technique overall for these classes.

“Mixed” case (Left plot, Fig. 1).Clearly, gs is best for all w in this case. Among the other approximation algorithms, we observe the following

1. fp (ce) is best and better than bp for w = 4, indistinguishable from bp for w = 3, and worst than bp for w = 2 where bp is best.

2. fp (ce) is consistently better than trw.

3. trw is worst than bp for w < 4, but better than bp for w = 4.

4. All methods, except for mf and nr, are consistently better than bl; mf and nr are consistently worst than bl, except for w = 2 where mf is indistinguishable from bl.

5. mf and nr are indistinguishable, except for w = 4 where nr is better than mf.

“Attractive” case (Right plot, Fig. 1).In this case, there is no clear overall best. We also observe the following.

1. trw is best among all methods except for w = 2 where gs is best, and trw is second best.

2. fp (ce) is better than all other methods, except trw, and gs for w = 2; and bp for w < 4 where fp (ce) is indistinguishable from bp.

3. mf, nr, bp, and gs are consistently indistinguishable from bl, and from each other; except for w = 2 where gs is best, of course.

Fig. 2 summarizes our experimental results for a class of Ising models which appears to lead to “harder” Ising-model instances. 30We perform hypothesis testing for the result in these classes of Ising models using two approaches depending on w. For w = 4, where we draw 50 models as samples for each q, we use appropriately modified paired z-tests on the individual (i.e., not joint) differences, each with p-value 0.05. We modify the calculation of the variances resulting from the average over the samples computed for each q. We do so because the distributional properties of the empirical mean/average for each q may differ. For w < 4, where we only draw 5 models as samples for each q, we use bootstrapped-based, individual, paired hypothesis-testing over each pair of aggregate differences between the methods for each of those values of w; we use 100 bootstrap samples, and p-value 0.05 All the statements are statistically significant with respect to such hypotesis tests.

Aggregate results (Left plot, Fig. 2). The left-hand plot in Fig. 2 shows the aggregate results for this case. There is no clear overall best over all q. We also observe the following.

1. fp (ce) is best for w = 4 and w = 3, while being second best to gs for w = 2; and, for w = 2.5, tied for best with gs (i.e., indistinguishable from gs).

2. trw is consistently better than bp, mf, bl, and nr, except for w = 2.5 where trw is indistinguishable from bp; trw is consistently worst than fp (ce)

3. Only mf is worst than bl for  w ∈ {2, 4}; mf is indistinguishable from bl for  w ∈ {2.5, 3}; also, bp and nr are indistinguishable from bl, except for w = 2.5, where bp is better than bl.

4. bp is better than mf except for w = 2; bp is indistinguishable from nr for w ∈ {2, 3}, but bp is better than nr for  w ∈ {2.5, 4}.

5. gs is consistently better than mf and nr; indistinguishable from trw, except, of course, for w = 2 where gs is tied for best with fp (ce).

Results for constant edge-weight magnitude w = 4 as a function of probability of attractive interaction q (Right plot, Fig. 2). The right-hand plot in Fig. 2 shows finer-grain results for this case. The results suggest that in fact such instances of Ising models tend to be harder in the sense that even state-of-the-art algorithms such as TRW are no better than the simple baseline estimation, in which  �pi= 0.5 for all nodes/variables i, for less than half of the full range of values of the sign probability q (i.e., for q ∈ {0.1, 0.4, 0.5, 0.6, 0.8, 0.9}). In fact, the performance of TRW is almost exactly the same as baseline accros the range of non-extreme values of q. (Note how the plot of the values for trw and bl are essentially on top of each other for values of q other than 0 or 1.) On the other hand, note how fp (ce) is consistently better than bl across the whole range of values for q. In fact, fp (ce) is always in the set of (statistically) best performers for all q: i.e., the single best for q = 1.0, and indistinguishable from trw for q = 0.0; gs and bp for q ∈ {0.1, 0.5, 0.6, 0.7}; nr, gs, and bp for q = 0.2; gs for  q ∈ {0.3, 0.4, 0.9}; and gs and mf for q = 0.8. The proposed fp (ce) is also best at both extremes, while trw is only best when all weights are negative. Almost all the methods other than fp (ce) are no better, and often worst, than bl, except for bp and trw for q = 0.0; trw for  q ∈ {0.2, 0.7, 1.0}; trw and gs for q = 0.3; and gs for q = 0.8.

image

Figure 1: Standard Evaluation on Ising Models with 8x8 and 12x12 Grids. The left and right plots are for the so-called “mixed” and “attractive” instances of Ising models, respectively. For all plots, the x-axis is the largest magnitude of the edge-weights: i.e., w = max(i,j)∈E |wij|. The y-axis is the average, over 50 randomly generated Ising models, of the average, over all of the 144 variables, of the absolute difference between the estimate and exact marginal probability for the random variable corresponding to that node, along with their corresponding 95% confidence intervals (CIs). The legend in each plot is for different approximation algorithms: bl = baseline; mf = mean field; bp = belief propagation; trw = tree reweighed message-passing; nr = simple no-regret algorithm; gs = Gibbs sampler; mw = multiplicative weights; and fp (ce) = the CE version of our version of the fictitious play for the 2-player potential game described in Section 3.4. We refer the reader to the main body for implementation details and a thorough discussion of the results.

image

Figure 2: Evaluation on Ising Models with 8x8 and 12x12 Grids, Uniform Interaction Magnitude, and Varied Probability of Attractive Interactions. (Left plots) The x-axis, y-axis, and legend are as in Fig. 1, except the edge-weight magnitude w is constant for each interaction strength in the x-axis (i.e., 2, 2.5, 3, and 4), and nr uses 106 iterations. For all cases, the result is the average over all values of the probability of attractive interaction  q ∈ {0.0, 0.1, . . . , 0.9, 1.0}, and over 5 Ising models for each q; except for the case of constant edge-weight magnitude w = 4, in which case the average for each q is over 50 Ising-model samples. Said differently, the overal average for the cases of  w ∈ {2.0, 2.5, 3.0}is over a total of 55 Ising models, while those for the case of w = 4 is over a total of 550 models. Note that the standard 95% CIs based on a Gaussian approximation resulting from the Central Limit Theorem (CLT) do not directly apply here because the averages are over different q values, each of which may have different distributional properties (e.g., different variances). For w < 4, because we are computing the average marginal-error over every q, each based on only 5 samples, we use the bootstrap method to compute the 95% CIs over the overall average for each method and each w, using 100 samples. For w = 4, because we have 50 samples for each q, we use a properly adapted version of the standard 95% CIs which modifies the calculation of the overall variance to account for distributional differences from each q. (Right plots) Results for each q value with w = 4, with 50 models as samples for each, along with their corresponding individual 95% CIs computed as usual. We refer the reader to the main body for a thorough discussion.

image

Figure 3: Evaluation on Ising Models with 8x8 Grids, Uniform In-

image

This plot shows proportion, along with standard individual 95% CIs, of non-convergent runs (y-axis) of bp (higher curve) and trw (lower curve), as a function of the probability of attractive interaction q (x-axis) for Ising models with constant edge-weights magnitude equal to 4. The setup is as described in the right plot of Fig. 2 for the case of edge-weight magnitude w = 4. The proportion is out of 50 runs for each  q ∈ {0, 0.1, 0.2, . . . , 0.9, 1}. Note how the convergence of bp degrades when q nears 0.5. Note the almost symmetric effect on non-convergence for bp. Note also that bp non-convergence seems uncorrelated with its performance, as shown in Fig. 2 (Right plot). While trw may also show non-convergence outside non-uniform edge-weights, the effect is less drastic than for bp.

image

Figure 4: Evaluation on Ising Models with 12x12 Grids, Uniform In-

image

Interactions: Marginal Error of fp (ce) by Number of Iterations. This plot shows the marginal error of the estimates obtained by the fp algorithm, for different numbers of iterations used in the algorithm. The x-axis is the number of iterations m, while the y-axis is the same as in Fig. 1 and Fig. 2. The marginal error of each run is represented by a circle on the graph. Each run is the result of averaging over all values of the probability of attractive interaction  q ∈ {0.0, 0.1, . . . , 0.9, 1.0}. The average marginal error is shown as a line, and is obtained from 20 randomly generated Ising models and corresponding estimates. The number of iterations used were  m ∈ {15,50, 100, 200, 500}.

image

Figure 5: Evaluation on Ising Models with 12x12 Grids, Multiplicative Weights Algorithms. The top plots are the same as those in Fig. 1 and the bottom plots are the same as those in Fig. 2, with the exception that only multiplicative weights algorithms are included in these plots (along with the baseline algorithm). The legend is as follows: mw er = external regret minimization; mw er cf = external regret minimization using a constant  η= 0.01; mw sr = swap regret minimization; mw sr cf = swap regret minimization using a constant  η= 0.01. We refer the reader to the main body for more details regarding the implementation of these algorithms.

image

Figure 6: Evaluation on Ising Models with 12x12 Grids, Uniform In-

image

This plot shows the marginal error of the estimates obtained by the trw (top plots) and gs (bottom plots) algorithms, for different numbers of iterations. The axes are the same as in Fig. 4. Note that the right plots use a logarithmic scale for the number of iterations. The number of iterations used in the left plots were  m ∈ {15,50, 100, 200, 500}, and  m ∈ {102,103,104,105}for the right plots. The marginal error of each run (circles) and the average marginal error (line) were found using the same procedure as in Fig. 4.

image

Figure 7: Evaluation on Ising Models Derived from MNIST Images, 28x28 Grids. This box plot compares the marginal errors of all algorithms with respect to gs. Because each algorithm was run on 100 image samples, each box consists of 100 average (over 282= 784 variables) marginal errors. Data points that are more than 1.5 times the interquartile range away from the median are considered outliers, and are drawn as short lines. The “whiskers” of the boxes are drawn as long lines.

image

Figure 8: Evaluation on Ising Models Derived from MNIST Images of Handwritten Digit ”1” Only, 28x28 Grids.This box plot compares the marginal errors of all algorithms with respect to gs. Because each algorithm was run on 100 image samples, each box consists of 100 average (over 282= 784 variables) marginal errors. Data points that are more than 1.5 times the interquartile range away from the median are considered outliers, and are drawn as short lines. The “whiskers” of the boxes are drawn as long lines.


Designed for Accessibility and to further Open Science