Learning performance in inverse Ising problems with sparse teacher couplings

2019·Arxiv

Abstract

Abstract

We investigate the learning performance of the pseudolikelihood maximization method for inverse Ising problems. In the teacher-student scenario under the assumption that the teacher’s couplings are sparse and the student does not know the graphical structure, the learning curve and order parameters are assessed in the typical case using the replica and cavity methods from statistical mechanics. Our formulation is also applicable to a certain class of cost functions having locality; the standard likelihood does not belong to that class. The derived analytical formulas indicate that the perfect inference of the presence/absence of the teacher’s couplings is possible in the thermodynamic limit taking the number of spins N as infinity while keeping the dataset size M proportional to N, as long as Meanwhile, the formulas also show that the estimated coupling values corresponding to the truly existing ones in the teacher tend to be overestimated in the absolute value, manifesting the presence of estimation bias. These results are considered to be exact in the thermodynamic limit on locally tree-like networks, such as the regular random or Erd˝os–R´enyi graphs. Numerical simulation results fully support the theoretical predictions. Additional biases in the estimators on loopy graphs are also discussed.

1 Introduction

Inference based on the classical Ising model is called the inverse Ising problem or Boltzmann machine learning, which is attracting more and more attention with the increasing interest in machine learning technologies. One recent application spurring this trend is for retinal neurons [1, 2], and subsequent applications to a wide range of systems have been conducted [3, 4, 5, 6, 7, 8, 9, 10], showing the potential usefulness of the inverse Ising framework.

However, it is difficult to compute standard estimators such as the maximum likelihood estimator in this framework when the system size is large. Thus, certain approximations and/or algorithms must be tailored to ease this difficulty and meet the demands of advanced applications, which have been attempted in previous studies [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. One of the most effective examples is the pseudolikelihood method [11, 20]. This method approximates the likelihood function by the product of conditional likelihood functions, each of which is for a single random variable conditioned by its neighboring variables. This is useful for a wide range of problems defined on graphical models, and enables us to treat large systems because local couplings directly connected to a focused single random variable are isolated from the other couplings, and thus can be estimated independently. This local nature is sometimes referred to as local learning [24].

Another benefit of local learning is its theoretical tractability in high-dimensional settings. Recent theoretical analyses based on the replica method revealed the tight limit of inference accuracy in the thermodynamic limit where the dataset size is comparable to model dimensionality [24, 25, 26]. This provides a firm theoretical basis for the inverse Ising framework.

Previous studies of [24, 25, 26] focused on fully-connected Ising models. In high-dimensional settings, however, sparsely-connected models are more interesting because the inference accuracy is expected to be much better than the dense case, and the inferred estimator is expected to have clearer interpretations owing to the sparsity. The present paper deals with this case. We investigate the so-called teacher-student scenario using the replica method by drawing on previous studies [24, 25, 26], but refine the theoretical treatment in [24] for dealing with the teacher with sparse connections. The cavity method is used but the cavity field is assumed to consist of separately treatable signal and noise; the functional form of the associated probability distribution is hypothesized. The network structure of couplings is assumed to be locally tree-like, and our theoretical result is expected to be exact on those networks in the thermodynamic limit. To check the accuracy, numerical experiments are also conducted.

The remainder of this paper is organized as follows. In sec. 2, we review the inverse Ising framework and the statistical mechanical formulation to analyze inference accuracy. We also briefly review the analysis of the fully-connected case given in [24] to demonstrate how the formulation is applied. In sec. 3, we extend the method in the statistical mechanical formulation to the sparsely-connected case. Some theoretical treatments developed in [24] are refined and our ansatz to deal with the teacher model with sparse connections is stated. In sec. 4, our numerical analysis results are provided and compared with our theoretical results to check their accuracy. The final section presents a summary and discussion.

2 Formulation

In this section, we briefly review the inverse Ising framework and two associated inference methods: the maximum likelihood (ML) and pseudolikelihood (PL) methods. The class of local learning is also introduced and explained. In addition, the general framework of the statisticalmechanical analysis proposed in [24] is presented. We emphasize its technically important points and review the case of the fully-connected Ising model [24] to demonstrate its application.

2.1 Inference framework

Let us consider an Ising model consisting of N spin variables and obeying the following distribution:

where are termed couplings and external fields, respectively. The setting of the inverse Ising problem is aimed at inferring the couplings and external fields from a given dataset of spin snapshots (samples) denotes the dataset size. The representative statistical solution or estimator is the ML estimator defined by

Hereafter, the symbol ˆis used to represent an estimator. This canonical estimator has some desirable properties such as consistency, unbiasedness, and efficiency when the dataset size M is large enough while N is kept finite (the asymptotic limit) [27, 28]. Consistency means that the estimator converges in probability to the true parameter in the large M limit; unbiasedness means that the mean of the estimator (which is a random variable when the dataset is randomly generated) becomes identical to the true parameter; unbiased estimators yielding the lowest mean-squared error are termed efficient. Note that the ML estimator enjoys the unbiasedness and the efficiency only in the asymptotic sense, and for finite size datasets it can have bias and be not efficient.

In this sense, the ML estimator is good, but is not always appropriate for the inverse Ising framework for some computational reasons. The PL method [11] overcomes this problem by replacing the likelihood with the conditional distribution is the coupling vector connected to the ith spin, and denotes the spin vector without the ith component. The explicit form is

The PL estimator is obtained separately for each i by

where

Two remarkable properties are held by the PL estimator. The first is its consistency. When the dataset size M is sufficiently large, the PL estimator converges to the true values The second is its locality. Owing to the factorized nature of PL, each coupling vector can be assessed independently with low (polynomial) computational cost, which is in contrast to the ML estimator for which the exponentially large computational cost is necessary when assessing the partition function. However, coupling symmetry is inevitably lost ( ˆin general).

This local property of PL also simplifies the theoretical treatment, which inspired explorations of the “optimal cost function” in a class of models with the same locality as the PL case [24, 26]. On the basis of these studies, we treat this local learning class and assume that the cost function argument is the only product of the local spin and effective field. The cost function and corresponding estimator are denoted as , respectively. When specifying a certain model in the class, the appropriate superscript will be attached as eqs. (5,7).

We have an additional remark on the lost symmetry in the PL estimator. This limitation of the PL estimator can be recovered by adding PL cost functions of different sites i and j and by performing the minimization with respect to a symmetric variable shared in both terms [29]. Although our theoretical analysis presented in this paper can also be generalized to this case with a slight increase in the respective computational steps shown in sec. 3, we restrict ourselves on the simplest case without such a symmetrizing operation.

2.1.1 Teacher-student scenario

Here, we consider the inverse Ising problem in the teacher-student scenario. Specifically, the dataset is assumed to be independently identically distributed (i.i.d.) samples from a teacher Ising model with the couplings and external fields , and a student Ising model attempts to infer the teacher couplings and fields from the dataset. The inference accuracy is quantified by the residual sum of squares (RSS) between the teacher couplings and student’s estimator

where we defined the following three macroscopic parameters:

These are later associated with the order parameters in the statistical mechanical analysis.

2.2 Statistical mechanical analysis

Here, we explain the statistical mechanical formulation developed in [24] to analyze the theoretical performance of local learning models. For simplicity of the theoretical analysis, we assume the absence of external fields H = 0 both in the teacher and student models.

2.2.1 General framework

The basic idea of statistical mechanical analysis is to introduce the following Hamiltonian and Boltzmann distribution induced by the cost function

where

Here, we reorder the spins and focus on the zeroth spin and its coupling vector J. The external field is set to zero as declared above; Trdenotes the integration with respect to J with an appropriate measure, the detailed definition of which is given in sec. A. In the limit , the Boltzmann distribution converges to a pointwise measure on the estimator ˆ; thus, the Boltzmann distribution enables the capture of the estimator’s properties. We are interested in the thermodynamic limit keeping (1); hence, the free energy density will show the selfaveraging property. Thus, for typical datasets, the free energy density converges to its average in this limit. The averaged free energy density is denoted as

where the square brackets [denote the average over the dataset, which is the average over the teacher Ising model:

The average of log is, however, generally difficult to compute. The replica method is a prescription to overcome this difficulty and is symbolized by the following identity:

Assuming that n is a positive integer, we can rewrite [

Rewriting this by introducing variables, which are hereafter called cavity fields, we get

[

At the last equality, we took the summation over the spins except for , yielding the joint distribution ) of the cavity fields. The normalization constant is defined through its marginal as

To proceed further with the computation, we need to specify the functional form of the cavity field distribution and take the average over it. This is possible when the teacher is a fully-connected model, as demonstrated in [24]. We review this result below as it contains some steps essential for the sparsely-connected case.

2.2.2 Revisit the fully-connected case

When the teacher is a fully-connected model, we can use the central limit theorem and assume that the cavity fields are multivariate Gaussian variables with appropriate covariances and means. In [24], the authors assumed that the teacher system is in the paramagnetic phase and the replica symmetry (RS) holds in both the student and teacher systems. These assumptions imply that the following four order parameters are sufficient to describe the free energy density:

where is the correlation matrix between the spins:

where denotes the average over the teacher Ising model without the zeroth spin; the last equality is due to the paramagnetic assumption. The covariances of the cavity fields are

described by eq. (20) asassuming this, we can rewrite [

where

Deferring the detailed computations to sec. A, we immediately have the result in the limit

Further, we take the limit , which requires the following relation:

After straightforward calculations, we get

where denotes the extremization condition with respect to x. The extremization condition yields the following equations of state (EOS):

where

Using the solution of eq. (30), we can evaluate the macroscopic parameters (9) by

These relations can be derived by a standard technique using auxiliary variables [24] and the derivation is given in sec. B. Once the values of are obtained, the RSS is eventually computed by eq. (8). Note that the values of are directly determined by the problem setting, as well as the inverse correlation function. To obtain these quantities, we need to separately solve the direct problem.

3 Details of the sparsely-connected case

This section provides the extension of the above result to the sparsely-connected case, which is the main contribution of this paper. To this end, we introduce an ansatz about the estimator’s behavior as well as the functional form of the cavity field distribution. Under the ansatz, the cavity field is decomposed into a signal and a noise, and it is shown that the noise part obeys essentially the same EOS as the fully-connected case. To complete the computation under the ansatz, the tree-like structure of the coupling network of the teacher model is employed.

3.1 Ansatz for the sparse case

In contrast to the fully-connected case, the cavity field distribution in the sparse case cannot be regarded as Gaussian; the distribution of actually becomes the sum of a few pointwise measures, which is far from Gaussian. Hence, we need a new ansatz to handle the cavity field distribution in the sparse case.

To obtain an idea of how to resolve this, let us consider an ideal situation where we know which couplings are nonzero. Let us suppose that the zeroth spin is connected to c = O(1) neighboring spins, and introduce two sets of indices Ω = ¯Ω = , where Ω (¯Ω) is called the active (inactive) set; and . If we know Ω and ¯Ω in advance, then the inference should be conducted only on . Accordingly, the number of variables to be inferred is just c = O(1); hence, the dataset size M = O(N) is sufficiently large. Thus, we can apply the asymptotic theory of statistics, which implies that an estimator in this ideal case behaves as

this is called an oracle estimator. The “error” from the true solution, ∆, is a random variable. In the local learning class with appropriate (consistent) cost functions such as PL [30], ∆considered to be zero mean with variance decreasing at the rate of RSS is written as ), and vanishes in the thermodynamic limit.

Based on these observations about the oracle estimator, we assume that the (non-oracle) estimator obtained from consistent cost functions obeys the following form:

where we again assume that ∆is a random variable which is asymptotically zero mean with variance scaled as ); the correlations among are also assumed to be sufficiently weak. The quantity ¯is interpreted as the mean value of the estimator and will deviate from the true value owing to the extensive number of noise terms . The values of later computed by taking the minimization condition of the free energy as the order parameters. The applicable range of this ansatz is discussed in sec. 3.3.

Let us examine the consequence of the ansatz. The RSS can be written as

Now, there are two non-negligible contributions to the RSS coming from the bias in Ω and the noise in ¯Ω; the RSS remains finite even in the limit in contrast to the ideal case. The ansatz also allows us to decompose the cavity field as

where is termed as the “noise” part. Furthermore, we can assume that asymptotically independent in the limit . This assumption is reasonable, and a schematic to explain this is given in Fig. 1. Owing to the tree-like structure, we can define the generation g of a spin s from Ω as the shortest path length between s and any spin in Ω along the network. As g grows, the correlation with decays exponentially fast, while the number of spins

Figure 1: Schematic explaining the independence between (. The majority in the sum in is uncorrelated with spins in Ω and dominates in the thermodynamic limit.

belonging to generation g exponentially increases. If the correlation decay is sufficiently faster than the increase of the spins, then the majority of spins in the network can be regarded as uncorrelated with Ω. Some terms in are certainly correlated with , but their contribution is ) because ∆) and the number of correlating terms is O(1) owing to the sufficiently fast decay of correlations. Hence, the contribution of correlating terms vanishes and the uncorrelated majority with Ω completely dominates in the thermodynamic limit. These observations indicate that eq. (17) can be now decomposed as follows:

In the second line, we performed the variable transformation and neglected the contribution as eq. (37). In the third line, we denoted Ω, and performed the sum over , yielding the joint distribution ). In the fourth line, we used the asymptotic uncorrelatedness be- tween ) discussed so far.

Now, the central limit theorem can be applied to the noise part , and they can be regarded as Gaussian variables. As the fully-connected case, two order parameters describing

their covariances are introduced:

Counterparts of are unnecessary because the dependence on () is separately and explicitly treated in the present formulation. Then,

where

Again, using the techniques in sec. A we get

Employing the relation (28) and taking the limit, we get

The extremization condition with respect to

where

Further, the mean estimates are also evaluated by the extremization condition. The result for ¯is given by

Using the parameters computed by eqs. (47-49), we can evaluate the RSS which is expressed in the present setting as

The quantitywill be computed in another discussion on the direct problem as the fully-connected case. In addition, ) will also be assessed separately in the sparse case. These points are addressed in the next subsection.

3.2 Direct problem’s properties

The inverse problem essentially requires certain information from its direct problem counterpart. Necessary information to compute the quantities of interest depends on the system’s properties. In the fully-connected case, two-body quantities such asare sufficient. However, in the sparse case, higher-order information is needed because the central limit theorem does not fully dominate the system. Hence, the functional form of necessary, as seen in eq. (46). Techniques for computing such quantities in the sparse case largely advanced in the ’90–’00s. Here, we quote a portion of the results to compute the necessary quantities. For readers interested in the detailed techniques, please refer to [31, 32]. Although these techniques are applied in general situations, to obtain compact analytic forms of the quantities of interest, we rely on the assumptions that the teacher model is in the paramagnetic phase and the external fields are absent.

3.2.1 Marginal distribution of the teacher model

The marginal distribution ) is computed by marginalizing the whole distribution ) with respect to In general, this operation requires nontrivial computations and the resultant distribution becomes dependent on parameters among the marginalized spins. However, under the present assumptions, such dependencies do not exist and the expression becomes rather simple:

where Ωdenotes the union index set of 0 and Ω. This form is applied to eqs. (47,49) to obtain the order parameters.

3.2.2 Inverse correlation function

Next, we compute the inverse correlation function ; hereafter, we treat the whole system and discard the superscript as it is not essential.

On the tree-like networks, the so-called Bethe approximation gives an exact result as the system size tends to infinity, and it provide a compact analytic form of the inverse correlation function. The derivation is given in [33, 34] and we here just quote the result. Under the present assumptions without magnetizations and external fields, it becomes

where denotes the index set of nodes connected to denotes the connectivity or the number of edges connecting to node i.

3.3 Applicable range of the ansatz

Here we consider the applicable range of the ansatz (34). This ansatz is a strong statement since it allows us to not think of any possible biases of the estimators outside the active set Ω. When is this valid? How does it relate to the tree-like network structure?

To obtain answers to these questions, we rethink eq. (49). An important observation is that this equation is merely the zero-gradient condition of with respect to Ω) averaged over the datasets, as shown below. Denoting the empirical average on the dataset and using the statistical mechanical analysis explained so far, we can write the zero-gradient condition with respect to

where we put . With this expression, we replace the estimator ˆJ with the average over eq. (11), take the average over the dataset, and use the replica method. The result is

where 1). Applying this form and following the same line of computations as in sec. 3.1, we get

and the factorin eq. (54) becomes unity when taking the 0 limit, yielding the identical result to eq. (49).

This computation naturally leads to the following question: Should we compute all the zero-gradient conditions not only for Ω but also for ¯Ω? This point is important because if this is the case, then the ansatz (34) is insufficient as it only suffices those for the active set consistent, the answer is considered to be yes in general; hence, we need to take into account the zero-gradient conditions for Ω. This implies that the ansatz (34) should be modified and we need to introduce mean estimates ¯Ω in general situations.

Fortunately, if the network is tree-like, we can show that all the zero-gradient conditions are automatically satisfied once those for Ω are met. Hence, the ansatz (34) is consistent on such networks; we show proof of this below. For technical reasons, we recover the external field in the remaining of this subsection. When the external field exists, the student model should also have the external field variable, and hence the replica result is slightly modified. That modification is accomplished by replacing(46,48) and (55). Here, ¯denotes the mean estimate of the external field variable acting on the focused spin of the student model, and is determined by the extremization condition of the free energy, yielding

Under the above setup, we show the consistency of eq. (34) on tree-like networks. The first step is to write down the zero-gradient condition for Ω. The result of applying the averages and replica method is the replacement of in eq. (54). With this expression, we perform the following transformation:

where denotes the average over By following the same computations so far, the second term vanishes in the limit limbecause the coefficient converges to the right-hand side of (57) giving zero. Meanwhile, in the same limit, the first term can be transformed as

and the dependence on appears only in the marginal distribution tree-like networks, the marginal distribution necessarily takes the following form:

where is the effective field obtained by marginalizing the descendant spins of i, and is usually termed as cavity field. An important point of eq. (60) is the absence of higher-order interactions among active set spins because of the tree-like structure. Hence, the differentiation of only through that of the effective fields. Furthermore, owing to the tree-like structure, only one of the effective fields is dependent on . Specifying the corresponding index as Ω), we get

Hence, the zero-gradient conditions on the inactive set ¯Ω are satisfied once those of the active set Ω hold, proving our statement.

The above proof also provides a perspective for loopy graphs. If loops exist, then higher-order interactions emerge in ); they generally depend on in a complex manner and yield some additional terms as a result of differentiation. In such situations, additional mean estimates ¯Ω will be necessary to satisfy the corresponding zero-gradient conditions; however, treating all variables in ¯Ω is clearly infeasible. Tailoring good approximations in such cases may be interesting in future work, although in sec. 4.3 we show an example in which our present theoretical treatment becomes a good approximation even for loopy graphs.

4 Numerical experiments

In this section, we conduct numerical experiments to check the accuracy of the theoretical computations. The actual behavior of the order parameters and related quantities depends on the details of the coupling ensembles. Hence, we treat the regular random (RR) graph and Erd˝os–R´enyi (ER) graph as representative examples of sparse tree-like graphs. The RR graph is characterized by one connectivity parameter c, while the ER graph is characterized by the connection probability p. To keep the generated graph sparse enough in the ER case, we assume the probability is scaled as p = d/N, yielding the mean degree d. Furthermore, we also assume that the couplings of the teacher model have the same probability of taking both signs and the strength is constant: 0. The coupling strength K is assumed to be small enough to satisfy the paramagnet assumption of the teacher model. In particular, for the RR graph, the paramagnetic condition is

while that of the ER one is

For readers interested in the derivation, please refer to [31, 32]. The cost function is fixed to that of PL in the following, as the simplest and commonly used case. The result of the RR graph case is shown below in sec. 4.1, and that of the ER graph is in sec. 4.2. For comparison, some numerical results on the square lattice are shown in sec. 4.3, focusing on the approximation nature of the present theoretical results. Furthermore, as another common cost function, the so-called interaction screening (IS) method [21] belonging to the local learning class is examined and compared with PL.

Owing to the uniformity of the coupling strength, the strength of mean estimates can also be set to a uniform value , where the bias factor

is introduced. By the same reason, the marginal distribution can be simplified by again introducing the cavity field

where . If the focused spin’s connectivity is c, then the cavity field distribution becomes

Applying the reduction (65) in eqs. (47,49) with replacement in eq. (48) reduces the computation of mean estimates to that of the bias factor ˆb. The theoretically evaluated ˆb was compared with that obtained by numerical experiments to check the validity of our theoretical treatment.

The numerical computation of the order parameter Q will be conducted below, but it has some delicate points. In our actual computations, the following procedure was adopted: From the generated teacher model we first compute the inverse correlation functioncavity formula (52) and numerically invert it to obtain ; then we introduce from the learning result ˆJ and the mean estimate ¯J which is obtained as ¯the theoretically evaluated value of ˆb is inserted; finally we get a numerical value of Q by . Although it is also possible to evaluate by the Monte Carlo (MC) sampling instead of using the formula (52), this method is better for controlling fluctuations and reducing computational cost.

The outline of our numerical experiment is as follows. We first generate teacher model, next perform MC sampling, and finally choose a center spin and conduct learning by numerically minimizing the PL cost function (5). The obtained estimator is used to compute relevant quantities such as RSS. As a result, there are some distinctive sources of fluctuation in the estimate, but we do not discriminate them below. The error bar is accordingly defined by the standard error coming from those fluctuations. The number of datasets used to compute the error bar is hereafter denoted as . More details of the numerical experiment are summarized in sec. C.

4.1 RR graph case

In the case of the RR graph with connectivity c, using eq. (52) the trace of the inverse correlation function becomes

Substituting this in conjunction with the parameters obtained by eqs. (47-49) into eq. (50), we obtain the RSS. Below, we compare these theoretical values with the numerically evaluated ones.

We start by comparing the theoretical and numerical values of . In Fig. 6, these quantities are plotted against = 3. In all the plots, the

Figure 2: Plots of E (left), Q (middle), and ˆb (right) against (200, 3). Dotted lines and color markers are the theoretical and numerical values, respectively. The agreement between them is fairly good. The left and middle panels are plotted in the double log scale because E and Q drastically diverge in the limit 2. The error bars obtained from = 100 datasets are shown, although they tend to be comparable with the size of markers.

agreement between the theoretical (dotted lines) and numerical (color markers) results is fairly good, supporting the validity of our analytical treatment.

Next, we consider the distributions of the estimators in Fig. 3, which were normalized as probability distribution functions. The left panel is the distribution of the estimators on the

Figure 3: Distribution of the estimators ˆJ on the active and inactive sets are given in the left and middle panels, respectively. The right panel is the distribution of the noise part on the active set, . The system parameters are (middle and right panels imply that the noise parts obey the zero-mean Gaussian distribution and have no discriminative difference between the active and inactive sets. Here, the histograms are generated from = 500 datasets; from each dataset, the number of obtained estimators is c = 3 for Ω while that for ¯Ω is

active set Ω. We can observe that two peaks are located around the theoretical prediction . In the middle panel, the estimator distribution on the inactive set ¯Ω is shown, yielding a Gaussian-like distribution with zero mean. Similar behavior is observed for the noise part on the active set, , the distribution of which is given in the right panel. Here, the mean estimates are computed by multiplying the theoretically evaluated bias ˆb by the true coupling . These observations are again consistent with our theoretical analysis.

Thirdly, we check the finite size effect. In Fig. 4, against the system size N, the RSS and rescaled variance (multiplied by N) of the noise parts ˆare plotted in the upper and lower panels, respectively. Although the finite size effect behaves in different ways depending on the parameters and quantities, we can see that the numerical results (markers) fairly matched the theoretical values (black dotted lines) as the system size is large. Here, the rescaled variance

Figure 4: (Upper): Plot of E against the system size = 5 (left) and 50 (right) at (K, c) = (0.4, 3). The black dotted lines denote the theoretical result and the markers are the numerical ones. The numerical results tend to converge with the theoretical results as the system size grows, although the finite size effect seems to be different between the left and right panels. The error bar is obtained from = 500 datasets for and = 800. (Lower): The rescaled variance (multiplied by N) of the noise part ˆis plotted against the system size N. The parameters are the same as those of their counterparts in the upper panels. Although in this closeup scale there is a small gap between the numerical and theoretical results within the one standard error, this gap can be eliminated by taking a larger number of samples. Here, the error bar was obtained using the bootstrap method by considering each realization and component of ˆas i.i.d..

corresponds to the quantity in our theoretical computation, which is consistent with eq. (50). These results again confirm the validity of our computations.

Finally, we have some noteworthy remarks. The results shown in Figs. 3 and 4 imply the possibility of an efficient method of debiasing and hypothesis testing. The bias factor ˆb can be computed from our analytical result, and hence we can debias our estimator in an efficient manner. The residual after debiasing ˆis considered to obey a Gaussian distribution, as shown in Fig. 3, and is supported by our analytical computations in sec. A. Thus, we can efficiently compute the P-value according to the standard hypothesis testing method, enabling us to judge the relevance of the estimated couplings. Moreover, in the thermodynamic limit show that the perfect reconstruction of the teacher’s network is possible for any 2. To do so, we need to evaluate the probability of getting false positives in the estimator. To control false positives, we introduce a constant threshold value 0), and consider estimated couplings with absolute values less than as negligible and set to zero; we independently repeat this procedure for all . Let us evaluate the probability of successfully screening out false positives using this method. The observations so far imply, on the inactive set ¯Ω, that the estimator behaves as

where (1)) is the rescaled variance of the estimate, with relation (1. Hence, the probability of successfully screening out these estimators on ¯Ω is

The second approximate equality comes from the asymptotic formula of the integral, which can be justified for large N. The last limiting behavior holds as long as is bounded from above, because the exponential factor e

decays fast enough compared with the number of products 1. Hence, we can completely suppress the false positives in the limit . Meanwhile, we also desire to accurately reproduce the presence of couplings on Ω. This can be done by tuning the threshold value as smaller than the true coupling strength K (the mean estimates ¯J are larger than K in the absolute value). In practical situations, we do not know the true coupling strength K in advance, and thus it is nontrivial to correctly tune . In such cases, it may be better to tune by monitoring the distribution of estimators such as Fig. 3, and to find a value that effectively separates the modes of distribution. The present theoretical analysis supports this process by manifesting the behavior of estimators in the limit

4.2 ER graph case

For the ER graph with connection probability p = d/N, the evaluation of the order parameters and related quantities is slightly more complex than the RR case because of the distributed nature of the connectivity. In the thermodynamic limit, the distribution of connectivity c in the ER graph obeys the Poisson distribution:

The trace of the inverse correlation function fortunately becomes simple in the limit:

When focusing on spin i with connectivity in the ER graph, its associated order parameters are computed by eq. (47) with ) defined in eq. (66), and the RSS is given by

This explicit dependence of the order parameter on is the complex point of the ER case. Upon realizing this property, we can easily compute the mean RSS for the whole network, which is written as

These provide explicit formulas of the RSSs for the ER case.

As an interesting departure from the RR case, we here examine the connectivity dependence of our quantities of interest. The plots of are given in Fig. 5. In this experiment, we generated ten different ER networks, performed

Figure 5: Plots of E (left), Q (middle), and ˆb (right) against dotted lines and markers are the theoretical and numerical values, respectively; the different colors correspond to different K. The agreement between them is fairly good.

two independent MC samplings, and conducted learning for all . The error bars were placed using the obtained datasets, and thus varied among different connectivity c. The agreement between the theoretical and numerical results is fairly good, supporting our theoretical result. Although a slight deviation at large c in E(c) and Q(c) was observed, this was presumably attributed to the finite size effect, which increased at large c because of insufficient system size for generating nodes with large c. We have tried to control this deviation but found it is difficult to conduct experiments of sufficiently large systems in reasonable time: The generation probability of node with, say, c = 13 can be estimated as hence for stably generating networks with such large degree nodes we need at least which is too much in our experiment. We thus leave this problem untouched.

We also computed the mean RSS (73) for the whole network. The theoretical value is 4780, while the present experimental value is 0041. The slight difference between these is again attributed to the finite size effect. Here, the theoretical value was obtained by taking the sum of eq. (73) up to c = 20; the effect of this truncation was found to be small.

4.3 Square lattice case for comparison

The cavity method in direct problems is known to yield good approximations even for loopy graphs, when correlations among spins are weak; it is sometimes referred to as Bethe approximation. Here, we examined this approximation nature of the present theoretical computation of inverse problems. To this end, we compared our theoretical result for c = 4 with the simulation result on the square lattice with periodic boundary condition. To avoid possible complexity due to frustration, the present teacher couplings were assumed to be all positive and constant,

In Fig. 6, we plotted 2 on the square lattice of size 20 in comparison with our theoretical result (dotted line) computed with the assumption of the tree-like network structure. The agreement between the theoretical and numerical results is

Figure 6: Plots of E (left) and ˆb (right) against 2 on the square lattice of size 20 20. For comparison, the theoretical results derived by assuming the tree-like structure of the coupling network are plotted as the dotted lines. The agreement between the markers (numerical results) and lines is fairly good. The error bars obtained from = 400 datasets are shown.

fairly good. This indicates that our theoretical result can be a good approximation even for loopy graphs.

Another interesting phenomenon for loopy graphs is the possible presence of bias in the estimated couplings for spins in ¯Ω, as discussed in sec. 3.3. To examine this, in the upper panels of Fig. 7 we show the distributions of the coupling estimates corresponding to the next nearest neighbors (NNN) from the center spin in the teacher model for the square lattice (left) and for the RR graph with c = 4 (right). To make a fair comparison, the present teacher couplings for the RR graph case are all positive and constant as the square lattice case. These two distributions are very similar, implying that the bias in coupling estimates for remote spins is, even if it exists in loopy graphs, very weak for the present situation. For further quantitative information, the means of those distributions were plotted against the system size in the lower panels. Again, we observed no clear deviation from zero and no significant difference between the two cases of the square lattice and RR graph. These suggest the practicality of the theoretical results for wider situations than tree-like networks.

Figure 7: (Upper): Distributions of the NNN estimators for the 20 20 square lattice (left) and for the RR graph with (N, c) = (400, 4). In both cases, other parameters are set to be (= 400. No clear positive/negative tendency is observed in both cases. (Lower). Plots of the mean of the NNN estimate distribution against the system size for the square lattice (left) and RR graph (right). The other parameters are similar to those of the corresponding upper panels. The means are quite small, and no clear deviation from zero is observed. The dataset sizes are respectively. The error bars are obtained using the bootstrap method.

4.4 Comparison with interaction screening

We here examine the IS cost function [21, 22], which is another common method for the inverse Ising problem. The IS cost function is given by

The quantitative comparison with the PL method is of our interest.

In Fig. 8, we plot for the RR graph at (c, K) = (3, 0.4), with two theoretical curves of IS (dashed line) and PL (dotted line). Numerical results (asterisks) are also shown to validate the theoretical result of the IS case. The important observation is that

Figure 8: Comparison of the PL and IS results. Plots of E (left), Q (middle), and ˆb (right) against 4) for the RR graph are given. The theoretical curves for PL and IS are described by dotted and dashed lines, respectively. The color markers are the numerical results for the IS case at N = 200, whose error bars are obtained from = 100 datasets. The consistency of the numerical and theoretical results is fairly good. The IS result provides larger values in all these three quantities than those of PL, implying that IS gives larger error, variance, and bias.

the IS result consistently gives larger values of than those of PL. This implies that the IS method gives larger error, variance, and bias than PL. Although it is known that IS achieves the optimal scaling of the dataset size M [21, 22, 35] necessary for perfect reconstruction of the coupling network structure, it does not necessarily mean the better accuracy in terms of variance/bias. The present analysis demonstrates this in a concise way.

5 Summary and discussion

We proposed a theory to evaluate the reconstruction performance in inverse Ising problems with sparse couplings by maximizing of the pseudolikelihood in the thermodynamic limit. A large part of the theory relies on the statistical mechanical formulation in [24], but we refined the theoretical treatment in the cavity method to handle the teacher model with sparse couplings. The resultant expression requires a full functional form of the cavity field distribution, which is far from Gaussian but was obtained by appropriate consideration of the direct problem counterpart. The theoretical result shows fairly good agreement with numerical experiments conducted on the RR and ER graphs, justifying our theoretical treatment. This agreement holds even for the case of the square lattice, suggesting the practicality of the present result as an approximation for loopy graphs.

The crucial assumptions of our treatment are the asymptotic behavior of the estimator (34) and the paramagnet assumption of the teacher model, leading to the decoupled distributions of the cavity fields. The former assumption implies that the teacher’s couplings can be reconstructed by the student almost perfectly, as discussed at the end of sec. 4.1 according to the hypothesis testing framework, providing a theoretical reasoning to use the inverse Ising framework. The latter assumption requires the smallness of the coupling strength, implying that strongly-correlated datasets cannot be treated by the proposed theory. It will be a challenge to overcome this applicability limitation.

Concerning to the perfect reconstruction of the sparse network in the inverse Ising framework, earlier studies reported similar results in empirical and theoretical ways [29, 35, 36, 37]. Especially, a series of analyses by Wainwright and the collaborators [35, 36, 37] derived the necessary and sufficient conditions for the perfect reconstruction in the high-dimensional limit, clarifying that the necessary size of the dataset scales as M = O(log N) when the maximum degree of the network is bounded from above. Compared to this scaling, our result on the scaling M = O(N) is rather conservative. Our formulation, however, has some nontrivial advantages by deriving more detailed information about the system. For example, our formulation can treat the ER graph, the maximum degree of which is not bounded and the proofs established in [35, 36, 37] are not applicable. Our result also clarifies by directly assessing the estimator’s fluctuation that the hypothesis testing can actually achieve the perfect reconstruction, which provides another efficient way of reconstruction than the regularization used in the above earlier studies. The explicit computation of the bias on the estimator is also another profit. In this way, the present formulation can provide finer information in concrete cases and will be beneficial in wider situations.

For handling real-world datasets, finite magnetizations as well as possible loop structures in the network should be taken into account. For such realistic situations, the computation of (1) will be more complicated. For evaluating those quantities, advanced techniques such as Bethe approximation [33, 34], high-temperature expansion, and MC samplings will be useful. The ansatz (34) should be also modified for the case of loopy graphs, as discussed in sec. 3.3. The presented result can be still practical as an approximation for treating such situations, as demonstrated in sec. 4.3. Certain data analysis utilizing these theoretical results will be interesting and useful.

A clear drawback of the estimator treated in this paper is that it is not informative in the region 2, as indicated by the divergent RSS in the limit 2 + 0 shown in sec. 4 and [24]. To overcome this, the use of regularizations will be promising. The regularization will be particularly useful to control false positives in the estimated couplings. It is also possible to employ hypothesis testing in conjunction with regularization [38]. An extension of the present analysis to these cases is interesting and will be our focus in future work.

Another interesting extension of the present analysis might be the analysis of the model-mismatched cases where the student model cannot be equal to the teacher one. Even in such cases, some limited information in the teacher, such as the coupling network structure, might be recovered in some conditions [10]. Pursuing this possibility will be interesting because it can provide a better justification for applications of the inverse Ising framework to the analysis of real-world datasets. The analysis of model-mismatched cases may also have a connection to “criticality” observed in earlier studies [39].

The inverse Ising problem or Boltzmann machine has been treated in this paper. Although this model is much simpler than the current models of machine learning communities, we believe that it is important to enhance theoretical knowledge on such simple models to maintain the reliability and interpretability of the results given by machine learning technologies. We hope that our study contributes to this direction, which will lead to a better understanding of and relationship with more complex models.

Acknowledgement

This work was supported by JSPS KAKENHI Nos. 25120013 and 17H00764, and JST CREST Grant Number JPMJCR1912, Japan.

A Computations for L and S

For computing L, the following decomposition of the cavity fields becomes useful:

where are i.i.d Gaussian variables with zero mean and unit variance. It is easy to confirm that this decomposition reproduces the covariances among . Using this and performing the integration with respect to

where we use the relation , which was canceled with a factor appearing by the integration of . Eq. (26) is easily derived from this.

For computing the entropic term ), we use the rescaled variable and set the integration measure as Tr. Here, we use the uniform measure because in the present setting the student has no prior information about the teacher couplings. If certain prior knowledge is available such as the teacher coupling sparseness, it can be suitable to introduce another measure such as Laplace prior. Further, we represent the delta functions by the Fourier expressions as follows:

where the integration contour is the imaginary axis and are appropriate normalization constants; however, these points are irrelevant and ignored hereafter. Inserting this into eq. (23), we get

where

To decouple different replicas and components of , we use the expression where Λ is the diagonal matrix consisting of the eigenvalues is the appropriate orthogonal matrix. Performing the variable transformation ˜W = OW and applying the Hubbard– Stratonovich transformation, we get

Note that this quadratic form with respect to ˜W implies that ˜W essentially obeys Gaussian, and thus the estimator ˆJ also does. This knowledge of the distribution can be used for hypothesis testing as addressed in the main text.

In the thermodynamic limit , we can use the saddle-point (or Laplace) method to avoid the explicit integrations with respect to ˜. This yields

where we used the relations . The limit leads to

and the extremization condition gives

Substituting these relations into eq. (84), we obtain eq. (25). If we ignore the terms related to , we have eq. (44).

B Derivation of macroscopic parameters R and ρ

To derive the expressions of , we can employ the technique of auxiliary variables. We introduce two terms in eq. (81), and perform the same line of computations as in sec. A. As a result, the entropic term is modified to the following expression:

Taking the differentiation with respect to and taking the limit

The last expression is obtained by using eq. (85). Similarly,

These give eq. (32).

In the sparse case, we need to compute for computing the RSS. By construction, this is equivalent with R when m is absent. Hence is given by putting m = 0 in eq. (88), leading to eq. (50).

C Details of numerical experiments

The actual experimental procedures are summarized as follows. We first generated a random graph and the teacher couplings on it, and obtained spin snapshots using MC sampling. Then, we randomly chose a center spin from the whole spins and learned the couplings connected to by minimizing the PL cost function defined with a dataset obtained from the sampled spin configurations. This single sequence of operations provided single values of the quantities of interest, such as E and Q. To obtain the error bars of those quantities, we repeated this sequence many times. Here, the experiment had three different sources of fluctuations: the generated teacher model (graph shape and couplings), the choice of the center spin, and the MC sampling. We did not discriminate between these three fluctuations unless explicitly mentioned, and we defined the error bar as the standard error among the obtained values according to their recurrence; the number of datasets obtained this way is hereafter denoted as . In the MC sampling, we started from a random initial configuration and updated the state by the standard Metropolis method; one MC step (MCS) is defined by N trial flips of spins, where N is the total number of spins. We discarded the first 10MCSs as burn-in to avoid systematic errors from the initialization. Furthermore, to avoid possible correlations in samples, each dataset for learning was generated by subsampling from a much larger dataset, which consists of all the configurations recorded after every few numbers of MCS. The size of the subsampled dataset was chosen to be at least five times smaller than that of the larger dataset. The optimization algorithm is a standard trust-region method using the second-order expansion of the objective function.

References

[1] Elad Schneidman, Michael J Berry, Ronen Segev, and William Bialek. Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440(7087):1007–1012, apr 2006.

[2] Jonathon Shlens, Greg D Field, Jeffrey L Gauthier, Matthew I Grivich, Dumitru Petrusca, Alexander Sher, Alan M Litke, and E J Chichilnisky. The structure of multi-neuron firing patterns in primate retina. J. Neurosci., 26(32):8254–8266, aug 2006.

[3] Aonan Tang, David Jackson, Jon Hobbs, Wei Chen, Jodi L Smith, Hema Patel, Anita Prieto, Dumitru Petrusca, Matthew I Grivich, Alexander Sher, Pawel Hottowy, Wladyslaw Dabrowski, Alan M Litke, and John M Beggs. A maximum entropy model applied to spatial and temporal correlations from cortical networks in vitro. J. Neurosci., 28(2):505–518, jan 2008.

[4] Liberty S Hamilton, Jascha Sohl-Dickstein, Alexander G Huth, Vanessa M Carels, Karl Deisseroth, and Shaowen Bao. Optogenetic activation of an inhibitory network enhances feedforward functional connectivity in auditory cortex. Neuron, 80(4):1066–1076, 2013.

[5] Takamitsu Watanabe, Satoshi Hirose, Hiroyuki Wada, Yoshio Imai, Toru Machida, Ichiro Shirouzu, Seiki Konishi, Yasushi Miyashita, and Naoki Masuda. A pairwise maximum entropy model accurately describes resting-state human brain networks. Nat. Commun., 4(May 2012):1370, jan 2013.

[6] Takamitsu Watanabe, Naoki Masuda, Fukuda Megumi, Ryota Kanai, and Geraint Rees. Energy landscape and dynamics of brain activity during human bistable perception. Nat. Commun., 5, jan 2014.

[7] Gaˇsper Tkaˇcik, Olivier Marre, Dario Amodei, Elad Schneidman, William Bialek, and Michael J. Berry. Searching for collective behavior in a large network of sensory neurons. PLoS Comput. Biol., 10(1):e1003408, jan 2014.

[8] Gaia Tavoni, Ulisse Ferrari, Francesco P. Battaglia, Simona Cocco, and R´emi Monasson. Inferred model of the prefrontal cortex activity unveils cell assemblies and memory replay. bioRxiv, 2015.

[9] Yu Terada, Tomoyuki Obuchi, Takuya Isomura, and Yoshiyuki Kabashima. Objective pro- cedure for reconstructing couplings in complex systems. arXiv preprint arXiv:1803.04738, 2018.

[10] Yu Terada, Tomoyuki Obuchi, Takuya Isomura, and Yoshiyuki Kabashima. Objective and efficient inference for couplings in neuronal networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4971–4980. Curran Associates, Inc., 2018.

[11] Julian Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):192–225, 1974.

[12] Hilbert J. Kappen and Francisco de Borja Rodr´ıguez. Efficient learning in Boltzmann machines using linear response theory. Neural Computation, 10(5):1137–1156, 1998.

[13] Toshiyuki Tanaka. Mean-field theory of Boltzmann machine learning. Physical Review E, 58(2):2302, 1998.

[14] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.

[15] Tamara Broderick, Miroslav Dudik, Gasper Tkacik, Robert E Schapire, and William Bialek. Faster solutions of the inverse pairwise Ising problem. arXiv preprint arXiv:0712.2437, 2007.

[16] Vitor Sessak and R´emi Monasson. Small-correlation expansions for the inverse Ising prob- lem. Journal of Physics A: Mathematical and Theoretical, 42(5):055001, 2009.

[17] Yasser Roudi and John Hertz. Mean field theory for nonequilibrium network reconstruction. Phys. Rev. Lett., 106:048702, Jan 2011.

[18] M M´ezard and J Sakellariou. Exact mean-field inference in asymmetric kinetic Ising systems. Journal of Statistical Mechanics: Theory and Experiment, 2011(07):L07001, jul 2011.

[19] S. Cocco and R. Monasson. Adaptive cluster expansion for inferring Boltzmann machines with noisy data. Phys. Rev. Lett., 106:090601, Mar 2011.

[20] Erik Aurell and Magnus Ekeberg. Inverse Ising inference using all the data. Phys. Rev. Lett., 108:090201, Mar 2012.

[21] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction screening: Efficient and sample-optimal learning of Ising models. In Advances in Neural Information Processing Systems, pages 2595–2603, 2016.

[22] Andrey Y Lokhov, Marc Vuffray, Sidhant Misra, and Michael Chertkov. Optimal structure and parameter learning of Ising models. Science advances, 4(3):e1700791, 2018.

[23] Marc Vuffray, Sidhant Misra, and Andrey Y. Lokhov. Efficient learning of discrete graphical models. CoRR, abs/1902.00600, 2019.

[24] Ludovica Bachschmid-Romano and Manfred Opper. A statistical physics approach to learn- ing curves for the inverse Ising problem. Journal of Statistical Mechanics: Theory and Experiment, 2017(6):063406, 2017.

[25] Ludovica Bachschmid-Romano and Manfred Opper. Learning of couplings for random asymmetric kinetic Ising models revisited: random correlation matrices and learning curves. Journal of Statistical Mechanics: Theory and Experiment, 2015(9):P09016, 2015.

[26] Johannes Berg. Statistical mechanics of the inverse Ising problem and the optimal objective function. Journal of Statistical Mechanics: Theory and Experiment, 2017(8):083402, aug 2017.

[27] Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012.

[28] H Chau Nguyen, Riccardo Zecchina, and Johannes Berg. Inverse statistical problems: from the inverse Ising problem to data science. Advances in Physics, 66(3):197–261, 2017.

[29] Aur´elien Decelle and Federico Ricci-Tersenghi. Pseudolikelihood decimation algorithm im- proving the inference of the interaction network in a general class of Ising models. Physical review letters, 112(7):070603, 2014.

[30] Aapo Hyv¨arinen. Consistency of pseudolikelihood estimation of fully visible Boltzmann machines. Neural Computation, 18(10):2283–2292, 2006.

[31] Manfred Opper and David Saad. Advanced Mean Field Methods : Theory and Practice. Neural information processing series. MIT Press, 2001.

[32] Marc Mezard and Andrea Montanari. Information, Physics, and Computation. Oxford University Press, 2009.

[33] Federico Ricci-Tersenghi. The Bethe approximation for solving the inverse Ising problem: a comparison with other inference methods. Journal of Statistical Mechanics: Theory and Experiment, 2012(08):P08015, 2012.

[34] H Chau Nguyen and Johannes Berg. Bethe–Peierls approximation and the inverse ising problem. Journal of Statistical Mechanics: Theory and Experiment, 2012(03):P03004, 2012.

[35] Narayana P Santhanam and Martin J Wainwright. Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Transactions on Information Theory, 58(7):4117–4134, 2012.

[36] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. , 1(1–2):1–305, 2008.

[37] Pradeep Ravikumar, Martin J Wainwright, and John D Lafferty. High-dimensional Ising model selection using -regularized logistic regression. The Annals of Statistics, 38(3):1287– 1319, 2010.

[38] Yingying Xu, Santeri Puranen, Jukka Corander, and Yoshiyuki Kabashima. Inverse finite- size scaling for high-dimensional significance analysis. Physical Review E, 97(6):062112, 2018.

[39] Iacopo Mastromatteo and Matteo Marsili. On the criticality of inferred models. Journal of Statistical Mechanics: Theory and Experiment, 2011(10):P10012, 2011.