Scale-invariant Feature Extraction of Neural Network and Renormalization Group Flow

2018·Arxiv

Abstract

Abstract

Theoretical understanding of how deep neural network (DNN) extracts features from input images is still unclear, but it is widely believed that the extraction is performed hierarchically through a process of coarse-graining. It reminds us of the basic concept of renormalization group (RG) in statistical physics. In order to explore possible relations between DNN and RG, we use the Restricted Boltzmann machine (RBM) applied to Ising model and construct a flow of model parameters (in particular, temperature) generated by the RBM. We show that the unsupervised RBM trained by spin configurations at various temperatures from T = 0 to T = 6 generates a flow along which the temperature approaches the critical value This behavior is opposite to the typical RG flow of the Ising model. By analyzing various properties of the weight matrices of the trained RBM, we discuss why it flows towards how the RBM learns to extract features of spin configurations.

1 Introduction

Machine learning has attracted interdisciplinary interests as the core method of artificial intelligence, particularly of big data science, and is now widely used to discriminate subtle images by extracting specific features hidden in complicated input data. A deep neural network (DNN), which is motivated by human brains, is one of well-known algorithms [1]. Despite its enormous successes, it is still unclear why DNN works so well and how DNN can efficiently extract specific features. In discriminating images, we first provide samples of input images with assigned labels, such as a cat or a dog, and then train the neural network (NN) so as to correctly predict the labels of new, previously unseen, input images: this is the supervised learning and its ability of prediction depends on how much relevant features the NN can extract. On the other hand, in unsupervised learning algorithms, a NN is trained without assigning labels to data, but trained so as to generate output images that are as close to the input ones as possible. If the NN is successfully trained to reconstruct the input data, it must have acquired specific features of the input data. With this in mind, unsupervised learnings are often adopted for pre-training of supervised NNs.

How can DNN efficiently extract features? Specific features characteristic to input data usually have hierarchical structures. An image of a cat can still be identified as an animal in a very low resolution image but one may not be able to distinguish it from a dog. Thus it is plausible that depth of neural networks reflects such hierarchy of features. Namely DNN learns low-level (microscopic) characteristics in upper stream of the network and gradually extracts higher-level (macroscopic) characteristics as the input data flow downstream. In other words, the initial data will get coarse-grained towards output. This viewpoint is reminiscent of the renormalization group (RG) in statistical physics and quantum field theories, and various thoughts and studies are given [2–9] based on this analogy. Especially, in a seminal paper [4], Mehta and Schwab proposed an explicit mapping between the RG and the Restricted Boltzmann Machine (RBM) [1,10–14].

RG is the most important concept and technology to understand the critical phenomena in statistical physics and also plays an essential role to constructively define quantum field theories on lattice. It is based on the idea (and proved by Kenneth Wilson [15]) that the long-distant macroscopic behavior of a many body system is universally described by relevant operators (relevant information) around a fixed point, and not affected by microscopic details in the continuum limit. Through reduction of degrees of freedom in RG, the relevant information is emphasized while other irrelevant information is discarded. Particularly, suppose that the statistical model is described by a set of parameters , and that the parameters are mapped to a different set by RG transformations. Repeating such RG transformations, we can draw a flow diagram in the parameter space of the statistical model,

These RG flows control the behavior of the statistical model near the critical point where a second order phase transition occurs.

A simplest version of RBM is a NN consisting of two layers, a visible layer with variables and a hidden layer with variables , that are coupled to each other through the Hamiltonian

A probability distribution of a configuration is given by

where we defined the partition function by . No intra-layer cou- plings are introduced in the RBM. Now suppose that the RBM is already trained and the parameters of the Hamiltonian (2), namely , are already fixed through a process of training. The probability distribution ) also provides the following conditional probabilities for ) with the other variables being kept fixed;

These conditional probabilities generate a flow of distributions, and consequently a flow of parameters of the corresponding statistical model. Suppose that we have a set of initial configurations ), which are generated by a statistical model with parameters , such as the Ising model at temperature T. In the large N limit, the distribution function

faithfully characterizes the statistical model with parameters . Multiplying the conditional probabilities (4) and (5) iteratively, we can generate a flow of probability distributions as

and so on for ). Let us focus on Eq. (7). If the probability distribution ) is well approximated by the Boltzmann distribution of the same statistical model with different parameters ˜, we can say that the RBM generates a transformation. If more than two layers are stacked iteratively, we can obtain a flow of parameters as in Eq. (1). Another way to obtain a flow is to look at the transformations and translate the flow of probability distributions into a flow of parameters In the present paper, we consider the latter flow to discuss a relation with RG.

Mehta and Schwab [4] pointed out similarity between RG transformations of Eq. (1) and the above flows of parameters in the unsupervised RBM. But in order to show that the transformation of parameters in the RBM indeed generates the conventional RG transformation, it is necessary to show that the weight matrix and the biases RBM must be appropriately chosen so as to generate the correct RG transformation that performs coarse-graining of input configurations. In Ref. [4], multi-layer RBM is employed as an unsupervised-learning NN, and the weights and the biases are chosen by minimizing the KL divergences (relative entropy) between the input probability distribution and the reconstructed distribution by integrating (marginalizing) over the hidden variables. The authors suggested the similarity by looking at the local spin structures in the hidden variables, but they did not show it explicitly that the weights determined by the unsupervised learning actually generate the flow of RG transformations.

The arguments [4] and misconception in the literature are criticized by Ref. [6]. In a wider context, the criticism is related to the following question: what determines whether a specific feature of input data is relevant or not? In RG transformations of statistical models, long-wave length (macroscopic) modes are highly respected while short-wave length modes are discarded as noise. In this way, RG transformations can extract universal behavior of the model at long-wave length. But, of course, it is so because we are interested in the macroscopic behavior of the system: if we are instead interested in short-wave length physics, we need to extract opposite features of the model. Thus, we may say that extraction of relevant features needs pre-existing biases to judge, and supervised learning is necessary to give such biases to the machine. However this does not mean that unsupervised learnings do not have anything to do with the RG. Even in unsupervised learnings, a NN automatically notices and extracts some kind of features of the input data and the flow generated by the trained NN reflects such features.

In the present paper, we investigate relationship between the RBM and the RG by further studying the flows of distributions, Eqs. (7) and (8), that the unsupervised RBM generates. Here notice that in defining the flow of (7) and (8), we need to specify how we have trained the RBM because the training determines the properties of the weights and biases, and accordingly the behavior of the flow. In this paper we mostly use the following three different ways of trainings. One type of RBM (we call type V) is trained by configurations at various temperatures from low to high. Other two types (type H and L) are trained by configurations only at high (and only at low) temperatures. Then we translate these flows of probability distributions defined by Eqs. (7) and (8) into flows of temperature of the Ising model,

In order to measure temperature, we prepare another NN trained by a supervised learning. The results of our numerical simulations lead to a surprising conclusion. In the type V RBM that has adequately learned the features of configurations at various temperatures, we found that the temperature , along the RBM flow. The behavior is opposite to the conventional RG flow of Ising model.

The paper is organized as follows. In section 2, we explain the basic settings and the methods of our investigations. We prepare sample images of the spin configurations of the Ising model, and train RBMs by the configurations without assigning labels of temperature. Then we construct flows of parameters (i.e., temperature) generated by the trained RBMsection 3, we show various results of the numerical simulations, including the RBM flows of parameters. In section 4, we analyze properties of the weight matrices using the method of singular value decomposition. The final section is devoted to summary and discussions. Our main results of the RBM flow and conjectures about the feature extractions of the unsupervised RBM are written in Sec. 3.2.

2 Methods

We explain various methods for numerical simulations to investigate relations between the unsupervised RBM and the RG of Ising model. Though most methods in this section are standard and well known, we explain them in details to make the paper self-contained. In Sec. 2.3, we explain the central method of generating the RBM flows. Basic materials of the RBM are given in Sec. 2.2. The other two sections, Secs. 2.1 and 2.4, can be skipped over unless one is interested in how we generate the initial spin configurations and measure temperature of a set of configurations.

2.1 Monte-Carlo simulations of Ising model

We first construct samples of configurations of the two-dimensional Ising model by using Monte-Carlo simulations. The spin variables 1 are defined on a two dimensional lattice of size . The index (x, y) represents each lattice site and takes

The Ising model Hamiltonian is given by

It describes a ferromagnetic model for J > 0 and an anti-ferromagnetic model for J < 0. Here we impose the periodic boundary conditions for the spin variables,

Generations of spin configurations at temperature T are performed by the method of Metropolis Monte Carlo (MMC) simulation. In the method, we first generate a random configuration . We then choose one of the spins and flip its spin with the probability

where is the change of energy of this system

The probability of flipping the spin (12) satisfies the detailed balance condition is the canonical distribution of the spin configuration at temperature T. Thus after many iterations of flipping all the spins, the configuration approaches the equilibrium distribution at T. Since all physical quantities are written in terms of a combination of , we can set the Boltzmann constant and the interaction parameter J to be equal to 1 without loss of generality.

In the following analysis, we set the lattice size 10 and repeat the procedure of MMC simulations 100= 10000 times to construct spin configurations. In our simulations, we generated spin configurations at various temperatures typical spin configurations are shown in Fig. 1.

Figure 1: Examples of spin configurations at temperatures T = 0, 2, 3, 6

2.2 Unsupervised learning of the RBM

Our main motivation in the present paper is to study whether the RBM is related to the RG in statistical physics. In this section, we review the basic algorithm of the RBM [1,10–14] which is trained by the configurations constructed by the MMC method of Sec. 2.1. As explained

Figure 2: (a) Two-layer neural network of the RBM with a visible layer and a hidden layer These two layers are coupled but there are no intra-layer couplings. (b) The RBM generates reconstructed configurations from through the hidden configuration

in the Introduction, the RBM consists of two layers as shown in the left panel of Fig. 2. The initial configurations of Ising model generated at various temperatures are input into the visible layer . The number of neurons in the visible layer is fixed at () to represent the spin configurations of Ising model. On the other hand, the hidden layer can take an arbitrary number of neurons, . In the present paper, we consider 7 different sizes; 225 and 400. Thus the spin variables in the hidden layer are given by

The RBM is a generative model of probability distributions based on Eq. (3). We first explain how we can train the RBM by optimizing the weights and the biases Our goal is to represent the given probability distribution ) in Eq. (6), as faithfully as possible, in terms of a model probability distribution defined by

The partition function is difficult to evaluate, but summations over only one set of spin variables (e.g. over ) are easy to perform because of the absence of the intra-layer couplings. It also makes the conditional probabilities (4) and (5) to be

rewritten as products of probability distributions of each spin variable;

Then the expectation values of spin variables in the hidden (or visible) layer in the background of spin configurations in the other layer are calculated as

Now the task is to train the RBM so as to minimize the distance between two probability distributions of ) by appropriately choosing the weights and the biases. The distance is called Kullback-Leibler (KL) divergence, or relative entropy, and given by

If two probabilities are equal, the KL divergence vanishes. Otherwise derivatives of KL(q||p) with respect to the weight and the biases are given by

where averages are defined by

and is replaced by of Eq. (17). In training the RBM, we change the weights and biases so that the KL divergence is reduced. Using the method of back

propagation [22], we renew values of the weights and biases as

where

Here denotes the learning rate, which we set to 0.1. The first terms to calculate, but the second terms are difficult to evaluate since it requires the knowledge of the full partition function Z.

To avoid this difficulty, we need to use the method of Gibbs sampling to approximately evaluate these expectation values . Practically we employ a more simplified method, which is called the method of contrastive divergence (CD) [23–25]. The idea is very simple, and reminiscent of the mean field approximation in statistical physics. Given the input data of the visible spin configurations , the expectation value of the hidden spin variable can be easily calculated as Eq. (17). We write the expectation value as

Then in this background of the hidden spin configurations, the expectation value of be again easily calculated by using Eq. (18). We write it as

Then we obtain , and so on. We can iterate these procedure many times and replace the second terms in Eq. (20) by the expectation values generated by this method.

In doing the numerical simulations in the present paper, we adopt the simplest version of CD, called CD, which gives us the following approximate formulas:

and

Here denotes each spin configuration generated by the method of Sec. 2.1. As input data to train the RBM, we generated 1000 spin configurations for each of 25 different temperatures T = 0, 0.25, . . . , 6. Then the index A runs from 1 to N = 25000. In some cases, as we will see in Sec. 3.2, we use only a restricted set of configurations at high or low temperatures, then the index runs (number of temperatures).

We repeat the renewal procedure (23) many times (5000 epochs), and obtain adjusted values of the weights and biases. In this way we train the RBM by using a set of configurations

2.3 Generation of RBM flows

As discussed in the Introduction, once the RBM is trained and the weights and biases are fixed, the RBM generates a sequence of probability distributions (8). Then we translate the sequence into a flow of parameters (i.e., temperature). In generating the sequence, the initial set of configurations should be prepared separately in addition to the configurations that are used to train the RBM

We can also generate a flow of parameters in a slightly different way. For a specific configuration , we can define a sequence of configurations following Eqs. (25) and (26) as

The right panel of Fig. 2 shows a generation of new configurations from . Since each value of 0) is defined by an expectation value as in Eqs. (25) and (26), it does not take an integer 1 but a fractional value between 1. In order to get a flow of spin configurations, we need to replace these fractional values by probability (12. It turns out that the replacement is usually a good approximation since the expectation values are likely to take values close to 1 owing to the property of the trained weights 1. In this way, we obtain a flow of spin configurations

starting from the initial configuration The flow of configurations is transformed to a flow of temperature distributions by using the method explained in Sec. 2.4.

2.4 Temperature measurement by a supervised-learning NN

Next we design a neural network (NN) to measure temperature of spin configurations. The NN for the supervised learning has three layers with one hidden layer in the middle (See Fig. 3).

Figure 3: Three-layer neural network for supervised learning with an input layer , a hidden layer and an output layer

The input layer consists of = 100 neurons in which we input spin configurations of Ising model. The output layer has 25 neurons which correspond to 25 different temperatures that we want to measure. The number of neurons in the hidden layer set to 64. We train this three-layer NN by a set of spin configurations, each of which has a label of temperature. Thus this is the supervised learning. As input data to train the NN, we use the same N = 25000 configurations which were used to train the RBM

The training of the NN is carried out as follows. Denote the input data as

where are the spin configurations as in Sec. 2.2. The input data is transformed to in the hidden layer by the following nonlinear transformation;

where is an weight matrix and is a bias. The activation function f(x) is chosen as

is transformed to in the output layer, which corresponds to the label, namely temperature, of each configuration. The output is given by

where are another weight matrix and bias. The function g(x) is the softmax function

so that can be regarded as a probability since = 1 is satisfied for each configu- ration A. Thus the NN transforms an input spin configuration to the probability of the configuration to take the -th output value (i.e., temperature).

Each of the input configurations is generated by the MMC method at temperature T. T takes one of the 25 discrete values 25). If the A-th configuration is labelled by , we want the NN to give an output as close as the following one-hot representation:

or its -th component is given by It can be interpreted as a probability of the configuration A to take the -th output. Then the task of the supervised training is to minimize the cross entropy, which is equivalent to the KL divergence of the desired probability and the output probability . The loss function is thus given by the cross entropy,

Then, using the method of back propagation, we renew values of the weights and biases from the lower to the upper stream;

The variations of at the lower stream are given by

where ∆. The learning rate is set to 0.1. Then using these lower stream variations, we change the upper stream weights and biases as

where

We repeat this renewal procedure many times (7500 epochs) for the training of the NN to obtain suitably adjusted values of the weights and biases.

Finally we note how we measure temperature of a configuration. If the size of a config-uration generated at temperature T is large enough, say , the trained NN will reproduce the temperature of the configuration quite faithfully. However our configurations are small sized with only L = 10. Thus we instead need an ensemble of many spin configura-tions and measure a temperature distribution of the configurations. The supervised learning gives us this probability distribution of temperature.

3 Numerical results

In this section we present our numerical results for the flows generated by unsupervised RBM, and discuss a relation with the renormalization group flow of Ising model. Our main results of the RBM flows are written in Sec. 3.2.

3.1 Supervised learning for temperature measurement

Before discussing the unsupervised RBM, let us first see how we trained the NN to measure temperature.

In Fig. 4, we plot behaviors of the loss function (36) as we iterate renewals of the weights and biases (37). The blue (lower) line shows the training error, namely values of the loss function (36) after iterations of training using 25000 configurations. It is continuously decreasing, even after 7500 epochs. On the other hand, the red (upper) line shows the test error, namely values of the loss function for additional 25000 configurations which are not used for the training. This is also decreasing at first, but after 6000 epochs it becomes almost constant. After 7500 epochs, in fact, it turns to increase. This means the machine becomes over-trained, therefore we stopped the learning at 7500 epochs.

In Fig. 5 we show probability distributions of temperature this NN measures. Here we use configurations at T = 0, 2, 3, 6 which are not used for the training. Though they are not

Figure 4: Training error and test error (up to 7500 epochs)

Figure 5: Probability distributions of measured temperatures for various sets of configurations generated at T = 0, 2, 3, 6 respectively. Temperature of the configurations can be distinguished by looking at the shapes of the distributions.

sharply peaked at the temperatures where the configurations are generated, each of them has characteristic shape that is different temperature by temperature. Thus it is possible to distinguish the temperature of the input configurations by looking at the shape of the probability distribution, even if these configurations are not used for the training of the NN. In the following, by using this NN, we measure temperature of configurations that are

generated by the RBM flow.

3.2 Unsupervised RBM flows

Now we present the main results of the present paper, namely the flows generated by the unsupervised RBM. We sometimes call it the RBM flow. As discussed in the Introduction, if the RBM is similar to the the conventional RG in that it possesses a function of coarse-graining, the RBM flow must go away from the critical point 27. In order to check it we construct three different types of unsupervised RBMs, which we call type V, type L, and type H respectively, using the method of Sec. 2.2. Each of them is trained by a different set of spin configurations generated at different set of temperatures. We then generate flows of temperature distributions by using these trained RBMs, following the methods of Secs. 2.3 and 2.4.

Type V RBM: Trained by configurations at

First we construct type V RBM, which is trained by configurations at temperatures ranging widely from low to high, T = 0, 0.25, . . . , 6. The temperature range includes the temperature . After training is completed, this unsupervised RBM will have learned features of spin configurations at these temperatures.

Once the training is finished, we then generate a sequence of reconstructed configurations as in Eq. (30) using the methods in Sec. 2.3. For this, we prepare two different sets of initial configurations. One is a set of configurations at T = 0, and another at T = 6. These initial configurations are not used for the training of the RBM. Then by using the supervised NN in Sec. 3.1, we measure temperature and translate the flow of configurations to a flow of temperature distributions.

In Figs. 6 and 7, we plot temperature distributions of configurations that are generated by iterating the RBM reconstruction in Sec. 2.3. The “itr” in the legends means the numbers of iterations n by the unsupervised RBM. Fig. 6 shows a flow of temperature distributions starting from spin configurations generated at T = 0. Fig. 7 starts from T = 6. In all the fig-ures, the black lines are the measured temperature distributions of the initial configurationsColored lines show temperature distributions of the reconstructed configurations various numbers of iterations. The left panels show the temperature distributions at small iterations (up to 10 in Fig. 6 and 50 in Fig. 7), while the right panels are at larger iterations up to 1000. These results indicate that the critical temperature is a stable fixed point of the flows in type V RBM. It is apparently different from a naive expectation that the RBM flow should show the same behavior as the RG flow. Indeed it is in the opposite direction. From whichever temperature T = 0 or T = 6 we start the RBM iteration, the peak of the temperature distributions approaches the critical point (T = 2.27).

Figure 6: Temperature distributions after various numbers of iterations of type V RBM, which is trained by the configurations at T = 0, 0.25, . . . , 6. The original configurations are generated at T = 0. After only several iterations, the temperature distribution is peaked around , and stabilize there: is a stable fixed point of the flow.

Figure 7: Temperature distributions after various numbers of iterations of the same RBM as Fig. 6. The original configurations are generated at T = 6. After 50 iterations, the distribution stabilizes at

In order to confirm the above behavior, we provide another set of configurations at T = 2.25 as initial configurations, and generate the flow of temperature by the same trained RBM. The flow of temperature distributions is shown in Fig. 8. We can see that the temperature distribution of the reconstructed configurations remains near the critical point, and never flows away from there. If the process of the unsupervised RBM corresponds to coarse-

Figure 8: Temperature distributions after various numbers of iterations of the same RBM as Figs. 6 and 7. The original configurations are generated at T = 2.25. The distribution is stable at around

graining of spin configurations, the temperature distributions of the reconstructed configura-tions must flow away from . Though the direction of the flow is opposite to the RG flow, both flows have the same property in that the critical point plays an important role in controlling the flows.

So far, in obtaining the above results of Figs. 6, 7 and 8, we used an unsupervised RBM with 64 neurons in the hidden layer. We also trained other RBMs with different sizes of the hidden layer, but by the same set of spin configurations. When the size of the hidden layer is smaller than (or equal to) that of the visible layer = 100, namely or 16, we find that the temperature distribution approaches the critical point. A difference is that for smaller , the speed of the flow to approach becomes faster (i.e., the flow arrives at by smaller numbers of iterations).

In contrast, when the RBM has more than 100 neurons in the hidden layer; we obtain different results. Fig. 9 shows the case of = 225 neurons. Until about ten iterations, the measured temperature distribution behaves similarly to the case of i.e., it approaches the critical temperature. However, afterward it passes the critical point and flows away to higher temperature. In the case of 400 neurons, it moves towards high temperature at faster speed. This behavior suggests that, if the hidden layer has more than a necessary size, the NN tends to learn a lot of noisy fluctuations. Since configurations at higher temperatures are noise-like, the flow should go away to high temperature. We come back to this conjecture in later sections.

Figure 9: Temperature distribution after various numbers of iterations of type V RBM with 225 neurons in the hidden layer; i.e., . The original configurations are generated at T = 0. The distribution has a peak at after 10 iterations, but then moves towards

Figure 10: Flow of temperature distributions starting from T = 0 in type H RBM. Type H RBM is trained by configurations at only T = 4, 4.25, . . . , 6. The NN has = 64 neurons (left) and neurons (right) respectively in the hidden layer. The speed of the flow is slower for the larger sized hidden layer.

Type H/L RBM: Trained by configs at Higher/Lower temperatures

Next we construct another type of RBM, which is trained by configurations at higher temperatures We call it type H RBM. The results of the flows of temperature distributions in type H RBM are drawn in Fig. 10. In this case, the

Figure 11: Flow of temperature distributions starting from T = 6 in type L RBM. Type L RBM is trained by configurations at only = 64 (left) and = 225 (right).

measured temperature passes the critical point and goes away towards higher temperature. The behavior is understandable since the RBM must have learned only the features at higher temperatures. We also find that, if the number of neurons in the hidden layer is increased, the flow moves more slowly.

Finally, we construct type L RBM, which is trained by configurations only at the lowest temperature T = 0. Fig. 11 shows the numerical results of flows in the type L RBM. Similarly to the type H RBM, the measured temperature passes the critical point, but flows towards lower temperature instead of higher temperature. It is, of course, as expected because the type L RBM must have learned the features of spin configurations at T = 0. In the type L RBM, as far as we have studied, the flow never goes back to higher temperature even for large . It will be because the T = 0 configurations used for training do not at all contain noisy fluctuations specific to high temperatures. This also suggests that the RBM does not learn features that are not contained in the configurations used for trainings.

Summaries and Conjectures

Here we first summarize the numerical results:

For the type V RBM,

• When , the measured temperature T approaches (Figs. 6, 7 and 8).

• However, for , the flow eventually goes away towards

• Speed of flow is slower for a larger

For the type H/L RBM,

• The temperature T flows towards = 0 respectively (Figs. 10 and 11).

• Speed of flow is slower for a larger

Here are numbers of hidden and visible neurons in the RBM. These behaviors are reflections of the properties of the weights and biases that the unsupervised RBMs have learned in the process of training.

Understanding the above behaviors is equivalent to answering what the unsupervised RBMs have learned in the process of trainings. The most important question will be why the temperature approaches in the type V RBM with , instead of, e.g., broadening over the whole regions of temperature from T = 0 to T = 6. Note that we did not teach the NN neither about the critical temperature nor the presence of phase transition. We just have trained the NN by configurations at various temperatures, from T = 0 to T = 6. Nevertheless the numerical simulations show that the temperature distributions are peaked at after some iterations of the RBM reconstruction. Thus we are forced to conclude that the RBM has automatically learned features specific to the critical temperature

An important feature at is the scale invariance. We have generated spin configurations at various temperatures by the Monte Carlo method, and each configuration has typical fluctuations specific to each temperature. At very high temperature, fluctuations are almost random at each lattice site and there are no correlations between spins at distant positions. At lower temperature, they become correlated: the correlation length becomes larger as and diverges at . On the other hand, at , spins are clustered and in each domain all spins take 1. At low temperature configurations have only big clusters, and as temperature increases small-sized clusters appear. At , spin configurations become to have clusters of various sizes in a scale-invariant way.

Now let us come back to the question why the type V RBM generates a flow approaching and does not randomize to broaden the temperature distribution over the whole regions. We have trained the type V RBM by using configurations at various temperatures with different sized clusters, and in the process the machine must have simultaneously acquired features at various temperatures. Consequently the process of the RBM reconstruction adds various features that the machine has learned to a reconstructed configuration. If only a single feature at a specific temperature was added to the reconstructed configuration, the distribution would become to have a peak at this temperature. But it cannot happen because various features of different temperatures will be added to a single configuration by iterations of reconstruction processes. Then one may ask if there is a configuration that is stable under additions of features at various different T.

Our first conjecture about this question is that a set of configurations at is a stabilizer (and even more an attractor) of the type V RBM with . It must be due to the scale invariant properties of the configurations at . Namely since these configurations are scale invariant, they have all the features of various temperatures simultaneously, and consequently they can be the stabilizer of this RBM. This sounds plausible since the scale invariance means that the configurations have various different characteristic length scales. However, we notice that this doesn’t mean that the RBM has forgotten the features of configurations away from the critical point. Rather it means that the RBM has learned features of all temperatures simultaneously. This doesn’t mean either that the configurations at have especially affected strong influence on the machine in the process of training. It can be confirmed as follows. Suppose we have trained a RBM by configurations at temperatures excluding namely train by configurations at all temperatures except T = 2.25 and 2.5. We found in the numerical simulations that the RBM generates a flow towards the critical point though we did not provide configurations at . Therefore we can say that the type V RBM has learned the features at all the temperatures and that configurations at are special because they contain all the features of various temperatures in the configurations.

Our second conjecture, which is related to the behavior of the type V RBM with is that RBMs with unnecessary large sized hidden layer tend to learn lots of irrelevant features. In the present case, they are noisy fluctuations of configurations at high temperatures. High temperature configurations have only short distance correlations, whose behavior is similar to the typical behavior of noise. The conjecture will be partially supported by the similarity of the RBM flows between the type V RBM with and the type H RBM. Namely both RBM flows converge on . The similarity indicates that the NN with a larger number of may have learned too much noise-like features of configurations at higher temperatures. The above considerations will teach us that the moderate size of the hidden layer, is the most efficient to properly extract the features.

4 Analysis of the weight matrix

In the previous section, we showed our numerical results for the flows generated by unsupervised RBMs, and proposed two conjectures. One is that the scale invariant configurations are stabilizers of the type V RBM flow. Another conjecture is that the RBM with an unnecessary large sized hidden layer tends to learn too much irrelevant noises. In this section, to further understand the theoretical basis of feature extractions and to give supporting evidences for our conjectures, we analyze various properties of the weight matrices and biases of the trained RBMs. Particularly, we study properties of looking at spin correlations in Sec. 4.2, magnetization in Sec. 4.3, and eigenvalue spectrum in Sec. 4.4.

4.1 Why WW T is important

All the information that the machine has learned is contained in the weights biases . Since the biases have typically smaller values than the weights (at least in the present situations), we will concentrate on the weight matrix ) in the following.

Let us first note that the weight matrix transforms as

under transformationsof exchanging the basis of neurons in the visible layer (the hidden layer (). Since the choice of basis in the hidden layer is arbitrary, relevant information in the visible layer is stored in a combination of that is invariant under transformations of . The simplest combination is a product,

It is an 100 matrix, and independent of the size of . But its property depends on because the rank of must be always smaller than min(if , the weight matrix is strongly constrained; e.g. a unit matrix allowed.

This simplest product (42) plays an important role in the dynamics of the flow generated by the RBM. It can be shown as follows. If the biases are ignored, the conditional probability (15) and the expectation value (17) for in the background of

In ), a combination can be regarded as an external magnetic field for . Thus these two variables, , tend to correlate with each other. Namely, the probability ) becomes larger when they have the same sign. Moreover, for is approximated by and we can roughly identify these two variables,

It is usually not a good approximation since weights can have larger values, but let us assume this for the moment. For a large value of is saturated at

Suppose that the input configuration is given by . If Eq. (44) is employed, we have Then the conditional probability (16) in the background of

can be approximated as

The RBM learns the input data so that the probability distribution p reproduces the probability distribution of the initial data, ). Therefore, training of the RBM will be performed so as to enhance the value means that W is chosen so that (will reflect the spin correlations of the input configurations

In this simplified discussion, learning of the RBM is performed through the combination . Of course, we neglected the nonlinear property of the neural network and the above statement cannot be justified as it is. Nevertheless, we will find below that the analysis of is quite useful to understand how the RBM works.

4.2 Spin correlations in WW T

In Fig. 12, we plot values of matrix elements of the 100. These three figures correspond to the RBMs with different sizes of . We can see that they have large values in the diagonal and near diagonal elements. Note that the spin variables in the visible layer, with x, y = 1, . . . , L = 10, are lined up as (and named (). Hence lattice points ) are adjacent to each other when . In the following, we mostly discuss the type V RBM unless otherwise stated.

Figure 12: Elements of when the hidden layer has 16 (left), 100 (center), 400 (right) neurons.

As discussed above, the product of weight matrices must reflect correlations between spin variables of the input configurations used for the training of the RBM. The most strong correlation in is of course the diagonal component, i = j. Thus we expect that the matrix will have large diagonal components. Indeed, such behavior can be seen in Fig. 12. In particular, for = 100 (the rightmost figure), is clearly close to a diagonal matrix. It is almost true for the case of middle figure). However, for = 100 (the leftmost figure), it is different from a unit matrix and off-diagonal components of (also have large values, in particular, at j = i + 1 and j = i + 2. This behavior must be a reflection of the spin correlations of the input configurations . It is also a reflection of the fact that the rank of smaller than cannot be a unit matrix if . Thus even though only less information can be stored in the weight matrix for a smaller number of hidden neurons, the relevant information of the spin correlations is well encoded in the weight matrix of the RBM with compared with the RBM with larger . Then we wonder why such relevant information is lost in the RBM with . This question might be related to our second conjecture proposed at the end of Sec. 3.2 that the RBM with very large will learn too much irrelevant information, namely noises of the input configurations. It is interesting and a bit surprising that the RBM with fewer hidden neurons seems to learn more efficiently the relevant information of the spin correlations.

In order to further confirm the relation between the correlations in the combination of the weight matrix and the spin correlations of the input configurations, we will study structures of the weight matrices of other types of RBMs. In Fig. 13, we plot behaviors of the off-diagonal components of for various RBMs. Each RBM is trained by configurations at a single temperature T = 0 (type L), T = 2, T = 3 and T = 6 respectively. The size of the hidden layer is set to = 16. For comparison, we also plot the behavior of the off-diagonal components for the type V RBM.

Fig. 13 shows that the correlation of decays more rapidly at higher temperature, which is consistent with the expected behavior of spin correlations. Therefore, the RBM seems to learn correctly about the correlation length, or the size of clusters, which becomes smaller at higher temperature. Furthermore, we find that, for the type V RBM that has learned all temperatures T = 0, . . . , 6, the off-diagonal elements decrease with the decay rate between the T = 2 case and the T = 3 case. This indicates that the type V RBM has acquired similar features to those of the configurations around 27. It is consistent with the numerical results of Figs. 6, 7 and 8, and gives another circumstantial evidence supporting for the first conjecture in Sec. 3.2.

4.3 Magnetization and singular value decomposition (SVD)

Information of the weight matrix W can be inferred by using the method of the singular value decomposition (See, e.g., [26,27]). Suppose that the matrix has eigenvalues

Figure 13: Averaged values of the off-diagonal components of (normalized by the diagonal components). Each colored line corresponds to the RBM that has learned configurations at a single temperature T = 0, 2, 3, 6 respectively. The black line (the most middle line) is the behavior of the type V RBM that has learned all the temperatures T = 0, . . . , 6.

() with corresponding eigenvectors

Decomposing an input configuration vector in terms of the eigenvectors with a normalization condition = 1, we can rewrite

Thus if a vector contains more components with larger eigenvalues of , the quantity becomes larger.

Fig. 14 shows averaged values of over the 1000 configurations each temperature. For comparison between different RBMs, we subtracted the values at T = 6. The figure shows a big change near the critical point, which is reminiscent of the magnetization of Ising model. Since should contain more information than the magnetization itself, the behavior cannot be exactly the same. But it is quite intriguing that Fig. 14 shows similar behavior to the magnetization. It might be because the quantity contains much information about the lower temperature after subtraction of the values at

Figure 14: Averaged values of over the 1000 input configurations at each temperature. Different colors correspond to type V RBMs with different number of hidden neurons . In this figure, the values at T = 6 are subtracted for comparison between different RBMs.

higher temperature

In order to see the properties of more than the magnetization in Fig. 14, we plot the same quantities but without subtracting the values at T = 6. Fig. 15 shows two cases for These figures show that, at high temperature, the RBM with large in the right panel has larger components of the principal eigenvectors compared to the RBM with small in the left panel. The difference must have caused the different behaviors in the RBM flows shown in Fig. 6 (= 64) and Fig. 9 (= 225). Namely the former RBM flow approaches the critical temperature , while the latter eventually goes towards higher temperature. The difference of two figures in Fig. 14 indicate that the RBM with larger seems to have learned more characteristic features at high temperatures than the RBM with fewer . Then, does the RBM with small fail to learn the features of high temperatures? Which RBM is more adequate for feature extractions? Although it is difficult to answer which is more adequate without specifying what we want the machine to learn, we believe that the RBM with properly learns all the features of various temperatures while the RBM with has learned too much irrelevant features of high temperature. This is nothing but the second conjecture in Sec. 3.2, and supported by the behaviors of correlations in discussed in Sec. 4.2.

Figure 15: Averaged values of over the 1000 configurations at each temperature. The left and right figure shows the quantities in the type V RBM with = 225 respectively.

4.4 Eigenvalue spectrum and information stored in W

Finally we study the eigenvalue spectrum of the matrix . Figs. 16 and 17 show the eigenvalues in the descending order. In Fig. 16, the red dots (a smooth line) show the

Figure 16: Eigenvalues of for type V RBM (red, a smooth line) and type L RBM (blue, a steplike line). Both RBMs have 64 neurons in the hidden layer.

eigenvalues of the type V RBM trained by configurations at all the temperatures (T = 0, 0.25, . . . , 6), while the blue dots (a steplike line) are the eigenvalues of the type L RBM (only T = 0).

These are obviously different. For the type L RBM, only several eigenvalues are especially large, and the rest are apparently smaller. On the other hand, for the type V RBM, the eigenvalues decrease gradually and there are no jumps or big distinctions between larger and smaller eigenvalues. The behavior indicates that, in the type L RBM, the weight matrix holds only small relevant information and only a small number of neurons is sufficient in the hidden layer. In the type V RBM, however, since it is trained by configurations at various different temperatures, all the eigenvectors are equally utilized to represent relevant features of spin configurations at various temperatures. Namely, in order to learn features of a wide range of temperatures, larger number of neurons in the hidden layer are necessarysuch larger degrees of freedom, the weight matrix has learned configurations with various characteristic scales at various temperatures so that the RBM can grasp rich properties of these configurations.

The difference of the eigenvalues between type V and type L is also phrased that type V has a scale invariant eigenvalue spectrum. In contrast, the eigenvalues of the type L RBM are separated into distinct regions in which the corresponding eigenvalues might represent features with different scales. It might be related to our previous numerical results, shown in Figs. 6, 7 and 8, that type V RBM generates a flow toward the critical point where the configurations have scale invariance.

Finally in Fig. 17 we show differences in eigenvalue spectrum between RBMs with different numbers of hidden neurons . As shown in the left panel of Fig. 17, in the type V RBM with

Figure 17: Eigenvalues of for type V RBM (left) and type L RBM (right). The legend shows the number of hidden neurons

= 100, most eigenvalues have similar values. In contrast, for a smaller , large and small eigenvalues are very different and the spectrum has a hierarchical structure. The type L RBM shows similar behaviors as shown in the right panel of Fig. 17. It might indicate that the RBM with larger ) has learned too much details of the input configurations and the most relevant features are weakened. In other words, it has learned too much irrelevant features which are especially specific to configurations at higher temperature. It may explain our numerical results shown in Fig. 9, which is apparently different from Fig. 6, that the flow first approaches the critical point but passes there and eventually goes away to higher temperature. This view is consistent with the discussion at the end of Sec. 4.3 and supports the second conjecture in Sec. 3.2.

To summarize, we find that the type V RBM with smaller can adequately learn configurations at wide range of temperatures, without learning too much features at higher temperature. All the neurons in the hidden layer are efficiently used to represent features of various temperatures in a scale invariant way as seen in the eigenvalue spectrum. As a result, after many iterations of the RBM reconstruction, initial configurations are transformed into the configurations around the critical temperature . Thus the RBM flow has similarity with the RG flow, but the RBM flow is in the opposite direction to the conventional RG flow and a naive analogy does not hold.

5 Discussions

In this paper, in order to see what the restricted Boltzmann machine (RBM) learns in the process of training, we investigated flow of configurations generated by the weight matrices of the RBM. In particular, we studied the Ising model and found that, if the RBM is trained by spin configurations at various different temperatures (we call it type V RBM), the temperature of an initial configuration flows towards the critical point where the system becomes scale invariant. The result suggests that the configurations at RBM flow. In order to understand the numerical results of the RBM flows and to find a clue of what the machine has learned, we explored properties of the weight matrix , especially those of the product , by looking at the eigenvalue spectrum.

There are still many unsolved issues left for future investigations. If we admit that an eigenvector represents a “feature” that the RBM has learned, the magnitude of the corresponding eigenvalue is an indicator of how much influence the feature affects, and reminiscent of the critical exponents of relevant operators in the renormalization group (RG). It will be interesting to pursue this analogy further and to extract more information from the eigenvalue spectrum and to connect with the behaviors of the RBM flow. The RBM flow gives important clues of what the machine has learned. If an RBM is trained by configurations at a single specific temperature, it generates a flow toward that temperature. This confirms a hypothesis that the unsupervised RBM indeed extracted relevant features of the configurations and the flow is consequently attracted to the configurations with these relevant features. The RBM flow is a stochastic process with a random noise, and should be described by the Langevin (or Fokker Planck) equation, whose drift term is given by the relevant features. We want to

come back to this problem in future investigations.

In the present paper, we picked up the Ising model, the simplest statistical model of the second order phase transition and found that the critical point is an attractor of the RBM flow. Then we wonder what happens in the case of the first order phase transition. A simple example is the Blume-Capel model on a two dimensional lattice. The Hamiltonian is given by 0 is a spin at site i. The model undergoes the first order phase transition that separates the parameter space (J, ∆), and the second order phase transition at the tricritical point (). If we train an RBM by configurations at various different parameters (such as the type V RBM of the Ising model), the flow of parameters will be attracted to the tricritical point. On the other hand, if we use only a restricted set of configurations for training, e.g. various J with a fixed ∆(), where is the RBM flow attracted? It is not certain whether it is attracted to the phase boundary of the first order phase transition or to the tricritical point. It is under investigations and we want to report the numerical results in future publications.

Finally we would like to comment on a possible relation between structures of RBM flows and how we recognize the world around us. Our finding is that there is an attractor of the RBM flow which characterizes the relevant feature that machine has learned before. We human beings also meet similar phenomena, namely we can recognize more easily and comfortably what we have already learned many times than what we first experience. Also what we think beautiful is not what we experience first but usually a combination of what we have already experienced before. A good example will be looking at abstract paintings or tasting bitter coffee. It suggests that attractors are constructed in the process of learnings and we would feel comfortable when input data are close to the attractors of the neural network in our brain. For verification of the conjecture, it must be amusing to train a RBM by inputting varieties of human faces and generate the RBM flow of a human face to see if there is an attractor face. The attractor might provide a standard for beauty. In this way, we can guess that attractors of the RBM flows in the NN of our brain may control our value judgments.

Acknowledgments

We would like to thank participants of I-URIC frontier colloquium 2017, especially Shunichi Amari, Taro Toyoizumi and Shinsuke Koyama for fruitful discussions. We also thank Masato Taki for his intensive lectures on machine learning at KEK. This work of S.I. and S.S. is supported in part by Grants-in-Aid for Scientific Research (No. 16K05329 and No. 16K17711, respectively) from the Japan Society for the Promotion of Science (JSPS).

References

[1] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science 313 (2006) 504.

[2] C. B´eny, “Deep learning and the renormalization group,” arXiv:1301.3124 [quant-ph].

[3] S. Saremi and T. J. Sejnowski, “Hierarchical model of natural images and the origin of scale invariance,” Proceedings of the National Academy of Sciences 110 (2013) 3071.

[4] P. Mehta and D. J. Schwab, “An exact mapping between the Variational Renormalization Group and Deep Learning,” arXiv:1410.3831 [stat.ML].

[5] A. Paul and S. Venkatasubramanian, “Why does Deep Learning work? - A perspective from Group Theory,” arXiv:1412.6621 [cs.LG].

[6] H. W. Lin, M. Tegmark and D. Rolnick, “Why Does Deep and Cheap Learning Work So Well?,” J. Stat. Phys. 168 (2017) 1223-1247 [arXiv:1608.08225 [cond-mat.dis-nn]].

[7] M. Sato, “Renormalization Group Transformation for Hamiltonian Dynamical Systems in Biological Networks,” arXiv:1609.02981 [q-bio.OT].

[8] K.-I. Aoki and T. Kobayashi, “Restricted Boltzmann machines for the long range Ising models,” Mod. Phys. Lett. B 30 (2016) 1650401 [arXiv:1701.00246 [cond-mat.stat-mech]].

[9] M. Koch-Janusz and Z. Ringel, “Mutual Information, Neural Networks and the Renor- malization Group,” arXiv:1704.06279 [cond-mat.dis-nn].

[10] R. Salakhutdinov, A. Mnih and G. Hinton, “Restricted Boltzmann machines for collab- orative filtering,” Proceedings of the 24th international conference on Machine learning (ACM, 2007) pp. 791-798.

[11] H. Larochelle and Y. Bengio, “Classification using discriminative restricted Boltzmann machines,” Proceedings of the 25th international conference on Machine learning (ACM, 2008) pp. 536-543.

[12] G. Hinton, “A practical guide to training restricted Boltzmann machines,” Momentum 9 (2010) 926.

[13] G. E. Hinton, “A principal guide to training restricted Boltzmann machines,” Neural networks: Tricks of the Trade, Springer (2012) pp. 599-619.

[14] D. S. P. Salazar, “Nonequilibrium Thermodynamics of Restricted Boltzmann Machines,” Phys. Rev. E 96 (2017) 022131 [arXiv:1704.08724 [cond-mat.stat-mech]].

[15] K. G. Wilson, “Renormalization group and critical phenomena. 1. Renormalization group and the Kadanoff scaling picture,” Phys. Rev. B 4 (1971) 3174. K. G. Wilson, “Renormalization group and critical phenomena. 2. Phase space cell analysis of critical behavior,” Phys. Rev. B 4 (1971) 3184.

[16] L. Wang, “Discovering phase transitions with unsupervised learning,” Phys. Rev. B 94 (2016) 195105 [arXiv:1606.00318 [cond-mat.stat-mech]].

[17] G. Torlai and R. G. Melko, “Learning thermodynamics with Boltzmann machines,” Phys. Rev. B 94 (2016) 165134 [arXiv:1606.02718 [cond-mat.stat-mech]].

[18] A. Tanaka and A. Tomiya, “Detection of phase transition via convolutional neural network,” J. Phys. Soc. Jap. 86 (2017) no.6, 063001 [arXiv:1609.09087 [cond-mat.dis-nn]].

[19] S. J. Wetzel, “Unsupervised learning of phase transitions: from principal component analysis to variational autoencoders,” Phys. Rev. E 96 (2017) 022140 [arXiv:1703.02435 [cond-mat.stat-mech]].

[20] W. Hu, R. R. P. Singh, and R. T. Scalettar, “Discovering Phases, Phase Transitions and Crossovers through Unsupervised Machine Learning: A critical examination,” Phys. Rev. E 95 (2017) 062122 [arXiv:1704.00080 [cond-mat.stat-mech]].

[21] A. Morningstar and R. G. Melko, “Deep Learning the Ising Model Near Criticality,” arXiv:1708.04622 [cond-mat.dis-nn].

[22] S. Amari, “Theory of adaptive pattern classifiers”, IEEE Trans. Elect. Comput. EC-16 (1967) 299-307.

[23] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation 14 (2002) 1771.

[24] M. ´A. Carreira-Perpi˜n´an and G. E. Hinton, “On contrastive divergence learning,” Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS, 2005) pp. 33-40.

[25] Y. Bengio and O. Delalleau, “Justifying and generalizing contrastive divergence.” Neural Computation 21 (2009) pp. 1601-1621.

[26] C. H. Lee, Y. Yamada, T. Kumamoto and H. Matsueda, “Exact Mapping from Singu- lar Value Spectrum of Fractal Images to Entanglement Spectrum of One-Dimensional Quantum Systems,” J. Phys. Soc. Jap. 84 (2015) no.1, 013001 [arXiv:1403.0163 [cond-mat.stat-mech]].

[27] T. Kumamoto, M. Suzuki, H. Matsueda, “Singular-Value-Decomposition Analysis of Associative Memory in a Neural Network,” J. Phys. Soc. Jap. 86 (2017) no.2, 024005 [arXiv:1608.08333 [cond-mat.stat-mech]].

designed for accessibility and to further open science