Linear Independent Component Analysis over Finite Fields: Algorithms and Bounds

2018·Arxiv

Abstract

Abstract

Independent Component Analysis (ICA) is a statistical tool that decomposes an observed random vector into components that are as statistically independent as possible. ICA over finite fields is a special case of ICA, in which both the observations and the decomposed components take values over a finite alphabet. This problem is also known as minimal redundancy representation or factorial coding. In this work we focus on linear methods for ICA over finite fields. We introduce a basic lower bound which provides a fundamental limit to the ability of any linear solution to solve this problem. Based on this bound, we present a greedy algorithm that outperforms all currently known methods. Importantly, we show that the overhead of our suggested algorithm (compared with the lower bound) typically decreases, as the scale of the problem grows. In addition, we provide a sub-optimal variant of our suggested method that significantly reduces the computational complexity at a relatively small cost in performance. Finally, we discuss the universal abilities of linear transformations in decomposing random vectors, compared with existing non-linear solutions.

Index Terms—Independent Component Analysis, Binary ICA, Blind Source Separation, Minimal Redundancy Representation, Minimum Entropy Codes, Factorial Codes.

I. INTRODUCTION

INDEPENDENT Component Analysis (ICA) addresses therecovery of unknown statistically independent source signals from their observed mixtures, without full prior knowledge of the mixing model or the statistics of the source signals. The classical Independent Components Analysis framework usually assumes linear combinations of the independent sources over the field of real valued numbers. A special variant of the ICA problem is when the sources, the mixing model and the observed signals are over a finite field, such as Galois Field of order q, GF(q). Several types of generative mixing models may be assumed when working over GF(q). This includes modulo additive operations, OR operations (over a binary field) and others. Existing solutions to ICA mainly differ in their assumptions of the generative model, the prior distribution of the mixing matrix (if such exists) and the noise model. Nevertheless, the common assumption to all of these approaches is that there exists a set of fully statistically independent source signals to be decomposed. However, these assumptions do not usually

A. Painsky is with the School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel. contact: amichai.painsky@huji.mail.ac.il

M. Feder is with the Department of Electrical Engineering, Tel Aviv University, Israel

The material in this paper was presented in part at the 2016 IEEE International Workshop on Machine Learning for Signal Processing

hold, and a more robust approach is required. In other words, we would like to decompose any given observed mixture into “as independent as possible” components, with no prior assumption on the way it was generated. This problem was first introduced by Barlow [1] as minimal redundancy representation and was later referred to as factorial representation [2] or generalized ICA over finite alphabets [3].

A factorial representation has several advantages. The probability of occurrence of any realization can be simply computed as the product of the probabilities of the individual components that represent it (assuming such decomposition exists). In addition, any method of finding factorial codes can be viewed as implementing the Occam’s razor principle, which prefers simpler models over more complex ones, where simplicity may be defined as the number of parameters necessary to represent the joint distribution of the data. In the context of supervised learning, independent features can also make later learning easier; if the input units to a supervised learning networks are uncorrelated, then the Hessian of its error function is diagonal, allowing accelerated learning abilities [4]. There exists a large body of work which demonstrates the use of factorial codes in learning problems. This mainly manifests in artificial neural networks [5], [6] with application to facial recognition [7]–[9] and deep learning [10], [11]. Recently, factorial codes were also shown to attain favorable compression rates in large alphabet source coding [12]–[15].

In this work we focus on linear solutions to the factorial representation problem; we seek for a linear transformation (over GF(q)), that decomposes the observed mixture into “as statistically independent as possible” components. Linear solutions have several advantages. They are easy to implement in linear time-invariant (LTI) systems, they are robust to noisy measurements and they are storage space efficient. Importantly, they are usually simpler to derive and analyze than in the non-linear case. For these reasons, linear ICA has received most of the attention over the years [16]–[22].

This paper is an extended version of our initial results, presented in [23]. In [23] we introduced a lower bound to the linear binary ICA problem, followed by a simple practical algorithm. Here we significantly extend the scope of this study. This includes the following contributions:

• A comprehensive derivation of the computational and statistical properties of the methods presented in [23].

• A novel computationally efficient algorithm which reduces the complexity of our previously suggested schemes.

• Generalization of our framework to finite fields of any order.

• A rigorous study of the flexibility of linear ICA methods, including both analytical and empirical results.

• Comparative study of linear and non-linear ICA methods.

• A data compression application, which demonstrates the use of our suggested approach in large alphabet source coding.

II. PREVIOUS WORK

The problem considered in this paper has a long history. In his pioneering work from 1989, Barlow [1] presented a minimally redundant representation scheme for binary data. He claimed that a good representation should capture and remove the redundancy of the data. This leads to a factorial representation/ encoding in which the components are as mutually independent of each other as possible. Barlow suggested that such representation may be achieved through minimum entropy encoding: an invertible transformation (i.e., with no information loss) which minimizes the sum of marginal entropies (as later presented in (1)). Unfortunately, Barlow did not propose any direct method for finding factorial codes. Atick and Redlich [24] proposed a cost function for Barlow’s principle for linear systems, which minimize the redundancy of the data subject to a minimal information loss constraint. This is closely related to Plumbey’s [25] objective function, which minimizes the information loss subject to a fixed redundancy constraint. Schmidhuber [2] then proposed several ways of approximating Barlow’s minimum redundancy principle in the non-linear case. This naturally implies much stronger results of statistical independence. However, Schmidhuber’s scheme is rather complex, and subjects to local minima [5]. Recently, we introduced a novel approach for finding factorial codes over non-linear transformation [3]. In that work, Barlow’s objective is tightly approximated with a series of linear problems. Despite its favorable computational properties, the approach suggested in [3] is quite analytically involved. Later, we introduced a simpler sub-optimal solution, which applies an order permutation according to the probability mass function of the observed mixture [14]. This method is shown to minimize the desired objective function up to a small constant, for increasing alphabets sizes, on the average.

In a second line of work, the factorial coding problem is studied under the assumptions that a fully independent decomposition exists, and that the mixture is linear. In his first contribution to this problem, Yeredor [16] considered a linear mixture of statistically independent sources over GF(2) (namely, binary ICA) and proposed a method for source separation based on entropy minimization. Yeredor assumed that the mixing model is a d-by-d invertible matrix and proved that the XOR model is invertible and that there exists a unique transformation matrix to recover the independent components. Yeredor further suggested several algorithms for the linear binary ICA (BICA) problem, including the AMERICA and the enhanced MEXICO algorithms. Further, Yeredor generalized his work [17] to address the ICA problem over Galois fields of any prime number. His ideas were analyzed and improved by Gutch et al. [26]. In [21], Attux et al. extended Yeredor’s formulation for a more robust setup, in which the sources are not necessarily independent and presented a heuristic immune-inspired algorithm [21], which was later improved and generalized to any GF(q) [27].

In addition to generative XOR model, there exist several alternative mixture assumptions. In [18] and [22] the authors considered a noise-OR model, where the probabilistic dependency between observable vectors and latent vectors is modeled via the noise-OR conditional distribution. Streith et al. [19] studied the binary ICA problem, where the observations are either drawn from a signal following OR mixtures or from a noise component. The key assumption made in this work is that the observations are conditionally independent given the model parameters (as opposed to the latent variables). This greatly reduces the computational complexity and makes the scheme amenable to an objective descent-based optimization solution. In [20], Nguyen and Zheng considered OR mixtures and proposed a deterministic iterative algorithm to determine the distribution of the latent variables and the mixing matrix.

III. LINEAR BINARY INDEPENDENT COMPONENT ANALYSIS

We begin our presentation by discussing the simpler binary case. Throughout this manuscript we use the following standard notation: underlines denote vector quantities, while their respective components are written without underlines but with an index. The probability function of the vector X is denoted as , while H (X) is the entropy of X. This means where the log function denotes a logarithm of base 2, unless stated otherwise. Further, we denote the binary entropy of the parameter p as .

A. Problem Formulation

Let be a given random vector of dimension d.We are interested in an invertible transformation, Y = g(X), such that the components of Y are “as statistically independent as possible”. Notice that the common notion of ICA is not limited to invertible transformations (hence Y and X may be of different dimensions). However, in our work we focus on this setup as we would like Y = g(X) to be “lossless” in the sense that we do not loss any information. Further motivation to this setup is discussed in [1], [2].

We distinguish between linear and non-linear invertible transformations. In the linear case, is a d-by-d invertible matrix over the XOR field. This means we seek where and rank(W) = d. In the non-linear case, we notice that an invertible transformation of a vector X is a one-to-one mapping (i.e., permutation) of its words. This means there exist invertible transformations from which only are linear [17].

To quantify the statistical independence among the components of Y we use the well-known total correlation criterion, which was first introduced by Watanabe [28] as a multivariate generalization of the mutual information,

This measure can also be viewed as the cost of coding the vector Y component-wise, as if its components were statistically independent, compared to its true entropy. Notice that the total correlation is non-negative and equals zero iff the components of Y are mutually independent. Therefore, “as statistically independent as possible” may be quantified by minimizing C(Y ). The total correlation measure was first considered as an objective for factorial representation by Barlow [1].

Since we define Y to be an invertible transformation of X

we have H(Y ) = H(X) and so our minimization objective, in the binary case, is

where is the sum of probabilities of all words whose bit equals 0. This means that the transformations are not unique. For example, we can invert the bit of all words to achieve the same objective, or even shuffle the bits.

Notice that the probability mass function of X is defined over values. Therefore, any approach which exploits the full statistical description of X would require going over all possible words at least once. On the other hand, there exist at most possible invertible transformations. The complexity of currently known binary ICA methods (and factorial codes) fall within this range. In the linear case (and under a complete independence assumption), the AMERICA algorithm [17], which assumes a XOR mixture, has a complexity of . The MEXICO algorithm, which is an enhanced version of AMERICA, achieves a complexity of under some restrictive assumptions on the mixing matrix. In the non-linear case, an approximation of the optimal solution may be achieved in a computational complexity of , where k is the accuracy parameter [29], while the simpler order permutation [14] requires .

B. Lower Bound on Linear Binary ICA

In this section we introduce a lower bound to the binary linear ICA problem. Specifically, we present a method which obtains an infimum of (2), under Y = WX and , rank(W) = d.

In his linear binary ICA work, Yeredor established a methodology based on a basic property of the binary entropy [17]. He suggested that the binary entropy of the XOR of two independent binary variables is greater than each variable’s entropy. Unfortunately, there is no such guarantee when the variables are dependent. This means that in general, the entropy of the XOR of binary variables may or may not be greater than the entropy of each of the variables (for example, , where is the xor operand). When minimizing (2) over linear transformations, Y = WX, we notice that each is a XOR of several, possibly dependent, variables . This means that naively, we need to go over all possible subsets of and evaluate their XOR. Formally, we define a matrix A of all possible realizations of d bits. Then, we would like to compute for all . This means that each row in the matrix A corresponds to a subset of variables from , on which we apply a XOR. Then, we evaluate the binary entropy of each . A necessary condition for W to be invertible is that it has no two identical rows. Therefore, a lower bound on (2) may be achieved by simply choosing d rows of the matrix A, for which are minimal. Notice that this lower bound is not tight or attainable. It defines a simple lower bound on (2), which may be attained if we are lucky enough to have chosen d rows of the matrix A which are linearly independent.

The derivation of our suggested bound requires the computation of binary entropy values, where each entropy corresponds to a different vectorial XOR operation of length d. This leads to a computational complexity of . Then, we sort the entropy values in an ascending order, using a simple quick-sort implementation. This again requires operations. Therefore, the total computation complexity of our suggested bound is . In other words, we attain a lower bound to any linear solution of (2), in a computational complexity that is asymptotically competitive to all currently known methods (for both linear and non-linear methods).

C. A Greedy Algorithm for Linear Binary ICA

We now present our suggested algorithmic approach for the linear BICA problem, based on the same methodology presented in the previous section. Again, we begin by evaluating all possible XOR operations for all . Further, we evaluate the binary entropy of each . We then sort the rows of A according to the binary entropy values of their corresponding . Denote the sorted list of rows as . Our remaining challenge is to choose d rows from such that the rank of these rows is d. Clearly, our objective suggests that we choose rows which are located at the top of the matrix , as they result in lower entropy values. Our suggested greedy algorithm begins with an empty matrix W. It then goes over the rows in in an ascending order. If the current row in is linearly independent of the current rows in W it is added to W. Otherwise, it is skipped and the algorithm proceeds to the next row in . The algorithm terminates once W is of full rank. The rank of W is evaluated by a simple Gauss elimination procedure (implemented on a Matlab platform as gfrank, for example) or by more efficient parallel computation techniques [30]. We denote our suggested algorithm as Greedy Linear ICA (GLICA).

Although GLICA looks for d linearly independent rows from in a no-regret manner, we may still consider its statistical properties and evaluate how much it deviates from the lower bound. Let us analyze the case where the order of the rows is considered random. This means that although the rows of are sorted according to the value of their corresponding entropy, , we still consider the rows as randomly ordered, in the sense that the position of each row in is random and independent of the other rows. This assumption is typically difficult to justify. However, we later demonstrate in a series of experiments that the derived properties may explain the favorable performance of GLICA. This suggests that although there is typically dependence among the row, it is practically low.

Let us examine GLICA at an iteration where it already found k < d linearly independent rows (rank(W) = k), and it is now seeking an additional independent row. Notice that there exist at most rows which are linearly dependent of the current rows in W (all possible linear combinations of these rows). Assume we uniformly draw (with replacement) an additional row from a list of all possible rows (of size ). The probability of a drawn row to be independent of the k rows in W is simply . Therefore, the number of draws needed in order to find another linearly interdependent row follows a geometric distribution with a parameter (as the draws are i.i.d.). This means that the average number of draws needed to find an additional row is , while its

variance is . Denote the number of rows that GLICA examines before termination by L. We have that E(L) = and var. It can be shown [31] that and . Further, var. We may now apply Chebyshev’s inequality to conclude that for every a > 0 we have that

This result implies that if we choose rows from , even with replacement, our suggested algorithm skips up to 2 rows on the average, before terminating with a full rank matrix W, under the assumptions mentioned above. Further, the probability that our algorithm skips 2 + a rows (a > 0) is bounded from above by . Notice that this bound is independent of d. Therefore, the overhead from our suggested the lower bound becomes negligible, as d increases. For example, for d = 100, the probability that we will examine 108 rows or more (a = 6) is not greater than 0.077.

D. Block-wise Approach for Linear Binary ICA

Despite its favorable computational and statistical properties, our suggested greedy algorithm (GLICA) still requires the computation of all possible XOR operations. This becomes quite costly as d increases (or as alphabet size grows, as we see in Section IV). Therefore, we suggest a simple modification to circumvent this computational burden.

Let us split the d components of X into b disjoint sets, (blocks). For example, for d = 7 and b = 3 we may have that and . Let us now apply GLICA to each block. Denote the outcome of this operation as , where the matrix is block diagonal. As discussed in Section III-A, our objective (2) is invariant to shuffling of components. It means we can randomly shuffle the components of Y (by multiplying it with a permutation matrix), and maintain the same objective value . In our example (d = 7, b = 3), the random shuffling may attain, for example, , and . Notice that we now have a new set of blocks, on which we may again apply the GLICA algorithm, to further reduce our objective. We repeat this process for a configurable number of times, or until (local) convergence occurs. Notice that the convergence is guaranteed since we generate a sequence of non-increasing objective values, which is bounded from below by our suggested lower bound (Section III-B). The resulting linear transformation of this entire process is simply the multiplication of all matrices that we apply along the process. We denote this iterative block-wise algorithm as Block GLICA, or simply BloGLICA. Our suggested approached is summarized in Algorithm 1

Notice that this block-wise approach is a conceptual framework which reduces the computational complexity of any finite alphabet ICA method. In other words, we may replace the GLICA algorithm in line 4 of Algorithm 1 by any finite field ICA algorithm, and by that reduce its computational burden. Let us set the maximal number of iterations to M. Further, set the maximal size of each block as . Then, the computational complexity of BloGLICA is simply . Notice this complexity is typically much smaller than GLICA’s , since we do not examine all possible XOR operations and focus on local random searches that are implied by the block structure that we define.

E. Illustrations

Let us now illustrate the performance of our suggested algorithms and bounds in several experiments. In the first experiment we examine our ability to recover d independent binary sources that were mixed by an unknown matrix B. Let be a d-dimensional binary vector. Assume that the components of S are i.i.d. with a parameter p = 0.4. This means that the joint entropy of S is . We draw 10, 000 i.i.d. samples from S and mix them with a binary matrix B. Then, we apply our binary ICA approach to recover the original samples and the mixing matrix B. Figure 1 demonstrates the averaged results we achieve for different number of components d, where B is an arbitrary invertible matrix that is randomly drawn, prior to the experiment. We verify that matrix B is not an identity, nor a permutation matrix, to make the experiment meaningful. We compare our suggested approach with three alternative methods: AMERICA, MEXICO (described in Section II) and cobICA [27]. As previously described, the cobICA is an immune-inspired algorithm. It starts with a random “population” where each element in the population represents a valid transformation (an invertible matrix). At each step, the algorithm evaluates the objective function for each element in the population, and subsequently “clones” the elements. Then, the clones suffer a mutation process, generating a new set of individuals. This new set is evaluated again in order to select the individuals with the best objective values. The entire process is repeated until a pre-configured number of repetitions is executed. The cobICA algorithm requires a careful tuning of several parameters. Here, and in the following experiments, we follow the guidelines of the authors; we set the general parameters as appear in Table 1 of [27], while the rest of the parameters (concentration and suppression) are then optimized over a predefined set of values, in a preliminary independent experiment.

We first notice that for , both GLICA and AMERICA successfully recover the mixing matrix B (up to permutation of the sources), as they achieve an empirical sum of marginal entropies, which equals to the entropy of the samples prior to the mixture (blue curve at the bottom). Second, we notice that for the same values of d, our suggested lower bound (Section III-B) seems tight, as GLICA and AMERICA attain it. This is quite surprising since our bound is not expected to be attainable. This phenomenon may be explained by the simple setting and the relatively small dimension. In fact, we notice that as the dimension increases beyond d = 12, our suggested bound drops beneath the joint entropy, H(Y ), and GLICA fails to perfectly recover B. The green curve corresponds to MEXICO, which demonstrates inferior performance, due to its design assumptions (see Section III-A) . Finally, the red curve with the circles is cobICA, which is less competitive as the dimension increases. It is important to emphasize that while AMERICA and MEXICO are designed under the assumption that a perfect decomposition exists, cobICA and GLICA do not assume a specific generative model. Nevertheless, GLICA demonstrates competitive results, even when the dimension of the problem increases.

2 4 6 8 10 12 14 -0.05

Fig. 1: Recovering independent sources experiment. Black curve at the bottom: lower bound on linear ICA. Blue curve: GLICA. Green dashed curve: AMERICA. Green curve: MEXICO. Red curve with the circles (on top): cobICA.

In our second experiment we consider a binary source vector over an alphabet size , whose joint distribution

follows a Zipf’s law distribution,

where m is the alphabet size and s is the “skewness” parameter. The Zipf’s law distribution is a commonly used heavytailed distribution, mostly in modeling of natural (real-world) quantities. It is widely used in physical and social sciences, linguistics, economics and many other fields. In this experiment, we draw 10, 000 i.i.d. random samples from a Zipf’s law distribution with s = 1.01, where each sample is represented by a d-dimensional binary vector, and the representation is chosen at random. We evaluate the lower bound of (2) under linear transformations, followed by the GLICA algorithm. In addition, we examine the block-wise approach, BloGLICA (Section III-D) with b = 2, 3. We compare our suggested bound and algorithms to cobICA [27]. Notice that in this experiment we omit the AMERICA and MEXICO algorithms, as they assume a generative mixture model, which is heavily violated in our setup. Figure 2 demonstrates our objective (2), for an increasing number of components d.

2 4 6 8 10 12 14 0

Fig. 2: Minimizing (2) given independent draws from a Zipf’s distribution. Black curve at the bottom: lower bound on linear ICA. Blue curve above it: GLICA. Blue dashed curve: BloGLICA with b = 2. Blue curve with X’s: BloGLICA with b = 3. Red curve with circles (on top): cobICA.

We first notice that our suggested algorithms lie between the lower bound and cobICA. Interestingly, bloGLICA preforms reasonably well, even as the number of blocks increases. On the computational side, Figure 3 demonstrates the runtime of each of these methods, over a standard personal computer. While we do not claim that these are the optimal implementations of the algorithms, obvious optimizations of the algorithms were implemented. The fact that they all were implemented in the same language (Matlab) is assumed to give none of the methods a significant advantage over the others. The same run-time comparison approach was taken, for example, in [26].

Here we notice the extensive runtime required by cobICA, compared to our suggested methods. In addition, we see that

Fig. 3: Runtime of the experiment presented in Figure 2.

as the dimension increases, GLICA necessitates a rapidly increasing runtime (as it requires to compute and sort entropy values). However, by introducing BloGLICA we avoid this problem at a relatively small overhead in the objective.

Finally, we repeat the previous experiment, where the samples are now drawn from a source distribution with a greater entropy. Specifically, we consider 10, 000 i.i.d. draws from a Beta-Binomial distribution over an alphabet size ,

where a and b are the parameters of the distribution and Bis the Beta function. We set a = b = 3 in our experiment. The Beta-Binomial distribution is the Binomial distribution in which the probability of success at each trial is not fixed but random and follows the Beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics, to capture overdispersion in Binomial type distributed data. Although the entropy of this distribution does not hold an analytical expression, it can be shown to be greater than for a = b = 3. As before, each drawn sample is represented by a d-dimensional binary vector, where the initial representation is chosen at random. Figure 4 demonstrates the results we achieve, applying the same binary ICA schemes as in the previous experiment.

Here again, we notice the same qualitative behavior as with the Zipf’s law experiment, where the main difference is that the ICA algorithms attain smaller values of (2). The reason for this phenomenon lies on the observation that sources with greater entropy are typically easier to decompose into independent components. This property is further discussed in Section VI.

IV. LINEAR ICA OVER FINITE FIELDS

Let us now extend our scope to general (possibly non- binary) finite fields of higher order. Finite fields (Galois fields) with q elements are commonly denoted as GF(q). A finite field only exists if q is a prime power i.e. for some prime p and a positive integer z. When , z = 1 the field is called a prime field and its elements can be represented by the integers

Fig. 4: Minimizing (2) given independent draws from a BetaBinomial distribution with a = b = 3. The curves represent the same schemes described in Figure 2.

. Addition, subtraction and multiplication are then performed by modulo-q operations, while division is not completely straight-forward, but can be easily implemented using B´ezout’s identity [26]. The simplicity of this field structure allows an easy implementation on any mathematical software – all it requires is multiplication, addition and the modulo operation on integers. For these reasons, we focus on linear ICA over prime fields, although our results can be easily extended to any finite field of prime power. As with the binary case, we are interested in minimizing the sum of marginal entropies, where it is now defined as

Here, we have that Y = WX where and the multiplication is modulo–q. As before, each is a linear combination (over GF(q)) of possibly dependent variables . This means that naively, we need to go over all possible linear combinations of to find the linear combinations that minimize the marginal entropies. Formally, we may define a matrix D of all possible linear combinations. We would like to compute , for all i = . Then, we evaluate the entropy of each . As before, a necessary condition for W to be invertible is that it has no two identical rows. Therefore, a lower bound on (4) may be achieved by simply choosing d rows of the matrix D, for which are minimal. As in the binary case, this lower bound is not tight or attainable; it defines a simple bound on (4), which may be attained if we are lucky enough to have chosen d rows of the matrix D that are linearly independent. Notice that this bound requires the computation of entropy values, followed by sorting them in a ascending order. This leads to a computational complexity of , which may become quite costly as q and b increase.

As discussed in Section III-C, we may derive a simple algorithm from our suggested bound, that seeks for d independent rows from in a greedy manner. As in the binary case, we may derive the second order statistics of our suggested scheme, under the same assumptions mentioned in Section III-C. Here, we have that the expected number of rows that our suggested algorithm examines before termination, E(L), satisfies

Further, the variance is bounded from above by

where the last inequality is due to (5). A similar derivation is given in [31]. Notice that both the expected overhead and the variance of L converge to zero as the order of the field grows. We now apply Chebyshev’s inequality to conclude that for every a > 0 we have that

Again, this result implies that under the independent draws assumption, choosing rows from the sorted D, even with replacement, skips up to rows on the average, before terminating with a full rank matrix W. Further, the probability that our algorithm will skip more than rows is governed by the left term of (7). Unfortunately, despite these desired statistical properties, the computational complexity of this algorithm, , becomes quite intractable as both q and d increase. Therefore, we again suggest a block-oriented algorithm, in the same manner described in Section III-D. This reduces the computation complexity to , where M is the maximal number of iterations and is the maximal number of components in each of the b blocks.

Figure 5 illustrates the performance of our suggested bound and algorithms when the alphabet size increases. Let be a six-dimensional random vector that follows a Zipf’s law distribution with s = 1.01. Here, we draw i.i.d. random samples of X for a dimension d = 6 and different prime numbers q. We evaluate our suggested lower bound, as previously discussed. We further apply the GLICA algorithm and examine BloGLICA with b = 2 and 3. Figure 5 demonstrates our objective for q = 2, 3, 5 and 7.

We first notice that GLICA achieves a sum of marginal entropies that is remarkably close the the lower bound. In addition, we notice that BloGLICA results in only a slight deterioration in performance (for b = 2, 3). As we examine

Fig. 5: Minimizing (4) for and q = 2, 3, 4 and 7. Black curve at the bottom: lower bound on linear transformation. Blue curve (right on top of it): GLICA. Blue dashed curve: BloGLICA with b = 2. Blue curve with X’s: BloGLICA with b = 3

the runtime of each of these methods (Figure 6), we notice a dramatic increase in GLICA’s computational complexity when q grows. As mentioned above, this computational drawback may be circumvented by applying the BloGLICA algorithm, at quite a mild overhead cost in the objective.

Fig. 6: Runtime of the experiment presented in Figure 5.

We further compare our suggested approach to AMERICA, MEXICO and cobICA, under a generative linear mixture model assumption, as described in Section III-E. Here again, we observe that both GLICA and AMERICA successfully decompose the mixture model for smaller values of d and q, while MEXICO and cobICA are slightly inferior. The detailed results are located in the first author’s webpage1, due to space limitation.

V. APPLICATIONS

Although ICA over finite fields does not originate from a specific application, it applies to a variety of problems. In [26], Gutch et al. suggested a specific application to this framework, in the context of eavesdropping on a multi-user MIMO digital communications system. Here, we show that ICA over finite fields also applies for large alphabet source coding. For the simplicity of the presentation we focus on the binary case.

Consider d independent binary sources with corresponding Bernoulli parameters . Assume that the sources are mixed by an unknown matrix B over GF(2). We draw n i.i.d. samples from this mixture, which we would like to efficiently transmit to a receiver. Large alphabet source coding considers the case where the number of draws n is significantly smaller than the alphabet size . This means that the empirical entropy of drawn samples is significantly smaller the entropy of the source. Therefore, even if we assume that both the transmitter and the receiver know , coding the samples according to the true distribution is quite wasteful. Large alphabet source coding has been extensively studies over the past decade. Several key contributions include theoretical performance bounds and practical methods in different setups (see [14] for an overview). Here, we show that under a mixed independent sources assumption, our suggested binary ICA framework demonstrates favorable properties, both in terms of code-rate and run-time. Specifically, we show that by applying our suggested GLICA scheme, we (strive) to recover the original independent sources. Then, we encode each recovered source independently, so that the sources are no longer considered over “large alphabet” (as the alphabet is now binary). Notice that the redundancy of this scheme consists of three terms: the decomposition cost (1), the (negligible) redundancy of encoding the recovered binary sources, and the cost of describing the matrix W to the receiver (bits).

We illustrate our suggested coding scheme in the following experiment. We draw n i.i.d. samples from a random mixture of d = 20 sources, with a corresponding set of parameters . Notice that the joint entropy of this source is 14.36 bits. We compare three different compression scheme. First, we examine the “textbook” approach for this problem. Here, we construct a Huffman code for the n samples, based on their empirical distribution. Since the receiver is oblivious to this code, we need to transmit it as well. This results in a total compression size of at least n times the empirical entropy, plus a dictionary size of , where is the number of unique symbols that appear in the n samples. Second, we apply our suggested GLICA algorithm, followed by arithmetic coding to each of the recovered sources. Finally, we apply BloGLICA with b = 2, to reduce the run-time. Notice that the cost of describing the transformation is also reduced to . Figure 7 demonstrated the compression rate we achieve for different values of n. The red curve with the asterisks corresponds to a Huffman compression (following [32]) with its corresponding dictionary. We first observe the notable effect of the dictionary’s redundancy when , leading to a compression rate which is even greater than d = 20 bits. However, as n increases the relative portion of the dictionary decreases, since grows much slower. This leads to a quick decay in the compression rate. The blue curve at the bottom is GLICA, while the black curve is the empirical sum of marginal entropies of the original sources. Here we observe that GLICA successfully decomposes the sources, where the difference between the two curves is due to the cost of describing W (which becomes negligible as n increases). Further, we compare GLICA with marginal encoding to each of the mixed components (that is, without trying to decompose it first). This results in an compression rate of approximately 20 bits (magenta curve with rhombuses), as the mixture increased the empirical marginal entropies to almost a maximum. Finally, we apply the BloGLICA (blue dashed curve). We notice an increased compression rate compared with GLICA, due to a greater sum of marginal entropies. However, BloGLICA is a much more practical approach, as it takes only 25 seconds to apply (for n = 10, 000), compared with 453 seconds by GLICA. Notice that we omit other alternative methods that are inadequate or impractical to apply to this high dimensional problem (cobICA, AMERICA, MEXICO).

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 14

Fig. 7: Large alphabet source coding experiment. The red curve with the asterisks is a Huffman code according to the empirical joint distribution. The Blue curve is GLICA. The dashed blue curve is BloGLICA. The black curve is the sum of empirical marginal entropies of the original sources, and the magenta curve with the rhombuses is the sum of empirical marginal entropies of the mixed sources.

To conclude, we show that by applying our suggested scheme we are able to efficiently compress data in a large alphabet regime. Although our results strongly depend on the independent sources assumption, they are not limited to this model. In other words, we may apply GLICA to any set of samples and compare the resulting empirical sum of marginal entropies with the empirical joint entropy of the data; if the difference between the two is relatively small, then there is no significant loss in encoding the data component-wise, and the binary ICA compression scheme is indeed favorable.

VI. ON THE FLEXIBILITY OF LINEAR TRANSFORMATIONS

In the previous sections we introduced fundamental bounds and efficient algorithms for the linear ICA problem over finite fields. However, it is not quite clear how “powerful” this tool is, for a given arbitrary mixture. In other words, compared to its alternatives, how well can a linear transformation minimize the sum of marginal entropies, in the general case? To answer this question, we first need to specify an alternative approach.

A. Non-linear binary ICA

Assume we are interested in minimizing (2) over non-linear invertible transformations. As mentioned in Section II, this problem was originally introduced by Barlow [1] as minimally redundant representations, and is considered hard. In [3], the authors introduce a piece-wise linear relaxation algorithm which tightly approximates the solution to (2) with a series of linear problems. Here we focus on a simplified greedy algorithm which strives to minimize (2) in a more computationally efficient manner with favorable theoretical properties. This approach, along with its computational and statistical characteristics, was previously introduced in [14]. Here, we briefly review these results.

1) The Order Permutation: As mentioned above, an invertible transformation Y = g(X) is a one-to-one mapping (i.e. permutation) of its alphabet symbols. In other words, an invertible transformation permutes the mapping of the m symbols to the m values of its probability mass function . Since our objective (1) is quite involved, we modify it by imposing an additional constraint. Specifically, we would like to sequentially minimize each term of the summation, , for j = 1, . . . , d, in a no-regret manner. As shown in [14], the optimal solution to this problem is the order permutation, which suggests to map the smallest probability to the word (in its binary representation). This means, for example, that the all zeros word will be assigned to the smallest probability value in while the all ones word is assigned to the maximal probability value.

2) worst-case and average-case performance: At this point, it is quite unclear how well the order permutation performs as a minimizer to (1). The following theorems present two theoretical properties which demonstrate its capabilities. These theorems (and their proofs) were first presented in [14].

We begin with the worst-case performance of the order permutation. Here, we denote our objective (1) as C(p, g), as it solely depends on the probability vector and the transformation. Further, let us denote the order permutation as and the optimal permutation (which minimizes (1)) as . In the worst-case analysis, we would like to quantify the maximum of over all probability mass functions p, of a given alphabet size .

Theorem 1: For any random vector , over an alphabet size we have that

Theorem 1 shows that even the optimal non-linear transformation achieves a sum of marginal entropies which is bits greater than the joint entropy, in the worst case. This means that there exists at least one source X with a probability mass function which is impossible to encode as if its components are independent without losing at least bits. This result is quite unfortunate since we have that

Notice that this worst-case derivation obviously also applies for linear transformations. In other words, we can always find a worst-case source which no ICA algorithm (linear or non-linear) can efficiently decompose.

We now turn to an average-case analysis. Here, we show that the expected value of is bounded by a small constant, when averaging uniformly over all possible p over an alphabet size .

Theorem 2: Let be a random vector of an alphabet size and joint probability mass function p. Let be the order permutation. For , the expected value of , over a prior uniform simplex of joint probability mass functions p, satisfies

This means that when the alphabet size is large enough, the order permutation achieves, on the average, a sum of marginal entropies which is only 0.0162 bits greater than the joint entropy, when all possible probability mass functions p are equally likely to appear. Notice that the uniform prior assumption may not be adequate for every setup, but it does provide a universal result for the performance of this minimizer. Proofs of these theorems are located in [14] under different notations, and in the first author’s webpage2 in the same notation that is used above.

B. Average-case performance of Linear ICA transformations

As with the non-linear case, we would like to evaluate the average-case performance of linear transformations.

Theorem 3: Let be a random vector of an alphabet size and joint probability mass function p. Let be the optimal linear transformation (notice that the optimal transformation depends on p). Then, the expected

value of , over a prior uniform simplex of joint probability mass functions p, satisfies

A proof for this theorem is provided in the Appendix. This theorem suggests that when the number of components is large enough, even the optimal linear transformation preforms very poorly on the average, as it attains the maximal possible marginal entropy value (). Moreover, it is shown in the Appendix that this average-case performance is equivalent to applying no transformation at all. In other words, when the number of components is too large, linear transformations are useless in the worst-case, and on the average. The intuition behind these results is that our objective depends on the entire probability mass function of X (which is of order ), while the number of free parameters, when applying linear transformations, is only . This means that when d increases, linear transformations are simply not flexible enough to minimize the objective.

C. Illustrations

Let us now compare our suggested linear algorithms and bound with the non-linear order permutation discussed above. In the first experiment we draw independent samples from a Zipf law distribution with s = 1.01 and a varying number of components d. We evaluate the lower bound of (2) under linear transformations, as discussed in Section III-B and further apply the GLICA algorithm (Section III-C). We compare our linear results with the non-linear order permutation on one hand, and with applying no transformation at all on the other hand. Figure 8 demonstrates the results we achieve. As we can see, the order permutation outperforms any linear solution quite remarkably, for the reasons mentioned above. Moreover, we notice that the gap increases as the number of components grows.

Fig. 8: Minimizing (2) given independent draws from a Zipf’s distribution, using linear and non-linear methods. Black curve with X’s: the order permutation. Black curve: lower bound on linear transformation. Blue curve above it: GLICA. Red curve with circles: applying no transformations.

We now turn to illustrate the average-case performance, as discussed in previous sections. Figure 9 shows the mean of our objective (empirically evaluated over independent draws from a uniform simplex), for the four methods mentioned above. We first see that the non-linear order permutation converges to a small overhead constant, as indicated in Theorem 2. On the other hand, the lower bound of the linear solution behaves asymptotically like applying no transformation at all. This can be easily derived from Theorem 3 and the fact that , where is a digamma function, as shown in [14].

VII. CONCLUSION

In this work we consider a framework for ICA over finite fields, in which we strive to decompose any given vector into

Fig. 9: Average-case analysis of minimizing (2) using linear and non-linear methods. The curves are described in Figure 8

“as statistically independent as possible” components, using only linear transformations. Over the years, several solutions have been proposed to this problem. Most of them strongly depend on the assumption that a perfect decomposition exists, while the rest suggest a variety of heuristics to the general case. In this work, we present a novel lower bound which provides a fundamental limit for the performance of any linear solution. Based on this bound, we suggest a simple greedy algorithm which shows favorable statistically properties, compared with our suggested bound. In addition, we introduced a simple modification to our suggested algorithm which significantly reduces its computational complexity at a relatively small cost in the objective.

In addition, we discuss the basic limitations of working with linear transformations. We show that this class of transformations becomes quite ineffective in the general case, when the dimension of the problem increases. Specifically, we show that when averaging over a uniform prior, applying the optimal linear transformation is equivalent to applying no transformation at all. This result should not come as a surprise, since the objective depends on the full statistical description of the problem (which is exponential in the number of components d), while linear transformations only provides free parameters. That being said, we do not intend to discourage the use of linear transformations for finite field ICA. Our analysis focuses on a universal setup in which no prior assumption is made. This is not always the case in real-world applications. For example, linear transformations may be quite effective in cases where we assume a generative linear model but the sources are not completely independent.

Although the focus of this paper is mostly theoretical, linear ICA is shown to have many applications in a variety of fields. We believe that the theoretical properties that we introduce, together with the practical algorithms which utilize a variety of setups, make our contribution applicable to many disciplines.

APPENDIX

This Appendix provide the proof of Theorem 3. We begin the proof by introducing the following Lemma:

Lemma 1: Let be a binary random vector of d components, . Then,

where the expectation is over a prior of uniform simplex of probability mass functions p and is the digamma function.

Proof The probability vector p consists of elements and follows a uniform simplex prior. This means that it follows a flat Dirichlet distribution with a parameter . The marginal probability of each component of Y is a summation of half of the elements of p. Therefore, following the Dirichlet distribution properties, we have that again follows a Dirichlet distribution, with . This means that Beta. Notice that for a symmetric Beta distributed random variable , we have that where is the digamma function. Therefore,

Proof of Theorem 3: Let be a binary random vector of d components, . For the simplicity of the presentation, denote our objective as . Its expectation over a prior of uniform simplex of probability mass functions is defined as where is the unit simplex and C is a normalization constant such that . Let us use the symmetry of the simplex to reformulate this definition of expectation. Denote

where is the set of all ascending ordered probability vectors of size and R(q) is the set of all possible permutations of the vector q. Notice we have that iff such that and . In other words, we can express every probability mass function as a permutation of an ascending ordered probability mass function. Therefore, .

Let us now focus on the set R(q), for a fixed q. Notice that this set represents the set of all invertible transformations of . In other words, assume , then Y = g(X) is an invertible transformation iff . Linear invertible transformations define a (disjoint) partitioning of R(q). This means that for every , we can define an invertible linear transformation such that . In other words, for every we can define a set of probability mass functions that are a result of all invertible linear transformations on it. These sets are disjoint since every set of linear transformations is closed; if and , then there exists a linear transformation from any element in to any element in , which contradicts the definition of above. Since these transformations are invertible (by definition), they are also elements in R(q). Denote the size of the set as A. Notice that while

p, q)| =

that these figures only depend of the dimension of the problem (and not on p or q).

of our objective. We denote it by . Further, define the set of all linear minimizers as . We have that

since . Let us now multiply both sides by A and average over all q. We get that

where the equality follows from . Finally, we have that

due to Lemma 1 above. Notice that the expression on the left is exactly the expected value of the optimal linear solution, as desired.

REFERENCES

[1] H. Barlow, T. Kaushal, and G. Mitchison, “Finding minimum entropy codes,” Neural Computation, vol. 1, no. 3, pp. 412–423, 1989.

[2] J. Schmidhuber, “Learning factorial codes by predictability minimiza- tion,” Neural Computation, vol. 4, no. 6, pp. 863–879, 1992.

[3] A. Painsky, S. Rosset, and M. Feder, “Generalized independent compo- nent analysis over finite alphabets,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 1038–1053, 2016.

[4] S. Becker and Y. Le Cun, “Improving the convergence of back- propagation learning with second order methods,” in Proceedings of the 1988 connectionist models summer school. San Matteo, CA: Morgan Kaufmann, 1988, pp. 29–37.

[5] S. Becker and M. Plumbley, “Unsupervised neural network learning pro- cedures for feature extraction and classification,” Applied Intelligence, vol. 6, no. 3, pp. 185–203, 1996.

[6] D. Obradovic, An information-theoretic approach to neural computing. Springer Science & Business Media, 1996.

[7] S. Choi and O. Lee, “Factorial code representation of faces for recogni- tion,” in Biologically Motivated Computer Vision. Springer, 2000, pp. 42–51.

[8] M. S. Bartlett, J. R. Movellan, and T. J. Sejnowski, “Face recognition by independent component analysis,” IEEE Transactions on Neural Networks, vol. 13, no. 6, pp. 1450–1464, 2002.

[9] M. S. Bartlett, “Information maximization in face processing,” Neurocomputing, vol. 70, no. 13, pp. 2204–2217, 2007.

[10] J. Schmidhuber, D. Cires¸an, U. Meier, J. Masci, and A. Graves, “On fast deep nets for agi vision,” in Artificial General Intelligence. Springer, 2011, pp. 243–246.

[11] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.

[12] A. Painsky, S. Rosset, and M. Feder, “Universal compression of memo- ryless sources over large alphabets via independent component analysis,” in Data Compression Conference (DCC). IEEE, 2015, pp. 213–222.

[13] ——, “A simple and efficient approach for adaptive entropy coding over large alphabets,” in Data Compression Conference (DCC). IEEE, 2016, pp. 369–378.

[14] ——, “Large alphabet source coding using independent component analysis,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6514–6529, 2017.

[15] A. Painsky, “Phd dissertation: Generalized Independent Components Analysis over finite alphabets,” arXiv preprint arXiv:1809.05043, 2018.

[16] A. Yeredor, “ICA in boolean XOR mixtures,” in Independent Component Analysis and Signal Separation. Springer, 2007, pp. 827–835.

[17] ——, “Independent analysis over Galois fields of prime order,” IEEE Transactions on Information Theory, vol. 57, no. 8, pp. 5342–5359, 2011.

[18] T. ˇSingliar and M. Hauskrecht, “Noisy-or component analysis and its application to link analysis,” The Journal of Machine Learning Research, vol. 7, pp. 2189–2213, 2006.

[19] A. P. Streich, M. Frank, D. Basin, and J. M. Buhmann, “Multi- assignment clustering for Boolean data,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 969–976.

[20] H. Nguyen and R. Zheng, “Binary Independent Component Analysis with OR mixtures,” IEEE Transactions on Signal Processing, vol. 59, no. 7, pp. 3168–3181, 2011.

[21] D. G. Silva, R. Attux, E. Z. Nadalin, L. T. Duarte, and R. Suyama, “An immune-inspired information-theoretic approach to the problem of ICA over a galois field,” in Information Theory Workshop (ITW). IEEE, 2011, pp. 618–622.

[22] F. Wood, T. Griffiths, and Z. Ghahramani, “A non-parametric Bayesian method for inferring hidden causes,” arXiv preprint arXiv:1206.6865, 2012.

[23] A. Painsky, S. Rosset, and M. Feder, “Binary independent component analysis: Theory, bounds and algorithms,” in The International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2016, pp. 1–6.

[24] J. J. Atick and A. N. Redlich, “Towards a theory of early visual processing,” Neural Computation, vol. 2, no. 3, pp. 308–320, 1990.

[25] M. D. Plumbley, “Efficient information transfer and anti-hebbian neural networks,” Neural Networks, vol. 6, no. 6, pp. 823–833, 1993.

[26] H. W. Gutch, P. Gruber, A. Yeredor, and F. J. Theis, “ICA over finite fields - separability and algorithms,” Signal Processing, vol. 92, no. 8, pp. 1796–1808, 2012.

[27] D. G. Silva, J. Montalvao, and R. Attux, “cobICA: A concentration- based, immune-inspired algorithm for ica over galois fields,” in IEEE Symposium on Computational Intelligence for Multimedia, Signal and Vision Processing (CIMSIVP), 2014, pp. 1–8.

[28] S. Watanabe, “Information theoretical analysis of multivariate correla- tion,” IBM Journal of research and development, vol. 4, no. 1, pp. 66–82, 1960.

[29] A. Painsky, S. Rosset, and M. Feder, “Generalized binary independent component analysis,” in International Symposium on Information Theory (ISIT). IEEE, 2014, pp. 1326–1330.

[30] K. Mulmuley, “A fast parallel algorithm to compute the rank of a matrix over an arbitrary field,” in Proceedings of the eighteenth annual ACM symposium on Theory of computing. ACM, 1986, pp. 338–339.

[31] N. Shulman, “Communication over an unknown channel via common broadcasting,” Ph.D. dissertation, 2003.

[32] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, 1999.

Amichai Painsky received his B.Sc. in Electrical Engineering from Tel Aviv University (2007), his M.Eng. degree in Electrical Engineering from Princeton University (2009) and his Ph.D. in Statistics from the School of Mathematical Sciences in Tel Aviv University. He is currently a Post-Doctoral Fellow, co-affiliated with the Israeli Center of Research Excellence in Algorithms (I-CORE) at the Hebrew University of Jerusalem, and the Signals, Information and Algorithms (SIA) Lab at MIT. His research interests include Data Mining, Machine Learning, Statistical Learning and their connection to Information Theory.

Saharon Rosset is a Professor in the department of Statistics and Operations Research at Tel Aviv University. His research interests are in Computational Biology and Statistical Genetics, Data Mining and Statistical Learning. Prior to his tenure at Tel Aviv, he received his PhD from Stanford University in 2003 and spent four years as a Research Staff Member at IBM Research in New York. He is a five-time winner of major data mining competitions, including KDD Cup (four times) and INFORMS Data Mining Challenge, and two time winner of the best paper award at KDD (ACM SIGKDD International Conference on Knowledge Discovery and Data Mining)

Meir Feder (S’81-M’87-SM’93-F’99) received the B.Sc and M.Sc degrees from Tel-Aviv University, Israel and the Sc.D degree from the Massachusetts Institute of Technology (MIT) Cambridge, and the Woods Hole Oceanographic Institution, Woods Hole, MA, all in electrical engineering in 1980, 1984 and 1987, respectively. After being a research associate and lecturer in MIT he joined the Department of Electrical Engineering - Systems, School of Electrical Engineering, Tel-Aviv University, where he is now a Professor and the incumbent of the Information Theory Chair. He had visiting appointments at the Woods Hole Oceanographic Institution, Scripps Institute, Bell laboratories and has been a visiting professor at MIT. He is also extensively involved in the high-tech industry as an entrepreneur and angel investor. He co-founded several companies including Peach Networks, a developer of a server-based interactive TV solution which was acquired by Microsoft, and Amimon a provider of ASIC’s for wireless high-definition A/V connectivity. Prof. Feder is a co-recipient of the 1993 IEEE Information Theory Best Paper Award. He also received the 1978 ”creative thinking” award of the Israeli Defense Forces, the 1994 Tel-Aviv University prize for Excellent Young Scientists, the 1995 Research Prize of the Israeli Electronic Industry, and the research prize in applied electronics of the Ex-Serviceman Association, London, awarded by Ben-Gurion University.

designed for accessibility and to further open science