The goal of Linear Independent Component Analysis (ICA) is to find such a unmixing function of the given data that the resulting representation has statistically independent components. Common tools solving this problem are based on maximizing some measure of nongaussianity, e.g. kurtosis (Hyvärinen, 1999; Bell and Se- jnowski, 1995) or skewness (Spurek et al., 2017). Clearly, an obvious limitation of those approaches is the assumption of linearity, as the real world data usually contains complicated and non-linear dependencies (see for instance (Larson, 1998; Ziehe et al., 2000)). Designing an efficient and easily implementable nonlinear analogue of ICA is a much more complex problem than its linear counterpart. A crucial complication is that without any limitations imposed on the space of the mixing functions the problem of nonlinear-ICA is ill-posed, as there are infinitely many valid solutions (Hyvärinen and Pajunen, 1999).
As an alternative to the fully unsupervised setting of the nonlinear ICA one can assume some prior knowledge about the distribution of the sources, which allows to obtain identifiability (Hyvarinen and Morioka, 2016; Hyvärinen et al., 2019). Several algorithms exploiting this property have been recently proposed, either assuming access to segment labels of the sources (Hyvarinen and Morioka, 2016), temporal dependency of the sources (Hyvärinen and Morioka, 2017) or, generally, that the sources are conditionally independent, and the conditional variable is observed along with the mixes (Hyvärinen et al., 2019; Khemakhem et al., 2019). However it may be sometimes hard to generalize those approaches in fully unsupervised setting where some prior knowledge is unavailable or the qualities of the data itself preserve unknown for the researcher.
An additional complication in devising nonliear-ICA algorithms lies in proposing an efficient measure of independence, which optimization would encourage the model to disentangle the components. One of the most common nonlinear method is MISEP (Almeida, 2003) which, similar to the popular INFOMAX algorithm (Bell and
Sejnowski, 1995), uses the mutual information criterion. In consequence, the procedure involves the calculation of the Jacobian of the modeled nonlinear transformation, which often causes a computational overhead when both the input and output dimensions are large. Another approach is applied in NICE (Nonlinear Independent Component Estimation) (Dinh et al., 2014). Authors propose a fully invertible neural network architecture where the Jacobian is trivially obtained. The independent components are then estimated using the maximum likelihood criterion. The drawback of both MISEP and NICE is that they require choosing the prior distribution family of the unknown independent components. An alternative approach is given by ANICA (Adversarial nonlinear ICA) (Brakel and Bengio, 2017), where the independence measure is directly learned in each task with the use of GAN-like adversarial method combined with an an autoencoder architecture. However, the introduction of a GANbased independence measure results in an often unstable adversarial training. In this paper we present a competitive approach to nonlinear independent components analysis – WICA (Nonlinear Weighted ICA). Crucial role in our approach is played by the conclusion from (Bedy- chaj et al., 2019), which proves that to verify nonlinear independence it is sufficient to check the linear independence of the normally weighted dataset, see Fig. 1. Based on this result we introduce weighted indepedence index (wii) which relies on computing weighted covariance and can be applied to the verification of the non-linear independence, see Section 2. Consequently, the constructed WICA algorithm is based on simple operations on matrices, and therefore is ideal for GPU calculation and parallel processing. We construct it by incorporating the introduced cost function in a commonly used in ICA problems auto-encoder framework (Brakel and Bengio, 2017; Le et al., 2011), where the role of the decoder is to limit the unmixing function so that the learned by the encoder independent components contained the information needed to reconstruct the inputs, see Section 3. We verified our algorithm in the case of a source signal separation problem. In Section 6, we presented the results of WICA for nonlinear mixes of images and for the decomposition of electroencephalogram signals. It occurs that WICA outperforms other methods of nonlinear ICA, both with respect to unmixing quality and the stability of the results, see Fig. 10. To fairly evaluate various nonlinear ICA methods in the case of higher dimensional datasets, we introduce a measure index called OTS based on Spearman’s rank correlation coefficient. In the defini-tion of OTS, similarly to the clustering accuracy (ACC) (Cai et al., 2005, 2010), we used optimal transport to obtain the minimal mismatch cost. This approach has its merit here, since the correspondence between the input coordinates and the reconstructed components in a higher dimensional space is nontrivial. Another important ingredient of this paper is the introduction
of a new and fully invertible nonlinear mixing function. In the case of linear ICA, one can easily construct many experiment settings that can be used in order to evaluate and compare different methods. Such standards are unfortunately not present in the case of nonlinear ICA. Therefore it is not clear what kind of nonlinear mixing should be used in the benchmark experiments. In most cases the authors usually use mixing functions, which correspond with the models architecture (Almeida, 2003; Brakel and Bengio, 2017). In contrast to such methodology, we propose a new iterative nonlinear mixing function based on the flow models (Dinh et al., 2014; Kingma and Dhariwal, 2018). This method does not relates to internal design of our network architecture, is invertible and allows for chaining the task complexity by varying the number of iterations, making it a useful tool in verification of the nonlinear ICA models.
Let us consider a random vector with density f. Then X has independent components iff f factors as
for some densities . Those functions are called marginal densities of f. A related, but much weaker notion, is the uncorrelatedness. We say that X has uncorrelated components, if the covariance of X is diagonal. Contrary to the independence, correlation has fast and easy to compute estimators. Components independence implies uncorrelatedness, but the opposite is not valid, see Fig. 1.
Figure 1. Sample from a random vector which Pearson’s correlation is equal to zero (left), but the components are not independent. Since the components are not independent, one can choose Gaussian weights so that Pearson’s correlation of weighted dataset is not zero (right).
Let us mention that there exist several measures which verify the independence. One of the most well-known measures of independence of random vectors is the distance correlation (dCor) (Székely et al., 2007), which is applied in (Matteson and Tsay, 2017) to solve the linear ICA problem. Unfortunately, to verify the independence of components of the samples, dCor needs comparisons, where d is the dimension of the sample and N is the sample size. Moreover, even a simplified version of dCor which checks only pairwise independence has high complexity and does not obtain very good results (which can be seen in experiments from Section 6). This motivates the research into fast, stable and efficient measures of indepedence, which are adapted to GPU processing.
2.1 Introducing the wii index
In this subsection we fill this gap and introduce a method of verifying independence which is based on the covariance of the weighted data. The covariance scales well with respect to the sample size and data dimension, therefore the proposed covariance-based index inherits similar properties.
To proceed further, let us introduce weighted random vectors.
Definition 2.1. Let be a bounded weighting function. By
we denote a weighted random vector with a density3
Observation 2.1. Let X be a random vector which has independent components, and let w be an arbitrary weighting function. Then has independent components as well.
One of the main results of (Bedychaj et al., 2019) is that the strong version of the inverse of the above theorem holds. Given we consider the weighting of X by the standard normal gaussian with center at m (N(m, I)):
We quote the following result which follows directly from the proof of Theorem 2 from (Bedychaj et al., 2019):
Theorem 2.1. Let X be a random vector, let be arbitrary. If
has linearly independent components for every
is a ball with center in p and radius r, then X has the independent components.
Given sample , and weights
, we define the weighted sample as:
Then the mean and covariance for the weighted sample is given by:
and
The informal conclusion from the above theorem can be stated as follows: if is (approximately) diagonal for a sufficiently large set of p, then the sample X was generated from a distribution with independent components.
Let us now define an index which will measure the distance from being independent. We define the weighted independence index (wii(X, p)) as
where d is the dimension of X and
Observation 2.2. Let us first observe that is a close measure to the correlation
where the equality holds iff the i-th and j-th components in equal standard deviations.
Proof. Obviously
Since (where the equality holds iff a = b), we obtain the assertion of the observation.
Consequently, wii(X, p) = 1 iff all components of early dependent and have equal standard deviations. Thus, the minimization of wii simultaneously aims at maximizing the independence and increasing the difference between the standard deviations.
We extend the index for a sequence of points the mean of the indexes for each
2.2 Selecting the weighting points
To implement the weighted independence index in practice, we need to find the optimal choice of weighting centers . First, we assume that the dataset in question is normalized componentwise (in particular, variance of each coordinate is one). We argue that the right choice of
should satisfy the following two conditions:
• selected weights do not concentrate on a small percentage of the data,
• for different centers selected from the dataset, weights diversify the data points.
At first glance, it would seem that the simplest choice for points is to sample them from the standard normal distribution. However, the conducted by us preliminary experiments (see Fig. 2) demonstrate that sampling from
would be a better choice.
Figure 2. In the experiment, we sampled twenty points from N(0, I) (x-axis). Then, we calculate weights of the points respectively to . We present values of those weights (sorted decreasingly) in the case when the center is chosen according to
. One can see that weights derived from
actually balance more data points, in contrary to N(0, I) which focus on smaller amount of data (N(0, I) converges to 0 earlier).
Consider the case when the data come from the standard normal distribution. For given weights w and density f we define measure
which functions w and f are well-defined) and zero otherwise, then the above reduces to is counting measure. Intuitively, P(w, f) returns the percentage of the population which has nontrivial weights.
density
and our dataset is normalized as stated above. Then, directly from (1), one obtains:
Applying the formula for the product of two normal densities:
we get: N(p, I)(x)N(0, I)(x)dx = N(p, 2I)(0),
for the denominator. The equation for the denominator follows from the simple fact that:
Summarizing, we obtain that
Normalizing (2) by its maximum obtained at 0, we get
Clearly if p would be chosen from the standard normal distribu- tion, the value of for large dimensions equals approximately d, and consequently the weights for the randomly chosen points will become concentrated at a single point (see Fig 2). To obtain the quotient approximately constant, we should choose p so that its norm is approximately one. Hence, it leads to the choice of p from the distribution
One can observe, that if , then we can sample from
by taking the mean of d randomly chosen vectors from X. This leads to the following definition:
Definition 2.2. For the dataset
wii(X) = E{wii(Y, p) : p a mean of random d elements of Y },
where Y is a componentwise normalization of X and E stands for expected value.
Let us summarize why centering the weights at the mean of d elements from the dataset has good properties:
• if the data is restricted to some subspace S of the space, then mean also belongs to S;
• if the data comes from normal distribution , then mean of d elements comes from
• if the data has heavy tails (i.e. comes from Cauchy distribution), then the distribution of mean for d elements set can be close to the original dataset mean.
In this section we propose the WICA algorithm for nonlinear ICA decomposition which exploits the wii(X) index in practice.
Following (Brakel and Bengio, 2017), we use an auto-encoder (AE) architecture, which consists of an encoder function and a complementary decoder function
The role of the encoder is to learn a transformation of the data that unmixes the latent components, utilizing some measure of independence (we use the wii(X) index). The decoder is responsible for limiting the encoder, so that the learned representation does not lose any information about the input. In practice, this is implemented by simultaneously minimizing the reconstruction error:
Reducing the difference between the input and the output is crucial to recover unmixing mapping close to inverse of the mixing one. Thus our final cost function is given by
where is a hyperparameter which aims to weight the role of reconstruction with that of independence (analogous to
-VAE (Higgins et al., 2017)). The training procedure follows the steps:
Let us start with a discussion of possible definitions of the nonlinear mixing function used for benchmarking the ICA methods. In the beginning we shortly explain some approaches used in the linear ICA, and then move forward to propose a mixing which benefits from properties desired in the comparison of the results obtained by nonlinear ICA algorithms.
In the case of linear ICA the experiments are usually conveyed on an artificial dataset, which is obtained by mixing two or more of independent source signals. This allows for the comparison of the results returned by the analyzed methods with the original independent components. In the real-world applications such a procedure is of course infeasible, but in experimental setting it provides a good basis for benchmarking different models. In classical ICA setup, creating an artificial mixing function is equivalent to selecting a random invertible matrix A, such that are the true sources and X are the observations, which are then passed to the evaluated methods. Such mixing is used by (Bedychaj et al., 2019; Hyvärinen, 1999; Spurek et al., 2017).
Unfortunately, there do not exist any mixing standards for the non-linear ICA problem. A common setup of the comparable environments needed to test the nonlinear models of ICA is to interlace linear mixes of signals with nonlinear functions (Almeida, 2003; Brakel and Bengio, 2017). During our experiments we found that the proposed methods of nonlinear mixes are ineffective in large dimensions. The aforementioned approaches usually apply only a shallow stack of linear projections followed by a nonlinearity. In consequence, the obtained observations are either close to the linear mixing (and therefore not hard enough to be properly challenging for the linear models) or become degenerate (i.e. all points cluster towards zero). Results of such mixing techniques are presented on Fig. 3.
Figure 3. Results of the nonlinear mixing techniques proposed in (Brakel and Bengio, 2017) on a normalized synthetic lattice data. Post nonlinear mixing model (PNL) introduced only slight nonlinearities, which are not hard enough to solve even for the linear algorithms. On the other hand, the multilayer perceptron mixing (MLP) technique collapses after just couple of iterations.
Figure 4. Results of our proposition of mixing over normalized synthetic lattice data. One may observe that after multiple iterations of the proposed mixing, results become highly nonlinear but not degenerate into any obscure solutions known from previous setup.
Because of aforementioned disadvantages we propose our own
1. Take random isometry:
where is a randomly initialized neural network and
from the split of X into half.
4. Return
One can easily increase the number of mixes and interlude splits of X in reverse order so that for odd iterate. The effects of applying the proposed mixing to two-dimensional data are presented in Fig. 4.
Our mixing procedure scales well in higher dimensions by iterating over the splits in . Additionally, it is also easily invertible, therefore there is a guarantee that the source components may be retrieved.
For the benchmark experiments we want to be able to measure the similarity between the obtained results Z and the original sources S. In the case of linear mixing the common choice is the maximum absolute correlation over all possible permutations of the signals (denoted hereafter as max_corr (Hyvarinen and Morioka, 2016; Hyvärinen and Morioka, 2017; Hyvärinen et al., 2019; Spurek et al., 2020; Zheng et al., 2007; Bengio et al., 2013; Hyvärinen, 1999)).
However, this measure is based on the Pearson’s correlation coef-ficient and therefore is not able to catch any high order dependencies. To address this problem we introduce a new measure based on the nonlinear Spearman’s rank correlation coefficient and optimal transport.
Let the Z denote the signal retrieved by an ICA algorithm and let the be the Spearman’s rank correlation coefficient between the j-th component of Z and k-th component of S. We define
where the zero entries indicate a monotonic relationship between the corresponding features.
This matrix is then used as the transportation cost of the components. Formally, we compute the value of the optimal transport problem formulated in terms of integer linear programming:
where
As a result of the last constraint, the obtained transport plan fines a one-to-one map from the retrieved signals to the original sources. In addition, the proposed Spearman-based measure (OTS) is sensitive to monotonic nonlinear dependencies and also relatively easy to compute with the use of existing tools for integer programming.
Another difference between OTS and max_corr is that the latter favors stronger disentanglement of few components, while OTS gives lower results for outcomes that decompose the observation more equally. In other words consider an experiment in which n signals were mixed. Further, assume that some (nonlinear) ICA algorithm failed to unmix all but one component (i.e. only one unmixed component matches exactly one source signal, while the rest is still highly unrecognizable). In such situation the max_corr value will be significantly higher than OTS, although only the small portion of the base dataset was recovered.
In order to empirically demonstrate this property, we artificially mixed a multidimensional grid using the mixing function from Section 4. Next, we randomly swapped one of the mixed signals with the original signal from the base dataset. We compared this mixed-and-swapped data to the source signals using max_corr and OTS. The results over different mixing iterations are presented on Fig. 5. One may observe that max_corr values are always above the OTS ones, suggesting that max_corr measure prefers such a recovery more than OTS. Naturally, in the case when all signals are far different from the true sources, values for max_corr and OTS are almost exactly the same (see Fig. 6).
In consequence, the max_corr measure can help to asses the maximum of informativeness from the retrieved signal. This can be desired in situations that favor well decomposition of few components at the cost of lower correlatedness of the remaining ones (which may happen, for instance, in denoising problems). In the case where approximately equal recovery of all the signals is requested, the OTS measure would be a better choice.
2 dimensions 4 dimensions
6 dimensions 8 dimensions
Figure 5. Results of the experiment where in n–dimensional mixed observation one component was swapped with a randomly chosen source signal. One may observe that max_corr almost always prefers such situation, while OTS seems to be more rigorous.
2 dimensions 4 dimensions
6 dimensions 8 dimensions
Figure 6. Results for the OTS and max_corr values for fully mixed dataset. One may observe that both measures in this case give similar outcomes.
In this section we show several simulated experiments to validate the WICA algorithm empirically. Because there is no clear benchmark definition for the nonlinear ICA evaluation, we have selected most figurative and easily interpretable setup which we present in the following subsections. In addition, we performed the analysis of electroencephalographic (EEG) signal according to procedure presented in (isha SunLISHA SUN et al., 2005; Onton and Makeig, 2006), to
Figure 7. Two dimensional example of the problem of unmixing natural images. One can easily spot that WICA has the smallest amount of artifacts remained after retrieving the signals. All of the scatter plots were normalized and are presented in the same scale. It is valuable to also look at the attached marginal histograms, where some of the similarities between the original signal and its retrieved counterpart may be observed.
validate our method in more natural setting, that is, without artifi-cially generated mixing and access to true source components.
6.1 Qualitative results
We start from the simulated example of the ICA application in the case of images separation problem. We use this regime because the results can be understand with the naked eye of a reader.
To construct this experiment one needs to apply some artificial mixing function (i.e. linear transformation or mixing function from Section 4) on the independent source signals. Such mixture is then passed to the ICA model in question to perform the unmixing task.
In order to compare the WICA algorithm to other nonlinear ICA approaches we evaluated the models performance in the case of separation of artificially mixed images. As an initial setup for this blind source separation task, we randomly sampled two flattened images from the Berkeley Segmentation Dataset (Martin et al., 2001)4 and mixed them using the function defined in Section 4. We compared the proposed our method with dCor (Spurek et al., 2020), PNLMISEP (Zheng et al., 2007), ANICA (Bengio et al., 2013) and linear FastICA. Results of this toy example are presented on Fig. 7.
Besides the retrieved images and their scatter plots, we also demonstrated projection of marginal densities. The desired goal is to achieve similar images and marginal densities as in the source (original) pictures.
One can easily spot that FastICA and dCor seem to only rotate the mixed signals. The ANICA, on the other hand, transformed the observations to a high extent, but the recovered signals are visually worse than the original pictures. Similarly to previous algorithm, PNLMISEP and WICA also performed some nontrivial shift on the marginal densities, but in this case the retrieved densities resemble the original ones more naturally.
This experiment was fully qualitative and the outcome is subject to one’s individual perception. We demonstrated the images purely as a visualization of the different ICA models performance in simple nonlinear setup. We report quantitative results in the next subsection.
Figure 8. The mean rank results for different mixes measured by max_corr (top) and OTS (bottom). The lower the better.
6.2 Quantitative results
From the preliminary results reported in previous subsection, we moved to a more complex scenario in which we quantitative evaluated the ICA methods in a higher dimensional setup.
We uniformly sampled d flattened images from the Berkeley Segmentation Dataset (Martin et al., 2001) to form the source components. We used five different source dimensions The observations were then obtained by using the function described in Section 4, applied iteratively
times. For each dimension d we randomly picked 5 different sets of source images. Every method was evaluated 10 times on each set of sources, dimensions and mixes.
We fit each nonlinear algorithm using the grid search over the learning rate. For the auto-encoder based models we also performed a grid search over the scaling of the independence measure. Adjustment of these hyper-parameters was done on randomly sampled observations from the set of all obtained mixtures. Examples used to tune the architectures, were then excluded from the dataset on which we performed the actual evaluation. It is worth to mention that we had to fix batch size to 256, because any bigger value caused instabilities in the ANICA results. To be fair in comparisons, we set the same neural net architecture for WICA, ANICA and dCor. Both the encoder and the decoder were composed of 3 hidden layers with 128 neurons each. In the case of MISEP we used the PNL version from (Zheng et al., 2007). The outcomes from each method were measured both by max_corr and OTS against the true source components.
Figure 9. Results of analysis done on the EEG signals. After the deletion of a suspicious signals selected by an expert from the decomposition, one can easily spot that the reconstructed components are more homogeneous, and do not have as much artifacts as the original EEG data. In both methods the same amount of signals was cleared. The results are satisfying in either of the cases. Additionally WICA persist scale of the retrieved signals, which is helpful property in further cleansing of the EEG data.
Figure 10. Comparison between standard ICA methods (PNLMISEP, dCor, ANICA, FastICA) and our approach by using OTS (left) and max_corr (right) measures in the setup where 50 mixing iterations were performed. In the experiment we train five models and present the mean and standard deviation of each of the used measures (the higher the better). One can observe that WICA consistently obtains good results for all of the dimensions and outperforms the other methods in higher dimensions. Moreover, it has the lowest standard deviation across all the nonlinear algorithms. More numerical results of the experiment are presented in Table 1.
Performance across different dimensions. We plotted the results of this experiment on with respect to the data dimension d in Fig. 10. The outcomes demonstrated that the WICA method outperformed any other nonlinear algorithm in the proposed task by achieving high and stable results regardless of the considered data dimension. In the case of the results stability, WICA losses only to the linear method – FastICA – which, unfortunately, cannot satisfactorily factorize nonlinear data. This experiment demonstrated that WICA is a strong competitor to other models in a fully unsupervised environment for nonlinear ICA.
It is also worth to mention the difference between the results measured by OTS and max_corr for the ANICA and FastICA models applied in high dimension. We hypothesize that this may indicate that those algorithms were able to retrieve very well only small subset of the components, while the remaining variables were still highly
mixed, leading to a similar effect as the one described in Section 5.
Performance across different mixes. For every model we evaluated the mean OTS and max_corr score on a given dimension d and number of mixing iterations i. Then, for each pair (d, i) we ranked the tested models based on their performance. We report the mean rank of models for each mixing iteration i in Fig. 8 (the lower the better).
One may observed that for tasks relatively similar to the linear case, where number of mixes is equal to 10, the PNLMISEP method performs the best both on max_corr and OTS. However, as the number of mixes increases, the WICA algorithm usually outperforms all the other methods in both measures, achieving the lowest mean rank. As a complement to the above discussion we also provide the complete numerical results for all mixtures on all tested dimensions in Table 1.
2 0.7714 0.910
6 10 0.821
8 0.814
10 0.812
2 0.8704 0.957
6 20 0.795
8 0.844
10 0.858
2 0.9254 0.820
6 30 0.887
8 0.746
10 0.835
2 0.8624 0.847
6 40 0.701
8 0.861
10 0.859
2 0.7594 0.774
6 50 0.769
8 0.831
10 0.819
2 0.7984 0.890
6 10 0.807
8 0.784
10 0.742
2 0.8844 0.945
6 20 0.776
8 0.797
10 0.790
2 0.7974 0.805
6 30 0.865
8 0.702
10 0.782
2 0.8694 0.781
6 40 0.636
8 0.729
10 0.820
2 0.8284 0.735
6 50 0.735
8 0.766
10 0.766
Table 1. Comparison between nonlinear ICA methods (PNLMISEP, dCor, ANICA, WICA) and the classical linear ICA approach (FastICA) on images separation problem (with different dimensions) by using max_corr and OTS measures. In the experiment we tuned and trained four models (excluding FastICA, which is a linear model) and present mean and standard deviation in the tabular form.
6.3 Decomposing EEG data
Finally we want to show usability of the WICA method on real life data. An example of a task that can be tackle by the ICA algorithms is electroencephalogram (EEG) decomposition.
An EEG signal is a test used to evaluate the electrical activity in
the brain. The brain cells communicate via electrical impulses and are active all the time. In the original scalp channel data, each row of the data recording matrix represents the time course of summed voltage differences between source projections to one data channel and one or more reference channels. We followed a common experiment framework proposed in (isha SunLISHA SUN et al., 2005; Onton and Makeig, 2006), to detect artifacts in unmixed signals representation which can suggest a blinks or an eye movement during the test.
The setup for this decomposition is different than in previous sections. An original EEG mixture took for this experiment, consisted of 40 scalp electrode signals. Those signals were selected as an input for the WICA model. Retrieved data were analysed by an expert, who selected signs of a blinking on recovered components. Manually selected subset of suspicious components, were then nullified. Unmixed signal with masked (by nullification) components were then feed back to the decoder which came from the training of the WICA model.
As a researcher we are not aware how deeply EEG signals are mixed or dependent. The crucial functionality that ICA serves in this setting is normalizing and cleansing of the dataset. From that point, time series produced from recovered signals have to be analysed by an expert. In this experiment we want to prove that high dimension of the input data and the unknown entanglement of the components is not a limitation for the WICA. Visual results of this experiment are presented on Fig 9. For a comparison we used results from other standard ICA algorithm used for this kind of a task – linear FastICA. The details of "remixing" process for this method are descirbed in (isha SunLISHA SUN et al., 2005). This experiment showed that WICA is able to handle multidimensional data highly above the volume tested for other nonlinear models. Moreover, results our method for this task works well enough to be used as a preliminary step of cleaning the data.
In this paper we presented a new approach to the nonlinear ICA task.
In addition to the investigation of WICA method, which proves to be matching the results of all other tested nonlinear algorithms, we proposed a new mixing function for validating nonlinear tasks in a structurized manner. Our mixing scales to higher dimensions and is easily invertible.
Lastly, we defined OTS, a measure that can catch nonlinear dependence and is easy to compute. The OTS measure and the proposed mixing have the potential to become benchmarking tools for all future work in this field.
The work of P. Spurek was supported by the National Centre of Science (Poland) Grant No. 2019/33/B/ST6/00894. The work of J. Tabor was supported by the National Centre of Science (Poland) Grant No. 2017/25/B/ST6/01271. A. Nowak carried out this work within the research project "Bio-inspired artificial neural networks" (grant no. POIR.04.04.00-00-14DE/18-00) within the Team-Net program of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund.
Almeida, L. B. (2003), ‘Misep–linear and nonlinear ica based on mutual information’, Journal of Machine Learning Research 4(Dec), 1297–1318.
Bedychaj, A., Spurek, P., Struski, Ł. and Tabor, J. (2019), ‘Indepen- dent component analysis based on multiple data-weighting’, arXiv preprint arXiv:1906.00028 .
Bell, A. J. and Sejnowski, T. J. (1995), ‘An information- maximization approach to blind separation and blind deconvolution’, Neural computation 7(6), 1129–1159.
Bengio, Y., Courville, A. and Vincent, P. (2013), ‘Representation learning: A review and new perspectives’, IEEE transactions on pattern analysis and machine intelligence 35(8), 1798–1828.
Brakel, P. and Bengio, Y. (2017), ‘Learning independent fea- tures with adversarial nets for non-linear ica’, arXiv preprint arXiv:1710.05050 .
Cai, D., He, X. and Han, J. (2005), ‘Document clustering using lo- cality preserving indexing’, IEEE Transactions on Knowledge and Data Engineering 17(12), 1624–1637.
Cai, D., He, X. and Han, J. (2010), ‘Locally consistent concept fac- torization for document clustering’, IEEE Transactions on Knowledge and Data Engineering 23(6), 902–913.
Dinh, L., Krueger, D. and Bengio, Y. (2014), ‘Nice: Non-linear inde- pendent components estimation’, arXiv preprint arXiv:1410.8516 .
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M. M., Mohamed, S. and Lerchner, A. (2017), beta-vae: Learning basic visual concepts with a constrained variational framework, in ‘ICLR’.
Hyvärinen, A. (1999), ‘Fast and robust fixed-point algorithms for in- dependent component analysis’, Neural Networks, IEEE Transactions on 10(3), 626–634.
Hyvarinen, A. and Morioka, H. (2016), Unsupervised feature extrac- tion by time-contrastive learning and nonlinear ica, in ‘Advances in Neural Information Processing Systems’, pp. 3765–3773.
Hyvärinen, A. and Morioka, H. (2017), Nonlinear ica of tempo- rally dependent stationary sources, in ‘International Conference on Artificial Intelligence and Statistics’, Microtome Publishing, pp. 460–469.
Hyvärinen, A. and Pajunen, P. (1999), ‘Nonlinear independent com- ponent analysis: Existence and uniqueness results’, Neural Networks 12(3), 429–439.
Hyvärinen, A., Sasaki, H. and Turner, R. E. (2019), Nonlinear ica using auxiliary variables and generalized contrastive learning, in ‘The 22nd International Conference on Artificial Intelligence and Statistics’, Journal of Machine Learning Research, pp. 859–868.
Khemakhem, I., Kingma, D. P. and Hyvärinen, A. (2019), ‘Vari- ational autoencoders and nonlinear ica: A unifying framework’, arXiv preprint arXiv:1907.04809 .
Kingma, D. P. and Dhariwal, P. (2018), Glow: Generative flow with invertible 1x1 convolutions, in ‘Advances in Neural Information Processing Systems’, pp. 10215–10224.
Larson, L. E. (1998), ‘Radio frequency integrated circuit technology for low-power wireless communications’, IEEE Personal Communications 5(3), 11–19.
Le, Q. V., Karpenko, A., Ngiam, J. and Ng, A. Y. (2011), Ica with reconstruction cost for efficient overcomplete feature learning, in ‘Advances in Neural Information Processing Systems’, pp. 1017– 1025.
Lisha Sun, Ying Liu and Beadle, P. J. (2005), Independent compo- nent analysis of eeg signals, in ‘Proceedings of 2005 IEEE International Workshop on VLSI Design and Video Technology, 2005.’, pp. 219–222.
Martin, D., Fowlkes, C., Tal, D. and Malik, J. (2001), A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in ‘Proc. 8th Int’l Conf. Computer Vision’, Vol. 2, pp. 416–423.
Matteson, D. S. and Tsay, R. S. (2017), ‘Independent component analysis via distance covariance’, Journal of the American Statistical Association pp. 1–16.
Onton, J. and Makeig, S. (2006), Information-based modeling of
event-related brain dynamics, in ‘Progress in Brain Research’, Elsevier, pp. 99–120.
Spurek, P., Nowak, A., Tabor, J., Maziarka, Ł. and Jastrz˛ebski, S. (2020), Non-linear ica based on cramer-wold metric, in ‘International Conference on Neural Information Processing’, Springer, pp. 294–305.
Spurek, P., Tabor, J., Rola, P. and Ociepka, M. (2017), ‘Ica based on asymmetry’, Pattern Recognition 67, 230–244.
Székely, G. J., Rizzo, M. L., Bakirov, N. K. et al. (2007), ‘Measuring and testing dependence by correlation of distances’, The annals of statistics 35(6), 2769–2794.
Zheng, C.-H., Huang, D.-S., Li, K., Irwin, G. and Sun, Z.-L. (2007), ‘Misep method for postnonlinear blind source separation’, Neural computation 19, 2557–78.
Ziehe, A., Muller, K.-R., Nolte, G., Mackert, B.-M. and Curio, G. (2000), ‘Artifact reduction in magnetoneurography based on time-delayed second-order correlations’, IEEE Transactions on biomedical Engineering 47(1), 75–87.