Discovering Nonlinear Relations with Minimum Predictive Information Regularization

2020·Arxiv

Abstract

Abstract

Identifying the underlying directional relations from observational time series with nonlinear interactions and complex relational structures is key to a wide range of applications, yet remains a hard problem. In this work, we introduce a novel minimum predictive information regularization method to infer directional relations from time series, allowing deep learning models to discover nonlinear relations. Our method2 substantially outperforms other methods for learning nonlinear relations in synthetic datasets, and discovers the directional relations in a video game environment and a heart-rate vs. breath-rate dataset.

1 Introduction and Related Work

Imagine a dataset with tens or hundreds of observational time series. There may exist interesting directional relations between the time series which we want to uncover, but their relation graph may be complicated, and the relation may be nonlinear as we do not know its functional form. How can we discover the underlying relations of those challenging scenarios in an efficient way, or at least identify candidate relations that are worth further investigation by a researcher? Problems of this type are omnipresent and important in a variety of scientific endeavors and applications, e.g., gene regulatory networks [24], neuroscience [30, 42], economics [11, 45] and finance [17, 13].

To address this question, the field of causal learning has proposed a large class of methods to discover or quantify causal relations. These methods have certain limitations in regards to capability of handling nonlinearity, and/or scalability and efficiency to large numbers of time series. Pearl [32, 33, 34] defines causality in terms of intervention and structural dependence, under the structural equation models (SEM). However, in our problem, where only observational time series is available, Pearl’s definition may not be applicable. In his seminal work, Granger [11, 12] defines causality via prediction: if the prediction of the future Y via a linear model can be improved by including the information of X, then X causes Y in the Granger sense. The original Granger causality is limited to linear causal models. Although later works also extend Granger causality to kernel methods [1, 25, 26, 43], they may still be insufficient to model and discover the nonlinear causal relations in real data. On the other hand, the causal measures of transfer entropy [41] and causal influence [20] are in theory able to handle any nonlinearity. However, both measures require density estimation of the joint distribution for the full N time series (N is the number of time series), which is difficult and data-hungry when N is large. Constraint-based methods [44, 15, 32, 44] require repetitive conditional independence tests, where the number of tests will grow large when N is large and the underlying causal graph is dense. Score-based methods search for the structure that yields the optimal score w.r.t. the data, generally using greedy search methods, for example GES [7], rankGES [29] and GIES [16]. This in general requires steps, and the number of neighboring states may grow very large at each step. Another closely related field is sparse learning/feature selection methods. Some important classes are Lasso [47] and elastic net [53], which are effective but subject to the limitations of linear models. For nonlinear models, although L1 and group L1 regularization [27, 40, 46] can induce sparsity in the model parameters, they are model and input dependent.

To handle the nonlinear relations in time series, a promising tool is neural nets. Not only are neural nets universal function approximators [18], a deep neural net also provides exponentially large expressive power [39], making it particularly suitable for modeling the unknown nonlinear relations in time series. Recently there has been an increasing amount of work on learning the dynamic models of interacting systems [2, 6, 14, 49, 19, 48]. However, their main focus is to make better predictions, using implicit interaction models (e.g. using fully connected graph networks). In this paper, we are mainly interested in discovering the underlying directional relations in an explicit form, utilizing the expressive power of neural nets.

To discover nonlinear directional relations from potentially large number of time series in an efficient way, the contribution of our work is as follows:

• We introduce a novel relational learning with Minimum Predictive Information Regularization (MPIR) method for exploratory discovery of nonlinear directional relations from observational time series. It is based on minimizing a mutual information-regularized risk with learnable input noise of a prediction model, which allows function approximators such as neural nets to learn nonlinear relations, combining the benefits of the Granger causality paradigm with deep learning models. At the minimization of the objective, the minimum predictive information term quantifies the directional predictive strength between each pair of time series given other time series. For discovering the directional relations among N time series, it only has to learn N models, and does not requires density estimation for the joint N time series.

• We prove that the minimum predictive information is able to differentiate dependence or independence between pairs of time series, which allows for statistical test. Moreover, we prove that the minimum predictive information is invariant to the scaling of input and reparameterization of the model. We further provide intuition that under certain conditions, our method is likely to discover direct relations instead of indirect associations.

• We demonstrate on nonlinear synthetic datasets that our method outperforms other methods in discovering true causal relations with larger N, and discovers the directional relations in video game environment and real-world heart-rate vs. breath-rate datasets.

2 Method

2.1 Problem setup

We consider N time series , where each time series and each is an M-dimensional vector. Denote with

maximum time horizon of K, and . We also denote

for t = K + 1, K + 2, ... . Here are noise variables that are mutually independent, are independent of any , . For any , we assume that the variables have probability density function

Our method is inspired by Granger causality [11, 12], which defines causality via predictions, making it especially suitable for relational inference of observational time series. Adapting to our notation:

Granger causality [12]: Assuming causal sufficiency [35], we say does not Granger- cause . Otherwise, we say Granger-causes

In practice, we say that time series j Granger-causes time series i, if it can be shown via significance tests that the null hypothesis of is rejected, i.e. statistically significant information for predicting

In his original work, Granger [11] investigates causality with linear function predictors. Later works have extended it to kernel methods [1, 25, 26, 43], which essentially estimate linear Granger causality on the feature space of the kernel. To learn potentially highly nonlinear response functions, it may be desirable to use expressive and universal function approximators [18] such as neural nets. Neural nets are much more flexible than linear models, and do not require kernel selection as in kernel methods.

2.2 Our method

Based on the definition of Granger causality, a naïve way to combine it with neural net is: for each , train two neural nets, one predicting based on , another predicting based on the full , and test whether former MSE is significantly larger than the latter. This method suffers from two major drawbacks: (1) instability: different training of the neural net may end up in different local minima, so that the two MSEs have large variance, which is observed in our initial explorations; (2) inefficiency: to discover the relations among N time series, it has to train at least models (for each , train models with one removed, and Q models () with full for accumulating statistics). On the other hand, these two drawbacks exactly inspire our method. Instead of predicting with one missing at a time, what if we let each corruption, and encourage each to provide as little information to possible while maintaining good prediction? In this way, we have a single shared model that can span the full product space of [total corruption, no corruptioninput time series, which is more stable and efficient than the removing one at a time and training N models. To achieve this, we add independent noise with learnable amplitudes to each input , and measure the corruption by the mutual information between the input and the corrupted input. We then define the following risk:

where (or element-wise, ) is the noise-corrupted inputs with learnable noise amplitudes , and is a positive hyperparameter for the mutual information . Intuitively, the minimization of the second term requires the noise amplitude to go up. The minimization of the first term requires the noise amplitude to go down, and the larger causal strength from to , the larger this force. The minimization of the two terms strikes a balance, at which point the measures the minimum number of bits of information the time series j need to provide to the learner, without compromising the prediction.

At the minimization of , we define , which we term minimum predictive information, where . Essentially, measures the predictive strength of time series j for predicting time series i, conditioned on all the other observed time series. We have that satisfies the following properties:

The proofs are provided in Appendix B. Property 1 shows that is able to differentiate time series that are dependent or independent with the target time series i. Empirically, to perform statistical tests, we can let the null hypothesis be . Before training, we append to some fake time series (e.g. by randomly permuting . After optimizing

w.r.t. to the augmented dataset, the values of form a distribution for which we know that the null hypothesis is true. Then if certain is greater than the quantile (e.g. ) of the distribution, we can reject the null hypothesis of independence. Properties 2 and 3 show the benefit of our method which essentially regularizes the input information, compared with L1 and group L1 [27, 40, 46] which regularize the model and thus do not satisfy these two properties.

Moreover, in Appendix B we further provide intuition that under certain conditions, is likely to favor the time series that directly causes time series i, compared with the time series that relate to i via the direct causal connections. Note that our method is not guaranteed to identify direct causal relations (in Granger [12] or Pearl [32] sense), which is a very hard problem given the potential large number of time series and nonlinearity present. However, our method provides an effective data exploratory tool to identify time series that are predictive of one another, conditioned on all the other observed time series, whose identified directional relations can be investigated further by a researcher. As stated above, under certain conditions, our method does favor the direct causal relations. And in the experiment section, we will compare the estimated with true causal relations if available.

Empirically, we minimize the following empirical risk:

In general, it may be inefficient to estimate the mutual information with large dimension of such that the expression is also differentiable w.r.t. . Utilizing the property of Gaussian channels, in Appendix C we prove that

the variance of across t. Therefore, in practice to improve efficiency, we can optimize an upper bound of the risk:

When the dimension of is large, a differentiable estimate of the mutual information (e.g. MINE [3]) can be applied. We provide Algorithm 1 to empirically estimate , which we term relational learning with Minimum Predictive Information Regularization (MPIR). The steps 1-3 construct fake input time series (which we know the null hypothesis of is true) to append to . Steps 4-9 optimize the objective w.r.t. the augmented dataset, and obtain a . Step 10 performs significance test and only preserve the values in the main matrix that are statistically significant. In the case where we only need to estimate the predictive strength, this step is not required. Finally the main matrix is returned.

To select an appropriate hyperparameter , we can additionally append to the target a few time series constructed from . We then select such that the estimated causal strength between (for which we know the causal relations) is at least away from the estimated causal strength between and (for which we know that they are independent). See Appendix A for details.

3 Experiments

To demonstrate that our proposed method is able to discover interesting underlying directional (possibly causal) relations, we test it on both synthetic and real datasets. We first use synthetic datasets, where we know the underlying causal structure and compare with other methods. We then test whether our algorithm can infer directional relations among trajectories of objects from watching an agent playing video games. Finally, we apply our algorithm to a real-world heart-rate vs. breath-rate dataset and a rat EEG dataset to test its effectiveness. We use the for optimization for all experiments.

3.1 Synthetic experiment with log-normal causal strengths

In this experiment, we evaluate our method together with other methods using a nonlinear synthetic dataset generated to have a known causal structure (hidden to the methods being compared). We study performance with varying number N of time series, with N up to 30. To generate the data, we let each have dimension M = 1, and also set the maximum time horizon is a matrix. We use the following realization of the response function in Eq. (1):

where denotes element-wise multiplication, and Hand Hare two nonlinear functions to make the response functions nonlinear. In this experiment, we use Hsoftplus, and His a random matrix, whose element is sampled from is a matrix, with 0.5 probability of being a zero matrix and 0.5 probability of being a nonzero random matrix, characterizing the underlying causal strength from j to i. Crucially, to reflect that the causal strength may span different orders of magnitude, if is sampled to be a nonzero matrix, then the amplitude of each of its element is sampled from a log-normal distribution with , their sign sampling from . Denote 1(A) as the 0-1 indicator matrix of causality (). The goal of each algorithm being evaluated is to produce an score matrix , where each entry characterizes the directional strength from j to i. Then the flattened is evaluated against the flattened 1(A) (excluding diagonal elements of the matrices) via different metrics. Fig. S4 in Appendix E shows example snapshots of the time series.

In general, for a large N, the number of possible causal graphs grows double exponentially: there are possible matrix of 1(A). To give an estimate, for N = 3, 4, 5, 8, 10, 20, 30, there are number of possible graphs, respectively. Therefore, estimating the underlying causal graph is in general a non-trivial task when N is large. We compare our algorithm with previous methods including transfer entropy [41], causal influence [20], linear Granger causality [11, 9], kernel Granger causality [25, 26], and three baselines: (1) mutual information (which gives ), (2) a sparse feature selection

Table 1: Mean and standard deviation of AUC-PR (%) vs. N, over 10 random sampling of datasets. Bold font marks the top method for each N.

method, elastic net [53], and (3) a random matrix, each element of which is drawn from a standard Gaussian distribution. For each N, we sample 10 datasets with different and matrices, and compare each method’s average performance over 10 datasets together with their standard deviation. The implementation details for each method and each experiment are provided in Appendix D and E, respectively. Since many of the methods do not provide a threshold or significance test, we use the standard metrics of area under the precision-recall curve (AUC-PR) [8] (Table 1 below) and area under the ROC curve (AUC-ROC) (Table S1 in Appendix F) to compare their performance.

We see that for smaller ), methods with smaller expressivity (linear Granger, kernel Granger) performs slightly better. However, as N becomes larger, our method outperforms other methods with increasing margin, demonstrating our method’s capability to infer complex relational structures from interacting time series. Particularly, although two linear methods, linear Granger and elastic net, have relatively strong performance with , they quickly degrade with larger N due to more nonlinearity present in the data. With the help of kernels, kernel Granger degrades slower, but can not compete in larger N with our method which allows expressive neural nets to model complex nonlinear interactions. For the Causal Influence method, although it has very good mathematical properties, it may be impractical in practice, as is also shown in the table. This is due to that it is defined as the KL-divergence between and its counterpart (whose causal arrows to and from time series j are cut), each of which is an dimensional vector, which can quickly go to high dimensions, where density estimation required to calculate KL-divergence is in general data-hungry and difficult. In comparison, our method that estimates predictive strength via minimizing prediction errors is comparatively easier in high dimensions.

3.2 Experiments with video games

To see how our method can discover the directional (possibly causal) relations in real video games, and potentially improve reinforcement learning (RL) or imitation learning (IL), we apply our method to the relational inference between the trajectories of different objects from a trained CNN RL-agent playing Atari Breakout games ([4], implementation details see Appendix H). Fig. 1 shows the inferred matrix for our method and compared methods, respectively. The true underlying causal chain is marked in dark color in Fig. 1e, with light color marking the competing causal relations that are indistinguishable from data (e.g. decrease of bricks and increase of reward happen at the same time step, so we cannot distinguish ball-ybrick and ball-yreward). Compared with other methods, we see that our method is able to discover comparatively most of the causal relations without finding false positives. Specifically, it correctly discovers a prominent causal direction from the ball’s y position to the reward, as well as brick reward, ball-x action, ball-y action. The latter two show that the ball’s x and y positions also have influences on the trained agent’s action: in order that the ball does not fall to the bottom, the agent has to position itself at the right position depending on the x and y positions of the ball.

In comparison, mutual information (Fig. 1b) gives a symmetric matrix that does not differentiate the two possible directions, and also misses the arrows ball-yreward. For transfer entropy (Fig. 1c), although it correctly discovers a number of causal arrows, it also gives relatively high scores for some incorrect arrows: brickaction, ball-yball-x. For kernel Granger (Fig. 1f), although it correctly discovers four causal relations, it also incorrectly finds rewardball-y and reward

Figure 1: (a) Predictive strength inferred by our method in Section 3.2. The (j, i) element denotes the inferred causal strength from j to i. (e) True underlying causal relations are marked dark, with light color marking competing causal relations that are indistinguishable from data. Other subfigures are: directional strength inferred by (b) mutual information (c) transfer entropy (d) linear Granger (f) kernel Granger (g) elastic net (h) causal influence.

For elastic net (Fig. 1g), it correctly discovers two prominent causal relations: ball-yaction and ball-yreward, but misses a few others. Linear Granger (Fig. 1d) and causal influence (Fig. 1h) fail to discover useful causal arrows.

3.3 Experiment with heart-rate vs. breath-rate and rat brain EEG datasets

Now we test our algorithm with real-world datasets. As a common dataset studied in previous causal works, we use the time-series of the breath rate and instantaneous heart rate of a sleeping patient suffering from sleep apnea (samples 2350-3550 of data set B from Santa Fe Institute time series contest held in 1991, available in [36]). We apply our method to infer the directional relations between the breath rate and heart rate, with different maximum time horizon K. The result is shown in Fig. 2. We see that the predictive strength from heart to breath is significantly higher than the reverse direction that is basically 0, consistent with the results from previous causal inference methods [41, 1, 25] as also shown in Fig. 2(b)(c). Notably, the from heart to breath estimated by our method remains at roughly the same level for different Ks, in contrast to the decaying causality index w.r.t. increasing history length in ([1], Fig. 2 (c)), showing a merit of our method in estimating directional strength across different time-horizons, aided by the flexibility of neural nets in extracting the right information to predict the future. The implementation details are provided in Appendix I. In

Figure 2: (a) Predictive strength inferred by our method with the heart-rate vs. breath-rate dataset, averaged over 50 initializations of . The shaded areas are the 95% confidence interval. (b) Upper: the filtered causality index vs. varying width of Gaussian kernel ]; lower: transfer entropy vs. r, the length scale [41]; (c) The causality index for breathheart (lower) and heartbreath (upper) in [1], where m is the maximum time lag (equivalent to our K).

addition, in Appendix J we test our algorithm on a rat EEG dataset, and obtain consistent result with previous works.

4 Discussion and conclusion

In this paper, we have introduced a novel relational learning with Minimum Predictive Information Regularization (MPIR) method for exploratory discovery of nonlinear directional relations from observational time series. It allows functional approximators like neural nets to learn complex directional relations from time series data. We prove its three theoretical properties, and provide intuition that it favors variables that directly cause the variable of interest. We demonstrate in synthetic datasets, a video game environment and heart-rate vs. breath-rate dataset, that our method has better capability to handle nonlinearity, and can scale to large numbers of time series. We believe our work endows practitioners with a useful tool for deciphering the directional relations in complex systems, and are excited to see it in broader applications.

References

[1] Ancona, N., Marinazzo, D., and Stramaglia, S. Radial basis function approach to nonlinear granger causality of time series. Physical Review E, 70(5):056221, 2004.

[2] Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502–4510, 2016.

[3] Belghazi, I., Rajeswar, S., Baratin, A., Hjelm, R. D., and Courville, A. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.

[4] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47: 253–279, 2013.

[5] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

[6] Chang, M. B., Ullman, T., Torralba, A., and Tenenbaum, J. B. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341, 2016.

[7] Chickering, D. M. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.

[8] Davis, J. and Goadrich, M. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. ACM, 2006.

[9] Ding, M., Chen, Y., and Bressler, S. L. Granger causality: basic theory and application to neuroscience. Handbook of time series analysis: recent theoretical developments and applications, pp. 437–460, 2006.

[10] Diuk, C., Cohen, A., and Littman, M. L. An object-oriented representation for efficient reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pp. 240–247. ACM, 2008.

[11] Granger, C. W. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, pp. 424–438, 1969.

[12] Granger, C. W. Testing for causality: a personal viewpoint. Journal of Economic Dynamics and control, 2:329–352, 1980.

[13] Granger, C. W., Huangb, B.-N., and Yang, C.-W. A bivariate causality between stock prices and exchange rates: evidence from recent asianflu. The Quarterly Review of Economics and Finance, 40(3):337–354, 2000.

[14] Guttenberg, N., Virgo, N., Witkowski, O., Aoki, H., and Kanai, R. Permutation-equivariant neural networks applied to dynamics prediction. arXiv preprint arXiv:1612.04530, 2016.

[15] Harris, N. and Drton, M. Pc algorithm for nonparanormal graphical models. The Journal of Machine Learning Research, 14(1):3365–3383, 2013.

[16] Hauser, A. and Bühlmann, P. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research, 13 (Aug):2409–2464, 2012.

[17] Hiemstra, C. and Jones, J. D. Testing for linear and nonlinear granger causality in the stock price-volume relation. The Journal of Finance, 49(5):1639–1664, 1994.

[18] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural networks, 4 (2):251–257, 1991.

[19] Hoshen, Y. Vain: Attentional multi-agent predictive modeling. In Advances in Neural Information Processing Systems, pp. 2701–2711, 2017.

[20] Janzing, D., Balduzzi, D., Grosse-Wentrup, M., Schölkopf, B., et al. Quantifying causal influences. The Annals of Statistics, 41(5):2324–2358, 2013.

[21] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[22] Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information. Physical review E, 69(6):066138, 2004.

[23] Lizier, J. T., Prokopenko, M., and Zomaya, A. Y. Local information transfer as a spatiotemporal filter for complex systems. Physical Review E, 77(2):026110, 2008.

[24] Lozano, A. C., Abe, N., Liu, Y., and Rosset, S. Grouped graphical granger modeling for gene expression regulatory networks discovery. Bioinformatics, 25(12):i110–i118, 2009.

[25] Marinazzo, D., Pellicoro, M., and Stramaglia, S. Kernel method for nonlinear granger causality. Physical review letters, 100(14):144103, 2008.

[26] Marinazzo, D., Pellicoro, M., and Stramaglia, S. Kernel-granger causality and the analysis of dynamical networks. Physical review E, 77(5):056215, 2008.

[27] Meier, L., Van De Geer, S., and Bühlmann, P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53–71, 2008.

[28] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[29] Nandy, P., Hauser, A., Maathuis, M. H., et al. High-dimensional consistency in score-based and hybrid structure learning. The Annals of Statistics, 46(6A):3151–3183, 2018.

[30] Neves, G., Cooke, S. F., and Bliss, T. V. Synaptic plasticity, memory and the hippocampus: a neural network approach to causality. Nature Reviews Neuroscience, 9(1):65, 2008.

[31] Papoulis, A. Probability, random variables and stochastic processes. 1985.

[32] Pearl, J. Causality: models, reasoning, and inference. IIE Transactions, 34(6):583–589, 2002.

[33] Pearl, J. Causality. Cambridge university press, 2009.

[34] Pearl, J. et al. Causal inference in statistics: An overview. Statistics surveys, 3:96–146, 2009.

[35] Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. MIT press, 2017.

[36] PhysioNet. Physionet data bank. URL http://www.physionet.org/.

[37] Quiroga, R. Q. The dataset can be downloaded from. URL www.vis.caltech.edu/~rodri.

[38] Quiroga, R. Q., Kraskov, A., Kreuz, T., and Grassberger, P. Performance of different synchronization measures in real data: a case study on electroencephalographic signals. Physical Review E, 65(4):041903, 2002.

[39] Rolnick, D. and Tegmark, M. The power of deeper networks for expressing natural functions. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=SyProzZAW.

[40] Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.

[41] Schreiber, T. Measuring information transfer. Physical review letters, 85(2):461, 2000.

[42] Seth, A. K., Barrett, A. B., and Barnett, L. Granger causality analysis in neuroscience and neuroimaging. Journal of Neuroscience, 35(8):3293–3297, 2015.

[43] Sindhwani, V., Quang, M. H., and Lozano, A. C. Scalable matrix-valued kernel learning for high-dimensional nonlinear multivariate regression and granger causality. arXiv preprint arXiv:1210.4792, 2012.

[44] Spirtes, P., Glymour, C. N., Scheines, R., Heckerman, D., Meek, C., Cooper, G., and Richardson, T. Causation, prediction, and search. MIT press, 2000.

[45] Stock, J. H. and Watson, M. W. Interpreting the evidence on money-income causality. Journal of Econometrics, 40(1):161–181, 1989.

[46] Tank, A., Covert, I., Foti, N., Shojaie, A., and Fox, E. Neural granger causality for nonlinear time series. arXiv preprint arXiv:1802.05842, 2018.

[47] Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.

[48] van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. arXiv preprint arXiv:1802.10353, 2018.

[49] Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., and Tacchetti, A. Visual interaction networks: Learning a physics simulator from video. In Advances in neural information processing systems, pp. 4539–4547, 2017.

[50] White, H. and Chalak, K. Settable systems: an extension of pearl’s causal model with optimization, equilibrium, and learning. Journal of Machine Learning Research, 10(Aug):1759–1799, 2009.

[51] White, H. and Lu, X. Granger causality and dynamic structural systems. Journal of Financial Econometrics, 8(2):193–243, 2010.

[52] White, H., Chalak, K., and Lu, X. Linking granger causality and the pearl causal model with settable systems. In NIPS Mini-Symposium on Causality in Time Series, pp. 1–29, 2011.

[53] Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

Appendix A Hyperparameter λ selection

For selecting an appropriate hyperparameter , we run our experiments for the synthetic dataset with . For each experiment involving N time series, we append independent time series , generated by randomly sampling time series from and performing random permutation across the examples. We also appendtime series , such that , where Q is a fixed random matrix, so that we know does not cause i, s. We apply Alg. 1 to the augmented dataset, and produce the estimated predictive strength from . For each hyperparameter , we then fit a Gaussian distribution to the estimated predictive strengths from to ), and fit another Gaussian distribution to the estimated predictive strengths from to , , and select the such that the upper value of is smaller than the lower value of . In this way, for the known causal and non-causal relations, they are sufficiently apart. We find that satisfy this criterion, while larger fails to satisfy. We then set for all our experiments.

B Proof and analysis of the Minimum Predictive Information regularized risk

In this section we prove the three properties of in Section 2.2, and analyze why it is likely to select variables that directly causes the variable of interest.

Firstly we state the assumption that will be used throughout this section:

Assumption 1. Assume that is a continuous function and has enough capacity so that it can approximate any. Let and assume that has support with intrinsic dimension of KM.

Also we emphasize that in this paper, the expected risks (with symbol R) are w.r.t. the distributions, and the empirical risks (with symbol ) are w.r.t. a dataset drawn from the distribution, with finite number of examples. The theorems in this paper are all proved w.r.t. distributions (assuming infinite number of examples). Sample complexity results will be left for future work.

Before going forward with the main proof, we first prove the following lemma.

B.1 Proving a lemma

Lemma 0.1. Suppose that Assumption 1 holds. Denote

as the standard MSE loss, we have

and

In other words, for the MSE loss, its minimum is attained when is the expectation of conditioned on

Proof. The proof of the lemma is adapted from [31]. The risk

Note that here is an inner product in

For any , treating as a vector, let’s calculate its value such that the integral attains its minimum.

Therefore, for any is the only stationary point for

Therefore, for any is the only global minimum of

The minimum of the risk is attained iff attains minimum at every , i.e.,

and

B.2 Proof of the three properties of

The three properties are

Proof. (1) If , then . Since where , we have . Recall Eq. (2):

let given a certain

where the second equality is due to that the mutual information term in does not depend on , and the last equality is due to Lemma 0.1. Let , since

Therefore,

which does not depend on . Finally, we have

For the last equality, the elements in the parenthesis does not depend on , and only the term depends on . Therefore, at the minimization of the whole objective , we have attains its minimum of 0, at which . By the definition of . Proof completes.

In essence, the proof states that if , then at the minimization of the whole objective, the MSE term does not depend on , and the mutual information term w.r.t. time series j can be independently minimized and approach 0.

(2) Suppose that we replace by where . Let .

We have , and therefore

, where the last equality is due to that mutual information is invariant to invertible transformations. Furthermore, due to Assumption 1, we can find another which undoes this affine transformation on , so the MSE term can be kept the same. Therefore, we have a one-to-one mapping between the original and the new such that value of the MSE term and the mutual information term remain unchanged. Thus at the minimization of the objective, remains the same.

(3) This is trivial to prove. We see that in , the MSE term remains the same if the mapping f remains the same, regardless of how we parameterize f in terms of parameter . The second term does not depend on . Therefore, at the minimization of , the

is invariant to the reparameterization of the same f in terms of parameter . As a direct corollary, is insensitive to the network architecture, as long as the capacity is enough (provided with sufficient number of examples). This is confirmed in Table S2 in Appendix G.

Note that L1 and group L1 regularization do not have this property, since they explicitly regularize on the parameter

B.3 Analysis of the minimum predictive information-regularized risk

After proving the three properties of , now we analyze why the minimum predictive information-regularized risk is likely to select the variables that directly cause , under some additional assumptions. We first state the additional assumption needed to perform the analysis, then we restate the definitions of direct causality to make our statements more rigorous. We then prove two lemmas in Appendix B.3.1, and finally perform the analysis in Appendix B.3.2.

Assumption 2. Assume that causal sufficiency [35] is satisfied, i.e. the observed time series 1, 2, ...N are all the variables that take part in the dynamics (no hidden confounding variables). Also assume that in the response function Eq. (1), the noise variable are effective variables, so each is not a deterministic mapping. Assume that by saying “causality", we mean “causality in mean".

To make our statement of causality more rigorous, here we restate the definition of direct (structural) causality [52] using our notations of the system Eq. (1). This definition is a natural extension to Pearl causality [33] in canonical settable systems [50, 52], which formalizes time series in its full generality.

Direct (structural) causality [52] We say does not directly (structurally) cause for all possible values of , the function is constant in . Otherwise, we say directly (structurally) causes

The relationship between direct causality and Granger causality in Section 2.1 is the following Lemma, which states that for our system, Granger causality is a sufficient condition for direct (structural) causality.

Lemma 0.2. Assuming causal sufficiency, for system Eq. 1, for any Granger-causes directly structurally causes

Proof. We base the proof on the Theorem 5.6 in [52]. Firstly, by definition, the system Eq. (1) belongs to the canonical settable system (Def. 3.3 in [52]), on which their Theorem 5.6 is based. To prove that in our system Granger causality can deduce direct structural causality, we only have to prove that the assumption A.1 and assumption A.2 in [52] are satisfied by our system. If we identify our with their with their with their , then our system Eq. (1) satisfies their Assumption A.1. Additionally, by definition, our are random variables that are mutually independent, and also independent of any , . Therefore, our system satisfies their strict exogeneity (in our representation which is a sufficient condition for Assumption A.2. Therefore, both their Assumption A.1 and Assumption A.2 are satisfied by our system Eq. (1). Applying their Theorem 5.6, we prove Lemma 0.2.

Therefore, for our system Eq. (1), applying the results by [52], we have that Granger causality is a sufficient condition for direct structural causality. The reason that here Granger causality can deduce direct structural causality is in part due to the fact that for system Eq. (1), conditional exogeneity [52] is automatically satisfied.

Note that the reverse of the statement is not true, i.e. a failed Granger causality test does not necessarily imply that there is no direct structural causality (White & Lu [51] give several examples, and also note that these instances are exceptional).

After stating Assumption 2 and clarifying the definition of causalities, now we prove two lemmas, which are important for the analysis of our objective.

B.3.1 Minimum MSE with different variables

Lemma 0.3. Suppose that Assumption 1 and 2 holds, and are mutually exclusive sets of variables satisfying

Fig. S1 below shows the relations between the variables, where the dashed arrows denote the potential existence of causal relations between variables. We see that conditioned on are independent, while conditioned on independent. Lemma 0.3 states that under the above scenario and under Assumptions 1 and 2, using can achieve a lower MSE than using

Figure S 1: Diagram of variables for Lemma 0.3. The dashed arrows denote the possible existence of causal relations between variables.

Proof. Since Assumption 1 holds, according to Lemma 0.1, Lemma 0.3 is equivalent to

We have

The third equality (the one before the inequality) is due to that , lead- ing to . The inequality step first uses the Assumption 2 that the noise variables are effective arguments of the response functions , and that each is “causality in mean". Therefore, . Using Lemma 0.1, we have minimizes , hence the inequality.

Using Lemma 0.3 recursively, we see that using variables that directly causes achieve the lowest MSE. Formalizing the above intuition, we have

Lemma 0.4. Suppose that Assumption 1 and 2 holds, and are the set of variables that directly causes

Specifically, we have

Proof. For any , let . Then are mutually exclusive, and . Now we prove that with , the corresponding , , satisfy the condition for Lemma 0.3. Since are the set of variables that directly causes , there does not exist a such that the corresponding (otherwise it violates the direct causality). Thus . To prove , note that does not directly cause , then does not Granger-cause , i.e. , which is equivalent to . The special case of follows directly that letting

B.3.2 Qualitative and quantitative behaviors of the mutual information-regularized risk

In this section, we analyze the qualitative and quantitative behaviors of the mutual information-regularized risk (Eq. 2), with varying noise levels . For each variable define as a “rescaled" mutual information between

and . When so that , at which the input is fully preserved. When all elements of , at which is fully corrupted. Denoting , we can then rewrite the mutual information-regularized risk (Eq. 2) as

tanh. Let be the set of variables that directly causes , and denote the corresponding set of as . Denote and the corresponding set of as . For any i = 1, 2, ...N, it is easy to see that MMSEhas the following properties:

1. MMSEattains maximum at

2. MMSEis monotonically decreasing w.r.t. each

3. MMSE(using Lemma 0.4).

4. MMSEattains minimum at is constant w.r.t.

To get a better intuition of the landscape of , let’s investigate a simple example. Let the

where are independent unit Gaussian variables, and . For , since only and are d-connected to , at the minimization of , only and may have a finite (the other are all infinite). There- fore, setting the not corresponding to and as infinity, and let , being independent unit Gaussian variables. Let , then we can get an analytic expression for

R

Minimizing

Substituting into

Here we have neglected the constant . To obtain , let

, we have . Substituting, we have

Fig. S2 shows the landscape of MMSE. We see that MMSEsatisfies the above mentioned four properties. Particularly, MMSEMMSE. After adding , the has global minimum along largely due to this property. Therefore, for this particular example, when is minimized,

By varying the value of , we can tune the relative influence of the two terms MMSEand . The landscape corresponding to are plotted in Fig. S3. We see that when , the MMSE term dominates, and it is possible that the global minimum of . This is similar to the effect of a L1 regularization, where if the coefficient for the L1 is vanishingly small, the L1 regularization will barely influence the loss landscape. When is not vanishingly small, as in Fig. S3 (b), we see that the global minimum of lies on . When , the term dominates and the global minimum is at

In general, we expect behave qualitatively similar. When , the global minimum for . As we ramp down , the dimension that has largest influence on MMSE will first host the global minimum with nonzero , which is most likely the variable that directly causes . When is further ramping down, we expect that the variables that host the global minimum with nonzero will more likely be those that directly causes , due to the landscape influenced by the four properties of MMSE. This can justify the mutual information-regularized risk as a good objective for causal discovery/variable selection. The experiments in the paper will empirically test the performance of the mutual information-regularized risk.

Figure S 3: (a) for (a) , (b) , (c) and (d) in section B.3.2, for

C Upper bound for the mutual information-regularized risk

In this section, we prove that

where l is the element of a vector, stdis the standard deviation of across t. The equality is reached when obeys a multivariate Gaussian distribution with diagonal covariance matrix satisfying

Here is differential entropy. For , its variance at the dimension is

The second equality is due to that is independent of . Using the principle of maximum entropy, the distribution that maximizes subject to the constraint of VarVaris a Gaussian distribution whose diagonal covariance matrix satisfies . Its entropy is

The equality is reached when obeys a multivariate Gaussian distribution with diagonal covari- ance matrix satisfying

D Implementation details for the methods

Here we state the implementation details for our method, as well as other methods being compared. Throughout this paper, unless otherwise specified, we use the standard k-nearest neighbor technique in [22] to estimate the KL-divergence and mutual information (with number of neighbors k = 5) and conditional mutual information (with number of neighbors k = 3), which is used in our implementations of Mutual information, Transfer Entropy and Causal Influence.

D.1 Our method

Without stating otherwise, our method (Algorithm 1) as a default uses a three layer neural net, with two hidden layers having 8 neurons and leakyReLU (max(0.3x, x)) activation, and the last layer having linear activation. We set the number of fake time series , and significance level . Adam [21] optimizer with learning rate is used as default throughout this paper. We set and . We use 30000 epochs. It also has a 400 epoch warm-up period where the mutual information term is turned off, to allow to find a good initial model as a start. We use the the upper bound (Eq. 4) as the risk and also in estimating , as discussed in the main text in Section 2.2. In this work, the relative noise amplitude is shared across the dimension l for each time series j. This simplifies the risk calculation, and is invariant to the rescaling of each time series . We also tested fully parameterizing with a similar performance.

D.2 Transfer Entropy

We use the definition of transfer entropy as defined in [41]. In that work the transfer entropy is defined for two time series. To deal with multiple time series, we let also include other time series, similar to the extension of transfer entropy as in [23].

D.3 Causal Influence

For causal influence [20], we use the same network architecture as in our method, to learn a prediction model. Then the KL divergence is estimated via the technique in [22].

D.4 Linear Granger

We follow the definition of linear Granger causality (Eq. (7) and (8) in [9]) to calculate linear Granger causality. Specifically, we calculate the residual squared error of a linear predictor of without (both with ). Then the linear Granger causality equals the log of the ratio of the two residual squared errors.

D.5 Kernel Granger

We use the implementation3 for [25, 26] for estimating kernel Granger causality. We use their default settings, with inhomogeneous polynomial (IP) kernel of degree p = 2. We follow the normalization requirement of the algorithm to normalize the data for each experiment.

D.6 Elastic Net

We use elastic net [53] with 5-fold time-series-split cross-validation, along the following regularization path: L1-ratio: 0.5, 0.8, 0.9, 0.95, 0.99, and strength of penalization being a 200-step geometric series from to . The score function used for cross-validation is the coefficient of determination (). The elastic net is implemented with scikit-learn’s ElasticNetCV module4, with optimization tolerance of

D.7 Gaussian Random

For Gaussian Random, we draw 10,000 random matrices, each element of which is drawn from a standard Gaussian distribution.

Table S 1: Mean and standard deviation of AUC-ROC (%) vs. N, over 10 random sampling of datasets. Bold font marks the top method for each N.

E Implementation details for synthetic experiments

For all experiments in this section, each metric is obtained by performing the experiments (including generation of the dataset and the training) ten times with seed = 0, 30, 60, 90, 120, 150, 180, 210, 240, 270 and averaging the resulting metrics (for Gaussian random matrices, for each true causal matrix A sample 10,000 random matrices ). For the ground-truth causal tensor A, each element is a matrix, with 0.5 probability of being an all-zero matrix, and 0.5 probability of being a nonzero matrix. If is a nonzero matrix, its each element is sampled from a log-normal distribution with and . For B, each is also a matrix, with each element sampling from Hin equation (5). As a default, 500 time series each with length of 22 are generated from Eq. (5), each of which is wrapped into 19 pairs (since K = 3), so there are in total examples for each dataset. Since we are using AUC as metrics where a threshold is not necessary, we neglect step 10 in Alg. 1 for synthetic experiment. The train-test-split is 9:1 for all experiments in this paper. See Fig. S4 for example snapshots of time series together with the corresponding

F AUC-ROC table for synthetic experiment

Table S1 show the AUC-ROC table for the synthetic experiment, where for each N, 10 datasets are randomly sampled according to Eq. (5) using random seed 0, 30, 60, 90, 120, 150, 180, 210, 240, 270, over which each method is run and their metrics are accumulated. It has similar behavior as the AUC-PR table (Table 1) in the main text.

G Additional experiment: testing with model capacity variations

Since in practice, we do not know the underlying causal structure a priori, it presents a greater challenge to select the model capacity for , as compared with supervised learning method where we can do cross-validation. To see how the capacity of the function approximator influences our method, we vary the number of layers and the number of neurons in each layer at N = 10, using the same 10 datasets as in Section 3.1. Table S2 summarizes the result. We see that our method’s performance here is hardly influenced by the model capacity, with only a slight degradation at very low capacity. This shows that our method is quite tolerant and stable with model capacity variations.

H Details for the video game dataset

Here, we implement a custom Atari Breakout game in the OpenAI Gym [5] environment, mimicking the original game5, where we can access the state of the ball, paddle and bricks, etc. This representation is also used in the OO-MDP [10] paradigm for a more efficient representation of the environment

Time series 1 Time series 2 Time series 3 Time series 4 Time series 5 Time series 6 Time series 7 Time series 8 Time series 9 Time series 10 Time series 11 Time series 12 Time series 13 Time series 14 Time series 15 Time series 16 Time series 17 Time series 18 Time series 19 Time series 20 Time series 21 Time series 22 Time series 23 Time series 24 Time series 25 Time series 26 Time series 27 Time series 28 Time series 29 Time series 30

Figure S 4: Example snapshots of the synthetic time series with (a) N = 8, (b) N = 15, and (c) N = 30. The inset is the hidden underlying matrix, whose (j, i) element denotes the causal strength from time series j to i. We see that the causal strength varies in orders, making it very difficult to identify each edge correctly.

state. We use the DQN algorithm, the same CNN architecture as in [28] to train an RL agent. Then we let it play the game for 45000 steps, obtaining a dataset with time-length of 45000 steps (if the Table S 2: Average and standard deviation of AUC-PR and AUC-ROC for different network structures for N = 10 with our method. Here for example, (8, 8, 8) means that the has 3 hidden layers, each with 8 neurons.

Figure S 5: Time series of the heart rate and breath rate of a patient suffering sleep apnea. The data is normalized to have 0 mean and standard deviation of 1. Sample rate is 2Hz.

agent dies, we restart the game) and 6 time series: action, paddle’s x position, ball’s x position, ball’s y position, number of bricks and reward. We then feed the time series (each time series normalized to mean of 0 and variance of 1) to our method, the same procedure as performed in the synthetic experiment, to let it produce an inferred matrix , which is shown in Fig. 1 in main text. All the datasets used in this paper and code will be open-sourced upon publication of the paper.

I Implementation details for experiment with heart-rate vs. breath-rate

For the two real-world datasets, we obtain the data with the same procedure as in [1] (See Fig.S5 for their plots). Then the data (each time series normalized to mean of 0 and variance of 1) are fed into our algorithm to infer the causal strength , the experiments are run for 50 times with seed from 0 to 49, and Fig. 2 in the main text is obtained by averaging over the inferred W matrix.

J Additional experiment: rat EEG dataset

As another real-world example, we apply our algorithm to estimate the directional relations of the EEG signals between the right and left cortical intracranial electrodes [37], before and after lesion (see Fig. S6 and S7 for the signals), also studied in [1, 38, 25]. Figure S8 (left) shows the inferred predictive strength for the EEG signals of a normal rat. We see that there is only a slight asymmetry, with the right channel having a slightly stronger influence on the left channel than the reverse direction. Fig. S8 (right) shows for the EEG signals with unilateral lesion in the rostral

0.5 Right Left

Figure S 6: Time series of a normal rat EEG signals from right and left cortical intracranial electrodes. The data is normalized to have 0 mean and standard deviation of 1, and the left signal is plotted with offset for better visualization. Sample rate is 200Hz.

Figure S 7: Time series of a rat EEG signals from right and left cortical intracranial electrodes, after lesion. The data is normalized to have 0 mean and standard deviation of 1, and the left signal is plotted with offset for better visualization. Sample rate is 200Hz.

pole of the reticular thalamic nucleus. We see that there is stronger predictive strength from the left to the right channels. Compared with the result of previous works [1, 25] as also shown in Fig. S9, we see that all methods correctly infer the directional relations before and after brain lesion. In addition, our method shows only a slight decay of predictive strength with increasing history length, in contrast to the much more rapid decay of causality index in [1], again demonstrating our method’s insensitivity against history length, due to its flexibility in extracting the right amount of information in order to predict the future. This experiment and the breadth rate vs. heart rate experiment in Section 3.3 demonstrate our method’s capability in inferring the directional relations from noisy, real-world data.

Figure S 8: Predictive strength inferred by our method with the EEG datasets, for different maximum time horizon K, averaged over 50 initializations of , for a normal rat (left) and after brain lesion (right).

Figure S 9: Causal indices for the rat EEG dataset with previous methods. (a) By [1]. Left: the variance for the left EEG (open circles) and right EEG (diamonds) vs. time lag m before brain lesion. Right: the causality index after brain lesion. (b) By [26]. The filtered causality index vs. varying p, the order of the inhomogeneous polynomial kernel, before (upper) and after (lower) brain lesion.