Causal discovery of linear non-Gaussian acyclic models in the presence of latent confounders

2020·Arxiv

Abstract

Abstract

Causal discovery from data affected by latent confounders is an important and difficult challenge. Causal functional model-based approaches have not been used to present variables whose relationships are affected by latent confounders, while some constraint-based methods can present them. This paper proposes a causal functional model-based method called repetitive causal discovery (RCD) to discover the causal structure of observed variables affected by latent confounders. RCD repeats inferring the causal directions between a small number of observed variables and determines whether the relationships are affected by latent confounders. RCD finally produces a causal graph where a bi-directed arrow indicates the pair of variables that have the same latent confounders, and a directed arrow indicates the causal direction of a pair of variables that are not affected by the same latent confounder. The results of experimental validation using simulated data and real-world data confirmed that RCD is effective in identifying latent confounders and causal directions between observed variables.

Keywords: Causal discovery, Causal structures, Latent confounders

1. Introduction

Many scientific questions aim to find the causal relationships between variables rather than only find the correlations. While the most effective measure for identifying the causal relationships is controlled experimentation, such experiments are often too costly, unethical, or technically impossible to conduct. Therefore, the development of methods to identify causal relationships from observational data is important.

Many algorithms that have been developed for constructing causal graphs assume that there are no latent confounders (e.g., PC [1], GES [2], and LiNGAM [3]). They do not work effectively if this assumption is not satisfied. Conversely, FCI [4] is an algorithm that presents the pairs of variables that have latent confounders. However, since FCI infers causal relations on the basis of the conditional independence in the joint distribution, it cannot distinguish between the two graphs that entail exactly the same sets of conditional independence. Therefore, to understand the causal relationships of variables where latent confounders exist, we need a new method that satisfies the following criteria: (1) the method should accurately (without being biased by latent confounders) identify the causal directions between the observed variables that are not affected by latent confounders, and (2) it should present variables whose relationships are affected by latent confounders.

Compared to the constraint-based causal discovery methods (e.g., PC [1] and FCI [4]), causal functional model-based approaches [5, 6, 7, 8, 9] can identify the entire causal model under proper assumptions. They represent an effect Y as a function of direct cause X. They infer that variable X is the cause of variable Y when X is independent of the residual obtained by the regression of Y on X but not independent of Y .

Most of the existing methods based on causal functional models identify the causal structure of multiple observed variables that form a directed acyclic graph (DAG) under the assumption that there is no latent confounder. They assume that the data generation model is acyclic, and that the external effects of all the observed variables are mutually independent. Such models are called additive noise models (ANMs). Their methods discover the causal structures by the following two steps: (1) identifying the causal order of variables and (2) eliminating unnecessary edges. DirectLiNGAM [8], which is a variant of LiNGAM [3], performs regression and independence testing to identify the causal order of multiple variables. DirectLiNGAM finds a root (a variable that is not affected by other variables) by performing regression and independence testing of each pair of variables. If a variable is exogenous to the other variables, then it is regarded as a root. Thereafter, DirectLiNGAM removes the effect of the root from the other variables and finds the next root in the remaining variables. DirectLiNGAM determines the causal order of variables according to the order of identified roots. RESIT [9], a method extended from Mooij et al. [6] identifies the causal order of variables in a similar manner by performing an iterative procedure. In each step, RESIT finds a sink (a variable that is not a cause of the other variables). A variable is regarded as a sink when it is endogenous to the other variables. RESIT disregards the identified sinks and finds the next sink in each step. Thus, RESIT finds a causal order of variables. DirectLiNGAM and RESIT then construct a complete DAG, in which each variable pair is connected with the directed edge based on the identified causal order. Thereafter, DirectLiNGAM eliminates unnecessary edges using AdaptiveLasso [10]. RESIT eliminates each edge if X is independent of the residual obtained by the regression of Y on Z/{X} where Z is the set of causes of Y in the complete DAG.

Causal functional model-based methods effectively discover the causal structures of observed variables generated by an additive noise model when there is no latent confounder. However, the results obtained by these methods are likely disturbed when there are latent confounders because they cannot find a causal function between variables affected by the same latent confounders. Furthermore, the causal functional model-based approaches have not been used to show variables that are affected by the same latent confounder, as FCI does.

This paper proposes a causal functional model-based method called repetitive causal discovery (RCD) to discover the causal structures of the observed variables that are affected by latent confounders. RCD is aimed at producing causal graphs where a bi-directed arrow indicates the pair of variables that have the same latent confounders, and a directed arrow indicates the direct causal direction between two variables that do not have the same latent confounder. It assumes that the data generation model is linear and acyclic, and that external influences are non-Gaussian. Many causal functional model-based approaches discover causal relations by identifying the causal order of variables and eliminating unnecessary edges. However, RCD discovers the relationships by finding the direct or indirect causes (ancestors) of each variable, distinguishing direct causes (parents) from indirect causes, and identifying the pairs of variables that have the same latent confounders.

Our contributions can be summarized as follows:

• We developed a causal functional model-based method that can present

• The method can also identify the causal direction of variable pairs that

• The results of experimental validation using simulated data and real-world

A briefer version of this work without detailed proofs can be found in [11].

2. Problem deﬁnition

2.1. Data generation process

This study aims to analyze the causal relations of observed variables confounded by unobserved variables. We assume that the relationship between each pair of (observed or unobserved) variables is linear, and that the external influ-ence of each (observed or unobserved) variable is non-Gaussian. In addition, we assume that (observed or unobserved) data are generated from a process represented graphically by a directed acyclic graph (DAG). The generation model is formulated using Equation 1.

where denotes an observed variable, is the causal strength from to , denotes a latent confounder, denotes the causal strength from to , and is an external effect. The external effect and the latent confounder are assumed to follow non-Gaussian continuous-valued distributions with zero mean and nonzero variance and are mutually independent. The zero/nonzero pattern of and corresponds to the absence/existence pattern of directed edges. Without loss of generality [12], latent confounders are assumed to be mutually independent. In a matrix form, the model is described as Equation 2:

where the connection strength matrices B and Λ collect and , and the

vectors x, f and e collect and .

2.2. Research goals

This study has two goals. First, we extract the pairs of observed variables that are affected by the same latent confounders. This is formulated by C whose element is defined by Equation 3:

Element equals 0 when there is no latent confounder affecting variables and . Element equals 1 when variables and are affected by the same

Figure 1: (a) Data generation model (are latent confounders). (b) Causal graph that RCD produces. A bi-directed arrow indicates that two variables are affected by the same latent confounders.

latent confounders.

The second goal is to estimate the absence/existence of the causal relations between the observed variables that do not have the same latent confounder. This is defined by a matrix P whose element is expressed by Equation 4:

= 0 when = 1 because we do not aim to identify the causal direction between the observed variables that are affected by the same latent confounders.

Finally, RCD produces a causal graph where a bi-directed arrow indicates the pair of variables that have the same latent confounders, and a directed arrow indicates the causal direction of a pair of variables that are not affected by the same latent confounder. For example, assume that using the data generation model shown in Figure 1-(a), our final goal is to draw a causal diagram shown in Figure 1-(b), where variables and are latent confounders, and variables A–H are observed variables.

3. Proposed Method

3.1. The framework

RCD involves three steps: (1) It extracts a set of ancestors of each variable. Ancestor is a direct or indirect cause. In this paper, denotes the set of ancestors of is initialized as . RCD repeats the inference of causal directions between variables and updates M. When inferring the causal directions between observed variables, RCD removes the effect of the already identified common ancestors. Causal direction between variables and can be identified when the set of identified common causes (i.e. ) satisfies the back-door criterion [13, 14] to and . The repetition of causal inference is stopped when M no longer changes. (2) RCD extracts parents (direct causes) from M. When is an ancestor but not a parent of , the causal effect of on is mediated through . RCD distinguishes direct causes from indirect causes by inferring conditional independence. (3) RCD finds the pairs of variables that are affected by the same latent confounders by extracting the pairs of variables that remain correlated but whose causal direction is not identified.

3.2. Finding ancestors of each variable

RCD repeats the inference of causal directions between a given number of variables to extract the ancestors of each observed variable. We introduce Lemmas 1 and 2, by which the ancestors of each variable can be identified when there is no latent confounder. Then, we extend them to Lemma 3 by which RCD extracts the ancestors of each observed variable for the case that latent confounders exist. We first quote Darmois-Skitovitch theorem (Theorem 1) proved in [15, 16] because it is used to prove the lemmas.

Theorem 1. Define two random variables and as linear combinations of independent random variables = 1): = = . Then, if and are independent, all variables for which

on and denote the residual obtained by the linear regression of on . The causal relation between variables and is determined as follows: (1) If and are not linearly correlated, then there is no causal effect between and . (2) If and are linearly correlated and is independent of residual , then is an ancestor of . (3) If and are linearly correlated and is dependent on and is dependent on , then and have a common ancestor. (4) There is no case that and are linearly correlated and is independent of and is independent of .

Proof. The causal relationship between two variables and can be classified into the following four cases: (Case 1) There is no common cause of the two variables, and there is no causal effect between them; (Case 2) There is no common cause of the two variables, and one variable is a cause of the other variable; (Case 3) There are common causes of the two variables, and there is no causal effect between them; (Case 4) There are common causes of the two variables, and one variable is a cause of the other variable. Cases 1, 2, 3, and 4 are modeled by Equations 5, 6, 7, and 8, respectively:

where and are the non-Gaussian external effects that are mutually independent, is the non-zero causal strength from to , and and are the linear combinations of the common causes of and . The linear combinations of the common causes and are linearly correlated and are independent of and . We investigate the following three points for each case: (1) whether and are linearly correlated, (2) whether is independent of , and (3) whether is independent of .

Case 1: Variables and are mutually independent because of Equation 5. Therefore, and are not linearly correlated. Let denote the coefficient of when is regressed on . Since and are mutually independent,

= 0. Then,

Therefore, is independent of because and are mutually independent. Similarly, is independent of .

Case 2: Variables and are linearly correlated because . Let denote the coefficient of when is regressed on . Then, because is the only term on the right side of equation that covaries with . Then, we have :

Then, is independent of because is independent of . Let denote the coefficient of when is regressed on . Since and are linearly correlated, = 0. Then, we have :

Then, is not independent of because of the term in Equation 11 and Theorem 1.

Case 3: Since and are linearly correlated, and are linearly correlated. Let denote the coefficient of when is regressed on . Since and are linearly correlated, = 0. Then, we have :

Then, is not independent of because of the term in Equation 12 and Theorem 1. Similarly, is not independent of .

Case 4: Since and are linearly correlated, and are linearly correlated. Let denote the coefficient of when is regressed on . Then, because covaries with terms and on the right side of equation = . We have :

Then, is not independent of because of the term (in Equation 12 and Theorem 1. Let denote the coefficient of when is regressed on . Since and are linearly correlated, = 0. Then, we have :

Then, is not independent of because of the term in Equation 14 and Theorem 1. These cases can be summarized as follows: (Case 1) and are not linearly correlated; (Case 2) and are linearly correlated, is independent of , and is not independent of when the causal direction is ; (Cases 3 and 4) and are linearly correlated, is not independent of , and is not independent of . Lemma 1-(1) assumes that and are not linearly correlated. This assumption only corresponds to Case 1. Therefore, there is no causal effect between and . Lemma 1-(2) assumes that and are linearly correlated, and is independent of . This assumption only corresponds to Case 2. Therefore, is an ancestor of . Lemma 1-(3) assumes that and are linearly correlated, is not independent of , and is not independent of . This corresponds to Case 3 or Case 4. Therefore, and have common ancestors. According to Lemma 1-(4), there is no case among Cases 1–4 where and are linearly correlated, is independent of , and is independent of .

It is necessary to remove the effect of common causes to infer the causal directions between variables. When the set of the identified common causes of variables and satisfies the back-door criterion, the causal direction between and can be identified. The back-door criterion [13, 14] is defined as follows:

Definition 1. A set of variables Z satisfies the back-door criterion relative to

an ordered pair of variables () in a DAG G if no node in Z is a descendant of , and Z blocks every path between and that contains an arrow into

x.

Lemma 1 is generalized to Lemma 2 to incorporate the process of removing the effects of the identified common causes. Lemma 2 can also be used to determine whether the identified common causes are sufficient to detect the causal direction between the two variables.

Lemma 2. Let denote the set of common ancestors of and . Let and denote the residuals when and are regressed on , respectively. Let and denote the residual obtained by the linear regression of on , and on , respectively. The causality and the existence of the confounders are determined by the following criteria: (1) If and are not linearly correlated, then there is no causal effect between and . (2) If and are linearly

Figure 2: (a) Variables A, B, and C are the causes of variable D, and they have a common cause, are the causes of D, but C is not.

correlated and is independent of the residual , then is an ancestor of . (3) If and are linearly correlated and is dependent on and is dependent on , then and have a common ancestor other than , and does not satisfy the back-door criterion to () or (). (4) There is no case that and are linearly correlated and is independent of and is independent of .

Proof. When Lemma 1 is applied to and , Lemma 2 is derived.

Next, we consider the case that there are latent confounders. In Lemma 2, the direction between two variables is inferred by regression and independence tests. However, if there are two paths from latent confounder to , and is only on one of the paths, then cannot satisfy the back-door criterion. For example, in Figure 2-(a), variables A, B, and C are the causes of variable D, and the causes are also affected by the same latent confounder . The causal direction between A and D cannot be inferred only by inferring the causality between them because the effect of is mediated through B and C to D. Therefore, A, B, and C are the causes of D when they are independent of the residual obtained by the multiple regression of D on {A, B, C}. However, it is necessary to confirm that variables in each proper subset of {A, B, C} are not independent of the residual obtained by the regression of D on the proper subset (i.e., no proper subset of {A, B, C} satisfies the back-door criterion). For example, in Figure 2-(b), C is not a cause of D, but A, B, and C are all independent of the residual obtained by the multiple regression of D on {A, B, C}. C should not be regarded as a cause of D because A and B are also independent of the residual when D is regressed on {A, B}. This example is generalized and formulated by Lemma 3:

Lemma 3. Let X denote the set of all observed variables. Let U denote a subset of X that contains (i.e., and ). Let M denote the sequence of where is a set of ancestors of . For each , let denote the residual obtained by the multiple linear regression of on the common ancestors of U, where the set of common ancestors of U is . We define ) as a function that returns 1 when each is independent of the residual obtained by the multiple linear regression of on ; otherwise it returns 0. If ) = 0 for each and ) = 1, then each is an ancestor of .

Proof. We prove Lemma 3 by contradiction. Assume that is not an ancestor of , even though ) = 0 for each , and ) = 1. Let denote the set that consists of the descendants of and itself.

Then,

Let denote the set of common causes of U (i.e. = ). Let

denote the coefficient of when is regressed on . Then,

Let denote the residual obtained by the multiple regression of on , and let denote the coefficient of obtained by the multiple

regression of on . Then, we have :

There is no term that includes , the external effect of , other than in Equation 15. External effect is independent of the other terms in Equation 15. Since is independent of , = 0 by Theorem 1. Therefore, we have as follows:

Every is independent of . This means ) = 1, and it contradicts the assumption; that is, ) = 0 for each .

We describe the procedure and the implementation of how RCD extracts the ancestors of each observed variable in Algorithm 1. The output of the algorithm is sequence , where is the set of identified ancestors of . Argument is the alpha level for the p-value of the Pearson’s correlation. If the p-value of two variables is smaller than , then we estimate that the variables are linearly correlated. Argument is the alpha level for the p-value of the Hilbert-Schmidt independence criterion (HSIC) [17]. If the p-value of the HSIC of two variables is greater than , then we estimate that the variables are mutually independent. Argument is the alpha level to test whether a variable is generated from a non-Gaussian process using the Shapiro-Wilk test [18]. Argument n is the maximum number of explanatory variables used in multiple linear regression for identifying causal directions; i.e., the maximum number of (1) in Lemma 3. In practice, this should be set to a small number when the number of samples is smaller than the number of variables. RCD does not perform multiple regression analysis of more than n explanatory variables.

RCD initializes to be an empty set for each . RCD repeats the inference between the variables in each that has (l + 1) elements. Number l is initialized to 1. If there is no change in M, l is increased by 1. If there is a change in M, l is set to 1. When l exceeds n, the repetition ends. Variable changed has information about whether there is a change in M within an iteration.

In line 16 of Algorithm 1, RCD confirms that there is no identified ancestor of in U by checking that . This confirms that ) = 0 for each in Lemma 3. In lines 17–24, RCD checks whether ) = 1 in Lemma 3. When ) = 1 is satisfied, is put into S. S is a set of candidates for a sink (a variable that is not a cause of the others) in U. It is necessary to test whether there is only one sink in U because two variables may be misinterpreted as causes of each other when the alpha level for the independence test () is too small.

We use least squares regression for removing the effect of common causes in line 12 of Algorithm 1, but we use a variant of multiple linear regression called multilinear HSIC regression (MLHSICR) to examine the causal directions between variables in U in line 20 of Algorithm 1 when 2. Coefficients obtained by multiple linear regression using the ordinary least squares method with linearly correlated explanatory variables often differ from true values due to estimation errors. Thus, the relationship between the explanatory variables and the residual may be misinterpreted to be dependent in the case that explanatory variables are affected by the same latent confounders. To avoid such failure, we use MLHSICR defined as follows:

Definition 2. Let variable denote an explanatory variable, x denote a vector that collects explanatory variables , and y denote a response variable. MLHSICR models the relationship x by the coefficient vector in the following equation:

where HSIC) denotes the Hilbert-Schmidt independence criterion of a and

Mooij et al. [6] have developed a method to estimate the nonlinear causal function between variables by minimizing the HSIC between the explanatory variables and the residual. RCD estimates by minimizing the sum of the HSICs in Equation 17 using the L-BFGS method [19], similar to Mooij et al. [6]. L-BFGS is a quasi-Newton method, and RCD sets the coefficients obtained by the least squares method to the initial value of .

3.3. Finding parents of each variable

When is an ancestor but not a parent of , the effect of on is mediated through . Therefore, . [20] proposed a method to test the conditional independence using unconditional independence testing in Theorem 2 (proved by them):

Theorem 2. If and are neither directly connected nor unconditionally independent, then there must exist a set of variables Z and two functions f and g such that ), and or .

where f and g are multiple linear regression functions of on and

on , respectively. Since (, we can assume

that ⊥⊥ | (\ {}) ⇔ ) ⊥⊥ \ {}) where h is a

multiple linear regression function of on ().

Based on Theorem 2, RCD uses Lemma 4 to distinguish the parents from the ancestors. We proved Lemma 4 without using Theorem 2.

Lemma 4. Assume that ; that is, is an ancestor of . Let denote the residual obtained by the multiple regression of on . Let denote the residual obtained by the multiple regression of on (). If and are linearly correlated, then is a parent of ; otherwise, is not a parent of .

Proof. Variable and are formulated as follows:

Let denote the coefficient of ) when is regressed on .

Then,

Let denote the coefficient of ) when is regressed on .

Then,

=

=

Since and do not have the same latent confounder:

From Equations 21, 22, and 23, and are linearly correlated when = 0. It means that is a parent (direct cause) of . When = 0, and are not linearly correlated. It means that is not a parent of .

3.4. Identifying pairs of variables that have the same latent confounders

RCD infers that two variables are affected by the same latent confounders when those two variables are linearly correlated even after removing the effects of all the parents. RCD identifies the pairs of variables affected by the same latent confounders by using Lemma 5.

Lemma 5. Let and respectively denote the sets of ancestors of and , and and respectively denote the sets of parents of and . Assume that and . Let denote the residual obtained by the multiple regression of on , and denote the residual obtained by the multiple regression of on . If and are linearly correlated, then and have the same latent confounders.

Proof. Variable and are formulated as follows:

Let denote the coefficient of when is regressed on . Then,

Let denote the coefficient of when is regressed on . Then,

Variables and are independent of each other. If we assume that and

do not have the same latent confounder, then,

Then, and are mutually independent. However, this contradicts the assumption of Lemma 5 that and are linearly correlated. Therefore, and have the same latent confounders.

4. Performance evaluation

We evaluated the performance of RCD relative to the existing methods in terms of how accurately it finds the pairs of variables that are affected by the same latent confounders and how accurately it infers the causal directions of the pairs of variables that are not affected by the same latent confounder. In regard to the latent confounders, we compared RCD with FCI [4], RFCI [21], and GFCI [22]. In addition to these three methods, we compared RCD with PC [1], GES [2], DirectLiNGAM [8], and RESIT [9] to evaluate the accuracy of causal directions. In the following sections, DirectLiNGAM is called LiNGAM for simplicity.

4.1. Performance on simulated structures

Figure 3: Performance evaluation on causal graphs using simulated data: The vertical red lines indicate the median values of the results. The evaluation of the latent confounders corresponds to the evaluation of bi-directed arrows. The evaluation of causality corresponds to the evaluation of directed arrows.

We performed 100 experiments to evaluate RCD relative to the existing methods. We prepared 300 sets of samples for each experiment. The data of each experiment were generated as follows: The data generation process was modeled the same as Equation 1. The number of observed variables was set to 20 and the number of latent confounders was set to 4. Let X and Y denote the stochastic variables, and assume that 5) and . We used the random samples of X for and because X is non-Gaussian. The number of causal arrows between the observed variables is 40, and the start point and the end point of each causal arrow were randomly selected. We randomly drew two causal arrows from each latent confounder to the observed variables. Let Z denote a stochastic variable that comes from a uniform distribution on [5] and [0.5, 1.0]. We used the random samples of Z for and .

We evaluated (1) how accurately each method infers the pairs of variables that are affected by the same latent confounders (called the evaluation of latent confounders), and (2) how accurately each method infers causality between the observed variables that are not affected by the same latent confounder (called the evaluation of causality). The evaluation of latent confounders corresponds to the evaluation of bi-directed arrows in a causal graph, and the evaluation of causality corresponds to the evaluation of directed arrows. We used precision, recall, and F-measure as evaluation measures. In regard to the evaluation of latent confounders, true positive (TP) is the number of true bi-directed arrows that are correctly inferred. In regard to causality, TP is the number of true directed arrows that a method correctly infers in terms of their positions and directions. Precision is TP divided by the number of estimations, and recall is TP divided by the number of all true arrows. F-measure is defined as F-measure = 2 precision recall/(precision + recall).

The arguments of RCD, that is, (alpha level for Pearson’s correlation), (alpha level for independence), (alpha level for the Shapiro-Wilk test), and n (maximum number of explanatory variables for multiple linear regression) were set as = 0= 0= 0.01, and n = 2.

In regard to the types of edges, FCI, RFCI, and GFCI produce partial ancestral graphs (PAGs) that include six types of edges: (directed), (bi-directed), (partially directed), (nondirected), and (partially undirected). In the evaluation, we only used the directed and bi-directed edges. PC, GES, LiNGAM, and RESIT produce causal graphs only with the directed edges; thus, we did not evaluate those methods in terms of latent confounders.

The box plots in Figure 3 display the results. The vertical red lines indicate the median values. Note that some median values are the same as the upper or lower quartiles. For example, the median and the upper quartile of the recalls of RCD in the results of latent confounders are the same. It means that the results between the median and the upper quartile are the same. In regard to the evaluation of latent confounders, the precision, recall, and F-measure values are almost the same for RCD, FCI, RFCI, and GFCI, but the medians of precision, recall, and F-measure values of RCD are the highest among them. In regard to causality, RCD scores the highest medians of the precision and F-measure values among all the methods, and the median of recall for RCD is the second

highest next to RESIT.

The results suggest that RCD does not greatly improve the performance metrics compared to the existing methods. However, there is no other method that has the highest or the second highest performance for each metric. FCI, RFCI, and GFCI perform as well as RCD in terms of finding the pairs of variables that are affected by the same latent confounders, but they do not perform well in terms of the recall of causality. In addition, no other method performs well in terms of both precision and recall of causality. RCD can successfully find the pairs of variables that are affected by the same latent confounders and identify the causal direction between variables that are not affected by the same latent confounder.

4.2. Performance on real-world structures

Causal structures in the real-world are often very complex. Therefore, RCD likely produces a causal graph where each pair of observed variables is connected with a bi-directed arrow. The result of identifying latent confounders is affected by the threshold of the p-value for the independence test, . If is too large or too small, then all the variable pairs are likely concluded to have the same latent confounders. Therefore, we need to find the most appropriate value of . We increased k from 1 to 25 and set as = 0and repeated the process. We adopted a result that has the smallest number of pairs of variables with the same latent confounders.

We analyzed the General Social Survey data set, taken from a sociological data repository.The data have been used for the evaluation of DirectLiNGAM in Shimizu et al. [8]. The sample size is 1380. The variables and the possible directions are shown in Figure 4. The directions were determined based on the domain knowledge in Duncan et al. [23] and temporal orders.

We evaluated the directed arrows (causality) in the causal graphs produced

Figure 4: Variables and causal relations in the General Social Survey data set used for the evaluation.

Table 1: The results of the application to sociological data.

by RCD and the existing methods, based on the directed arrows in Figure 4. In addition, we evaluated the bi-directed arrows in causal graphs produced by the methods as accurate inference if they exist in Figure 4 as directed arrows.

The results are listed in Table 1. In regard to bi-directed arrows (latent confounders), the number of successful inferences by RCD is the highest, and the precisions of RCD, FCI, and RFCI are all 1.0. In regard to the directed arrows (causality), the numbers of the successful arrows of RCD, RESIT, and LiNGAM are the highest. The precisions of RCD and LiNGAM are also the highest. The causal graph produced by RCD is shown in Figure 5. The dashed

Figure 5: Causal graph produced by RCD: The dashed arrow, is incorrect inference,

arrow is the incorrect inference, but the others are correct.

RCD performs the best among the existing methods in terms of both identifying the pairs of variables that are affected by the same latent confounders and identifying the causal direction of the pairs of variables that are not affected by the same latent confounder.

5. Conclusion

We developed a method called repetitive causal discovery (RCD) that produces a causal graph where a directed arrow indicates the causal direction between the observed variables, and a bi-directed arrow indicates a pair of variables have the same confounder. RCD produces a causal graph by (1) finding the ancestors of each variable, (2) distinguishing the parents from the indirect causes, and (3) identifying the pairs of variables that have the same latent confounders. We confirmed that RCD effectively analyzes data confounded by unobserved variables through validations using simulated and real-world data.

In this paper, we did not discuss the utilization of prior knowledge. However, it is possible to make use of prior knowledge of causal relations in practical applications of RCD. In this study, information about the ancestors of each variable was initialized to be an empty set. If we have prior knowledge about causal relations, the information about the ancestors of each variable that RCD

retains can be set according to the prior knowledge.

There is still room for improvement in the RCD method. The optimal settings of the arguments of RCD and the extension of RCD for nonlinear causal relations will be investigated in future studies.

6. Acknowledgments

We thank Dr. Samuel Y. Wang for his useful comments on a previous version of our algorithm proposed in [11]. Takashi Nicholas Maeda has been partially supported by Grant-in-Aid for Scientific Research (C) from Japan Society for the Promotion of Science (JSPS) #20K19872. Shohei Shimizu has been partially supported by ONRG NICOP N62909-17-1-2034 and Grant-in-Aid for Scientific Research (C) from Japan Society for the Promotion of Science (JSPS) #16K00045 and #20K11708.

References References

[10] H. Zou, The adaptive lasso and its oracle properties, Journal of the Amer-

[11] T. N. Maeda, S. Shimizu, RCD: Repetitive causal discovery of linear non-

[12] P. O. Hoyer, S. Shimizu, A. J. Kerminen, M. Palviainen, Estimation of

[13] J. Pearl, Comment: Graphical models, causality and intervention, Statis-

[14] J. Pearl, Causality: models, reasoning and inference, Cambridge University

[15] G. Darmois, Analyse g´en´erale des liaisons stochastiques: etude particuli´ere

[16] V. P. Skitovitch, On a property of the normal distribution, Doklady

[17] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch¨olkopf, A. J. Smola, A

[18] S. S. Shapiro, M. B. Wilk, An analysis of variance test for normality (com-

[19] D. C. Liu, J. Nocedal, On the limited memory BFGS method for large scale

[20] H. Zhang, S. Zhou, K. Zhang, J. Guan, Causal discovery using regression-

[21] D. Colombo, M. H. Maathuis, M. Kalisch, T. S. Richardson, Learning

[22] J. M. Ogarrio, P. Spirtes, J. Ramsey, A hybrid causal search algorithm for

[23] O. D. Duncan, D. L. Featherman, B. Duncan, Socioeconomic background