b

DiscoverSearch
About
My stuff
A causation coefficient and taxonomy of correlation/causation relationships
2017·arXiv
Abstract
Abstract

This paper introduces a causation coefficient which is defined in terms of probabilistic causal models. This coefficient is suggested as the natural causal analogue of the Pearson correlation coefficient and permits comparing causation and correlation to each other in a simple, yet rigorous manner. Together, these coefficients provide a natural way to classify the possible correlation/causation relationships that can occur in practice and examples of each relationship are provided. In addition, the typical relationship between correlation and causation is analyzed to provide insight into why correlation and causation are often conflated. Finally, example calculations of the causation coefficient are shown on a real data set.

The maxim, “Correlation is not causation”, is an important warning to analysts, but provides very little information about what causation is and how it relates to correlation. This has prompted other attempts at summarizing the relationship. For example, Tufte [1] suggests either, “Observed covariation is necessary but not sufficient for causality”, which is demonstrably false or, “Correlation is not causation but it is a hint”, which is correct, but still underspecified. In what sense is correlation a ‘hint’ to causation?

Correlation is well understood and precisely defined. Generally speaking, correlation is any statistical relationship involving dependence, i.e. the random variables are not independent. More specifically, correlation can refer to a descriptive statistic that summarizes the nature of the dependence. Such statistics do not provide all of the information available in the joint probability distribution, but can provide a valuable summary that is easier to reason about. Among the most popular, and often referred to as just “the correlation coefficient” [2], is the Pearson correlation coefficient, which is a measure of the linear correlation between variables.

Causality is an intuitive idea that is difficult to make precise. The key contribution of this paper is the introduction of a “causation coefficient”, which is suggested as the natural causal analogue of the Pearson correlation coefficient. The causation coefficient permits comparing correlation and causation to each other in a manner that is both rigorous and consistent with common intuition.

The rest of this paper is outlined as follows: The statistical/causal distinction is discussed to provide background. The existing probabilistic causal model approach to causality is briefly summarized. The causation coefficient is defined in terms of probabilistic causal models and some of the properties of the coefficient are discussed to support the claim that it is the natural causal analogue of the Pearson correlation coefficient.

The definition of the causation coefficient permits the following new analyses to be conducted: A taxonomy of the possible relationships between correlation and causation is introduced, with example models. The typical relationship between correlation and causation is analyzed to provide insight into why correlation and causation are often conflated. Finally, example calculations of the correlation coefficient are shown on a real data set.

Causality is difficult to formalize. Causality is implicit in the structure of ordinary language [3] and the words ‘causality’ and ‘causal’ are often used to refer to a number of disparate concepts. In particular, much confusion stems from conflating three distinct tasks in causal inference [4]:

1. Definitions of counterfactuals 2. Identification of causal models from population distributions 3. Selection of causal models given real data

Counterfactuals, as defined in philosophy, are hypothetical or potential outcomes – statements about possible alternatives to the actual situation [5]. A classic example is, “If Nixon had pressed the button, there would have been a nuclear holocaust” [6], a statement which seems intuitively correct, but difficult to formally model and impossible to empirically verify. Defining causation in terms of counterfactuals originates with Hume in defining a cause to be, “where, if the first object had not been, the second never had existed” [7]. Indeed, in a world where there had been a nuclear war during the Nixon administration, it would be quite reasonable to claim that launching nuclear missiles was a cause. A key difficulty in making this notion of causality precise is that it requires precise models of counterfactuals and therefore precise assumptions that can be unobservable and untestable even in principle.

For example, consider the possible results of treating a patient in a clinical setting. In the notation of the Rubin causal model [8], a particular patient or unit1, u, can be potentially exposed to either treatment, t, or control, c. The treatment effect2, Yt(u) − Yc(u), is the difference between the outcomes when the patient is exposed to treatment and when the same patient is exposed to the control. Determining treatment effect on a unit is usually the ultimate goal of causal inference. It is also impossible to observe – the same patient cannot be treated and not treated – a problem which Holland names the Fundamental Problem of Causal Inference.

This is not to suggest that causal inference is impossible, merely that additional assumptions must be made if causal conclusions are to be reached. A well known assumption that makes causal inference possible is randomization. Assuming that units are randomly assigned to treatment or control groups, it is possible to estimate the average treatment effect,  E[Yt − Yc]. Note that randomization is an assumption external to the data; it is not possible to determine, from the data alone, that it was obtained from a randomized controlled trial. Another example of an assumption that permits causal inference is unit homogeneity, which can be thought of as “laboratory conditions”. If different units are carefully prepared, it may be reasonable to assume that they are equivalent in all relevant aspects, i.e.  Yt(u1) =  Yt(u2) and  Yc(u1) = Yc(u2). For example, it is often assumed that any two samples of a given chemical element are effectively identical. In these cases, treatment

effect can be calculated directly as  Yt(u1) − Yc(u2).

A closely related concept is ceteris paribus, roughly, “other things held constant”, which is a mainstay of economic analysis [4]. For example, increasing the price of some good will cause demand to fall, assuming that no other relevant factors change at the same time. This is not to suggest that no other factors will change in a real economy; ceteris paribus simply isolates the effect of one particular change to make it more amenable to analysis.

In practice, the first causal inference task, defining counterfactuals, requires having a scientific theory. For example, classical mechanics describes the possible states of idealized physical systems and can provide an account of manipulation. The theory can predict what would happen to the trajectory of some object if an external force were to be applied, whether or not such a force was actually applied in the real world. Scientific theories are usually parameterized; one example of a parameter is standard gravity,  gn ≈ 9.8 m/s2, the acceleration of an object due to gravity near the surface of the earth [9].

The second causal inference task, identification from population distributions, is a problem of uniquely determining a causal model or some property of a causal model from hypothetical population data. In other words, the problem is to find unique mappings from population distributions or other population measures to causal parameters. This can be thought of as the problem of determining which scientific theory is correct, given data without sampling error. A well-designed experiment to determine  g0will, in the limit of infinite samples, yield the exact value for the parameter.

The third task, selection of causal models given real data, is the problem of inference in practice. Any real experiment can only provide an analyst with a finite-sample distribution subject to random error. This problem lies in the domain of estimation theory and hypothesis testing.

In addition to the standard population/sample distinction, this paper follows Pearl’s conventions in referring to the statistical/causal distinction3 [10]. Astatistical concept is a concept that is definable in terms of a joint probability distribution of observed variables. Variance is an example of a statistical parameter; the statement that f is multivariate normal is an example of a statistical assumption. A causal concept is a nonstatistical concept that is definable in terms of a causal model. Randomization is an example of a causal, not statistical, assumption because it is impossible to determine from a joint probability distribution that a variable was randomly assigned.

This distinction draws a sharp line between statistical and causal analysis, which can be thought of as the difference between analyzing uncertain, yet static conditions versus changing conditions [11]. Estimating and updating the likelihood of events based on observed evidence are statistical tasks, given that experimental conditions remain the same. Causal analysis aims to infer the likelihood of events under changing conditions, brought about by external interventions such as treatments or policy changes. Much like how statistical inference is performed with respect to assumptions formalized in a statistical model, rigorous causal inference requires formal causal models.

Probabilistic causal models4are an approach to causality characterized by nonparametric models associated with a type of directed acyclic graph (DAG) called a causal diagram. The concept of using graphs to model probabilistic and causal relationships originates with Wright’s path analysis [12]. The modern, nonparametric version appears to have been first proposed by Pearl and Verma [13] and has been the subject of considerable research since then [14] [15] [16]. This section is meant to provide a high-level summary of probabilistic causal models, sufficient to explain the proposed causation coefficient.

Causal models

The philosophy of probabilistic causal models is that of Laplacian quasideterminism – a complete description of the state of a system is sufficient to exactly determine how the system will evolve [17]. In this view, randomness is a statement of an analyst’s ignorance, not inherent to the system itself.5

A causal model, M, consists of a set of equations where each child-parent family is represented by a deterministic function:

image

The  Xivariables are the endogenous variables, determined by factors in the model.  paidenotes the parents of  Xi, which can be thought of as the direct, known causes of  Xi. The  ϵivariables are the exogenous variables, and can appropriately be considered ‘background’ variables or ‘error terms’ and correspond to variables that are determined by factors outside of the model [10]. Since the exogenous variables model those factors that cannot be directly accounted for, they are treated as random variables. Regardless of the distribution of the exogenous variables, or the functional form of the fiequations, a probability distribution,  P(ϵ) over the exogenous variables induces a probability distribution,  P(x1, . . . , xn) over the endogenous variables [10]. The resulting model is called a probabilistic causal model.

Each causal model induces a causal diagram, G, where each  Xicorresponds to a vertex and each parent-child relationship between  pai and Xicorresponds to a directed edge from parent to child. In this paper, it is assumed that all models are recursive, i.e. all models induce an acyclic causal diagram.

This paper adopts the convention that all of the exogenous variables are mutually independent. If all of the endogenous variables are observable – denoted in the causal diagram by solid nodes – then the model is called Markovian. The joint probability function,  P(x1, . . . , xn), in a Markovian model is said to be Markov compatible with  G in that P(x1, . . . , xn) respectsthe Markov condition: each variable is independent of all its non-descendants given its parents in the graph [18]. Dependence between two observable variables that have no observable ancestor can be introduced by adding a latent endogenous variable, denoted in a causal diagram by an open node. Such a model is called semi-Markovian.

Each causal diagram can be thought of as denoting a set of causal models. Most of this paper considers the following set of models with endogenous variables X, Y and Z:

image

image

Figure 1: Causal diagrams for Markovian and semi-Markovian models

Causal effect

Without any additional context, this characterization of probabilistic causal models appears merely to be a way to generate Bayesian networks. However, the functional, quasi-deterministic approach also specifies how the probability distribution of the observable variables change in response to an external intervention. The simplest external intervention is where a single variable Xiis forced to take some fixed value  xi, ‘setting’ or ‘holding constant’ Xi = xi. Such an atomicintervention corresponds to replacing the equation Xi = fi(pai, ϵi) with the constant  xi, generating a new model. This can be extended to sets of variables.

Causal effect [14] Given two disjoint sets of variables, X and Y , the causal effect of X on Y , denoted either as  P(y | ˆx) or P(y | do(x)) is a function from X to the space of probability distributions on Y . For each realization  x of X, P(y | ˆx) gives the probability of Y = y induced by deleting from the model all equations corresponding to variables in X and substituting X = x in the remaining equations.

Crucially, causal effect,  P(y | ˆx), is fundamentally different than conditioning or observation, P(y | x). The latter is a function of the joint probability distribution of the original model, M. The former is a function of the distribution of the submodel,  Mx, that results from the effect of action do(X = x) on M. Intuitively, this can be thought of as ‘cutting’ all of the incoming edges to X and replacing the random variable with the constant x.

It is possible for DAGs to be observationally equivalent, i.e. Markov compatible with the same set of joint probability distributions [13]. Two observationally equivalent DAGs cannot be distinguished without performing interventions or drawing on additional causal information. For example [10], in a causal diagram modeling relationships between the season, rain, sprinkler settings and whether the ground is wet, it would be reasonable

image

Figure 2: An intervention do(X = x) on causal model M produces submodel

image

to accept a model where the season causally effects the sprinkler settings, but not vise-versa. While indistinguishable from observation alone, the two models imply different results due to intervention; it would be implausible that changing the settings on a sprinkler would cause the season of the year to change as a result.

image

Figure 3: Observationally equivalent DAGs

Identification of causal effect

The problem of whether a causal query can be uniquely answered is referred to as causal identifiability. An unbiased estimate of  P(y | ˆxi) can always be calculated from observational (preintervention) probabilities in Markovian models by conditioning on the parents of  Xiand averaging the result, weighted by the probabilities of  PAi = pai. This operation is called “adjusting for PAi” or “adjustment for direct causes” [10]. More formally, the observational probability distribution P and causal diagram G of a Markovian model identifies the effect of the intervention  do(Xi = xi) on Yand is given by:

image

Semi-Markovian models do not always permit identification. A simple example is when a single latent variable is a parent of every observable variable. Informally, it is not possible to determine if observed covariation is indicative of a causal effect between two variables, or whether their common, unobservable parent brings about the correlation. For example [15], consider two models,  M1 and M2 where both models have observable variables X, Y , latent  Z ∼ Bernoulli(0.5), and  fX(z) = z. In  M1, fY (z, x) = z XOR x; in  M2, fY (z, x) = 0. These models are compatible with the same causal diagrams and have identical observational probability distributions, but different causal effects,  P(y | ˆx). Since the causal effect cannot be uniquely calculated from the available information, it is not identifiable. However, many semi-Markovian models still permit estimation of certain causal effects. Complete methods are described in [15].

The Pearson product-moment correlation coefficient,  ρ, is a standard measure of correlation between random variables. This is commonly described as a measure of how well the relationship between X and Y can modeled by a linear relationship with  ρ = −1/+ 1 being a perfect negative/positive linear relationship and 0 representing no linear relationship at all. The population correlation coefficient is defined as a normalized covariance [2]:

image

For discrete random variables, this is a function of the joint probability mass function (for continuous random variables that admit a probability density function, the summations are replaced with integrals):

image

The causation coefficient relies on the observation that the correlation coeffi-cient can, by the law of total probability, be rewritten as a function of the conditional distribution P(y | x), and marginal distribution, P(x), instead of in terms of the joint density:

image

Syntactically, the causation coefficient,  γX→Y, is defined by replacing P(y|x) with  P(y|ˆx) and P(x) with ˆP(x). As a convenience, the following terms are also defined: V ar[ ˆX] = �x x2 ˆP(x) − (�x x ˆP(x))2and  V ar[Y ˆX] = �x�y y2P(y|ˆx) ˆP(x) − (�x�y yP(y|ˆx) ˆP(x))2. The full definition of  γX→Yis then:

image

Where  P(y | ˆx) is the causal effect of do(X = x) on Y and ˆP(x) is the distribution of interventions.

Distribution of interventions

In the discrete case, the distribution of interventions can be thought of as a set of weights for averaging the possible causal effects. It also has an interpretation in the context of observational and experimental studies. As an example, consider a scenario where patients decide for themselves whether or not to take a drug (X), and observe whether or not they recover (Y ). The population joint probability distribution, P(x, y), provides all of the information available from an idealized version of this observational study. For intuition, it may be helpful to imagine P(x, y) as being calculated from millions of samples to the point where random sampling error has ceased to be a relevant consideration.

This simplest way to model this is with Bernoulli (binary) random variables for X and Y , with 0 representing no treatment or failure to recover and 1 representing treatment or recovery. The probability of patients deciding for themselves whether or not to take the drug, in this observational study, is the marginal probability P(x). In clinical terms, P(X = 0) and P(X = 1) are the relative sizes of the cohorts.

However, even in an idealized observational study, P(y | x) would not provide definitive information on whether treatment actually improves patient outcomes. Hypothetically, the drug could cause unpleasant side effects in the patients that would have received the greatest benefit, leading those patients to choose not to take the drug. An idealized randomized controlled trial would permit an analyst to directly measure  P(y | ˆx), as randomization explicitly cuts out confounding. However, randomized controlled trials are often impractical (e.g. too expensive or unethical) to run in practice.

The relative sizes of the cohorts in an observational study may be different than the relative sizes of the treatment and control groups in a corresponding randomized controlled trial – this is the use of the distribution of interventions ˆP. Experiments are often designed to have equal group sizes as this typically provides maximum statistical power, but this is by no means universal. Also, it is not uncommon for patients to drop out or otherwise be disqualified from studies, so the cohorts will often be unequal in practice.

The natural causation coefficient, denoted  γX→Yor  γ, is defined for ˆP(x) equal to the pre-intervention marginal distribution, P(x). This corresponds to an experimental trial where the treatment groups are scaled to be proportional to the relative sizes seen in the observational study.

The maximum entropy causation coefficient, denoted  γH,X→Yor  γH, is the causation coefficient where ˆP(x) is a maximum entropy probability distribution. For random variables with bounded support, this is the uniform distribution and corresponds to equal treatment group sizes.

Other distributions of interventions are possible, to reweigh the effects of certain interventions relative to others in the computation of the causation coefficient. These should be denoted explicitly as  γ ˆP. For example, a certain drug may be known to be helpful in certain small doses, but worse than no treatment at all in larger doses, in which case both the natural and maximum entropy coefficients could be misleading. In such cases, a distribution of interventions corresponding to current best practices may be more informative.

Independence and invariance

The definition of independence of random variables X and Y is: ∀x, y P(x, y) = P(x)P(y) or, equivalently:  ∀x, y P(y | x) = P(y). In other words, observing X provides no information about Y (and vise-versa). The causal equivalent is invariance of  Y to X: ∀x, y P(y | ˆx) = P(y); that is to say, no possible intervention on X can affect Y [18]. Unlike independence, invariance is not symmetric. The term mutually invariant is suggested to refer to when both Y is invariant to X and X is invariant to Y .

For Bernoulli random variables, X and Y are uncorrelated (ρ= 0) if and only if they are independent. The analogous condition holds for the causation coefficient. For Bernoulli distributed X and  Y , γX→Y= 0 if and only if Y is invariant to X (see appendix for proof). However, both the correlation and causation coefficients have difficulty capturing nonlinear relationships between variables.6In general, independence implies  ρ= 0 and invariance implies  γ= 0, but the converse does not hold for many distributions.

Table 1: Non-invariant interventional distributions where γH = 0

image

As a simple example, Table 1 contains interventional distributions where Y is not invariant to X, but the maximum entropy causation coefficient  γH = 0.The natural causation coefficient may be positive, negative or zero depending on the observational (pre-intervention) distribution P(x).

Average treatment effect

Average treatment effect is defined as [19]:

image

This is the probabilistic causal model equivalent of the Rubin causal model definition of average treatment effect. Positive ATE implies that treatment is, on average, superior to non-treatment, while negative ATE implies the opposite. For Bernoulli distributed  X and Y , γX→Yreduces to (see appendix for proof):

image

Since variance is strictly positive for nondegenerate Bernoulli distributions, this implies that  γhas the same sign as the average treatment effect.

For Bernoulli X and  Y , ρand  γprovide a natural way to classify the possible correlation/causation relationships.  ρand  γcan each be positive, negative or zero, implying 9 possible relationships. These are grouped into 5 classifications in Table 2.

Table 2: Correlation/causation relationships

image

In Table 2, “0” is a zero value for the coefficient, and “+/-” refers to the coefficient taking on a positive or a negative value (e.g. inverse causation refers to either a model with positive  ρand negative  γ, or negative  ρand positive  γ). Note that  ρ and γare population coefficients; this taxonomy can be thought of as categorizing the possible relationships between correlation and causation, in the limit of infinite samples.

Many of the relationships described in the following sections are well known and existing terminology is used where appropriate. Examples of each re- lationship are given, as well as simple probabilistic causal models of three Bernoulli distributed variables that produce the described relationship. Notably absent is the notion of mutual causation, which is beyond the scope of this paper. Note that while  ρis symmetric, i.e.  ρX,Y = ρY,X, at least one of  γX→Y , γY →Xis zero in all recursive probabilistic causal models (see appendix for proof).

Independent and invariant

Two variables that are independent and mutually invariant are completely unrelated – neither observing nor manipulating one can provide information about or change the other. This is usually the default assumption when studying a system – in hypothesis testing, the null hypothesis is usually “no effect”. For a somewhat absurd example, researchers would not believe that the average gas mileage of a Prius is related in any way to the minimum width of the English channel [20] by default – some sort of evidence would be expected before taking such a suggestion seriously. The notion of light cones provides an example familiar to physicists – the principle of locality and the theory special relativity imply that nothing outside of someone’s past and future light cones can ever affect them.

Independent and invariant variables can be trivially mathematically modeled. An example is provided here to introduce the conventions used throughout the rest of this section. Let  ϵX, ϵY , ϵZbe fair coins, i.e. independent Bernoulli distributed random variables with p = 0.5. These are the exogenous variables of the probabilistic causal model. X will generally model a cause or treatment, Y , an effect or response, and Z, a confounding variable that causally effects X and Y . An example model with independent and invariant X and Y is simply:

image

Table 3: Observational distribution of independent and invariant model

image

Table 4: Interventional distributions of independent and invariant model

image

X and Y are clearly independent and invariant and the correlation and causal coefficients are 0.

Common causation

Reichenbach appears to be the first to propose the “Principle of the Common Cause” claiming, “If an improbable coincidence has occurred, there must exist a common cause” [21]. Elaborating on this, he suggests that correlation between events A and B indicates either that A causes B, B causes A or A and B have a common cause. This philosophical claim naturally suggests the following definition:

Common Causation X and Y are said to experience common causation when X and Y are mutually invariant but not independent.

This effect is sometimes referred to as a “spurious relationship” or “spurious correlation” – a term originally coined by Pearson [22]. This risks conflating several distinct concepts: the interventional distributions from which  γis calculated, the population observational distribution from which  ρis calculated, and the finite-sample observational distribution, from which the sample correlation coefficient, r is calculated. Consider the following scenarios:

• A very large number of samples are taken from invariant X and Y , but due to a latent confounding variable, X and Y are correlated.

• A small number of samples are taken from independent and invariant X and Y , but due to random sampling errors, the sample correlation coefficient suggests that X and Y are correlated.

In both scenarios, there is a spurious relationship between X and Y . The first scenario exhibits common causation. The second scenario is due to random sampling error and, as the number of samples increases, the observed correlation will tend to zero. The term “coincidental correlation” is suggested to distinguish this finite-sample effect from common causation.

An example of a common cause can be found in a study on myopia and ambient lighting at night [23]. Development of myopia (shortsightedness) is correlated with nighttime light exposure in children, although the latter does not cause the former. The common cause is that myopic parents are likely to have myopic children, and also more likely to set up night lights.

The following is a simple common causation model: Let  ϵX, ϵY , ϵZbe fair coins and X, Y and Z be defined by the following three equations:

image

Table 5: Observational distribution of common cause model

image

Table 6: Interventional distributions of common cause model

image

From the observational distribution, it is clear that X and Y are correlated (ρ= 1/3) and from the interventional distributions, that X and Y are invariant (γ = 0).

Inverse causation

A classic veridical paradox is the relationship between tuberculosis and dry climate [24]. At one point, Arizona, with one of the driest climates in the United States was found to also have the largest share of tuberculosis deaths. This is because tuberculosis patients greatly benefit from a dry climate, and many moved there. The following is proposed as a definition for this type of scenario:

Inverse causation X and Y are said to experience inverse causation when the correlation coefficient  ρand causation coefficient  γhave the opposite sign.

Inverse causation is of special importance when considering clinical treatment; a case of inverse causation is a case where the correct treatment option is the opposite of what a naive interpretation of correlation would suggest.

The following is a simple model that exhibits inverse causation: Let  ϵZ be afair coin, and  ϵYbe Bernoulli distributed with p = 3/4. The following is an inverse causation model with  ρ = −1/2 and γ = 1/4:

image

Table 7: Observational distribution of inverse causation model

image

Table 8: Interventional distributions of inverse causation model

image

“Inverse causation” suggested to avoid confusion with other terminology. “Anti-causation” is inappropriate, as “anti-causal filters” in digital signal processing are filters whose output depend on future inputs. “Reverse causation” is also inappropriate, as this refers to mistakenly believing that Y has a causal effect on X, when, X causes Y .

Unfaithfulness

The Markov condition entails a set of conditional independence relations between variables corresponding to nodes in a DAG. The faithfulness condition [spirtes2000] (also referred to as stability [10]) is the converse.

Faithfulness condition A distribution P is faithful to a DAG G if no conditional independence relations other than the ones entailed by the Markov condition are present.

This is a global condition, applying to a joint probability distribution and a DAG. The following local condition is defined in terms of two random variables X and Y in a causal model:

Unfaithful X and Y are said to be unfaithful if they are independent but not mutually invariant.

This local condition can only occur if the global faithfulness condition is violated (see appendix for proof). For Bernoulli random variables, X and Y are unfaithful if and only if  ρ = 0 and γ ̸= 0.

The following model is a simple example where X and Y are unfaithful. Let ϵZ, ϵYbe fair coins. Then, in the following model,  ρ = 0 and γ = 1/2:

image

image

Table 9: Observational distribution of unfaithful model

image

Table 10: Interventional distributions of unfaithful model

image

Almost all models are faithful in a formal sense – models that do not respect the faithfulness condition have Lebeguse measure zero in probability spaces where model parameters have continuous support and are independently distributed [25]. However, this does not mean that such models can be dismissed out of hand; they are vanishingly unlikely to occur by chance, but can be deliberately engineered.

Friedman’s thermostat and the traitorous lieutenant

Consider “Friedman’s Thermostat”; a correctly functioning thermostat would keep the indoor temperature constant, regardless of the external temperature, by adjusting the furnace settings.7Observation would show external temperature and furnace settings to be anticorrelated with each other and internal temperature to be uncorrelated with both. This does not correspond to the true causal effect that external temperature and furnace settings have on internal temperature.

The sharp-eyed reader will note that the Friedman’s thermostat example is not a recursive (acyclic) causal model. An example of unfaithfulness with a recursive causal model can be seen in the following “Traitorous Lieutenant” problem. Consider the problem of a general trying to send a one-bit message. The general has two lieutenants available to act as messengers, however, one of them is a traitor and will leak whatever information they have to the enemy. The general observes the following protocol: to send a 1, the general either gives the first lieutenant a 1 and the other a 0, or the first a 0 and the second a 1, with equal probability. To send a 0, the general either gives both lieutenants a 0, or both lieutenants a 1, with equal probability. The recipient of the message XORs both lieutenants’ bits to recover the original message.

image

Figure 4: Diagram of the traitorous lieutenant problem

In this scenario, the traitor will see a 0 and 1 with equal probability, regardless of the actual message. This is unfaithfulness; a lieutenant changing their bit has a causal effect on the final message, but observing a single lieutenant’s bit provides no information.

Genuine causation and confounding bias

“Genuine causation” is suggested for referring to models where  ρ and γ havethe same sign. However, due to confounding bias, the strength of the true casual effect may be different than what a naive interpretation of correlation would suggest.

The causal definition of no confounding is provided by Pearl [28].

No confounding Let M be a causal model. X and Y are not confounded in M if and only if  P(y | ˆx) = P(y | x).

By the definition of the causation coefficient, no confounding implies  ρ = γ.Genuine causation with negative confounding bias corresponds to  γ > ρ, and can be thought of as a weaker version of the type of confounding effect that produces unfaithfulness or inverse causation. In such cases, the true causal effect will be stronger than correlation suggests. Let  ϵZbe a fair coin and  ϵYbe Bernoulli distributed with p = 1/4. Then, the following model exhibits genuine causation with negative confounding bias, with  ρ = 1/2 and

image

Table 11: Observational distribution of negative confounding bias model

image

Table 12: Interventional distributions of negative confounding bias model

image

Genuine causation with positive confounding bias corresponds to  γ < ρ; in such cases, the true causal effect will be weaker than correlation suggests. Let  ϵX, ϵY , ϵZbe fair coins. In the following model,  ρ ≈ 0.745, the natural causation coefficient,  γ ≈ 0.447, and the maximum entropy causation coefficient,  γH = 0.5:

image

image

Table 13: Observational distribution of positive confounding bias model

image

Table 14: Interventional distributions of positive confounding bias model

image

Common intuition suggests that correlation is closely related to causation. However, the models in the previous section act as a constructive proof that the sign of the correlation coefficient provides no guarantees about the true causal effect. Some insight on this apparent discrepancy can be found by considering the following set of linear probabilistic causal models, parameterized by  σ2ϵX, σ2ϵY , σ2ϵZ, αZ, βX, βZ:

image

Since these models are linear and covariance is bilinear, the population correlation coefficient can be calculated analytically, regardless of the underlying distribution of the error terms:

image

The natural causation coefficient can also be calculated directly from the definitions of the causation coefficient and causal effect:

image

The typical relationship between correlation and causation can be analyzed by constructing a probability distribution for the parameters of the linear model.  αZ, βX, βZhave support over the entire real line;  σ2ϵX, σ2ϵY , σ2ϵZ havesupport over (0, ∞). Given mean 0 and variance 1, the maximum entropy distributions are N(0, 1) and exp(1), respectively. Assuming jointly independent distributions over the parameters, it is straightforward to randomly sample models and compute their correlation and causation coefficients. Plotting  γ against ρyields a graph where each point represents a single linear probabilistic causal model. The (smoothed) result of plotting such a graph is in Figure 5.

In the graph of  γ vs ρ, the upper left and lower right quadrants contain inverse causation models and the other two quadrants contain genuine causation models. Except for (0, 0), which corresponds to an invariant and independent model, the horizontal line,  ρ= 0, contains common causation models and the vertical line,  γ= 0, contains unfaithful models.

With maximum entropy distributions over the parameters, the probability of a random linear model exhibiting inverse causation  ≈ 0.122, genuine causation with negative bias  ≈ 0.364, and genuine causation with positive bias  ≈ 0.514. This matches closely with common intuition. Typically, a strong positive correlation indicates a strong positive causal effect – this can be seen in the upper right quadrant, with a high density of models. Inverse causation is possible, although much less likely, and unfaithful models have measure 0, which accounts for why they are often considered counterintuitive. However, this is an analysis of population, not sample coefficients and the

image

Figure 5: Causation vs correlation coefficient (kernel density estimation with 106 samples). Darker shading indicates higher density of models. The curves at the top and right of the graph are the marginal densities of  ρ and γ.

measure of nearly unfaithful models is nonzero.8 In practice, this means that unfaithfulness cannot be dismissed as irrelevant. Although the population correlation will be zero in such models, the sample correlation will often be indistinguishable from zero, despite the possibility of a nontrivial causal effect.

The choice of a maximum entropy distribution in this analysis is based on the principle of maximum entropy, which states that the appropriate prior distribution, given the absence of any other information, is the maximum entropy distribution [30]. However, the choice of linear models and the particular parameterization remain somewhat subjective. The statement that inverse causation only occurs in  ≈12% of models should be seen as qualitatively consistent with the intuition that such situations are rare, but not quantitatively significant.

Randomization of an independent variable effectively cuts all incoming edges to that node in a causal diagram, removing potential confounding effects. Reporting a correlation coefficient, in the context of a randomized controlled trial, can be viewed as reporting an estimate of the causation coefficient, with the distribution of interventions, ˆP(x), equal to the distribution of interventions that were performed in the experiment.

When randomization is not available, it may still be possible calculate a sample causal coefficient by estimating P(y | do(x)). Presented here is a simple example, using data from a study on the treatment of kidney stones [31]. More advanced techniques for identifying P(y | do(x)) are given in [15].

The subgroups (Z) in Table 15 refer to kidney stone size. Group 1 is small kidney stones; group 2 is large kidney stones. This study can be modeled with binary treatment (X) and response (Y ) variables, with the decision to perform percutaneous nephrolithotomy (PCNL) as 0 and surgery as 1.

Table 15: Success rate of treatment; successful/total (probability)

image

The naive model is that there is no confounding (Figure 6a). In such a case, the population natural causation coefficient equals the population correlation coefficient and therefore the sample correlation coefficient r is equal to the sample causation coefficient  g.9Given the data,  r = g = −0.057. The cohorts are equal in size, so this is also an estimate of the maximum entropy causation coefficient.

image

Figure 6: Some of the possible causal diagrams for modeling kidney stone treatment

Hypothetically, if the subgroups were postoperative infection and the treatment affected the likelihood of postoperative infection, which, in turn, affected recovery (Figure 6b), the natural causation coefficient would still equal the correlation coefficient – adjusting for subgroups would still be incorrect. This is an immediate consequence of the do-calculus [15].

However, the correct set of causal assumptions is that kidney stone size affects treatment and recovery (Figure 6c) – doctors took kidney stone size into account when making the decision whether or not to send a patient to surgery. Correctly estimating the causation coefficient in this model can be done with an adjustment for direct causes, P(y | do(x)) = �z P(y | x, z)P(z). With respect to the correct causal diagram, estimating the causation coefficient yields g = 0.068. This a case of inverse causation and the best treatment option for patients is the opposite of what a naive interpretation of correlation would suggest.

The reversal effect seen here is well known as Simpson’s paradox, but requires causal knowledge to resolve correctly [32]. Adjusting for the wrong variables will produce incorrect estimates of causal effect.

There are many different ways in which positive correlation can be misleading with respect to causation. Population distributions may exhibit common causation (ρ > 0, γ= 0) or inverse causation (ρ > 0, γ <0). Sampling error introduces the possibility of coincidental correlation (r > 0, ρ= 0). Unfaithfulness (ρ = 0, γ ̸= 0) implies that the absence of correlation cannot guarantee the absence of causation. And even if there is no confounding (ρ = γ), human error introduces the possibility of reverse causation (γX→Y ̸=γY →X).

Despite the warning that, “Correlation is not causation”, the two are easy to conflate because of the high likelihood that a random model will have ρ ≈ γ. However, there remains a nontrivial possibility of encountering other correlation/causation relationships such as inverse causation, a problem that no amount of additional data sampling will mitigate. There is simply no substitute for accurate causal assumptions.

By emphasizing the population/sample and statistical/causal distinctions and explicitly naming the different ways in which correlation can relate to causation, it is hoped that these effects will become easier to recognize in practice.

Theorem. For Bernoulli  X, Y , γX→Y= 0 if and only if Y is invariant to X.

Proof. Consider the definition of average treatment effect, ATE(X → Y ) =P(y = 1 | do(x = 1))  − P(y= 1 | do(x = 0)). Average treatment effect is

image

Theorem. For Bernoulli  X, Y : γX→Y = ATE(X → Y )(V ar[ ˆX]/V ar ˆX[Y ])1/2

Proof. Consider the numerator of  γ. For Bernoulli random variables:

image

Therefore,  γX→Y = ATE(X → Y )(V ar[ ˆX]/V ar ˆX[Y ])1/2. □

Theorem. In all recursive probabilistic causal models, at least one of γX→Y , γY →X is zero.

Proof. Assume without loss of generality that  γx→yis nonzero. This implies that Y is not invariant to  X. Since P(y | ˆx) is a nonconstant function of x, X must be an ancestor of Y in the associated causal diagram. Consider the submodel  Mythat results from do(Y = y). Since, X is an ancestor of Y in the original model, X and Y must be d-separated in  My. Therefore, X is invariant to  Y and γY →X is zero. □

Theorem. If X and Y are unfaithful in causal model M, then the observational distribution P and causal diagram G associated with M violate the faithfulness condition.

Proof. Assume without loss of generality that Y is not invariant to X. Therefore, X is an ancestor of Y in the associated causal diagram and X and Y are d-connected [10]. However, X and Y are independent, an independence relation not entailed by the Markov condition. Therefore the observational distribution P is not faithful to  G. □

Thanks to James Reggia, Brendan Good, Donald Gregorich and Richard Bruns for their comments on drafts of this paper.

[1] E. R. Tufte, The cognitive style of powerpoint: Pitching out corrupts within. Graphics Press, 2006.

[2] E. Weisstein, Correlation coefficient, MathWorld.

[3] R. Brown and D. Fish, “The psychological causality implicit in language,” Cognition, vol. 14, no. 3, pp. 237–273, 1983.

[4] J. J. Heckman, “The scientific model of causality,” Sociological Methodology, vol. 35, no. 1, pp. 1–97, 2005.

[5] D. Lewis, “Causation,” The Journal of Philosophy, vol. 70, no. 17, pp. 556–567, 1973.

[6] K. Fine, “Critical notice,” Mind, vol. 84, no. 355, pp. 451–458, 1975.

[7] D. Hume, An enquiry concerning human understanding. 1748.

[8] P. Holland, “Statistics and causal inference,” Journal of the American Statistical Association, 1986.

[9] “The international system of units,” National Institute of Standards and Technology, Tech. Rep., 2008.

[10] J. Pearl, Causality. Cambridge university press, 2009.

[11] ——, “Bayesianism and causality, or, why i am only a half-bayesian,” in Foundations of bayesianism, Springer, 2001, pp. 19–36.

[12] S. Wright, “The method of path coefficients,” The annals of mathematical statistics, vol. 5, no. 3, pp. 161–215, 1934.

[13] J. Pearl and T. Verma, “A theory of inferred causation,” Morgan Kaufmann, 1991, pp. 441–452.

[14] J. Pearl, “Causal diagrams for empirical research,” Biometrika, vol. 82, no. 4, pp. 669–688, 1995.

[15] I. Shpitser and J. Pearl, “Complete identification methods for the causal hierarchy,” Journal of Machine Learning Research, vol. 9, no. Sep, pp. 1941–1979, 2008.

[16] E. Bareinboim and J. Pearl, “Transportability from multiple environments with limited experiments: Completeness results,” in Proceedings of the 27th Annual Conference on Neural Information Processing Systems, 2014.

[17] P. S. Laplace, A philosophical essay on probabilities. 1814.

[18] E. Bareinboim, C. Brito, and J. Pearl, “Local characterizations of causal bayesian networks,” in Graph Structures for Knowledge Representation and Reasoning, Springer, 2012, pp. 1–17.

[19] D. M. Chickering and J. Pearl, “A clinician’s tool for analyzing noncompliance,” in Proceedings of the National Conference on Artificial Intelligence, 1996, pp. 1269–1276.

[20] R. Munroe, Dimensional analysis, 2010.

[21] H. Reichenbach, The direction of time. University of Los Angeles Press, 1956.

[22] K. Pearson, “Mathematical contributions to the theory of evolution,” Proceedings of the Royal Society of London, 1896.

[23] G. E. Quinn, C. H. Shin, M. G. Maguire, and R. A. Stone, “Myopia and ambient lighting at night,” Nature, vol. 399, no. 6732, pp. 113–114, 1999.

[24] M. Gardner, Aha! a two volume collection. MAA, 2006.

[25] P. Spirtes, C. N. Glymour, and R. Scheines, Causation, prediction, and search. 2000.

[26] M. Friedman, “The fed’s thermostat,” The Wall Street Journal, 2003.

[27] N. Rowe. (2010). Milton friedman’s thermostat, [Online]. Available: http://worthwhile.typepad.com/worthwhile_canadian_initi/2010/ 12/milton-friedmans-thermostat.html.

[28] J. Pearl, “Why there is no statistical test for confounding, why many think there is, and why they are almost right,” UCLA, Tech. Rep., 1998.

[29] C. Uhler, G. Raskutti, P. Bühlmann, B. Yu, et al., “Geometry of the faithfulness assumption in causal inference,” The Annals of Statistics, vol. 41, no. 2, pp. 436–463, 2013.

[30] E. T. Jaynes, Probability theory: The logic of science. Cambridge University Press, 2003.

[31] C. R. Charig, D. R. Webb, S. R. Payne, and J. E. Wickham, “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy.,” Br Med J (Clin Res Ed), vol. 292, no. 6524, pp. 879–882, 1986.

[32] J. Pearl, “Comment: Understanding simpson’s paradox,” The American Statistician, vol. 68, no. 1, pp. 8–13, 2014.


Designed for Accessibility and to further Open Science