High-Dimensional Independence Testing via Maximum and Average Distance Correlations

2020·Arxiv

Abstract

Abstract

This paper introduces and investigates the utilization of maximum and average distance correlations for multivariate independence testing. We characterize their consistency properties in high-dimensional settings with respect to the number of marginally dependent dimensions, assess the advantages of each test statistic, examine their respective null distributions, and present a fast chi-square-based testing procedure. The resulting tests are non-parametric and applicable to both Euclidean distance and the Gaussian kernel as the underlying metric. To better understand the practical use cases of the proposed tests, we evaluate the empirical performance of the maximum distance correlation, average distance correlation, and the original distance correlation across various multivariate dependence scenarios, as well as conduct a real data experiment to test the presence of various cancer types and peptide levels in human plasma.

Keywords: unbiased distance correlation, chi-square test, testing independence

1 Introduction

Given pairs of observations (, assume they are independently identically distributed as . The statistical hypothesis for testing independence is formulated

as:

Traditional correlation measures like Pearson’s correlation (Pearson, 1895) are commonly used but unable to detect nonlinear and high-dimensional dependencies. Recent measures, such as the distance correlation (Szekely et al., 2007; Szekely and Rizzo, 2009) and the Hilbert-Schmidt independence criterion (Gretton et al., 2005; Gretton and Gyorfi, 2010), can uncover any type of dependency given sufficient sample size, are zero if and only if independence, share similar characteristics, and can reliably test independence for any joint distribution of any fixed dimensionality. Dependence measures are valuable in various statistical applications, such as feature screening

(Li et al., 2012; Zhong and Zhu, 2015; Shen et al., 2024), time-series (Zhou, 2012; Fokianos and

Pitsillou, 2018; Shen et al., 2023), conditional independence (Fukumizu et al., 2007; Szekely and Rizzo, 2014; Wang et al., 2015), clustering (Szekely and Rizzo, 2005; Rizzo and Szekely, 2010), graph testing (Lee et al., 2019; Xiong et al., 2022), and deep learning (D. Guo and Zha, 2022; Zhen et al., 2022).

Detecting multivariate dependencies, and especially in high-dimensional scenarios, remains a challenging task with limited understanding. As the number of dimensions increases relative to the sample size, the testing power of existing dependence measures may diminish (Shen et al., 2020; Ramdas et al., 2015). In situations where the dimensions (p or q) approach infinity and the sample size grows more slowly than the dimension, distance correlation may fail to detect certain multivariate dependencies (Zhu et al., 2019). To address this issue, several solutions have been proposed, such as computing marginal distance covariance for each dimension in X and Y and averaging them to form the test statistic (Zhu et al., 2019), or considering a random rotation of the average distance covariance (Huang and Huo, 2017). Moreover, hypothesis testing through marginal covariance typically involves employing a standard permutation test to calculate the p-value, which can be computationally intensive and cost as much as O(rpqn log(n)), where r is the number of random permutations.

In this paper, we examine the utilization of maximum distance correlation and average distance

correlation as test statistics for multivariate dependence testing. We formulate the maximum distance correlation and the average distance correlation based on pairwise, unbiased, and marginal distance correlations. To understand their respective advantages, we establish their consistency properties for high-dimensional independence testing, using the concept of marginally dependent dimensions. Subsequently, we analyze their limiting null distribution and propose a valid chi-square-based test to calculate p-values, which is significantly faster than permutation tests for large datasets. Our numerical study compares the performance of maximum, average, and original distance correlation using both Euclidean distance and the Gaussian kernel across various simulation settings. Finally, we provide a real data experiment on cancer types and peptide levels in human plasma to illustrate their practical applications. All theorem proofs are in the Appendix.

2 Background

In this section, we provide a review of existing results, including the unbiased distance correlation, its relationship with the Hilbert-Schmidt independence criterion (HSIC), its validity and consistency in testing independence, the permutation test, the limiting null distribution, and the chi-square test.

We denote the paired sample data, assumed to be independently and identically distributed as , as follows:

Moreover, we always assume finite moments for throughout this paper.

Given a distance metric ) such as the Euclidean metric, we denote distance matrix of ). Similarly, we denote as the pairwise distance matrix of Y.

Next, we compute a modified matrix as follows:

and similarly compute . The unbiased sample distance covariance and correlation are then given by:

If n < 4 or the denominator term is not a positive real number, the unbiased sample distance correlation is set to 0.

The above unbiased statistic was introduced in Szekely and Rizzo (2014). Comparing to the

biased statistic introduced in Szekely et al. (2007), the unbiased statistic satisfies

if and only if X and Y are independent.

By default, distance correlation utilizes the Euclidean distance as its metric. However, it is versatile enough to accommodate any distance metric or kernel choice by setting two sample kernel matrices. It is worth noting that when the Gaussian kernel is used, distance correlation effectively becomes equivalent to HSIC. In fact, one can interchange between distance and kernel metrics through an appropriate kernel-to-distance transformation (Sejdinovic et al.,

When the metric used is of strong negative type (Lyons, 2013, 2018), such as the Euclidean distance, or when a characteristic kernel is used (Gretton et al., 2005; Fukumizu et al., 2007; Gretton and Gyorfi, 2010), like the Gaussian kernel, the resulting distance correlation exhibits the

following property:

if and only if X and Y are independent. Equivalently, distance correlation converges to a positive constant if and only if X and Y are dependent. This fundamental property makes distance correlation a valid and universally consistent statistic for testing independence using the permutation test, which is a standard approach in testing independence (Good, 2005; Heller et al., 2013; Shen

Specifically, when X and Y are dependent, performing a permutation test on distance correlation yields an asymptotic p-value of 0, leading to testing power that converges to 1 as the sample size n increases. Conversely, when X and Y are independent, the p-value follows a uniform distribution in the range [0, 1], and the testing power equals the type I error level

The permutation test can be computationally intensive, requiring the permutation of sample data for at least r = 100 times, with the computation of the permuted statistic for each permutation. Given that distance correlation is typically ), the testing procedure entails a computational complexity of

Recent advancements have led to a faster implementation of distance correlation with a time complexity of O(n log(n)) when p = q = 1 and Euclidean distance is used (Huo and Szekely, 2016; Chaudhuri and Hu, 2019). Furthermore, significant progress has been made in characterizing the null distribution of distance correlation, i.e., the distribution when X and Y are independent. This distribution can be fully specified in the limit (Zhang et al., 2018) and approximated by the following chi-square distribution (Shen et al., 2022):

Theorem 1. The limiting null distribution of the unbiased distance correlation satisfies

where the weights satisfy

distribution.

regardless of the metric choice or marginal distributions.

Here, the notation means upper tail dominance in distribution, defined as follows:

Definition 1. Given two random variables U and V , we say U dominates V in upper tail at

probability level , or equivalently , if and only if

which has a time complexity of O(1) and is straightforward to implement in any programming language. This test is valid for any type 1 error level where the upper-tail dominance holds, and it remains universally consistent against any dependence, meaning that the p-value converges to 0 when X and Y are dependent.

Although the level cannot be exactly determined in closed-form, for sample sizes n > 30, the chi-square test on distance correlation yields testing power similar to the permutation test and is approximately valid for any 05. For example, in high-dimensional scenarios where both n and p tend to infinity, assuming that X is continuous and each dimension of X is exchangeable, the null distribution converges to N(0, 2) (Szekely and Rizzo, 2013). In such cases, the chi-square test is strictly valid for any Shen et al., 2022).

3 Main Results

3.1 Maximum and Average Distance Correlations

Given as the sample data, let denote the sth-dimension of the sample data. Similarly for

For every ], we refer to the distance correlation between (the marginal distance correlation, denoted by To distinguish, we will refer to ), which incorporates all dimensions of the sample data, as the original distance correlation.

We introduce the maximum and average distance correlations as follows:

In essence, ) is the maximum of all marginal distance correlations per dimension, while ) is the average of all marginal distance correlations. Note that all the marginal sample statistics employ the computation of unbiased distance correlation.

3.2 Consistency for Testing Marginal Dependence

In this section, our objective is to determine when and how the proposed statistics are suitable for testing independence in high-dimensional scenarios, particularly when pq is large and increases concurrently with n. To achieve this, we introduce the concept of marginal dependence:

, we define ∆(X, Y ) as the set of pairwise marginally dependent dimensions. In other words, the element (if and only if We use to denote the cardinality of this set, which means it represents the number of marginally dependent dimensions.

Clearly, 0, it implies that X and Y must be dependent. However, the reverse is not always true; that is, X and Y may be dependent, while could be 0. In practice, creating a counter-example usually requires special construction, and the concept of marginal dependence effectively captures a significant portion of dependence.

The following two theorems establish the high-dimensional behavior of the maximum and average statistics under the null and alternative hypotheses, respectively.

, the average distance correlation satisfies:

regardless of pq.

, the maximum distance correlation satisfies:

regardless of pq.

with equality holds when

Therefore, both the maximum and average distance correlations are asymptotically consistent for testing the presence of marginal dependence when pq is fixed because either statistic tends to zero asymptotically if and only if where pq increases concurrently with n, the maximum statistic may not be consistent if pq increases too rapidly relative to n, while the average statistic is not consistent when ) is too small relative to n. This can be summarized in the following corollary, which directly follows from Theorem 2 and Theorem 3.

Corollary 1. As n increases to infinity, the average distance correlation is asymptotically consistent in testing the existence of marginal dependence when pq is fixed or increasing pq.

The maximum distance correlation is asymptotically consistent in testing the existence of marginal dependence when pq is fixed or for increasing pq.

Therefore, as long as n is not too small, the maximum correlation is more advantageous because the average statistic may be too small when a dependence signal is present in only a few dimensions. On the other hand, for small-sample problems, the maximum correlation can be significantly biased under the null, potentially inflating its p-value. These behaviors are also observed in the numerical study.

3.3 Limiting Null Distribution and Chi-square-based Tests

While one could employ the permutation test on either the maximum or average statistic, it tends to be very slow for large n. To expedite the testing process, we delve into the null distributions of the maximum and average statistics:

Theorem 4. Assume that X and Y are independent, and that each dimension within X and Y

is also independent. For sufficiently large n and sufficiently small , it holds that

where

While the null distribution of maximum statistic relies on upper-tail dominance, the null distribution of the average statistic is simpler, which converges to a normal distribution as the dimensions increase:

Theorem 5. Under the same condition in Theorem 4, and further assuming that both n and pq

increase to infinity, it holds that

Note that this coincides with the limiting null distribution of the original distance correlation in high-dimensions (Szekely and Rizzo, 2013). Moreover, while Theorem 4 holds for any pq, Theorem 5 requires pq to increase. In practice, we have found that pq > 30 suffices for a good approximation. Alternatively, if we assume each marginal distance correlation actually follows then the null distribution of . This provides a better empirical approximation for small pq, and as pq increases, it also converges to N(0, 2) after dividing by

Utilizing the null distributions, we can construct chi-square-based tests for both the maximum and average distance correlations. Specifically, we calculate p-values as follows:

• For the maximum distance correlation, we let ) + 1, and compute the

• For the average distance correlation, we let ) + 1, and compute the

Note that for the average distance correlation, we employ the distribution instead of the normal distribution. This choice provides a better approximation for small values of pq while being equivalent to the normal distribution for large pq.

Both of these chi-square-based tests are considered valid according to the following theorem:

Theorem 6. Under the same condition in Theorem 4, the chi-square test for the maximum correlation is a valid test of independence for sufficiently large n and sufficiently small type 1 error level . Moreover, the chi-square test for the average correlation is a valid test of independence for sufficiently large n and pq, at any type 1 error level

It is important to note that our results do not rely on any particular choice of distance metric. For instance, one can employ the Gaussian kernel and compute the maximum and average HSIC, and the chi-square-based tests remain approximately valid and consistent.

Regarding the assumption of independence among dimensions within X and Y, we shall clarify that even when this assumption is not met, the chi-square-based test remains approximately valid and is often more conservative. In other words, the p-value obtained through the chi-square-based test tends to be larger than the p-value obtained through the permutation test. For example, in the extreme scenario where still holds, but with a conservative bound. In this case, a tighter bound for the null distribution should be

Empirically, the presence of inter-dimension dependence appears to have relatively little impact on both tests as long as n is moderate. However, for small samples, the p-value for the maximum distance correlation can be more conservative. In such cases, a permutation test may provide a more accurate result while remaining cost-effective for small n. These behaviors will be observed

in the experiments section

4 Simulation Study

In our simulation study, we first demonstrate that the chi-square-based distribution provides an accurate approximation of the true null distribution. Next, we evaluate the testing power of the maximum and average tests on a variety of multivariate dependence.

4.1 Chi-Square vs True Null

Figure 1 displays the comparison between the chi-square-based distribution and the true null distribution for original, maximum, and average statistics, using Euclidean distance and Gaussian kernel respectively. The cumulative distribution function is plotted based on Theorem 4 and Theorem 5 for a sample size of n = 300 and pq = 100. The true null distribution is obtained through repeated generation of independent (X, Y). For the original and maximum distance correlations, the chi-square-based distribution dominates the true null for small , regardless of whether Euclidean distance or Gaussian kernel is used. For the average distance correlation, the chi-square-based distribution aligns closely with the true null distribution.

Figure 1: Compare the chi-square distribution and true null distribution for distance correlation. The top row considers the Euclidean distance (DCor), while the bottom row employs the Gaussian kernel (HSIC). In the first column, we compare the null approximation for the original statistic as presented in Theorem 1. In the second and third columns, we compare the null approximation for the maximum and average statistics, respectively, based on Theorems 4 and 5.

4.2 Fixed δ(X, Y ) with Increasing pq

We evaluate the testing power of original statistic, maximum statistic, and average statistic, in detecting multivariate dependence structures. The data is generated by sampling , using the 0], and considering

• Linear (

• Quadratic (using dimension-wise square.

• Fourth Root (

• Independence (

We perform the simulation study with n = 100 by gradually increasing p from 5 to 100 (except in the linear case, where it is increased to 1000). At each p, we generate sample data 1000 times, run each method and record the number of times the p-value is below 05. The results are plotted in Figure 1, showing the testing power for each method.

In these scenarios (excluding independence), the number of marginally dependent dimensions are limited, i.e., increases. The maximum test delivers near perfect power, followed by the average test, and the original test has the lowest power. As pq increases, the power of all tests declines, but the maximum test appears to be the least affected by increasing dimensions. This pattern remains consistent regardless of whether the Euclidean distance or Gaussian kernel is employed.

Furthermore, in the case of independence, the chi-square-based test effectively controls the type 1 error, and the test power closely aligns with , affirming the validity of the tests.

4.3 Increasing δ(X, Y ) with Fixed pq

In this scenario, we examine different multivariate dependence structures where the number of marginally dependent dimension ) increases, while pq remains fixed. Let

, and consider

• Linear (1) otherwise.

Figure 2: Compare the testing power of maximum statistic, average statistic, and original statistic in linear, quadratic, fourth root, and independent settings as the number of dimensions increases while the number of marginally dependent dimensions is fixed. The top row utilizes Euclidean distance (DCor), while the bottom row employs the Gaussian kernel (HSIC).

• Trigonometry (1) otherwise.

We set p = 50 and n = 20 in linear, and p = 30 and n = 100 in trigonometry. For each d = 1, . . . , 10, we repeat the experiment 1000 times, run the test with all methods, and plot the testing power at a type 1 error level of 05 in Figure 2. In these settings, as the number of marginally dependent dimensions ) increases, all testing methods eventually achieve a testing power of 1. Among these, the maximum test consistently outperforms the average test, and the original test has the lowest power.

Figure 3: Compare the testing power of maximum statistic, average statistic, and original statistic in linear and trigonometric relationships as the number of marginally dependent dimensions increases from 1 to 10. The top row considers the Euclidean distance (DCor), while the bottom row employs the Gaussian kernel (HSIC).

5 Real Data

This experiment aimed to investigate the presence of any dependency between the abundance levels of peptides in human plasma and the occurrence of cancers. Selected Reaction Monitoring (SRM) was employed as a targeted quantitative proteomics technique for measuring protein and peptide abundance in complex biological samples (Wang et al., 2011). A prior study utilized SRM to identify a total of 318 peptides from a total of 98 individuals, among whom 33 were normal subjects, 10 had pancreatic cancer, 24 had colorectal cancer, and 28 had ovarian cancer (Wang et al., 2017). Consequently, X represents the sample peptide levels with p = 318. The data is publicly available on a GitHub repository in MATLAB format

We performed independence tests based on various combinations, including utilizing the entire sample dataset, where Y represents a label vector indicating the cancer type each subject has. Other test scenarios included: distinguishing normal individuals from others, where Y is a label vector with normal subjects as 1 and all others as 2; distinguishing pancreatic from colorectal cancer, with Y as a label vector where pancreatic subjects are labeled as 1, colorectal subjects as 2, and others as unused; and so forth.

Our previous study (Vogelstein et al., 2019) indicated that all such testing combinations should yield significant p-value. Table 1 presents the test statistics and p-values for the maximum distance correlation, average distance correlation, and original distance correlation. While the original distance correlation performed well, there were three combinations where it failed to detect dependence: pancreatic vs others, colorectal vs others, and pancreatic vs colorectal. This may not be surprising, given that there were only 10 subjects with pancreatic cancer, which is the smallest group in the dataset, and colorectal cancer is the second smallest group. Since the sample size n is small in these cases, we also conducted permutation tests using 100 random permutations for these insignificant pairs (results reported in brackets), and the results remained insignificant.

The maximum correlation yielded significant results overall, except in three testing combinations involving pancreatic cancer. This aligns with the theoretical findings where the maximum correlation may not perform well for small sample sizes due to bias from the null distribution, and the chi-square test can be overly conservative. Therefore, we conducted permutation tests in these cases, and the results improved significantly, with all of them becoming statistically significant at type 1 error level

The average correlation yielded significant p-values in almost all combinations, as it is not sensitive to small sample sizes. The only exception was the test for pancreatic vs. colorectal cancer, where it did not yield a significant result, while the maximum method would have tested significant at 07. Its underperformance in this case suggests that ), the number of

marginally dependent dimensions, could be very small, leading to its lack of sensitivity.

Table 1: Results for cancer peptide testing. We consider 11 different combinations of the sample data to test significant relationship, and use the original distance correlation, maximum distance correlation, and average distance correlation for testing. The statistic and the p-value are reported.

6 Conclusion

In this paper, we propose the maximum and average distance correlations using pairwise, unbiased, and marginal distance correlations. This formulation facilitates the understanding of their consistency properties, relative advantages, limiting distributions, and enables the use of chi-square-based tests. The numerical experiments further confirm our findings and shed light on their practical usages.

Acknowledgments

This work was supported by the National Science Foundation awards DMS-1921310 and DMS-2113099, the University of Delaware Data Science Institute Seed Funding Grant, and the Defense Advanced Research Projects Agency’s L2M program FA8650-18-2-7834.

References

Chaudhuri, A. and Hu, W. (2019). A fast algorithm for computing distance correlation. Computational Statistics & Data Analysis, 135:15–24.

D. Guo, C. Wang, B. W. and Zha, H. (2022). Learning fair representations via distance correlation minimization. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14.

Fokianos, K. and Pitsillou, M. (2018). Testing independence for multivariate time series via the auto-distance correlation matrix. Biometrika, 105(2):337–352.

Fukumizu, K., Gretton, A., Sun, X., and Sch¨olkopf, B. (2007). Kernel measures of conditional dependence. In Advances in neural information processing systems.

Good, P. (2005). Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer.

Gretton, A. and Gyorfi, L. (2010). Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11:1391–1423.

Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Scholkopf, B. (2005). Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129.

Heller, R., Heller, Y., and Gorfine, M. (2013). A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510.

Huang, C. and Huo, X. (2017). A statistically and numerically efficient independence test based on random projections and distance covariance. arXiv.

Huo, X. and Szekely, G. (2016). Fast computing for distance covariance. Technometrics, 58(4):435– 447.

Lee, Y., Shen, C., Priebe, C. E., and Vogelstein, J. T. (2019). Network dependence testing via diffusion maps and distance-based correlations. Biometrika, 106(4):857–873.

Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of American Statistical Association, 107:1129–1139.

Lyons, R. (2013). Distance covariance in metric spaces. Annals of Probability, 41(5):3284–3305.

Lyons, R. (2018). Errata to “distance covariance in metric spaces”. Annals of Probability, 46(4):2400–2405.

Pearson, K. (1895). Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58:240–242.

Ramdas, A., Reddi, S. J., P´oczos, B., Singh, A., and Wasserman, L. (2015). On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In 29th AAAI Conference on Artificial Intelligence.

Rizzo, M. and Szekely, G. (2010). DISCO analysis: A nonparametric extension of analysis of variance. Annals of Applied Statistics, 4(2):1034–1055.

Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013). Equivalence of distance- based and rkhs-based statistics in hypothesis testing. Annals of Statistics, 41(5):2263–2291.

Shen, C., Chung, J., Mehta, R., Xu, T., and Vogelstein, J. T. (2023). Independence testing for temporal data. https://arxiv.org/abs/1908.06486.

Shen, C., Panda, S., and Vogelstein, J. T. (2022). The chi-square test of distance correlation. Journal of Computational and Graphical Statistics, 31(1):254–262.

Shen, C., Priebe, C. E., and Vogelstein, J. T. (2020). From distance correlation to multiscale graph correlation. Journal of the American Statistical Association, 115(529):280–291.

Shen, C. and Vogelstein, J. T. (2021). The exact equivalence of distance and kernel methods in hypothesis testing. AStA Advances in Statistical Analysis, 105(3):385–403.

Shen, C., Wang, S., Badea, A., Priebe, C. E., and Vogelstein, J. T. (2024). Discovering the signal subgraph: An iterative screening approach on graphs. https://arxiv.org/abs/1801.07683.

Szekely, G. and Rizzo, M. (2005). Hierarchical clustering via joint between-within distances: Extending ward’s minimum variance method. Journal of Classification, 22:151–183.

Szekely, G. and Rizzo, M. (2009). Brownian distance covariance. Annals of Applied Statistics, 3(4):1233–1303.

Szekely, G. and Rizzo, M. (2013). The distance correlation t-test of independence in high dimen- sion. Journal of Multivariate Analysis, 117:193–213.

Szekely, G. and Rizzo, M. (2014). Partial distance correlation with methods for dissimilarities. Annals of Statistics, 42(6):2382–2412.

Szekely, G., Rizzo, M., and Bakirov, N. (2007). Measuring and testing independence by correlation of distances. Annals of Statistics, 35(6):2769–2794.

Vogelstein, J. T., Wang, Q., Bridgeford, E., Priebe, C. E., Maggioni, M., and Shen, C. (2019). Discovering and deciphering relationships across disparate data modalities. eLife, 8:e41690.

Wang, Q., Chaerkady, R., Wu, J., Hwang, H. J., Papadopoulos, N., Kopelovich, L., Maitra, A., Matthaei, H., Eshleman, J. R., Hruban, R. H., Kinzler, K. W., Pandey, A., and Vogelstein, B. (2011). Mutant proteins as cancer-specific biomarkers. Proceedings of the National Academy of Sciences of the United States of America, (6):2444–9.

Wang, Q., Zhang, M., Tomita, T., Vogelstein, J. T., Zhou, S., Papadopoulos, N., Kinzler, K. W., and Vogelstein, B. (2017). A selected reaction monitoring approach for validating candidate biomarkers. PNAS.

Wang, X., Pan, W., Hu, W., Tian, Y., and Zhang, H. (2015). Conditional Distance Correlation. Journal of the American Statistical Association, 110(512):1726–1734.

Xiong, J., Shen, C., Arroyo, J., and Vogelstein, J. T. (2022). Graph independence testing: Appli- cations in multi-connectomics. https://arxiv.org/abs/1906.03661.

Zhang, Q., Filippi, S., Gretton, A., and Sejdinovic, D. (2018). Large-scale kernel methods for independence testing. Statistics and Computing, 28(1):113–130.

Zhen, X., Meng, Z., Chakraborty, R., and Singh, V. (2022). On the versatile uses of partial distance correlation in deep learning. In European Conference on Computer Vision, pages 327–346.

Zhong, W. and Zhu, L. (2015). An iterative approach to distance correlation-based sure indepen- dence screening. Journal of Statistical Computation and Simulation, 85(11):2331–2345.

Zhou, Z. (2012). Measuring nonlinear dependence in time-series, a distance correlation approach. Journal of Time Series Analysis, 33(3):438–457.

Zhu, C., Yao, S., Zhang, X., and Shao, X. (2019). Distance-based and rkhs-based dependence metrics in high dimension. https://arxiv.org/abs/1902.03291.

7 All Proofs

, the average distance correlation satisfies:

regardless of pq.

for any pair of (s, t), leading to:

Consequently, when p and q are fixed, both the maximum distance correlation and average distance correlation are asymptotically 0.

If pq increases together with n, there can be an infinite number of marginal correlations. The average distance correlation still converges to 0 by law of large numbers, as it represents the mean of pq marginal correlations, all of which converge to 0.

However, the maximum distance correlation may be influenced, and a careful analysis is required to determine whether the convergence still holds in probability. For any 0, it suffices

to prove

Here, the second line follows from basic order statistics, the third line follows from the existing dominance results by Theorem 1, the fourth line is based on the approximation of the standard normal distribution at the tail, where

Therefore, it suffices to consider when pq is a function of n, where we have:

or equivalently:

By computing the limit while treating pq as a function of n, and then using L’Hˆopital’s rule (details omitted), we find that the above limit is 0 as long as:

where (represents its derivative with respect to n. A sufficient condition is therefore pq =

, the maximum distance correlation satisfies:

regardless of pq.

with equality holds when

0, there exists at least one pair of (s, t) such that

Therefore, the maximum distance correlation is always greater than 0.

This also holds true for the average distance correlation when pq is fixed. However, when pq also increases to infinity (at any rate relative to

where the second line follows because each marginal correlation is bounded in [

Theorem 4. Assume that X and Y are independent, and that each dimension within X and Y

is also independent. For sufficiently large n and sufficiently small , it holds that

where

], we apply the upper-tail dominance property from Theorem 1 to

each marginal distance correlation, i.e.,

for sufficiently large n and sufficiently small

By order statistics of independent random variables, we can establish the distribution of the maximum distance correlation as follows:

Consequently, we have

Theorem 5. Under the same condition in Theorem 4, and further assuming that both n and pq

increase to infinity, it holds that

Proof. From Theorem 1, the limiting null distribution of each marginal distance correlation satis-

fies

where the weights satisfy

distribution. Therefore, as , each marginal distance correlation has an expected value of 0

and a variance of 2

By central limit theorem,

Theorem 6. Under the same condition in Theorem 4, the chi-square test for the maximum correlation is a valid test of independence for sufficiently large n and sufficiently small type 1 error level . Moreover, the chi-square test for the average correlation is a valid test of independence for sufficiently large n and pq, at any type 1 error level

Proof. The validity of these tests can be directly deduced from Theorem 4 and Theorem 5. For the maximum statistic, owing to upper-tail dominance, the p-value obtained through the chi-square-based test is always greater than or equal to the p-value derived from the true null. Consequently, the test is valid and tends to be more conservative than the permutation test. In the case of the average statistic, the p-value produced by the chi-square-based test converges to the p-value generated by the true null. Hence, the test is valid and matches the p-value of the permutation test for sufficiently large n, p, and q.