b

DiscoverSearch
About
My stuff
On the Consistency of Optimal Bayesian Feature Selection in the Presence of Correlations
2020·arXiv
Abstract
Abstract

Optimal Bayesian feature selection (OBFS) is a multivariate super- vised screening method designed from the ground up for biomarker discovery. In this work, we prove that Gaussian OBFS is strongly consistent under mild conditions, and provide rates of convergence for key posteriors in the framework. These results are of enormous importance, since they identify precisely what features are selected by OBFS asymptotically, characterize the relative rates of convergence for posteriors on different types of features, provide conditions that guarantee convergence, justify the use of OBFS when its internal assumptions are invalid, and set the stage for understanding the asymptotic behavior of other algorithms based on the OBFS framework.

image

Biomarker discovery aims to identify biological markers (genes or gene products) that lead to improved diagnostic or prognostic tests, better treatment recommendations, or advances in our understanding of the disease or biological condition of interest (Ilyin et al., 2004; Rifai et al., 2006; Ramachandran et al., 2008). Reliable and reproducible biomarker discovery has proven to be difficult (Diamandis, 2010); while one reason is that preliminary small-sample datasets are inherently difficult to analyze, another factor is that most popular algorithms and methods employed in bioinformatics have inherent limitations that make them unsuitable for many discovery applications.

Consider univariate filter methods like t-tests, which are perhaps the most ubiquitous throughout the bioinformatics literature. Since they focus on only one feature at a time, filter methods cannot take advantage of potential correlations between markers. In particular, they cannot identify pairs (or sets) of markers with tell-tale behaviors only when observed in tandem. Multivariate methods, on the other hand, can account for correlations, but the vast majority are wrapper or embedded feature selection algorithms designed to aid in classification or regression model reduction. Model construction is not a reasonable goal in most small-scale exploratory studies, particularly in biology where the expected number of variables is large and the nature of their interactions can be highly complex and context dependent. Furthermore, feature selection methods designed for model reduction intrinsically penalize redundant features and reward smaller feature sets (Sima and Dougherty, 2008; Awada et al., 2012; Ang et al., 2016; Li et al., 2017). This can be counterproductive in biomarker discovery; so much so that filter methods often far outperform multivariate methods (Sima and Dougherty, 2006, 2008; Foroughi pour and Dalton, 2018b).

Optimal Bayesian feature selection (OBFS) is a supervised multivariate screening method designed to address these problems (Dalton, 2013; Foroughi pour and Dalton, 2019). The OBFS modeling framework, discussed in detail in Section 2, assumes that features can be partitioned into two types of independent blocks: blocks of correlated “good” features that have distinct joint distributions between two classes, and blocks of correlated “bad” features that have identical joint distributions in all classes (Foroughi pour and Dalton, 2018a). Given distributional assumptions on each block, and priors on the distribution parameters, OBFS can be implemented using one of many selection criteria, for example the maximum number correct (MNC) criterion, which maximizes the posterior expected number of features correctly identified as belonging to good versus bad blocks, or the constrained MNC (CMNC) criterion, which additionally constrains the number of selected features. In this way, OBFS aims to optimally identify and rank the set of all features with distinct distributions between the classes, which represents the predicted set of biomarkers, while accounting for correlations.

Optimal Bayesian feature filtering (OBF) is a special case of OBFS that assumes that features are independent (blocks of size one) and the parameters governing each feature are independent (Foroughi pour and Dalton, 2015). Gaussian OBF (which assumes independent Gaussian features with conjugate priors, henceforth referred to as simply OBF), has very low computation cost, robust performance when its modeling assumptions are violated, and particularly excels at identifying strong markers with low correlations (Foroughi pour and Dalton, 2017a, 2018a). Furthermore, it has been shown that OBF is strongly consistent under very mild conditions, including cases where the features are dependent and non-Gaussian (Foroughi pour and Dalton, 2019). In particular, OBF, in the long run, selects precisely the set of features with distinct means or variances between the classes. The consistency theorem in (Foroughi pour and Dalton, 2019) formalizes our intuition that OBF cannot take advantage of correlations; although OBF identifies features that are individually discriminating, we see that it has no capacity to identify features that only discriminate when grouped together, or features that are merely correlated with (or linked through a chain of correlation with) other discriminating features.

In bioinformatics, features that only discriminate when grouped together and features that are correlated with other strong markers are of tremendous interest, since they may represent important actors in the biological condition under study. While multivariate methods have the potential to discover such features, as discussed above, methods focused on model reduction tend to do so unreliably and inconsistently. Rather than involve classification or regression models, OBFS searches for intrinsic differences between features. Like most multivariate methods, one difficulty with OBFS is that it is computationally expensive (except in certain special cases, like OBF). If an exact solution is desired, even Gaussian OBFS (which assumes independent blocks of Gaussian features with conjugate priors, henceforth referred to as simply OBFS) is currently only tractable in small-scale problems with up to approximately ten features. This has lead to the development of a number of promising computationally efficient heuristic algorithms based on OBFS theory, for example 2MNC-Robust (Foroughi pour and Dalton, 2014), POFAC, REMAIN, and SPM (Foroughi pour and Dalton, 2018a).

Although OBFS and heuristic algorithms based on OBFS have demonstrated better performance than other multivariate methods in biomarker discovery, it is currently unknown precisely what features they select in the long run. Our main contribution is a theorem presented in Section 3, analogous to the consistency theorem presented in (Foroughi pour and Dalton, 2019) for OBF, that shows that OBFS is strongly consistent under mild conditions. The consistency proof for OBFS utilizes nine lemmas and is significantly more complex than that of OBF. This theorem identifies precisely what features are selected by OBFS asymptotically, provides conditions that guarantee convergence, justifies the use of OBFS in non-design settings where its internal assumptions are invalid, and characterizes rates of convergence for key posteriors in the framework, including marginal posteriors on different types of features.

The asymptotic behavior of optimal feature selection provides a frame of reference to understand the performance of heuristic algorithms based on OBFS; for instance, we may compare the sets of features selected asymptotically, the conditions required to guarantee convergence, and rates of convergence. Furthermore, while numerous works emphasize the need to account for gene interactions and to detect families of interacting biomarkers [e.g., see Han et al. (2017), Xi et al. (2018) and Fang et al. (2019)], typically the focus is on simple statistics that measure pairwise interactions in the data, rather than on establishing intrinsic characteristics of complete marker families, or on quantifying the performance and properties of selection algorithms designed to identify marker families. Thus, this work is also important because it proposes a formal definition for marker families (correlated blocks of features), and shows that these marker families are identifiable (via OBFS).

We begin with an overview of the Gaussian modeling framework presented in Foroughi pour and Dalton (2018a), and a derivation of the corresponding optimal selection rule, OBFS. Let F be the set of feature indices, each corresponding to a real-valued feature. We call feature indices we wish to select “true good features,” denoted by ¯G, and we call feature indices we wish to not select “true bad features,” denoted by ¯B = F\ ¯G. In this model, good features assign (w.p. 1 over the prior) distinct distributions between two classes, labeled y = 0 and y = 1, while bad features assign the same distribution across the whole sample. We further assume that ¯G and ¯B can each be partitioned into sub-blocks of interacting features. We call the set of all sub-blocks a “true feature partition,” denoted by ¯P. If ¯G and ¯B are non-empty, we write ¯P = ({ ¯G1, . . . , ¯G¯u}, { ¯B1, . . . , ¯B¯v}), where ¯u and ¯v are positive integers, and the set of all ¯Gi’s and ¯Bj’s are disjoint such that ¯G = ∪¯ui=1¯Gi, ¯B = ∪¯vj=1¯Bj, and all ¯Gi’s and ¯Bj’s are non-empty. If ¯G is empty, then we still denote ¯P this way, but also define  ∪Gi = ∅, and define sums and products from 1 to ¯u to be 0 and 1, respectively. We define similar conventions when ¯B is empty.

In the Bayesian model, ¯G, ¯B and ¯P are random sets, and we denote instantiations of these random sets by G, B, and P = ({G1, . . . , Gu}, {B1, . . . , Bv}), respectively. We define a prior on the true feature partition,  π(P) ≡P( ¯P = P), which induces a prior on the true good features,  π(G) ≡P( ¯G = G) = �P :∪Gi=G π(P).

Given a true feature partition ¯P = P, define  θGiyfor each class y = 0, 1 and each good block index i = 1, . . . , u, and  θBjfor each bad block index j = 1, . . . v. Let  θP

denote the collection of all  θGiy’s and  θBj’s. Assume the  θGiy’s and  θBj’s are mutually independent, i.e., we have a prior of the form

image

For good block  A = Gi, assume  θAy= [µAy , ΣAy] for each class y = 0, 1, where  µAyis a length |A| column vector, ΣAyis an  |A| × |A|matrix, and |A| denotes the cardinality of set A. We assume  π(θAy) =  π(µAy |ΣAy)π(ΣAy) is normal-inverse-Wishart:

π(ΣAy) =  KAy |ΣAy |−0.5(κAy +|A|+1)exp�−0.5 tr(SAyAy)−1)�,(2.2) π(µAy |ΣAy) =  LAy |ΣAy |−0.5exp�−0.5νAy(µAy − mAy)TAy)−1(µAy − mAy)�,(2.3)

where |X| is the determinant, tr(X) is the trace, and  XTis the transpose of matrix X. For a proper prior,  κAy > |A| −1,  SAyis a symmetric positive-definite  |A| × |A|matrix, νAy >0,  mAyis an  |A| ×1 vector,

image

and Γddenotes the multivariate gamma function, where d is a positive integer. Likewise, for bad block  A = Bjassume  θA= [µA, ΣA], and that  π(θA) is normal-inverse-Wishart with hyperparameters  κA, SA, νA, and  mA, and scaling constants  KAand  LA.

Let  Snbe a sample composed of n labeled points with  nypoints in class y, where labels are determined by a process independent from ¯P and  θ ¯P(for instance, using random sampling or separate sampling). Given ¯P = P, θP, and the labels, assume all sample points are mutually independent, and features in different blocks are also independent. Assume that features in block  Gi, i= 1, . . . , u, are jointly Gaussian with mean  µGiyand covariance matrix ΣGiyunder class y, and that features in block  Bj, j = 1, . . . , v, are jointly Gaussian with mean  µBjand covariance ΣBjacross the whole sample.

The posterior on ¯P over the set of all valid feature partitions is given by the normalized product of the prior and likelihood function. It can be shown that  π∗(P) ≡P( ¯P = P|Sn) ∝ π(P)q(P)a(P), where  π∗(P) is normalized to have unit sum, and

image

image

For each good block  A = Gi, i= 1, . . . , u, and class y = 0, 1,

image

and ˆµAyand ˆΣAyare the sample mean and unbiased sample covariance of features in A under class y. Similarly, for each bad block  A = Bj, we find  νA∗, κA∗, QA, SA∗and CAusing (2.8) through (2.12) with all subscript y’s removed, where ˆµAand ˆΣAare the sample mean and unbiased sample covariance of features in A across the whole sample.

The posterior on ¯P induces a posterior on ¯G over all subsets  G ⊆ F,

image

as well as posterior probabilities that each individual feature  f ∈ Fis in ¯G,

image

Note  π∗(f) = P(f ∈¯G|Sn) and  π∗({f}) = P( ¯G = {f}|Sn) are distinct.

The objective of OBFS is to identify the set of true good features, ¯G. We will consider two objective criteria: MNC and CMNC. MNC maximizes the expected number of correctly identified features; that is, MNC outputs the set  G ⊆ Fwith complement B = F\G that maximizes E[|G ∩¯G| + |B ∩¯B|], which is given by  GMNC=  {f ∈ F: π∗(f) > 0.5}(Foroughi pour and Dalton, 2014). CMNC maximizes the expected number of correctly identified features under the constraint of selecting a specified number of features, D, and  GCMNCis found by picking the D features with largest  π∗(f) (Foroughi pour and Dalton, 2017b). Both MNC and CMNC require computing  π∗(f) for all f ∈ F, which generally requires computing  π(P) for all valid feature partitions P, and is generally intractable unless |F| is small.

Under proper priors, Gaussian OBFS takes the following modeling parameters as input: (1)  π(P) for each valid feature partition, P, (2)  νAy >0,  mAy,  κAy > |A| −1, and symmetric positive-definite  SAyfor all y and all possible good blocks A in valid P, and (3)  νA >0,  mA, κA > |A| −1, and symmetric positive-definite  SAfor all possible bad blocks A in valid P. If CMNC is used, it also takes D as input. When  π(θP |¯P = P) is improper, the above derivations are invalid, but  π∗(P) ∝ π(P)q(P)a(P) can still be taken as a definition and computed as long as: (1)  π(P) is proper, (2)  νA∗y >0, κA∗y > |A| −1, and  SA∗yis symmetric positive-definite, (3)  νA∗ >0,  κA∗ > |A| −1, and  SA∗is symmetric positive-definite, and (4)  KAyand  LAyare no longer given by (2.4) and (2.5), and similarly  KAand  LAare no longer given by analogous equations; instead these are positive input parameters specified by the user.

Gaussian OBF is a special case where  π(P) assumes that all blocks ¯G1, . . . ,¯G¯uand ¯B1, . . . , ¯B¯vare of size one, and the events  {f ∈ ¯G}are mutually independent. For each  f ∈ F, OBF takes as input (1) marginal priors  π(f) ≡P(f ∈¯G), (2) scalars ν{f}y , m{f}y , κ{f}yand  S{f}yfor all y, (3) scalars  ν{f}, m{f}, κ{f}and  S{f}, and (4) Lf ≡ K{f}0 L{f}0 K{f}1 L{f}1 /(K{f}L{f}) if improper priors are used. OBF also takes D as input if CMNC is used. Under OBF, it can be shown that  π∗(f), defined in (2.14), simplifies to  π∗(f) = h(f)/(1 + h(f)), where

image

κ{f}∗y , Q{f}yand  S{f}∗yare defined in (2.9), (2.10) and (2.11), respectively, and  κ{f}∗, Q{f}and  S{f}∗are defined similarly. Rather than evaluate  π∗(P) for all feature partitions, OBF under both MNC and CMNC reduces to simple filtering, where features are scored by the h(f) given in (2.15).

Let  F∞be an arbitrary sampling distribution on an infinite sample,  S∞. Each sample point in  S∞consists of a binary label, y = 0, 1, and a set of features corresponding to a set of feature indices, F. For each n = 1, 2, . . ., let  Sndenote a sample consisting of the first n points in  S∞, let  nydenote the number of points in class y, and define ρn = n0/n.

The goal of feature selection is to identify a specific subset of features (say those with different distributions between two classes), which we denote by ¯G. Proving strong consistency for a feature selection algorithm thus amounts to showing that

image

with probability 1 (w.p. 1) over the infinite sampling distribution, where n is the sample size,  Snis a sample of size n, and ˆG(Sn) is the output of the selection algorithm. Here, ˆG(Sn) and ¯G are sets, and we define limn→∞ ˆG(Sn) = ¯G to mean that ˆG(Sn) = ¯G for all but a finite number of n.

OBFS and OBF fix the sample and model the set of features we wish to select and the sampling distribution as random. To study consistency, we reverse this: now the sampling distribution is fixed and the sample is random. We begin by reviewing a few definitions and a special case of the strong consistency theorem for OBF.

Definition 1 (Foroughi pour and Dalton, 2019). ¯G is an independent unambiguous set of good features if the following hold, where  µfyand  σfyare the mean and variance of feature f under class y, respectively:

1. For each  g ∈¯G, µgyand  σgyexist for all y such that either  µg0 ̸= µg1or  σg0 ̸= σg1. 2. For each  b /∈¯G, µbyand  σbyexist for all y such that  µb0=  µb1and  σb0=  σb1.

Definition 2 (Foroughi pour and Dalton, 2019). An infinite sample,  S∞, is a balanced sample if lim infn→∞ ρn >0, lim supn→∞ ρn <1, and, conditioned on the labels, sample points are independent with points belonging to the same class identically distributed.

Theorem 1 (Foroughi pour and Dalton, 2019). Let  S∞be a fixed infinite sample and let G be a fixed subset of F. If limn→∞ π∗(G) = 1, then limn→∞ˆGMNC(Sn) = G and limn→∞ˆGCMNC(Sn) = G if D = |G|, where ˆGMNC(Sn) and ˆGCMNC(Sn) are the MNC and CMNC feature sets under  π∗(G), respectively.

image

By Theorems 1 and 2, Gaussian OBF under MNC and Gaussian OBF under CMNC with “correct” D (the user knows in advance how many features to select) are strongly consistent if the conditions of Theorem 2 hold. Condition (i) uses Definition 1, and characterizes the features that OBF aims to select. Condition (iii) uses Definition 2, and characterizes requirements of the sample. Conditions (i)–(iii) impose very mild assumptions on the data generating process, which is not required to satisfy the Gaussian OBF modeling assumptions in Condition (iv).

Let us now turn to OBFS. Denote the true mean and covariance of features in feature set A and class y by  µAyand ΣAy, respectively (the features need not be Gaussian). OBFS aims to select the smallest set of features, ¯G, with different means or covariances between the classes, where ¯B = F\ ¯G has the same means and covariances between the classes and is uncorrelated with ¯G. This is formalized in the following definition.

Definition 3. ¯G is an unambiguous set of good features if the following hold:

image

4. Each feature in ¯G is uncorrelated with each feature in ¯B in both classes. 5. Conditions 1-4 do not all hold for any strict subset  G ⊂¯G.

Assuming all first and second order moments exist, an unambiguous set of good features always exists and is unique. To prove uniqueness, let ¯G1and ¯G2be arbitrary unambiguous sets of good features, and let ¯G3= ¯G1 ∩¯G2. By condition 4, ΣFy, has a block diagonal structure for each y = 0, 1 corresponding to the sets of features ¯G3, ¯G1\ ¯G3, ¯G2\ ¯G3, and F\( ¯G1 ∪ ¯G2). Thus, condition 4 holds for ¯G3. By condition 3, µA0=  µA1and ΣA0= ΣA1for all of these blocks except ¯G3. Thus, condition 3 holds for ¯G3. If  µ¯G10 ̸= µ¯G11, then (since the means for each class are equal for ¯G1\ ¯G3) µ¯G30 ̸= µ¯G31. Alternatively, if Σ¯G10 ̸= Σ¯G11, then (since ¯G3and ¯G1\ ¯G3are uncorrelated for each class, and the covariances between each class are equal for ¯G1\¯G3) Σ¯G30 ̸= Σ¯G31. In either case, condition 2 holds for ¯G3. Since ¯G3 ⊆¯G1and ¯G3 ⊆¯G2, by condition 5 we must have ¯G1= ¯G2= ¯G3. We denote the unique unambiguous set of good features by ¯G, and its complement by ¯B, throughout.

Similar to the Bayesian model, define a feature partition to be an ordered set of the form P = ({G1, . . . , Gu}, {B1, . . . , Bv}), where the set of  Gi’s and  Bj’s partition F. We call each  Gia “good block” and each  Bja “bad block”, keeping in mind that these terms are always relative to a specified arbitrary partition. Feature partitions with no good blocks or no bad blocks are permitted, with appropriate conventions for unions, sums, and products over the (empty set of) corresponding blocks. The following definitions formalize a non-Bayesian analog of the true feature partition.

Definition 4. Let  P1and  P2be arbitrary feature partitions.  P1is a mesh of  P2if every block in  P1is a subset of a block in  P2. P1is a refinement of  P2if every good block in P1is a subset of a good block in  P2and every bad block in  P1is a subset of a bad block in  P2. P1is a strict refinement of  P2if it is a refinement and  P1 ̸= P2.

Definition 5. ¯P = ({ ¯G1, . . . ,¯G¯u}, {¯B1, . . . ,¯B¯v}) is anunambiguous feature partition

image

1. ¯G = ∪¯ui=1¯Giis an unambiguous set of good features. 2. Each feature in any arbitrary block is uncorrelated with each feature in any other block in both classes. 3. Conditions 1 and 2 do not hold for any feature partition P that is a strict refine-ment of ¯P.

An unambiguous feature partition always exists and is unique, and we denote it by ¯P throughout. By definition, the unambiguous feature partition ¯P induces the unambiguous set of good features ¯G.

Our main result is given in the following theorem (Theorem 3), which provides sufficient conditions for the (almost sure) convergence of  π∗(P) to a point mass at ¯P, thereby guaranteeing the (almost sure) convergence of  π∗(G) to a point mass at ¯G. By Theorem 1, Gaussian OBFS under MNC and Gaussian OBFS under CMNC with “correct” D are strongly consistent if the conditions of Theorem 3 hold, which are very mild. Condition (i) is based on Definition 3 and essentially guarantees that ¯G really represents the type of features OBFS aims to select (those with different means or covariances between the classes). Conditions (i) and (ii) ensure certain moments exist, and Condition (iii) ensures that all covariances are full rank. There are no other distributional requirements. Condition (iv) is based on Definition 2 and characterizes the sampling strategy; the proportion of points observed in any class must not converge to zero, and sample points should be independent. Condition (v) places constraints on the inputs to the OBFS algorithm; we must assign a non-zero probability to the true feature partition.

Finally, from the proof of Theorem 3 we have that, under the conditions stated in the theorem, w.p. 1 there exist 0 < r < 1 and N > 0 such that

image

for all n > N and all  P ̸= ¯P that label any good features as bad or that put features that are correlated in separate blocks. Also, w.p. 1 there exist s, N > 0 such that

image

for all n > N and all  P ̸= ¯P. Fix  f ∈¯G. By (3.2), for each feature partition P such that  f /∈ ∪Gi, w.p. 1 there exist 0  < rP <1 and  NP >0 such that  π∗(P)/π∗( ¯P) < rnPfor all  n > NP. Therefore, w.p. 1,

image

for all n > N, where maxP :f /∈∪Gi rP < r <1 and N > maxP :f /∈∪Gi NP. Thus, w.p. 1,

image

for all n > N. The marginal posterior of good features converges to 1 at least exponentially w.p. 1. Now fix  f ∈¯B. By (3.3), w.p. 1 there exist s, N > 0 such that

image

for all n > N. In other words, the marginal posterior of bad features converges to 0 at least polynomially w.p. 1. This characterizes rates of convergence of the posterior on feature partitions, and the marginal posterior probabilities on individual features.

Throughout, “f and g are asymptotically equivalent” and “f ∼ gas  n → ∞” mean that limn→∞ f(n)/g(n) = 1, v(i) denotes the ith element of vector v, M(i, j) denotes the ith row, jth column element of matrix M, 0a,bdenotes the all-zero  a × bmatrix, and the sample mean and unbiased sample covariance of features in block A and class y are denoted by ˆµAyand ˆΣAy, respectively.

image

(i) ¯P is the unambiguous feature partition. (ii) For all  f ∈ F, the fourth order moment across both classes exists and is finite. (iii) ΣF0and ΣF1are full rank. (iv)  S∞is a balanced sample w.p. 1. (v)  π∗(P) assumes a fixed Gaussian OBFS model for all n with  π( ¯P) > 0.

image

Fix  P ̸= ¯P. If  π(P) = 0 then (3.8) holds trivially. In the remainder of this proof, assume π(P) ̸= 0. It suffices to show:

image

We prove this by constructing a sequence of partitions from P to ¯P in six steps. In Step 1, blocks that are labeled bad in P are split into subsets of bad blocks in ¯P and subsets of the unambiguous set of good features ¯G. In Steps 2 and 3, blocks that are labeled bad in the previous partition but that are actually subsets of ¯G are labeled good (and in Step 3 they are also merged with other good blocks). In Step 4, blocks that are labeled good in the previous partition are split into subsets of (good and bad) blocks in ¯P. In Step 5, blocks that are labeled good in the previous partition but that are actually subsets of bad blocks in ¯P are labeled bad. In Step 6, we merge blocks in the previous partition until we arrive at ¯P. An example of this sequence of partitions is illustrated in Fig. 1. Finally, Step 7 uses the sequence of partitions constructed in Steps 1-6 to show (3.9).

Step 1. Let  P 11 , . . . , P K11 , K1 ≥2, be a sequence of partitions where (1) we let  P 11= P, (2) for each k = 1, . . . , K1 −1, we let  P k+11be identical to  P k1except with one bad block partitioned into two bad blocks such that one of the smaller blocks is the intersection of the original block with a bad block in ¯P, and (3) we iterate until no more blocks can be split. The order we split blocks does not matter; at the end we always obtain the partition  P1 = P K11that is identical to P except with each bad block split into smaller blocks that are each either a subset of a bad block in ¯P or a subset of the unambiguous set of good features ¯G. Suppose  P ̸= P1. Note  P1is a strict refinement of P. By Lemma 1, w.p. 1 there exist  s1, N1 >0 (which may depend on the sample) such that  q(P)/q(P1) < n−s1for all  n > N1. Furthermore,

image

image

Figure 1: An example of the sequence of partitions constructed in the consistency proof for the OBFS algorithm, starting from the partition P (top) and ending at ¯P (bottom). Green represents good blocks in the current partition (thus ¯G is the set of green blocks at the bottom), and red represents bad blocks in the current partition (thus ¯B is the set of red blocks at the bottom).

For each k = 1, . . . , K1 −1, by Lemma 3, w.p. 1 there exist  ck1, N k1 >0 (which may depend on the sample) such that  a(P k1)/a(P k+11 ) <(log  n)ck1for all  n > N k1. By (3.10), w.p. 1 there exist  s1, c1, N >0 (namely  s1from above,  c1 = c11+  . . . + cK1−11, and N = max(N1, N 11 , . . . , N K1−11)) such that (q(P)/q(P1)) × (a(P)/a(P1)) < n−s1(log  n)c1for all n > N. Finally, since  n−s1for  s1 >0 dominates (log  n)c1, (q(P)/q(P1)) ×(a(P)/a(P1)) →0 w.p. 1.

Step 2. Construct a sequence of partitions starting from  P1where, in each step, one bad block in the current partition that is (a) contained in ¯G, and (b) has either different means or different covariances between the classes, is labeled good in the next partition, and iterate until no more blocks can be re-labeled. The order we re-label blocks does not matter; we always obtain the same final partition, which we call  P2. Suppose  P1 ̸= P2. By Lemma 1, w.p. 1 there exists  s2such that  q(P1)/q(P2) < n−s2for all n large enough. By Lemma 4, w.p. 1 there exists 0  < r2 <1 such that  a(P1)/a(P2) < rn2for all n large enough. Since  rn2for 0  < r2 <1 dominates  n−s2, q(P1)/q(P2)×a(P1)/a(P2) →0 w.p. 1.

Step 3. Some bad blocks in  P2may have equal means and covariances between the classes, but may still be contained in ¯G because they are correlated with, or connected through a chain of correlation with, features in ¯G that have either different means or different covariances between the classes. Here we construct a sequence of partitions to correct the label of these features. The construction goes as follows: starting from P2, in each step merge one bad block in the current partition that is contained in ¯G (which must have the same means and covariances between classes after Step 2) with one good block in the current partition that is correlated in at least one class (because it is correlated it must also be in ¯G), label the resulting block as good in the next partition, and iterate until no more blocks can be merged.

First only blocks directly correlated with good blocks in the original partition can be moved, then only blocks directly correlated with good blocks in the original partition or the block just moved can be moved, etc. All bad blocks in  P2that are actually in ¯G will eventually be moved because: (1) all features that have either different means or different variances have already been correctly labeled good in Step 2, and (2) the definition of ¯P guarantees that every feature in ¯G is connected to at least one feature in ¯G with either different means or different variances via at least one chain of pairwisecorrelated features that are all in the same good block in ¯P. Thus, each bad block in the final partition, call it  P3, is guaranteed to be a subset of a bad block in ¯P.

Suppose  P2 ̸= P3. By Lemma 1, w.p. 1 there exists  s3such that  q(P2)/q(P3) <n−s3for all n large enough. By Lemma 5, w.p. 1 there exists 0  < r3 <1 such that a(P2)/a(P3) < rn3for all n large enough. Since  rn3for 0  < r3 <1 dominates  n−s3, q(P2)/q(P3) × a(P2)/a(P3) →0 w.p. 1.

Step 4. Construct a sequence of partitions starting from  P3where, in each step, one good block in the current partition is split into two good blocks such that one of the smaller blocks is the intersection of the original block with a (good or bad) block in ¯P, and iterate until no more blocks can be split. The order we split blocks does not matter; in the final partition, call it  P4, each bad block is a subset of a bad block in ¯P and each good block is a subset of a good or bad block in ¯P. Suppose  P3 ̸= P4. Note that  P4is a strict refinement of  P3. By Lemma 1, w.p. 1 there exists  s4 >0 such that q(P3)/q(P4) < n−s4for all n large enough. By Lemma 6, w.p. 1 there exists  c4 >0 such that  a(P3)/a(P4) <(log  n)c4for all n large enough. Finally, since  n−s4for  s4 >0 dominates (log  n)c4, q(P3)/q(P4) × a(P3)/a(P4) →0 w.p. 1.

Step 5. Construct a sequence of partitions starting from  P4where, in each step, one good block in the current partition that is contained in a bad block in ¯P is labeled bad in the next partition, and iterate until no more blocks can be re-labeled. The order we re-label blocks does not matter; in the final partition, call it  P5, each block is a subset of a block in ¯P of the same type. Suppose  P4 ̸= P5. Note that  P5is a mesh of  P4, every good block in  P5is a good block in  P4, and every bad block in  P5is a block (good or bad) in  P4. By Lemma 1, w.p. 1 there exists  s5 >0 such that q(P4)/q(P5) < n−s5for all n large enough. By Lemma 7, w.p. 1 there exists  c5 >0 such that  a(P4)/a(P5) <(log  n)c5for all n large enough. Since  n−s5for  s5 >0 dominates (log  n)c5, q(P4)/q(P5) × a(P4)/a(P5) →0 w.p. 1.

Step 6. Construct a sequence of partitions from  P5to ¯P where each partition is a strict refinement of ¯P and, in each step, two blocks of the same type (good or bad) in the current partition are merged into a larger block of the same type in the next partition that is contained in a block of the same type in ¯P. The order we merge blocks does not matter, as long as the pair of blocks merged in each step are correlated in at least one class. Suppose  P5 ̸= ¯P. By Lemma 1, w.p. 1 there exists  s6such that  q(P5)/q( ¯P) < n−s6for all n large enough. By Lemmas 8 and 9, w.p. 1 there exists 0  < r6 <1 such that a(P5)/a( ¯P) < rn6for all n large enough. Since  rn6for 0  < r6 <1 dominates  n−s6, q(P5)/q( ¯P) × a(P5)/a( ¯P) →0 w.p. 1.

Step 7. We have that

q( ¯P)a( ¯P) =  q(P)a(P)q(P1)a(P1) · q(P1)a(P1)q(P2)a(P2) · q(P2)a(P2)q(P3)a(P3) · q(P3)a(P3)q(P4)a(P4) · q(P4)a(P4)q(P5)a(P5) · q(P5)a(P5)q( ¯P)a( ¯P) .

(3.11) Since  P ̸= ¯P, at least one of Steps 1-6 applies. For the steps that apply, the corresponding ratio in (3.11) converges to 0 w.p. 1. For the steps that do not apply, the corresponding ratio equals 1. Thus, (3.9) holds.

Lemma 1. Let P = ({G1, . . . , Gu}, {B1, . . . , Bv}) and P ′= ({G′1, . . . , G′u′}, {B′1, . . . , B′v′}) be arbitrary feature partitions, and let S∞be a fixed sample such that lim infn→∞ ρn >0 and lim supn→∞ ρn <1. Then there exists s ∈ Rand N > 0 such that  q(P)/q(P ′) < n−sfor all n > N. Further, this holds for s > 0 if: (i)  P ′is a strict refinement of P, or (ii)  P ′is a mesh of P where  P ̸= P ′, every good block in  P ′is a subset of a good block in P, and every bad block in  P ′is a subset of a good or bad block in P.

Proof. For any feature partition P, we may write

image

where  cP1is a positive constant that does not depend on n. We first focus on each block individually.  νGi∗yand  κGi∗yare asymptotically equivalent to  ny, and  νBj∗and  κBj∗are asymptotically equivalent to n. Further, for all positive integers d,

image

and by Stirling’s formula,

image

as  x → ∞. Thus, for all  A ⊆ F,

image

image

as  n → ∞, where  cA4 >0 does not depend on n. Applying (3.18) and (3.19) in (3.12), we have that  q(P) ∼ cP5(n0n1)−rP1 n−rP2(2e)−0.5|F |nas  n → ∞, where  cP5 >0 does not depend on  n, rP1 ≡ 0.75|G|+ 0.25 �ui=1 |Gi|2, rP2 ≡ 0.75|B|+ 0.25 �vj=1 |Bj|2, and we treat  n0and  n1as functions of n. Applying this in the ratio, we have:

image

as  n → ∞, where  c ≡ cP3 /cP ′3,  r1 ≡ rP ′1 − rP1and  r2 ≡ 2rP ′1 − 2rP1+  rP ′2 − rP2. We always have lim supn→∞ ρn(1 − ρn) ≤ 0.25, and by our assumed bounds on the limit inferior and limit superior of  ρnwe also have lim infn→∞ ρn(1 − ρn) >0. Thus, for all  s < r2,

image

Further,

r2= 14

If  P ′is a strict refinement of P, then  r2 >0. Also, if  P ′is a mesh of P where  P ̸= P ′, then every good block in  P ′is a subset of a good block in P and every bad block in  P ′is a subset of a good or bad block in P, thus  r2 >0. If  r2 >0, then (3.21) holds for all 0  < s < r2.

Lemma 2. Let  A ⊆ Fbe any non-empty feature set such that ΣA0and ΣA1are full rank. Suppose for all  f ∈ Fthe fourth order moment across both classes exists and is finite. Let  S∞be a balanced sample. Then, w.p. 1 there exist K, N > 0 (which may depend on the sample) such that

image

A. Foroughi pour and L. A. Dalton 15 �����|CAy ||ˆΣAy |− 1����� < Kn ,(3.24)

image

Proof. Fix A. Since  S∞is a balanced sample,  ny → ∞as  n → ∞, and by the strong law of large numbers ˆµAyconverges to  µAyand ˆΣAyconverges to ΣAyw.p. 1 for both y = 0, 1. In the rest of this proof, we only consider the event where ˆµAyand ˆΣAyconverge.

Since  S∞is a balanced sample, there exist 0  < p0, p1 <1 and  N1 >0 such that ny/n > pyfor all  n > N1and y = 0, 1. By the law of the iterated logarithm (Hartman and Wintner, 1941), w.p. 1 there exist  K1 >0 and  N2 > N1such that

image

for all  n > N2and i = 1, . . . , |A| (Foroughi pour and Dalton, 2019). This proves (3.23).

We can decompose the sample covariance for class y as follows:

image

image

for all  n > N3and i, j = 1, . . . , |A|. (xik − µAy(i))(xjk − µAy(j)) is a random variable with mean ΣAy(i, j) and finite variance  Vy(i, j). Suppose  Vy(i, j) >0. By the law of the iterated logarithm, w.p. 1

image

for all i, j = 1, . . . , |A|. Thus, w.p. 1 there exists  K3 >0 and  N4 > N3such that ����� 1ny −1

image

log log n n (3.36)

for all  n > N4and i, j = 1, . . . , |A|. A similar inequality holds when  Vy(i, j) = 0. Combining (3.33), (3.34) and (3.36), w.p. 1 there exists  K4 >0 such that

image

for all  n > N4and i, j = 1, . . . , |A|. Thus (3.25) holds. Since |ˆΣAy |is a polynomial function of the ˆΣAy(i, j), where each term has degree |A| and coefficient  ±1, w.p. 1 there exists  K5 >0 such that

image

for all  n > N4. Dividing both sides by  |ΣAy |proves (3.26).

Note that,

image

image

Further (3.23) implies that w.p. 1 ˆµAy(i) − mAy(i) is bounded for all  n > N2. Similarly, (3.25) implies that w.p. 1 ˆΣAy(i, j) is bounded for all  n > N3. Thus, from (3.39), w.p. 1 there exists  K6 >0 and  N5 > N4such that

image

for all  n > N5and i, j = 1, . . . |A|. Again noting that the determinant is a polynomial, w.p. 1 there exists  K7 >0 such that

image

for all  n > N5. By (3.38), |ˆΣAy |is bounded for all  n > N4w.p. 1. Thus, (3.24) holds. Applying the triangle inequality on  CAy(i, j) − ΣAy(i, j) and applying (3.25) and (3.40), we have that (3.27) also holds.

The sample covariance across all samples in both classes can be decomposed as

image

Define

image

Again since w.p. 1 ˆµAy(i) and ˆΣAy(i, j) are bounded for all  n > N3, by the triangle inequality w.p. 1 there exists  K8 >0 such that

image

for all  n > N3and i, j = 1, . . . , |A|. Further, by the triangle inequality

image

for all  n > N3and i, j = 1, . . . , |A| w.p. 1, where

image

image

d0= ˆµA0 − µA0,  d01 = µA0 − µA1and  d1 = µA1 − ˆµA1.  d01(i) is constant for all i = 1, . . . , |A|. By (3.23), w.p. 1 there exists  K9, N6 > N5such that  |dy(i)| < K9�log log n/n for all  n > N6, y= 0, 1 and i = 1, . . . , |A|. Thus, w.p. 1 there exists  K10 >0 such that d(i, j) < K10�log log n/n for all  n > N6and i, j = 1, . . . , |A|. Applying this fact and (3.25) to (3.44), w.p. 1 there exists  K11 >0 and  N7 > N6such that

image

for all  n > N7and i, j = 1, . . . , |A|. Combining this fact with (3.45), w.p. 1 there exists K12 >0 such that

image

for all  n > N7and i, j = 1, . . . , |A|. Again since the determinant is a polynomial, w.p. 1 there exists  K13such that

image

for all  n > N7. Observe from (3.31) that ΣAn(i, j) is lower bounded by

image

and upper bounded by

image

Since  |ΣAn |is a polynomial function of the ΣAn(i, j), where each term has degree |A| and coefficient  ±1, |ΣAn |must also be upper and lower bounded when n is large enough. (3.29) follows by dividing both sides of (3.49) by  |ΣAn |, and applying a bound on  |ΣAn |on the right hand side.

Observe that

image

Since ˆµA=  ρnˆµA0+ (1  − ρn)ˆµA1, (3.23) implies that w.p. 1 ˆµA(i) − mA(i) is bounded for all  n > N2. Similarly, (3.48) and the fact that ΣAn(i, j) is bounded for n large enough

implies that w.p. 1 there exists  N8 > N7such that ˆΣA(i, j) is bounded for all  n > N8. Thus, from (3.52), w.p. 1 there exists  K14 >0 and  N9 > N8such that

image

for all  n > N9and i, j = 1, . . . |A|. Again since the determinant is a polynomial, w.p. 1 there exists  K15 >0 such that

image

for all  n > N9. By (3.49) and the fact that  |ΣAn |is bounded for n large enough, w.p. 1 there exists  N10 > N9such that |ˆΣAy |is bounded for all  n > N10. Thus, (3.28) holds. Applying the triangle inequality on  CA(i, j) − ΣAn(i, j) and applying (3.48) and (3.53), we have that (3.30) also holds.

Corollary 1. Let  A ⊆ Fbe any non-empty feature set such that ΣA0and ΣA1are full rank. Suppose for all  f ∈ Fthe fourth order moment across both classes exists and is finite. Let  S∞be a balanced sample. Then, w.p. 1 the following all hold for all y = 0, 1 and i, j = 1, . . . , |A|, where ΣAnis given in (3.31).

image

Proof. (1) follows from (3.27). (2) follows by combining (3.24) and (3.26). (5) follows from (3.30). (6) follows by combining (3.28) and (3.29). (3) and (4) are special cases of (5) and (6), where in this case ΣAn= ΣAis constant. Recall that in Lemma 2 we showed that ΣAn(i, j) and  |ΣAn |are lower and upper bounded when n is large enough. By (3.30), CA(i, j) must also be upper and lower bounded when n is large enough, so (7) holds. Finally, since by item 6  |CA|and  |ΣAn |are asymptotically equivalent,  |CA|must also be upper and lower bounded when n is large enough, so (8) holds.

Lemma 3. Let  A1and  A2be any disjoint feature sets such that  µA10 = µA11, ΣA1 ≡ΣA10= ΣA11, features in  A1and  A2are uncorrelated in both classes, and ΣA0and ΣA1are full rank, where  A = A1 ∪ A2. Suppose for all  f ∈ Fthe fourth order moment across both classes exists and is finite. Let  S∞be a balanced sample. Then, w.p. 1 there exist c, N > 0 such that

image

for all n > N. The left hand side of (3.55) is  a(P)/a(P ′) when P and  P ′are identical feature partitions except A is a bad block in P, and  A1and  A2are bad blocks in  P ′.

image

where

image

By Corollary 1, the first term in (3.56) is upper bounded by a positive finite constant for n large enough w.p. 1. Without loss of generality, assume features in A are ordered such that

image

for some matrix V . Observe that

image

where  W ≡ V T(CA1)−1V. Since  µA10 = µA11, ΣA10= ΣA11, and features in  A1and  A2are uncorrelated in both classes,

image

where ΣAnand ΣA2nare defined in (3.31). By Lemma 2, w.p. 1 there exist  K1, N1 >0 such that for all  n > N1, i= 1, . . . , |A1|and j = 1, . . . , |A2|,

image

Since W is comprised of only quadratic terms in V and  CA1(i, j) → ΣA1(i, j) w.p. 1 (Corollary 1), w.p. 1 there exist  K2, N2 >0 such that for all  n > N2, i= 1, . . . , |A1|and j = 1, . . . , |A2|,

image

Further, since  |CA2| − |CA2 − W|is a polynomial function of the elements of W, where each term has a degree between 1 and  |A2|and a coefficient that is a polynomial function of the elements of  CA2of at most degree  |A2|−1, and since  CA2(i, j) is upper and lower bounded for n large enough w.p. 1 for all i, j = 1, . . . , |A| (Corollary 1), w.p. 1 there exists  K3, N3 >0 such that for all  n > N3,

image

Therefore,

image

for  n > N3w.p. 1, where the first line follows from (3.57) and (3.59) and the second line from (3.63). Further, by Corollary 1, w.p. 1 there exist  K4 >0 and  N4 > N3such that  |CA2| > K4for all  n > N4, thus

image

for all  n > N4. The second line holds for  N4large enough that x = (K3/K4) log log n/n is between 0 and 0.5 for all  n > N4, so that (1  − x)−1 <1 + 2x. The last line follows from the fact that (1 +  t/x)x < etfor all x, t > 0. From (3.56), the theorem holds with c = K5 + K3/K4and  N = N4, where  K5is such that (log  n)K5exceeds a bound on the first term in (3.56).

Lemma 4. Let G be any feature set such that either  µG0 ̸= µG1or ΣG0 ̸= ΣG1, and ΣG0and ΣG1are full rank. Let  S∞be a balanced sample. Then, w.p. 1 there exist 0  ≤ r <1 and N > 0 such that

image

for all n > N. The left hand side of (3.66) is  a(P)/a(P ′) when P and  P ′are identical feature partitions except G is a bad block in P and a good block in  P ′.

image

where

image

By Corollary 1, the first term in (3.67) is upper bounded by a positive finite constant for n large enough, and

image

as  n → ∞, both w.p. 1. Suppose ΣG ≡ ΣG0= ΣG1and  µG0 ̸= µG1. Then,

image

The last line follows from the matrix determinant lemma (Harville, 1998). Since  S∞is balanced and ΣGis positive-definite, there exists 0 < T < 1 such that 0  ≤ Tn < Tfor all n large enough. The theorem holds for any T < r < 1. If ΣG0 ̸= ΣG1, then

image

Fan (1950) showed that for any symmetric positive-definite X and Y and 0  ≤ p ≤1,

image

Although not mentioned by Fan, if 0 < p < 1 then equality holds if and only if X = Y (by the weighted arithmetic-geometric-mean inequality). Suppose  X ̸= Yand fix 0 < δ < 0.5. Then there exists  ǫ >0 such that  |X|p|Y |1−p/|pX+ (1  − p)Y | < 1 − ǫfor all p ∈ (δ, 1 − δ). Applying this fact here, since  S∞is balanced there exists 0 < T < 1 such that 0  ≤ Tn < T. The theorem holds for any T < r < 1.

Lemma 5. Let  G1and  G2be any disjoint feature sets such that  µG20 = µG21, ΣG2 ≡ΣG20= ΣG21, there exists at least one feature in  G1and one feature in  G2that are correlated in at least one class, and ΣG0and ΣG1are full rank, where  G = G1 ∪ G2. Let S∞be a balanced sample. Then, w.p. 1 there exist 0  ≤ r <1 and N > 0 such that

image

for all n > N. The left hand side of (3.73) is  a(P)/a(P ′) when P and  P ′are identical feature partitions except  G1is a good block in  P, G2is a bad block in P, and G is a good block in  P ′.

Proof. We have that

image

where

image

By Corollary 1, the first term in (3.74) converges to a positive finite constant w.p. 1, and

image

where  Vyis a matrix of correlations between features in  G1and  G2. Since

image

we have that

image

We always have  |ΣG2y −V TyG1y)−1Vy| ≤ |ΣG2y |, and thus 0  ≤ Ry ≤1. Since there exists at least one feature in  G1and one feature in  G2that are correlated in at least one class, in this class  Vyis non-zero,  |ΣG2y − V TyG1y)−1Vy| < |ΣG2y |, and  Ry <1. Since  S∞is balanced, there exist 0  < p0, p1 <1 such that  n0/n < p0and  n1/n < p1for n large enough. Thus,

image

for n large enough, where  Rn = Rp00,nRp11,n. Note  Rn → Rp00 Rp11 ≡ R, where 0  ≤ R <1. The theorem holds for any R < r < 1.

Lemma 6. Let  A1and  A2be any disjoint feature sets such that features in  A1and A2are uncorrelated in both classes, and ΣA0and ΣA1are full rank, where  A = A1 ∪ A2. Suppose for all  f ∈ Fthe fourth order moment across both classes exists and is finite. Let  S∞be a balanced sample. Then, w.p. 1 there exist c, N > 0 such that

image

for all n > N. The left hand side of (3.81) is  a(P)/a(P ′) when P and  P ′are identical feature partitions except A is a good block in P, and  A1and  A2are good blocks in  P ′.

Proof. We have that

image

image

where

image

By Corollary 1, the first term in (3.82) converges to a positive finite constant w.p. 1. Fix  y ∈ {0, 1}. Without loss of generality, assume features in A are ordered such that

image

for some matrix  Vy. Observe that

image

where  Wy ≡ V Ty(CA1y)−1Vy. Since features in  A1and  A2are uncorrelated in both classes,

image

By Lemma 2, w.p. 1 there exist  K1, N1 >0 such that for all  n > N1, y= 0, 1, i = 1, . . . , |A1|and j = 1, . . . , |A2|we have

image

Since  Wyis comprised of only quadratic terms in  Vyand  CA1y(i, j) → ΣA1y(i, j) w.p. 1 (Corollary 1), w.p. 1 there exist  K2, N2 >0 such that for all  n > N2, y= 0, 1, i = 1, . . . , |A1|and j = 1, . . . , |A2|

image

Further, since  |CA2y |−|CA2y −Wy|is a polynomial function of the elements of  Wy, where each term has a degree between 1 and  |A2|and a coefficient that is a polynomial function of the elements of  CA2yof at most degree  |A2|−1, and since  CA2y(i, j) → ΣA2y(i, j) w.p. 1 (Corollary 1), w.p. 1 there exists  K3, N3 >0 such that for all  n > N3and y = 0, 1,

image

Therefore,

image

image

for  n > N3w.p. 1, where the first line follows from (3.83) and (3.85) and the second line from (3.89). Further, w.p. 1 there exists  N4 > N3such that for all  n > N4and y = 0, 1,

image

The first line holds as long as  N4is large enough that  |CA2y | > |ΣA2y |/2 for all  n > N4and y = 0, 1 (this is possible since both  CA2yconverge). The second line holds as long as N4is large enough that x = (2K3/|ΣA2y |) log log n/n is between 0 and 0.5 for all  n > N4and y = 0, 1, so that (1−x)−1 <1+2x. The third line holds as long as  N4is large enough that  ny/n < pyfor all  n > N4and y = 0, 1 and some 0  < p0, p1 <1 (this is possible since S∞is balanced). Finally, the last line follows from the fact that (1 +  t/x)x < etfor all x, t > 0. From (3.82), the theorem holds with  c = K4+2p0K3/|ΣA20 |+2p1K3/|ΣA21 |and N = N4, where  K4is such that (log  n)K4exceeds a bound on the first term in (3.82).

Lemma 7. Let B be any feature set such that  µB ≡ µB0=  µB1, ΣB ≡ ΣB0= ΣB1, and ΣBis full rank. Suppose for all  f ∈ Fthe fourth order moment across both classes exists and is finite. Let  S∞be a balanced sample. Then, w.p. 1 there exist c, N > 0 such that

image

for all n > N. The left hand side of (3.92) is  a(P)/a(P ′) when P and  P ′are identical feature partitions except B is a good block in P and a bad block in  P ′.

Proof. We have that

image

where

image

image

Let  N2 > N1be such that  x = K1/nis between 0 and 0.5 for all  n > N2, so that (1  − x)−1 <1 + 2x. Then for all  n > N2,

image

w.p. 1, where the last line follows from the fact that (1 +  t/x)x < etfor all x, t > 0. From (3.42),

image

and for y = 0, 1,

image

for all  n > N3, y= 0, 1 and i, j = 1, . . . , |B|. Since  S∞is balanced, there exist 0 < p0, p1 <1 and  N4 > N3such that  ny/n < pyfor all  n > N4and y = 0, 1. Since ˆΣBy → ΣBas  n → ∞w.p. 1, there exists B > 0 such that  |ˆΣBy(i, j)| < B for all n, y, i and j w.p. 1. By Lemma 2, w.p. 1 there exists  K3 >0 and  N5 > N4such that |ey(i)| < K3�log log n/n for all  n > N5and y = 0, 1. By the triangle inequality,

image

image

for all  n > N5and i, j = 1, . . . , |B| w.p. 1. In particular, there exists  K4 >0 such that

image

Observe that |ˆΣB|is a polynomial function of the elements of ΣB, ρnE0, (1  − ρn)E1and E, where each term has a degree of |B| and a coefficient of  ±1. In particular,

image

where  εi1,...,i|B|is the Levi-Civita symbol, equal to +1 if (i1, . . . , i|B|) is an even permutation of (1, . . . , |B|), −1 if its an odd permutation, and 0 otherwise,  X1= ΣB, X2 = ρnE0, X3= (1  − ρn)E1, X4 = E, and

image

From (3.101) and (3.103), w.p. 1 there exists  K5 ∈ Rsuch that

image

+  K5log log n n (3.106)

for all  n > N5, where

M2 = {(k1, . . . , k|B|) : exactly one k equals 2, the rest equal 1},

M3 = {(k1, . . . , k|B|) : exactly one k equals 3, the rest equal 1},

image

image

for all  n > N5, where

image

X′1= ΣB, X′2=  E0, and we have used the facts that  m′(1, . . . ,1) = m(1, . . . , 1) and ρnm′(k1, . . . , k|B|) =  m(k1, . . . , k|B|) when (k1, . . . , k|B|) ∈ M2. Similarly, w.p. 1 there exists  K7 ∈ Rsuch that

image

for all  n > N5. Combining (3.106), (3.107) and (3.109), w.p. 1 there exists  K8 ∈ Rsuch that

image

for all  n > N5. Thus, w.p. 1 there exists  K9 ∈ Rsuch that

image

for all  n > N5, where  K9is chosen based on the fact that |ˆΣBy | → |ΣB|w.p. 1 (an application of Corollary 1 with hyperparameters  νBy= 0,  κBy= |A| and  SBy= 0 in place of the hyperparameters used by the selection rule), and thus |ˆΣBy |must be bounded for all n. In addition, w.p. 1 there exists  K10 >0 and  N6 > N5such that

image

image

for  n > N6, where in the second line we have applied the triangle inequality and the fact that |ˆΣB1 | → |ΣB|w.p. 1, so  N6is chosen such that  |ˆΣB1 | > 0.5|ΣB|for all  n > N6, and the last line follows from Lemma 2. By Lemma S3 in Foroughi pour and Dalton (2019), there exists  r ∈(0, 1) such that for all  ρ ∈(0, 1) and  x ∈(1  − r,1 + r),

image

where in the last inequality we use the fact that  ρ(1 − ρ) < 0.25 for all  ρ ∈[0, 1]. Thus, w.p. 1 there exists  N7 > N6such that

image

for all  n > N7, where the last line follows from the fact that (1 +  t/x)x < etfor all x, t > 0. From (3.93), the theorem holds with  c = K12+ 0.5K11and  N = N7, where K12is such that (log  n)K12exceeds  e1.5K1times a bound on the first term in (3.93).

Lemma 8. Let  G1and  G2be any disjoint feature sets such that there exists at least one feature in  G1and one feature in  G2that are correlated in at least one class, and ΣG0and ΣG1are full rank, where  G = G1 ∪ G2. Let  S∞be a balanced sample. Then, w.p. 1 there exist 0  ≤ r <1 and N > 0 such that

image

for all n > N. The left hand side of (3.116) is  a(P)/a(P ′) when P and  P ′are strict refinements of the unambiguous feature partition that are identical except  G1and  G2are good blocks in P and G is a good block in  P ′.

image

where

image

By Corollary 1, the first term in (3.117) converges to a positive finite constant w.p. 1, and (3.76) holds. The rest of the proof proceeds exactly as in Lemma 5.

Lemma 9. Let  B1and  B2be any disjoint feature sets such that  µB0=  µB1, ΣB ≡ ΣB0= ΣB1, there exists at least one feature in  B1and one feature in  B2that are correlated in at least one class, and ΣBis full rank, where  B = B1 ∪ B2. Let  S∞be a balanced sample. Then, w.p. 1 there exist 0  ≤ r <1 and N > 0 such that

image

for all n > N. The left hand side of (3.119) is  a(P)/a(P ′) when P and  P ′are strict refinements of the unambiguous feature partition that are identical except  B1and  B2are bad blocks in P and B is a bad block in  P ′.

image

where

image

By Corollary 1, the first term in (3.120) converges to a positive finite constant w.p. 1, and

image

as  n → ∞, both w.p. 1, where ΣB1 ≡ ΣB10= ΣB11and ΣB2 ≡ ΣB20= ΣB21. Without loss of generality, assume features in B are ordered such that

image

where V is a matrix of correlations between features in  B1and  B2. Since

image

we have that

image

Since there exists at least one feature in  B1and one feature in  B2that are correlated, V is non-zero,  |ΣB2 − V TB1)−1V | < |ΣB2|, and 0  ≤ R <1. The theorem holds for any R < r < 1.

The consistency theory presented herein is important because it provides a richer understanding the type of features selected by OBFS. Furthermore, we have characterized rates of convergence for the posterior on feature partitions, and the marginal posterior probabilities on individual features.

Although here we focus on identifying ¯G using the posterior  π∗(G), the OBFS framework can be used for optimal Bayesian partition selection, which aims to identify ¯P using the full posterior  π∗(P). Partition selection may be of interest, for instance, if one wishes to identify communities of closely interacting genes. Since Theorem 3 proves that  π∗(P) converges to a point mass at ¯P, this theorem has direct implications on the consistency of optimal Bayesian partition selection as well.

The conditions in Theorem 3 are sufficient but not necessary. For example, it may be possible to relax Condition (iii) of Theorem 3. It is also possible for  π∗(G) to converge to a point mass at ¯G, but for  π∗(P) to not converge to a point mass at ¯P. For example, if non-zero correlations are present in the data generation process, then the OBF variant of OBFS sets  π( ¯P ) = 0, which means that Condition (v) of Theorem 3 does not hold. However, if the independent unambiguous set of good features given in Definition 1 and the unambiguous set of good features given in Definition 3 happen to be equal (the latter always contains the former), then OBF can be strongly consistent relative to the unambiguous set of good features.

OBFS searches for the unambiguous set of good features, which expands upon the independent unambiguous set of good features targeted by OBF. In particular, the unambiguous set of good features includes features that are only strongly discriminating when grouped together, features that are individually weak but strongly correlated with strong features, and features that are linked to discriminating features only through a chain of correlation. The unambiguous feature set is complete in the sense that any features that are not included must not have any first or second order distributional differences between the classes and must be uncorrelated with all features that are included, and minimal in the sense that it is the smallest feature set with this property. The unambiguous feature set is thus of great interest and importance in bioinformatics and many other application areas, although we are not aware of any selection algorithms besides OBFS that aim to identify this set.

This work is supported by the National Science Foundation (CCF-1422631 and CCF-1453563).

Ang, J. C., Mirzal, A., Haron, H., and Hamed, H. N. A. (2016). “Supervised, unsuper- vised, and semi-supervised feature selection: A review on gene selection.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(5): 971–989. 2

Awada, W., Khoshgoftaar, T. M., Dittman, D., Wald, R., and Napolitano, A. (2012). “A review of the stability of feature selection techniques for bioinformatics data.” In Proceedings of the 2012 IEEE 13th International Conference on Information Reuse and Integration (IRI), 356–363. 2

Dalton, L. A. (2013). “Optimal Bayesian feature selection.” 65–68. IEEE. 2

Diamandis, E. P. (2010). “Cancer biomarkers: Can we turn recent failures into success?” Journal of the National Cancer Institute, 102(19): 1462–1467. 1

Fan, K. (1950). “On a theorem of Weyl concerning eigenvalues of linear transformations II.” Proceedings of the National Academy of Sciences, 36(1): 31–35. 22

Fang, G., Wang, W., Paunic, V., Heydari, H., Costanzo, M., Liu, X., Liu, X., Vander- Sluis, B., Oately, B., Steinbach, M., et al. (2019). “Discovering genetic interactions bridging pathways in genome-wide association studies.” Nature communications, 10(1): 1–18. 3

Foroughi pour, A. and Dalton, L. A. (2014). “Optimal Bayesian feature selection on high dimensional gene expression data.” In Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 1402–1405. 3, 5

— (2015). “Optimal Bayesian feature filtering.” In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACMBCB), 651–652. 2

— (2017a). “Robust feature selection for block covariance Bayesian models.” In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2696–2700. 2

— (2017b). “Robust feature selection for block covariance Bayesian models.” 5

— (2018a). “Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure.” BMC Bioinformatics, 19(Suppl 3): 70. 2, 3

— (2018b). “Optimal Bayesian filtering for biomarker discovery: Performance and ro- bustness.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, early access. 2

— (2019). “Theory of Optimal Bayesian Feature Filtering.” Bayesian Analysis, accepted. 2, 3, 7, 15, 29

Han, K., Jeng, E. E., Hess, G. T., Morgens, D. W., Li, A., and Bassik, M. C. (2017). “Synergistic drug combinations for cancer identified in a CRISPR screen for pairwise genetic interactions.” Nature biotechnology, 35(5): 463. 3

Hartman, P. and Wintner, A. (1941). “On the law of the iterated logarithm.” American Journal of Mathematics, 63(1): 169–176. 15

Harville, D. A. (1998). “Matrix algebra from a statistician’s perspective.” Technometrics, 40(2): 164–164. 22

Ilyin, S. E., Belkowski, S. M., and Plata-Salam´an, C. R. (2004). “Biomarker discovery and validation: Technologies and integrative approaches.” Trends in Biotechnology, 22(8): 411–416. 1

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., and Liu, H. (2017). “Feature selection: A data perspective.” ACM Computing Surveys, 50(6): 94. 2

Ramachandran, N., Srivastava, S., and LaBaer, J. (2008). “Applications of protein microarrays for biomarker discovery.” Proteomics - Clinical Applications, 2(10–11): 1444–1459. 1

Rifai, N., Gillette, M. A., and Carr, S. A. (2006). “Protein biomarker discovery and validation: The long and uncertain path to clinical utility.” Nature Biotechnology, 24(8): 971–983. 1

Sima, C. and Dougherty, E. R. (2006). “What should be expected from feature selection in small-sample settings.” Bioinformatics, 22(19): 2430–2436. 2

— (2008). “The peaking phenomenon in the presence of feature-selection.” Pattern Recognition Letters, 29(11): 1667–1674. 2

Xi, J., Li, A., and Wang, M. (2018). “A novel unsupervised learning model for detecting driver genes from pan-cancer data through matrix tri-factorization framework with pairwise similarities constraints.” Neurocomputing, 296: 64–73. 3


Designed for Accessibility and to further Open Science