Unsupervised Bump Hunting Using Principal Components

2014·Arxiv

Abstract

Abstract

1 Introduction

The PRIM algorithm for bump hunting was first developed by Friedman and Fisher (1999). It is an intuitively useful computational algorithm for the detection of local maxima (or minima) on target functions. Roughly speaking, PRIM peels the (conditional) distribution of a response from the outside in, leaving at the end rectangular boxes which are supposed to contain a bump (see the formal description in Algorithm 1) at page 5. However, some shortcomings against this procedure have also appeared in the literature when several dimensions are under consideration. For instance, as Polonik and Wang (2010) explained it, the method could fail when there are two or more modes in high-dimensional settings.

Almost at the same time, Dazard and Rao (2010) proposed a supervised bump hunting strategy, given that the use of PRIM is still “challenged in the context of high-dimensional data”. The strategy, called Local Sparse Bump Hunting (LSBH) is outlined in Algorithm 2 at page 3. Summarizing the algorithm, it uses a recursive partitioning algorithm (CART) to identify subregions the whole space where at most one mode is estimated to be present; then a Sparse Principal Component Analysis (SPCA) is performed separately on each local partition; and finally, the location of the bump is determined via PRIM in the local, rotated and projected subspace induced by the sparse principal components.

As an example, we show in Figure 1 simulation results representing a multivariate bimodal situation in the presence of noise, similarly to the simulation design used by Dazard and Rao (2010). We simulated in a three-dimensional input space (p = 3) for visualization purposes. The data consists of a mixture of two trivariate normal distributions, taking on discrete binary response values (), noised by a trivariate uniform distribution with a null response (Z = 0), so that the the data can be written by where 1] is the mixing weight, and (

Notice how the data in the PC spaces determined by Partition #1 and #2 do align with the PC coordinate axes , respectively (Figure 1).

Our goal in this paper is to provide some theoretical basis for the use of PCs in mode hunting using PRIM and a modified version of this algorithm that we called “fastPRIM”. Although the original LSBH algorithm accepts more than one mode by partition, we will restrict ourselves to the case in which there is at most one on each partition, in order to get more workable developments and more understandable results in this work.

In Section 2 we define the algorithms we are working with and set some useful notation. Section 3 proposes a modification of PRIM (called fastPRIM) for the particular case in which the bumps are modes in a setting of normal variables that allows to compare the boxes in the original space and in the rotation induced by principal components. The approach goes beyond normality and can be shown to be true for every symmetric distributions with finite second moment, and it is also an important reduction on the computational complexity since it is also useful for samples when 0, via the central limit theorem (Subsection 3.3). In this section we also present simulations which display the differences between considering the original space or the PC rotation for PRIM and fastPRIM. Finally, Section 4 proves Theorem 1, a result explaining why the (volume-standardized) output box mode is higher in the PC rotation than in the original input space, a situation observed computationally by Dazard and Rao (2010) for which we give here a formal explanation. Theorem 2 shows that in terms of bias and variance, fastPRIM does better than PRIM. Finally, in Section 5 we show additional simulations relevant to the results found in Section 4.

Figure 1: Illustration of the efficiency of the encapsulation process by LSBH of two target normal distributions (red and green dots), in the presence of 10% (w = 0.9) noise distribution (black dots) in a three-dimensional input space (p = 3). We let the total sample size be . Top row: each plot represents a projected view of the data in input subspace () with 95% confidence ellipses (dotted red and green contours - top left panel) and partitions vertices (top right panel). Only those partitions encapsulating the target distributions are drawn. Bottom row: each plot represents a projected view of the data in the PC subspace () of Partition #1 (bottom left), and () of Partition #2 (bottom right).

2 Notation and basic concepts

We set here the concepts that will be useful throughout the paper to define the algorithms and its modifications. Our notation on PRIM follows as a guideline the one used by Polonik

Let X be a p-dimensional real-valued random vector with distribution F. Let Z be an integrable random variable. Let . Assume without loss of generality that

Define interested in a region C such that

where ). Note then that ave(C) is just a notational convenience for the average of

Given a box B whose sides are parallel to the coordinate axes of , we peel small pieces of B parallel to its sides and we stop peeling when what remains of the box B becomes too small. Let the class of all these boxes be denoted by B. Given a subset a parameter 1), we define

where ave(B|S) = I(B|S)/F(B|S). In words, is the box with maximum average of Z among all the boxes whose F-measure, conditioned to the points in the box former definitions set the stage to define Algorithm 1 at page 5 below.

Some remarks are in order given Algorithm 1:

is the second tuning parameter and -quantile of the marginal conditional distribution function of given the occurrence of construction,

Remark 2. Conditioning on an event, say ˜A, is equivalent to conditioning on the random variable ; i.e., when this occurs, as in (2), we are conditioning on a Bernoulli random variable.

Remark 3. When dealing with a sample, we define analogs of the terms used previously and

replace those terms in Algorithm 1 with:

where is the empirical cumulative distribution of

Remark 4. Ignore the pasting stage, considering only peeling and covering. Let us call the probability of the final region. Then

2.1 Principal Components

The theory about PCA is widely known, however we will oultine it here for the sake of completeness and to define notation. Among others, Mardia (1976) presents a thorough analysis.

If x is a random centered vector with covariance matrix Σ, we can define a linear transformation T such that

where Γ is a matrix such that its columns are the standardized eigenvectors of Σ := ΓΛΓΛ is a diagonal matrix with , are the eigenvalues of Σ. Then T is called the principal components transformation.

Let ) the original p-dimensional space where ) the rotated p-dimensional space where y lives, and ) the rotated and projected space on the PC’s.

As we will explain later, we are not advising on the reduction of dimensionality in the context of regression or other learning settings. However, since it is relevant to some features of our simulations, we consider the case

3 fastPRIM: a More Eﬃcient Approach to mode hunt-

Despite successful applications in many fields, PRIM presents some shortcomings. For instance, Friedman and Fisher (1999), the proponents of the algorithm, show that in the presence of high collinearity or high correlation PRIM is likely to behave poorly. This is also true when there is significant background noise. Further, PRIM becomes computationally expensive in simulations and real data sets in large dimensions. In this section we propose a modified version of PRIM, called “fastPRIM”, aimed to solve these two problems when we are hunting the mode. The high collinearity problem can be solved via principal components. The computational problems can be solved via the CLT and the geometric properties of the normal distribution, if we can warrant

The following situations are variations from simple to complex of the input X and the response Z being normally distributed ), respectively. We are interested on maximizing the density of Z given X. But there are several ways to define the mode of a continuum distribution. So for simplicity, let us define the mode of Z as the region

with that maximizes

(note the similarity of M(C) with I(C) in Equation (1)). In terms of PRIM, we are interested in the box defined on Equation (2). That is, is a box such that inside it the mean density of the response Z is maximized. Then, since the mean and the mode of the normal distribution coincide, finding a box of size centered around the mean of X is equivalent to finding a box that maximizes the mode of Z (since X and Z are both centered around the origin).

Although it is good to have explicit knowledge of our final region of interest, on what follows most of the results —with the exception of Theorem 1 below— can be stated without direct reference to the mode of Z, taking into account that the mode of Z is centered around the mean of X.

3.1 fastPRIM for Standard Normality

Let living in the space 1). Since the whole input space is defined by symmetric uncorrelated variables, PRIM can be modified in a very efficient way. (See below Algorithm 3.)

Several comments are worthy to mention related to this modification.

1. Given that the standard normal is spherical, the final box at the end of the peeling algorithm is centered. It is also squared in that all its marginals have the same Lebesgue measure and the same probability measure . Then, instead of doing the whole peeling stage, we can reduce it to select the central box whose vertices are located at the coordinates corresponding to the quantiles of each marginal.

2. Say we want to apply t steps of covering. Since the boxes chosen are centered at the end of the t-th covering step, the final box will have probability measure

(which, by Remark 4, produces the same probability than PRIM), each marginal has measure (, and the vertices of each marginal are located at the coordinates corresponding to the quantiles . It means that the whole fastPRIM is reduced to calculating this central box of probability measure

3. The only non-zero values outside the diagonal in the covariance matrix of ((+1) are possibly the non-diagonal terms in the first row and the first column. Let us call them . From this we get that

4. It does not make too much sense to have a pasting stage, since we will be adding the same we just peeled in portions of ) at each side. However, a possible way to add this whole stage is to look for the dimension that maximizes the conditional mean, once a portion of probability 2 have been added to each side of the selected dimension. All this, of course, provided that this maximal conditional mean be higher than the one already found during the peeling stage. If this stage is applied as described, the final region will be a rectangular centered box.

Points 1, 2 and 3 can be stated as follows:

. Let us iterate t times Algorithm 3. Then the whole algorithm can be reduced to a single stage of finding a centralized box with vertices located at the coordinates corresponding to the quantiles of each of the p variables.

3.2 fastPRIM and Principal Components

Note that if Σ), the same algorithm as in Section 3 can be used. The only difference is that the final box will be a rectangular Lebesgue set, not necessarily a square as before (although it continues being a square in probability). Some comments are in order.

First, with each of the variables having possible different variances, we are also peeling the random variables with lower variance. That is, we are peeling precisely the variables that we do not want to touch. The whole idea behind PRIM, however, is to peel from the variables with high variance, leaving the ones with lower variance as untouched as possible. The obvious solution is to use a PCA to project on the variables with higher variance, peel on those variables, and after the box is obtained to add the whole set of variables we chose not to touch. Adding to the notation developed in Section 2.1 for PCA, call the projection of Y to its firsts principal components, where 0 . Algorithm 4 below makes this explicit.

In this way, we avoid to select for peeling the variables with lower variance. Concededly, we are still peeling the same amount (we are getting squares, not rectangles, in probability), but we are also getting an important simplification in algorithmic complexity cost. Besides

this fact, most of the comments in Section 3.1 are still valid but one clarification has to be made: The covariance matrix of () has size (+ 1); as before, all the non-diagonal elements are zero, except possibly the ones in the first row and the first column. Call where -th component of the random vector

As before, we can state the following lemma:

times the covering stage of Algorithm 4. Then the whole algorithm can be reduced to a two-stage setting: First, to find a centralized box with vertices located at the coordinates corresponding to the quantiles

of each of the variables. Second, add the dimensions left untouched to the final box.

Remark 5. Even though we have developed the algorithm with , it is not wise to try to reduce the dimensions of the input. To be sure, the rotation of the input in the direction of the principal components is a useful thing to do in learning settings, as D´ıaz-Pach´on et al. (2014) have showed. However, Cox (1968), Hadi and Ling (1998), and Joliffe (1982), have warned against the reduction of dimensionality.

3.3 fastPRIM and Data

The usefulness of the previous result can be more easily seen when, for relatively large n, we consider the iid vectors with finite second moment, since in this way we can approximate to a normal distribution by the Multivariate Central Limit Theorem:

Call ] and let us assume that 0. By the multivariate central limit theorem, if the vectors of observations are iid, such that their distribution has mean variance Σ, we can approximate -variate normal distribution with parameters can be approximated to a distribution Now, is the PC transformation of is the matrix of eigenvectors of S, the sample covariance matrix of is the diagonal matrix of eigenvalues of

As before, call the projection of Y to its firsts principal components. Apply Algorithm 4.

Note that the use of the CLT is indeed well justified: since the asymptotic mean of is 0, its asymptotic mode is also at 0 (or around 0).

3.4 Graphical Illustrations

In the following simulations, we first test PRIM and fastPRIM and illustrate graphically how fastPRIM compares to PRIM either in the input space X(p) or in the PC space We generated a synthetic dataset derived from a simulation setup similar to the one used in Section 1, although with a single target distribution and a continuous normal response, without noise. Thus, the data X was simulated as Σ) with response To control the amount of variance for each input variable and their correlations, the sample covariance matrix Σ was constructed from a specified sample correlation matrix R and sample variance matrix V such that Σ := , after ensuring that the resulting matrix Σ is symmetric positive definite.

Simulations were carried out with a continuous normal response with parameters and 2, a fixed sample size , and no added noise (i.e. mixing weight w = 1). Here, we limited ourselves to a low dimensional space (= 2) for graphical visualization purposes. Simulations were for a fixed peeling quantile , a fixed minimal box support fixed maximal coverage parameter t, and no pasting for PRIM. Empirical results presented in Figure 2 show the marked computational efficiency of fastPRIM compared to PRIM. CPU times are plotted against PRIM and fastPRIM coverage parameters , respectively, in the original input space X(2) and PC space

Further, empirical results presented in Figure 3 show PRIM and fastPRIM box coverage sequences as a function of PRIM and fastPRIM coverage parameters , respectively. Notice the centering and nesting of the series of fastPRIM boxes in contrast to the sequence of boxes induced by PRIM (Figure 3).

Figure 2: Total CPU time as a function of coverage. For all plots, comparison of speed metrics are reported against coverage parameter for PRIM and coverage parameter for fastPRIM, in the original input space X(2) (left), and the PC space (2) (right) for each algorithm. Total CPU time in seconds (s). Mean estimates and standard errors of the means are reported after generating 128 Monte-Carlo replicates.

4 Comparison of the Algorithms in the Input and PC

The greatest theoretical advantage of fastPRIM is that, because of the centrality of the boxes, it gives us a framework to compare the output mean in the original input space and in the PC space, something that cannot be attained with the original PRIM algorithm in which the behaviour of the final region is unknown (see Figure 2). Polonik and Wang (2010) explain how PRIM tries to approximate regression level curves, an objective that the algorithm does not accomplish in general. With the idea of level curves in mind, it is clear that the bump of a multivariate normal distribution can be seen as the data inside the ellipsoids of concentration. This concept is the key to prove the optimality of the box found on the PC space. By optimality here we mean the box with minimal Lebesgue measure among all possible central boxes found by fastPRIM with probability measure

Lemma 3. Let E be a p-dimensional ellipsoid. The rectangular box that is circumscribing E (i.e. centered at the center of E, with sides parallel to the axes of E, such that each of its edges is of length equal to the axis length of E in the corresponding dimension), is the box with the minimal volume of all the rectangular boxes containing E.

Figure 3: PRIM and fastPRIM box coverage sequences. Top row: PRIM complete sequence of coverage boxes, each corresponding to a coverage step with a fixed peeling quantile 05, and a fixed maximal coverage parameter t = 20, corresponding to a fixed minimal box support 05. Bottom row: fastPRIM complete sequence of coverage boxes, each corresponding to a fixed coverage parameter , with a fixed Results are given in the input space X(2) (left) and in the PC space (2) (right). The red to blue palette corresponds to a range of box output means from the largest to the smallest, respectively.

The proof of Lemma 3 is well-known and is omitted here.

. Assume that the true bump E of X has probability measure . Then, it is possible to find a rectangular box R by fastPRIM that circumscribes E under the PC rotation with minimal Lebesgue measure over all rectangular boxes containing E and the set of all possible rotations.

Proof. The true bump satisfies that . This bump, by definition of normality, lives inside an ellipsoid of concentration E, of volume Vol(the length of the semi axis of the dimension is a constant that only depends on the dimension p. By Lemma 3 above, the box R with sides parallel to the axes of E, and circumscribing E, has minimal volume over all the boxes containing E and its volume is 2. Let us assume that

Note now that R is parallel to the axes in the space of principal components it is centered at its origin. Therefore, provided an appropriate small (it is possible that we need to adjust proportionally on each direction of the principal components to obtain the box that circumscribes E), the minimal rectangular box R containing the bump E can be approximated through fastPRIM and is in the direction of the principal components. As such, then the box R has smaller Lebesgue measure than any other approximation in every other rotation.

Remark 6. The box of size circumscribing the ellipsoid of concentration E is identical to in equation (2).

Proposition 1 allows us to compare box estimates in the PC space of PRIM (Figure 2, top-right) versus fastPRIM (Figure 2, down-right). Remember from Equation (5) that 1) is the box obtained with PRIM after a single stage of coverage. We now restrict ourselves to the case of 1) in the direction of the principal components (i.e., its sides are parallel to the axes of )). We establish the following result:

the final fastPRIM box resulting from Algorithm 4 and assume . As in (5), call also ˆthe final box from Algorithm 1 after one stage of coverage. Assume that contain the true bump. Then

that is, the volume-adjusted box output mean of the mode of Z given R is bigger than the volume-adjusted box output mean of the mode of

Proof. Note that by definition, the two boxes have sides parallel to the axes of proof is direct because of the assumptions. By Proposition 1, R is the minimal box of measure that contains the true bump. Therefore, any other box with parallel sides to R that contains the bump also contains R. Since R is centered around the mean of Z, every point z in the support of have less density than arg minTherefore ). From Proposition 1 we also get that

Since 1) is but a particular case of a box , the result follows.

Not only R has better volume-adjusted output mean than 1). We conclude showing the optimality of the latter over the former in terms of bias and variance.

as the true bump, and let us assume that both R is unbiased while

1) are estimators of , as defined in Equation (2). Algorithm 4 is producing unbiased boxes since by construction it is centered around the mean. In fact, R would be unbiased even if not taken in the direction of the PC. On the other hand, ˆalmost surely biased, even in the direction of the principal components, since it is producing boxes that are not centered around the mean.

Now, the inequality )) stems from the fact that R is the box with minimal volume containing E. Since R is in the direction of the principal components, every other box that contains E in the same direction also contains R, in particular

5 Simulations

Next, we illustrate how the optimality of the box encapsulating the true bump is improved in the PC space ) as compared to the input space X(p). Empirical results presented in Figure 4 are for the same simulation design and the same fastPRIM and PRIM parameters as described in Subsection 3.4, except that we now allow for higher dimensionality since no graphical visualization is desired here (p = 100).

Some of the theoretical results between the original input space and the PC space are borne out based on the empirical conclusions plotted in Figure 4. In sum, for situations with no added noise, one observes for both algorithms that: i) the effect of PCA rotation dramatically decreases the box geometric volume; ii) the box output (response) means are almost identical in the PC space and in the original input space; and iii) the volume-adjusted box output (response) means are markedly larger in the PC space than in the original input space - indicating a much more concentrated determination of the true bump structure (Figure 4).

Some additional comments:

Figure 4: Box statistics and performance metrics as a function of coverage. For all plots, results are plotted against PRIM coverage parameter and fastPRIM coverage parameter in the original input space X(100) (red) vs. the PC space (green), that is for = 100, for each algorithm: PRIM (top row) vs. fastPRIM (bottom row). First column: box geometric volume (Log scale); second column: box output (response) mean; third column: volume-adjusted box output (response) mean (Log scale). See simulation design for details and metrics definitions. Mean estimates and standard errors of the means are reported after generating 128 Monte- Carlo replicates.

1. As each algorithm covers the space (up to step k = t), the box support and the box geometric volume are expected to increase monotonically (up to sampling variability) for both algorithms.

2. The boxes are equivalent for the mean of Z and the mode of Z because Z is normal, we expect the fastPRIM box being centered around the mean and therefore the conditional

mean of Z should be 1 (because in this simulation the mean of Z is 1). While, the box for Z given PRIM must have a different conditional expectation. This justifies the fact of looking at the mode of Z inside the boxes, and not directly the mode of Z.

3. Since the the box output (response) mean is almost perfectly constant at 1 for fastPRIM and close to 1 for PRIM, it is expected that the box volume-adjusted output mean decreases monotonically at the rate of the box geometric volume for both algorithms.

4. Also, as coverage k, t increases, the two boxes ) of each algorithm converge to each other (covering most of the space), so it is expected that the output (response) means inside the final boxes converge to each other as well (i.e. towards the whole space mean response 1).

To illustrate the effect of increasing dimensionality, we plot in Figure 5 the profiles of gains in volume-adjusted box output (response) mean as a function of increasing dimensionality . Here, the gain is measured in terms of a ratio of the quantity of interest in the PC space ) over that in the original input space X(p). Empirical results presented are for the same simulation design and the same fastPRIM and PRIM parameters as described in subsection 3.4. Notice the extremely fast increase in volume-adjusted box output (response) mean ratio as a function of dimensionality p, that is, the marked larger value of volume-adjusted box output (response) mean in the PC space as compared to the one in the input space for both algorithms. Notice also the weak dependency with respect to the coverage parameters (k, t).

Further, using the same simulation design and the same fastPRIM and PRIM parameters as described in subsection 3.4, we compared the efficiency of box estimates generated by both algorithms in the PC space ) as a function of dimension and coverage parameters k, t for PRIM or fastPRIM, respectively. Notice, the reduced box geometric volume (Figure 6) and increased box volume-adjusted output (response) mean (Figure 7) of fastPRIM as compared to PRIM.

Finally, in Figures 8 and 9 below we compare variances of fastPRIM and PRIM volume-adjusted box output (response) means in the PC space ) as a function of dimension coverage parameters k, t for PRIM or fastPRIM, respectively. Empirical results are presented for the same simulation design and the same fastPRIM and PRIM parameters as described in subsection 3.4. Results show that the variance of fastPRIM box geometric volume (Figure 8) is reduced than its PRIM counterparts for coverage t not too large (15), which is matched to a reduced variance of fastPRIM volume-adjusted box output (response) mean for coverage t not too small (

Figure 5: Gains profiles in volume-adjusted box output (response) mean as a function of dimensionality p. For all plots, comparison of box statistics and performance metrics profiles are reported as a ratio of the values obtained in the PC space ) (denoted Y) over the original input space X(p) (denoted X). We show empirical results for varying dimensionality , a range of PRIM and fastPRIM coverage parameters (), and for both algorithms: PRIM (left) vs. fastPRIM (right). Both coordinate axes are on the log scale.

Of note, the results in Figures 6 and 7 below, and similarly in 8 and 9, are for the sample size n = 1000 of this simulation design. In particular, efficiency results of fastPRIM versus PRIM box estimates show some dependency with respect to coverage parameters k, t for large coverages and increasing dimensionality. As discussed above, this reflects a finite sample-effect favoring PRIM box estimates in these coverages and dimensionality.

Notice finally in Figures 6 and 7 how the curves approach each other for the largest coverage step k = t = 20, and similarly in 8 and 9 how the curves approach the identity line. This is in line with the aforementioned convergence point of the two boxes coverage increases.

Figure 6: Comparative profiles of box geometric volumes in the PC space ) as a function of dimension and coverage parameters for PRIM or fastPRIM, respectively. We show results for a range of dimension a range of PRIM and fastPRIM coverage parameters ’y’ axes are on the Log scale.

Figure 7: Comparative profiles of box volume-adjusted output (response) means in the PC space ) as a function of dimension and coverage parameters for PRIM or fastPRIM for PRIM and fastPRIM, respectively. We show results for a range of dimension and a range of PRIM and fastPRIM coverage parameters . The ’y’ axes are on the Log scale.

Figure 8: Comparative profiles of variances of box geometric volumes in the PC space as a function of dimensionality and coverage parameters for PRIM or fastPRIM, respectively. In all subplots, we show the variances of box geometric volumes of both algorithms against each other for a range of PRIM and fastPRIM coverage parameters () in four dimensions The identity (doted) line is plotted. All axes are on the Log scale.

Figure 9: Comparative profiles of variances of box volume-adjusted output (response) means in the PC space ) as a function of dimensionality and coverage parameters for PRIM or fastPRIM, respectively. In all subplots, we show the variances of the volume-adjusted box output (response) means of both algorithms against each other for a range of PRIM and fastPRIM coverage parameters () in four dimensions . The identity (doted) line is plotted. All axes are on the Log scale.

6 Discussion

Our analysis here corroborates what D´ıaz-Pach´on et al. (2014) have showed on how the rotation of the input space to the one of principal components is a reasonable thing to do when modeling a response-predictor relationship. In fact, Dazard and Rao (2010) use a sparse PC rotation for improving bump hunting in the context of high dimensional genomic predictors. And Dazard et al. (2012) also show how this technique can be applied to find additional heterogeneity in terms of survival outcomes for colon cancer patients. The geometrical analysis we present here shows that as long as the principal components are not being selected prior to modeling the response, then these improved variables can produce more accurate mode characterizations. In order to elucidate this effect, we introduced the fastPRIM algorithm, starting with a supervised learner and ending up with an unsupervised one. This analysis opens the question on whether is possible to go from supervised to unsupervised settings in more general bump hunting situations, not only modes; and more generally, whether is possible to go from unsupervised to supervised in other learning contexts beyond bump hunting.

Acknowledgements: All authors supported in part by NIH grant NCI R01-CA160593A1. We would like to thank Rob Tibshirani, Steve Marron and Hemant Ishwaran for helpful discussions of the work. This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University.

References

Cox, D. (1968), “Notes on some aspects of regression analysis,” J. R. Stat. Soc. Ser. A., 131, 265/279.

Dazard, J.-E., Rao, J., and Markowitz, S. (2012), “Local sparse bump hunting reveals molec- ular heterogeneity of colon tumors,” Stat. Med., 31, 1203–1220.

Dazard, J.-E. and Rao, J. S. (2010), “Local Sparse Bump Hunting,” J. Comput. Graph. Stat., 19, 900–929.

D´ıaz-Pach´on, D., Rao, J., and Dazard, J.-E. (2014), “On the explanatory power of principal components,” Under review.

Friedman, J. and Fisher, N. (1999), “Bump hunting in high-dimensional data,” Stat. Comput., 9, 123–143.

Hadi, S. and Ling, R. (1998), “Some Cautionary Notes on the Use of Principal Components Regression,” Am. Stat., 52, 15–19.

Joliffe, I. (1982), “A Note on the Use of Principal Components in Regression,” J. R. Stat. Soc. Ser. C. Appl. Stat., 31, 300–303.

Mardia, K. (1976), Multivariate Analysis, Academic Press.

Polonik, W. and Wang, Z. (2010), “PRIM Analysis,” J. Multivar. Anal., 101, 525–540.

designed for accessibility and to further open science