Large-Scale Shrinkage Estimation under Markovian Dependence

2020·arXiv

1 Introduction

2 Shrinkage estimation in a hidden Markov model

Consider a two-state hidden Markov model. Suppose we are interested in estimating the mean vector based on observed output . In HMMs, the observed data XXX can be viewed as a contaminated version of the effect sizes , where the contaminations are white noises following a zero-mean normal distribution with known variance . The HMM further assumes that s are independently distributed conditional on the unknown states , where are Bernoulli variables forming a Markov chain. The latent state usually represents the null or the in-control state and corresponds to the non-null or the out-of-control state. For example, in CNV studies, is the observed number of repeats in the genome, represents healthy part of genome, and represents the part of the genome that is susceptible to perturbation in diseased patients.

2.1 Model and notations

Formally, the data generation process can be described by a hierarchical model:

In practice, we often assume that the null distribution is either known or can be modeled by a pre-specified parametric family1. By contrast, we do not specify any parametric forms on because we usually do not have sufficient knowledge on the non-null process , which may be difficult to describe using any pre-specified parametric families. The setting is reasonable for a wide range of application scenarios.

tion probabilities

The above probabilities are unknown and obey standard stochastic constraints 0 < Figure 2.1 presents a schematic diagram of the model. We aim to find a decision rule ˆfor estimating the unknown mean vector . The goal is to minimize the Bayes risk , where is the mean squared error (MSE, or loss):

Fig. 1 Schematic diagram of the data generation process. We observe s only and would like to estimate the s which are related through the Markov chain

Let denote the HMM parameters, where A is the transition matrix, is the initial distribution, and

We use to denote the normal density with mean 0 and variance with the suffix dropped for standard normal density.

2.2 Oracle estimator under independence

Ignoring the dependence structure, the optimal solution that minimizes the Bayes risk is given by Tweedie’s formula:

and f is the marginal distribution of all the observed X. The formula first appeared in Robbins (1955), who attributed the idea to Maurice Kenneth Tweedie. Tweedie’s formula can be implemented using empirical Bayes methods for constructing a class of non-parametric estimators (Brown and Greenshtein, 2009; Efron, 2011). The crucial observation is that it works directly with the marginal distribution, which is in particular attractive in large-scale estimation problems since f and can be well estimated using standard density estimation techniques (Silverman, 2018).

2.3 Oracle Estimator under HMM dependence

Our idea is to extend Tweedie’s formula to correlated observations under the HMM dependence. This can effectively increase the statistical efficiency by borrowing strength from adjacent observations. It has been shown in the multiple testing literature that exploiting the dependence structure can greatly improve the power of existing false discovery rate (Benjamini and Hochberg, 1995) methods (Sun and Cai, 2009; Wei et al., 2009; Sun et al., 2015). We further show in this article that dependence structure is also highly informative and can be exploited to improve the accuracy of shrinkage estimators.

Our methodological development is divided into two steps. We first assume that an oracle knows the hidden states and study the oracle rule. Then we discuss the case when the states are unknown and propose data-driven methods to emulate the oracle rule. If the states are known, then it is natural to apply Tweedie’s formula to two separate states:

The oracle estimator (3) ˜provides a benchmark in shrinkage estimation, i.e. the theoretically achievable lower limit of the estimation risk.

Next we consider a “weaker” oracle estimator that only knows , the collection of hyper-parameters in the HMM. Then the optimal solution is given by the next lemma.

Lemma 1 Consider Model (1) and assume that HMM parameters are known. Then the estimator that minimizes the MSE is given by

The proof of lemma is straightforward and hence omitted. The formula (4) can be viewed as an extension of Tweedie’s formula under HMM dependence. The next section considers the case where HMM parameters are unknown. We shall develop data-driven estimators and computational algorithms to emulate the Bayes estimator (4).

3 Data-Driven Estimator and Computational Algorithms

We discuss how to estimate and to implement the Bayes estimator. The proposed estimator is a non-parametric Tweedie-based shrinkage estimator under dependence (TD). In light of (4), we consider a class of estimators in the form:

where ˆare estimates of the true conditional probability , and ˆand ˆare estimates of and . Next we describe an algorithm for constructing estimates of the unknown quantities in (5).

3.1 The modified Baum-Welch algorithm

The most well-known method for constructing estimators of the form (5), under the conventional setting with pre-specified parametric families, is the Baum-Welch (BW) algorithm (Rabiner, 1989; Yang et al., 2015; Baum et al., 1970). We emphasize that the BW algorithm must proceed with user specified parametric forms for ’s. However, while , the in-control distribution, can be well modeled by parametric densities, the out-of-control observations may not be easily approximated by parametric densities. For example, the Gaussian mixture model with a fixed number of components is not suitable for approximating very heavy tailed . To overcome the issue, we describe an HMM-based Tweedie (HMMT) estimator that employs parametric and non-parametric . The essential component in our estimator is a generalized BW algorithm that updates the estimates of and iteratively based on posterior probability estimates of the latent states.

To avoid the identifiability issues, assume is unimodal and 5. These assumptions are usually reasonable in applications. In our HMMT estimator, is assumed to be a Gaussian density with possibly unknown location and scale . The Gaussian model can be generalized to Gaussian mixtures as well as other parametric families without essential difficulty. In contrast with existing HMM algorithms that utilize parametric assumptions on , we consider a nonparametric approach to estimating . The proposed methodology employs a class of weighted kernel density estimators

where 1, h is the bandwidth and is a standard Gaussian kernel func- tion. Other choices of kernel such as the Cauchy kernel can also be used particularly in applications that are sensitive to the tails of the density.

With an initial choice of , ˆand ˆ, the forward-backward propagation steps of the Baum-Welch algorithm provides updated estimates of the posterior probabilities ˆ. These estimates can be utilized to update ˆusing new weights ˆand ˆ:

The process is repeated until convergence.

3.2 Choice of h

Choosing an appropriate bandwidth h is crucial for constructing an efficient estimator. If h is too small, then we will end up with estimating by the naive estimator . If h is too large, then it will be difficult to identify ˆ, which leads to a severely biased and inaccurate estimator. This section describe a cross validation method for choosing the tuning parameter h.

The first step is to split the observed sample XXX into (UUU,VVV) by artifically adding independent Gaussian noise: , where . Note thatUUU,VVV are normally distributed random effects with same meanbut different variances and . By construction, UUU and VVV are mutually independent conditioned on .

The key idea in the second step is to use UUU for estimation and VVV for bandwidth selection. One can think of as a measure of how much data we use for estimation (Brown et al., 2013). When 0, the entire data set is used for estimation and none for hyper-parameter calibration. This article uses 10%.

The steps are repeated for a list of prefixed h values yielding estimates of the posterior probabilities : 1 of the latent states and marginal densities estimates ˆand ˆ. For each h value, we estimate the MSE as

where, ˆand ˆ. The optimal value of bandwidth is

Subsequently, the HMMT estimator is computed by running the generalized BW algorithm with the bandwidth set at ˆh and 0.

3.3 Initialization and the HMMT estimator

Another important caveat about Algorithm 1 is the initialization step. To prevent the algorithm getting trapped in a local maximum, a good initialization is important. For this purpose we use the algorithm developed in Ko et al. (2015) for estimation in an HMM with unknown number of changing points. As the probability of the process being in control is estimated by the mode of the component with the largest probability which is subsequently attributed to ˆ. The remaining states as estimated via a Dirichlet process model contributes towards the out-of-process marginal densities ˆ. It was found that such an initialization produced reasonable values of ˆ. Henceforth for simplicity, we assume 1. The detailed estimation procedure is summarized by Algorithm 1 below.

Algorithm 1 HMM-Tweedie Estimator

2: Intialization Use the algorithm in Ko et al. (2015) to get estimates 3: E step: For 1, compute the following in the t iteration using the previous iteration estimates.

5: Iterate until the parameters and the non-parametric density estimate converges. 6: Repeat steps 3-5 for a list of h values and store the final marginal densities estimates ˆ

and the posterior probabilities of the latent states. 7: Choose bandwidth ˆ

3.4 Operational characteristics of the new algorithm

This section discusses the operation characteristics of Algorithm 1 that distinguishes our work from existing ones. The discussions provide intuitions that are helpful for our later theoretical analysis.

Algorithm 1 includes several modifications to the conventional Baum-Welch algorithm of Baum et al. (1970). For understanding the crucial differences first consider 1, i.e, no sample splitting. The average log-likelihood for complete data assuming known is:

When is unknown, interchanging logarithm and summation above, we iteratively maximize the following lower bound ˜of ,

where, log is the entropy, and decouples as with

and log . We iteratively max- imize over based on previous iterate . If the posterior probabilities for are updated to ˆthen, equals

Although is concave in , it is not strongly concave. Updating by maximizing over the 1 dimensional hyperplane as would not yield good solutions. We interchange the logarithm and summation in the first term in above. The resultant ˜have a closed form maxima at ˆ.

backward procedure (Rabiner, 1989) is used to maximize and produce updated posterior probabilities ˆ. For i = 1, . . ., n and j, k = 0, 1 the forward variable: and the backward variable: are computed. and the posterior probabilities are updated as:

The objective ˜is bounded above and increasing at each step, therefore, our algorithm will converge. In particualr, as , the true solution is a fixed point of the algorithm. This insight is crucial for establishing the theoretical properties for the proposed HMMT estimator.

4 Asymptotic Properties of the proposed estimator

For establishing risk properties of our proposed shrinkage estimators, we consider the following assumption that is popularly imposed for theoretical analysis in HMM starting from the pioneering works of Bickel et al. (1996, 1998):

Assumption A1. The hidden states form a stationary ergodic Markov chain and the transition probabilities satisfy 0 1. The Markov chain begins in its stationary state and is reversible.

To understand the risk properties of the proposed estimator ˆ, we consider the following two quantities:

which are helpful for understanding the behaviors of the proposed estimate of . Note that is an estimator that cannot be implemented in practice, since it uses knowledge of unknown . By contrast, ˆis a practical estimator that substitutes ˆin place of . ˆ, which can be conceptualized as the estimates of posterior probabilities , can be obtained as the outputs from the proposed algorithm.

Consider ˆdefined by (8) at an arbitrary iterative step. Barring sample splitting, the estimate of in every iterative step of Algorithm 1 has the functional form of ˆ. We consider a class of that are smooth, bounded and -Holder continuous.

Assumption A2. comes from a smooth class of functions and has bounded support. is -Holder continuous, i.e., , where

It can be seen (cf. Ch. 6 of Wasserman, 2006) that the squared distance of the oracle kernel density estimator from the true ,

has the following asymptotic expected value:

The intuition is that, under assumption A1, ˆcan be decomposed as

where the residual ˆis asymptotically negligible as provided ˆare asymptotically unbiased estimates of . Theory underpinning this intuition is rigorously established in the proof of the next lemma. Hence

where is a fixed probability whose existence is ensured the stationarity of the Markov chain (Assumption A1). We summarize the above discussions in the lemma below.

Lemma 2 If is -Holder continuous and has bounded support then under Assumption A1, the integrated Bayes risk of non-parametric density estimators of the form (8)-(9) for any h > 0 is

In particular, if ˆare unbiased and log then,

The above results shows that with good posterior probability estimates, the algorithm 1 provides consistent non-parametric estimation of . As is parametrized as a Gaussian density, its estimation consistency also subsequently follows. The free parameters in are and and . For understanding the evolution of the estimates in Algorithm 1 from the initial solutions towards the true , note that under Assumption A1, the population criteria ˜lim˜and ˜˜lim˜˜are well-defined. Suppose our initial solution is in the vicinity of such that with estimates ˆand ˆ1that are improvements over the initial estimates ˆand ˆin the sense d( ˆand , maximizing and the subsequent application of (8) produces better estimates ˆof the true posterior probabilities such that ˆgiven by (9) is better than ˆwith d( ˆ. If this feature of the evolution is maintained over the successive iterations, the bias in the posterior probability estimates decrease in every step and ˆand

1will ultimately converge to the true and respectively at the rate given in Lemma 2. Having a good initial solution that has negligible asymptotic bias, which implies that successive iterations would have the aforementioned properties, is hence essential for successful convergence in Algorithm 1. Our assumption on the properties related to initialization is given next.

Assumption A3. The initial solution , is in the neighborhood of the true solution. The behaviour of the population log-likelihood in this neighborhood is such that for all t, the posterior estimates ˆbased on argmaxproduce estimates and in Algorithm 1 which satisfy: maxmax

The results in Lemma 2 establish the risk optimality of our proposed ˆ. To facilitate a universal optimality statement that is appropriate for both nonsparse and sparse regimes, we impose the following assumption on similar to Brown and Greenshtein (2009):

Assumption A4. For any fixed positive , we have sup0 as .

Remark 1 Based on our set-up in Section 2.1 and Assumption A1, the above condition is equivalent to restricting the support of the priors and to where 0 as 0 for any positive .

For any fixed h > 0, let ˆ, ˆand ˆbe the final estimates of the , and the mode of respectively, from algorithm 1. Recall that as 1, the estimates of the mean when the process is out of control is

Barring sample splitting (using 0, not steps 7 and 8 in algorithm 1) the HMMT estimator is ˆ. As the HMMT estimator involves ratio of estimates ˆand ˆf , the approximation error bound in Lemma 2 on ˆf does not trivally lead to optimal risk properties of the HMMT estimator. For achieving asymptotically optimal risk performance, consider a modified robust version of the HMMT estimator that truncates very high out-of-control mean values:

The following result akin to Theorem 1 of Brown and Greenshtein (2009) shows that ˆis asymptotically oracle optimal as it can achieve the mean square error of the oracle shrinkage estimator defined in (3). The proofs of the results in this section are provided in the supplements.

Theorem 1 Under Assumptions A1 to A4, for any such that log 0 but for any positive , we have

5 Simulation Studies

This section conducts simulation studies to compare the performance of our proposed HMMT with the following four estimators:

(a) GMM.3: we use BW algorithm and Gaussian mixture models with 3 mixing components for modeling ;

(b) GMM.DP: similar to GMM.3, we use BW algorithm and Gaussian mixture models. However, the number of components in the Gaussian mixture will not be fixed as before. Instead, the number of components is estimated using the Dirichlet Process model as described in Ko et al. (2015);

(c) T.ND: we ignore the Markovian dependence structure and apply Tweedie’s formula. The algorithm in Fu et al. (2017) has been used to choose the tuning parameter.

(d) OR: the oracle estimator in (3) which uses knowledge about the latent parameters.

5.1 Comparison of the MSEs

In Table 1 we report the mean squared errors averaged over 50 replications. We consider 11 simulation scenarios. In each scenario, we simulate n = 2, 000 observations. The latent states are generated based on a two-states Markov chain where the transition matrix has the probability of being in-control fixed at 0.95 while the probability of being out-of-control is varied between 0.2 to 0.8. Across all the scenarios was fixed as the distribution with point mass at 0. In cases I to VI, the out-of-control prior was generated from the following 6 densities:

(a) uniform distribution with support on ; (b) Asymmetric triangle distribution on 30] with mode at 6; (c) Levy distribution with scale 7 on ; (d) Non-central Chi-square distribution with 3 degrees of freedom and noncentrality parameter at 2;

(f) Burr distribution with location 0, scale 2 and shape parameters 2 and 0.5. In cases VII to XI, we study the performance of the estimators when is generated from mixture of the above densities.

The following observations can be made based on the simulation results.

• Across all the regimes, our proposed HMMT estimator significantly improves the non-parametric shrinkage estimator T.ND which does not use dependent structure, which shows the importance of taking the dependence structure into account.

• In many cases, incorporating Markov structure result in efficiency gain even when the model is mis-specified.

• The HMMT estimator also improves on the Gaussian mixture model based estimation strategies and has error rates quite close to that of the Oracle estimator in (3). GMM.3 sometimes has higher MSE than even the non-parametric shrinkage estimator T.ND that does not use dependent structure. This reflects the usefulness of using Kernel density based non-parametric approach in HMMT.

5.2 Comparison of the estimated f1’s

When implementing the proposed HMMT method, the estimate GMM.DP has been used in the initialization step. This indicates that, in the cases where HMMT improves significantly over GMM.DP, the proposed algorithm employs that is sufficiently far away from the Gaussian mixture family. This would typically happen when the distribution of out-of-control averages is heavy-tailed. In Figure 2, the estimated across these simulation regimes are displayed for the case 8. Across all the studied regimes, the HMMT estimator is evidently smoother and its shape is closer to the truth.

Fig. 2 From top to bottom (by columns), we have the estimated density ˆfrom GMM.DP (blue) and HMMT (red) across the eleven cases considered in Table 1. The true is plotted in black.

Table 1 As the probability of remaining out of control 1 and the out of control prior MSE of the Tweedie estimator that does not use dependence structure (T.ND), conventional BaumWelch algorithm using Gaussian mixture models with 3 mixing components (GMM.3) and with the number of components estimated by using Dirichlet Process model (GMM.DP) are reported along with the performance of our proposed HMMT estimator and that of the Oracle estimator in (3).

6 Real data analysis

This section we illustrate our methodology by applying it to several real data analyses.

6.1 Copy number variation

We analyze the IMR103 data in Sharp et al. (2006), which are displayed in Figure 3. The gene copy number is the number of copies of a particular gene in the genotype of an individual. It is widely believed that in healthy cells, the average copy number should be 2. A shift away from 2 is a genomic disorder and is usually related to certain disease. It is clear from the left panel of Figure 3 that in the region from 100000 to 110000, there is a shift away from 0 in ratio. We take and model the data using an HMM. The two hidden states 0 and 1 can be interpreted as healthy/unhealthy genes. The noise variance is estimated as the sample variance of the the first 10000 data (0.183). The right panel of Figure 3 shows the histogram of the first 10000 data along with the density function of N(0, 0.1832). We will take this as .

Fig. 3 Left panel: The IMR103 data, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.Right panel: Histogram of the first 10000 genes’ mean BAC log 2 ratio.

For the data analysis, we focus on the data from position position 111000 to 112999 (Call it UUU) and from position 1113000 to 117999 (Call it VVV). For each data in UUU we add a random noise 1832). Define , For each data in VVV we also add a noise from the same distribution 1832). Let and . Under this construction, and are independent with the same mean. We will construct the estimators based on , and estimate the mean vector of (call it ˆ). And use the average of as prediction error. The plot of UUU and VVV are shown in figure 14 and figure 15 respectively.

Fig. 4 Left panel: The IMR103 data gene number 111000 to 112999, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.Right panel: The IMR103 data gene number 113000 to 118000, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.

The estimated using the Gaussian mixture model and our non-parametric method are displayed in Figure 5. Use the sample splitting method we discussed in Section 3.2, the bandwidth is chosen to be 0.28. Finally we apply the GMM.DP and HMMT methods to the data set. The prediction error for GMM.DP is 0.130, whereas the prediction error of the proposed HMMT is 0.121. This illustrates the benefit of using the nonparametric approach to modeling .

Fig. 5 Left panel: Estimated for mean BAC log 2 ratio using GMM. Right panel: Estimated for mean BAC log 2 ratio using HMMTweedie.

6.2 Internet search trend

This section applies the proposed methods for analysis of search trend data. The key word of interest is “NBA”. The data are collected from Google Trend for the period 08/29/2013 to 05/28/2018. According to Google, “Numbers represent search interest relative to the highest point on the chart for the given region and time. Hence a value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term.” Because of the increasing accessibility of the Internet, if we directly take the time interval to be from 08/29/2013 to 05/28/2018, there will be a clear increasing trend. To adjust for that, we collect the data by taking the time window to be 4 months. Since the numbers are relative, homoscedastic errors seem to be a reasonable assumption. Figure 6 display the search trend data and its histogram. Bottom panel of Figure 6 shows the histogram of the data with value less than 30, where the line represents the density function of 13

Fig. 6 Top left panel: Plot for the search interest score, x-axis is time, y-axis is the google search interest score for the keyword "NBA". Top right panel: Histogram of the search interest score for the keyword "NBA". Bottom panel: Histogram of the search interest score (<30) for the keyword "NBA"

The two hidden states 0 and 1 can be interpreted as no event/event. An event could be an important game (all-stars,finals etc.). To estimate the homoscedastic error, we compute the standard deviation of the data from 08/29/2013 to 10/02/2013, since it is during the off-season, the search trend should be stable. It turns out the standard deviation is around 3. We are primarily interested in the time when search interest is high. Thus we use to model those time when search interest is low. By inspection, We take to be the density for 13 . The estimated is displayed in Figure 7. For the proposed HMMT method, the bandwidth is chosen to be 3.

Fig. 7 Left panel: Estimated using GMM. Right panel: Estimated using HMMTweedie

Using the estimated transition matrix and , we can calculate the expected proportion of days when the search interest is greater than 80. The Gaussian mixture model gives an estimate of 0.0294, the non-parametric model gives an estimate of 0.0266, whereas the proportion obtained from the data is 0.0243. We can see that the HMMT method provides a better estimate.

6.3 Change in unemployment rate

Finally we illustrate the method using the unemployment data. Left panel of Figure 8 is a plot of monthly unemployment rate change from February 1948 to August 2018. The data is from U.S. Bureau of Labor Statistics website. Since we are only considering the changes of unemployment rate, it is reasonable to assume that the errors are homoscedastic.

Fig. 8 Left panel: Plot of unemployment rate change from February 1948 to August 2018. x-axis is time, y-axis is unemployment rate change. Right panel: Histogram of of unemployment rate change from February 1948 to August 2018

We can see from the left panel of Figure 8 that most of the time the change in unemployment rate is close to 0, and dramatic change often appears in clusters, exhibiting dependence structure that can be reasonably described by an HMM.

Suppose we are interested in the months where there is a big increase in unemployment rate. Economists and policy makers may want to focus on those times and try to figure out the possible causes to avoid increase in unemployment rate in the future. The right panel of Figure 8 shows a histogram of the unemployment rate change when it is 2%, together with the density function for 0. We can see the density function matches the data quite well, therefore, we assume to be 0The estimated is given in Figure 9. For the proposed HMMT, we choose the bandwidth to be 0.15. The Gaussian mixture model assumes there are 6 components. However, as we can see from the histogram, the mixture model is inadequate. By contrast, the proposed nonparametric model describes the data quite well; this clearly shows the flexibility and benefit of the HMMT method.

Fig. 9 Left panel: Estimated for unemployment rate change using GMM. Right panel: Estimated for unemployment rate change using HMMTweedie

Next we apply our GMM.DP and HMMT to analyze the unemployment rate data (monthly, seasonal adjusted, January 1960 to June 2018), which are displayed in Figure 10. The data are obtained from the website of Federal Reserve Bank of St. Louis. Top right panel of Figure 10 presents the estimated true change of unemployment rate using a Gaussian mixture model. Bottom panel of Figure 10 is the estimate obtained using the proposed HMMT. We can see that HMMT has a clear “de-noise” effect on the data. The HMMT estimate is less scattered and more robust than GMM.DP. This can help researchers and policy makers to have a better idea of the true change of unemployment rate, and focus on the studies on the policies over the months that have a real impact on the unemployment rate.

Fig. 10 Top left panel: Plot of change of unemployment rate January 1960 to June 2018. Top right panel: Estimated true change of unemployment rate January 1960 to June 2018 using GMM. Bottom: Estimated true change of unemployment rate January 1960 to June 2018 using HMMTweedie.

7 Discussion

We developed a non-parametric estimator for estimating mean vector under Markovian dependence. It will be interesting to introspect if following the bivariate kernel density approach in Fu et al. (2017) the proposed methodology can be extended to heteroskedastic Markov set-ups with known variances. Developing shrinkage algorithms when the variance in the observations is unknown and needs to jointly estimated from the data will be important. Estimation in multivariate HMMs where the observed outputs ’s are vectors finds wide applications (Kale et al., 2003; Fiecas et al., 2017). In this context it will be useful to study shrinkage estimation particularly in the presence of latent structual properties on the covariances.

8 Acknowledgements

G. Mukherjee’s work was partly supported by the NSF grant DMS-1811866. W. Sun’s work was partly supported by the NSF grant DMS-1712983.

References

Ahmed SE, Reid N (2012) Empirical bayes and likelihood inference, vol 148. Springer Science & Business Media

Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics 41(1):164–171

Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc B 57:289–300

Bickel PJ, Ritov Y, et al. (1996) Inference in hidden markov models i: Local asymp- totic normality in the stationary case. Bernoulli 2(3):199–228

Bickel PJ, Ritov Y, Ryden T, et al. (1998) Asymptotic normality of the maximum- likelihood estimator for general hidden markov models. The Annals of Statistics 26(4):1614–1635

Brown LD, Greenshtein E (2009) Nonparametric empirical bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics pp 1685–1704

Brown LD, Greenshtein E, Ritov Y (2013) The poisson compound decision problem revisited. Journal of the American Statistical Association 108(502):741–749

Carlin BP, Louis TA (2010) Bayes and empirical Bayes methods for data analysis. Chapman and Hall/CRC

Churchill GA (1992) Hidden markov chains and the analysis of genome structure. Computers & chemistry 16(2):107–115

Dicker LH, Zhao SD (2016) High-dimensional classification via nonparametric empirical bayes and maximum likelihood inference. Biometrika 103:21–34

Efron B (2011) TweedieâĂŹs formula and selection bias. Journal of the American Statistical Association 106(496):1602–1614

Efron B, Zhang NR (2011) False discovery rates and copy number variation. Biometrika 98(2):251–271

Fiecas M, Franke J, von Sachs R, Tadjuidje Kamgaing J (2017) Shrinkage estima- tion for multivariate hidden markov models. Journal of the American Statistical Association 112(517):424–435

Fourdrinier D, Strawderman WE, Wells MT (2017) Shrinkage Estimation. Springer

Fu L, Sun W, Gareth J (2017) Nonparametric empirical bayes tweedieâĂŹs estimator for normal means with heteroscedastic errors. Manuscript

Fu L, Gang B, James GM, Sun W (2019) Information loss and power distortion from standardizing in multiple hypothesis testing. arXiv preprint arXiv:191008107

James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, vol 1, pp 361–379

Jiang W, Zhang CH, et al. (2009) General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics 37(4):1647–1684

Jiang Y, Oldridge DA, Diskin SJ, Zhang NR (2015) Codex: a normalization and copy number variation detection method for whole exome sequencing. Nucleic acids research 43(6):e39–e39

Johnstone IM (2012) Gaussian estimation: Sequence and wavelet models, available at: "http://www-stat.stanford.edu/~imj"

Kale A, Chowdhury AKR, Chellappa R (2003) Towards a view invariant gait recog- nition algorithm. In: null, IEEE, p 143

Ko SI, Chong TT, Ghosh P, et al. (2015) Dirichlet process hidden markov multiple change-point model. Bayesian Analysis 10(2):275–296

Koenker R, Mizera I (2014) Convex optimization, shape constraints, compound de- cisions, and empirical bayes rules. Journal of the American Statistical Association 109(506):674–685

Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2):257–286

Robbins H (1955) An empirical Bayes approach to statistics. Office of Scientific Research, US Air Force

Saha S, Guntuboyina A (2017) On the nonparametric maximum likelihood estimator for gaussian location mixture densities with application to gaussian denoising. arXiv preprint arXiv:171202009

Sharp AJ, Hansen S, Selzer RR, Cheng Z, Regan R, Hurst JA, Stewart H, Price SM, Blair E, Hennekam RC, et al. (2006) Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nature genetics 38(9):1038

Silverman BW (2018) Density estimation for statistics and data analysis. Routledge

Sun W, Cai T (2009) Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(2):393–424

Sun W, Reich BJ, Cai TT, Guindani M, Schwartzman A (2015) False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77(1):59–83

Wasserman L (2006) All of nonparametric statistics. Springer Science & Business Media

Wei Z, Sun W, Wang K, Hakonarson H (2009) Multiple testing in genome-wide association studies via hidden markov models. Bioinformatics 25(21):2802–2808

Yang F, Balakrishnan S, Wainwright MJ (2015) Statistical and computational guaran- tees for the baum-welch algorithm. In: Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, IEEE, pp 658–665

Supplement

Here, we provide the proofs of the results in Section 4.

Proof of Lemma 2

Note that it is sufficient to bound The result then follows from the triangle inequality. Let B be the event that , throughout the proof we assume B holds, as by Hoeffding’s inequality, . Write ˆ, and then

We first bound . Write . It is clear that and are both of order . Hence . Next consider . We have

Similarly It follows from Cauchy-Schwarz inequality that Therefore and . Using the fact that, we have

Note that

Assumption A1 implies for some 0. Hence for every j we can focus on its log n neighborhood, it follows that

3 from Fu et al. (2019), together with the Markov structure, we have

Hence log n). Note that It

follows that . By Lemma 4 from Fu et al. (2019) and the Markov structure we also have log n). Hence,

The lemma follows.

Proof of Theorem 1

First, notice that as a consequence of A1, Thus the conditions of theorem 1 in Brown and Greenshtein, 2009 are satisfied. Our theorem is an adaption of theorem 1 in Brown and Greenshtein, 2009. The proof follows the same logic, we provide a sketch.

We will show for any >0. Let . Note that it follows from the proof of Lemma 2 that ˆ. This fact together with Lemma 2 in Brown and Greenshtein (2009) implies we only need to show

Brown and Greenshtein, 2009 we first show for an independent sample ˜XXX = ( ˜˜˜we have

We can write ˆ. Then on the region R defined in the proof of lemma 3 in in Brown and Greenshtein, 2009,

A consequence of Lemma 2 is that 2 as . We only need to bound The variance of . The variances of are the variances of the density estimators ˆand ˆ. Notice that the variances of and

Since 0, then as ,

Use the same argument as in the proof of Lemma 3 in in Brown and Greenshtein, 2009, (13) follows. To show (12), we use the same argument in Brown and Green- shtein, 2009 . Since the Markov structure at most contribute a factor of log n, the conclusion is not affected. (12) and Lemma 2 in Brown and Greenshtein, 2009 together with Cauchy-Schwarz inequality proves the theorem.

Designed for Accessibility and to further Open Science