My stuff
Large-Scale Shrinkage Estimation under Markovian Dependence
1 Introduction
1 Introduction

Shrinkage is a useful notion that provides an elegant and powerful framework for compound estimation problems. The topic has been extensively studied since the celebrated work of James and Stein (1961). A plethora of influential results obtained by various researchers show that shrinkage often leads to substantial improvements in the performances of conventional estimators. The optimal directions and magnitudes of shrinkage estimators in parametric models have been extensively studied in the literature, see Fourdrinier et al. (2017); Johnstone (2012); Ahmed and Reid (2012); Carlin and Louis (2010) for reviews of related topics. In real-world applications, parametric families of estimators often have limited usage. Nonparametric shrinkage methods, exemplified by Tweedie’s formula (Efron, 2011), have received renewed attention in large-scale inference problems where thousands of parameters are estimated simultaneously; see Brown and Greenshtein (2009); Jiang et al. (2009); Koenker and Mizera (2014); Saha and Guntuboyina (2017); Dicker and Zhao (2016) for some important recent developments.

One essential limitation of existing non-parametric shrinkage methods is that most assume that observations are independently sampled from an underlying distribution. However, observations arising from large-scale estimation problems are often dependent. Ignoring the dependence structure may result in significant loss of effi-ciency and invalid inference. This article aims to develop non-parametric shrinkage

Bowen Gang University of Southern California, 3620 S. Vermont ave, Los Angeles, CA 90089, USA e-mail:


Gourab Mukherjee University of Southern California, 3670 Trousdale Pkwy, Los Angeles, CA 90089, e-mail:


Wenguang Sun University of Southern California, 3670 Trousdale Pkwy, Los Angeles, CA 90089, e-mail:


estimators under the widely used hidden Markov model and show that the efficiency of existing nonparametric methods can be greatly improved by incorporating the Markovian dependence structure.

Consider the problem of simultaneous estimation of a sequence of dependent parameters that are generated from a Hidden Markov Model (HMM) (Rabiner, 1989; Churchill, 1992). Markovian dependence provides a powerful tool for modeling data arising from a wide range of modern scientific applications where parameters of interest are spatially or temporally correlated. This article focuses on an important class of HMMs that have two underlying states: one being the default (null) state where the process is in-control; the other being the abnormal (non-null) state where the process is out-of-control. The observed data are independent conditional on the unknown states. The two-state HMM is a popular model that can be used to describe many data sets collected from various applications. For example, in estimating the Copy Number Variations (CNV) across a genome (Efron and Zhang, 2011; Jiang et al., 2015), there are regions with high and low variations that can be well described by the non-null and null states, respectively. In event analysis application, such as estimating hourly social media trends of a marketing campaign there are dormant (null) and active (non-null) states which can be adaptively incorporated into an HMM for building a better tracker.

To the best of our knowledge, the important problem of nonparametric shrinkage estimation under dependence has not been studied in the literature. The key idea in our methodology is to combine the ideas in the elegant Tweedie’s formula (Brown and Greenshtein, 2009; Jiang et al., 2009; Efron, 2011) and the forward-backward algorithms in HMMs to infer the underlying states, which is further utilized to facilitate fast and robust estimation of the unknown effect sizes (mean parameters) under Markovian dependence. We establish decision-theoretic properties of the proposed estimator and exhibit its enhanced efficacy over popular shrinkage methods developed under the independence assumption. In contrast with existing algorithms in HMMs that proceed with pre-specified families of parametric densities (e.g. Gaussian mixtures), our method is nonparametric and capable of handling a wider class of distributions. Through extensive numerical experiments, we demonstrate the superior performance of our proposed algorithm over existing state-of-the-art methods.

The chapter is organized as follows: in Section 2 we formulate the problem, develop an oracle estimator and compare this estimator with classical shrinkage estimators that do not incorporate HMM structure. In Section 3, we propose a data-driven procedure that mimics the oracle estimator and discuss how to overcome the new difficulties and challenges in the non-parametric approach. In Section 4, we establish theoretical properties of our proposed estimator and show that the nonparametric approach offers substantial efficiency gain over HMM algorithms employing Gaussian mixture models. In Section 5, we present numerical experiments to compare our estimator with existing methods. Section 6 applies the proposed method to three real data examples. Section 7 concludes the article with a discussion on interesting extensions.

Consider a two-state hidden Markov model. Suppose we are interested in estimating the mean vector  µµµ = (µ1, · · · , µn)based on observed output  XXX = (X1, · · · , Xn). In HMMs, the observed data XXX can be viewed as a contaminated version of the effect sizes  µµµ, where the contaminations are white noises following a zero-mean normal distribution with known variance  σ2. The HMM further assumes that  µis are independently distributed conditional on the unknown states  θθθ = (θ1, · · · , θn) ∈{0, 1}n, where  θiare Bernoulli variables forming a Markov chain. The latent state [θ = 0]usually represents the null or the in-control state and  [θ = 1]corresponds to the non-null or the out-of-control state. For example, in CNV studies,  Xiis the observed number of repeats in the genome,  [θi = 0]represents healthy part of genome, and  [θi = 1]represents the part of the genome that is susceptible to perturbation in diseased patients.

2.1 Model and notations

Formally, the data generation process can be described by a hierarchical model:


In practice, we often assume that the null distribution  g0is either known or can be modeled by a pre-specified parametric family1. By contrast, we do not specify any parametric forms on  g1because we usually do not have sufficient knowledge on the non-null process  g1, which may be difficult to describe using any pre-specified parametric families. The setting is reasonable for a wide range of application scenarios.


tion probabilities


The above probabilities are unknown and obey standard stochastic constraints 0 < ajk < 1, aj0 + aj1 = 1.Figure 2.1 presents a schematic diagram of the model. We aim to find a decision rule ˆµˆµˆµfor estimating the unknown mean vector  µµµ. The goal is to minimize the Bayes risk  R = E{l(ˆµµµ, µµµ)}, where  l(ˆµµµ, µµµ)is the mean squared error (MSE, or  l2loss):


Fig. 1 Schematic diagram of the data generation process. We observe  xis only and would like to estimate the  µis which are related through the Markov chain  {θ1, θ2, · · · , θn }.

Let  Λ = (A, (ψ0, ψ1), f0, f1)denote the HMM parameters, where A is the transition matrix,  (ψ0, ψ1)is the initial distribution, and


We use  φσto denote the normal density with mean 0 and variance  σ2with the suffix dropped for standard normal density.

2.2 Oracle estimator under independence

Ignoring the dependence structure, the optimal solution that minimizes the Bayes risk is given by Tweedie’s formula:


and f is the marginal distribution of all the observed X. The formula first appeared in Robbins (1955), who attributed the idea to Maurice Kenneth Tweedie. Tweedie’s formula can be implemented using empirical Bayes methods for constructing a class of non-parametric estimators (Brown and Greenshtein, 2009; Efron, 2011). The crucial observation is that it works directly with the marginal distribution, which is in particular attractive in large-scale estimation problems since f and  f ′can be well estimated using standard density estimation techniques (Silverman, 2018).

2.3 Oracle Estimator under HMM dependence

Our idea is to extend Tweedie’s formula to correlated observations under the HMM dependence. This can effectively increase the statistical efficiency by borrowing strength from adjacent observations. It has been shown in the multiple testing literature that exploiting the dependence structure can greatly improve the power of existing false discovery rate (Benjamini and Hochberg, 1995) methods (Sun and Cai, 2009; Wei et al., 2009; Sun et al., 2015). We further show in this article that dependence structure is also highly informative and can be exploited to improve the accuracy of shrinkage estimators.

Our methodological development is divided into two steps. We first assume that an oracle knows the hidden states and study the oracle rule. Then we discuss the case when the states are unknown and propose data-driven methods to emulate the oracle rule. If the states  θiare known, then it is natural to apply Tweedie’s formula to two separate states:


The oracle estimator (3) ˜µiOR(xxx)provides a benchmark in shrinkage estimation, i.e. the theoretically achievable lower limit of the estimation risk.

Next we consider a “weaker” oracle estimator that only knows  Λ, the collection of hyper-parameters in the HMM. Then the optimal solution is given by the next lemma.

Lemma 1 Consider Model (1) and assume that HMM parameters are known. Then the estimator that minimizes the MSE is given by


The proof of lemma is straightforward and hence omitted. The formula (4) can be viewed as an extension of Tweedie’s formula under HMM dependence. The next section considers the case where HMM parameters are unknown. We shall develop data-driven estimators and computational algorithms to emulate the Bayes estimator (4).

We discuss how to estimate  f0, f1and  P(θi = k|xxx)to implement the Bayes estimator. The proposed estimator is a non-parametric Tweedie-based shrinkage estimator under dependence (TD). In light of (4), we consider a class of estimators in the form:


where ˆpiare estimates of the true conditional probability  P(θi = 1|X), and ˆf0and ˆf1are estimates of  f0and  f1. Next we describe an algorithm for constructing estimates of the unknown quantities in (5).

3.1 The modified Baum-Welch algorithm

The most well-known method for constructing estimators of the form (5), under the conventional setting with pre-specified parametric families, is the Baum-Welch (BW) algorithm (Rabiner, 1989; Yang et al., 2015; Baum et al., 1970). We emphasize that the BW algorithm must proceed with user specified parametric forms for  fk’s. However, while  f0, the in-control  (θ = 0)distribution, can be well modeled by parametric densities, the out-of-control observations may not be easily approximated by parametric densities. For example, the Gaussian mixture model with a fixed number of components is not suitable for approximating very heavy tailed  f1. To overcome the issue, we describe an HMM-based Tweedie (HMMT) estimator that employs parametric  f0and non-parametric  f1. The essential component in our estimator is a generalized BW algorithm that updates the estimates of  f0and  f1iteratively based on posterior probability estimates of the latent states.

To avoid the identifiability issues, assume  f0is unimodal and  π0 > 0.5. These assumptions are usually reasonable in applications. In our HMMT estimator,  f0is assumed to be a Gaussian density  φ[ν, τ]with possibly unknown location  νand scale  τ. The Gaussian model can be generalized to Gaussian mixtures as well as other parametric families without essential difficulty. In contrast with existing HMM algorithms that utilize parametric assumptions on  f1, we consider a nonparametric approach to estimating  f1. The proposed methodology employs a class of weighted kernel density estimators


where �ni=1 wi =1, h is the bandwidth and  K(·)is a standard Gaussian kernel func- tion. Other choices of kernel such as the Cauchy kernel can also be used particularly in applications that are sensitive to the tails of the density.

With an initial choice of  A(0), ˆf (0)0and ˆf (0)1, the forward-backward propagation steps of the Baum-Welch algorithm provides updated estimates of the posterior probabilities ˆp(0)i. These estimates can be utilized to update ˆf1using new weights w(0)i = ˆp(0)i /�ni=1ˆp(0)iand ˆf0:


The process is repeated until convergence.

3.2 Choice of h

Choosing an appropriate bandwidth h is crucial for constructing an efficient estimator. If h is too small, then we will end up with estimating  µiby the naive estimator xi. If h is too large, then it will be difficult to identify ˆf1, which leads to a severely biased and inaccurate estimator. This section describe a cross validation method for choosing the tuning parameter h.

The first step is to split the observed sample XXX into (UUU,VVV) by artifically adding independent Gaussian noise:  UUU = XXX + αZZZ, VVV = XXX − (1/α)ZZZ, where  ZZZ d= N(0, In). Note thatUUU,VVV are normally distributed random effects with same meanθθθbut different variances  σ2u = (1 + α2)σ2and  σ2v = (1 + α−2)σ2. By construction, UUU and VVV are mutually independent conditioned on  θθθ.

The key idea in the second step is to use UUU for estimation and VVV for bandwidth selection. One can think of  αas a measure of how much data we use for estimation (Brown et al., 2013). When  α =0, the entire data set is used for estimation and none for hyper-parameter calibration. This article uses  α =10%.

The steps are repeated for a list of prefixed h values yielding estimates of the posterior probabilities  { ˆpi[h]: 1  ≤ i ≤ n}of the latent states and marginal densities estimates ˆf1[h]and ˆf0[h]. For each h value, we estimate the MSE as


where, ˆq1,j[h] = ˆp1[h]and ˆq0, j[h] = 1 − ˆp1[h]. The optimal value of bandwidth is


Subsequently, the HMMT estimator is computed by running the generalized BW algorithm with the bandwidth set at ˆh and  α =0.

3.3 Initialization and the HMMT estimator

Another important caveat about Algorithm 1 is the initialization step. To prevent the algorithm getting trapped in a local maximum, a good initialization is important. For this purpose we use the algorithm developed in Ko et al. (2015) for estimation in an HMM with unknown number of changing points. As the probability of the process being in control  π0 ≥ 0.5, ν(0)is estimated by the mode of the component with the largest probability which is subsequently attributed to ˆf (0)0. The remaining states as estimated via a Dirichlet process model contributes towards the out-of-process marginal densities ˆf (0)1. It was found that such an initialization produced reasonable values of ˆA(0). Henceforth for simplicity, we assume  σ =1. The detailed estimation procedure is summarized by Algorithm 1 below.

Algorithm 1 HMM-Tweedie Estimator


2: Intialization Use the algorithm in Ko et al. (2015) to get estimates  Λ(0) = { ˆA(0), ψ(0)0 , ψ(0)1 , ˆf (0)0 , ˆf (0)1 }.3: E step: For  t ≥1, compute the following in the t iteration using the previous iteration estimates.


5: Iterate until the parameters and the non-parametric density estimate converges. 6: Repeat steps 3-5 for a list of h values and store the final marginal densities estimates ˆf0[h], ˆf1[h]

and the posterior probabilities  { ˆpi[h](j) : 1 ≤ i ≤ n, 0 ≤ j ≤ 1}of the latent states. 7: Choose bandwidth ˆh = argminh �mse[h] where,


3.4 Operational characteristics of the new algorithm

This section discusses the operation characteristics of Algorithm 1 that distinguishes our work from existing ones. The discussions provide intuitions that are helpful for our later theoretical analysis.

Algorithm 1 includes several modifications to the conventional Baum-Welch algorithm of Baum et al. (1970). For understanding the crucial differences first consider  α =1, i.e, no sample splitting. The average log-likelihood for complete data  {xxx,θθθ}assuming  θθθknown is:


When  θθθis unknown, interchanging logarithm and summation above, we iteratively maximize the following lower bound ˜ln(Λ(t)|Λ(t−1))of  ln(Λt),


where,  nHn(Λ(t−1)) = −Eθθθ |XXX,Λ(t−1)log  P(θθθ|XXX, Λ(t−1))is the entropy, and  Qndecouples as  Qn = Q1,n + Q2,nwith


and  n Q2,n(Λ(t)|Λ(t−1)) = �ni=1 Eθi |XXX,Λ(t−1)log  P(xi|θi−1, f1, f0). We iteratively max- imize  Qn(Λ(t)|Λ(t−1))over  Λ(t)based on previous iterate  Λ(t−1). If the posterior probabilities for  [θi = 1]are updated to ˆp(t)ithen,  n Q2,n(Λt|Λ(t−1))equals


Although  Q2,nis concave in  (wt1, ..., wtn), it is not strongly concave. Updating  wtjby maximizing  Q2,nover the  n −1 dimensional hyperplane as  n → ∞would not yield good solutions. We interchange the logarithm and summation in the first term in Q2,nabove. The resultant ˜Q2,nhave a closed form maxima at  wtj = ˆp(t)j /�ni=1ˆp(t)i.


backward procedure (Rabiner, 1989) is used to maximize  Q1,nand produce updated posterior probabilities ˆp(t)i. For i = 1, . . ., n and j, k = 0, 1 the forward variable: α(t)i (j) = PΛ(t−1)(X1 = x1, ..., Xi = xi|θi = j)and the backward variable:  β(t)i (j) =PΛ(t−1)(Xi+1 = xi+1, ..., Xn = xn|θi = j)are computed. and the posterior probabilities are updated as:


The objective  Q1,n +˜Q2,nis bounded above and increasing at each step, therefore, our algorithm will converge. In particualr, as  n → ∞, the true solution is a fixed point of the algorithm. This insight is crucial for establishing the theoretical properties for the proposed HMMT estimator.

For establishing risk properties of our proposed shrinkage estimators, we consider the following assumption that is popularly imposed for theoretical analysis in HMM starting from the pioneering works of Bickel et al. (1996, 1998):

Assumption A1. The hidden states  θθθform a stationary ergodic Markov chain and the transition probabilities satisfy 0  < a00, a11 <1. The Markov chain begins in its stationary state and is reversible.

To understand the risk properties of the proposed estimator ˆµµµHMMT, we consider the following two quantities:


which are helpful for understanding the behaviors of the proposed estimate of  f1. Note that  f or1,n[h](u)is an estimator that cannot be implemented in practice, since it uses knowledge of unknown  θθθ. By contrast, ˆf1,n[h]is a practical estimator that substitutes ˆpi,nin place of  θθθ. ˆpi,n, which can be conceptualized as the estimates of posterior probabilities  P(θi = 1|XXX), can be obtained as the outputs from the proposed algorithm.

Consider ˆpi,ndefined by (8) at an arbitrary iterative step. Barring sample splitting, the estimate of  f1in every iterative step of Algorithm 1 has the functional form of ˆf1,n[h]. We consider a class of  f1that are smooth, bounded and  γ-Holder continuous.

Assumption A2.  f1comes from a smooth class of functions and has bounded support.  f1is  γ-Holder continuous, i.e.,  f1 ∈ Hγ, where


It can be seen (cf. Ch. 6 of Wasserman, 2006) that the squared  L2distance of the oracle kernel density estimator  f or1,n[h]from the true  f1,


has the following asymptotic expected value:


The intuition is that, under assumption A1, ˆf1,n[h]can be decomposed as


where the residual ˆRn[h]is asymptotically negligible as  n → ∞provided ˆpi,nare asymptotically unbiased estimates of  θi. Theory underpinning this intuition is rigorously established in the proof of the next lemma. Hence


where  ψ1 = P(θi = 1)is a fixed probability whose existence is ensured the stationarity of the Markov chain (Assumption A1). We summarize the above discussions in the lemma below.

Lemma 2 If  f1is  γ-Holder continuous and has bounded support then under Assumption A1, the integrated  L2Bayes risk of non-parametric density estimators of the form (8)-(9) for any h > 0 is


In particular, if ˆp1,nare unbiased and  hon = (log  n/n)1/(2γ+1)then,


The above results shows that with good posterior probability estimates, the algorithm 1 provides consistent non-parametric estimation of  f1. As  f0is parametrized as a Gaussian density, its estimation consistency also subsequently follows. The free parameters in  Λare  A = (a00, a11)and  ν, τand  f1. For understanding the evolution of the estimates in Algorithm 1 from the initial solutions towards the true  Λ∗, note that under Assumption A1, the population criteria  Q1(Λ|˜Λ) :=limn→∞ Q1,n(Λ|˜Λ)and ˜Q2(Λ|˜Λ) :=limn→∞˜Q2,n(Λ|˜Λ)are well-defined. Suppose our initial solution  Λ0nis in the vicinity of  Λ∗such that with estimates ˆf (1)0,nand ˆf (1)1,nthat are improvements over the initial estimates ˆf (1)0,nand ˆf (1)1,nin the sense d( ˆf (1)1,n, f ∗1 ) < d( ˆf (0)1,n, f ∗1 )and  d( ˆf (1)0,n, f ∗0 ) < d( ˆf (0)0,n, f ∗0 ), maximizing  Q1(A| ˆf (1)0,n, ˆf (1)1,n)and the subsequent application of (8) produces better estimates ˆp(1)i,nof the true posterior probabilities such that ˆf (2)1,ngiven by (9) is better than ˆf (1)1,nwith  d( ˆf (2)1,n, f ∗1 ) <d( ˆf (1)1,n, f ∗1 ). If this feature of the evolution is maintained over the successive iterations, the bias in the posterior probability estimates decrease in every step and ˆf (1)0,nand


1,nwill ultimately converge to the true  f ∗0and  f ∗1respectively at the rate given in Lemma 2. Having a good initial solution that has negligible asymptotic bias, which implies that successive iterations would have the aforementioned properties, is hence essential for successful convergence in Algorithm 1. Our assumption on the properties related to initialization is given next.

Assumption A3. The initial solution  A0, f (0)0,  f (1)1is in the neighborhood of the true solution. The behaviour of the population log-likelihood in this neighborhood is such that for all t, the posterior estimates ˆppp(t)based on  A(t) =argmaxA Q1(A| f (t−1)0 , f (t−1)1 )produce estimates  f (t)1and  f (t)0in Algorithm 1 which satisfy: maxA Q(A, f (t)0 , f (t)1 ) >maxA Q(A, f (t−1)0 , f (t−1)1 ).

The results in Lemma 2 establish the risk optimality of our proposed ˆµµµHMMT. To facilitate a universal optimality statement that is appropriate for both nonsparse and sparse regimes, we impose the following assumption on  µµµsimilar to Brown and Greenshtein (2009):

Assumption A4. For any fixed positive  δ, we have  n−δsupi=1,...,n |µi| →0 as n → ∞.

Remark 1 Based on our set-up in Section 2.1 and Assumption A1, the above condition is equivalent to restricting the support of the priors  g1and  g0to  [−Sn, Sn]where n−δSn →0 as  n →0 for any positive  δ.

For any fixed h > 0, let ˆpi[h], ˆf1[h]and ˆγ[h]be the final estimates of the  P(θi|XXX), f1and the mode of  f0respectively, from algorithm 1. Recall that as  σ =1, the estimates of the mean when the process is out of control is


Barring sample splitting (using  α =0, not steps 7 and 8 in algorithm 1) the HMMT estimator is ˆµHMMTi [h] = (1 − ˆpi[h])ˆν[h] + ˆpi[h] ˆµOCi [h]. As the HMMT estimator involves ratio of estimates ˆf ′and ˆf , the approximation error bound in Lemma 2 on ˆf does not trivally lead to optimal risk properties of the HMMT estimator. For achieving asymptotically optimal risk performance, consider a modified robust version of the HMMT estimator that truncates very high out-of-control mean values:


The following result akin to Theorem 1 of Brown and Greenshtein (2009) shows that ˆµµµTis asymptotically oracle optimal as it can achieve the mean square error of the oracle shrinkage estimator defined in (3). The proofs of the results in this section are provided in the supplements.

Theorem 1 Under Assumptions A1 to A4, for any  hnsuch that  hnlog  n →0 but nδhn → ∞for any positive  δ, we have


This section conducts simulation studies to compare the performance of our proposed HMMT with the following four estimators:

(a) GMM.3: we use BW algorithm and Gaussian mixture models with 3 mixing components for modeling  f1;

(b) GMM.DP: similar to GMM.3, we use BW algorithm and Gaussian mixture models. However, the number of components in the Gaussian mixture will not be fixed as before. Instead, the number of components is estimated using the Dirichlet Process model as described in Ko et al. (2015);

(c) T.ND: we ignore the Markovian dependence structure and apply Tweedie’s formula. The algorithm in Fu et al. (2017) has been used to choose the tuning parameter.

(d) OR: the oracle estimator in (3) which uses knowledge about the latent parameters.

5.1 Comparison of the MSEs

In Table 1 we report the mean squared errors averaged over 50 replications. We consider 11 simulation scenarios. In each scenario, we simulate n = 2, 000 observations. The latent states  θθθare generated based on a two-states Markov chain where the transition matrix has the probability of being in-control  A00fixed at 0.95 while the probability of being out-of-control  A11is varied between 0.2 to 0.8. Across all the scenarios  g0was fixed as the distribution with point mass at 0. In cases I to VI, the out-of-control prior  g1was generated from the following 6 densities:

(a) uniform distribution with support on  [−9, 9]; (b) Asymmetric triangle distribution on  [−30,30] with mode at 6; (c) Levy distribution with scale 7 on  [0, ∞]; (d) Non-central Chi-square distribution with 3 degrees of freedom and noncentrality parameter at 2;


(f) Burr distribution with location 0, scale 2 and shape parameters 2 and 0.5. In cases VII to XI, we study the performance of the estimators when  g1is generated from mixture of the above densities.

The following observations can be made based on the simulation results.

• Across all the regimes, our proposed HMMT estimator significantly improves the non-parametric shrinkage estimator T.ND which does not use dependent structure, which shows the importance of taking the dependence structure into account.

• In many cases, incorporating Markov structure result in efficiency gain even when the model is mis-specified.

• The HMMT estimator also improves on the Gaussian mixture model based estimation strategies and has error rates quite close to that of the Oracle estimator in (3). GMM.3 sometimes has higher MSE than even the non-parametric shrinkage estimator T.ND that does not use dependent structure. This reflects the usefulness of using Kernel density based non-parametric approach in HMMT.

5.2 Comparison of the estimated f1’s

When implementing the proposed HMMT method, the estimate GMM.DP has been used in the initialization step. This indicates that, in the cases where HMMT improves significantly over GMM.DP, the proposed algorithm employs  f1that is sufficiently far away from the Gaussian mixture family. This would typically happen when the distribution of out-of-control averages is heavy-tailed. In Figure 2, the estimated  f1across these simulation regimes are displayed for the case  A11 = 0.8. Across all the studied regimes, the HMMT estimator is evidently smoother and its shape is closer to the truth.


Fig. 2 From top to bottom (by columns), we have the estimated density ˆf1from GMM.DP (blue) and HMMT (red) across the eleven cases considered in Table 1. The true  f1is plotted in black.

Table 1 As the probability of remaining out of control  A11 and the out of control prior  g1 varies,MSE of the Tweedie estimator that does not use dependence structure (T.ND), conventional BaumWelch algorithm using Gaussian mixture models with 3 mixing components (GMM.3) and with the number of components estimated by using Dirichlet Process model (GMM.DP) are reported along with the performance of our proposed HMMT estimator and that of the Oracle estimator in (3).


This section we illustrate our methodology by applying it to several real data analyses.

6.1 Copy number variation

We analyze the IMR103 data in Sharp et al. (2006), which are displayed in Figure 3. The gene copy number is the number of copies of a particular gene in the genotype of an individual. It is widely believed that in healthy cells, the average copy number should be 2. A shift away from 2 is a genomic disorder and is usually related to certain disease. It is clear from the left panel of Figure 3 that in the region from 100000 to 110000, there is a shift away from 0 in  log2ratio. We take  g0 = δ0and model the data using an HMM. The two hidden states 0 and 1 can be interpreted as healthy/unhealthy genes. The noise variance is estimated as the sample variance of the the first 10000 data (0.183). The right panel of Figure 3 shows the histogram of the first 10000 data along with the density function of N(0, 0.1832). We will take this as  f0.


Fig. 3 Left panel: The IMR103 data, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.Right panel: Histogram of the first 10000 genes’ mean BAC log 2 ratio.

For the data analysis, we focus on the data from position position 111000 to 112999 (Call it UUU) and from position 1113000 to 117999 (Call it VVV). For each data in UUU we add a random noise  ϵ1 ∼ N(0.0.1832). Define  UUU1 = UUU +ϵ1ϵ1ϵ1, For each data in VVV we also add a noise from the same distribution  ϵ2 ∼ N(0, 0.1832). Let  VVV1 = VVV +ϵ2ϵ2ϵ2and  VVV2 = VVV − ϵ2ϵ2ϵ2. Under this construction,  VVV1and  VVV2are independent with the same mean. We will construct the estimators based on  UUU1, and estimate the mean vector of  VVV1(call it ˆµµµ). And use the average of  ∥ ˆµµµ − VVV2∥22as prediction error. The plot of UUU and VVV are shown in figure 14 and figure 15 respectively.


Fig. 4 Left panel: The IMR103 data gene number 111000 to 112999, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.Right panel: The IMR103 data gene number 113000 to 118000, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.

The estimated  f1using the Gaussian mixture model and our non-parametric method are displayed in Figure 5. Use the sample splitting method we discussed in Section 3.2, the bandwidth is chosen to be 0.28. Finally we apply the GMM.DP and HMMT methods to the data set. The prediction error for GMM.DP is 0.130, whereas the prediction error of the proposed HMMT is 0.121. This illustrates the benefit of using the nonparametric approach to modeling  f1.


Fig. 5 Left panel: Estimated  f1for mean BAC log 2 ratio using GMM. Right panel: Estimated  f1for mean BAC log 2 ratio using HMMTweedie.

6.2 Internet search trend

This section applies the proposed methods for analysis of search trend data. The key word of interest is “NBA”. The data are collected from Google Trend for the period 08/29/2013 to 05/28/2018. According to Google, “Numbers represent search interest relative to the highest point on the chart for the given region and time. Hence a value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term.” Because of the increasing accessibility of the Internet, if we directly take the time interval to be from 08/29/2013 to 05/28/2018, there will be a clear increasing trend. To adjust for that, we collect the data by taking the time window to be 4 months. Since the numbers are relative, homoscedastic errors seem to be a reasonable assumption. Figure 6 display the search trend data and its histogram. Bottom panel of Figure 6 shows the histogram of the data with value less than 30, where the line represents the density function of 13 N(5, 3) + 13 N(13, 3) + 13 N(25, 3)


Fig. 6 Top left panel: Plot for the search interest score, x-axis is time, y-axis is the google search interest score for the keyword "NBA". Top right panel: Histogram of the search interest score for the keyword "NBA". Bottom panel: Histogram of the search interest score (<30) for the keyword "NBA"

The two hidden states 0 and 1 can be interpreted as no event/event. An event could be an important game (all-stars,finals etc.). To estimate the homoscedastic error, we compute the standard deviation of the data from 08/29/2013 to 10/02/2013, since it is during the off-season, the search trend should be stable. It turns out the standard deviation is around 3. We are primarily interested in the time when search interest is high. Thus we use  f0to model those time when search interest is low. By inspection, We take  f0to be the density for 13 N(5, 3) + 13 N(13, 3) + 13 N(25, 3). The estimated  f1is displayed in Figure 7. For the proposed HMMT method, the bandwidth is chosen to be 3.


Fig. 7 Left panel: Estimated  f1using GMM. Right panel: Estimated  f1using HMMTweedie

Using the estimated transition matrix and  f1, we can calculate the expected proportion of days when the search interest is greater than 80. The Gaussian mixture model gives an estimate of 0.0294, the non-parametric model gives an estimate of 0.0266, whereas the proportion obtained from the data is 0.0243. We can see that the HMMT method provides a better estimate.

6.3 Change in unemployment rate

Finally we illustrate the method using the unemployment data. Left panel of Figure 8 is a plot of monthly unemployment rate change from February 1948 to August 2018. The data is from U.S. Bureau of Labor Statistics website. Since we are only considering the changes of unemployment rate, it is reasonable to assume that the errors are homoscedastic.


Fig. 8 Left panel: Plot of unemployment rate change from February 1948 to August 2018. x-axis is time, y-axis is unemployment rate change. Right panel: Histogram of of unemployment rate change from February 1948 to August 2018

We can see from the left panel of Figure 8 that most of the time the change in unemployment rate is close to 0, and dramatic change often appears in clusters, exhibiting dependence structure that can be reasonably described by an HMM.

Suppose we are interested in the months where there is a big increase in unemployment rate. Economists and policy makers may want to focus on those times and try to figure out the possible causes to avoid increase in unemployment rate in the future. The right panel of Figure 8 shows a histogram of the unemployment rate change when it is  ≤ 0.2%, together with the density function for 0.8N(0, 0.11)+0.2N(−0.25, 0.11). We can see the density function matches the data quite well, therefore, we assume  f0to be 0.8N(0, 0.11) + 0.2N(−0.25, 0.11)The estimated  f1is given in Figure 9. For the proposed HMMT, we choose the bandwidth to be 0.15. The Gaussian mixture model assumes there are 6 components. However, as we can see from the histogram, the mixture model is inadequate. By contrast, the proposed nonparametric model describes the data quite well; this clearly shows the flexibility and benefit of the HMMT method.


Fig. 9 Left panel: Estimated  f1for unemployment rate change using GMM. Right panel: Estimated f1for unemployment rate change using HMMTweedie

Next we apply our GMM.DP and HMMT to analyze the unemployment rate data (monthly, seasonal adjusted, January 1960 to June 2018), which are displayed in Figure 10. The data are obtained from the website of Federal Reserve Bank of St. Louis. Top right panel of Figure 10 presents the estimated true change of unemployment rate using a Gaussian mixture model. Bottom panel of Figure 10 is the estimate obtained using the proposed HMMT. We can see that HMMT has a clear “de-noise” effect on the data. The HMMT estimate is less scattered and more robust than GMM.DP. This can help researchers and policy makers to have a better idea of the true change of unemployment rate, and focus on the studies on the policies over the months that have a real impact on the unemployment rate.


Fig. 10 Top left panel: Plot of change of unemployment rate January 1960 to June 2018. Top right panel: Estimated true change of unemployment rate January 1960 to June 2018 using GMM. Bottom: Estimated true change of unemployment rate January 1960 to June 2018 using HMMTweedie.

We developed a non-parametric estimator for estimating mean vector under Markovian dependence. It will be interesting to introspect if following the bivariate kernel density approach in Fu et al. (2017) the proposed methodology can be extended to heteroskedastic Markov set-ups with known variances. Developing shrinkage algorithms when the variance in the observations is unknown and needs to jointly estimated from the data will be important. Estimation in multivariate HMMs where the observed outputs  Xi’s are vectors finds wide applications (Kale et al., 2003; Fiecas et al., 2017). In this context it will be useful to study shrinkage estimation particularly in the presence of latent structual properties on the covariances.

G. Mukherjee’s work was partly supported by the NSF grant DMS-1811866. W. Sun’s work was partly supported by the NSF grant DMS-1712983.

Ahmed SE, Reid N (2012) Empirical bayes and likelihood inference, vol 148. Springer Science & Business Media

Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics 41(1):164–171

Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc B 57:289–300

Bickel PJ, Ritov Y, et al. (1996) Inference in hidden markov models i: Local asymp- totic normality in the stationary case. Bernoulli 2(3):199–228

Bickel PJ, Ritov Y, Ryden T, et al. (1998) Asymptotic normality of the maximum- likelihood estimator for general hidden markov models. The Annals of Statistics 26(4):1614–1635

Brown LD, Greenshtein E (2009) Nonparametric empirical bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics pp 1685–1704

Brown LD, Greenshtein E, Ritov Y (2013) The poisson compound decision problem revisited. Journal of the American Statistical Association 108(502):741–749

Carlin BP, Louis TA (2010) Bayes and empirical Bayes methods for data analysis. Chapman and Hall/CRC

Churchill GA (1992) Hidden markov chains and the analysis of genome structure. Computers & chemistry 16(2):107–115

Dicker LH, Zhao SD (2016) High-dimensional classification via nonparametric empirical bayes and maximum likelihood inference. Biometrika 103:21–34

Efron B (2011) TweedieâĂŹs formula and selection bias. Journal of the American Statistical Association 106(496):1602–1614

Efron B, Zhang NR (2011) False discovery rates and copy number variation. Biometrika 98(2):251–271

Fiecas M, Franke J, von Sachs R, Tadjuidje Kamgaing J (2017) Shrinkage estima- tion for multivariate hidden markov models. Journal of the American Statistical Association 112(517):424–435

Fourdrinier D, Strawderman WE, Wells MT (2017) Shrinkage Estimation. Springer

Fu L, Sun W, Gareth J (2017) Nonparametric empirical bayes tweedieâĂŹs estimator for normal means with heteroscedastic errors. Manuscript

Fu L, Gang B, James GM, Sun W (2019) Information loss and power distortion from standardizing in multiple hypothesis testing. arXiv preprint arXiv:191008107

James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, vol 1, pp 361–379

Jiang W, Zhang CH, et al. (2009) General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics 37(4):1647–1684

Jiang Y, Oldridge DA, Diskin SJ, Zhang NR (2015) Codex: a normalization and copy number variation detection method for whole exome sequencing. Nucleic acids research 43(6):e39–e39

Johnstone IM (2012) Gaussian estimation: Sequence and wavelet models, available at: "http://www-stat.stanford.edu/~imj"

Kale A, Chowdhury AKR, Chellappa R (2003) Towards a view invariant gait recog- nition algorithm. In: null, IEEE, p 143

Ko SI, Chong TT, Ghosh P, et al. (2015) Dirichlet process hidden markov multiple change-point model. Bayesian Analysis 10(2):275–296

Koenker R, Mizera I (2014) Convex optimization, shape constraints, compound de- cisions, and empirical bayes rules. Journal of the American Statistical Association 109(506):674–685

Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2):257–286

Robbins H (1955) An empirical Bayes approach to statistics. Office of Scientific Research, US Air Force

Saha S, Guntuboyina A (2017) On the nonparametric maximum likelihood estimator for gaussian location mixture densities with application to gaussian denoising. arXiv preprint arXiv:171202009

Sharp AJ, Hansen S, Selzer RR, Cheng Z, Regan R, Hurst JA, Stewart H, Price SM, Blair E, Hennekam RC, et al. (2006) Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nature genetics 38(9):1038

Silverman BW (2018) Density estimation for statistics and data analysis. Routledge

Sun W, Cai T (2009) Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(2):393–424

Sun W, Reich BJ, Cai TT, Guindani M, Schwartzman A (2015) False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77(1):59–83

Wasserman L (2006) All of nonparametric statistics. Springer Science & Business Media

Wei Z, Sun W, Wang K, Hakonarson H (2009) Multiple testing in genome-wide association studies via hidden markov models. Bioinformatics 25(21):2802–2808

Yang F, Balakrishnan S, Wainwright MJ (2015) Statistical and computational guaran- tees for the baum-welch algorithm. In: Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, IEEE, pp 658–665

Here, we provide the proofs of the results in Section 4.

Proof of Lemma 2

Note that it is sufficient to bound  E(θ,XXX)d2� ˆf1,n[h], f or1,n[h]�.The result then follows from the triangle inequality. Let B be the event that  | �ni=1 θi − ψ1n| < 12ψ1n, throughout the proof we assume B holds, as by Hoeffding’s inequality,  P(Bc) =O(e−n/2). Write  bj = θj/�ni=1 θi, b∗j = ˆpj,n/�ni=1ˆpi,n, and  φh(x − xj) = 1h K( x−xjh )then


We first bound  E(b∗j − bj)2. Write  E(b∗j − bj)2 = {E(b∗j − bj)}2 + Var(b∗j − bj). It is clear that  E(b∗j)and  E(bj)are both of order  O(n−1). Hence  {E(b∗j − bj)}2 = O(n−2). Next consider  Var(b∗j − bj) = Var(b∗j) + Var(bj) − 2Cov(b∗j, bj). We have


Similarly  Var(bj) = O(n−2).It follows from Cauchy-Schwarz inequality that Cov(b∗j, bj) = O(n−2).Therefore  Var(b∗j − bj) = O(n−2)and  E(b∗j − bj)2 = O(n−2). Using the fact that∫φ2h(x − xj)dx = O(h−1), we have


Note that


Assumption A1 implies  Cov(θj, θk) = O(γ |j−k |)for some  γ >0. Hence for every j we can focus on its log n neighborhood, it follows that


3 from Fu et al. (2019), together with the Markov structure, we have


Hence  Ebj = 1/n + O(n−2log n). Note that  Eb∗j = 1/n + O� E ˆp1,n−ψ1n + log nn2 �.It

follows that  E(b∗j − bj)E(b∗k − bk) = O� E( ˆp1,n−ψ1)2n2 + log2 nn4 �. By Lemma 4 from Fu et al. (2019) and the Markov structure we also have  Cov(b∗j − bj, b∗k − bk) =O(n−3log n). Hence,


The lemma follows.

Proof of Theorem 1

First, notice that as a consequence of A1,  E||µµµ − ˆµµµOR(xxx)||22 = O(n).Thus the conditions of theorem 1 in Brown and Greenshtein, 2009 are satisfied. Our theorem is an adaption of theorem 1 in Brown and Greenshtein, 2009. The proof follows the same logic, we provide a sketch.

We will show  E|| ˆµµµTn − ˜µµµORn ||22 = o(nϵ)for any  ϵ>0. Let  pi = P(θi = 1|xxx). Note that it follows from the proof of Lemma 2 that ˆpi − pi = o(1). This fact together with Lemma 2 in Brown and Greenshtein (2009) implies we only need to show


Brown and Greenshtein, 2009 we first show for an independent sample ˜XXX = ( ˜X1, ...,˜Xn),˜Xi ∼ N(µi, 1), i = 1, ...nwe have


We can write ˆf ′1 [h](xi)ˆf1[h](xi) = f ′1 [h](xi)+R1f1[h](xi)+R2. Then on the region R defined in the proof of lemma 3 in in Brown and Greenshtein, 2009,


A consequence of Lemma 2 is that  E(Ri) → 0, i = 1,2 as  n → ∞. We only need to bound The variance of  Ri. The variances of  R1, R2are the variances of the density estimators ˆf ′1[h]( ˜Xi)and ˆf1[h]( ˜Xi). Notice that the variances of  f or1,n[h]( ˜Xi)and


Since  E(R2) →0, then as  n → ∞,


Use the same argument as in the proof of Lemma 3 in in Brown and Greenshtein, 2009, (13) follows. To show (12), we use the same argument in Brown and Green- shtein, 2009 . Since the Markov structure at most contribute a factor of log n, the conclusion is not affected. (12) and Lemma 2 in Brown and Greenshtein, 2009 together with Cauchy-Schwarz inequality proves the theorem.

Designed for Accessibility and to further Open Science