A major goal in behavioral medicine is identifying temporal patterns of risk factors preventing an individual from successfully modifying a health-related behavior. In smoking cessation, one would like to know times of day, locations, and other contextual factors such as smoking opportunity [16] that may precipitate lapse to inform interventions to prevent lapse [22]. A basic task is to describe when smoking occurs through modeling the counting process of repeated negative events. One goal is to estimate the mean function of the counting process to characterize the dynamics of health-related behaviors at the population level.
The most common and widely-used measurement for behaviors is self-report via Ecological Momentary Assessment (EMA): a participant responds to prompts and enters data via a phone app [26]. An EMA can be randomly triggered by the app. Random EMAs provide a less-biased data collection paradigm relative to user initiated EMAs and have been used in a wide variety of behavioral studies [25, 21, 20]. Here, a participant is prompted 3-4 times a day at random times. As EMA times will not generally co-occur with events of interest, participants report the cumulative number of events since the last EMA. For example, how many cigarettes they smoked since the last prompt. This takes the form of panel count data [15][27], where exact times of a counting process are unobserved. Only the cumulative number of events since the last observation are measured, where observation times are assumed to follow some unknown distribution.
The counting process mean function can describe population-level temporal patterns of smoking. It converts the discrete patterns of smoking counts from a population of participants (see Fig. 1) into a temporally-continuous summary of smoking behavior (see Fig. 2b). However, the missingness inherent in EMA data makes consistent estimation of the mean function difficult. There are several conditions causing missingness. An EMA may be ignored by the user or opened and then abandoned. Second, the mobile app itself may postpone the triggering of an EMA for any of several reasons (e.g. battery low). While EMAs are triggered randomly, the random process is constrained to have a minimum temporal spacing between EMAs, in order to keep participant burden at an acceptable level. If EMAs are postponed or ignored too many times, it will not be possible to trigger the full set of EMAs for the day, resulting in missing EMAs. Missing data is a common issue in studies that involve EMAs, with [14] noting that over 126 studies, the average missingness rate is 25%.
For mean function estimation, missing EMAs cause problems when they lead to inaccurate counts of the total cigarettes smoked between EMAs. Due to recall bias, longer intervals between EMAs are less reliable.1 Behavioral scientists have developed heuristic imputation schemes to adjust for missing counts [11], but these may not consistently estimate the mean function. This paper presents the first self-contained and systematic treatment of missing data for panel count data, by providing a simple Expectation-Maximization (EM) algorithm [7, 3] for estimating the mean function with finite-sample theoretical guarantees.
Our primary methodological contribution is a functional EM algorithm to wrap standard non-parametric mean function estimators to handle missing data under a missing completely at random (MCAR) assumption [17]. The E-step uses estimates from a fitting method to impute missing data, and the M-step calls that fitting method to estimate a mean function. This extends several classic non-parametric methods [32, 18] and the baseline-only version of a semi-parametric mean function estimation method [30] to the setting of missing data.
We analyze our EM algorithm using the frameworks described in [3, 35] and obtain finite sample guarantees. This requires care as we are extending their work from the parametric to the non-parametric setting. This paper addresses three major theoretical challenges in the context of functional EM by: (i) noting that an infinite dimensional derivative in our setting is analogous to the inner products used in [35, 3], (ii) showing local uniform strong concavity of our population E-step, and (iii) a high probability finite sample bound on the convergence of the M-step. The lack of ground truth for missing data in real-world settings makes evaluation difficult and theoretical guarantees important. Finally, our proposed algorithm can consistently estimate the mean function even when the Poisson process assumption is violated. This is achieved by recovering the population MLE, and noting that under certain integrability conditions the population MLE of the Poisson process log-likelihood is the true mean function of the counting process, even when the counting process is not Poisson.
Ecological Momentary Assessments are frequently used to record counts for behavior, including smoking counts [24, 29, 10], alcohol counts [6] and promiscuous behaviors [34]. However, none of these papers use panel count data methods to estimate the mean function.
Panel count data analysis comes from nonparametric statistics, but our missing data problem has not been addressed. The closest works to ours are [30, 32, 18]. These papers estimate the mean function from panel count data, but cannot account for missing EMAs as in our motivating application. [30] develop an EM approach under a more limited missingness model assuming that the total counts for each participant are known prior to data analysis. This is unsuitable for analyzing data with missing EMAs. However, two strengths are that they can incorporate baseline covariate information, and they relax the standard assumption of independence between observation times and the counting process.
Figure 1: Raw cigarette counts smoked between observations. We want to convert this to a mean function of cumulative cigarettes smoked (Figure 2b), but some counts are missing or unreliable.
Neither of [32, 18] handle missing counts, and cannot be applied directly to our setting. We propose an EM algorithm for missing data to wrap any of [30, 32, 18], building on existing methods in the M-step. [32] provides the first strong theoretical guarantees for a mean function estimator in the panel count setting. [18] improve their rate of convergence with spline mean functions.
Many other papers in the literature focus on either extending to the semi-parametric setting, relaxing assumptions in either the non-parametric or semi-parametric setting, or improving estimation under specific distributional assumptions. They do not address missing data. [33] extended [32] to the semi-parametric setting, [13] analyzed panel count with informative observation times and subject-specific frailty, and [12] proposed using a smooth semi-parametric estimator to handle over-dispersion in panel count data. [8] proposed using a squared Gaussian process intensity function.
EM Theory. Also relevant is recent work on convergence of the EM algorithm to the true parameter [3, 35]. In [3], they show that under good initialization and certain regularity conditions, population EM converges to the true parameter. With finite sample uniform convergence of the M-step, one can also show a finite-sample bound for EM. [35] show that by assuming the ability to optimize over a ball around the true parameter, one can weaken regularity conditions and replace uniform convergence of the M-step with three concentration inequalities, often easier to prove. We base our population guarantees on [35]. We base our finite-sample guarantees on [3], because [35] assumes that empirical and population norms are the same which does not hold in our non-parametric setting.
3.1 Complete Data
Let be a univariate counting process. The goal is to estimate the mean function over a study window be the random number of observations for a participant. Let be a random vector of observation times, with . Let be the count increments: . Let be mean function increments. For each participant , we observe are the corresponding sample spaces. Let denote the empirical and true measures on , respectively. For a measurable function f, let and . Under a non-homogeneous Poisson process, the complete data sample log-likelihood is
and the population log-likelihood is similar but with P instead of . The goal is consistent mean function estimation even when the Poisson process assumption is violated.
3.2 Missing Data
Unlike previous work, we assume certain observations are missing. For , each observation is missing completely at random with probability (w.p.) be the missingness pattern, where but the notation represents missingness. Let if is missing. Let and , where is elementwise product. Then . We observe but not . Let represent our missing data vector, and let Z be the space of values for our missing data. Then gives us the observed and missing parts of the data, respectively. In this case are now the empirical and true measures for
We first describe the general setting for EM. Let be joint, marginal, and conditional densities respectively. Let be a convex set of functions and a sieve: a nested sequence is dense in . We define the following.
Definition 1. (Population
Definition 2. (Sample
Definition 3. is the essential supremum.
Definition 4. (Population
Definition 5. (Sample
A Q-function is an E-step of the EM algorithm.. Population EM repeatedly takes where denotes iteration t’s estimate. Sample EM repeatedly takes describe EM computation. Algorithm 1 describes sample EM; population EM is similar.
Algorithm 1 assumes is known following [35]: in practice it is unknown, and we suggest trying multiple initializations and choosing the one that leads to the highest final log-likelihood. In our numerical experiments and data analysis, for illustration we assume {monotone step functions with at most n steps}. Next we describe the E- and M-steps.
4.1 E-Step
The population and sample Q-functions replace missing counts with mean function estimates. The sample Q-function (the population version is similar but uses P instead of
4.2 M-Step
The population M-step maximizes
where is a uniform upper bound for functions in this set. The sample M-step uses a convex set (sieve estimator). is potentially monotonic step functions [32, 30] or monotonic splines [18], subject to the constraint of having the upper bound of over the study time. Maximization proceeds via existing methods [32, 18, 30]. Like is unknown.
Section 5.1 defines assumptions, 5.2 defines distances and a quasi-inner product, and 5.3 gives two regularity condition lemmas based on [35]. Proposition 1 shows that with good initialization and sufficiently small missingness probability, a population EM step gives a contraction, moving our estimate closer to the true mean function after every iteration. Theorem 1 then shows that population EM converges linearly to the true mean function.
We then show finite-sample theory. Proposition 2 states that with high probability, the sample M-step converges uniformly to the population M-step. Theorem 2 states that with high probability the distance between the current estimate and the true mean function is bounded by two terms: one describes applying population EM, and the other involves the uniform convergence of the M-step.
5.1 Assumptions
We make the following assumptions, similar assumptions were made in [32, 18, 30]:
1. The counting process N is independent of the number of observations K and observation times T, respectively;
2. The observation times are random variables taking values in the bounded set where
3. The number of observations is bounded, i.e., there exists
4. The true mean function is uniformly bounded over the study, satisfying for some is a uniform upper bound on functions in
5. The first derivative of has a positive lower bound in
6. The observation times are -separated. That is, for some and all
7. E[exp(aN(t))] is uniformly bounded for some constant a;
8. The count increments are missing completely at random (MCAR) w.p. are iid Bernoullirandom variables.
In the context of an observational smoking study, some of these assumptions can be scientifically expressed: 1) and 2) EMAs are delivered at random times throughout the study and are independent of smoking times 3) there is a maximum number of EMAs delivered over the study 5) there is a minimum smoking risk throughout the study 6) there is a minimum time between EMAs 8) the probability that an EMA is missed is . The only new assumption for panel count is assumption 8.
5.2 Measures, Distance Metrics, and Quasi-Inner Product
We next define measures for constructing distance metrics between mean functions. We then define a quasi-inner product that is used to show regularity conditions for population EM theory. Let be the intersection of Borel sets in
Definition 6 (Measures for sets containing observation times). and
Then observations in one observation in Definition 7 (metric for mean functions).
This is the metric of [32]. Under assumption 3, convergence in implies convergence in We base our theory on convergence in this norm. Now define
We do not claim that (1) is a valid inner product, but it is closely related to the inner product from [35, 3], the Q-function’s directional derivative in the direction of a parameter vector, while (1) is the Q-function’s right Gateaux derivative in the direction of a mean function, analogous in function space. The connection to inner products from [3, 35] is key to using existing EM theory to prove our Lemmas 1 and 2. We prove Equation (1) in A.1 in the supplementary material.
5.3 Population Theory
Before stating our main population EM convergence theorem, we define important constants and state two lemmas for the population Q-function. Lemmas 1 and 2 mirror gradient stability and local uniform strong concavity from [35], respectively, but they studied Q-functions with quadratic dependence on parameters. Our objective function does not have quadratic dependence on the mean function. However, we can decompose our Q-function into a sum of two terms: one locally uniformly strongly concave, and one strictly concave. The sum is then locally uniformly strongly concave. Lemma 2 shows this lower bound holds, and is a key step that allows us to apply [35] to our setting. We then prove Proposition 1 which states that population EM steps contract, bringing estimates closer to the true mean function after each iteration.
Definition 8. Let be a uniform lower bound on increments of the true mean function. This exists by assumptions 5 and 6.
Definition 9. Let be a uniform upper bound on increments of mean functions in is a uniform upper bound on functions in
Definition 10. Let
Lemma 1. (Gradient Stability) Assume assumptions 1, 2, 3, 4, 5, 6, and 8 hold. Then for (proof in A.2),
Lemma 2. (Local Uniform Strong Concavity) Assume all assumptions hold. Then if and (proof in A.3),
See section A.5 in the supplementary material for a detailed proof. This is similar to Proposition 3.2 in [35]. In order for this to give a contraction, we need , which holds if . Thus if the uniform lower bound on increments of the true mean function goes up or the uniform upper bound on increments in the ball goes down, we can tolerate a higher probability of missing data.
Theorem 1. (Population EM Convergence to True Mean Function) Suppose the assumptions of the above hold and . Take the EM sequence and
The proof is in A.6 in the supplementary material. At each EM step, we move towards the true function by a multiplicative factor of. This is similar to Theorem 3.1 in [35].
5.4 Sample Theory
We next discuss convergence of the sample EM algorithm. We again require the initial estimate and subsequent estimates to be close, i.e., in the set . The key additional assumption is that the sample size is large enough to satisfy certain conditions. For all n greater than this minimum sample size, we can guarantee approximate contraction of the sample EM algorithm, where the term that does not contract goes to 0 in large samples.
Proposition 2. Suppose all assumptions hold. Assume has radius . For any L > 0,
there exists such that w.p.
See B.1 of the supplementary material for proof, as well as a definition for . This is the rate of convergence an M-estimator, and generalizes the rate of convergence of [1] from maximum likelihood estimates to M-steps. We now state our main result.
Theorem 2. Suppose and all assumptions hold such that the population contractivity holds. Let . Take the EM sequence and . Then if the sample size is large enough that , then w.p. at least
See section B.2 in the supplementary material for a proof. The proof follows that of Theorem 5 of [3] closely. Note that this immediately implies that our algorithm can recover the true population MLE in large samples. Further by [32], the population MLE is the true mean function under assumptions 1,4 and 7 even under Poisson process violations. Thus our method is robust to Poisson process violations.
Here we show numerical and real data results to illustrate the utility of the proposed functional EM algorithm. In the M-step, we can choose any reasonable mean function estimator. In our experiments, we use a general likelihood-based augmented estimating equation (AEE) method [30], which uses monotone step functions when obtaining the mean function estimate with complete data. [31, 5] provide software implementations of AEE and a wide variety of other panel count methods. First, we perform synthetic experiments that demonstrate accurate recovery of the true mean function (see Section C of the supplemtary material). In particular, we also demonstrate good recovery of the true mean function using simulated data based on mixed poisson processes (which violate the Poisson process assumption) hence confirming theoretical robustness of functional EM to Poisson process assumption. In the following, we focus on two experiments involving real data.
6.1 Real Data with Synthetic Missingness
We analyze blaTum (bladder tumor dataset) [4] with 85 patients and counts of tumors taken at appointment times. We artificially delete intervals completely at random with probability 0.2. We then initialize by replacing the missing data with Poisson(1) random variables and fitting AEE. We bootstrap 1,000 times, and plot the sample mean of our learned mean functions under complete data. We also set counts to zero in intervals with missing counts, which biases the mean function estimates. Figure 2a compares inference from complete vs partially missing data using our wrapper vs zeroing out missing data. Our wrapper learns a model much closer to the complete data than its initialization or the zeroing model. Section D of the supplementary material has additional experiments: we investigate sensitivity to initialization and different missingness probabilities. We also replace AEE with other M-step methods in Section D of the supplementary material and again confirm the utility of functional EM in approximating the mean function. The actual estimates vary slightly. In practice we recommend comparing results when making scientific conclusions; similar results from a wide variety of flexible methods indicate more robust scientific evidence.
Figure 2: The estimated mean function and the 95% bootstrap confidence band from a) bladder tumor dataset: synthetic 20% missingness probability. Initialization has missing data set to Poisson(1) values. b) smoking cessation study. Initialization treats self-reported counts in intervals over 24 hours (about 5.6% of observations) as valid.
6.2 Smoking Cessation Study
We analyze data from an ongoing smoking cessation study in which the following EMA question was delivered randomly 3-4 times per day: "Since the last assessment, how many cigarettes did you smoke?" We have 125 participants tracked over 14 days, with 3-4 random EMAs targeted. The first five days are their normal smoking behavior, while the remaining days involve them attempting to quit. In both curves the smoking rates are much higher pre-quit than post-quit, suggesting that they do in fact attempt to quit. After discussion with psychologists, we treat counts in intervals of longer than 24 hours as missing because they are considered to be unreliable. We use the counts to initialize our model. Unlike the previous experiments, we lack a ground truth. Note that this study is currently ongoing and we cannot link to it or share the dataset.
Figure 2b shows the results based on 1, 000 non-parametric bootstrap samples and two models: one treating long intervals as valid (“initial"), and the other treating them as missing where we use EM. Long intervals make up 5.6% of observations. The EM algorithm estimates that on average smokers attempting to quit smoked 80.4 cigarettes by the end of the study; in contrast, AEE underestimated (66.11), a difference of 21.7%. This is consistent with our collaborating behavioral scientists’ hypothesis that participants may under-report a cigarette count when the gap between completed EMAs is large. The proposed functional EM is able to borrow count information across multiple shorter overlapping intervals from other participants to alleviate this under-reporting. We discuss other analyses of the dataset further in section E of the supplementary material.
This paper proposed a functional EM algorithm to estimate the mean function for incomplete panel count data. Extending existing EM algorithm analysis to general non-parametric settings, we provided finite sample convergence guarantees to the truth. We conducted extensive experiments to illustrate the effectiveness of the proposed algorithm in recovering the true mean function. We applied the proposed algorithm to a smoking cessation study and found that participants may underestimate their cigarettes smoked over intervals longer than 24 hours. To the best of our knowledge, we are the first to apply panel count data methods to EMAs, despite their central role in behavioral science research.
From a theoretical perspective, the main limitation is the MCAR assumption. Deviations from MCAR can happen if non-reporting depends on a subject’s emotional state, which may be related to their smoking counts. Extensions to other missingness mechanisms such as Missing at Random (MAR) warrant future research; our analysis provides a general framework and first step, which can eventually be extended to also incorporate covariates under a more complex missing data model. Like some existing theoretical analysis of EM algorithms [35], another limitation is the need to optimize over a ball around the true mean function, which is unknown. Relaxing this condition is likely very challenging but also important future work. Finally, we would like test statistics for our estimator. This is challenging as the differences between fitted and true mean functions in norm do not converge to a normal distribution. However, there are test statistics based on weighting [2] that could be extended to this setting.
Understanding the dynamics for individuals who attempt to change and maintain behaviors to improve health has important societal value, for example, a comprehensive understanding of how smokers attempt to quit smoking may guide behavioral scientists to design better intervention strategies that tailor to the highest risk windows of relapse. Our theory and method provide a valid approach to understanding a particular aspect of the smoking behavior (mean function). The resulting algoithm is robust, readily adaptable and simple to implement, highlighting the potential for its wider adoption. The negative use case could be lack of sensitivity analysis around the assumptions such as missing data mechanism which may lead to misleading conclusions. Our current recommendation is to consult scientists about the plausibility of the assumption about missing data.
[1] BALAKRISHNAN, N., AND ZHAO, X. A class of multi-sample nonparametric tests for panel count data. Annals of the Institute of Statistical Mathematics 63, 1 (2011), 135–156.
[2] BALAKRISHNAN, N., ZHAO, X., ET AL. New multi-sample nonparametric tests for panel count data. The Annals of Statistics 37, 3 (2009), 1112–1149.
[3] BALAKRISHNAN, S., WAINWRIGHT, M. J., YU, B., ET AL. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics 45, 1 (2017), 77–120.
[4] BYAR, D. The veterans administration study of chemoprophylaxis for recurrent stage i bladder tumours: comparisons of placebo, pyridoxine and topical thiotepa. In Bladder tumors and other topics in urological oncology. Springer, 1980, pp. 363–370.
[5] CHIOU, S. H., HUANG, C.-Y., XU, G., AND YAN, J. Semiparametric regression analysis of panel count data: A practical review. International Statistical Review 87, 1 (2019), 24–43.
[6] COLLINS, R. L., KASHDAN, T. B., AND GOLLNISCH, G. The feasibility of using cellular phones to collect ecological momentary assessment data: Application to alcohol consumption. Experimental and clinical psychopharmacology 11, 1 (2003), 73.
[7] DEMPSTER, A. P., LAIRD, N. M., AND RUBIN, D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological)
39, 1 (1977), 1–22.
[8] DING, H., LEE, Y., SATO, I., AND SUGIYAMA, M. Variational inference for gaussian process with panel count data. Uncertainty in Artificial Intelligence (2018).
[9] GERW (HTTPS://MATH.STACKEXCHANGE.COM/USERS/58577/GERW). Taylor expansion for gteaux derivative. Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3076726 (version: 2019-01-17).
[10] GRIFFITH, S. D., SHIFFMAN, S., AND HEITJAN, D. F. A method comparison study of timeline followback and ecological momentary assessment of daily cigarette consumption. Nicotine & tobacco research 11, 11 (2009), 1368–1373.
[11] HALLGREN, K. A., AND WITKIEWITZ, K. Missing data in alcohol clinical trials: a comparison of methods. Alcoholism: Clinical and Experimental Research 37, 12 (2013), 2152–2160.
[12] HUA, L., ZHANG, Y., AND TU, W. A spline-based semiparametric sieve likelihood method for over-dispersed panel count data. Canadian Journal of Statistics 42, 2 (2014), 217–245.
[13] HUANG, C.-Y., WANG, M.-C., AND ZHANG, Y. Analysing panel count data with informative observation times. Biometrika 93, 4 (2006), 763–775.
[14] JONES, A., REMMERSWAAL, D., VERVEER, I., ROBINSON, E., FRANKEN, I. H., WEN, C. K. F., AND FIELD, M. Compliance with ecological momentary assessment protocols in substance users: a meta-analysis. Addiction 114, 4 (2019), 609–619.
[15] KALBFLEISCH, J., AND LAWLESS, J. F. The analysis of panel data under a markov assumption. Journal of the American Statistical Association 80, 392 (1985), 863–871.
[16] KIRCHNER, T. R., CANTRELL, J., ANESETTI-ROTHERMEL, A., GANZ, O., VALLONE, D. M., AND ABRAMS, D. B. Geospatial exposure to point-of-sale tobacco: real-time craving
and smoking-cessation outcomes. American journal of preventive medicine 45, 4 (2013), 379–385.
[17] LITTLE, R. J., AND RUBIN, D. B. Statistical analysis with missing data, vol. 793. John Wiley & Sons, 2019.
[18] LU, M., ZHANG, Y., AND HUANG, J. Estimation of the mean function with panel count data using monotone polynomial splines. Biometrika 94, 3 (2007), 705–718.
[19] MA, S., AND KOSOROK, M. R. Robust semiparametric m-estimation and the weighted bootstrap. Journal of Multivariate Analysis 96, 1 (2005), 190–217.
[20] O’CONNELL, K. A., GERKOVICH, M. M., COOK, M. R., SHIFFMAN, S., HICKCOX, M., AND KAKOLEWSKI, K. E. Coping in real time: Using ecological momentary assessment techniques to assess coping with the urge to smoke. Research in Nursing & Health 21, 6 (1998), 487–497.
[21] PIASECKI, T. M., JAHNG, S., WOOD, P. K., ROBERTSON, B. M., EPLER, A. J., CRONK, N. J., ROHRBAUGH, J. W., HEATH, A. C., SHIFFMAN, S., AND SHER, K. J. The subjective effects of alcohol–tobacco co-use: An ecological momentary assessment investigation. Journal of abnormal psychology 120, 3 (2011), 557.
[22] REHG, J. M., MURPHY, S. A., AND KUMAR, S. Mobile Health. Springer, 2017.
[23] SEN, B. A gentle introduction to empirical process theory and applications.
[24] SHIFFMAN, S. How many cigarettes did you smoke? assessing cigarette consumption by global report, time-line follow-back, and ecological momentary assessment. Health Psychology 28, 5 (2009), 519.
[25] SHIFFMAN, S., BALABANIS, M. H., GWALTNEY, C. J., PATY, J. A., GNYS, M., KASSEL, J. D., HICKCOX, M., AND PATON, S. M. Prediction of lapse from associations between smoking and situational antecedents assessed by ecological momentary assessment. Drug and alcohol dependence 91, 2-3 (2007), 159–168.
[26] SHIFFMAN, S., STONE, A. A., AND HUFFORD, M. R. Ecological momentary assessment. Annu. Rev. Clin. Psychol. 4 (2008), 1–32.
[27] SUN, J., AND KALBFLEISCH, J. Estimation of the mean function of point processes based on panel count data. Statistica Sinica (1995), 279–289.
[28] VAN DER VAART, A. W., AND WELLNER, J. A. Weak convergence. In Weak convergence
and empirical processes. Springer, 1996, pp. 16–28.
[29] WANG, H., SHIFFMAN, S., GRIFFITH, S. D., AND HEITJAN, D. F. Truth and memory: Linking instantaneous and retrospective self-reported cigarette consumption. The annals of applied statistics 6, 4 (2012), 1689.
[30] WANG, X., MA, S., AND YAN, J. Augmented estimating equations for semiparametric panel count regression with informative observation times and censoring time. Statistica Sinica (2013), 359–381.
[31] WANG, X., AND YAN, J. Fitting semiparametric regressions for panel count survival data with an r package spef. Computer methods and programs in biomedicine 104, 2 (2011), 278–285.
[32] WELLNER, J. A., ZHANG, Y., ET AL. Two estimators of the mean of a counting process with panel count data. The Annals of statistics 28, 3 (2000), 779–814.
[33] WELLNER, J. A., ZHANG, Y., ET AL. Two likelihood-based semiparametric estimation methods for panel count data with covariates. The Annals of Statistics 35, 5 (2007), 2106–2142.
[34] WRAY, T. B., KAHLER, C. W., AND MONTI, P. M. Using ecological momentary assessment (ema) to study sex events among very high-risk men who have sex with men (msm). AIDS and Behavior 20, 10 (2016), 2231–2242.
[35] WU, C., YANG, C., ZHAO, H., AND ZHU, J. On the convergence of the em algorithm: A data-adaptive analysis. arXiv preprint arXiv:1611.00519 (2016).
A.1 Proof of Equality in Equation 1
We start with the numerator
Now consider
We next need to show that we can pull the limit as under integrals. There are two relevant terms:
Noting that is monotone increasing to , we can apply the monotone convergence theorem to pull the limit under the integral. Then
then any integral of this over will also be finite, and we can apply Fubini’s theorem to such integrals. Then recalling that is the MCAR probability (assumption 8),
Here we used Fubini’s theorem to pull out . Applying the CS inequality for inner products in finally applying the CS inequality for expectations gives us the result. The first term on rhs is then:
This gives us
A.3 Proof of Lemma 2
Note that
Define
and note that if and are in- tegrable over with respect to the measure P, we can apply Fubini’s theorem to obtain
This proof takes place in four parts. We first show an inequality that allows us to characterize r for . Next we show that in this ball the integrability conditions above hold. We then show that and are both strictly concave. Finally we show that is strongly concave and thus Q, a positive linear combination of a strongly concave and a strictly concave function, is strongly concave.
A.3.1 Characterizing the Radius of Contraction
Proof. Let . Then we can Taylor expand
where lies between 0 and y. Then noting that
for this implies that
i.e. . Let . Then we need for to hold. Note that if
Then
and if z = 0.5 the desired result holds. For equation 15 to hold it suffices to have . Now noting that
w.p. 1, and noting that we want z = 0.5 a sufficient condition is
A.4 Integrability Conditions
First note by assumption 7, all moments of are uniformly bounded and thus . Then for and by assumption 3,
A.4.1 Strict Concavity of
For
and for the same argument can be made.
A.4.2 Strong Concavity of Q
Now note that
where the first inequality holds by claim 1. Now by strict concavity of as shown in A.4.1 we have
Summing
A.5 Proof of Population Contractivity
We now state the main proof of population contractivity. Denote
then
and rearranging terms and dividing both sides by gives the desired result. Note that we used , which if were a valid inner product would be the KKT conditions. However since may not be a valid inner product, they must be checked specifically. [32] does it in the sample case for the true log-likelihood: it is easy to verify that it still holds in the population case for Q-functions. Noting that maximizes , we have that
A.6 Proof of Theorem 1
By induction. It holds for t = 0. Assume it holds for . Then and by assumption . Applying population contractivity and the induction assumption,
B.1 Proof of Proposition 2
B.1.1 Definitions and Background from the Literature
Before proving the proposition, we restate several important results that we use from existing literature. We repeat two definitions and two theorems from [23], adjusted to our notation. Note that are normally called -brackets. However, since we have already used to denote MCAR probabilities, we call them
Definition 11. (-bracket) Let (F, d) be a normed space of functions distance metric d induced by some norm. Given two functions , the bracket [l, g] is the set of all functions -bracket is a bracket
Definition 12. (Bracketing numbers). The bracketing number is the minimum number of -brackets needed to cover
Definition 13. (Bracketing Integral) The bracketing integral is defined as
Note that since any non-empty set requires at least one bracket to cover it and for
for some . We call this the separation condition. Further, let
and assume that there exists some function satisfying the following three conditions
3. for the rate of convergence as n varies.
Here the right hand side times a constant. Then for every with probability at least . Here , where only depends on the constants in the separation condition and the expected supremum bound.
Note that this is essentially a special case of Theorem 3.2.5 of [28], but it uses expectations instead of outer expectations. It makes the stronger version of their assumptions and draws a stronger conclusion, giving a finite sample bound. The key in our setting will be to ensure that does not vary across iterations, which requires that the constants for the separation condition and the expected supremum bound do not vary across iterations. Importantly, we cannot always apply this theorem in the general functional EM setting: it requires that the sample Q-function at its maximizer over a Sieve is equal to the sample Q-function at its maximizer over the full function space. This holds for the Poisson process log-likelihood for panel count data for a range of cases: for instance with step functions or splines where jumps/knots occur at observation times. It does not necessarily hold for arbitrary models. In the latter case we may be able to use Theorem 6.1 of [23], but this would require checking it carefully for each potential model.
We also note Theorem 4. (Theorem 4.12 of [23]) For any class F of measurable functions
where is some constant.
Importantly, constant and does not depend on F. This was noted by [19]. A version of this theorem was originally Lemma 3.4.2 in [28].
B.1.2 Outline
We follow [1], which proves the rate of convergence for the maximum pseudo-likelihood of a Poisson process objective function for panel count data: extending their proof to the expected complete data log-likelihood case is straightforward. However, we face the issue that we want a high probability uniform bound on the distance between sample and population M-steps across EM iterations, whereas they neeeded an asymptotic high probability rate of convergence for the pseudo MLE. This poses three challenges: 1) our objective function at each iteration is the expected log-likelihood rather than the log-likelihood. Thus we cannot prove that the separation condition holds using the same techniques. 2) the separation condition must hold always rather than only in a neighborhood of the optimum 3) we need to check that constants are the same across EM iterations. Our aim is to apply Theorem 3 and show that the constant in does not vary across EM iterations. Note that other than checking the separation condition, the majority of this proof simply repeats the proof strategy of [1] but fills in details of results they call to make sure that constants don’t vary across iterations of EM.
This proof takes place in four parts. We first show that the separation condition holds. We then bound the expectation of the supremum of the magnitude of an empirical process. We next prove the two properties of the function involved in that bound to show the rate of convergence, and finally conclude by applying Theorem 3.
B.1.3 Separation Condition
We first prove that the separation condition given by Equation 22 holds. This involves applying functional second order Taylor expansions to and and using the remainder terms to obtain quadratic lower bounds.
Claim 2. For any
Proof. Consider the functional second order expansion of at , which we can do by [9].
where the last line follows since w.p. 1. Noting that by the KKT conditions,
and thus the separation condition holds since we optimized over
B.1.4 Bounding the Expectation of the Supremum of the Magnitude of the Empirical Process
Our aim in this section is to apply Theorem 4 and use the result to show that the expected supremum condition, Equation 24 in Theorem 3 holds. Let
This section proceeds as follows. We first show that for all for some constant and for some . We next show a bound on the bracketing entropy in terms of the bracket size. We then use this to bound the bracketing integral using . We combine all of this to bound the expectation of the supremum of interest. We must carefully note that relevant constants do not vary across iterations.
Claim 3. For that does not depend on
Proof.
Then note
where the first line uses the inequality which implies. Plugging this back into equation 29 we obtain
so that we have for some constant does not depend on either
Claim 4. For does not depend on
Proof. Again using
where we used that
By theorem 2.7.5 of [28], which bounds the bracketing number of monotone functions mapping to [0, 1], and noting that has bracketing number less than or equal to that of , which was shown in [2],
where only depends on , the uniform upper bound in . Noting that , we have
where again only depends on
and note that
Claim 5. Because we have
for does not depend on
Proof. Here we use Theorem 4 in order to prove Equation 24 in Theorem 3. By Theorem 4 and again noting that
With this we have the bound on the expected supremum and have proven Equation 24 for Theorem 3.
B.1.5 Characterizing the Function in the Bound
We first prove Equation 25 in Theorem 3.
and thus
Next we prove Equation 26. Let
and
and thus
B.1.6 Putting it All Together
We have now proven that all of the assumptions of Theorem 3 hold, and that constants do not vary across iterations. Then for some constant does not vary across iterations, for any L > 0 and using
B.2 Proof of Theorem 2
Note that if , then we have . We claim for any t > 0,
We prove by induction. First, this holds for t = 1.
Now assume holds true for t > 0. Then for t + 1,
Now iterating we have
C.1 Square Root Synthetic Experiment
We generate synthetic panel count data from mixed inhomogeneous Poisson processes with conditional mean functions , where . The mean functions are then and , respectively. The counting process conditional on X is Poisson, but the marginal counting process is not. We use 100 trajectories, each with 30 observations and for each observation set it to missing with probability 0.2. We initialize the mean function by replacing the missing data with Poisson(1) random variables and fitting a model. We generate 1000 Monte Carlo runs and create Monte Carlo marginal confidence intervals from those runs.
Fig 3 compares the true mean function against AEE wrapped with our method vs AEE directly on the corrupted data. Taking the corrupted data as given learns highly biased results, while wrapping AEE with our algorithm learns close to the true mean function for both experiments.
Figure 3: Mean+95% Monte Carlo CIs of 1000 runs of Mixed Poisson processes, which violate the Poisson process assumption. Synthetic datasets: 20% missingness using AEE wrapped with our method. Black/solid is the true mean function. Red/dotted is initialization with missing data set to Poisson(1) values. Blue/dashed is our EM algorithm initialized at red. (a) a square root mean function (b) a quadratic mean function
D.1 Comment on Mean and Confidence Intervals
In order to form the mean and marginal confidence intervals, we can note that the observation times are all discrete valued at times (months) 1 to 50. Thus we can simply take mean and marginal confidence intervals at those points. To do so, for each bootstrap replicate, we add a small amount of noise to the observed time points for identifiability purposes in order to train, and then take mean and marginal confidence intervals at the described points.
D.2 Varying Missingness Probabilities
In this experiment we vary the missingness probability with . We do so for each of the five following methods: the non-parametric step function maximum pseudo-likelihood (NPMPLE) of [32], the smoothed maximum pseudo-likelihood (MPLs) and and maximum likelihood (MLs) estimators of [18], and the step function solution to the augmented estimating equation (AEE) and the informative censoring (AEEX) version methods of [30]. We show that bias is low but increasing as a function of the missingness probability for the five methods. Figure 4 shows results. The bias is very low in all cases. Further, in all cases it is much lower than the initialization with corrupted data shown in Figure 2a.
D.3 Varying Initialization
In this experiment we vary the initialization. We note that the mean count across all observations is 0.44. We investigate Poisson(1) to Poisson(4) initializations under both and . We use AEE in all cases. Figure 5 shows results. We see that further initializations increase bias, but only slightly for and more so for . This is not surprising as our theory requires good initialization and sufficiently low missingness probability.
D.4 Heterogeneity in Missingness Probabilities
Even if the MCAR assumption does hold, the missingness probability may vary between subjects. In this experiment, we let where and . This can capture between subject heterogeneity in missingness. Within each subject, we compare initialization of Poisson(1) to Poisson(4). Figure 6 shows results using the AEE method. They look very similar to results without heterogeneity in the missingness probabilities between subjects.
D.5 Missing at Random
Here we investigate the recovery under missing at random (MAR). For MAR, we note that 83.7% of the counts are 0: that is, there are no new tumors. We then set the missingness probability for a
Figure 4: results for mean and 95% CIs for 1000 bootstrap replicates for bladder tumor dataset, with the missingness probability . We see that bias is very low in all settings, although it increases with the missingness probability. The methods wrapped with our method are (a) maximum pseudo-likelihood (MPL) of [32] (b) augmented estimating equations (AEE) of [30] (c) augmented estimating equations with informative censoring (AEEX) of [30] (d) maximum pseudo-likelihood splines (MPLs) of [18] note that for this we need to reduce the time axis slightly as spline models cannot interpolate far past the region where they have values in learning (e) maximum likelihood splines (MLs) of [18], same issue.
Figure 5: Here we vary the initialization by initializing increasingly far from the mean observed under complete data. The true sample mean of all intervals is 0.44. We initialize to Poisson random variables with means 1 to 4. We see that initializing further from the truth increases the bias, and that it is worse for the higher missingness probability of 0.4. Our theory requires good initialization and the missingness probability to be sufficiently low, so this is not surprising.
Figure 6: Here we let the missingness probability vary per participant while still being MCAR by multiplying the mean missingness by some value drawn from uniform[0, 2]. We again initialize to Poisson(1) to Poisson(4) for a) mean missingness 0.2 and b) mean missingness 0.4.
Figure 7: Missing at random (MAR). The time-varying missingness probability is if the previous (approximately 84% of observations) and otherwise.
specific observation for a participant to if and otherwise . Figure 7 shows the results. Recovery is very good.
In this section we plot several histograms and boxplots to get insight into behavior variability in the study. We plot cigarettes since the last assessment and number of days between EMAs, where for both we aggregate both between and within subjects. We also plot the percentage of long intervals between subjects. That is, we take the percentage for each participant and then take the boxplots and histograms with the summary statistic for each participant as a data point.
Figure 8 shows the plots for number of cigarettes since the last assessment. We see that most of the time they don’t smoke any cigarettes. The histogram looks like a geometric distribution for the number of cigarettes. The mean number of cigarettes is 1.76, the median is 0, and the 25 and 75 percentiles are 0 and 2, respectively. Thus, in at least half of the intervals they don’t smoke, and in 75% percent of them they smoke at most two cigarettes, but in some cases they smoke some huge number: in one case someone smoked over 40 cigarettes. That said, the number of cigarettes also will depend on the length of time elapsed since the last assessment, and the intensity function is time-varying: particularly, we noted in Figure 2b) that they tend to smoke more frequently in the pre-quit period.
Figure 9 plots the days between EMAs. Here we see that most observations are under one day, but there are a substantial number of outliers. In a number of cases the time between EMAs is over two days, and in several cases it is over six days. The mean time between observations is 0.34 days or approximately eight hours, the median is 0.17 or approximately four hours, and the 25 and 75 percentile are approximately two and 11 hours, respectively. Summarizing, most inter-EMA durations are under one day, but a few are very long.
Finally, also relevant is whether there is heterogeneity in the proportion of long intervals between subjects. Figure 10 investigates this. We see that participants vary between having no unreliable intervals and in one case having all unreliable intervals. The mean is 11%, the median is 6%, and the
Figure 8: a) boxplot b) histogram for number of cigarettes since the last assessment. We see that the median number of cigarettes is 0, and that there are many outliers.
Figure 9: a) boxplot b) histogram for number of days between EMAs. Recall that we treat intervals over one day as missing/unreliable. From the boxplot we see that the mean and 75% percentile are under one day, but that there are a substantial number of observations over one day. The histogram suggests a similar finding.
25 and 75 percentile are 1.8% and 11.4%, respectively. Thus there is substantial heterogeneity. While our theory does not currently account for this and we leave it to future work, the experiments from the previous section where each subject had a different missingness probability (Figure 6) suggests that our model can handle this issue well.
Figure 10: a) boxplot b) histogram for percent of long intervals between subjects. We see that most participants have a small proportion of long intervals (the median is 6.1%), while a few have a much larger proportion of long intervals.