It has been known for decades that a relatively small group of patients, termed high-cost claimants
(HiCCs), accounts for a disproportionate share of healthcare costs and insurance claims [1]. For example,
members with claims over $50,000 per year represented 1.2% of the U.S. insured population but
comprised 31% of total spending [2]. In the Medicare population in the U.S., McWilliams and Schwartz
[3] found that 17% of the population incurred 75% of all costs. In our comprehensive data on the non-
Medicare insured population [4], members with annual costs greater than $250,000 comprise just 0.1% of the population yet account for 9% of overall costs. Moreover, in this population the number of HiCCs
with $250,000 or more has risen by 62% from 36,449 in 2012 to 58,897 in 2016, and the average cost
was $446,748 per HiCC in 2017. It is therefore not surprising that when asked to list the most important strategies for healthcare in the next five years, midsized and large employers ranked managing high-cost claimants at the top of the list [5].
As a population, HiCCs are frequently burdened with multiple chronic diseases, functional limitations,
and other barriers [6]. Often, high medical expenditures occur as part of acute or invasive therapies that
are frequently unsuccessful and involve tremendous suffering and disability [7]. Fortunately, there are
many interventions that could prevent relatively healthy individuals from becoming high-cost claimants in the first place [8,9]. An intervention is tailored to a member’s specific circumstances and might consist
of some combination of a telephone call, additional diagnostic studies, referral to a specialist, digital
coaching, or other services. Health insurers and providers often maintain a care management
organization whose role is to identify patients who would benefit from interventions and match them
with intervention programs (Fig. 1).
Figure 1. The care management process. Members in the covered population are first reviewed by the
healthcare professional, who, informed by the algorithm, selects some of the members for possible
intervention. Selected members are then stratified according to dimensions such as risk and the
availability of suitable interventions, and if appropriate, receive an intervention. This referral decision is not based solely on the algorithm’s prediction score; rather, the decision is holistic, with the algorithm’s
prediction score to be considered in the context of the member’s overall clinical situation. The
intervention is expected to result in some health benefits and cost savings. After the intervention the
individual returns to the population.
Because interventions are costly, potential benefits and cost savings that might result from the
intervention must be balanced against the cost of the intervention itself [10]. Furthermore, it is
challenging to determine which individuals would derive the greatest benefit from the intervention,
even within a narrow subpopulation that has a serious pre-existing condition. Therefore, one of the
central challenges of a care management program is to identify individuals at risk of acute and expensive health outcomes [8].
The emergence of powerful predictive AI methods seems ideally suited to address this identification
challenge, since an AI algorithm could potentially predict future costs or medical needs at an individual
level [11,12]. Such an identification algorithm can then guide limited intervention resources towards
the highest-risk and highest-need individuals. Therefore, our goal here was to apply machine learning to
identify members who are at risk of exceeding a total healthcare cost of $250,000 over the next 12
months. We hypothesized that using a relatively large dataset of insurance claims and a large
investment in engineering new input variables we could exceed previously published benchmarks in this field.
Overview of existing research
Much research has examined the problems of predicting medical costs and identifying high-cost
claimants [12–14]. However, not many studies used very high-cost thresholds, i.e., studies examining
the top 1% (or higher) of claimants. Additionally, many more studies did not report predictive
performance in terms of area under the ROC or precision-recall curves, or were descriptive in nature
rather than predictive; see Table 1.
Historically, the problem of high-cost claimants was studied by actuarial scientists at the population
level, with emphasis on parameter estimation and statistical significance tests [15–19]. But with the
growing availability of data, computing power, and new artificial intelligence and machine learning
methods (AI/ML), there has recently been increased interest in predicting costs at the level of the
individual member instead of estimating parameters for populations [13,20] (for general treatments on the contrast between model- and data-driven methods see [21–23]).
Table 1: Summary of studies that used machine learning methods to identify high-cost claimants.
Studies were selected by filtering the list of 55 studies in Table 1 of [14], and retaining only those studies where the population to be identified was the top 1% or rarer. High-cost claimants above $250,000 have
a nearly 100-fold lower prevalence than the top 10% of claimants, posing a much greater predictive
challenge. The only study that reported performance values was [24] (area under the ROC curve, AUC-
ROC: 81%-86%) and all other studies at this high threshold either did not assess predictive modeling
and/or did not report the AUC-ROC.
Detailed review of previous prediction approaches
A common approach to the problem of predicting high-cost claimants is the use of logistic regression. A 2010 paper using logistic regression found that inclusion of medical condition information substantially
improved the prediction of high-cost patients, resulting in "good discrimination" (area under the ROC
curve, AUC-ROC=0.84%) [38]. The paper also concluded that the number of chronic conditions should be considered as a predictor for high-cost prediction models. In a 2015 study, logistic regression was used
to predict which patients would transition from an intermediate-cost subpopulation to a high-cost
subpopulation with “reasonable discrimination” (AUC-ROC=0.67%) [39]. Two predictors that were
significantly associated with high future costs were the count of chronic conditions and having a
diagnosis of congestive heart failure. In a 2017 paper, a group-based trajectory model based on logistic regression was applied to data from a large insurer to accurately predict patients in the highest spending
trajectory and the top fifth percentile for spending [40]. Using data from the Danish National Health
Service and Civil Registration System, Tamang et al. implemented a penalized logistic regression model with over 1,000 predictors and were able to achieve good predictive performance (AUC-ROC of 0.79%) in a “cost bloom analysis” [41].
Alternative methods such as machine learning techniques are being applied increasingly to the problem
of predicting high-cost claimants. Using neural networks, a 2005 study comparing a population model
against three disease-specific models found that larger cohorts tended to result in a greater predictive
power of the disease-specific models compared to the population model [42]. A 2013 study using
routine electronic service records found that a score based on six simple dichotomous questions had
only “fair” predictive power for health and social care costs in elder patients discharged from acute
medical units, with an AUC-ROC of 0.70% [43]. A Canadian group [44] applied machine learning to a
large range of clinical measurements to identify the top 5% of claimants, attaining an AUC-ROC of
0.81%-0.94%. In a study where an extended gradient boosting model (XGBoost [45]) was applied on an
imbalanced dataset from three of the largest health insurers in the U.S., Hartman found that
oversampling the minority class resulted in better predictive performance (AUC-ROC 0.835) than
undersampling the majority class [46], at least at the highest thresholds. Gibbs et al. proposed the use of asymmetric cost matrices to optimize the threshold for an intervention [10].
Whether one uses logistic regression or alternative machine learning techniques, obvious questions
might be (1) which analytic approaches tend to result in the best predictive performance? and (2) which predictors tend to best explain high utilization? A recent review of 55 papers in the literature revealed
that high utilization was primarily explained by high levels of chronic and mental illness [14]. Another
2018 literature review concluded that gradient boosting had the best predictive performance overall
and for low- to medium-cost members, but neural networks and ridge regression had the highest
performance for high-cost members [13].
Our contribution
In this paper, we describe a novel solution to the problem of identifying future high-cost claimants using machine learning (ML). We applied one of the largest datasets of healthcare insurance claims (over 50 million members), achieving one of the highest performance results reported in the literature reviewed
here. Because health insurance plans typically have access only to their own claims data and not to
hospital records or specialized registries, we constrained our algorithm to use variables available
through health insurance claims alone, along with public data on social determinants of health. No
hospital records or specialized registries were used.
Because we anticipated using the predictive model on the full population without any filtering, no
requirement regarding continuous enrollment in health insurance was imposed on the data. The
algorithm was therefore designed to operate for members with an incomplete medical history, such as less than one year of data. Additionally, the algorithm can function for members for which only medical claims and not pharmaceutical claims are available – a situation occurring in approximately 40% of the insured population in our data.
Similar previous studies most often defined HiCCs using thresholds between $50,000 and $100,000 per
year; some other studies used instead the top 1% to 10% in costs [14]. We instead defined HiCCs as
members with a yearly total allowed amount that exceeded $250,000: this is the threshold at which
many reinsurance policies attach. At this threshold, high-cost claimants accounted for about 1.6 out of 1,000 members, making identification particularly challenging and requiring truly massive datasets.
Our goal was to identify members who would exceed a total healthcare cost of $250,000 over the next 12 months. Generally, identification of high-cost claimants is modeled typically as a prediction problem in the framework of machine learning. Two formulations are used: (1) predicting cost, namely predicting
a member’s future dollar cost amount over the next 12 months; or (2) binary classification, namely,
predicting whether or not a member will exceed a certain cost amount. We evaluated both of these
formulations and then selected the second formulation (classification) after considering the business
requirements that are often around a particular threshold and the performance of the models.
Applying our methods to the problem we found that the best performing model overall was the Light
Gradient Boosted Trees Classifier (LightGBM) algorithm [47], achieving an AUC-ROC of 91.25% on a
holdout dataset. This is consistent with the findings of a recent literature review [13], where gradient
boosting methods had the best predictive performance overall. The model’s AUC-ROC was 7% higher
than all previous models at the $250,000 point, and it is estimated that the model’s high performance could generate considerable benefits for care management programs.
Construction of input variables
Our model for predicting health costs was constructed from administrative claims data – the data
created as part of electronic exchanges between medical facilities, professionals, and pharmacies on the
one side, and payers on the other. In the U.S. the format of claims data is standardized by HIPAA and
contains a listing of diagnoses, procedures, drugs, as well as costs. Taken together, claims data provide a
nearly complete summary of each patient’s medical journey across all types of care. Claims data are
readily available to all actors in the healthcare system – the medical insurers, government payers, and sponsors of healthcare – and are increasingly available to patients.
The majority of input variables were calculated from claims data over the one-year report period
4/1/2017 to 3/31/2018, called the “reporting period.” The remaining variables, specifically cost trend
and some enrollment variables, also included any claims available before the start of the reporting
period. If no data were available, we assigned null values. The following one-year interval from 4/1/2018
to 3/31/2019 was considered the “prediction period” and was used to define the high-cost status. At
the time of the calculation, claims data were >99% adjudicated.
Predictors were constructed using SQL (Vertica Analytic Database v8.0.1-3; Micro Focus, Berkshire,
England, UK). Total allowed amounts for medical conditions (diagnoses) and procedures (services) in the current report period were summed using BHI’s Episodes of Care grouper methodology [48]. To prevent
information leakage, data were processed in an isolated schema that excluded any events occurring
outside the report period. Indicators of social determinants of health were constructed using publicly
available data from the U.S. census, estimated for 2017 American Community Survey (ACS) 5-year
estimates at the ZIP code level, which were linked to the member based on the member’s ZIP code. ACS
data included housing conditions, unemployment, poverty and fraction of the population who are a
racial minority.
Table 2 lists the types of variables used in the prediction. The variables were selected by pruning an
initial list of 6,006 variables, which improved the performance. Feature pruning was performed using a
feature importance metric available in DataRobot, which implemented a model-agnostic algorithm
based on permutation testing. We also used a variable importance metric available in DataRobot, which captured information about tree splitting in tree-based algorithms.
TABLE 2. List of the types of input variables. The list of input variables used in the final model were
selected from a larger list of 6,006 variables. The total number of predictors before and after feature
selection was 6,006 and 255, respectively. Columns "Original Count” and “Final Count” show the
number of predictors before and after feature selection, respectively. Details are found in S1 Table.
Training and holdout population
The model was trained and tested using realistic data, since we anticipated using the predictive model on the entire insured population including individuals with incomplete medical histories. Members were
required to be enrolled only on the first day of the last month of the reporting period (4/1/2017 to
3/31/2018) and on the last day of the first month of the predicted period (4/1/2018 to 3/31/2019). No
requirement regarding continuous enrollment was imposed on the data. In the resulting population,
63% had both medical and pharmaceutical claims, and 78% had at least one year of continuous
enrollment.
Due to limits on the file size for upload to DataRobot, the training dataset was reduced to a
subpopulation of 3 million members by downsampling the majority class as follows [46]. Due to the high
threshold of $250,000, HiCCs are rare and therefore each becames very informative. All high-cost
claimants in the original training dataset were retained. The non-HiCCs in the original training dataset
were randomly sampled to bring the total complement in the reduced training dataset to 3 million. In the final training dataset, the number of HiCCs and non-HiCCs was 61,277 and 2,938,723, respectively,
and 20% of the training data was used for model selection. Training of all candidate models was
completed in 5 hours using a high-performance cluster.
To report on model performance, the final model was evaluated on a holdout dataset of 9,684,279
members, described in Tables 3 and 4, which was not used in the training. The proportion of high-cost
claimants (HiCCs) in the holdout reflected the natural proportions in the commercially insured
population in the US. A small fraction of the population 65 or older is also was included in this
population. We reported the performance on all age and gender groups. Additionally, we further
stratified the HiCCs into emergent and recurrent, corresponding to individuals with no previous history
of HiCC status, and those who previously were high-cost claimants. These were, in effect, two distinct
prediction problems: the emergent population was identified from a very large (N= 48,402,958) set of
candidates, whereas the recurrent population was identified in a much smaller set (N= 28,249), of which approximately 37% became HiCCs in subsequent years.
TABLE 3. Shown are demographics of the holdout population used for model evaluation.
TABLE 4. Shown are the highest-cost conditions for HiCCs (same year, top 20 categories ordered by
number of HiCCs), in the holdout population. * indicates not otherwise specified.
Machine learning and statistical methods
BHI uses DataRobot’s predictive platform (DR) version 5.0.1 (DataRobot, Boston, MA). Our platform
includes industry-leading tools for exploratory data analysis (EDA), model training, validation, and
deployment. Once the training data were uploaded, dozens of machine learning models were trained in
a high-throughput supervised learning system. Training used the log-loss error function and all the
classifiers have built-in regularization to minimize overfitting in the presence of class imbalances (see
e.g. [49]). Missing values are extremely rare and were imputed automatically with median values. Data used in model training were stripped of basic HIPAA identifiers and anonymized. We have implemented multiple levels of security that governed access to both the models and the results.
Model selection proceeded in three rounds, where in each round, more data were used (16%, 32%, and then 64% of the sample), and the best-performing algorithm in each round was passed on to subsequent
rounds. The final round included a grid scan of hyperparameters. The algorithms considered included
Random Forests, Support Vector Machines, Gradient Boosted Trees, Elastic Nets, Extreme Gradient
Boosting, and ensembles. Multiple implementations of the algorithms were tested, including the open source machine learning libraries from R, scikit-learn, TensorFlow, Vowpal Wabbit, Spark ML, XGBoost, and LightGBM.
Once the automatic selection was complete, we reviewed the model performance on a holdout dataset. The top model was selected by reviewing each model holistically, including its predictive performance, scoring speed, and interpretability. The model was then subjected to a clinical review and assessment in a separate holdout dataset (see below). We found that we could improve the model’s performance by
pruning variables of lower importance. We minimized the risk of overfitting by preferring algorithms
that are inherently resistant to overfitting such as LightGBM, and by confirming that the model’s
performance was consistent on the training and holdout data.
Because HiCCs are rare, we used performance metrics that are appropriate for classification problems
with class imbalances [50,51]: area under the ROC Curve (AUC-ROC) and area under the precision-recall
curve (AUC-PR). The AUC-PR has been increasingly suggested as the best overall metric of model
performance [52], but we used both metrics since most of the existing literature in the field still uses the AUC-ROC metric. We also computed recall (also called sensitivity and true positive rate), precision (also
called positive predictive value), false positive rate, F1 score, and Matthews correlation coefficient
(MCC). In production, we applied isotonic regression to obtain calibrated probability scores [53]. The
optimal prediction threshold was normally selected using economic analysis (see below) or in other
cases, by selecting the score that maximizes the F1 score and the MCC – which agreed within 1% [54]. The model was deployed both as an API connected to our predictive platform and as an executable Java
Archive file (JAR), which can be run in any environment supporting Java (e.g., Linux or Windows). The
consistency of the scores of the two implementations was checked to agree numerically to within
0.0001. Model validation used JARs containing the model to calculate prediction scores. The running
time of the JAR model is over 400,000 rows per minute. We applied a stringent development and
validation process to ensure credibility and accuracy in our recommendations. Our development process and model training are very dependent on iterative clinician review and acceptance by clinical staff.
Also, the internal model validation was conducted by an independent team.
Health economics methods
Going beyond statistical methods for performance, we estimated the financial and health impact of the model. For this estimation, we placed the model in a typical scenario of a care management program in which the model identifies individuals in need of timely interventions [8]. In such a program, individuals
with the highest model score are evaluated by clinical experts such as nurses, and if appropriate,
enrolled in an intervention program. We compared the identification model to the common care
management programs, that often use simple rules to identify members at risk, and therefore have a
very high false alert rate.
Because the model is expected to be used in a care management program, the model’s probable
financial and health impact was assessed in the following care management scenario, which is closely
based on our data from multiple health insurers in the U.S. In this scenario, one million members are
covered by the insurer, the rate of HiCCs is 0.16%, and the mean cost per HiCC case is $413,975. We
assumed an intervention program costs $10,000 per member and achieves an average cost reduction of
15% cost reduction per HiCC. We conservatively did not attempt to estimate the value of the
intervention beyond the first year or its effect on non-HiCC members. The care management program
has the capacity to treat between 300 and 1000 people per year. The model was considered as a
replacement for an existing rule-based HiCC identification system, which was assumed to have a
precision of 2%.
Our model training found that highest-performing model was a Light Gradient Boosted Trees Classifier
[47]. The classifier uses 410 trees with a maximum of 16 leaves per tree, boosted at a learning rate of
0.05, and with no regularization. The three most important predictor variables are age, a tendency for
rising cost in the last three months of the prediction period, and life expectancy based on actuarial
tables (see S1 Table).
When evaluated on the holdout dataset, the algorithm achieves an area under ROC curve of 91.2%, and an area under the precision-recall curve of 23.1% (Figure 2). At a threshold of 0.76 (consistent with the highest F1 score), the model gives recall of 33.0% and precision of 29.9% (see Table 5).
(A) (B)
Figure 2: (A) The receiver operating characteristic of the HiCC predictive model is shown. It has an area
under the ROC curve (AUC-ROC) of 91.2%. The red-dashed diagonal line indicates the chance-level
performance benchmark. (B) The precision-recall curve of the HiCC predictive model is shown. It has an
area under the PR curve (AUC-PR) of 23.1%. A red-dashed line just above the X axis indicates the
reference, i.e., the proportion of high-cost claimants in the holdout data, or 0.16% (cf. TABLE 3). The
model attains a precision > 80% when high predictive score thresholds are used.
Table 5. Shown is the performance of the model on the holdout population at various thresholds,
including the threshold that maximized the F1-Score, 0.76. TP, FP, FN, and TN the number of true
positives, false positives, false negatives, and true negatives, respectively. Recall is also called true
positive rate or sensitivity; TNR is true negative rate (also called specificity); precision is also called
positive predictive value; and NPV is negative predictive value.
The following tables (Tables 6 and 7) show the model’s performance in the population of emergent and
recurrent HiCCs, namely those with and without prior history of HiCC status, respectively. The AUC of
the model is higher in the emergent than the recurrent population (89% vs. 81%) and consistent
between gender and demographic cohorts. Members that either (1) lacked data on pharmacy benefits
or (2) lacked one full year of data represented 37% and 22% of the population, respectively, yet the
model maintained its AUC in these populations to within 1%.
TABLE 6. Shown is the model performance in the emergent HiCC population. The threshold for a positive
class is 0.76, which maximizes the F1 score. AUC is area under the ROC curve; recall is also called true
positive rate or sensitivity; FPR is false positive rate; precision is also called positive predictive value; and NPV is negative predictive value.
TABLE 7. Shown is the model performance for the recurrent HiCC population. The threshold for a positive class is 0.92, which maximizes the F1 score. AUC is area under the ROC curve; recall is also called true positive rate or sensitivity; FPR is false positive rate; precision is also called positive predictive value; and NPV is negative predictive value.
We assessed whether the model possibly under-predicts the number of HiCCs in populations with racial minorities and found no evidence for this. In a univariate analysis correlating the average model score
in each ZIP code with the ZIP’s fraction of minority population we found a fairly strong positive
relationship (R2=47%), namely, more HiCC are predicted in areas with higher racial minority. There was
no evidence of higher cost in areas of higher racial minority, and indeed the average medical cost
tended to be slightly lower in these areas ($1.6 lower for every percentage point increase in minority
status).
Health economic analysis
We estimated the financial and health impact of placing the ML algorithm in a typical care management program covering a population of 1 million individuals. In the first step of the program, identification, a
subset of the population is identified as likely future high-cost claimants (Table 8). In a representative
case, the program has the capacity of 1,000 HiCCs, including 500 recurrent (previously known) and 500
emergent HiCCs, respectively. We set the classification threshold separately for each population and
calculated that the algorithm would attain precision of 32% and 66% for the emergent and recurrent
cohorts, respectively.
Table 8. Shown is the ability of the machine learning algorithm to identify HiCCs for emergent and
recurrent care management programs.
To compare the machine learning (ML) algorithm to a conventional rule-based system, we assumed that
the program has an overall capacity of 500. The prediction threshold of the algorithm is set to 93% in
order to generate 500 members with the highest risk scores. At this threshold, the precision was 39.8%, producing a population of 199 true HiCCs (Table 9). By contrast, the rule-based system identified only
10 true HiCCs, and thus the ML algorithm can impact nearly 20 times as many HiCCs. The cost of the
program was $5 million in both cases, which translates to a cost per HiCC of $25,125 and $500,000 for the ML algorithm and the rule-based system, respectively. The machine learning-based system would
result in a net savings of $7.3 million against a net financial cost of $4.4 million for the rule-based
system.
Table 9. Shown is the estimated impact of HiCC identification on a typical care management program, and comparison of a conventional rule-based system with the current machine learning algorithm.
Our study describes an algorithm for identification of high-cost claimants at the level of $250,000 per
year using the methods of machine learning. We demonstrate that using administrative claims with
census data alone makes it possible to achieve AUC-ROC scores greater than 90%, even though HiCCs
represent only 0.1% of the commercially insured population in the U.S. These results compare very
favorably with results published in the literature, which attain performance of 80%-85%, even for
populations that are easier to identify. This opens an opportunity to make interventions and achieve
significant cost savings. The performance remains essentially unchanged even in populations with
limited data, such as partial-year enrollment or lack of drug benefits.
Unlike previous studies that used a single ML method, we applied a modern parallel machine learning
platform that considers over 50 models and automatically tuned their hyperparameters. In our
experience, the best performing non-ensemble models tended to be the eXtreme Gradient Boosting
algorithm [45] and its derivative, the LightGBM [47]. We also found that ensembling (or blending) of
models tended to increase performance by 0.2% of AUC-ROC (results not shown here), but these models
were not adopted because these gains were outweighed by the increase in computational cost and
because they created certain practical barriers to model deployment.
While the literature on predictive models traditionally focuses on AUC-ROC and AUC-PR, in this
application a more important measure of performance was the precision (also known as PPV) at the
highest-risk 1,000 members. This is simply due to logistics and the extreme rareness of very high-cost
claimants. Because the program has a limited enrollment capacity, only the highest-risk HiCCs are
referred to this program, and the financial and health outcomes are influenced by the precision in this elite cohort.
The model was assessed by considering the model’s precision and recall, combined with the projected effectiveness and cost of interventions. We found that in a typical care management program, the ML
algorithm would create significant health and economic benefits. When compared to a typical rule-
based system, the algorithm identifies approximately 20 times as many high-cost, high-needs
individuals, and thus has a nearly 20 times lower cost per case. The care management program results in considerable net savings ($7.3 million) versus a net cost in the rule-based system ($4.3 million).
Future work could attempt to improve the model further, at the very least by incorporating new
predictive information. Our experience indicates that claims data are an incredibly rich source of
information, and we believe that there remain opportunities to improve model performance through
the synthesis of new predictor variables. Another possibility for improvement might be the inclusion of external sources of information from specialty sources, such as credit reporting databases and electronic health records.
Limitations and appropriate use of the model
A general limitation of all predictive methods in practice is that they do not give a prescriptive solution, and usually a separate prescriptive methodology is needed for planning interventions for HiCCs, which is
outside the scope of this work. Common interventions include finding better and cost-efficient
healthcare settings, closely monitoring members to control chronic conditions, and others. Many but
not all members at risk for high costs would be willing or able to receive interventions [8].
Multiple efforts have been made to make the trained model free of errors, including quality assurance
and other good software development life cycle practices. Prediction algorithms need to be regularly
updated due to rapid changes in the healthcare system, including new treatments or treatment
pathways and changes in costs. We plan to address these evolving changes by implementing regular
updates to the training data and the model after deployment. Because our training data contains only
the U.S. commercially insured population and their dependents, the model’s performance is expected to degrade in populations aged 65 or older, or in populations without commercial insurance. We evaluated
the performance of the model in a variety of populations and found that the degradation is small;
however, the model should be used with more caution.
The algorithm presented here is primarily designed to predict the risk of high cost, rather than measures
of health status or health needs. Health is known to be systematically different from cost, and
populations with barriers to healthcare, such as lower socio-economic classes or certain racial
minorities, have systematically lower healthcare expenditure [55,56]. Furthermore, the algorithm uses demographic data such as age, gender, and ZIP code-linked variables in order to maximize its predictive
performance. Therefore, the most appropriate use of this model is either in a strictly financial setting,
or in a holistic care management decision system. Any use of the predictions should be performed by a
healthcare professional equipped with rich contextual data because this context would allow the
healthcare professional to account for contextual information not available to the algorithm, account for gaps in the algorithm’s performance, and ensure equitable outcomes.
The predictive model described demonstrates the potential for the next generation of predictive
algorithms for the healthcare space. High-cost claimants exceeding $250,000 in annual cost account for
nearly 10% of overall costs but are very rare, representing just 1.6 in every 1,000 members. By using
hundreds of variables, rich claims data, and modern machine learning, it was possible to train a machine
learning model that attains an AUC-PR of 91% and a precision of more than 30% in the top 1,000
members. With the high predictive performance of this model, cost-effective interventions could be
implemented.
We thank our colleagues Michael Rogero, Munir Islam, Joulan Wu, and Carolyn Jevit, for implementing this algorithm and DataRobot for its support. Richard Paul provided oversight of the solution architecture and release management. Alan Schwartz, Ilya Safro, Ehsan Sadrfaridpour, Justin Sybrandt, and Brian Hartman provided helpful comments that greatly improved this study.
1. Zook CJ, Moore FD. High-Cost Users of Medical Care. New England Journal of Medicine. 1980;302: 996–1002. doi:10.1056/NEJM198005013021804
2. Wilson DM, Troy TD, Jones KL. High Cost Claimants: Private vs. Public Sector Approaches. American Health Policy Institute; 2016. Available: http://www.americanhealthpolicy.org/Content/documents/resources/High_Cost_Claimants. pdf
3. McWilliams JM, Schwartz AL. Focusing on High-Cost Patients — The Key to Addressing High Costs? N Engl J Med. 2017;376: 807–809. doi:10.1056/NEJMp1612779
4. Blue Health Intelligence. Getting to the Root of High-Cost Claimants. 2018. Available: https://bluehealthintelligence.com/getting-to-the-root-of-high-cost-claimants/
5. National Survey of Employer-Sponsored Health Plans. Mercer HR and Healthcare Consulting; 2018. Available: https://www.mercer.us/what-we-do/health-and-benefits/strategy-and-transformation/mercer-national-survey-benefit-trends.html
6. Blumenthal D, Abrams MK. Tailoring Complex Care Management for High-Need, HighCost Patients. JAMA. 2016;316: 1657. doi:10.1001/jama.2016.12388
7. Das LT, Abramson EL, Kaushal R. High-Need, High-Cost Patients Offer Solutions for Improving Their Care and Reducing Costs. 2019; 4.
8. Anderson GF, Ballreich J, Bleich S, Boyd C, DuGoff E, Leff B, et al. Attributes Common to Programs That Successfully Treat High-Need, High-Cost Individuals. American Journal of Managed Care. 2015;21: 4.
9. Figueroa JF, Zhou X, Jha AK. Characteristics And Spending Patterns Of Persistently HighCost Medicare Patients. Health Affairs. 2019;38: 107–114. doi:10.1377/hlthaff.2018.05160
10. Gibbs Z, Hartman BM. Using Asymmetric Cost Matrices to Optimize Wellness Intervention. Society of Actuaries Health Meeting; 2018; Austin, Texas. Available: https://www.soa.org/globalassets/assets/files/e-business/pd/events/2018/health-meeting/pd-2018-06-health-session-057.pdf
11. Bates DW, Saria S, Ohno-Machado L, Shah A, Escobar G. Big Data In Health Care: Using Analytics To Identify And Manage High-Risk And High-Cost Patients. Health Affairs. 2014;33: 1123–1131. doi:10.1377/hlthaff.2014.0041
12. Hileman G, Steele S. Accuracy of Claims-Based Risk Scoring Models. 2016. Available: https://www.soa.org/globalassets/assets/files/research/research-2016-accuracy-claims-based-risk-scoring-models.pdf
13. Morid MA, Kawamoto K, Ault T, Dorius J, Abdelrahman S. Supervised Learning Methods for Predicting Healthcare Costs: Systematic Literature Review and Empirical Evaluation. AMIA Annu Symp Proc. 2017;2017: 1312–1321.
14. Wammes JJG, van der Wees PJ, Tanke MAC, Westert GP, Jeurissen PPT. Systematic review of high-cost patients’ characteristics and healthcare utilisation. BMJ Open. 2018;8: e023113. doi:10.1136/bmjopen-2018-023113
15. Deb P, Munkin MK, Trivedi PK. Bayesian analysis of the two-part model with endogeneity: application to health care expenditure. J Appl Econ. 2006;21: 1081–1099. doi:10.1002/jae.891
16. Diehr P, Yanez D, Ash A, Hornbrook M, Lin DY. Methods for analyzing health care utilization and costs. Annual Review of Public Health. 1999;20: 125–144. doi:10.1146/annurev.publhealth.20.1.125
17. Duan N. Smearing Estimate: A Nonparametric Retransformation Method. Journal of the American Statistical Association. 1983;78: 605–610. doi:10.1080/01621459.1983.10478017
18. Duan N, Manning WG, Morris CN, Newhouse JP. A Comparison of Alternative Models for the Demand for Medical Care. Journal of Business & Economic Statistics. 1983;1: 115. doi:10.2307/1391852
19. Mihaylova B, Briggs A, O’Hagan A, Thompson SG. Review of statistical methods for analysing healthcare resources and costs. Health Economics. 2011;20: 897–916. doi:10.1002/hec.1653
20. Chechulin Y, Nazerian A, Rais S, Malikov K. Predicting Patients with High Risk of Becoming High-Cost Healthcare Users in Ontario(Canada). Healthcare Policy | Politiques de Santé. 2014;9: 68–79. doi:10.12927/hcpol.2014.23710
21. Breiman L. Statistical Modeling: The Two Cultures. Statistical Science. 2001;16: 199–215.
22. Kattan MW, Gönen M. The prediction philosophy in statistics. Urologic Oncology: Seminars and Original Investigations. 2008;26: 316–319. doi:10.1016/j.urolonc.2006.12.002
23. Shmueli G. To Explain or to Predict? Statist Sci. 2010;25: 289–310. doi:10.1214/10-STS330
24. Meehan J, Chou C-A, Khasawneh MT. Predictive Modeling and Analysis of High-Cost Patients. Institute of Industrial Engineers; 2015. pp. 2566–2575.
25. Ash AS, Zhao Y, Ellis RP, Kramer MS. Finding Future High-cost Cases: Comparing Prior Cost Versus Diagnosis-based Methods. : 13.
26. Coughlin TA, Long SK. Health Care Spending and Service Use among High-Cost Medicaid Beneficiaries, 2002—2004. Inquiry. 2010;46: 405–417.
27. DeLia D. Mortality, Disenrollment, and Spending Persistence in Medicaid and CHIP: Medical Care. 2017;55: 220–228. doi:10.1097/MLR.0000000000000648
28. Hensel JM, Taylor VH, Fung K, Vigod SN. Rates of Mental Illness and Addiction among High-Cost Users of Medical Services in Ontario. Can J Psychiatry. 2016;61: 358–366. doi:10.1177/0706743716644764
29. Meenan RT, Goodman MJ, Fishman PA, Hornbrook MC, O’Keeffe-Rosetti MC, Bachman DJ. Using Risk-Adjustment Models to Identify High-Cost Risks. Proceedings. 2003;41: 1301–1312.
30. Monheit AC. Persistence in health expenditures in the short run: prevalence and consequences. Med Care. 2003;41: III53–III64. doi:10.1097/01.MLR.0000076046.46152.EF
31. Powers BW, Chaguturu SK. ACOs and High-Cost Patients. N Engl J Med. 2016;374: 203–
205. doi:10.1056/NEJMp1511131
32. Riley GF. Long-Term Trends In The Concentration Of Medicare Spending. Health Affairs. 2007;26: 808–816. doi:10.1377/hlthaff.26.3.808
33. Robst J. Developing Models to Predict Persistent High-Cost Cases in Florida Medicaid. Population Health Management. 2015;18: 467–476. doi:10.1089/pop.2014.0174
34. Rosella LC, Fitzpatrick T, Wodchis WP, Calzavara A, Manson H, Goel V. High-cost health care users in Ontario, Canada: demographic, socio-economic, and health status characteristics. BMC Health Serv Res. 2014;14: 532. doi:10.1186/s12913-014-0532-2
35. Wammes JJG, Tanke M, Jonkers W, Westert GP, Van der Wees P, Jeurissen PP. Characteristics and healthcare utilisation patterns of high-cost beneficiaries in the Netherlands: a cross-sectional claims database study. BMJ Open. 2017;7: e017775. doi:10.1136/bmjopen-2017-017775
36. Wodchis WP, Austin PC, Henry DA. A 3-year study of high-cost users of health care. CMAJ. 2016;188: 182–188. doi:10.1503/cmaj.150064
37. Zhao Y, Ash AS, Haughton J, McMillan B. Identifying Future High-Cost Cases Through Predictive Modeling: Disease Management & Health Outcomes. 2003;11: 389–397. doi:10.2165/00115677-200311060-00005
38. Fleishman JA, Cohen JW. Using Information on Clinical Conditions to Predict High-Cost Patients. Health Services Research. 2010;45: 532–552. doi:10.1111/j.1475-6773.2009.01080.x
39. Lu J, Britton E, Ferrance J, Rice E, Kuzel A, Dow A. Identifying Future High Cost Individuals within an Intermediate Cost Population. Qual Prim Care. 2016;23: 318–326.
40. Lauffenburger JC, Franklin JM, Krumme AA, Shrank WH, Brennan TA, Matlin OS, et al. Longitudinal Patterns of Spending Enhance the Ability to Predict Costly Patients: A Novel Approach to Identify Patients for Cost Containment. Medical Care. 2017;55: 64–73. doi:10.1097/MLR.0000000000000623
41. Tamang S, Milstein A, Sørensen HT, Pedersen L, Mackey L, Betterton J-R, et al. Predicting patient ‘cost blooms’ in Denmark: a longitudinal population-based study. BMJ Open. 2017;7: e011580. doi:10.1136/bmjopen-2016-011580
42. Crawford AG, Fuhr JP, Clarke J, Hubbs B. Comparative Effectiveness of Total Population versus Disease-Specific Neural Network Models in Predicting Medical Costs. Disease Management. 2005;8: 277–287. doi:10.1089/dis.2005.8.277
43. Edmans J, Bradshaw L, Gladman JRF, Franklin M, Berdunov V, Elliott R, et al. The Identification of Seniors at Risk (ISAR) score to predict clinical outcomes and health service costs in older people discharged from UK acute medical units. Age and Ageing. 2013;42: 747–753. doi:10.1093/ageing/aft054
44. Izad Shenas SA, Raahemi B, Hossein Tekieh M, Kuziemsky C. Identifying high-cost patients using data mining techniques and a small set of non-trivial attributes. Computers in Biology and Medicine. 2014;53: 9–18. doi:10.1016/j.compbiomed.2014.07.005
45. Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining -KDD ’16. San Francisco, California, USA: ACM Press; 2016. pp. 785–794. doi:10.1145/2939672.2939785
46. Hartman B, Owen R, Gibbs Z. Predicting High-cost Members in the HCCI Database. Brigham Young University; 2019. Available: https://hartman.byu.edu/docs/files/HartmanOwenGibbs_HighCostClaims.pdf
47. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Advances in Neural Information Processing Systems 30. Curran Associates, Inc.; 2017. pp. 3146–3154. Available: http://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
48. Blue Health Intelligence. Episode of Care (EoC) Grouper. 2019. Available: https://bluehealthintelligence.com/wp-content/uploads/2019/06/bhi_eocgrouper_0619.pdf
49. Sadrfaridpour E, Razzaghi T, Safro I. Engineering fast multilevel support vector machines. Mach Learn. 2019;108: 1879–1917. doi:10.1007/s10994-019-05800-7
50. Sammut C, Webb GI, editors. Encyclopedia of machine learning. New York ; London: Springer; 2010.
51. Vermeulen AF. Industrial Machine Learning - using artificial intelligence as a transformational disruptor. S.l.: APRESS; 2020.
52. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning - ICML ’06. Pittsburgh, Pennsylvania: ACM Press; 2006. pp. 233–240. doi:10.1145/1143844.1143874
53. Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’02. Edmonton, Alberta, Canada: ACM Press; 2002. p. 694. doi:10.1145/775047.775151
54. Chicco D. Ten quick tips for machine learning in computational biology. BioData Mining. 2017;10: 35. doi:10.1186/s13040-017-0155-3
55. Adler NE, Newman K. Socioeconomic Disparities In Health: Pathways And Policies. Health Affairs. 2002;21: 60–76. doi:10.1377/hlthaff.21.2.60
56. Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366: 447–453. doi:10.1126/science.aax2342
Table S1. The top 20 input variables of the final model ranked by variable importance. Variable importance was calculated based on the weighted number of tree splits and has been normalized so that the most important variable (AGE) has a relative importance of 1. A total of 255 variables were used in the final model. The allowed amount, sometimes simply referred to as cost, is the cost of care after the settlement between payers and providers.