Background: As part of the effort to improve quality and to reduce national healthcare costs, the Centers
for Medicare & Medicaid Services (CMS) are responsible for creating and maintaining an array of
clinical quality measures (CQMs) for assessing healthcare structure, process, outcome, and patient
experience across various conditions, clinical specialties, and settings. The development and maintenance
of CQMs involves substantial and ongoing evaluation of the evidence on the measure’s properties—
importance, reliability, validity, feasibility, and usability. As such, CMS conducts monthly environmental
scans of the published clinical and health services literature. Conducting time consuming, exhaustive
evaluations of the ever-changing healthcare literature presents one of the largest challenges to an
evidence-based approach to healthcare quality improvement. Thus, it is imperative to leverage automated techniques to aid CMS in the identification of clinical and health services literature relevant to CQMs.
Additionally, the estimated labor hours and related cost savings of using CMS Sematrix compared to a
traditional literature review are roughly 818 hours and $122,000 for a single monthly environmental scan [1].
Objective: Designing CMS Sematrix, an automated knowledge extraction framework that scans
published clinical and health services literature, identifies relevant articles for a given CQM, and stores
evidence presented by the articles in a form capable of analysis and synthesis.
Methods: CMS Sematrix contains three major components: (1) a quality measure ontology to describe
high-level knowledge constructs contained in CQM; (2) a natural language process (NLP) system to
extract concepts and relations that correspond to the ontology from text; and (3) a graphical database to store the concepts and relations extracted from text as Resource Description Framework (RDF) triples.
To build the framework, a set of 65 CQMs covering a variety of healthcare domains and 98 biomedical
articles (PubMed Abstracts and PubMed Central Full Articles) were manually annotated with CQM
ontology specific concepts and relations. In addition, the 65 CQMs were manually reviewed by subject matter experts in order to extract the high-level quality constructs. Lastly, to validate that the documents returned by CMS Sematrix contain information relevant to the given quality measure, we developed an automated procedure for identifying relevant documents. The results of this automated procedure were then compared to a manual document review for set of 9 randomly selected measures from the set of 65 CQMs.
Results: The NLP component of CMS Sematrix was able to correctly identify CQM concepts with an
average recall score of 87% for measure descriptions and 86% for articles. In addition, CMS Sematrix
achieved overall precision and recall scores of 84% and 62% when extracting concept relations. We then conducted an environmental scan of the PubMed and PubMed Central abstracts and articles using the set of 65 CQMs. For the 9 measures selected for manual review, our automated procedure for determining
relevant documents obtained average precision and recall scores of 84% and 88%. Running this
procedure on the full set of 65 CQMs, we found that on average roughly 72% of the articles returned by
CMS Sematrix for a given measure contain information relevant to the measure description using our
June 2018 environmental scan data.
Conclusions: CMS Sematrix is able to identify articles published in the clinical and health services
literature that contain information relevant to a given CQM. In practice, CMS Sematrix can reduce the
time-consuming burden of the CMS monthly environmental scans and allow measure developers to
quickly and accurately design CQMs to track outcomes in order to improve the national healthcare
system.
Keywords: Quality of Health Care - Natural Language Processing - Biomedical Ontologies
The IMPACT Act [2], MACRA[3], and the 21st Century Cures Act [4] are three of the more
recent legislative manifestations of the acknowledged importance of reducing the cost of healthcare while improving quality and enabling innovation. Recent estimates project that the cost of healthcare will reach
nearly 20% of the Gross Domestic Product by 2026 [5], and those are dollars that might otherwise be
spent on complementary societal needs like infrastructure, education, housing, and many others,
especially at the state and local level. As the nation’s largest payer for healthcare, the Centers for
Medicare & Medicaid Services (CMS) is a central component to the success of this effort.
The transformation of the healthcare system from volume to value is the preferred mechanism to
achieve cost reduction, quality improvement, and innovation. This transformation requires an array of
clinical quality measures (CQMs) for assessing healthcare structure, process, outcome, and patient
experience across various conditions, clinical specialties, and settings. To achieve its quality and
transformation priorities, CMS maintains an inventory of over 2,000 CQMs for use in quality
improvement, comparative reporting, value-based purchasing, and alternative payment models
(http://cmit.cms.gov). The development and maintenance of CQMs involves substantial and ongoing
evaluation of the evidence on the measure’s properties—importance, reliability, validity, feasibility, and
usability. The use of measures with poor reliability and validity wastes time and resources and may result
in unintended system harms. Measures that are not feasible impose significant burden on consumers and
clinicians. Measures must have information value to be usable for selecting clinicians or health plans (for consumers), allocating resources to quality improvement (for clinicians), or prioritizing clinical and health services research (for government).
To ensure this evidence is timely and complete, CMS conducts a monthly environmental scan of
the published clinical and health services literature for all 2,000 CQMs. Conducting a scan for such a
high volume of measures would be challenging enough; however, the challenge is further exacerbated
with the rapid increase in the number of research publications. In 2017 alone, MEDLINE, the U.S.
National Library of Medicine’s database of journal articles on biomedicine, added more than 813,500 new
citations [6]. Human review of the results of this scan would be cost prohibitive and would not keep
pace with the increase in the number of publications. A human reviewer, no matter how proficient, must select relevant keywords to perform the search, read each returned abstract to establish relevance (or not) with the measure under consideration, rank the relevant abstracts to identify the subset of full-text articles to review, read the identified full-text articles, extract the knowledge contained in the full-text articles that
provides evidence on the measure properties, and store that evidence in some form capable of analysis
and synthesis. For a small set of measures in a common domain, a human review may take a 1,000 hours and cost hundreds of thousands of dollars.
To facilitate the monthly environmental scan for every measure in the CMS Measure Inventory,
we have collaborated with the CMS Measures Management System (MMS) to develop a system called
CMS Sematrix that automates the identification of clinical and health services literature relevant to
CQMs, the extraction of knowledge contained in the relevant abstracts and full-text articles that provides
evidence on the measure properties, and the store of that evidence in a form capable of analysis and
synthesis. CMS Sematrix contains three major components: (1) A quality measure ontology to describe
high-level knowledge constructs contained in CQM; (2) a natural language process (NLP) system to
extract concepts and relations that correspond to the ontology from text; and (3) a graphical database to store the concepts and relations extracted from text as Resource Description Framework (RDF) [7] triples
that can be queried to deduce measure components within documents. To our knowledge, there is no
currently available off-the-shelf computational cognitive service that provides a competitive option to
CMS Sematrix due to its utilization of a highly specific clinical quality measure ontology created
explicitly for use in our system.
The overall objective of this study is to design an automated knowledge extraction framework we
call CMS Sematrix that scans the published clinical and health services literature, identifies relevant
articles for a given CQM, and stores the evidence contained within the articles in a form capable of
analysis and synthesis. To achieve this objective, we detail the steps required to build the individual
components that make up the CMS Sematrix system. Namely, the definitions of the CQM ontology, the
structure and training methodology of the NLP engine, and the resulting knowledge database. Lastly, we
aim to show that CMS Sematrix dramatically reduces the labor hours and related cost compared to a
traditional literature review without losing much accuracy for developing and maintaining CQMs, and the results returned by the system are relevant to CQM developers.
Clinical Quality Measure (CQM) Ontology
The goal of the CQM ontology is to standardize the essential features of a CQM into a set of
abstract concepts with defined relationships between them. The components of the measure, such as the
measure focus, target population, quality construct, and quality priority, can then be systematically
represented as combinations of these concepts. This allows NLP tools to identify and extract these
concepts and relations and place them in a structured format that can be used for semantic reasoning and
analysis. The specific application for the CMS was to extract these concepts and relations from both
clinical and health services research articles and the measure description text to identify articles that
contain information relevant to a specific measure.
The abstract concepts in the ontology are displayed in Table 1. It is important to note that the
Population concept can also have the attributes “Age Group”, “Gender”, or social determinants of health
which can be used for further refinement. Similarly, the health status concept has attributes “severity”
and “time”. In addition to concepts, we have defined the ways in which the concepts can relate to each other (see Table 2). Each relation has a specified domain and range among the concepts, as denoted in the table.
Table 1. Abstract CQM Concepts. The five high-level measure concepts captured by the CQM ontology, along with their definitions and examples.
Table 2. CQM Relations with Domain and Range. The five base semantic relations in the CQM ontology along with their definitions and the concepts they relate.
Natural Language Processing (NLP)
The recent emergence of Big Data- RDF triple stores makes it possible to merge massive amounts
of structured and unstructured data by defining a common ontology model for representing the domain
knowledge and storing all the domain assertions as semantic triples. However, technology gaps exist.
More specifically, there is a lack of: (1) efficient and accurate algorithms or tools to automatically
transform unstructured document content into knowledge graphs; (2) rich and complete semantic
representation to store and query actionable domain knowledge that is compatible with the RDF standard;
and (3) methods for accessing information to enable intelligent search applications while hiding the
underlying complexity of the voluminous semantic data being searched.
CMS Sematrix uses the K-Extractor [8] NLP technology for the extraction of detailed semantic
statements from unstructured text. The driver of the K-Extractor is the deep NLP Pipeline (Figure 1),
which spans the lexical, syntactic, and semantic layers of knowledge extraction from text. It acts as a
pipeline for filtering, data reduction, and valueadded (semantics) functions, and discovers concepts and
relations relevant to the ontology in the form of entities and relations between these entities (Figure 2).
More specifically, CMS Sematrix accepts text documents (primarily scientific or technical) as inputs and then extracts both entities (e.g., health status, change concept, or output) and the significant relationships between and among them using a pipeline of NLP modules. It uses the resulting semantic Web Ontology
Language (OWL)/RDF[7] knowledge base to support semantic query and graph visualization. CMS
Sematrix is scalable and can process approximately 25,000 documents per day per processing core. It has
already processed 8.5 million PubMed abstracts and 1.9 million PubMed Central full articles from 2005
to the present.
Figure 1. K-Extractor's Deep NLP Pipeline. The flow of K-Extractor’s NLP pipeline starting from the raw, unstructured document text and ending with the structured RDF triple representation of the extracted concepts and relations.
Figure 2. An example of the entities and semantic relations identified by CMS Sematrix along with their associated ontology classes. Screenshot of the text of a measure description that has been
annotated with the CQM concepts and relations using the brat annotation tool. This section of the text
only contains health statuses (AMI, ST-segment elevation, and LBBB), the affected population (patients 18 years and older), and the change concept (primary reperfusion therapy). They are each connected by the appropriate sematic relation.
Concept Identification
K-Extractor’s concept detection methods range from the detection of simple nominal and verbal
concepts to more complex named entity and phrasal concepts. The use of a hybrid approach to named
entity recognition using machine learning classifiers, cascades of finite-state automatons, and lexicons
makes it possible to label more than 80 types of domain independent named entities, including person,
organization, and various types of locations, quantities, numerical values, etc.
The finite-state automatons framework uses a pattern-based machine-learning approach and hand-
coded rules which allows for a highly customizable and adaptable process for detecting domain relevant
concepts including signs, symptoms, disorder, disease, complication, functional status, advanced illness,
population, outcome, etc. To learn the lexicon and rules-based models for extracting CQM ontology
specific concepts, a set of 65 quality measures and 98 biomedical articles (PubMed Abstracts and
PubMed Central Full Articles) were manually annotated using the Brat rapid annotation tool [9], [10]
(Figure 2 and see appendix for the list of measures and articles). A random 80:20 split of the manually
annotated data was created for training and testing respectively.
Semantic Relation Identification
In the K-Extractor, semantic relations are instruments used to abstract underlying linguistic
relations between concepts. Semantic relations can occur within a word, between words, between phrases,
and between sentences. Because semantic relations provide connectivity between concepts, their
extraction from text is essential for the ultimate goal of machine text understanding. We use a fixed set of
26 relationships [11] (Table 3), which strike a good balance between too specific and too general. They
include the thematic roles proposed by Fillmore [12] and others [13], and the semantic roles in PropBank
[14], while also incorporating relationships outside of the verb-argument settings, which highlight key
interactions between entities, events, causes, time and space, and others.
Table 3. K-Extractor's 26 base semantic relations. The 26 relations in K-Extractor that are used to construct the higher level CQM relations.
K-Extractor uses a hybrid approach to semantic parsing. This hybrid approach includes machine
learning classifiers for argument pairs identified using syntactic patterns and filtered using extended
definitions for our semantic relationships, which describe the possible domain and range information for a
relation and impose these semantic restrictions on candidate arguments [11]. Additional modules with
specific relational targets are also used.
The below example depicts the conversion of text into a graph by automatically extracting the
base semantic relations listed in Table 3 using the K-Extractor:
“The cfr gene, originally identified in a bovine Staphylococcus sciuri isolate, was found to code for a RNA methyltransferase which modifies the adenine residue at position 2503 in the 23S rRNA and thereby confers resistance not only to oxazolidinones, but also to phenicols, lincosamides, pleuromutilins, and streptogramin A antibiotics.”
Figure 3. Example depicting the extraction of base semantic relation from text. How K-Extractor represents the semantic relations between the entities in the example sentence given above.
The base 26 semantic relations can capture the underlying linguistic relations between concepts in text. However, the base semantic relations do not directly map to the relations defined in domain ontology
such as the CQM ontology. K-Extractor provides an easily extensible framework to extract the domain
specific relations defined in the domain ontology. K-Extractor uses Semantic Calculus rules (Tatu &
Moldovan, 2006) to the extract new types of semantic relations by defining how two or more base
semantic relations can be combined. The Semantic Calculus defines axioms for semantic relations R0 that may hold between two concepts c1 and c2, which are linked by two semantic relationships R1 and R2 (not necessarily distinct) that share a third concept c3 as a common argument. More formally:
For instance, the semantic calculus axiom:
where x is a population concept, z is a health status concept, and y is an experiencing related verb. This
axiom can be used to derive new semantic information IsMadeUpOf (patients, hypothyroidism from a
sentence such as “The impact of yoga upon female patients suffering from hypothyroidism”.)
To automatically learn the semantic calculus axioms required to extract CQM ontology specific
semantic relations, the same 65 quality measures and 98 biomedical articles were manually annotated
with CQM ontology specific relations. The semantic calculus axioms learning framework used a high
recall focus to automatically learn more than 20,000 axioms using the manually annotated examples.
Knowledge Structures
The CMS Sematrix NLP and knowledge base use a commonly accepted model for knowledge
representation known as the Semantic Web. Each piece of knowledge extracted from the text (and tables) in the literature by means of NLP can be stored in a standard and open format known as RDF. The RDF specification represents each statement or assertion as a common data structure, known colloquially as a
“triple”, that can be thought of as similar to the grammatical notion of “subject [entity] – verb
[relationship] – object [entity or value]”. Specific (e.g. “named”) entities are assigned to classes, or
categories, which may be defined in a logical domain ontology (see below). For example, the sentence
“John Smith suffers from hypertension” encountered in the text could be represented in pseudo-RDF as something like the example in Table 4.
Table 4. Example of an RDF triple. The RDF triple representation of the two entities and single sematic relation from the sentence “John Smith suffers from hypertension.”
The RDF triples and associated metadata extracted from each document are stored in a graph
database. Knowledge graphs can be constructed from the triples by connecting the like entities or more
generally, like concepts per the ontology. This allows for graphs that span specific mentions of an entity
within a document or can span documents if desired. For example, in Figure 4 we see the components of a health status dependent quality measure represented as a graph on the measure ontology. The NOT edges represent relationships that are disallowed between the two nodes in the graph.
Figure 4. The 4 components of a health status dependent measure. The Numerator, Denominator, Opportunity for Improvement, and Rationale components are represented as graphs made up of the five concepts and five sematic relations from the CQM ontology. Care Setting (Utilization-Change Concept) is an optional concept that denotes the setting where the patient experiences the Change Concept.
It is important to note that, in Figure 4, the Health Status that appears in the Numerator can (and
often) is different from the Health Status that appears in the three other graphs. However, the Population, Change Concept, and Output are the same across all four component graphs.
To see this more concretely, we provide the following example in Table 5 from CMS Measures
Inventory Tool (CMIT) measure number 4 titled “Aspirin Prescribed at Discharge”.
Table 5. CQM Concepts Extracted from CMIT 4. Manually extracted CQM concepts from CMS Measures Inventory Tool measure number 4.
Matching Publications to Measures
Once the knowledge structures have been extracted from the journal publications, the same
procedure is applied to the measure description text. Five knowledge structures (i.e. Keywords,
Biomedical Concepts, Biomedical Concept Expansions, Semantic Relations, and CQM Model Semantic Relations) extracted by the K-Extractor from the measure descriptions are then used to create a semantic
query. The semantic query consists of 5 different fields/components (one per knowledge structure) and
each knowledge structure’s field is assigned an importance weight. The process to compute the
importance weight is introduced in the Optimizing the Component Weights section. The semantic query
is used to match against the same 5 components extracted from the documents index in the publication
database with the goal of returning publications that contain relevant information to the measure
description. Each publication is then returned with a score denoting its relevancy to the given measure. The overall score utilized by CMS Sematrix is the Lucene Practical Scoring Function [16]:
where q is a search query, created from processing some inputs (for example, one can be created from a measure XML), d is a document in the search index, f is a field or component of the score (see below), t is
a term in the field, weight(f) is a weight given to a particular field to boost its importance in the overall
score, coord(q,d,f) is used to reward documents that contain a higher percentage of the query terms,
fieldNorm(f) is the inverse square root of the number of terms in the field, value(t) is the value that a term
has in the query (typically the number of occurrences of this term in the query input), tf(t,d) is the term
frequency value for the term t in document d, and idf(t) is the inverse document frequency for term t
among all documents (the logarithm of the number of documents in the index, divided by the number of
documents that contain the term). The full details of the terms making up the Lucene Practical Scoring
Function can be found in their documentation [16]. CMS Sematrix uses 5 different components (detailed below) when computing the document score.
As currently implemented, CMS Sematrix searches either the abstract or article component of the
content management system for the most relevant documents associated with a single measure. One
measure is searched at a time. The system returns the number of documents requested by the user with
the top 30 highest overall relevance scores for that measure. The system is not designed to report the five
individual scores utilized in the overall score function. It returns the overall, measure-specific relevance
score associated with each document searched.
Individual Score Components
Each of the 5 component scores are defined by various levels of the ontology utilized in CMS
Sematrix’s natural language processing:
Lemmatization is the process of normalizing/generalizing morphological variations of a word to
its base form. Example: adults lemmatized to adult, diagnosed lemmatized to diagnose, scanning
lemmatized to scan, etc. Stopword removal is the process of removing non-content words such as
a, the, and, on, etc.
concept is a thing mentioned in an article/measure that can span from 1 to n words. Example:
high blood pressure, water pressure, video recording, Mediterranean diet, etc. A biomedical
concept is a concept that is valid for a particular biomedical domain. Example: for the CMS
project, we include all biomedical concepts such as high blood pressure, Mediterranean diet, etc.
concepts present in an article/measure. Example: occurrence of high blood pressure in an
article/measure will result in concepts such as hypertension, blood pressure, etc. being added
(with an appropriated similarity weight) into this field for matching purposes.
concept2) that occur between biomedical concepts present in an article/measure. Example:
hypertension isCauseOf headache, 120/80 isValueOf blood pressure, etc.
between biomedical concepts present in an article/measure. Example: asian isAPartOf patients,
adult experiences mediterranean diet, etc. The full list of CQM model semantic relations can be
found in Table 2.
Optimizing the Component Weights
To ensure that articles relevant to a given measure are scored higher in the measure search results, the weights for each of the 5 components are optimized in order to maximize a modified mean reciprocal rank (MRR) score [17]. MRR is a statistic for evaluating any process aimed at selecting a best option by
ranking the options, ordered by a score. It is calculated as the average of the inverse rank of the best
option over multiple executions of the process. The process being evaluated here has multiple correct
options. Due to this difference, a modification of the MRR is proposed:
where denotes the total number of cited articles associated with the measure of the
th search,
denotes the number of cited articles in the measure description returned by the th search, and
denotes
the rank (determined by ranking the scores from highest to lowest) of the th cited article returned by the
th search.
In order to efficiently determine the optimal set of weights for the 5 components, the Lucene
practical scoring function was re-formulated in a way that allows the scores for each component to be run independently from the field/component weights. This means that entire search query does not need to be
re-run every time a new set of weights is tested. Briefly, the score for each individual component can be
written in terms of weight independent Numerator and Denominator parts so that the total score is
computed as follows:
where q is the current query, d is the given document, weight(f) is the weight for the current field
(component), and Numerator(q,d,f) and Denominator(q,f) are combinations of SOLR functions used by
the Lucene practical scoring index given above.
To generate the dataset for the weight optimization, queries were run for 65 measures used in the
NLP training. Five separate queries were run for each measure in which only 1 component out of the 5
was set to one while the rest were set to zero. For each measure and each component, the Numerator and Denominator parts were computed for each document returned by the query and the top 1,000 documents
with the highest numerator parts are returned (since the Denominator part is independent of the
document). Thus, for a given set of weights, the Lucene Practical Scoring Function can be computed
without re-running the search query to obtain the Numerator and Denominator parts.
One issue that arises when computing the overall score is that, for a given measure, an article can
appear in the top 1,000 results for one component and may not appear in the results for the other
components. That is, a given document from a particular measure query may not have Numerator and
Denominator parts for all 5 fields/components. When computing the combined score across the 5
components for a given document and measure query, if that document is missing a Numerator and
Denominator part for a given field, then the Numerator for that document is set to the minimum
Numerator value found in the 1,000 search results for the field for the given measure. Since the
Denominator is independent of the document, it is set to the same Denominator value as all other 1,000 search results for the field and the given measure.
To find the optimal component weights, a grid search was performed where the weights for each of the five components varied between 0 and 1 in 0.1 increments results in 115(=161,051) possible weight
combinations. The weight combination that maximized the average MRR across the 63 measures was
selected. We obtained the best MRR of 0.1098 (for 65 measures) for the following components’ weight combination: WKeywords=0.1, WConcepts=0.3, WExpansion=0.2, WRelation=1.0, WCQM_Relation=0.3.
Identifying Relevant Measure Concept Graphs
The Lucene Practical Scoring Function used for scoring a documents’ relevancy to a measure is
focused around discovering specific terms and relationships within a document. That is, it does not
consider the specific structure of the measure concept graphs shown in Figure 4. To assess the degree to
which the associated literature provides evidence for the given measure, we developed a procedure for
identifying specific, relevant measure concepts in literature associated with each measure by the monthly
environmental scan. The goal of this was twofold: (1) to provide a separate verification that the
documents returned by the Lucene Practical Scoring Function contain information relevant to a given
measure; and (2) to allow measure developers to more quickly and efficiently review environmental scan results.
First, the RDF triples for a given article or abstract are retrieved. As mentioned above, each triple contains the subject text, relationship text, and object text, along with attributes such as: the standardized
subject and object text, called subject and object alias text, which collapse instances like hemorrhagic
stroke and brain hemorrhage into a single concept by mapping the subject and object text instances to
Unified Medical Language System [18] and K-Extractor lexicons (Tatu et al., 2016); the identifier of the
document from which the triple was extracted; a measure of the system’s confidence in the correctness of
the assigned relationship expressed by the triple; and the entity type (class). Below is an example of a
triple returned by CMS Sematrix:
Next, the triples are converted to a graph structure where the nodes are the instances of the
concepts extracted from the document (for example, a Health Status node could be Heart Failure) and the
edges between the nodes are the semantic relations. All the triples in CQM ontology from given
document are combined to form a large “document graph.”
Creating Document Graphs
When constructing the document graphs, it was discovered that there are often instances that
appear in the triples that should be merged together. For example, the acronym AMI and the phrase
Acute Myocardial Infarction both appear and would be treated as separate nodes in the graph. Not doing
this leaves a more disconnected graph as edges that should be associated with just Acute Myocardial
Infarction, get separated out among two different nodes. To retrieve the full phrase, these acronyms, any
strings with a small number of characters (<=5), are searched in the Text2Knowledeg Acronym Finder
database [19]. Additionally, we found that Population and Output tended to vary quite a bit within a
given document which also resulted in a disconnected document graph. To remedy this, we converted the text of any tagged Population and Output entities to generic ‘Population’ and ‘Output’.
Finding Measure Concept Graphs
Next, subgraph matching algorithms are used to identify subgraph-patterns consistent with the
“concept maps” in Figure 4, which represent the four basic elements used in the creation of definition of a measure. Essentially, these algorithms enumerate all potential subgraphs of the large document graph in
an efficient manner and check them against the aggregate graph pattern to determine whether they are
isomorphic (e.g., they have the same node types and relations between them). To perform the subgraph
matching, we use the R programming language implementation of the VF2 subgraph isomorphism
algorithm[20]. Only the first three concept maps have been used in the graph-matching analysis so far, as the Rationale requires a more sophisticated algorithm than subgraph matching.
Determining Relevancy of Measure Concept Graphs Found in Documents
The subgraph matching algorithm discussed in the previous section only returns subgraphs that
match the pattern of the associated measure concept graph. It does not reveal anything about the
relevancy of the instances of Population, Health Status, Change Concept, Output, or Utilization that
appear in the subgraph to those of the measure concept graph constructed from the actual measure
description. Thus, one could potentially be faced with hundreds of subgraphs that all match the concept
graph patterns, but a number of those subgraphs can contain instances of Population, Health Status,
Change Concept, Output, or Utilization that are not actually relevant to the current measure. Thus, we
then developed a procedure to filter out the non-relevant subgraphs so as to verify that CMS Sematrix is returning documents that contain information that is relevant to the measure being searched.
First off, the concept graphs associated with each measure needed to be extracted from the
measure descriptions in order to have a “gold standard” to which potential subgraphs are compared.
Initially, this gold standard data set was created by manually reviewing the descriptions for the 65 CQMs to extract the necessary triples to construct the measure concept graphs, but one measure description did not contain enough information for constructing the measure concept graphs. In particular, the instances
of Population, Health Status, Change Concept, Output, and Utilization were determined from the measure descriptions (e.g., Table 5).
Each potential measure concept graph that is found by the subgraph matching algorithm in a
document is then compared to the manually derived measure concept graphs from measure description in
order to determine if there is a match. The methodology for determining a match is as follows: for a
given Numerator, Denominator, or Opportunity graph (Figure 4),
A given document graph is considered a match only if every node in the graph is found to be a match to
the corresponding nodes of the manually derived concept graph from the measure description.
Lastly, we say that a document returned by CMS Sematrix for a given measure is relevant if and
only if it contains at least one (Numerator, Denominator, or Opportunity) graph that matches the
corresponding manually derived measure concept graph. However, in results, we also look at the more
stringent case where a document is considered relevant if and only if it contains matching Numerator,
Denominator, and Opportunity graphs.
Validating NLP System
In all the experiments listed in this section, 80% of the annotated data was randomly selected for training the CQM concept and semantic relation extraction modules. The remaining 20% of the annotated examples were used for the testing the quality of the trained modules and computing evaluation results.
Concept lexicons and rules were learnt using the manually annotated examples. The training was focused on maximizing the recall (the fraction of the correct concepts that are successfully identified).
Table 6 provides a summary of the number CQM concepts instance examples manually annotated in the
test set of quality measures and biomedical articles, and the recall results obtained by the trained NLP
models. The models were trained and tested on either only the annotated quality measure data or only the annotated biomedical article data.
Table 6. CQM Concept Extraction Results. K-Extractor’s performance on extracting the CQM
concepts from either the measure text (Quality Measures columns) or the article text (Biomedical Articles columns). In each case, K-Extractor was trained and tested either only on the measure text, or only on the article text.
For the final NLP model used in the CMS Sematrix systems, all of the annotated data was used for
training.
Table 7 provides a summary of the number CQM semantic relation instance examples manually
annotated in the quality measures and biomedical articles, and the recall and precision (the fraction of
identified concepts that are correct) results obtained by the trained NLP models. The models were trained
and tested on either only the annotated quality measure data, or the annotated measure and biomedical
article data. As with the concepts, the final semantic relations extraction model used in the CMS Sematrix systems used all of the annotated data for training.
Table 7. CQM Semantic Relation Extraction Results. K-Extractor’s performance on extracting the CQM relations from either the measure text, or the measure and article text. In each case, K-Extractor was trained and tested either only on the measure text, or on the measure and article text.
The CMS Sematrix content management system includes abstracts from PubMed from 2007-
2018 and full text articles from PubMed Central over the same time period. Additionally, licenses for all
articles cited during the development of the core and high impact measures that are NQF endorsed were
obtained, and these articles are included in the content management system. Currently, the content
management system includes approximately 8.5 million abstracts and over 1.9 million full text articles.
Validating Results Returned by Sematrix
As mentioned in Methods, a document returned by CMS Sematrix for a given measure is
considered relevant if and only if it contains at least one measure concept graph (Numerator,
Denominator, or Opportunity) that matches the corresponding manually derived measure concept graph;
and is stringent relevant if and only if it contains matching Numerator, Denominator, and Opportunity
graphs. To validate that the documents returned by CMS Sematrix contain information relevant to the
given quality measure, we examined the associated top 30 articles for 9 randomly selected measures from
the set of 65 CQMs. The 9 randomly selected measures are CMIT 4, 254, 573, 888, 1014, 1241, 1765,
1898, and 2552 (see the corresponding measure descriptions in Appendix). Each article was manually
reviewed to determine if it contained information relevant to the associated quality measure.
The results of the manual review were then compared to the results obtained using our automated method to determine relevant and stringent relevant documents. For the comparison, the results from the
manual review were considered to be the “true” relevant documents. Figure 5 shows boxplots of the
precision (the fraction of automatically identified relevant documents that were also identified as relevant
by the manual procedure) and recall (the fraction of manually identified relevant documents that were
also identified as relevant by the automated procedure) scores for relevant and stringent relevant results
aggregated across the 9 measures. Overall, the average precision and recall are 84% and 88%,
respectively, both of which indicate that our automated approach can successfully determine relevant
documents. In addition, the stringent relevant approach would slightly increase the average precision
(85%) but causes a large drop in the average recall (56%) which indicates that the relevant approach
better aligns with the results of the manual review.
Figure 5. Comparison of the automated method for determining relevant documents against the
manually determined relevant documents for the set of 9 random measures. Boxplots showing the precision and recall scores for the automated relevancy method using either the relevant (left) or stringent relevant (right) criteria. The thick horizontal line denotes the median, the lower and upper hinges correspond to the first and third quartiles, the whiskers extend from the hinge to the most extreme value no further than 1.5 × interquartile range from the hinge, and the dots beyond the whiskers are outliers.
To provide an example of the information that our automated relevancy procedure extracts from
documents, we reviewed the relevant documents returned for measure CMIT 4 (described in Table 5).
Table 8 shows an example of the relevant Numerator, Denominator, and Opportunity graphs extracted
from the article PMC-4631331 titled “Acute Myocardial Infarction Risk in Patients with Coronary Artery Disease Doubled after Upper Gastrointestinal Tract Bleeding: A Nationwide Nested Case-Control Study”
which was deemed relevant by our automated procedure. The Denominator and Numerator Health
Statuses in the article (acute myocardial infarction) are exactly the same as those extracted from the
measure description (see Table 5), while the articles’ Change Concept (antiplatelet therapy) is closely
related to the Change Concept extracted from the measure description (aspirin). There are several
Outputs found in the article graphs as the automated search procedure treats Outputs as a single generic
value (see Methods). Nonetheless, the Output extracted from the measure description (reduce) is found
among of the list of outputs in the article graphs. Thus, this compact representation very clearly shows
that this article contains information relevant to quality measure CMIT 4 and can be utilized by measure developers to quickly summarize and parse the scientific literature.
Table 8. The relevant Numerator, Denominator, and Opportunity graphs extracted from the article titled “Acute Myocardial Infarction Risk in Patients with Coronary Artery Disease Doubled after Upper Gastrointestinal Tract Bleeding: A Nationwide Nested Case-Control Study” for measure CMIT 4.
Next, we used the results from CMIT 4 to investigate the discrepancy between the documents that were determined relevant via manual review and those deemed relevant by our automated procedure.
Table 9 provides two examples of returned documents for measure CMIT 4 that had different relevancy results from the two different methods.
Table 9. Examples of the returned documents for measure CMIT 4 that had different relevancy results for the automated method and the manual examination.
For PMC-539261, although the term “aspirin” is used once in the discussion section, in context
the term does not significantly contribute to evidence supporting the objective of the article. The term
“aspirin” may not be a suitable Change Concept when considering the article as a whole; rather, “standard
optimal coronary care” is the appropriate Change Concept, as stated in the abstract. On the other hand,
although PMC-5862020 is not strictly focused on acute myocardial infarction (AMI) as a Health Status,
the article demonstrates important correlations in approaches to therapy between AMI and Chronic
Obstructive Pulmonary Disease (COPD) that may be useful to providers, such as the use of antiplatelet
therapy (i.e., the focus of CMIT-4). To emphasize this point, the article presents in evidence in a tabular
format that COPD is an important factor in whether AMI patients receive aspirin at discharge. However,
the table uses complex formatting (i.e., blank cells; multiple lines of data per cell) that may not be easily
processed by Sematrix. Improvement in the way Sematrix digests tables, particularly tables with complex formatting, should improve relevancy scores for articles that use tables to present findings.
Lastly, we looked at the number of articles that are deemed relevant and stringent relevant by our
automated procedure for the top-30 documents returned by CMS Sematrix for each of the 65 measures
except for CMIT 967 (the measure description of CMIT 967 did not include enough information for
manually annotated the gold standard graphs). The results are shown in Figure 6. We found on average
roughly 72% of the articles returned by CMS Sematrix for a given measure contain information relevant
to the measure description (i.e., ~21 out of the 30 returned documents). However, there were a few
measures where CMS Sematrix did not return any relevant documents. For example, CMIT 284 did not
appear to have any relevant documents according to our automated procedure. This is most likely due to
the fact that the measure description was updated after we extracted the “gold standard” measure concept
graphs (see Methods). We also found that CMS Sematrix had relatively poor performance for CMIT 78,
80, 86, and 89 in terms of number of relevant documents returned, which is likely caused by the fact that
their measure descriptions did not provide enough information to extract the precise measure concepts
(e.g., the Change Concept was not specified). Thus, a very general change concept (e.g., quality
improvement) was inferred making it difficult to match relevant documents.
Figure 6. Relevancy results across the entire set of 65 CQMs using the automated method. The bar plot shows the number of documents (out of the 30 returned search results) determined relevant (red) or stringent relevant (red) by our automated method.
To effectively evaluate the quality of health care, the developers of clinical quality measures face
the arduous task of scanning the biomedical literature each month in order to ensure that the evidence
supporting each of their roughly 2,000 CQMs is timely and complete. In this work, we have detailed our tool CMS Sematrix which is aimed at reducing the burden placed on measure developers by effectively automating the knowledge discovery process of the monthly scans. CMS Sematrix contains three major
components: (1) A quality measure ontology to describe high-level knowledge constructs contained in
CQM; (2) a NLP system to extract concepts and relations that correspond to the ontology from text; and (3) a graphical database to store the concepts and relations extracted from text as RDF triples that can be
queried to deduce measure components within documents. We have shown that the NLP component of
CMS Sematrix was able to correctly identify CQM concepts with an average recall score of 87% for
measure descriptions and 86% for articles. In addition, CMS Sematrix achieved overall precision and
recall scores of 84% and 62% when extracting concept relations. We then conducted an environmental
scan of the PubMed and PubMed Central abstracts and articles using a set of 65 CQMs. For the 9
measures selected for manual review, our automated procedure for determining relevant documents
obtained average precision and recall scores of 84% and 88%. Running this procedure on the full set of
65 CQMs, we found that on average roughly 72% of the articles returned by CMS Sematrix for a given
measure contain information relevant to the measure description using our June 2018 environmental scan data.
CMS Sematrix is able to identify articles published in the clinical and health services literature
that contain information relevant to a given CQM. In practice, CMS Sematrix can reduce the time-
consuming burden of the CMS monthly environmental scans and allow measure developers to quickly
and accurately design CQM to track outcomes in order to improve the national healthcare system.
Support for this work was provided in part by the Centers for Medicare and Medicaid Services under task order HHSM-500-T0001 and contract HHSM-500-2013-13005I and Battelle Memorial Institute.
[1] L. Abrahamyan, N. Boom, L. R. Donovan, J. V. Tu, and C. C. S. Q. I. S. Committee, “An international environmental scan of quality indicators for cardiovascular care,” Canadian Journal of Cardiology, vol. 28, no. 1, pp. 110–118, 2012.
[2] “IMPACT Act of 2014 (2014 - H.R. 4994),” GovTrack.us. [Online]. Available: https://www.govtrack.us/congress/bills/113/hr4994. [Accessed: 29-Aug-2018].
[3] “Medicare Access and CHIP Reauthorization Act of 2015 (2015 - H.R. 2),” GovTrack.us. [Online]. Available: https://www.govtrack.us/congress/bills/114/hr2. [Accessed: 29-Aug-2018].
[4] “21st Century Cures Act (2016; 114th Congress H.R. 34) - GovTrack.us.” [Online]. Available: https://www.govtrack.us/congress/bills/114/hr34. [Accessed: 29-Aug-2018].
[5] G. A. Cuckler et al., “National Health Expenditure Projections, 2017–26: Despite Uncertainty, Fundamentals Primarily Drive Spending Growth,” Health Affairs, vol. 37, no. 3, pp. 482–492, Feb. 2018.
[6] “MEDLINE®: Description of the Database.” [Online]. Available: https://www.nlm.nih.gov/bsd/medline.html. [Accessed: 29-Aug-2018].
[7] “RDF - Semantic Web Standards.” [Online]. Available: https://www.w3.org/RDF/. [Accessed: 29-Aug-2018].
[8] M. Tatu, M. Balakrishna, S. Werner, T. Erekhinskaya, and D. Moldovan, “Automatic Extraction of Actionable Knowledge,” in 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), 2016, pp. 396–399.
[9] “brat rapid annotation tool.” [Online]. Available: http://brat.nlplab.org/. [Accessed: 29-Aug-2018].
[10] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, and J. Tsujii, “BRAT: a web-based tool for NLP-assisted text annotation,” in Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 102–107.
[11] D. Moldovan and E. Blanco, “Polaris: Lymba’s Semantic Parser,” Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), 2012.
[12] C. J. Fillmore, “Types of Lexical Information,” Studies in Syntax and Semantics, pp. 109–137, 1969.
[13] P. C. Horgan, The Cambridge Encyclopaedia of the Language Sciences. Cambridge University Press, 2010.
[14] M. Palmer, D. Gildea, and P. Kingsbury, “The Proposition Bank: An Annotated Corpus of Semantic Roles,” Computational Linguistics, vol. 31, no. 1, pp. 71–106, Mar. 2005.
[15] M. Tatu and D. Moldovan, “A Logic-based Semantic Approach to Recognizing Textual Entailment,” in Proceedings of the COLING/ACL on Main Conference Poster Sessions, Stroudsburg, PA, USA, 2006, pp. 819–826.
[16] “TFIDFSimilarity (Lucene 4.0.0 API).” [Online]. Available: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.ht ml. [Accessed: 29-Aug-2018].
[17] M. Tatu, M. Balakrishna, S. Werner, T. Erekhinskaya, and D. Moldovan, “A Semantic Question Answering Framework for Large Data Sets,” Open Journal of Semantic Web (OJSW), vol. 3, no. 1, pp. 16–31, 2016.
[18] O. Bodenreider, “The Unified Medical Language System (UMLS): integrating biomedical terminology,” Nucleic Acids Res, vol. 32, no. suppl_1, pp. D267–D270, Jan. 2004.
[19] “T2K Medical Acronym Finder - free biological/medical acronym database.” [Online]. Available: http://www.bioinformatics.org/textknowledge/acronym.php. [Accessed: 29-Aug-2018].
[20] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub)graph isomorphism algorithm for matching large graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 10, pp. 1367–1372, Oct. 2004.
[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
Table A.1 shows the CMIT ID and measure description for the list of 65 quality measures.
Table A.1. The set of 65 quality measures.
Table A.2. The list of returned documents for measure CMIT 4 with manually examined and MIF analysis results.