The large amount of biomedical information in databases such as PubMed [1] is a valuable source for
automated information extraction [2] that facilitates the development of more efficient biomedical information retrieval systems. The concept of ‘deep learning’ has recently gained a lot of attention. It refers to unsupervised
learning algorithms which automatically discover data without the need of supplying specific domain
knowledge [3]. This approach usually has higher performance rates than supervised and informed methods
when processing large unstructured corpora. However, the utility of these algorithms applied to realistic,
domain-specific use-cases still needs further evaluation.
Word2vec [4] implements an efficient deep learning algorithm for computing high-dimensional vector
representations of words and their relationships [5] based on unstructured text data. Once a vector model is
created from a text corpus, word2vec provides two basic tools to use these models: distance and analogy. The distance tool provides a list of words closely related to a particular word from the vector model. These results also contain the corresponding cosine similarity of each related word that indicates how close the words are in the vector space model. The analogy tool, on the other hand, is able to query for textual regularities captured in the vector model through simple vector subtraction and addition.
For example, let us assume that we use word2vec to create a vector model of the words appearing in a large
corpus of news articles. If the resulting vector space representation of cities and countries is projected to a two-dimensional representation, we can observe patterns such as those sketched in Figure 1. Not only are ‘similar’ (e.g., bordering) countries close to each other in vector space, we also see that their capital cities are arranged at predictable distances from their countries, because the deep learning algorithm was able to capture a notion
of the ‘capital city’ relation between two entities from unstructured text sources. Because of this, the vector
operation Paris – France + Berlin would result in a position in the vector space model that is close to the
position of the word Germany, i.e., the regularities in the vector representation can be used to search for words
related through a certain relationship to a query word. Because queries are simply defined ‘by example’,
word2vec allows querying for poorly formalized relationships, which is of special interest in complex and
evolving knowledge domains.
Figure 1. Simplified, two-dimensional example of the regularities in the vector space representation
These characteristics make word2vec of potential interest for improving the accessibility of unstructured
medical content. For example, word2vec could assist the curation of structure knowledge bases and ontologies,
and could help in refining information retrieval algorithms. However, deep learning algorithms such as
word2vec are known to require very large amounts of training data to provide good results, and the amount of
accessible, high-quality literature in specific domains (such as medicine) is often restricted, potentially
decreasing the practical utility of the approach.
While similar approaches such as GloVe [6] were recently claimed to outperform word2vec tools for the
unsupervised learning of word representations and word analogy, we utilized word2vec because it has been
widely studied [6-8] and, therefore, our results can be easily compared with others.
In this paper we report on our evaluation of word2vec for clinically relevant medical content based on diverse,
openly available, mid-sized medical text corpora. We compare the word relationships learned by word2vec
with curated medical relationship encoded in the National Drug File – Reference Terminology (NDF-RT)
ontology [9] to evaluate the results. The results of this exploratory work are intended to serve as an initial
guidance that informs more in-depth work on applying word2vec in the medical domain.
Word2vec
Word2vec is an efficient implementation of deep learning techniques based on two architectures, continuous
bag-of-words (CBOW) and skip-gram (SG) [5], for computing continuous distributed vector representation of words from large datasets (up to hundreds of billions of words).
Word2vec requires training the corpora using one of these architectures. The training tool provides the
following options: (i) type of architecture: continuous bag-of-words or skip-gram; (ii) the dimensionality of the vector space; (iii) the size of the context window in number of words; (iv) the training algorithm: hierarchical softmax and / or negative sampling; (v) the threshold for downsampling the frequent words; (vi) the number of threads to use; and (vii) the format of the output word vector file (an example command line is shown in Table 1).
Table 1. An example of the parameters used by the word2vec training tool to generate the vector model of the medical corpora.
In this example, the training tool uses the corpus file “corpora.txt” to generate the vector model and serialize it to the file “vector-
threshold of 0,001 for downsampling the frequent words (-sample 1e-3). The training tool uses 12 execution threads to generate the vector model as a binary file.
Medical text corpora
We assembled a collection of openly available text repositories relevant to clinical medicine (excluding
veterinary medicine) for use in this evaluation.
Two corpora were derived from PubMed. The first corpus was made up of PubMed abstracts with clinical
relevance. To select abstracts of clinical relevance, a PubMed query was assembled by merging the lists of
journals screened by the evidence-based medicine repositories DynaMed [10] and EvidenceUpdates [11]. The
list was further manually edited, and additional constraints were added (e.g., excluding articles published
before January 2005, excluding editorials) to create the final selection of PubMed abstracts. From this corpus
of PubMed abstracts we derived another corpus by extracting the conclusion sections of the abstracts. The
conclusion sections were further processed by expanding locally defined abbreviations in each abstract. This
resulted in a smaller corpus made up of very high-quality, self-contained key assertions made in the clinically relevant research literature.
We created a corpus of medically relevant content from Wikipedia by selecting all articles that were associated with Wikiproject Medicine or Wikiproject Pharmacology [12] through manual curation of Wikipedia editors.
This produced a corpus of Wikipedia articles with a good coverage of all major clinically relevant topics.
We also included two popular publicly available websites with content for medical professionals: Medscape
[13] and Merck Manual [14]. We created a script for crawling medical content from these websites based on
the PHPcrawl open-source library [15]. We also created scripts for stripping non-relevant portions of web
pages (such as headers and footers) that would have significantly degraded the quality of the corpora. HTML markup was removed from the source data to yield raw text representations of the page contents.
Statistics on word counts and vocabulary sizes of the corpora generated in this way are summarized in Table 2.
Table 2. Corpora used in the experiment.
Word counts refer to the final corpora that were derived from source datasets after all processing steps. Vocabulary sizes refer to the number of distinct words found in each corpus. Underscored corpora were used for evaluating word2vec.
National Drug File – Reference Terminology (NDF-RT) ontology
The NDF-RT is a formal representation of knowledge about drugs and is maintained by the US Department of
Veterans Affairs. We chose the NDF-RT ontology as a reference for evaluating the results produced by the
word2vec algorithms because it is one of the richest manually curated and openly available knowledge bases
on medical drugs available. Several relationships between entities - such as ‘may treat’ or ‘has mechanism of
action’ - were extracted from NDR-RT through SPARQL queries. The relationships that were extracted in this manner are described in Table 3.
Table 3. Relationships between entities extracted from NDF-RT.
In the text below we refer to relationships captured by NDF-RT as ‘correct’ relationships. Of course, this is a
simplification of reality, it might well be possible that not all relationships found in NDF-RT are factually
correct, and it is also likely that there are factually correct relationships missing from the NDF-RT ontology.
Evaluation
In order to conduct the pilot evaluation of word2vec tools for medical text, we defined the following workflow for analyzing and evaluating the results produced by the word2vec tools: (1) gathering and processing openly
comparing results from the word2vec distance and analogy tools to assess how well the models captured the
calculating statistics to evaluate the impact of different parameter configurations.
Gathering and processing medical corpora
The texts gathered from the selected sources needed to be pre-processed before they could be used for training
vector models, since word2vec has no built-in functionalities for term normalisation or dealing with
punctuation. We found that unprocessed corpora contained an abundance of basic syntactic variations and
punctuation that had a negative impact on how word2vec indexes the terms and, therefore, the quality of the
resulting vector space models.
The processing of corpora involved:
The processing workflow is depicted in Figure 2.
Figure 2. Workflow for gathering and pre-processing the content of the corpora.
Training word2vec vector space models
The pre-processed corpora described above were used to build vector models with the word2vec training tool. The results of word2vec do not depend exclusively on the corpora but also on the parameters used. In order to
test the impact of these parameter settings we compared the results of vector models trained with different
parameter settings. We trained the corpora using (a) CBOW and SG vector model architectures; (b) 200, 300,
500 and 800 dimensionality of vector space; and (c) a word windows size of 5, 10 and 20. Other parameters,
such as the training algorithm or the threshold for downsampling frequent words, were not varied to keep the
complexity of results manageable.
Assessment system
To assess the vector models of the trained corpora, the results of the word2vec distance and analogy tools were
compared to the curated content of the NDF-RT ontology as a gold standard. To perform this evaluation, we
developed a system that automatically queried a trained vector model using the distance and analogy tools of
word2vec and matched the resulting list of words with the content of the NDF-RT ontology.
Figure 3 shows the software architecture of the assessment system. The “RESTful server” module made the
word2vec analogy and distance tools accessible through RESTful services. RESTful services were deployed
through the Java-based Jersey framework [16] and facilitated the access of the vector space models from
external applications. Both services returned forty words with their cosine similarity values by default.
Figure 3. The software architecture of the assessment system of word2vec tools using NDF-RT ontology.
The “RESTful client” module (Figure 3) was responsible for gathering required information from the NDF-RT
ontology, preparing the queries to consult the analogy and distance services, processing the responses and
matching the retrieved terms with the information from NDF-RT to obtain the evaluation results. A batch
process was also defined to automatically execute these tasks and to obtain the corresponding evaluation result for each trained corpus. The “Query module” collected subject-predicate-object triples for particular predicates
from the NDF-RT ontology. Then, it created a list of unique subjects from all collected triples and used it to
call the analogy and distance services. The “Matching module” received the results from the services and
checked which words from the retrieved vector matched their corresponding words from the object values of
the collected triples. Finally, the results of the word2vec tools evaluation were processed to calculate their
accuracy values.
The main results of this paper are: (1) the definition of a methodology to evaluate deep learning techniques in
medical corpora; (2) the development of the batch system to automatically run the word2vec tools and match
the results against the content of the NDF-RT ontology as a gold standard; and (3) the assessment of the results
of the batch system for the evaluation of word2vec on medical corpora. The source code of the software for
running these experiments is freely available for download at [17]; the software can be easily adapted to other use-cases, corpora and ontologies.
Removing sources of syntactic variability in the corpora proved to be a good strategy for increasing the utility
of the trained vector models. As an example, Table 4 shows an excerpt of the result list when searching for
words related to “aspirin” in the combined corpus with the distance tool. The results from the raw corpus
include terms such as “[coumadin]” or “aspirin’s” which contain irrelevant symbols. Furthermore, the lack of
multiword terms in the raw corpus reduced the quality of retrieved terms, e.g., the term “acetylsalicylic_acid” (the active ingredient of Aspirin) is the closest to “aspirin” in the pre-processed corpus with a cosine similarity
value of 0.83, whereas in the raw corpus the term “acetylsalicylic” is only found at the 5th position with a
cosine similarity of 0.68.
Table 4. Results of the Distance tool for the term “aspirin” when querying the pre-processed and raw
versions of the combined corpus).
The vector model based on the pre-processed corpus yields better results than the vector model based on the raw corpus, e.g., the term ‘acetylsalicylic_acid’ is the closest term to ‘aspirin’.
The analogy tool aims to find textual regularities captured in the vector model to search for words related to a
query word through a specific kind of relationship, while the distance tool simply returns words that are
similar or generically related to a query word. To test whether the vector model actually captured textual
regularities as expected, we tested if the analogy tool was superior in retrieving the right words from a specific
relation. We calculated the accuracy of both tools for the three selected corpora: combined,
pubmed_key_assertions and Wikipedia (see Table 2); and the relationships: may_treat, may_prevent, has_PE
and has_MoA (see Table 3) from the reference ontology NDF-RT as a gold standard. Table 5 shows that the
analogy tool indeed provided better results than the unselective distance tool, which underlines the ability of
the analogy tool to recognize textual regularities beyond generic notions of relatedness. Models trained on the
combined corpus, the largest corpus among the test corpora, produced best results, which confirms the
hypothesis that word2vec model quality increases as corpus size increases. The number of correct results for
the may_treat relationship with the combined corpus and the analogy tool was the best among all test cases
with an accuracy 38.78%. Consequently, we choose the best configuration to run our evaluations regarding the influence of window-size and vector dimensionality parameters on the accuracy of the resulting vector models.
The optimisation of window-size and vector dimensionality parameters is described by the developers of
word2vec as the most important ones for achieving good results. The window-size parameter corresponds to
the span of words in the text that is taken into account, backwards and forwards, when iterating through the
words during model training.
Table 5. Accuracy values of the word2vec tools.
Figure 4 shows the accuracy when querying the resulting vector models generated with SG and CBOW
architectures and with three different window sizes, 5, 10 and 20. Larger window sizes increase the time
required to train vector models, so choosing the optimal window size can reduce computing time. Our findings
show that the window-size parameter is also relevant for improving the accuracy of the gathered medical
corpora. With our corpora, the optimal value is a window size of 10.
Figure 4. Comparison between the accuracy values using skip-gram (SG) and continuous bag-of-words
(CBOW) architectures. Comparisons were conducted for window sizes of 5, 10 and 20. Results are based
The influence of the vector dimensionality parameter was evaluated using vector models generated with 200,
300, 500 and 800 vector dimensions, and the SG and CBOW architectures. Figure 5 shows the comparison
between the accuracy values with various parameter configurations. In our experiments, the best ranking
results were obtained with the SG architecture and a vector dimensionality of 300, while worst ranking results were observed with the CBOW architecture and a vector dimensionality of 800. The results suggest that the SG architecture consistently yielded better result ranking than the CBOW architecture. This is also consistent with
other evaluations such as [6] which obtained an accuracy of 61% with SG, a dimension vector of 300 and 1B
word corpus. We observed a U-shaped relationship between the dimensionality of the model and the accuracy
of results with SG and CBOW architecture, so a higher vector dimensionality will not provide better results
than the stated ones. A higher dimension of the vectors implies a bigger size of the resulting vector model.
Therefore, having a vector model with a dimension of 300 requires less memory space and provides better
results than a vector model with 500 or 800 vector dimensions.
Figure 5. Accuracy of the vector models generated using the combined corpus with skip-gram (SG) and
The ability of word2vec to retrieve expected terms from the size-restricted corpora we used is not suitable for
applications requiring high precision since we only obtained an accuracy of 49.28%. These modest results
could be explained by the restricted size of gathered medical corpora as well as the complexity of the medical
knowledge domain. It could be of interest for future research to test this tool with larger, commercially
available medical corpora.
Due to the complexity of medical terminology, we found pre-processing of corpora necessary to decrease
syntactic variability. We also found that many relevant medical terms of interest were composed of multiple
words, and that ontology-based pre-processing of these terms led to a marked improvement of the results from the word2vec tools.
As expected, the analogy tool produced better results for identifying related entities for a specific type of
relationship than the distance tool, and the largest corpus provided better results than smaller sub-corpora. As a
consequence of our results, we recommend to use SG architecture rather than CBOW to train other medical
datasets because such architecture always produced the best accuracy values. Moreover, the combination of a
10 window-size with a 300 vector dimension produced the best results among all tested parameters
configurations. We also conclude that a vector dimension greater than 800 and a window size greater than 20
are not recommended due to the observed quick deterioration of accuracy values.
Regarding the low matching rate observed in our evaluation with mid-sized medical corpora, our future
research will focus on how to improve accuracy values by combining word2vec analogy tool with knowledge-
based resources such as ontologies to create hybrid systems and also running the evaluations with larger,
commercially available corpora.
We thank the word2vec team for providing assistance with tuning parameters of the word2vec tools.
1. PubMed home. Available: http://www.ncbi.nlm.nih.gov/pubmed/. Accessed 14 February 2010.
2. Kim J-D, Ohta T, Tsujii J (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 9: 10. doi:10.1186/1471-2105-9-10.
3. Bengio Y, Courville A, Vincent P (2013) Representation Learning: A Review and New Perspectives. IEEE Trans Pattern Anal Mach Intell 35: 1798–1828. doi:10.1109/TPAMI.2013.50.
4. word2vec - Tool for computing continuous distributed representations of words. - Google Project Hosting. Available: https://code.google.com/p/word2vec/. Accessed 22 August 2014.
5. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. ArXiv13013781 Cs. Available: http://arxiv.org/abs/1301.3781. Accessed 22 August 2014.
6. Pennington J, Socher R, Manning CD (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
7. Levy O, Goldberg Y (2014). Linguistic Regularities in Sparse and Explicit Word Representations. Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Baltimore, Maryland, USA.
8. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T (2013). Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems. pp. 2121-2129.
9. National Drug File - Reference Terminology - Summary | NCBO BioPortal. Available: http://bioportal.bioontology.org/ontologies/NDFRT. Accessed 24 August 2014.
10. Evidence Based Content, Up to Date Content, Clinical Reference | DynaMed. Available: https://dynamed.ebscohost.com/. Accessed 16 October 2013.
11. EvidenceUpdates. Available: http://plus.mcmaster.ca/evidenceupdates/. Accessed 16 October 2013.
12. Heilman JM, Kemmann E, Bonert M, Chatterjee A, Ragar B, et al. (2011) Wikipedia: a key tool for global public health promotion. J Med Internet Res 13: e14. doi:10.2196/jmir.1589.
13. Diseases & Conditions - Medscape Reference. Available: http://emedicine.medscape.com/. Accessed 15 October 2013.
14. THE MERCK MANUALS - Trusted Medical and Scientific Information. Available: http://www.merckmanuals.com/. Accessed 15 October 2013.
15. PHPCrawl webcrawler/webspider library for PHP - About. Available: http://phpcrawl.cuab.de/. Accessed 16 October 2013.
16. Jersey. Available: https://jersey.java.net/index.html. Accessed 22 August 2014.
17. biomedical-text-exploring-tools - Deep learning on medical text corpora on the World Wide World -Google Project Hosting. Available: https://code.google.com/p/biomedical-text-exploring-tools/. Accessed 22 August 2014.