One of the most significant challenges for veterinary data science is that veterinary primary practices rarely code clinical findings in EHR records. This makes it hard to perform core tasks like case finding, cohort selection, or to support the production of basic descriptive statistics like disease prevalence. It is becoming increasingly accepted that spontaneous diseases in animals have important translational impact on the study of human disease for a variety of disciplines (Kol et al., 2015). Beyond the study of zoonotic diseases, which represent 60-70% of all emerging diseases, non-infectious diseases, like cancer, have become increasingly studied in companion animals as a way to mitigate some of the problems with rodent models of disease (LeBlanc, Mazcko, and Khanna, 2016). Additionally, spontaneous models of disease in companion animals are being used in drug development pipelines as these models more closely resemble the “real world” clinical settings of diseases than genetically altered mouse models (Grimm, 2016; Klinck et al., 2017; Baraban and Löscher, 2014; Hernandez et al., 2018).
In comparison to the human EHR, there has been little ML work on veterinary EHR, which faces a unique challenge. The labeled data which are accessible to research only reside in referral teaching hospitals. These hospitals often specialize in a specific type of diseases. The patient type, as well as the disease distributions, do not resemble the general population. Machine learning models trained on this dataset might easily get biased and perform poorly on general clinical records. We refer to this as the cross-hospital challenge.
Our contributions We develop an algorithm to leverage one million unlabeled clinical notes through generative sequence modeling, and demonstrate such large-scale modeling can substantially improve the model’s performance in a cross-hospital setting. We adapt the new state-of-the-art Trans-
Figure 1: Our proposed model architecture for automated disease coding. Two tasks are shown: gen- erative modeling (top) and supervised learning (bottom). The dashed arrows represent the generative modeling process on the unlabeled SAGE data, and the solid arrows represent the supervised learning process on the labeled CSU data. An additional test is done on the PP data (not shown).
former model proposed by Vaswani et al. (2017). We systematically evaluate the model performance in this cross-hospital setting, where the algorithm trained on one hospital is evaluated in a different hospital with substantial domain shift. In addition, we provide interpretation for what is learned by the deep network. Our algorithm addresses an important application in healthcare, and our experiments add insights into the power of generative sequence modeling for clinical NLP.
We formulate the problem of automated disease coding as a multi-label classification problem. Given a veterinary record X, which contains detailed description of the diagnosis, we try to infer a subset of diseases , given a pre-defined set of diseases Y. The problem of inferring a subset of disease codes can be viewed as a series of independent binary prediction problems (Sorower, 2010).
We use three datasets in this work (Appendix Table S1). CSU(Labeled): We use a curated set of 112,558 veterinary notes from the Colorado State University College of Veterinary Medicine and Biomedical Sciences. Each note is annotated with a set of SNOMED-CT codes by veterinarians at Colorado State. PP(Labeled): We obtain a smaller set of 586 discharge summaries curated from a commercial veterinary practice located in Northern California. Two veterinary experts applied SNOMED-CT codes to these records and achieved consensus on the records used for validation. This dataset is drastically different from the CSU dataset evidenced by their shorter length and usage of abbreviations. SAGE(Unlabeled): We obtained a large set of 1,019,747 unlabeled notes from the SAGE Centers for Veterinary Specialty & Emergency Care. This is a set of raw clinical notes without any codes applied to them. The characteristics of this dataset should be similar to the PP dataset because they are both primary local clinics.
Our proposed model architecture is shown in Figure 1. Two tasks are shown: generative modeling and supervised learning. We describe these two tasks in the following section.
3.1 Generative Modeling
A generative model over text is also referred to as a language model. Text sequence is an ordered list of tokens. Therefore, we can build an autoregressive model to estimate the joint probabil-
Table 1: Evaluation of trained classifiers on the CSU test data and PP data. EM is the fraction of cases where the set of diseases predicted by the model exactly matches the expert labels. The classifiers are trained on a subset of CSU. Notation: LSTM and Transformer are our two base encoder models; +Word2Vec uses Word2Vec trained on SAGE to initialize; +Pretrain uses generative modeling loss on SAGE to initialize; +Auxiliary uses generative modeling loss on CSU in addition to classification objective on CSU:
ity of the entire sequence: . In an ordered sequence, we can factorize it as
. Concretely, we estimate the token distribution of
by using the contextualized representation provided by our encoder:
. We optimize over the negative log-likelihood of the distribution
In our model, we examine the effect of generative modeling on two encoder architectures: Transformer and the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). We use this objective in two parts of our system: 1) pretrain encoder’s parameters; 2) serve as an auxiliary task during training of the classifier.
3.2 Supervised Learning
Classifier uses a dot-product attention layer to get a summary representation c for the entire sequence. We describe the computation in Appendix Eqn 5. We then use a fully connected layer to down project it and calculate probability: . We compute the binary cross entropy loss across
Finally, we use a mixture of two losses and use hyperparameter
to set the strength of the auxiliary task loss when we use generative modeling as an auxiliary task in our classification training.
We conduct systematic experiments on different models and ablations to quantify which component of our model improves the automatic coding performance (Table 1).
Neural networks outperform feature-based models We use the popular MetaMap, a program developed by the National Library of Medicine (NLM) (Aronson and Lang, 2010), as a baseline. MetaMap processes a document and outputs a list of matched medically-relevant keywords with its frequencies in the given document. We directly train on the sparse bag-of-words feature representation from MetaMap. We use SVM or MLP as the classification algorithm from scikit-learn (Pedregosa et al., 2011). We find its performance is worse than the CAML, LSTM and Transformer on both the CSU and PP test data.
Table 2: Most influential words in the best model (Transformer+Auxiliary+Pretrain). We select five representative disease categories. For each disease, we show the top 10 words in the MetaMap medical dictionary that the model most strongly associates with the disease.
Generative modeling outperforms Word2Vec The test perplexity of the generative modeling can achieve on the SAGE dataset with LSTM is 20.7 and with Transformer is 15.6. Transformer outperforms LSTM on generative modeling pretraining. We find that generative modeling as pretrain is sufficient for models to learn useful word embeddings and models with +Pretrain outperforms models with +Word2Vec on both CSU and cross-hospital dataset PP.
Generative modeling helps Transformer more In our experiment, we compare the performance of our system by adding generative modeling objective as an auxiliary task during the classification task. Adding the generative modeling as an auxiliary task improves both Transformer and LSTM on CSU test set as well as the cross-hospital PP evaluation set. The effect of auxiliary training is more significant on Transformer than on LSTM. We also combine the generative modeling pretraining as well as the auxiliary task during the classification task and observe a substantially better performance on the overall model compared to the baseline model with either encoder.
In order to gain intuition on how deep learning models process clinical notes, we implement a gradient-based interpretation method on our model. The method attributes prediction scores to input by computing the attribution score as gradient input (Ancona et al., 2018). We compute the frequency of words that have score
(threshold chosen to select on average 3% words per note), use MetaMap dictionary as a filter to extract medical relevant terms, and then sort them in decreasing order. We sample 5 diseases and report the top 10 clinical relevant terms extracted by the model in the Table 2. Words captured by the model have high quality and agree with medical domain knowledge. Most words captured by the model are in the expert-curated dictionary from the MetaMap. Moreover, we notice that the model is capable of capturing abbreviations (i.e., ‘kcs’), combinations (i.e., ‘immune-mediated’) and rare professional terms (i.e., ‘cryptorchid’) that MetaMap fails to extract.
We propose a framework that is robust for the cross-hospital generalization problem in the veterinary medicine automated coding task. By training the model on 1 million raw notes with generative modeling objective, and using state-of-the-art Transformer model, we substantially increase the performance of the framework on clinical notes annotated and gathered from a private hospital. Our framework can be applied to other medical domains that currently lack medical coding resources.
Ancona, M.; Ceolini, E.; Oztireli, C.; and Gross, M. 2018. Towards better understanding of gradient- based attribution methods for deep neural networks. In 6th International Conference on Learning Representations (ICLR 2018).
Aronson, A. R., and Lang, F.-M. 2010. An overview of metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association 17(3):229–236.
Baraban, S. C., and Löscher, W. 2014. What new modeling approaches will help us identify promising drug treatments? In Issues in Clinical Epileptology: A View from the Bench. Springer. 283–294.
Bird, S., and Loper, E. 2004. Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, 31. Association for Computational Linguistics.
Donnelly, K. 2006. Snomed-ct: The advanced terminology and coding system for ehealth. Studies in health technology and informatics 121:279.
Grimm, D. 2016. From bark to bedside.
Hernandez, B.; Adissu, H. A.; Wei, B.-R.; Michael, H. T.; Merlino, G.; and Simpson, R. M. 2018. Naturally occurring canine melanoma as a predictive comparative oncology model for human mucosal and other triple wild-type melanomas. International journal of molecular sciences 19(2):394.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735– 1780.
Klinck, M. P.; Mogil, J. S.; Moreau, M.; Lascelles, B. D. X.; Flecknell, P. A.; Poitte, T.; and Troncy, E. 2017. Translational pain assessment: could natural animal models be the missing link? Pain 158(9):1633–1646.
Kol, A.; Arzi, B.; Athanasiou, K. A.; Farmer, D. L.; Nolta, J. A.; Rebhun, R. B.; Chen, X.; Griffiths, L. G.; Verstraete, F. J.; Murphy, C. J.; et al. 2015. Companion animals: Translational scientist’s new best friends. Science translational medicine 7(308):308ps21–308ps21.
LeBlanc, A. K.; Mazcko, C. N.; and Khanna, C. 2016. Defining the value of a comparative approach to cancer drug development. Clinical Cancer Research 22(9):2133–2138.
Mullenbach, J.; Wiegreffe, S.; Duke, J.; Sun, J.; and Eisenstein, J. 2018. Explainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Pret- tenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.
Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training.
Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Sorower, M. S. 2010. A literature survey on algorithms for multi-label learning. Oregon State University, Corvallis 18.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
A.1 Model Details
LSTM The Long short-term Memory Networks (LSTM) is a recurrent neural network with a long short-term memory cell (Hochreiter and Schmidhuber, 1997). It maintains semantic gating functions specifically designed to capture long-term dependency between words. At time step t with word embedding input , the recurrent computation of the LSTM networks can be described in Equation 1.
is the sigmoid function
is the hyperbolic tangent function.
the hadamard product.
Transformer Transformer was proposed by Vaswani et al. (2017) as a machine translation architecture. We use a multi-layer Transformer decoder similar to the setup in Radford et al. (2018).
Let the previous layer’s output as . At the first layer, these values equal to word embeddings added with a positional encoding defined in Equation 2 where i indicates the dimension of the positional embedding, and t indicates the position of this token in the sequence.
For the multihead attention, we first use three linear projections to transform matrices. We compute the new hidden states
according to Equation 3.
An n-headed attention computes Equation 3 n times and concatenate the obtained matrix n times. In order to prevent dimension blow-up as the layer goes deeper, multi-head attention matrix
all have dimensions
. In Equation 4, we describe the transformer block. The matrix multiplication by
are referred to as a bottleneck computation, where D is much larger than d.
Classifier The drawback of letting is that we are essentially reducing the information before timestamp T. We use a dot-product attention layer to transform
to a vector that summarizes the entire sequence c. The computation is defined in Equation 5.
Experimental Setup We filter out all non-ascii characters in our documents, convert all letters to lower case, and then tokenize with NLTK (Bird and Loper, 2004). We apply the standard BPE (Byte Pair Encoding) (Sennrich, Haddow, and Birch, 2015) algorithm to address the out-of-vocabulary problem. BPE uses a vocabulary size of 50k. We truncate all documents to no more than 600 tokens, padded with start and end of sentence tokens. The word embedding dimension and encoder latent dimension are both set to 768. For the Transformer, we stack 6 transformer blocks, with 8 heads for the multi-head attention on each layer. We let the feedforward dimension to be 2048. We implement our model in PyTorch. We use Noam Optimizer (Vaswani et al., 2017) with 8000 warm up steps. Dropout rate is set to 0.1 during training to reduce overfitting. We split datasets into training, validation and test set (Table S1). All models are trained for 10 epochs. We use the validation set to select our best model and evaluate CSU test set and PP test set on our best model. We use a batch size of 10 for LSTM and a batch size of 5 for Transformer, which is the maximum allowed to train on a single GPU.
A.2 Dataset Details
Table S1: Descriptive statistics of the three datasets.
SNOMED-CT Codes SNOMED-CT is a comprehensive clinical health terminology managed by the International Health Terminology Standards Development Organization (Donnelly, 2006). Annotations are applied from the SNOMED-CT veterinary extension (SNOMED-CT VET), which is a veterinary extension of the International SNOMED-CT edition. In this work, we try to predict disease level SNOMED-CT codes.
Example We select three examples from each dataset and show them in Figure S1.
Length Distribution We plot a histogram to show the proportion of records in each dataset with certain length in Figure S2.
Number of Label Per Document Distribution We plot a histogram to show the proportion of records in each labeled dataset with certain number of labels in Figure S3.
Figure S1: Examples from the CSU, PP and SAGE datasets. CSU and PP are expert labeled and SAGE is unlabeled.
Figure S2: Document length distribution.
Figure S3: Label number distribution.
Species Distribution We plot pie charts to show the proportion of species in each labeled dataset in Figure S4.
Data Availability The data that support the findings of this study are available from Colorado State University College of Veterinary Medicine, a private practice veterinary hospital near San Francisco and SAGE Centers for Veterinary Specialty & Emergency Care, but restrictions apply to the availability of these data, which were made available to Stanford for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Colorado State University College of Veterinary Medicine, the private hospital and SAGE Centers for Veterinary Specialty & Emergency Care.
A.3 Result Details
We compute precision, recall, F1 and accuracy score for 20 most frequent disease categories. We list the results in Table S2.
To investigate the effectiveness of generative modeling pretraining and generative modeling as an auxiliary task, we compare the performance of two models: Transformer v.s. Transformer+Auxiliary+Pretrain on both CSU and PP datasets. We report precision, recall and F1 score for the 20 most frequent disease categories, as shown in Figure S5. We observe a significant improvement
Figure S4: Species distribution in CSU dataset (left) and PP dataset (right).
Table S2: Performance of the best model (Transformer+Auxiliary+Pretrain) for 20 most frequent disease categories.
in recall for Transformer+Auxiliary+Pretrain model, which explains the overall improvement in F1 score.
A.4 Interpretation Details
We use gradient-based interpretation attribution algorithm to compute the frequency of words that have score (threshold chosen heuristically), use MetaMap dictionary as a filter to extract medical relevant terms, and then sort them in decreasing order. We select the top 50 words and display words that intersect with the MetaMap expert-curated dictionary. We show results in Table S3, S4, S5. Disease categories without influential words are not shown.
Figure S5: Performance comparison on the CSU and PP dataset for the 20 most frequent disease categories. Generative modeling pretraining and generative modeling as an auxiliary task improve recall significantly.
Table S3: Most influential words in the best model (Transformer+Auxiliary+Pretrain). Disease categories without influential words are not shown.
Table S4: Most influential words in the best model (Transformer+Auxiliary+Pretrain). Disease categories without influential words are not shown.
Table S5: Most influential words in the best model (Transformer+Auxiliary+Pretrain). Disease categories without influential words are not shown.