Alzheimer’s Disease (AD) is a neurodegenerative disease which affects 5.5 million Americans and whose care cost $259 billion in the United States in 2017 [2]. Despite its prevalence, it can be challenging to recruit participants with cognitive decline for research studies, due to issues ranging from ethics protocol restrictions for vulnerable populations to caregiver fatigue. Datasets for AD are therefore often sparse [20].
Language decline is one of the main symptoms of AD and several studies have consequently applied natural language processing and machine learning to quantify differences between AD and healthy speech. Wankerl et al [23] used a simple N-gram based approach to build language models for control participants and AD patients. Using a perplexity measure, they achieved a classification result of 77.1%. Rentoumi et al [20] considered a slightly more challenging task, using frequency unigrams to differentiate between picture descriptions from AD participants with and without additional vascular pathology (N = 18 in each group); their highest accuracy was 75%. Other approaches have considered a greater number of features. Guinn et al [9] distinguished between AD and healthy language samples, with up to 79.5% accuracy, using 80 conversations and features such as filled pauses, repetitions, and incomplete words. Meilan et al [17] distinguished between 30 patients with AD and 36 healthy controls, obtaining an accuracy of 84.8%, using temporal and acoustic features such as percentage and number of voice breaks, shimmer, and noise-to-harmonics ratio. Similarly, Fraser et al detected primary progressive aphasia [7] and AD [6] with up to 100% and 82% accuracy, respectively, using a wide array of lexicosyntactic and acoustic features during story retelling and picture description texts.
Previous experiments on classifying between AD and healthy speech highlight the need for bigger datasets. For example, Andrade De Oliveira et al [1] used normative data to identify AD patients, albeit on neuroimaging data. We take a similar approach and explore combining normative data with existing speech and text transcripts collected on participants with AD in a picture description task. By combining synthetic sampling and normative data, we obtain state-of-the-art results on the DementiaBank dataset.
Here, we combine a dataset containing AD participants, DementiaBank, with each of two normative datasets, consisting of only healthy participants, from the Wisconsin Longitudinal Study and the Talk2Me project. Table 1 shows demographics for these datasets, each of which employs the same ‘Cookie Theft’ picture description task from the Boston Diagnostic Aphasia Examination [8].
2.1 DementiaBank (DB)
In DB, which is part of the TalkBank project [15], each participant was above 44 years old, and had at least 7 years of education. Participants also had no history of nervous system disorders, had an initial Mini-Mental State Exam (MMSE) score of 10 or greater1, and were able to give informed consent [3]. Each participant was assigned to either the ‘Dementia’ group (N = 167) or the ‘Control’ group (N = 97) based on their medical histories and an extensive neuropsychological and physical assessment battery. Additionally, since many subjects repeated their engagement at yearly intervals (up to five years), we use 240 samples from those in the ‘Dementia’ group, and 233 from those in the ‘Control’ group. Each speech sample was recorded and manually transcribed at the word level following the CHAT protocol [14]. Narratives were segmented into utterances and annotated with filled pauses, paraphasias, and unintelligible words.
2.2 Talk2Me (T2M)
Talk2Me is an online language assessment from the University of Toronto2. It consists of seven tasks, including picture descriptions, story retellings, word-colour Stroop, fluency tasks, and self-reported evaluation of mood. The tasks are performed online, with participants entering their answers through text or through speech recordings. Answers to the picture description task are collected as audio recordings. Participants are shown a random picture during each session, including pictures from Flickr, the Webber Photo Cards: Story Starters collection [24], and Cookie Theft. Crucially, unlike DB, no human-produced transcripts are included. We therefore apply the Kaldi open-source automatic speech recognition (ASR) engine [19], using a long short-term memory network with i-Vector input [22] and a reverberation model, trained on the Fisher data [4]. Our ad hoc evaluation of a random portion of these data suggests a word-error rate of approximately 12.5%.
2.3 Wisconsin Longitudinal Study (WLS)
The second normative dataset is the Wisconsin Longitudinal Study (WLS), which is recorded over several decades on a 1/3 random sample of all Wisconsin high school graduates in 1957 (N = 10, 317) born between 1938 and 1940 [11]. Survey data were collected from the original respondents or their parents in 1957, 1964, 1975, 1992, 2004, and 2011, and participants performed the ‘Cookie Theft’ picture description task in the 2011 survey. Only the audio was retained from that survey, so we therefore apply the same Kaldi-based ASR engine to these data as we do in Section 2.2.
Table 1: Demographics for the three data sets for patients with AD and controls (CT). Years are indicated by their means and standard deviations.
From the picture description transcripts, we extract 567 features, including various lexical features (e.g., mean number of syllables per word, mean word length, various parts-of-speech and phrase type counts and ratios), and syntactic features (e.g., ratios of various context-free grammatical constructions, and the total number of T-units). We also compute vocabulary similarity with the cosine distance between words. Finally, we compute various subjective measures, including the FleschKincaid score for readability [12], LIWC psycholinguistic features [18], and valence from the Stanford Sentiment Analyzer [21].
We then perform a one-way ANOVA and retain the features with p-values 0.005, set empirically. In the binary classification task, this selects 142 features with DB only, 311 features with DB + WLS, and 364 features with DB + T2M. In the multi-class classification task, we use 174 features with DB only, 293 features with DB + WLS, and 361 features with DB + T2M. A subset of the top features identified by the ANOVA test for differentiating between CT and AD participants are presented in Table 2.
Table 2: Top features for DB, DB + T2M, and DB + WLS, following a one-way ANOVA test for differentiating between AD vs CT participants.
We combine DB data with each normative dataset, WLS or T2M, in turn. To avoid bias introduced by class imbalance, we oversample the minority class with ADASYN [10]. ADASYN extends methods such as SMOTE by synthesizing points closer to the decision boundary. Data were randomly split 80/20 for training/testing, ensuring each participant’s samples do not occur in both sets. We apply ADASYN on the training set only.
We consider a random forest (with 100 trees), a gradient boosting classifier (with 100 estimators), an SVM (with a radial basis kernel), and a four-layer DNN (trained using Adam for 100 epochs with a batch size of 100). First, we look at binary classification of CT vs AD. We then further split the AD group into two categories, Mild and Moderate, given MMSE scores above and below 10, respectively. Results for multi-class and binary classification are presented in Tables 3 and 4 below. We report the F1 averages. The macro average assigns equal weight to each class, whereas the micro average accounts for the frequency of each class.
Table 3: Moderate vs Mild vs CT. The three highest F1 macro scores are shown in bold.
Table 4: AD vs CT. The three highest F1 macro scores are shown in bold.
Effectively monitoring and assessing the linguistic symptoms of dementia automatically will have major potential impacts on health care. Among these is the ability to remotely assess cognitive function in mobility-reduced (and rural) individuals, which would considerably lessen the burden on healthcare workers. Clearly, using normative data greatly improves classification accuracy, and these improvements are generally maintained through class-balancing with ADASYN, although a KruskalWallis test does not find statistical significance (). However, an n-way ANOVA reveals significant main effects of model (
), and task (binary v trinary,
), as well as interaction effects between the model and task (
0.05) and between database and task (
), with database and oversampling as covariates. While oversampling does not typically improve estimates, it is important for verification, due to the massive class imbalance otherwise.
Considering the binary task in Table 4, it is clear that adding WLS, rather than T2M, improves performance the most. There are two possible explanations for this; first, the demographics of WLS more closely resemble those of DB than T2M (especially in terms of age) and, secondly, T2M also includes some picture descriptions of images other than the Cookie Theft. Indeed, the features selected (as shown in Table 2) indicate what sets these normative data sets apart. While both data sets reveal similar differences to speakers with AD in the DB data set, in terms of grammatical features and semantic similarity, the latter is amplified in WLS, and T2M reveals more subjective or psycholinguistic differences. Interestingly, many lexical features (indicative of previous work that only used the DB data set [6]) ceased to be important when including the normative data. Future work will reveal if differences in grammatical construction may be indicative of slight cultural differences.
Because of the relatively small size of the DB dataset, we resorted to extensive feature engineering, and extracted a total of 567 features. The small amount of data also limited the performance of the DNN, since it only achieved comparable results when supplemented with a normative dataset.
As expected, classification is easier in the binary case than the trinary case, especially due to relatively minor differences between speakers with mild versus moderate cognitive impairment. The very high micro F1-scores on the augmented datasets can be misleading without context, due to the high class imbalance. Selecting appropriate evaluation metrics, especially in this context, is paramount.
Ongoing work is focused on augmenting the T2M dataset, by collecting data from individuals with cognitive decline, and by also introducing a telephone-based interface. We are currently applying transfer- and multiview-learning on these and related data sets of pathological speech.
The Wisconsin Longitudinal Study is sponsored by the National Institute on Aging (grant numbers R01AG009775, R01AG033285, and R01AG041868), and was conducted by the University of Wisconsin.
[1] Ailton Andrade De Oliveira, Maria Teresa Carthery-Goulart, Pedro Paulo De Magalhães Oliveira Júnior, Daniel Carneiro Carrettiero, and João Ricardo Sato. Defining multivariate normative rules for healthy aging using neuroimaging and machine learning: An application to Alzheimer’s disease. Journal of Alzheimer’s Disease, 43(1):201–212, 2014.
[2] Alzheimer’s Association. 2017 Alzheimer’s disease facts and figures. Alzheimer’s & dementia : the journal of the Alzheimer’s Association, 13(4):325–373, April 2017.
[3] James T Becker, Francois Boller, Oscar I Lopez, Judith Saxton, and Karen L McGonigle. The Natural History of Alzheimer’s Disease: Description of Study Cohort and Accuracy of Diagnosis. Archives of Neurology, 51(6):585–594, 1994.
[4] Christopher Cieri et al. Fisher english training speech parts 1 and 2 transcripts ldc2004t19 and ldc2005t19. https://catalog.ldc.upenn.edu/ldc2005t19, Linguistic Data Consortium, 2005.
[5] Marshal F Folstein, Susan E Folstein, and Paul R McHugh. Mini-mental state: A practical method for grading the cognitive state of patients for the clinician. Journal of psychiatric research, 12(3):189–98, nov 1975.
[6] Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. Linguistic features identify Alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422, 2015.
[7] Kathleen C Fraser, Frank Rudzicz, and Elizabeth Rochon. Using text and acoustic features to diagnose progressive aphasia and its subtypes. In Proceedings of Interspeech 2013, pages 2177–2181, Lyon France, aug 2013.
[8] Harold Goodglass and Edith Kaplan. Boston Diagnostic Aphasia Examination, 1983.
[9] Curry Guinn and Anthony Habash. Language analysis of speakers with dementia of the Alzheimer’s type. In AAAI Fall Symposium: Artificial Intelligence for Gerontechnology, pages 8–13, 2012.
[10] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328. IEEE, June 2008.
[11] Pamela Herd, Deborah Carr, and Carol Roan. Cohort Profile: Wisconsin longitudinal study (WLS). International Journal of Epidemiology, 43:34–41, 2014.
[12] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. Derivation of new read- ability formulas (automated readability index, FOG count and Flesch reading ease formula) for Navy enlisted personnel. Technical report, Naval Technical Training Command Millington TN Research Branch, 1975.
[13] Xiaofei Lu. Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474–496, 2010.
[14] Brian MacWhinney. The CHILDES project: Tools for analyzing talk. Child Language Teaching and Therapy, 8(2):217–218, 1992.
[15] Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. AphasiaBank: Methods for studying discourse. Aphasiology, 25(11):1286–1307, 2011.
[16] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David Mc- Closky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.
[17] Juan José G Meilán, Francisco Martínez-Sánchez, Juan Carro, Dolores E López, Lymarie Millian-Morell, and José M Arana. Speech in Alzheimer’s disease: Can temporal and acoustic parameters discriminate dementia? Dementia and Geriatric Cognitive Disorders, 37:327–334, 2014.
[18] James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. The Development and Psycho- metric Properties of LIWC2015. https://repositories.lib.utexas.edu/handle/2152/31333, sep 2015.
[19] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukás Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. The Kaldi speech recognition toolkit. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 1–4, 2011.
[20] Vassiliki Rentoumi, Ladan Raoufian, Samrah Ahmed, Celeste de Jager, and Peter Garrard. Features and machine learning classification of connected speech samples from patients with autopsy proven Alzheimer’s disease with and without additional vascular pathology. Journal of Alzheimer’s disease, 42(3):3–17, 2014.
[21] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631– 1642, 2013.
[22] Pulkit Verma and Pradip K Das. i-Vectors in speech processing applications: A survey. International Journal of Speech Technology, 18(4):529–546, 2015.
[23] Sebastian Wankerl, Elmar Nöth, and Stefan Evert. An N-gram based approach to the automatic diagnosis of Alzheimer’s disease from spoken language. In Interspeech 2017, pages 3162–3166, ISCA, Aug 2017. ISCA.
[24] Sharon G Webber. Webber photo cards: Story starters., 2005.
For the top syntactic features, the T-units are extracted using the Lu Syntactic Complexity analyzer [13], and the grammatical constituents are extracted from parse trees generated by the Stanford Parser [16]. The semantic similarity measures consist of various metrics related to the cosine measure of similarity taken on each pair of utterances. For the top subjective measures, we consider reading norms, psycholinguistic features, and sentiment analysis. Reading norms were derived from the Flesch reading-ease score and Flesch-Kincaid grade level formula [12]. Psycholinguistic measures were derived from LIWC [18] and Receptiviti 3. Measures for negative polarity are extracted from the Stanford Sentiment Analyzer [16].
Table 5: Lexical features
Table 6: Syntactic features
Table 7: Semantic similarity measures
Table 8: Subjective measures
The random forest classifier fits 100 decision trees and considers when looking for the best split. Each decision tree in the ensemble is built from a sample drawn with replacement
The DNN used consists of four layers of 512 units. The tanh activation function is used at each hidden layer, and a dropout of 0.1 is applied after each hidden each. The DNN is trained using Adam for 100 epochs and with a batch size of 100. A learning rate of 0.1, and the cross-entropy loss are used.
The receiver operating characteristic curve plots the true positive rate vs the false positive rate for different decision thresholds. The area under the curve (AUC) for each experiment is displayed in the tables below.
Table 9: AUC values for binary classification. AUC values for the AD class are computed.
Table 10: AUC values for the multi-class problem. AUC values are computed in a one-vs-all approach. Values for the undersampled classes Moderate and Mild are reported.