Socially assistive robotics (SAR) is a promising new subfield of human-robot interaction (HRI), with a focus on developing intelligent robots that provide assistance through social interaction (1, 2). As overviewed in this journal (3), researchers have been exploring SAR as a means of providing accessible, affordable, and personalized interventions to complement human care. However, HRI methods are still limited in their ability to autonomously perceive, interpret, and naturally respond to behavioral cues from atypical users in everyday contexts. This hinders the ability of SAR interventions to be tailored toward the specific needs of each user (4,5).
These HRI challenges are not only amplified in the context of SAR for individuals with autism spectrum disorders (ASD), but ASD is also the context where SAR is especially promising. ASD is a developmental disability characterized by difficulties in social communication and interaction. About 1 in 160 children worldwide are diagnosed with ASD (6), with a higher rate of 1 in 59 children in the United States (7). Therapists offer individualized services for helping children with ASD to develop social skills through games or storytelling (8), but such services are not universally accessible or affordable. To this end, researchers have been actively exploring SAR for children with ASD (9). Several short-term studies have already shown SAR to support learning in ASD users (10). Moreover, in this journal, Scassellati et al. (11) reported on a long-term, in-home SAR intervention that helped children with ASD to improve social skills such as perspective-taking and joint attention with adults.
SAR systems must engage users to be effective. Robot perception of user engagement is a key HRI capability that makes it possible for the robot to achieve the specified goals of the interaction and, in the SAR context, the intervention (13). Past work has used rule-based methods to approximate the engagement of children with ASD. For instance, Kim et al. (14) indirectly assessed engagement by estimating emotional states from audio data. Esteban et al. (13) gauged
Fig. 1. Long-term, real-world SAR intervention setup. In this month-long, in-home study, child participants with ASD played math games on a touchscreen tablet, while a socially assistive robot used multimodal data to provide personalized feedback and instruction (12).
engagement by using the frequency of measured variables, such as how many times the child looked at the robot. More recently in this journal, Rudovic et al. (4) used supervised machine learning (ML) to model engagement in a single-session laboratory study. The study focused on developing post hoc personalized models using deep neural networks, and achieved an average agreement of 60% with human annotations of engagement on a continuous scale from -1 to +1.
This article addresses the feasibility and challenges of applying supervised ML methods to model user engagement in long-term, real-world human-robot interactions, with a focus on SAR interventions for children with ASD. The contributions of this work differ from past work in two key aspects.
First, the methods and results are based on data from month-long, in-home SAR interventions with 7 child participants with ASD. Whereas single-session and short-term studies of SAR for ASD are numerous (15), the work by Scassellati et al. (11) in this journal is the only other long-term, in-home SAR for ASD study conducted to date. Long-term, in-home studies and data collections are important for many reasons: They more realistically represent real-world learning environments, they provide more opportunities for the user to learn and interact with the robot, and they produce more relevant training datasets (16). Furthermore, long-term, in-home settings present new modeling challenges, given the significantly larger quantity and variance in user data.
Second, this work emphasizes engagement models that are practical for online use in human-robot interactions. With a supervised ML approach, models require labeled data for training, which are often expensive or unfeasible to obtain. Previous works report on models trained and tested on randomly sampled subsets of each participant’s data (4). However, that approach is impractical for online use if labeled training data for a given user are obtained chronologically after the testing data. In contrast, this work presents, for each user: (i) generalized models trained on data from different users; and (ii) individualized models trained on an early subset of the user’s data. As detailed in Materials and Methods, models were developed for different numbers of training users in generalized models and varying sizes of early subsets in individualized models. An early subset of the data is defined as the first X% of a user’s data sorted chronologically. Furthermore, this work also analyzes the temporal structure of model predictions to examine the possibility of initiating re-engagement actions at appropriate times.
The presented engagement models are trained on data from month-long, in-home SAR interventions. During interventions, child participants with ASD played space-themed math games on a touchscreen tablet while a Stewart platform robot named Kiwi provided verbal and expressive feedback (17). The robot’s feedback and instruction challenge levels were personalized to each user’s unique learning patterns with reinforcement learning over the month-long intervention. All participants showed improvements in reasoning skills and long-term retention of intervention content (18,19). Over the month-long data collection, we collected an average of 3 hours of multimodal data across multiple sessions for each participant, including video, audio, and performance on the games. As Figure 1 shows, a USB camera mounted on top of the game tablet recorded a front view of the user. Visual and audio features were extracted from the camera data and performance features were derived from the answers to game questions recorded on the tablet. As detailed in Materials and Methods, the open-source data processing tools used to extract these features are appropriate for online use in HRI contexts (20–22).
This work frames engagement modeling as a binary classification problem, similar to most previous relevant works (4). Participants were annotated as engaged or disengaged in each camera frame using standard definitions of engagement as a combination of behavioral, affective, and cognitive constructs (9). A participant was considered to be engaged when paying full attention to the interaction, immediately responding to the robot’s prompts, or seeking further guidance from others in the room. The binary labels simplify the representation of participants’ behavior, which may vary in degree of engagement and disengagement (23). However, temporal patterns in binary labels can provide additional context (18); therefore, this work also analyzes the length of time a participant is continuously engaged and disengaged. Trained annotators labeled engagement, and an inter-rater reliability of k = 0.84 (Fleiss’ Kappa) was achieved. The Materials and Methods section provides additional details about the data and the annotation process.
This article focuses on post hoc models of user engagement based on data from month-long, in-home SAR interventions. The presented approaches are suitable for online perception of engagement, and are intended to inform the design of more engaging and personalized HRI. The contributions of this work especially aim to improve SAR’s effectiveness in supporting learning by children with ASD.
This work presents two types of supervised ML models of user engagement in long-term, real-world HRI intended for online implementation: (i) generalized models trained on data from different users; and (ii) individualized models trained on data from early subsets of the users’ interventions. On average, these models achieved area under the receiver operating characteristic (AUROC) values of approximately 90%. AUROC is a commonly-used ML metric for binary classification problems; specifically, it measures the probability that the models would rank a randomly chosen engaged instance higher than a randomly chosen disengaged instance (24). In order to evaluate these two approaches, we also implemented models trained on random samples of all user data. Random sampling yielded significantly higher recall for disengagement compared to generalized and individualized models. This is likely because the month-long, in-home setting led to a large variance in both engagement states and recorded data. Variance in data manifested not only across participants but also within each participant, highlighting an important characteristic of real-world HRI in the ASD context. Despite the lower recall and higher variance for disengagement, temporal patterns in model predictions can be used to reliably initiate re-engagement actions at appropriate times.
Observed User Engagement
Over the course of the month-long, in-home intervention, participants were engaged an average of 65% of the time during the child-robot interactions. However, engagement varied considerably across participants and for each participant, as shown in Figure 2. Average engagement for participants ranged from 48% to 84%, with a standard deviation of 14%. Analyzing each participant’s engagement chronologically over 10% increments also showed a standard deviation of 15%. Moreover, all participants had a significant (p < 0.01) decrease in engagement over the month-long intervention, as determined by a regression t-test and shown by the plotted
Fig. 2. Engagement by participant. A significant variance in engagement was observed across participants (A) and for each participant (B). A decreasing trend (p < 0.01) in engagement was also observed over the month-long intervention (B), indicating the need for online engagement recognition and response.
trend line. For example, Participant 2 was engaged 82% of the time in the first 10% and only 19% of the time in the last 10% of the month-long intervention.
This substantial variance in user engagement over the course of a long-term, real-world study indicates the need for online recognition of and response to disengagement. This study observed higher engagement for all participants shortly after the robot had spoken. Specifically, participants were engaged about 70% of the time when the robot had spoken in the previous minute, but less than 50% of the time when the robot had not spoken for over a minute. This validates the use of appropriately-timed robot speech as a tool for eliciting and maintaining user engagement.
Generalized and Individualized Model Results
This work presents generalized and individualized models of user engagement using data from long-term, in-home SAR interventions. As detailed in Materials and Methods, generalized models were developed by training on data from a given subset of users and then testing on different users. Individualized models were developed by sorting a user’s data chronologically
Fig. 3. Model results. Generalized models trained on different users and individualized models trained on early subsets of the intervention achieved comparable AUROC to models trained on random samples of all users’ data (A), but had much lower recall for disengagement (B).
and using an early subset for training and later subset for testing the model. We designed these two approaches to be feasible for online use in HRI; the labeled data required for supervised ML models are practical to obtain in both cases. In order to evaluate these approaches, models trained on random samples of all users’ data were also implemented, despite random sampling not being feasible for online use.
As shown in Figure 3A, generalized and individualized models achieved approximately 90% AUROC. For generalized models, the number of training users had little effect on AUROC; models trained on six users resulted in only a 3% improvement in AUROC over models trained on one user. On the other hand, individualized models performed better with additional data; models trained on the first 10% of a user’s data only achieved 77% AUROC, whereas models trained on the first 50% of a user’s data achieved 87% AUROC. Overall, both generalized and individualized models achieved comparable AUROC to models trained on random samples, which obtained 90% AUROC by training on as little as 30% of data across all users.
However, generalized and individualized models differ from models trained on random samples when considering other ML evaluation metrics such as precision and recall. For a given class (engagement or disengagement), precision measures the proportion of predictions of the class that are correct, and recall measures the proportion of actual instances of the class that are predicted correctly. As Figure 3B shows, there is an especially large difference between models in recall for disengagement. On average, training on random samples resulted in 82% recall for disengagement, whereas training on different users and early subsets resulted in only 61% and 45% recall, respectively. This indicates that generalized and individualized models would produce a high number of false negatives for detecting disengagement if implemented online in HRI. Supplementary Tables S4, S5, and S6 contain detailed model results for all approaches and evaluation metrics.
Variance in Data Across Users, Sessions, and Engagement States
This work’s long-term, real-world setting resulted in significantly different means and variances of data across participants, sessions, and engagement states, as shown in Figure 4. The figure compresses recorded data with high face detection confidence to two dimensions using principal component analysis (PCA), a commonly used unsupervised dimensionality reduction technique. Plotting compressed data reveals limited overlap between two participants (Figure 4A) and two sessions from the same participant (Figure 4B). Additionally, Figure 4C shows a higher variance in data when participants are disengaged, which may explain the low recall values for disengagement reported in the previous section. Supplementary Figures S1 and S2 show similar visualizations for all participants and all sessions for the same participant in Figure 4B.
Statistical analysis confirmed that both the means and variances of features differed significantly (p < 0.01) across participants, sessions, and engagement states. We used a one-way analysis of variance (ANOVA) to test differences in means, and an F-test for differences in variance. Tests were performed on the principal components of all data.
Fig. 4. Comparing data across participants, sessions, and engagement states. A visualization of limited overlap in data of two participants (A), two sessions from the same user (B), and across engagement states (C). Higher variance in data also observed when users are disengaged (C). Visualized data are those with high face detection confidence, compressed to two dimensions using principal component analysis (PCA); statistically significant (p < 0.01) differences in means and variances determined using the complete dataset.
Fig. 5. Sequences of engagement and disengagement. Median duration of engagement sequences (11.0s) is significantly (p < 0.01) higher than the median duration of disengagement sequences (4.0s). Despite prevalence of shorter sequences, disengagement sequences longer than the upper quartile (9.5s) accounted for 75% of the total time users were disengaged.
Fig. 6. Re-engagement strategy results. Post hoc analysis of strategy to re-engage users if predicted engagement probability is less than a threshold on average over a window. As shown by the orange line in (A) and (B), a threshold of 0.35 and window of 3s would have re-engaged users in 73% of long disengagement sequences but also 15% of engagement sequences. Varying window lengths (A) and thresholds (B) illustrates the trade-off between maximizing re-engagement in disengaged sequences and minimizing re-engagement in engaged sequences. Results based on generalized models trained on six participants.
Detecting Disengagement Sequences Using Temporal Patterns
This study demonstrates the importance and feasibility of detecting longer sequences of disengagement using the temporal structure of model predictions. Engagement sequences (ES) are periods in the interaction when the user is continuously engaged, while disengagement sequences (DS) are periods when the user is continuously disengaged. As Figure 5 shows, the duration of ES had an interquartile range of 5.0 to 27.0 seconds (s) while the duration of DS had an interquartile range of 2.5s to 9.5s. This work defines long DS as having a duration greater than the upper quartile (9.5s) and short DS as having a duration less than the lower quartile (2.5s). Long DS accounted for 75% of the total time users were disengaged, whereas short DS accounted for only 5% of the total time users were disengaged. This suggests that re-engagement strategies should focus on long DS despite the presence of many shorter sequences.
The results and insights from the data suggest the following strategy for determining when to initiate re-engagement actions (RA): (i) average a model’s predicted probability of engagement over a given window, and then (ii) initiate RA if the engagement probability is less than a given threshold. This approach should maximize long DS with RA, and minimize the percentage of ES with RA. Other considerations include the percentage of short DS with RA, the median duration of DS with RA, and the median elapsed time in DS before RA. The window length and threshold will affect these evaluation metrics so the choices for these parameters should depend on the intervention design and implemented RA.
Figure 6 presents a post hoc analysis of the proposed re-engagement strategy. For example, suppose this study initiated RA if the predicted engagement probability was less than 0.35 on average for a 3s window. This approach would have led to RA in 73% of long DS. The median duration of DS with RA would have been 25s, and RA would have occurred 2.5s into these sequences. However, RA would also have occurred in 5% of short DS and 15% of ES.
Varying the window lengths and thresholds highlights the trade-off between maximizing RA in DS and minimizing RA in ES. The window length was negatively correlated with the percentage of long DS () with RA for a fixed threshold, as shown in Figure 6A. On the other hand, the threshold was positively correlated with the percentage of long DS () with RA for a fixed window length, as shown in Figure 6B. The reported results are based on generalized models trained on six users. Supplementary Tables S7 and S8 contain results for both generalized and individualized models with additional window length and threshold combinations.
Different Modalities and Model Types
Over the month-long, in-home SAR interventions, we collected a rich multimodal dataset from which we derived visual, audio, and game performance features to model engagement. To assess each modality’s importance, we created separate models using each feature group. As Figure 7A shows, all modalities together outperformed each individual modality. However, models created using only visual features outperformed those created using audio or game performance features by about 20% AUROC. These results support related work in this journal (4) that also found visual features as the most significant but multiple modalities as complimentary.
Moreover, analyzing individual features revealed that the results of this work could largely be replicated using only seven key features. Feature analysis was performed using Pearson’s correlation coefficient (r), and key features were determined using a threshold of |r| > 0.20. The key features are: the elapsed time in a session, the number of people in the environment, the direction of the user’s eye gaze, the distance from the camera to the user, the elapsed time since the robot last talked, the count of incorrect responses to game questions, and the confidence value with which the user’s face is being detected in the camera frame. Models using only these seven key features achieved AUROC values within 5% of the results described above.
Additionally, this work explored several supervised ML model types, but found tree-based models to be the most successful. The following conventional model types were considered: Naive Bayes, K-Nearest Neighbors, Support Vector Machines, Neural Networks, Logistic Regression, Random Decision Tree Forests, and Gradient Boosted Decision Trees. Of these, Gradient Boosted Decision Trees had the highest AUROC, as shown in Figure 7B, and are the basis for model evaluation metrics reported in previous sections. We also explored sequential models, such as Hidden Markov Models, Conditional Random Fields, and Recurrent Neural Networks, but found these to be less effective than conventional static models.
Fig. 7. Results across different modalities and model types. All modalities together outperformed each modality separately, but visual features were the most significant (A). Tree-based models were the most successful among conventional supervised ML model types (B).
A few alternative modeling approaches were considered as well. Interestingly, an ensemble of generalized and individualized models did not lead to better results than those approaches applied separately. Other explored approaches included: (i) rule-based models with the key features, (ii) deep neural networks with re-weighting techniques (25), and (iii) synthetic oversampling of the disengaged class (26). None of these approaches outperformed Gradient Boosted Decision Trees.
Robot perception of user engagement is a crucial HRI capability previously unexplored in the context of long-term, in-home SAR interventions for children with ASD. This study is the first to model engagement in this complex setting, and it also differs from previous work by developing supervised ML models intended for real-world deployments. The discussion below focuses on how this work highlights the feasibility and challenges of online recognition and response to disengagement in long-term, in-home SAR contexts. These contributions aim to inform the design of more engaging and personalized HRI, and improve SAR’s effectiveness in augmenting learning by children with ASD.
Feasibility of Online Closed-Loop Implementation
This study presents supervised ML models that are feasible for use in online robot perception of and closed-loop response to disengagement. We developed two types of models for each user: (i) generalized models trained on data from different users; and (ii) individualized models trained on data from early subsets of users’ interventions. The visual, audio, and performance features along with the labeled training data used in these models can be obtained in online deployments, as discussed further in Materials and Methods. Generalized and individualized models achieved approximately 90% AUROC in this work’s post hoc analysis. Individualized models performed better with additional data, likely because participants had higher engagement in early subsets used for training. Overall, the similar performance of these approaches indicates the possibility of having one model for multiple users. Generalized and individualized models also attained comparable AUROC to models trained on random samples of all participants’ data, which is an ideal but impractical method.
A shortcoming of generalized and individualized models is revealed through a 50% false negative rate for detecting disengagement. This effect is likely due to the higher variance in features when participants were disengaged compared to when they were engaged. Despite the low recall values, a SAR system that accurately recognizes some instances of disengagement can still considerably enhance the interaction by attempting to re-engage the user at appropriate times. Analyzing disengagement sequences (DS), or periods in the interaction when the user is continuously disengaged, shows that 75% of the total time participants were disengaged occurred during DS that were 10 seconds long or longer. Some examples of participant behavior during these long DS included playing with toys, interacting with siblings, and even abruptly leaving the intervention setting. Shorter DS typically involved brief shifts in participant focus to other aspects of the environment. Moreover, most long DS required caregivers to re-engage participants, whereas participants re-engaged on their own in short DS. This suggests that re-engagement strategies should focus on counteracting longer DS.
This work’s post hoc analysis shows generalized and individualized models could be used to reliably initiate re-engagement actions (RA) during long DS. The presented re-engagement strategy would have initiated RA if the average predicted engagement probability over a window of time fell under a set threshold. Using a 0.35 threshold and a 3-second window would have resulted in RA for about 75% of long DS, with the first RA occurring 2.5 seconds into these sequences on average; however, RA would also have been erroneously initiated in 15% of engagement sequences (ES). An exploration of various window lengths and thresholds reveals the trade-off between maximizing RA in DS and minimizing RA in ES. This balance is important for maintaining RA effectiveness, and choices for these parameters should depend on the implemented RA and overall intervention design.
The presented models are also readily interpretable, an important characteristic for facilitating implementation. Interpretability of ML is especially important in the ASD context, where therapists and caregivers need an understanding of the system’s behavior in order to trust and adopt it (4). The described models achieved interpretability in two ways: (i) through a simplified feature set and (ii) through the selected model types. First, this work replicated the described model accuracies to within 5% using seven key features, as described in Results. This shows the problem of modeling engagement to be more tractable and provides insights for the design of future HRI studies in similar contexts. Second, we found tree-based ML models to be the most effective. Such methods are comparatively easier to train and interpret compared to more complex ML models (27). Nevertheless, it is unknown whether the effectiveness of tree-based models in this work would generalize to other contexts. The interpretability of this work further demonstrates the feasibility of applying supervised ML to model user engagement online in closed-loop HRI.
Challenges of the Real-World In-Home Context
A long-term, real-world HRI setting raises many modeling challenges due to the significant noise and variance in data. The unconstrained home environment in particular presented several unforeseen problems. First, the camera was attached to the top of the game tablet, but it’s position was frequently disturbed by both caregivers and child participants. For example, some caregivers temporarily turned the camera towards the ground or toward the robot when child participants were taking a break, instead of using the system’s power switch as instructed. As a result, the camera position varied throughout the study adding noise to the extracted visual features. The audio data in this study also contained a high level of background noise, including sounds from television, pets, kitchen appliances, and lawn mowers. All participating families chose to place the SAR system in their living rooms, so external parties such as siblings, friends, and neighbors regularly interrupted sessions. The camera sometimes failed to capture all individuals in the environment, as the system was not designed for multi-party interactions. Finally, the variance in data was higher in this study given participants were children with ASD, who display atypical and highly diverse behaviors (4).
Substantial variance in data leaves supervised ML models vulnerable to overfitting. To mitigate this risk, standard ML practices such as bagging, boosting, and early stopping were used, as discussed below in Materials and Methods. In spite of the challenges of a real-world setting, generalized and individualized models achieved AUROC values around 90% and could reliably initiate re-engagement actions in long sequences of disengagement. However, these models only had 50% recall for disengagement. Further improving the false negative rate is a key area of future work.
Limitations and Future Work
As discussed in the previous section, a key modeling challenge in real-world HRI is the substantially increased variance in data, especially when users are disengaged. The solution to this problem in traditional ML is to obtain a large sample of labeled training data. This is not always feasible in HRI and is especially complex for atypical user populations. Moreover, this challenge is especially acute in the ASD context, where high variance in behaviors manifest not only between individuals but also within each individual.
Active learning (AL) is a promising approach to this challenge, as it seeks to automatically select the most informative instances that need labeling (28). Preliminary work has shown AL to successfully train models of user engagement with a small amount of data (29). However, AL is yet to be validated in long-term, real-world settings, as discussed previously in this journal (30). A future approach could first use supervised ML to train base models on available labeled data from other users or a user’s beginning sessions, as done in this work. Then, AL could be applied to decide when to request a label for unseen data. A therapist or caregiver could provide the labels off-line after sessions, allowing the model to iteratively improve in a long-term setting.
Ultimately, the most important direction for future work is to deploy ML frameworks online in real-world HRI and SAR. Such deployments are critical for understanding how well models recognize disengagement under realistic constraints of noise, uncertainty, and variance in data. When implemented online, these models could inform the activation of robot re-engagement actions; specifically, these could entail verbal and non-verbal robot responses such as socially supportive gestures and phrases (31). Overall, online recognition of and response to disengagement will enable the design of more engaging, personalized, and effective HRI, especially in SAR for the ASD community.
Multimodal Dataset
The presented engagement models are based on data from month-long, in-home SAR interventions with children with ASD. During child-robot interactions, participants played a set of space-themed math games on a touchscreen tablet while a 6 degrees of freedom Stewartplatform robot named Kiwi provided verbal and expressive feedback, as shown in Figure 1. The study was approved by the Institutional Review Board of the University of Southern California (UP-16-0075(v), and we obtained informed consent from the children’s caregivers. The 7 child participants in this work had a clinical diagnosis of ASD in mild to moderate ranges as described in the Diagnostic and Statistical Manual of Mental Disorders (32). Supplementary Table S1 reports the ages and genders of the participants: ages were between 3 years, 11 months and 7 years, 2 months; 3 were female and 4 were male. An earlier article provides further details about the SAR system and intervention design, with a focus on how the robot’s feedback and instruction challenge levels were personalized to each user’s unique learning patterns using online reinforcement learning (19).
Over the course of the month-long, in-home study, we collected a large multimodal dataset from which we derived visual, audio, and game performance features to model engagement. Due to numerous technological challenges commonly encountered in noisy real-world studies, this work only considers approximately 21 hours of interaction from 7 participants who had sufficient multimodal data. Participant 4 had the maximum interaction time analyzed (3 hours, 48 minutes), and Participant 6 had the minimum interaction time analyzed (1 hour, 52 minutes). Data collected in individual sessions were truncated to only use the content between the first and last game, because session data often included unstructured interactions before and after the games. Each participant was given a tutorial session as an introduction to the SAR system; data from that session were not included in the analysis.
A USB camera mounted at the top of the game tablet recorded a front view of the participants. Visual and audio features were extracted from this camera data using OpenFace (20), OpenPose (21), and Praat (22), open-source data processing tools feasible for online use in HRI. Visual features derived from OpenFace included: (i) the face detection confidence value, (ii) eye gaze direction, (iii) head position; and (iv) facial expression features. OpenPose was only used to estimate the number of people in the environment, since the camera’s field of view centered on the user’s face. Audio features derived from Praat included pitch, frequency, intensity, and harmonicity. Game performance features were also derived from system recordings and included the challenge level of the game being played, the count of incorrect responses to game questions, and the elapsed time in a session, game, and since the robot last talked. Supplementary Note S1 lists all visual, audio, and game performance features used for modeling engagement.
In this work, a participant was annotated to be engaged or disengaged using standard definitions of engagement as a combination of behavioral, affective, and cognitive constructs (9). Specifically, a participant was considered to be engaged when paying full attention to the interaction, immediately responding to the robot’s prompts, or seeking further guidance from others in the room. The lead author of this work annotated whether a participant was engaged or disengaged for the seven participants. To verify the absence of bias, two additional annotators independently annotated 20% of the data for each participant; inter-rater reliability was measured using Fleiss’ Kappa, and a reliability of k = 0.84 was achieved between the primary and verifying annotators. Supplementary Table S2 contains the specific criteria followed by all annotators.
Modeling Approaches
This work applied and evaluated conventional supervised ML methods to model user engagement in month-long, in-home SAR interventions for children with ASD. First, we applied a few preprocessing techniques to the multimodal dataset described in the previous section to address missing data and possible errors in the fusion of modalities. Although we obtained data for each camera frame at a standard rate of 30 frames per second, this work considered the median value of features and annotations in overlapping one second intervals (i.e., 0 to 1 second, 0.5 to 1.5 seconds, 1 to 2 seconds, etc.). The following features were added for each interval: (i) the variance of continuous-valued features in the interval; and (ii) an indicator for whether discrete-valued features changed in the interval. This also addressed the problem of low OpenFace confidence for detecting the user’s face; low confidence occurred in 38% of camera frames overall, but in only 3 continuous frames on average. Furthermore, all features were standardized to have zero mean and unit variance since raw values were measured on different scales. The means and variances of each feature needed for standardization were obtained with respect to the training set in order to maintain the feasibility of online implementation.
To model user engagement, this work used two supervised ML approaches that are practical for online implementation in closed-loop HRI: (i) generalized models trained on data from different users; and (ii) individualized models trained on data from early subsets of the users’ interventions. Generalized models were implemented by training on data from a given subset of M participants. The models were then tested on the remaining N users not in the training subset. We considered all possible combinations of participants; because there were 7 participants, values of M and N ranged from 1 to 6. Individualized models were developed by sorting a user’s data chronologically and using an early subset for training and later subset for testing the model. In particular, we applied this evaluation to training sets from the first 10% to the first 90% of a user’s data, in increments of 10%. We used this approach to standardize the training sets across differences in participant interaction times; future implementations could use beginning sessions as training data instead. The generalized and individualized approaches are both feasible for online use in HRI deployments; the labeled training data required for supervised ML models can be obtained in both cases. In order to evaluate these approaches, we also implemented models trained and tested on distinct random samples of all users’ data despite this approach being impractical for online use. The proportions of training data evaluated in the random sampling approach also ranged from 10% to 90%, in increments of 10%.
Using the generalized, individualized, and random sampling approaches, this work implemented several supervised ML model types. All considered model types are reported in the Results section; Gradient Boosted Decision Trees were the most successful. Specifically, we implemented Gradient Boosted Decision Trees with early stopping and bagging (33). Boosting algorithms train weak learners sequentially, with each learner trying to correct its predecessor. Early stopping partitions a small percentage of the training data for validation, and ends training when performance on the validation set stops improving. Bagging fits several base classifiers on random subsets of the training data, and then aggregates the individual predictions to form a final prediction. These techniques were adopted to mitigate the increased risk of overfitting in high variance datasets, as was the case in this long-term, in-home study.
We implemented the ML models in Python using the following libraries: Scikit-learn version 0.21.3 (34), XGBoost version 0.90 (33), Hmmlearn version 0.2.1 (35), CRFsuite version 0.12 (36), TensorFlow version 1.15.0 (37), and Keras version 2.2.4 (38). All models were implemented with default hyperparameters, as specified in Supplementary Table S3. We used default hyperparameters because the variance in data caused commonly used strategies such as cross validation and grid search to overfit to the training data. All reported model results are from Scikit-learn implementations, except for Gradient Boosted Decision Trees, which we implemented using XGBoost for improved computational performance. Neural Networks were also implemented in TensorFlow and Keras, and had similar performance to the reported results from Scikit-learn. Sequential models were explored using Hmmlearn, CRFsuite, and Keras.
The authors thank K. Mahajan and L. Mathur for their help with data analysis, K. Peng and J. Keller for their assistance with annotations, and T. Groechel, L. Klein, and R. Pakkar for their advice and support. In addition, the authors are very grateful to G. Ragusa for a key role in the original study design, recruitment, assessments, and more. The entire research team thanks the children and families who participated in the study that generated the dataset.
Funding: This research was supported by the National Science Foundation Expedition in Computing Grant NSF IIS-1139148.
Author contributions: S.J. led this work’s conceptualization and investigation. S.J., B.T., and Z.S. processed the data, implemented the methods, and analyzed the results. C.C. designed the study and managed the deployments that generated the datasets used in this work. M.J.M. was the leading faculty advisor for the overarching study on SAR for ASD, and all conducted research, data collection, and analysis. S.J., B.T., Z.S., and M.J.M. all contributed to the writing of this article.
Competing interests: M.J.M. is a co-founder of Embodied, Inc. but has not been involved with the company since December 2016. C.C. is now a full-time employee of Embodied, Inc. but was not involved with the company while the reported work was done.
Data and materials availability: All data needed to evaluate the conclusions that can be released under USC IRB policies are included in the article or the Supplementary Materials. Please contact S.J. and M.J.M. for questions about other materials.
File S1 (.csv format). Dataset for engagement models. Google Drive Link.
File S2 (.csv format). Descriptions of columns in File S1. Google Drive Link.
Note S1. List of multimodal features.
Figure S1. Comparing data across users.
Figure S2. Comparing data across sessions.
Table S1. Participant demographic information.
Table S2. Engagement annotation criteria.
Table S3. Model hyperparameters.
Table S4. Generalized model results.
Table S5. Individualized model results.
Table S6. Random sampling model results.
Table S7. Re-engagement strategy evaluation using generalized models.
Table S8. Re-engagement strategy evaluation using individualized models.
1. D. Feil-Seifer, M. J. Matari´c, Rehabilitation Robotics, 2005. ICORR 2005. 9th International Conference on (IEEE, 2005), pp. 465–468.
2. M. J. Matari´c, B. Scassellati, Springer Handbook of Robotics (Springer, 2016), pp. 1973– 1994.
3. M. J. Matari´c, Science Robotics 2 (2017).
4. O. Rudovic, J. Lee, M. Dai, B. Schuller, R. W. Picard, Science Robotics 3 (2018).
5. C.-M. Huang, B. Mutlu, Proceedings of the 2014 ACM/IEEE International Conference on Human-robot Interaction, HRI ’14 (ACM, New York, NY, USA, 2014), pp. 57–64.
6. Autism spectrum disorders, Tech. rep., World Health Organization (2018).
7. Data and statistics on autism spectrum disorder, Tech. rep., Centers for Disease Control and Prevention (2019).
8. M. B. Ospina, et al., PloS one 3, e3755 (2008).
9. B. Scassellati, H. Admoni, M. Matari´c, Annual review of biomedical engineering 14, 275 (2012).
10. J. J. Diehl, L. M. Schmitt, M. Villano, C. R. Crowell, Research in autism spectrum disorders 6, 249 (2012).
11. B. Scassellati, et al., Science Robotics 3 (2018).
12. M. O’Brien, K. Tobin, Science nation: Socially assistive robots for children on the autism spectrum.
13. P. Gomez Esteban, et al., Paladyn, Journal of Behavioral Robotics 8, 18 (2017).
14. J. C. Kim, P. Azzi, M. Jeon, A. M. Howard, C. H. Park, 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI) (2017), pp. 39–44.
15. C. Clabaugh, M. Matari´c, Annual Review of Control, Robotics, and Autonomous Systems 2, 33 (2019).
16. K. Dautenhahn, ACM Trans. Hum.-Robot Interact. 7, 4:1 (2018).
17. E. S. Short, M. J. Matari´c, RO-MAN 2016 (2016).
18. C. Clabaugh, et al., 2018 International Symposium on Experimental Robotics (ISER) (2018).
19. C. Clabaugh, et al., Frontiers in Robotics and AI (2019).
20. T. Baltrusaitis, A. Zadeh, Y. C. Lim, L. Morency, 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018) (2018), pp. 59–66.
21. Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, Y. Sheikh, arXiv preprint arXiv:1812.08008 (2018).
22. P. Boersma, Praat, a system for doing phonetics by computer (2002).
23. A. Ben-Youssef, et al., Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI ’17 (ACM, New York, NY, USA, 2017), pp. 464–472.
24. A. P. Bradley, Pattern Recogn. 30, 1145 (1997).
25. M. Ren, W. Zeng, B. Yang, R. Urtasun, CoRR abs/1803.09050 (2018).
26. K. W. Bowyer, N. V. Chawla, L. O. Hall, W. P. Kegelmeyer, CoRR abs/1106.1813 (2011).
27. Y. Li, L. Yang, B. Yang, N. Wang, T. Wu, Neurocomputing 333, 273 (2019).
28. C. Clabaugh, M. Matari, Annual Review of Control, Robotics, and Autonomous Systems 2, 33 (2019).
29. O. Rudovic, M. Zhang, B. W. Schuller, R. W. Picard, CoRR abs/1906.03098 (2019).
30. C. Clabaugh, M. Matari´c, Science Robotics 3 (2018).
31. L. Brown, R. Kerwin, A. M. Howard, 2013 IEEE International Conference on Systems, Man, and Cybernetics (2013), pp. 4360–4365.
32. A. P. Association, et al., Diagnostic and statistical manual of mental disorders (DSM-5 R(American Psychiatric Pub, 2013).
33. T. Chen, C. Guestrin, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16 (ACM, New York, NY, USA, 2016), pp. 785–794.
34. F. Pedregosa, et al., Journal of Machine Learning Research 12, 2825 (2011).
35. hmmlearn: Hidden markov models in python (2015).
36. N. Okazaki, Crfsuite: a fast implementation of conditional random fields (2007).
37. M. Abadi, et al., TensorFlow: Large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org.
38. F. Chollet, Keras, https://keras.io (2015).
39. T. Baltruaitis, M. Mahmoud, P. Robinson, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2015), vol. 06, pp. 1–6.
Note S1. List of multimodal features.
tions for children with ASD. Video and audio features were extracted from camera data using OpenFace (20), OpenPose (21), and Praat (22). Game performance features were derived from the tablet interactions. A list of all features used for modeling engagement is included here.
• Visual Features :
• Audio Features: (i) harmonicity: measure of voice quality, (ii) intensity: power carried
• Game Performance Features: (i) challenge level of game being played, (ii) count of in-
Fig. S1. Comparing data across users. Statistically significant (p < 0.01) difference in means and variances of all recorded data across users. Visualized data are those with high face detection confidence, compressed to two dimensions using principal component analysis (PCA).
Fig. S2. Comparing data across sessions. Statistically significant (p < 0.01) difference in means and variances of all recorded data across sessions for each user. Sessions shown here from Participant two. Visualized data are those with high face detection confidence, compressed to two dimensions using principal component analysis (PCA).
Table S1. Participant demographic information. Gender and age of the 7 child participants, in random order. All participants had a clinical diagnosis of ASD in mild to moderate ranges as described in the Diagnostic and Statistical Manual of Mental Disorders (32). Additional details about the participants and study that generated the dataset are reported in an earlier article (19).
Table S2. Engagement annotation criteria. Participants were annotated as engaged or disengaged in each camera frame using the criteria above.
Table S3. Model hyperparameters. Several supervised ML model types were considered, as shown here and detailed in Results. All models were implemented using the default hyperparameters shown here because substantial variance in data caused hyperparameter tuning strategies to overfit to training sets. ML models were implemented in Python using the following libraries: Scikit-learn (34), XGBoost (33), Hmmlearn (35), CRFsuite (36), TensorFlow (37), and Keras (38).
Table S4. Generalized model results. Model evaluation metrics for the generalized approach: training on a given subset of users and testing on different users. Evaluated for all possible combinations of the 7 participants, and results averaged across models with the same number of training users. Results reported for Gradient Boosted Decision Trees, the most successful model type.
Table S5. Individualized model results. Model evaluation metrics for the individualized approach: sorting a user’s data chronologically, using an early subset for training and later subset for testing. Applied to proportions of training data ranging from the first 10% to the first 90% of a user’s data, in increments of 10%; results averaged across all users. Results reported for Gradient Boosted Decision Trees, the most successful model type.
Table S6. Random sampling model results. Model evaluation metrics for the random sampling approach: training and testing on distinct random samples of all users data. Applied to proportions of training data ranging from 10% to 90%, in increments of 10%; results averaged over 10 iterations. Results reported for Gradient Boosted Decision Trees, the most successful model type.
Table S7. Re-engagement strategy evaluation using generalized models. Post hoc strategy to re-engage users if predicted engagement probability is less than a threshold on average over a window. Analysis based on generalized models trained on 6 users for varying thresholds and window lengths. Evaluated on the following metrics: percentage of long disengagement sequences (DS) with RA, percentage of engagement sequences (ES) with RA, percentage of short DS with RA, the median duration of DS with RA, and the median elapsed time in DS before RA.
Table S8. Re-engagement strategy evaluation using individualized models. Post hoc strategy to re-engage users if predicted engagement probability is less than a threshold on average over a window. Analysis based on individualized models trained on the first 50% of users’ data for varying thresholds and window lengths. Evaluated on the following metrics: percentage of long disengagement sequences (DS) with RA, percentage of engagement sequences (ES) with RA, percentage of short DS with RA, the median duration of DS with RA, and the median elapsed time in DS before RA.