We performed four sets of experiments. These were carefully designed to evaluate the role of the model personalization and the interpretability/utility of the proposed personalized framework. We also provide comparisons of the proposed PPA-net with alternative approaches, and investigate the role of different behavioral modalities on robot perception of affect and engagement.
2.1 System design
Fig. 2 shows the proposed PPA-net that we designed and implemented to optimally handle: (i) the multi-modal and missing nature of the children data (feature layer) (44, 45), (ii) highly heterogeneous children data (a famous adage says: “If you have met one person with autism, you have met one person with autism.”) (context layer), and (iii) simultaneous and continuous estimation of the children’s affective dimensions: valence, arousal, and engagement — all three being critical for evaluating the efficacy of the autism therapy (decision layer). For (i), we perform the fusion of different modalities to obtain optimal input features for each child. We used a special type of network based on auto-encoders (46) (Sec. 4). The learned feature representations are further augmented by the expert knowledge within the context layer, quan-tified using the expert-assessed childhood autism rating scale (CARS) (47). The architecture of this layer is designed based on the nesting of the children using their demographics (culture and gender), followed by individual models for each child. Since the target outputs exhibit different dependence structures for each child (Fig. 3), we used the notion of multi-task learning (48,49) to learn the child-specific decision layers.
Figure 2: Personalized Perception of Affect Network (PPA-net). The feature layer uses (supervised) autoencoders to filter the noise in the features and reduce the feature size. At the intermediate level (context layer), behavioural scores of the child’s mental, motor, and verbal ability (also quantified by CARS) are used to adaptively select the optimal features for each child in the fusion part of the network. Further personalization of the network to each child is achieved via the demographic variables: culture and gender. Finally, the estimation of valence, arousal, and engagement is accomplished in the decision layer using the child-specific network layers, which output continuous estimates of target states.
therapy for autism, led by experienced therapists, and assisted by a humanoid robot NAO (43). The data come from 35 children: 17 from Japan (C1), and 18 from Serbia (C2), all of whom had a prior diagnosis of autism. This data include: (i) video recordings of facial expressions and head movements, body movements, pose and gestures, (ii) audio-recordings, and (iii) autonomic physiology from the child: heart rate (HR), electrodermal activity (EDA), and body temperature (T) – as measured on the non-dominant wrist of the child. We extracted various features (the sensing step) from these modalities using state-of-the-art tools for video and audio processing (OpenFace (50), OpenPose (51), and OpenSmile (52)) - see Appendix/C. We also developed feature extraction/noise-cleaning tools for processing of physiology data. Fig. 3(A) summarizes the results. Note that in 48% of the data, at least one of the target modalities was absent (e.g., when the child’s face was not visible, and/or when a child refused to wear the wristband).
valence, arousal, and engagement on a continuous scale from by five trained human experts, while watching the audio-visual recordings of the sessions. The coders’ agreement was measured using the intra-class correlation (ICC) (53), type (3,1). The ICC ranges from
and is commonly used in behavioral sciences to assess the coders’ agreement. The average ICC per output was: valence (
), arousal (
) and engagement (
The codings were pre-processed and averaged across the coders. These were then used as the ground truth for training ML models, where the data sets of each child were separated in disjoint training, validation, and testing data subsets (Sec. 4.5). A detailed summary of the data, features, and coding criteria is provided in Appendix/B,C.
2.2 Effects of Model Personalization
The main premise of model personalization via the PPA-net is that disentangling different sources of variation in behavioral modalities of the children with ASC is expected to improve the individual estimation performance compared to the traditional "one-size-fits-all" approach. Fig. 3(D) depicts the ICC scores computed at each level in the model hierarchy. Specifically, the performance scores at the top node in the graph are obtained using the predicted outputs for children from both cultures using the proposed personalized PPA-net and the group-level perception of affect network (GPA-net). Overall, the strength of the model personalization
Figure 3: (A) Summary of the fraction of data present across the different modalities both individually and concurrently. (B) The dependence patterns derived from the manual codings of the valence, arousal, and engagement. Note the large differences in these patterns at the culture and individual levels. (C) Clustering of the children from C1&C2 using the t-SNE, an unsupervised dimensionality reduction technique, applied to the auto-encoded features (Sec. 2.2). (D) ICC scores per child: C1 (17) and C2 (18) for valence (V), arousal (A) and engagement (E) estimation. Note the effects of the model personalization: the performance of the proposed personalized PPA-net (in black) improves at all three levels (culture, gender, and individual) when compared to the GPA-net (in gray). At the individual level, we depict difference in their performance
can be seen in the performance improvements at culture, gender, and individual levels in all (sub)groups of the children. A limitation can be seen in the adverse gender-level performance by the PPA-net on the two females from C1: this is due to the lack of data at this branch in the model hierarchy, which evidently led to the PPA-net overfitting the data of these two children when fine-tuning their individual layers, a common bottleneck of ML algorithms when trained on limited data (54). Since the layers of the GPA-net were tuned using data of all the children, this resulted in more robust estimation of the network on these individuals.
Fig. 3(D). The improvements in ICC due to the network personalization range from per child. We also note drops in the PPA-net performance on some children. This is common in multi-task learning, where the gain in performance on some tasks (here "tasks" are children) comes at the expense of performance on the others. We also ran the paired t-test with unequal variances (
), using the ICC scores from 10 repetitions of the random splits of the data per child (see Sec. 4.5), and compared the two models per child. The number of the children, on which the personalized PPA-net outperformed significantly the GPA-net, in both cultures are: V = 24/35, A = 23/35, and E = 31/35. More specifically, within C1, we obtained:
E = 16/18. Finally, the personalized PPA-net performed significantly better on all three outputs (V, A, E) on 6/17 children from C1, and 9/18 children from C2. Taken together, these results demonstrate notable benefits of the model personalization to improving the robot perception of affect and engagement, at each level in the model hierarchy.
(Sec. 4). The original size of the auto-encoded features was 250 dimensions (D). Fig. 3(C) shows the embeddings of these features into a 2D space obtained using the t-Distributed Stochastic Neighbor Embedding (t-SNE) (55) – a popular ML technique for unsupervised dimensionality reduction and visualization of high-dimensional data. Note the clustering of the children’s data in the projected space (the children ID was not used), confirming the high heterogeneity in behavioral cues of these children. Personalizing the PPA-net to each child allows it to accommodate individual differences at different levels in the model hierarchy, leading to overall better performance compared to the GPA-net.
2.3 Interpretability and Utility
A barrier to adoption of deep learning is when interpretability is paramount. Understanding the features that lead to a particular output builds trust with clinicians and therapists using the system in their daily practice. To analyze the contribution of each behavioral modality, and the features within, we used DeepLift (Learning Important FeaTures) (56), an open-source method for computing importance scores in a neural network. Fig. 4(A) shows the importance scores of the input features from each modality and from CARS for estimation of engagement, obtained by applying DeepLift to the PPA-net. We note that the body and face modality are dominant when the scores are computed for both cultures together. The prior information derived from CARS is also a big influencer in estimation of engagement. However, at the culture level, the DeepLift produces opposite scores for the body modality/CARS for the two cultures. This evidences that the model was capable of disentangling the culture-level differences. By analysis of CARS (total scores) in our previous work (43), we found statistically significant differences in the scores for the two cultures (p < 0.05), which, in part, also explains the difference in their contribution across the two cultures. Similar observations can be made at the gender-level, yet, these are more difficult to interpret due to the imbalance in males vs. females.
robot sensing and perception. Fig. 4 (B) depicts the estimated affect and engagement levels, along with the key autonomic physiology signals of a child undergoing the therapy (currently,
Figure 4: (A) Interpretability can be enhanced by looking at the influence of the input features on the output target: for example, here the output target is engagement level. The relative importance scores (y-axis) are shown for the face, body, audio, physiology, and CARS features (x-axis). These are obtained from the DeepLift (56) tool, which provides negative/positive values when the input feature drives the output toward . The most evident differences arise from body features, also indicating cultural differences in the two groups. Therapy monitoring (B) and summarization (C), based on the robot’s sensing of behavioral cues (audio-visual and autonomic physiology), and perception of the current affective states and engagement levels, of the child. In (B), we depict the engagement (E), arousal (A) and valence (V) levels of the child. The automatically estimated levels using the PPA-net are shown in blue, and the ground-truth based on the human coding is shown in red. We also plot the corresponding signals measured from the child’s wrist: accelerometer readings (ACC) showing the movement intensity
along 3-axes (x, y, z); blood-volume pulse (BVP) and electro-dermal activity (EDA). The bars in plot (B) summarize the therapy in terms of the average levels of E, V and A (
SD) within each phase of the therapy: (1) pairing, (2) recognition, and (3) imitation (there is no phase (4) - storytelling - because the child left after phase (3)).
these results are obtained by an off-line analysis of the recorded video). We note that the PPAnet was able to accurately detect the changes in the child’s engagement levels (e.g., during the disengagement segment), while providing estimates that are overall highly consistent with human coders. Since the PPA-net is personalized using data of each child, evidently, it learned particular expressions of affect and engagement for the interacting child. Fig. 4(C) summarizes the therapy in terms of average valence, arousal, and engagement levels (along with their variability) within each phase of the therapy. Compared to human coders, the PPA-net produces these statistics accurately for engagement and arousal levels, while it overestimates the valence levels. However, as more data of target child becomes available, these can be improved by re-training his/her individual layer.
2.4 Alternative Approaches
How much advantage does the new personalized deep learning approach obtain over more traditional ML? Table 1 (Appendix/A) shows the estimation results obtained by alternative approaches and evaluated using the same experimental protocol (Sec. 4). Here, we compare the performance of the proposed PPA-net/GPA-net with traditional multi-layer perceptron (MLP) deep networks with the same hierarchical structure but optimized using standard learning techniques (i.e., without sequential nesting of the layers).1 We also include the traditional ML models: linear regression (LR) (6), support vector regression (SVR) (6), and gradient boosted regression trees (GBRTs) (57). In the ML literature, LR is usually considered the baseline model. SVR is an adaptation of the SVM models (6), used in state-of-the-art works on human-robot interaction (e.g., (11, 40)), for estimation of continuous outputs. On the other hand, GBRTs are commonly used in clinical decision tasks due to their easy interpretation of input features (58). For more details about training and evaluation of the models, see Appendix/A.
net). The joint learning of all layers in the MLP results in a lack of discriminative power of the network. Compared to unpersonalized models (GPA-net, MLP-0, LR, SVR, and GBRTs), there is a gap in performance. While LR fails to account for highly-nonlinear dependencies in the data, the non-linear kernel method (SVR) achieves it to some extent, but does not reach the full performance attained by the PPA-net due to the absence of hierarchical structure. On the other hand, GBRTs are capable of discovering a hierarchy in the features, yet, they lack a principled way of adapting to each child. Also, the large variance in performance of all the models is because of high heterogeneity in behavioral expressions of children with ASC. By ranking the models based on the number of ‘winning’ tasks (TaskRank), the PPA-net outperforms the compared models on majority of the tasks (48%), followed by SVR (22%) and GBRT (13%).
2.5 Effects of Different Modalities
To assess the contribution of each modality for estimation of target outputs, we evaluated the PPA-net using visual (face and body), audio and physiological features both independently and together. Fig. 8 (Appendix/A) shows the average results for both cultures, and for the children within each culture. As expected, the fusion approach outperforms the individual modalities across all three outputs (valence, arousal, and engagement). Also, higher performance was achieved on C1 than C2 with the multi-modal approach – confirming the complimentary and additive nature of these modalities (59). Furthermore, the body features outperform the other individual modalities, followed by the face and physiology modality. The low performance by the audio modality is attributed to a high level of background noise, which is difficult to control in real-world settings (60). Also, while the physiology features are comparable to the best performing individual modality (body) in C1, this is not the case in C2.
The overall objective of this work was to demonstrate the feasibility of an automated system for robot perception of affect and engagement during autism therapy. This is driven by the societal need for new technologies that can facilitate and improve existing therapies for a growing number of children with ASC. Recent advances in ML and data collection, using unobtrusive sensors such as cameras and microphones, and wearable technology for measurement of autonomic physiology, have paved a way for such technology (61), however, little progress has been done so far (62). To this end, we introduced a novel personalized ML framework that can easily adapt to a child’s affective states and engagement even across different cultures and individuals. This framework builds upon state-of-the-art deep learning techniques (7, 42), which we used to implement the proposed Personalized Perception of Affect Network (PPA-net). While deep learning has shown great success in a variety of learning tasks (e.g., object and scene recognition (63,64) and sentiment analysis (65)), it has not been explored before in the context of robot perception for use in autism therapy. One of the reasons is the previous lack of data needed to take full advantage of deep learning. Using the cross-cultural and multi-modal dataset containing over 500k images of child-robot interactions during autism therapy (43), we were able to successfully design the robot perception based on the PPA-net.
GPA-net and MLP (i.e., the traditional "one-size-fits-all" approaches), deep learning has had great success in leveraging a vast amount of data. However, realizing the full potential of our framework on the data of children with ASC requires the network to personalize for each child. We showed that with the PPA-net, an average intra-class agreement (ICC) of 59% can be achieved between the model predictions and human (manual) coding of children’s affect and engagement levels, where the average agreement between the human coders was 55.3%. This does not imply that humans are not better in estimating affect and engagement but rather that the proposed framework provides a more consistent and less biased estimation approach. Compared to the standard approach in the field to coding affect (valence and arousal) levels, in the most recent and largest public dataset of human faces (Affect-Net (66)), the coders agreement was 60.7%, and the automatic prediction of valence and arousal (using CNN-AlexNet (67)) was 60.2% and 53.9%, respectively, in terms of Pearson correlation. Note, however, that these results are obtained from face images of typical individuals, whereas coding and automatic estimation of the same dimensions from children with ASC is a far more challenging task.
ticularly suited for the task at hand and different from existing approaches. First, it uses a novel learning algorithm that allows the deep network to take full advantage of data sharing at each level in the model hierarchy (i.e., the culture, gender, and individual level). This is achieved via the newly introduced network operators (learn, nest, and clone) and fine-tuning strategies (Sec. 4), where the former are based on the notion of network nesting (68) and deeply-supervised nets (69). We showed that, overall, this approach improves the estimation of affect and engagement at each level in the model hierarchy, obtaining statistically significant improvements on 15/35 children (across all three outputs) when compared to the GPA-net (Fig. 3(D)). Second, previous deep models (e.g., (70,71)) that focused on multi-task learning do not leverage the contextual information such as demographics (culture and gender), nor account for the expert knowledge. We also showed in our experiments on the network interpretability (Sec. 2.3) that this is important for disentangling different sources of variance arising from the two cultures and individuals. This, in turn, allows the PPA-net to focus on individual variation when learning the network parameters for each child. Third, using the network layers as building blocks in our framework, we efficiently personalized the network to the target context. Traditional ML approaches such as SVMs, used in previous attempts to implement the robot perception, and an ensemble of regression trees (6), do not offer this flexibility. By contrast, the PPA-net brings together the interpretability, design flexibility and overall improved performance.
nature of the data, especially in the presence of noisy and missing modalities. We showed in Sec. 2.5 that the fusion of audio-visual and physiological cues contributes to increasing the network performance. While our experiments revealed that body and face modalities play a central role in the estimation, the autonomic physiology is also an important facet of affect and engagement (72). This is the first time that both outward and inward expressions of affect and engagement were used together to facilitate the robot perception in autism therapy. We also found that the autonomic physiology influences differently the output of the two cultures. Namely, in C1 the physiology modality alone achieved an average ICC of above 50%, where in C2 this score was around 30%. This disparity may be attributed to cultural differences, as children from C2 were moving more during the interactions, which often caused faulty readings from the physiology sensors. Furthermore, the audio modality underperformed in our experiments, despite the evidence in previous works that successfully used it in estimation of affect (73). There are potentially two solutions to remedy this: use more advanced techniques for reduction of background noise and user diarization, and a richer set of audio descriptors (60). By analyzing the feature importance, we found that CARS largely influences the estimation of engagement. This suggests that, in addition to the data-driven approach, the expert knowledge is important for informing the robot perception in the form of prior knowledge. Lastly, in this work, we adopted a feature-level fusion of different modalities; however, more advanced approaches can be used to personalize the feature fusion to each child (e.g., using a mixture-of-experts approach (74)).
use for therapists and clinicians working with children with ASC? The potential utility of our personalized ML framework within autism therapy is through the use of visualization of the estimated affect and engagement levels, and the key autonomic physiology signals of a child undertaking the therapy (Fig. 4(B)). We note at least two benefits of this: first, the obtained scores can be used by the robot to automatically adapt its interaction with the child. This can also assist a therapist to monitor in real time the target behavioral cues of the child, and to modify the therapy “on the fly". It should also inform the therapist about the idiosyncratic behavioral patterns of the interacting child. Furthermore, it can assist the therapists in reading the children’s inward behavioral cues, i.e., their autonomic physiology, which cannot easily be read from outward cues (e.g., EDA as a proxy of the child’s internal arousal levels, the increase of which, if not detected promptly, can lead to meltdowns in children with ASC). Second, as we show in Fig. 4(C), the output of the robot perception can be used to summarize the therapy in terms of average valence, arousal, and engagement levels (along with their variability) within each phase of the therapy. This, in turn, would allow for a long-term monitoring of the children’s progress, also signaling when the robot fails to accurately perceive the child’s signals. This can be used to improve certain aspects of the child’s behavioral expressions by profiling the child and designing strategies to optimize his/her engagement through a personalized therapy content.
for future method enhancement. First, in the current structure of the proposed PPA-net, we assumed that the children split based on their demographics solely. While the findings in Sec. 2.3 show that the body modality has the opposite influence between the two cultures on estimation of engagement, thus justifying the current PPA-net architecture, other network structures are also feasible. For example, an adaptive robot perception would adopt a hybrid approach where prior knowledge (e.g. demographics) is combined with a data-driven approach to automatically learn the network structure (75). Also, our current framework is static, while the data we used is inherently dynamic (the sensed signals are temporally correlated). Incorporating the temporal context within our framework can be accomplished at multiple levels: different network parameters can be learned for each phase of the therapy. To this end, more advanced models such as recurrent neural networks (76), can be used in the individual layers. Furthermore, the network generalizability not only within the previously seen children, as currently done by the PPA-net, but also to new children is another important aspect of robot perception. Extending the current framework so that it can optimized for previously unseen children would additionally increase its utility. Due to the hierarchical nature of the PPA-net, a simple way to currently achieve this is by adding an individual layer for each new child, while re-using the other layers in the network.
though we used a rich dataset of child-robot interactions to build the robot perception system, this dataset contains a single therapy session per child. An ideal system would have a constant access to the therapy data of a target child, allowing the robot to actively adapt its interpretations of the child’s affect and engagement, and further personalize the PPA-net as the therapy progresses. For this, ML frameworks such as active learning (77) and reinforcement learning (78) are a good fit. This would allow the robot to continuously adjust the network parameters using new data, and also reduce the coding effort by only asking human coders to provide labels for cases for which it is uncertain. Another constraint of the proposed robot perception solution is that the video data come from a background camera/microphone. While this allows us to have a more stable view for the robot sensing of the face-body modality, the view from the robot’s perspective would enable more naturalistic interactions. This is also known as active vision (79), however, it poses a number of challenges including the camera stabilization and multi-view adaptation (80). Finally, one of the important avenues for future research on robot perception for autism is to focus on its utility and deployment within every-day autism therapies. Only in this way can the robot perception and the learning of children with ASC be mutually enhanced.
4.1 Data representations
We used the feed-forward multi-layer neural network approach (81) to implement the proposed deep learning architecture (Fig. 2). Each layer receives the output of the layer above as its input, producing higher-level representations (82) of the features extracted from the behavioral modalities of the children. We began with the GPA-net, where all layers are shared among the children. The network personalization was then achieved (i) by replicating the layers to construct the hierarchical architecture depicted in Fig. 2, and (ii) by applying the proposed fine-tuning strategies to optimize the network performance on each child. The last layers of the network were then used to make individual estimations of affect and engagement.
4.2 Feature Fusion and Autoencoding
We applied the feature-level fusion to the face (), and physiology (
features of each child as:
is the overall dimension of the input. The continuous labels for valence (
), arousal (
), and engagement (
) for each child were stored as
. Furthermore, the data of each child were split into non-overlapping training, validation and test datasets (Sec. 4.5). To reduce the adverse effects of partially-observed and noisy features in the input x (Fig. 8 (A) - Appendix/A), we used an autoencoder (AE) (83) in the first layer of the PPA-net. The AE transforms x to a hidden representation
(with an encoder) through a deterministic mapping:
parametrized by designates the parameters on the encoder side. We used the linear activation function (LaF), where the parameters
are a weight coefficient matrix and a bias vector, respectively. This hidden representation is then mapped back to the
input, producing the decoded features:
where d designates the parameters of the decoder, and are the tied weights used for the inverse mapping of the encoded features (decoder). In this way, the input data were transformed to a lower-dimensional and less-noisy representations (’encoding’). Since the input data are multi-modal, the encoded subspace also integrates the correlations among the modalities, rendering more robust features for learning of the subsequent layers in the network.
for each hidden layer (69). The CoF acts as a regularizer on the network weights, enabling the outputs of each layer to pass the most discriminative features to the next layer. Using the CoF, the AE also reconstructs target outputs (in addition to
The AE parameters were then optimized over the training dataset to minimize the mean-squared-error (MSE) loss (defined as
decoding () and output (
) estimates:
where N is the number of training datapoints from all the children. The parameter was chosen to balance the network’s generative power (the feature decoding) and discriminative power (the output estimation), and was optimized using validation data (in our case, the optimal value was
). The learned
was applied to the input features x, and the resulting code
was then combined with the CARS (
) for each child:
This new data representation was used as input to the subsequent layers of the network.
Figure 5: The learning of the PPA-net. (A) The supervised-AE performs the feature smoothing by dealing with missing values and noise in the input, while preserving the discriminative information in the subspace - constrained by the CoF
. The learning operators in the PPA-net: (B) learn, (C) nest and (D) clone, are used for the layer-wise supervised learning, learning of the subsequent vertical layers, and horizontal expansion of the network, respectively. (E) The group level GPA-net is first learned by sequentially increasing the network depth using learn & nest. The GPA-net is then used to initialize the personalized PPA-net weights at the culture, gender, and individual level (using clone). (F) The network personalization is then accomplished via the fine tuning steps I and II (Sec. 4).
4.3 Group-level Network
We first trained the GPA-net, where all network layers are shared among the children (Fig. 5 (E)). The weights of the GPA-net were also used to initialize the PPA-net, followed by the proposed fine-tuning strategies to personalize the network (Sec. 4.4). The former step is important because each layer below the culture level in the PPA-net uses only a relevant subset of the data (e.g., in C1, data of two females are present below the gender layer), resulting in less data to train these layers. This, in turn, could easily lead to overfitting of the PPA-net, especially of its child-specific layers, if only the data of a single child were used to learn their weights. To this end, we employed a supervised layer-wise learning strategy, similar to that proposed in recent deep learning works (68,69). The central idea is to train the layers sequentially and in a supervised fashion by optimizing two layers at a time: the target hidden layer and its CoF.
operator is called when simultaneously learning the hidden and CoF layers. For the hidden layers, we used the rectified linear unit (ReLU) (7), defined as: where
. ReLU is the most popular activation function that provides a constant derivative, resulting in fast learning and preventing vanishing gradients in deep neural networks (7). The AE output and CARS (
) were fed into the fusion (l = 1) layer, followed by the culture (l = 2), gender (l = 3), and individual (l = 4, 5) layers, as depicted in Fig. 5, where each CoF is a fully connected LaF with parameters
optimal parameters of the
were found by minimizing the loss:
where is the MSE loss (Sec.4.2), computed between the output of the ReLU layer (
passed through the LaF layer of the CoF (
), and true outputs (y).
fashion as in (68), to initialize the parameters as:
where the weight matrix of the ReLU was set to an identity matrix (I). To avoid the network being trapped in a local minimum of the previous layer, we added a low Gaussian noise (
) to the elements of I. We set the parameters of the supervised linear layer using the weights of the CoF above, which assures that the network achieves similar performance after nesting of the new ReLU layer. Before we started training the nested layer, we ‘froze’ all the layers above by setting the gradients of their weights to zero – a common approach in a layer-wise training of deep models (84). This allows the network to learn the best weights for the target layer (at this stage). The steps learn & nest were applied sequentially to all subsequent layers in the network. Then, the fine-tuning of the network hidden layers and the last CoF was done jointly. We initially set the number of epochs to 500, with earlystopping, i.e.,
training until the error on a validation set reaches a clear minimum (82) (100 epochs).2
Briefly, this algorithm indicates how a model should change its parameters that are used to compute the representation in each layer from the representation in the previous layer. The loss of the AE layer and each pair of the ReLU/LaF(CoF) layers was minimized using the Adadelta gradient descent algorithm with learning rate lr = 1, 200 epochs, and a batch size of 100. The optimal network configuration had hidden neurons (h) in the AE and its CoF layers, respectively. Likewise, the size of the fusion ReLU was
and
for all subsequent ReLU layers. The size of their CoF layers was
implemented the PPA-net using the Keras API (85) with a Tensorflow backend (86), on a Dell Precision workstation (T7910), with the support of two GPUs (NVIDIA GF GTX 1080 Ti).
4.4 Network Personalization
To personalize the GPA-net, we devised a learning strategy that consists of three steps: the network initialization followed by two fine-tuning steps. For the former, we introduced a new operator, named clone, which widens the network to produce the architecture depicted in Fig. 2. Specifically, the AE (l = 0) and fusion (l = 1) layers were configured as in the GPA-net (using the same parameters). The clone operator was then applied to generate the culture, gender, and
individual layers, the parameters of which were initialized as follows:
As part of the clone procedure, the culture and gender layers were shared among the children, while the individual layers were child-specific.
strategies. We report here a two-step fine-tuning strategy that performed the best. First, we updated the network parameters along the path to a target child, while freezing the layers not intersecting with that particular path. For instance, for child k and demographics the following updates were made:
. Practically, this was achieved by using a batch of 100 random samples of target child, to compute the network gradients along that child-path. In this way, the network gradients were accumulated across all the children, and then back-propagated (1 epoch). This was repeated for 50 epochs, and the Stochastic Gradient Descent (SGD) algorithm (lr = 0.03) was used to update the network parameters. At this step, SGD produced better parameters than Adadelta. Namely, due to its adaptive lr, Adadelta quickly altered the initial network parameters, overfitting the parameters of deeper layers, for the reasons mentioned above. This, in turn, diminished the shared knowledge provided by the GPA-net. On the other hand, the SGD with the low and fixed lr made small updates to the network parameters at each epoch, allowing the network to better fit each child while preserving the shared knowledge. This was followed by the second fine-tuning step where the child-specific layers (ReLU/LaF(CoF)) were further optimized. For this, we used Adadelta (lr = 1) to tune the child-specific layers,
, one-child-at-the-time (200 epochs). Further details of these learning strategies are provided in Appendix/A.
4.5 Evaluation Procedure
We performed a random split of data of each child into three disjoint sets: we used 40% of a child’s data as a training set, and 20% as the validation data to select the best model configu-ration. The remaining 40% were used as the test set to evaluate the model’s generalization to previously unseen data. This protocol imitates a realistic scenario where a portion of a child’s data (e.g., annotated by child therapists) is used to train and personalize the model to the child, and the rest is used to estimate affective states and engagement from new data of that child. To avoid any bias in the data selection, this process was repeated ten times. The input features were z-normalized (zero mean, unit variance), and the model’s performance is reported in terms of ICC (and MSE) computed from the model estimates and ground-truth labels (see Appendix).
We thank the Serbian Autism Society, and the educational therapist Ms S. Babovic for her invaluable feedback during this study. We would also like to thank the Ethics Committee from Japan (Chubu IRB), Serbia - MHI, and USA (MIT IRB), for allowing this research to be conducted. We also thank Ms Havannah Tran and Ms Jiayu Zhou for helping us to prepare the figures in the paper, Dr Javier Hernandez, for his insights into the processing and analysis of physiological data, Dr Manuel Cebrian for his advice on formatting the paper, and MIT undergraduate students: Mr John Busche, for his support in experiments for alternative approaches, and Ms Sadhika Malladi, for her help with running the DeepLift code. Our special thanks go to all the children and their parents who participated in the data collection - without them, this research would not be possible. Funding: This work has been supported by MEXT Grant-in-Aid for Young Scientists B Grant No. 16763279, and Chubu University Grant I Grant No. 27IS04I (Japan). The work of O. R. has been funded by EU HORIZON 2020 under grant agreement no. 701236 (ENGAGEME) - Marie Skłodowska-Curie Individual Fellowship, and the work of B.S. under grant agreement no. 688835 (DE-ENIGMA). Author contributions: O.R. and R.P. conceived the personalized machine learning framework. O.R. derived the proposed deep learning method. M.D. and O.R. implemented the method and conducted the experiments. J.L. supported the data collection, data processing and analysis of the results. B.S. provided insights into the method and audio-data processing. All authors contributed to the writing of the paper. Competing interests: The authors declare that they have no competing interests.
Table 1: The mean (SD) of the ICC scores (in %) for estimation of the children’s valence, arousal, and engagement. TaskRank quantifies the portion of tasks (35 children
3 outputs = 105 in total) on which the target model outperformed the compared models, including standard deep multi-layer perceptron network with last layers adapted to each child (MLP), joint MLP (MLP-0), linear regression (LR), support-vector regression (SVR) with a Gaussian kernel, and gradient-boosted regression trees (GBRT).
Figure 6: Empirical Cumulative Distribution Function (CDF) computed from average estimation errors for valence, arousal, and engagement levels, and in terms of (A) ICC and (B) MSE. We show the performance by three top ranked models (based on TaskRank in Table 1). The individual performance scores for 35 children are used to compute the CDFs in the plots. From the plots, we see that the improvements due to the network personalization are most pronounced for 40% < F(X) < 75% of the children. On the other hand, the model personalization exhibits similar performance on the children for whom the group-level models perform very well (0% < F(X) < 40%), or largely underperform (75% < F(X) < 100%). This indicates that for the underperforming children, the individual expressions of affect and engagement vary largely across the children. Thus, more data of those children is needed to achieve a more effective model personalization.
Figure 7: The networks’ learning: Mean Squared Errors (MSE) during each epoch in the network optimization are shown for the personalized (PPA-net and MLP) and group-level (GPA-net) models, and for training (tr) and validation (va) data. Note that the GPA-net learns faster and with a better local minimum compared to the standard MLP. This is due to the former using layer-wise supervised learning strategy. This is further enhanced by fine-tunning steps in PPA-net, achieving the lowest MSE during the model learning, which is due to its ability to adapt its parameters to each culture, gender and individual.
Figure 8: The contribution of visual (face and body), audio and physiology modalities in the estimation of the valence, arousal, and engagement levels of the children using PPA-net. The fusion approach (’ALL’) outperforms the individual modalities, evidencing the additive contribution of each modality to predicting the target outputs. The large error bars reflect the high level of heterogeneity in the individual performance of the network on each child, as expected for many children with ASC.
In Table. 1, we compare different methods described in Sec. 2.4 in the paper, and detailed below. In Fig. 6, we depict the error distributions of the top performing methods, highlighting the regions in the error space where the proposed PPA-net is most effective (and otherwise). Fig. 7 shows the convergence rates of the deep models evaluated in the paper, and in terms of the learning steps and the loss minimization (MSE). Note that the proposed PPA-net is able to fit the target children significantly better, while still outperforming the compared methods on the previously unseen data of those children (Table 1). In traditional ML, where the goal is to be able to generalize to previously unseen subjects, this could be considered as algorithmic bias (the model overfitting). By contrast, in personalized ML, as proposed here, it is beneficial as it allows the model to perform the best on unseen data of target subject to whom we aim to personalize the model. Fig. 8 depicts the contribution of each modality to the estimation performance (Sec. 2.5). The bars in the graph show the mean (SD) ICC performance for each modality obtained by averaging it across the children. The PPA-net configuration used to make predictions from each modality was the same as in the multi-modal scenario. However, the size of the auto-encoded space varied in order to accomodate the size of the input features. Specifically, the optimal size of the encoded features per modality was: 150 (face), 50 (body), and original feature size: 24 for the audio, and 30 for the physiology modality, were used. In what follows, we provide additional details on the training procedures for the alternative methods used in our experiments.
cally, for MLP-0/MLP we used Keras API (85), and for the rest, we used the sklearn (87), a python toolbox for ML.
We reported the results obtained on the dataset of children undergoing occupational therapy for autism (43). The therapy was led by experienced child therapists, and assisted by a humanoid robot NAO. The goal of the therapy was to teach the children to recognize and imitate emotive behaviors (using the Theory of Mind concept (88)) as expressed by NAO robot. During the
Table 2: The summary of participants [taken from (43)]. The average CARS scores of the two groups are statistically different (
therapy, the robot was driven by a "person behind the curtain" (i.e., the therapist) but the data were collected for enabling the robot to have a future autonomous perception of the affective states of a child learner. The data include: (i) video recordings of facial expressions, head and body movements, pose and gestures, (ii) autonomic physiology (heart rate (HR), electrodermal activity (EDA) and body temperature (T)) from the children, as measured on their non-dominant wrist, and (iii) audio-recordings (Fig. 1). The data come from 35 children, with different cultural backgrounds. Namely, 17 children (16 males / 1 female) are from Japan (C1), and 19 children (15 males / 4 females) are from Serbia (C2) (43). Note that in this paper we excluded the data of one male child from C2 due to the low-quality recording. Each child participated in a 25 minutes long child-robot interaction. Children’s ages varied from 3-13, and all the children have a prior diagnosis of autism (see Table 2). The protocol for the data acquisition was reviewed and approved by relevant Institutional Review Boards (IRBs), and informed consent was obtained in writing from the parents of the children. More details about the data, recording setting and therapy stages (pairing, recognition, imitation and story-telling) can be found in (43).
The raw data of synchronized video, audio and autonomic physiology recordings were processed using the state-of-the-art open-source tools. For analysis of facial behavior, we used the OpenFace toolkit (50). This toolkit is based on Conditional Local Neural Fields (CLNF) (89), a ML model for detection and tracking of 68 fiducial facial points, described as 2 dimensional (2D) coordinates (x, y) in face images (Fig. 1). It also provides 3D estimates of head-pose and eye-gaze direction (one for each eye), as well as the presence and intensity (on a 6 level Likert scale) of 18 facial action units (AUs) (90). The latter are usually referred to as the judgment level descriptors of facial activity, in terms of activations of facial muscles. Most human facial expressions can be described as a combination of these AUs and their intensities, and they have been the focus of research on automated analysis of facial expressions (91). For capturing the body movements, we used the OpenPose toolkit (51) for automated detection of 18-keypoint body pose locations, 21-keypoint hand estimation, and 70 fiducial facial landmarks (all in 2D), along with their detection confidence (0-1). From this set, we used the body pose and facial landmarks, and disregarded the hand tracking (due to frequent occlusions of the children’s hands). OpenPose is built upon recent advances in convolutional neural networks - CNNs (specifically, the VGG-19 net (92)), and the part affinity fields for part association (51).
acoustic low-level descriptors (LLDs) from the speech waveform on frame level. Specifically, we used 24 LLDs ( (pitch, MFCC, LSP, etc.) provided by openSmile, which have already been used effectively for cross-lingual automatic diagnosis of ASC from children’s voices (73). These features were computed over sliding windows of length 100 ms with 10 ms shift, and then aligned with the visual features using time-stamps stored during the data recording.
commercially available E4 wrist-worn sensor (93,94). This wristband provides real-time readings of blood volume pulse (BVP) and HR (64Hz), EDA via the measurement of skin conductance (4Hz), skin T (4Hz), and 3-axis accelerometer (ACC) data (32Hz). From these signals, we also extracted additional commonly used hand-crafted features (34), as listed in Table 3. Note that since HR is obtained from BVP, we used only the raw BVP. Again, these were temporally aligned with the visual features using time-stamps stored during the data recording.
Table 3: The summary of the features used from different data modalities.
dictors of target affective states and engagement in our personalized affect perception deep networks. From the OpenPose output, we used the face and body features with the detection confidence over each feature set (face&body) above 30%, which we found to be a good threshold by visually inspecting the detection results. The final feature set was formed as follows: (i) visual: we used the facial landmarks from OpenPose, enhanced with the head-pose, eye-gaze and AUs, as provided by OpenFace. (ii) Body: we merged the OpenPose body-pose features, and E4 ACC features encoding the hand movements. (ii) Audio: the original feature set is kept, and (iii) Physiology: contains the features derived from the E4 sensor, without the ACC features. Table 3 summarizes these features.
children’s behavioral severity at the time of the interaction (after the recordings) was scored on the CARS (47) by the therapists (Table 2). The CARS form is typically completed in less than 30 minutes, and it asks about 15 areas of behavior defined by a unique rating system (0-4) developed to assist in identifying individuals with ASC. The rating values given for the 15 areas are summed to produce a total score for each child. CARS covers the three key behavioral dimensions pertinent to autism: social-emotional, cognitive, and sensory, and based on the total scores, the children fall into one of the following categories: (i) no autism (score below 30), (ii) mild-to-moderate autism (score: 30–36.5), and (iii) moderate-to-severe autism (37–60). We used this 15-D feature set (the CARS scores for each of the 15 areas) as a unique descriptor for each child - encoding the expert knowledge about the children’s behavioral traits.
The dataset was labeled by human experts in terms of two most commonly used affective dimensions (valence and arousal), and engagement, all rated on a continuous scale in the range from . Specifically, five expert therapists (two from C1 and three from C2) coded the videos independently while watching the audio-visual recordings of target interactions. As a measure of the coders’ agreement, we used the intra-class correlation (ICC) score, type (3,1) (53). This score is a measure of the proportion of a variance that is attributable to objects of measurement compared to the overall variance of the coders. The ICC is commonly used in behavioral sciences to assess the agreement of judges. Unlike the well-known Pearson correlation (PC), ICC penalizes the scale differences and offset between the coders, which makes it a more robust measure of coders’ agreement. The codings were aligned using the standard alignment techniques: we applied time-shifting of
seconds to each coder, and selected the shift which produced the highest average inter-coder agreement. The ground truth labels that we used to evaluate the ML models were then obtained by averaging the codings of 3/5 coders, who had the highest agreement (based on the pair-wise ICC scores). We empirically found, that in this way, outlying codings can significantly be reduced. The obtained coding ("the gold standard") was then used as the ground truth for training ML models for estimation of valence, arousal, and engagement levels during the child-robot interactions. Finally, note that in our previous work (43), we used discrete annotations for the three target dimensions. Since these were coded per manually selected engagement episodes, for this work we re-annotated the data to obtain a more fine-grained (i.e., continuous) estimates of the affect and engagement from the full dataset. The description of the exemplary behavioral cues used during the coding process is given in Table 4.
Table 4: The description of the behaviors and corresponding cues given to the coders as reference points when coding target affective states and engagement levels. All three dimensions are coded on a continuous scale based on the perceived intensity of the target dimensions.
1. T. Kanda, H. Ishiguro, Human-robot interaction in social robotics (CRC Press, 2017).
2. T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots, Robotics and autonomous systems 42, 143–166 (2003).
3. E. S.-W. Kim, Robots for social skills therapy in autism: Evidence and designs toward clinical utility, Ph.D. thesis, Yale University (2013).
4. G.-Z. Yang, et al., Medical robotics—regulatory, ethical, and legal considerations for increasing levels of autonomy (2017).
5. L. D. Riek, Healthcare robotics, Communications of the ACM 60, 68–78 (November 2017).
6. C. M. Bishop, Pattern recognition and machine learning (Springer, 2006).
7. Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521, 436–444 (May 2015).
8. M. J. Matari´c, Socially assistive robotics: Human augmentation versus automation, Science Robotics 2 (March 2017).
9. A. Tapus, M. Mataric, B. Scassellati, Socially assistive robotics [Grand Challenges of Robotics], IEEE Robotics & Automation Magazine 14, 35–42 (2007).
10. A. Peca, Robot enhanced therapy for children with autism disorders: Measuring ethical acceptability, IEEE Technology and Society Magazine 35, 54–66 (June 2016).
11. P. G. Esteban, et al., How to build a supervised autonomous system for robot-enhanced therapy for children with autism spectrum disorder, Paladyn, Journal of Behavioral Robotics 8, 18–38 (April 2017).
12. D. Freeman, et al., Virtual reality in the assessment, understanding, and treatment of mental health disorders, Psychol. Med. pp. 1–8 (2017).
13. A. P. Association, Diagnostic and statistical manual of mental disorders (DSM-5 R(American Psychiatric Pub, 2013).
14. D. L. Christensen, et al., Prevalence and characteristics of autism spectrum disorder among 4-year-old children in the autism and developmental disabilities monitoring network, Journal of Developmental & Behavioral Pediatrics 37, 1–8 (January 2016).
15. D. Feil-Seifer, M. Mataric, Robot-assisted therapy for children with autism spectrum disorders, Proc. of the 7th International Conference on Interaction Design and Children (2008), pp. 49–52.
16. W. A. Bainbridge, J. W. Hart, E. S. Kim, B. Scassellati, The benefits of interactions with physically present robots over video-displayed agents, Int. J. Soc. Robot. 3, 41–52 (January 2011).
17. M. Helt, et al., Can children with autism recover? if so, how?, Neuropsychology review 18, 339–366 (2008).
18. C. M. Corsello, Early intervention in autism, Infants & Young Children 18, 74–85 (April 2005).
19. S. Baron-Cohen, A. M. Leslie, U. Frith, Does the autistic child have a "theory of mind"?, Cognition 21, 37–46 (October 1985).
20. S. Harker, Applied behavior analysis (aba), Encyclopedia of Child Behavior and Development pp. 135–138 (2011).
21. R. L. Koegel, L. Kern Koegel, Pivotal Response Treatments for Autism: Communication, Social, and Academic Development. (2006).
22. J. J. Diehl, L. M. Schmitt, M. Villano, C. R. Crowell, The clinical use of robots for individuals with Autism Spectrum Disorders: A critical review, Research in Autism Spectrum Disorders 6, 249–262 (2012).
23. B. Scassellati, H. Admoni, M. Matari´c, Robots for use in autism research, Annual review of biomedical engineering 14, 275–294 (2012).
24. K. Dautenhahn, I. Werry, Towards interactive robots in autism therapy: Background, motivation and challenges, Pragmatics & Cognition 12, 1–35 (2004).
25. P. Liu, D. F. Glas, T. Kanda, H. Ishiguro, Data-driven hri: Learning social behaviors by example from human–human interaction, IEEE Transactions on Robotics 32, 988–1008 (2016).
26. E. S. Kim, R. Paul, F. Shic, B. Scassellati, Bridging the research gap: Making hri useful to individuals with autism, Journal of Human-Robot Interaction 1 (2012).
27. B. M. Scassellati, Foundations for a theory of mind for a humanoid robot, Ph.D. thesis, Massachusetts Institute of Technology (2001).
28. K. Dautenhahn, I. Werry, Towards interactive robots in autism therapy: Background, motivation and challenges, Pragmatics & Cognition 12, 1–35 (2004).
29. C. L. Breazeal, Designing sociable robots (MIT press, 2004).
30. P. Pennisi, et al., Autism and social robotics: A systematic review, Autism Research 9, 165–183 (February 2016).
31. M. A. Goodrich, A. C. Schultz, Human-robot interaction: a survey, Foundations and trends in human-computer interaction 1, 203–275 (February 2007).
32. S. M. Anzalone, S. Boucenna, S. Ivaldi, M. Chetouani, Evaluating the engagement with social robots, International Journal of Social Robotics 7, 465–478 (August 2015).
33. M. B. Colton, et al., Toward therapist-in-the-loop assistive robotics for children with autism and specific language impairment, Autism 24, 25 (2009).
34. J. Hernandez, I. Riobo, A. Rozga, G. D. Abowd, R. W. Picard, Using electrodermal activity to recognize ease of engagement in children during social interactions, Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing (2014), pp. 307–317.
35. M. E. Hoque, Analysis of speech properties of neurotypicals and individuals diagnosed with autism and down, Proceedings of the 10th International ACM SIGACCESS Conference on Computers and Accessibility (2008), pp. 311–312.
36. A. Baird, et al., Automatic classification of autistic child vocalisations: A novel database and results, Interspeech pp. 849–853 (2017).
37. T. Belpaeme, et al., Multimodal child-robot interaction: Building social bonds, Journal of Human-Robot Interaction 1, 33–53 (December 2012).
38. Z. Zheng, et al., Robot-mediated imitation skill training for children with autism, IEEE Transactions on Neural Systems and Rehabilitation Engineering 24, 682–691 (June 2016).
39. J. Sanghvi, et al., Automatic analysis of affective postures and body motion to detect engagement with a game companion, The 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (2011), pp. 305–311.
40. J. C. Kim, P. Azzi, M. Jeon, A. M. Howard, C. H. Park, Audio-based emotion estimation for interactive robotic therapy for children with autism spectrum disorder, The 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI) (2017), pp. 39–44.
41. S. S. Rajagopalan, O. R. Murthy, R. Goecke, A. Rozga, Play with me – measuring a child’s engagement in a social interaction, The 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (2015), vol. 1, pp. 1–8.
42. M. I. Jordan, T. M. Mitchell, Machine learning: Trends, perspectives, and prospects, Science 349, 255–260 (Jul 2015).
43. O. Rudovic, J. Lee, L. Mascarell-Maricic, B. W. Schuller, R. W. Picard, Measuring engagement in robot-assisted autism therapy: a cross-cultural study, Frontiers in Robotics and AI 4, 36 (2017).
44. J. Ngiam, et al., Multimodal deep learning, Proceedings of the 28th International Conference on Machine Learning (ICML) (2011), pp. 689–696.
45. N. Jaques, S. Taylor, A. Sano, R. Picard, Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction, International Conference on Affective Computing and Intelligent Interaction (ACII), 2017 (2015).
46. Y. Bengio, L. Yao, G. Alain, P. Vincent, Generalized denoising auto-encoders as generative models, Advances in Neural Information Processing Systems (2013), pp. 899–907.
47. E. Schopler, M. E. Van Bourgondien, G. J. Wellman, S. R. Love, The childhood autism rating scale, (CARS2) (WPS Los Angeles, 2010).
48. Y. Zhang, Q. Yang, An overview of multi-task learning, National Science Review (2017).
49. N. Jaques, O. Rudovic, S. Taylor, A. Sano, R. Picard, Predicting tomorrow’s mood, health, and stress level using personalized multitask learning and domain adaptation, IJCAI 2017 Workshop on Artificial Intelligence in Affective Computing (2017), pp. 17–33.
50. T. Baltrušaitis, P. Robinson, L.-P. Morency, Openface: an open source facial behavior analysis toolkit, IEEE Winter Conference on Applications of Computer Vision (2016), pp. 1–10.
51. Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using part affinity fields, IEEE Conference on Computer Vision and Pattern Recognition (2017).
52. F. Eyben, F. Weninger, F. Gross, B. Schuller, Recent developments in opensmile, the munich open-source multimedia feature extractor, Proceedings of the 21st ACM International Conference on Multimedia (2013), pp. 835–838.
53. P. E. Shrout, J. L. Fleiss, Intraclass correlations: uses in assessing rater reliability, Psychol. Bull. 86, 420 (March 1979).
54. H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, Exploring strategies for training deep neural networks, J. Mach. Learn. Res. 10, 1–40 (January 2009).
55. L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, J. Mach. Learn. Res. 9, 2579–2605 (November 2008).
56. A. Shrikumar, P. Greenside, A. Shcherbina, A. Kundaje, Not just a black box: Learning important features through propagating activation differences, International Conference on Computer Vision and Pattern Recognition (2016).
57. J. H. Friedman, Greedy function approximation: a gradient boosting machine, Ann. Stat. pp. 1189–1232 (October 2001).
58. V. Podgorelec, P. Kokol, B. Stiglic, I. Rozman, Decision trees: an overview and their use in medicine, Journal of medical systems 26, 445–463 (2002).
59. R. Picard, M. Goodwin, Developing innovative technology for future personalized autism research and treatment, Autism Advocate 50, 32–39 (2008).
60. B. W. Schuller, Intelligent audio analysis (Springer, 2013).
61. E. Brynjolfsson, T. Mitchell, What can machine learning do? workforce implications, Science 358, 1530–1534 (2017).
62. M. R. Herbert, Treatment-guided research, Autism Advocate 50, 8–16 (2008).
63. A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh losses for scene geometry and semantics, International Conference on Computer Vision and Pattern Recognition (2017).
64. R. Salakhutdinov, J. B. Tenenbaum, A. Torralba, Learning with hierarchical-deep models, IEEE transactions on pattern analysis and machine intelligence 35, 1958–1971 (August 2013).
65. W. Wang, S. J. Pan, D. Dahlmeier, X. Xiao, Recursive neural conditional random fields for aspect-based sentiment analysis, Computation and Language (2016).
66. A. Mollahosseini, B. Hasani, M. H. Mahoor, Affectnet: A database for facial expression, valence, and arousal computing in the wild, IEEE Transactions on Affective Computing PP, 1-1 (2017).
67. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems (2012), pp. 1097– 1105.
68. T. Chen, I. Goodfellow, J. Shlens, Net2net: Accelerating learning via knowledge transfer, International Conference on Learning Representations (ICLR) (2016).
69. C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets, Artificial Intelligence and Statistics (2015), pp. 562–570.
70. S. Ruder, An overview of multi-task learning in deep neural networks, arXiv preprint arXiv:1706.05098 (2017).
71. S. A. Taylor, N. Jaques, E. Nosakhare, A. Sano, R. Picard, Personalized multitask learning for predicting tomorrow’s mood, stress, and health, IEEE Transactions on Affective Computing PP, 1-1 (2017).
72. R. El Kaliouby, R. Picard, S. Baron-Cohen, Affective computing and autism, Annals of the New York Academy of Sciences 1093, 228–248 (December 2006).
73. M. Schmitt, E. Marchi, F. Ringeval, B. Schuller, Towards cross-lingual automatic diagnosis of autism spectrum condition in children’s voices, Proceedings of the 12th Symposium on Speech Communication (2016), pp. 1–5.
74. N. Shazeer, et al., Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, International Conference on Learning Representations (ICLR) (2017).
75. Y. Lu, et al., Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification, The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
76. R. J. Williams, D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural computation 1, 270–280 (1989).
77. B. Settles, Active learning, Synthesis Lectures on Artificial Intelligence and Machine Learning 6, 1–114 (2012).
78. V. Mnih, et al., Human-level control through deep reinforcement learning, Nature 518, 529–533 (2015).
79. S. Chen, Y. Li, N. M. Kwok, Active vision in robotic systems: A survey of recent developments, The International Journal of Robotics Research 30, 1343–1377 (2011).
80. H. I. Christensen, et al., Next generation robotics, A Computing Community Consortium (CCC) (2016).
81. Y. Bengio, Learning deep architectures for ai, Foundations and trends in Machine Learning 2, 1–127 (November 2009).
82. Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (March 2013).
83. G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313, 504–507 (July 2006).
84. Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, Advances in neural information processing systems (2007), pp. 153–160.
85. F. Chollet, et al., Keras (2015).
86. M. Abadi, et al., Tensorflow: A system for large-scale machine learning, Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (USENIX Association, 2016), pp. 265–283.
87. F. Pedregosa, et al., Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12, 2825–2830 (2011).
88. J. A. Hadwin, P. Howlin, S. Baron-Cohen, Teaching Children with Autism to Mind-Read: Workbook (John Wiley & Sons, 2015).
89. T. Baltrušaitis, P. Robinson, L.-P. Morency, 3d constrained local model for rigid and nonrigid facial tracking, IEEE Conference on Computer Vision and Pattern Recognition (2012), pp. 2610–2617.
90. E. Paul, Facial Expressions (John Wiley & Sons, Ltd, 2005).
91. J. F. Cohn, F. De la Torre, Automated face analysis for affective computing, The Oxford handbook of affective computing (2015).
92. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, International Conference on Computer Vision and Pattern Recognition (2014).
93. Empatica e4: https://www.empatica.com/en-eu/research/e4/ (2015).
94. J. Hernandez, D. J. McDuff, R. W. Picard, Bioinsights: extracting personal data from “still” wearable motion sensors, The 12th IEEE International Conference on Wearable and Implantable Body Sensor Networks (2015), pp. 1–6.