Dialogue-based interactive simulation is a class of intelligent tutoring system (ITS) often designed for teaching awareness in domains that are not well structured. Examples include: interpersonal skills training [1], cross-cultural competence [2]– [4], negotiation skills [5], [6], confrontation management [7], foreign language learning [8]–[10], and interviewing skills [11].
This paper describes the design and development of an intelligent tutoring system for cross-cultural training. Training to improve cross-cultural interactions is difficult but simulations and games designed for this task have proven to be more effective than traditional classroom instruction [12]. However, most of these systems have used multiple-choice selection or menu-based interface to capture learners’ actions [5], [13]. Menu-based interfaces can be very restrictive and do not reflect the natural mode of human-to-human communication. Multiple choice interaction may also impact the educational efficacy of the training systems [14], [15] by limiting the possibility of the learner being able to transfer any knowledge gained to a real world situation.
Also, the assessment models in existing ITS use simplistic uni-dimensional categorization [2]. These models categorize learners’ actions into one of few categories such as correct/incorrect or negative/neutral/positive. Such simplistic evaluations do not capture the multiple features and dimensions that characterizes interactions in a typical unstructured domains.
This paper expands upon existing ITS by incorporating dialogue-based interaction for cultural competency training. To build this new system required completion of the following tasks:
1) Abstraction of cultural concepts into a computationally tractable feature space;
2) Training multiple expert models to evaluate and score users’ responses in the simulation;
3) Integrating a speech recognition system as the input interface;
4) Implementing an expert assessment model and an adaptive feedback mechanism that work together to provide feedback based on the trainee input; and
5) Designing and conducting an experiment to evaluate the usability and performance of the system.
Since the goal of the our training simulation is to improve the cultural competency of soldiers working with their counterparts from the Chinese army, the ITS described here is built on a Disaster Management Exercise (DME) scenario involving a joint coalition of the US and Chinese army.
There is a significant amount of literature related to the main elements of this project. By way of organization we have grouped this research into four major areas: Architecture; Interface; Data collection; and User response evaluation.
A. Architecture
A common architecture of intelligent instructional systems consists of four main components. The domain expert model; learner model; pedagogical model; and the user interface. The domain expert model contains the concepts, rules and problem solving strategies for the domain to be learned. It provides the standard for evaluating learner’s responses. Expert knowledge can be represented in various ways, including network presentations [16], behavior trees [17], [18], and finite state machines [19]. Most instructional systems perform a range of actions such as providing instruction, hint and feedback to the learner [20]–[22] during interaction. The expert model is used to analyze user’s input in order to provide feedback on errors. The learner’s model consist of dynamic representation of the evolving knowledge of the learner. This include aspects (variables) of the learner’s behavior and knowledge that have assumed effect on performance and learning. The Pedagogical model handles the planning and regulation of teaching activities. This includes making decisions on the sequence of activities and strategy to achieve the learning objectives. The plans of this module is based on observations received from the learner about his/her progress as defined in the domain expert model. The user interface is the connecting point between the tutor and the learner. It serves to collect the user’s input as well as render the actions of the tutor. It may come in different forms and media, including a web/mobile app interface or virtual-reality environments.
B. Speech-based dialogue interface
Speech driven interface in human-computer tutoring systems have been shown to help improve learning [14], [15] and also facilitates adaptation of instructional strategy based on perceived affect in learners’ responses [23]. While human tutors can respond to the content of the learners’ response and perceive their emotions [24], existing menu-option selection interface in dialogue-based cultural training simulations limits this capability. The need to close this performance gap, as it applies to simulation systems to learn cultural awareness, coupled with the recent advances in natural language processing, automatic speech recognition (ASR) and text-to-speech systems, drove the investigations presented in this paper.
C. Data collection
Data-driven dialogue systems have mostly relied on Wizard-of-Oz (WOz) [25] or role play simulation [26] methods to gather data. The collected data is then used to train a natural language understanding model to recognize users’ input to the system. Other works have approached data collection using crowd-sourcing through amazon M-turk [27]. Sharma V. et al. [27], took a step further by applying a data augmentation technique to improve the sample size and class distribution of an originally crowd-sourced data.
D. Response evaluation and feedback mechanism
Evaluating the trainee’s response, in an intelligent tutoring system, is the main task of the expert model. However, semantic representation and natural language understanding steps needs to be performed prior to this evaluation. Continuous bag of words, term-frequency inverse document frequency (TFIDF) or word vector representation are some of the techniques that have been proposed in the natural language processing literature [28]–[31].
Menu-based option selection systems map each pre-defined trainee’s response to a corresponding performance measure and feedback [13], [32], [33]. These systems evaluate trainees’ input using supervised machine learning classification algorithm that learns from training examples annotated by subject matter experts [1], [2], [4]. The trained models classifies new input as positive, neutral, or negative; Others use simpler binary category of positive or negative [2]. Applying this annotation method to the case of cultural training simulation means we will define a score as 1 = Culturally inappropriate; 2 = Neither appropriate nor inappropriate; and 3 = Culturally appropriate.
This approach has a number of limitations for our system; i) the complexity of culture-sensitive interaction requires context to properly classify a dialogue. A single score is uninformative and also does not capture any context nor dialogue history; ii) Identifying the difference between the labels may also be a problem as an utterance could fall into more than one category [2]. This happens if a response contains tokens that qualifies it as an appropriate response but also contains an addendum that tends to change the meaning when the utterance is considered as a whole. Humans may still be able to separate such response, the model, however, will be uncertain as to which score to assign such utterance; iv) Adaptive feedbacks are constructed to help learners get better as they interact with the system. Any handcrafted feedback may be off point and not generalize to all possible expressions that the players could make. This may impact the user experience and not provide any information on where the trainee actually erred. This also makes constructing an adaptive feedback almost impossible; v) Lastly, a single score assignment model is not natural of the way human adjudge an utterance for cultural appropriateness. Even though there may be individual differences on interpretation, it is natural to decompose a sentence into components that fits different cultural features. In this paper, we extend this single annotation mechanism into a multi-label model where each label defines a feature that characterizes the utterance based on expectation of the target culture. This method facilitates the decomposition of possible user’s responses into feature space that simplifies the natural language understanding task for the machine learning expert model. Similar to the information state update formulation in task oriented dialogue systems [34]–[37], this method also defines a structure where cultural training dialogues becomes easier to implement in addition to facilitating adaptive feedback. Adapting instruction, hint and feedback [20] to a trainee’s performance in a simulation environment may lead to improved awareness of the target domain [13], [38] and also allows them to know how to subsequently provide more appropriate response [22]
A. Data collection, annotation and scoring methods
In an earlier work [27] we described the foundation for our data collection effort. Based on the dialogue designed by the Chinese culture experts [32], players’ in the simulation are evaluated at fourteen (14) different points during the interaction. In order to train the expert models for this task, we crowd-sourced data using Amazon mechanical turk (m-turk). Three approaches were used to prompt m-turk users; i) using context from the scenario, they were asked to provide their own responses to the avatar comments; ii) using feedback provided by the Chinese culture experts in our multiple-choice version [32], they were asked to construct alternative responses; and iii) rephrasing the multiple-choice responses in the earlier version [32] of the simulation. The collected data was annotated by a minimum of two (2) annotators. The annotators had an average inter-rater reliability of 66% [27]. The next section will detail how the annotated data was used in training the models to evaluate and score trainees’ responses in the simulation. The scores assigned by the model are then mapped to the abstracted concepts. This mapping is used to construct a feedback that informed the players of their errors.
1) Cultural Concept abstraction: In [32] subject matter
experts provided feedback for each pre-defined option in the system. Using this feedback we developed binary features by abstracting the concepts that the Chinese culture experts used to determine the appropriateness of a response. These features allowed us to cast the problem as a multi-label machine learning classification problem.
Multi-label annotation assigns a score vector with each component being mutually exclusive as follows;
This approach makes more sense for a cultural training simulation because; i) In human-to-human interaction, an inappropriate utterance may not be completely inappropriate as there might be elements of such utterance that may still match the interlocutor’s expectation. By decomposing the interlocutor’s expectation into a set of features, the model will be able to separate an utterance based on the natural language representation as discussed in section II-D, and decide whether it is a hit or a miss. ii) Since context is also very important, the features can be defined to implicitly capture what makes an appropriate response within a particular context.
Table I shows example response in one of the sections in the simulation where the player tries to greet the Chinese officer. The features in this section are as follows;
A: Greets officer by saying: ”Hi”, ”Hello” or ”Good morning”;
B: Avoids asking about officer’s welfare on first meeting; and C: Uses an honorific expression
This feature-based utterance label allows us to automate feedback construction such that each response is mapped to the features and a feedback could be constructed based on score received from the expert model.
TABLE I: Cultural concept abstraction
B. System Architecture
Figure 1 shows the architecture of the simulation. The modular architecture allowed for easy integration, troubleshooting and error correction. We will describe the interface and each of the components in more detail below;
Fig. 1: Simulation system architecture
1) User Interface: Virtual environment: The simulation
environment was designed using the Unity game engine. Characters were modeled and rigged using Adobe fuse while animation was done using Mixamo 3D character animator. Lip-syncing and facial animation helped bring the characters to life and created a more realistic interaction in the virtual environment. Figure 2 shows one of the American soldier character in the virtual environment.
Fig. 2: virtual simulation environment
Voice Detection (VDM) and Automatic Speech Recognition (ASR) Modules: The voice detection module is a gated listening mechanism that manages turn taking between the player and the avatar. It uses a sampling technique [39] to detect when the player starts to speak so as to begin the recording. Once the player completes their response, the recorded audio is passed to the ASR system for speech to text transcription. When the avatar is speaking, the listening gate is shot until the feedback is completely provided to the player. We found better transcription accuracy, using this approach as compared with direct streaming to the ASR system. Taking recording of the audio by section also allowed to evaluate effect of the error of the ASR on the performance of the model in properly scoring the player’s input. We will describe this in more detail. We used the Google speech-to-text API for the speech transcription.
2) Cultural-Expert Module: This module houses the pre-trained input-vectorizer and the multi-label classifier models as shown in figure 3. We trained multiple multi-label classifier models as our domain expert to evaluate and score the player’s response for each section. We considered the problem of recognizing whether a user’s utterance is culturally appropriate or not as a multi-label text classification problem [40], [41] as described in section II-D. Each user utterance is scored based on the defined features for the section. Since the goal of the simulation is to improve cultural awareness of soldiers working with their counterparts from the Chinese army, we built the scenario around a Disaster Management Exercise (DME) involving a joint coalition effort between the US and the Chinese army. Chinese culture experts designed a dialogue tree structure with fourteen (14) strategic points where the players’ responses are evaluated for cultural awareness. For improved accuracy, each of these evaluation tasks are handled by separate evaluation models and trained on separate dataset as described in section II-C.
Fig. 3: Cultural-expert module
Input representation After experimenting with different dimensions of probabilistic semantic representation such as pretrained word vectors [31], bag of words (BOW) and Latent Dirichlet Allocation (LDA) [42]. The term frequencyinverse document frequency (TFIDF) representation gave best performance [30] iin our model. Each evaluation points also had a pre-trained vectorizer that outputs the TF-IDF vector representation for each data point in our dataset. We used the same vectorizers in the simulation to take the text output from the ASR system to produce the corresponding vector representation. The architecture for this module is as shown
in figure 3
It is important that the expert module has the correct text to score and so it is important that the output of the ASR system is as accurate as possible. We, therefore, set a confidence threshold for the output from the ASR system such that if the confidence is below this threshold, the avatar will ask the player to repeat him/herself.
The ASR module is configured to output both the transcribed text and the confidence parameter for the recognized speech output. First, we tried different values of
to balance the trade-off between frustrating users and the accuracy of the expert model. This allows the speech recognizer to only pass output above
to the expert module. If the ASR module confidence level is below this threshold, the dialogue loops back and the avatar would ask the player to repeat what (s)he has said again.
Classifier models: The models were trained on our annotated dataset of data as detailed in [27]. We experimented with different classifier models as well as different combinations of the input vector dimensions to determine the best performing model. We evaluated the performance of the models based on the F1-score, precision and recall. We obtained the best performance with k-nearest neighbours (KNN), RandomForest and multi-layer perceptron (MLP) classifier models. RandomForest and multi-layer perceptron (MLP) models were integrated into the final simulation because, while KNN had a good performance, the computational cost during the simulation was too expensive and the time taken for the player’s response to be evaluated makes the deployment less suitable for a real-time interactive simulation. Our evaluation metric was carefully chosen for reasons that captures both pedagogical objective as well as the user-experience considerations. Below, we detail suitability of each of these metric;
Precision: is calculated as the ratio of the true positive (TP) to the sum of the true positive (TP) and false positive (FP). High precision indicates high TP and low FP which means that the expert-module mostly recognizes when the users make the appropriate. Also, low FP means the players were not wrongly assigned a score when they gave an inappropriate response. This is bad for pedagogical efficacy of the system since the users will not be corrected for what they did wrongly. Minimizing the FP is important for the pedagogical efficacy of our system. Recall: is the ratio of the TP to sum of TP and false negatives (FN). High recall indicates high TP to FN ratio which is desirable for a good user experience in the simulation. This means that the classifier mostly recognizes when the users gives the appropriate response and scores them properly for it. Since the feedback received is based on what the output score is, low false negative means the users are not frustrated by the system correcting them even when they make an appropriate response. We wanted a good recall rate so players will not receive feedback that says their response is not appropriate, even when it is. Inaccurate feedback is bad for
educational effectiveness of the interaction as players may get confused as to what actually defines an appropriate response. F1 score: is the harmonic mean of precision and recall metrics. The F1 score is an important metric in this system as it aggravates the lower of the outlying value of the precision and recall metrics. It offers information as to the trade-off we are making between pedagogical efficacy and the user experience of the simulation. Precision and recall can individually be maximized by minimizing false positives and false negatives respectively.
ASR Word Error Rate: measures the performance of the speech recognition module. This metric will help us understand and compare the performance of the expert models. Since the model performance is dependent on whether or not it got the correct token from the ASR system. If the expert module is not finding the token that identifies a user response as appropriate due to error from the speech recognizer, the expert module will consider this response as inappropriate and may assign a score of zero for that feature.
3) Pedagogical module:: Two pedagogical components were implemented to help players navigate interactions with the avatars. A simulation guide help players navigate the interaction by providing background information about the nonplayer characters. The feedback module informs the players of what was right and wrong about their response. We will describe both components in detail below;
Simulation guide: The simulation guide is a pre-recorded audio with corresponding text display as shown in both figures 2 and 4. This provides the players with context of the scenarios and help guide them by providing additional information that the players need to properly construct their responses. This is important for several reasons; i) it helps them clarify and contextualize what was said by the avatar and also connects it to the history of the conversation. This will help the players recall the history of the dialogue in the previous scenes and connect to the current one; ii) it helps narrow the space of potential response so the players could articulate their response within a shorter time; iii) it also helps the flow of the interaction when the player are able to respond promptly; iv) It increases the players immersion in the scenes and also their likelihood of making the proper response; and v) For non-military players who may not be familiar with the scenario, the guide helps them understand the context of what they needed to deal with.
Adaptive feedback generation Adaptive feedbacks based on intelligent error analyses of learners’ solutions can help achieve correct response with minimal error as learners progress through the simulation [22], [38]. This is especially useful for our simulation because, it help players continuously improve their responses as they try to implement the corrections in subsequent turns [43]. Our adaptive feedback mechanism uses the output score-vector as shown in equation 1 from the expert module and maps it to the domain feature space of the dialogue. Depending on the score output, the feedback generator constructs a feedback that informs the
Fig. 4: Simulation guide
player of what an appropriate response should be and informs them of what they did right and tries to correct what they could improve upon in subsequent dialogue.
An example feedback for the first response from table I
which is an example utterance that a player could say when meeting the Chinese captain for the first time; (i.e.“Good morning captain Wang, how are you doing?”). The domainexpert module returns a score vector of [1,0,0] based on the following features;
the feedback generated is as follows:
“a culturally appropriate response in this section should include greeting the officer, avoiding asking about the officer’s welfare on a first meeting, and using an honorific expression. From your response, you succeeded in greeting the officer, but your response could be improved by avoiding asking about the officer’s welfare on a first meeting and using an honorific expression.”
The generated feedback text is passed to the feedback speech synthesizer to get converted to audio format and played to the players before proceeding to the next turn.
We recruited 18 subjects to test the usability and performance of the simulation. We evaluated the performance of the models and the overall system as a potential training tool for cultural awareness. Participants were mostly undergraduates from the school of engineering at the University of Virginia. The objective of our simulation system design was to explore culturally-aware data-driven model that evaluates the speech-based dialogue between the user and the computer avatar. Our experiment did not cover testing learning improvement in the target culture as this was outside the scope of this paper. The experiment was approved by the IRB and all participants provided their consent for the data that was collected.
Our hypotheses are as follows: H1: The expert model will not assign scores comparable with
human adjudged scores based on the players’ responses in the simulation. H2: The word error rate of the speech recognizer does not have any impact on the performance of the expert model when compared with manually transcribe input. H3: The adaptive feedback mechanism aids the user in making more culturally appropriate responses as they progressed through the interaction. H4: An integrated interactive simulation system with speechdriven input designed for cultural awareness training created room for more naturalistic interaction with better user experience.
To analyse the performance of the cultural-expert models, we collected the text transcript of the participants’ responses as well as the audio recordings. This logged text also include the score assigned by the expert module and the ASR confidence level. For the user experience, we administered questionnaires to the participants to understand what went right and how the system can be improved.
The first five subjects’ data were removed because the system did not use the final versions of our classification models. This leaves the data for 13 participants. The system froze twice and the data collected for the two (2) participants was omitted in the analysis as well, leaving data for eleven (11) participants.
A. Expert Model performance
The recorded audio responses were manually transcribed, the result was then annotated by two human annotators. The annotators had inter-rater agreement of 78% and the score assigned was compared with the output scores from the expert models. Using the score from the human annotators as a baseline, we show the performance of the expert models in table II below. The model-assigned scores was compared with the base-line to calculate the F1-score, recall, and precision. We also considered the ASR system word error rate to explain in part what was observed in the performance of the models.
Fig. 5: Model performance and ASR word error rate
TABLE II: Performance of cultural expert module
Figure 5 shows the plot of the model performance in table II. The expert models had average F1-score of 80.7%, average precision of 79.4% and recall of 84.4%. The average WER of the ASR system is 19.6%.
H1: The F1-measure of the expert models as shown in table II shows average performance of 80%. Except for model 7 and 8 with 3 and 4 features respectively, other models had F1-score above 75%. Models 7 and 8 had more binary features resulting in () classes into which every response could fall into. From this we see that the number of features impacts the performance of the models as the classification task gets more difficult with increasing number of features.
The high WER for section 1 is because participants were mostly mentioning the name of the Chinese avatar (”Captain Wang”) with whom they were speaking and the ASR system struggled to properly recognize it. Training the ASR system on our own dataset may have improved the performance in this section. With average precision of 79.4% the ratio of true positives (TP) to false positives is well above average which shows that the model does not assign score when it receives an inappropriate response from the players. Similarly, average recall of 84.4% showed that the models TP to false negatives (FN) is around 0.84. This also indicates positive user experience performance as the players do not get corrected for what should pass as a culturally appropriate response. Both measures shows high TP values which is good for the pedagogical efficacy of the simulation. Since low recall values indicates higher FN, models 7 and 8 with the least recall means that players were, on average, corrected more than a human evaluator would have corrected similar utterance. Based on F1-score (t(13) = 26.49, p=1.07e-12), we have enough evidence to conclude that the performance of the expert module was comparable with that of human expert using manually transcribed data.
H2: We performed simple regression analysis to analyse the
relationship between the WER the F1-score. Our result [t(13) exist between the word error rate of the speech recognizer and the performance of the expert models. Similarly, the performance of the ASR system had no impact on the other metrics (Precision and Recall) of the expert models. This shows that, the tokens needed to score the participants’ responses were mostly properly recognized. For example, model-1 where participants get score when they greeted the army officer that they were interacting, tokens such as ”Good morning”, ”hi”, ”hello” were mostly correctly recognized and appropriately scored by the model. However, a multiple regression analysis using the number of features together with the WER to predict the F1-score of the models showed that both variables have joint significant effect [F(2,11) = 6.179, p=0.016] on the performance of the models. Therefore, as the number of features increases, the effect of the speech recognizer error tend to begins to have effect the model performance.
H3: Using average aggregate scores across sections (equation (2)), we evaluated the performance of the participants as they progress through the simulation. We hypothesized that providing an adaptive feedback to the players based on the model assigned score will improve their performance as they progress through the interaction. Natural human-to-human interaction does not usually provide explicit feedback to help players identify errors in their responses. However, it was still necessary to provide feedback to participant to help them learn of their mistakes based on the norm in the target culture.
From equation (2), N is the number of participants and k is the number of cultural concept that each model was evaluating in the users’ utterance; k is defined for each model as the number of labels/features as indicated in table II.
Fig. 6: Participants average score
From Figure 6 above, we did not observe any performance improvement in the aggregate score received by the participants during the interaction. This is may be due to the independence of the features at each evaluation points. Also, most of the participants have not had any prior experience using such simulation system for cultural training and this may impact their performance as a first timer. The participants may also find it difficult remembering what corrections they received in previously sections. Also, lessons learnt in one simulation run can be easily utilized experientially in a subsequent one but not necessarily during the same interaction. Players would have had time to internalize the corrections from the first experience and if they have to go through the simulation on a second run, their experience may be better.
Fig. 7: User experience survey result
H4: We used post simulation survey to evaluate the users’ experience. We issued questionnaire to the participants after completing the scenarios in the simulation to evaluate the performance of the system as integrated. Five (5) questions were asked to measure the performance of each component:
figure 7 showed that minimum of 80% of the participants had positive response about each component. We also asked an open question where participants are to freely express what they think needs to be improved in the system. Responses to this question is as presented in the word cloud.
Some comments made by the participants in improving the simulation include; i) adding a progress bar; ii) improving the feedback system; and iii) Making the interaction more understandable to a non-military audience.
Appendix-I shows an example of a participant’s dialogue with the Chinese avatar during the simulation as well as the score received from the classification and the corresponding generated feedback.
In this paper we presented a data-driven simulation system for cultural awareness training. We improved previous simulation systems by implementing a spoken dialogue system in place of the menu-based option selection model. We implemented a data-driven multi-model cultural experts to
Fig. 8: Word cloud of participants response
evaluate and assign score to users’ spoken responses based on abstracted cultural features. Our design also implemented an adaptive feedback mechanism that uses the score assigned by the expert module to serve the pedagogical role of informing the players how culturally appropriate their responses were. We evaluated the system by comparing the performance of the models with scores assigned by two human annotators. Our result showed that our data-driven intelligent experts models gave comparable performance with what human would adjudge similar utterance. The feature abstraction technique and adaptive feedback mechanisms also allowed us to solve the problem of manually scoring user’s utterance. We also did not have to write feedback for every possible utterance that a user could make in the simulation. These components also allowed for real time adaptive feedback construction based on vector-score received by the learners’ utterance. Most prior works have focused on developing techniques to improve the individual component of the intelligent tutoring system, our simulation focused on improving most of the modules to improve the realism of the interaction. While, providing explicit feedback in an interactive situation does not reflect the natural way in which learning occurs during interaction, our future works will explore techniques towards an adaptive response generation with implicit feedback. This will have the ultimate goal of improving the learning efficacy of cultural training simulation.
The project is funded by the Army Research Lab under Research Funding Source Award Number: W911NF1820279. We thank the Data Science Institute of the University of Virginia for the support received for this project.
[1] H. C. Lane, M. G. Core, D. Gomboc, A. Karnavat, and M. Rosenberg, “Intelligent tutoring for interpersonal and intercultural skills,” UNIVERSITY OF SOUTHERN CALIFORNIA MARINA DEL REY CA INST FOR CREATIVE . . . , Tech. Rep., 2007.
[2] H. C. Lane and M. J. Hays, “Getting down to business: Teaching cross- cultural social interaction skills in a serious game,” in Workshop on Culturally Aware Tutoring Systems (CATS), 2008, pp. 35–46.
[3] P. Dillenbourg, D. Schneider, and P. Synteta, “Virtual learning environments,” in 3rd Hellenic Conference” Information & Communication Technologies in Education”. Kastaniotis Editions, Greece, 2002, pp. 3–18.
[4] H. C. Lane, M. J. Hays, M. G. Core, and D. Auerbach, “Learning intercultural communication skills with virtual humans: Feedback and fidelity.” Journal of Educational Psychology, vol. 105, no. 4, p. 1026, 2013.
[5] J. M. Kim, R. W. Hill Jr, P. J. Durlach, H. C. Lane, E. Forbell, M. Core, S. Marsella, D. Pynadath, and J. Hart, “Bilat: A game-based environment for practicing negotiation in a cultural context,” International Journal of Artificial Intelligence in Education, vol. 19, no. 3, pp. 289–308, 2009.
[6] M. Core, D. Traum, H. C. Lane, W. Swartout, J. Gratch, M. Van Lent, and S. Marsella, “Teaching negotiation skills through practice and reflection with virtual humans,” Simulation, vol. 82, no. 11, pp. 685–701, 2006.
[7] J. Kolkmeier, M. Lee, and D. Heylen, “Moral conflicts in vr: Addressing grade disputes with a virtual trainer,” in International Conference on Intelligent Virtual Agents. Springer, 2017, pp. 231–234.
[8] M. L. Swartz and M. Yazdani, Intelligent tutoring systems for foreign language learning: The bridge to international communication. Springer Science & Business Media, 2012, vol. 80.
[9] H. Wang, C. J. Waple, and T. Kawahara, “Computer assisted language learning system based on dynamic question generation and error prediction for automatic speech recognition,” Speech Communication, vol. 51, no. 10, pp. 995–1005, 2009.
[10] P. Wik and A. Hjalmarsson, “Embodied conversational agents in computer assisted language learning,” Speech communication, vol. 51, no. 10, pp. 1024–1037, 2009.
[11] Z. Yu, V. Ramanarayanan, P. Lange, and D. Suendermann-Oeft, “An open-source dialog system with real-time engagement tracking for job interview training applications,” in Advanced Social Interaction with Agents. Springer, 2019, pp. 199–207.
[12] D. E. Brown, A. Moenning, S. Guerlain, B. Turnbull, D. Abel, and C. Meyer, “Design and evaluation of an avatar-based cultural training system,” The Journal of Defense Modeling and Simulation, vol. 16, no. 2, pp. 159–174, 2019.
[13] K. Georgila, M. G. Core, B. D. Nye, S. Karumbaiah, D. Auerbach, and M. Ram, “Using reinforcement learning to optimize the policies of an intelligent tutoring system for interpersonal skills training,” in Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2019, pp. 737–745.
[14] D. J. Litman and S. Silliman, “ITSPOKE: An intelligent tutoring spoken dialogue system,” in Demonstration papers at HLT-NAACL 2004. Association for Computational Linguistics, 2004, pp. 5–8.
[15] R. Moreno, R. E. Mayer, H. A. Spires, and J. C. Lester, “The Case for Social Agency in Computer-Based Teaching: Do Students Learn More Deeply When They Interact With Animated Pedagogical Agents?” Cognition and Instruction, vol. 19, no. 2, pp. 177–213, Jun. 2001. [Online]. Available: https://doi.org/10.1207/S1532690XCI1902 02
[16] H. Gamboa and A. Fred, “Designing intelligent tutoring systems: a bayesian approach,” Enterprise Information Systems III. Edited by J. Filipe, B. Sharp, and P. Miranda. Springer Verlag: New York, pp. 146– 152, 2002.
[17] X. Geng, S. Qin, H. Chang, and Y. Yang, “A hybrid knowledge representation for the domain model of intelligent flight trainer,” in 2011 IEEE International Conference on Cloud Computing and Intelligence Systems. IEEE, 2011, pp. 29–33.
[18] T.-W. Chan, “Curriculum tree: a knowledge-based architecture for in- telligent tutoring systems,” in International Conference on Intelligent Tutoring Systems. Springer, 1992, pp. 140–147.
[19] A. Yankovskaya and N. Yevtushenko, “Finite state machine (fsm)– based knowledge representation in a computer tutoring system,” New
Media and Telematic Technologies for Education in Eastern European Countries, pp. 67–74, 1997.
[20] J. Vassileva, “Reactive instructional planning to support interacting teaching strategies,” in Proceedings of the 7-th World Conference on AI and Education, 1995, pp. 334–342.
[21] S. Gross, B. Mokbel, B. Hammer, and N. Pinkwart, “Feedback Provision Strategies in Intelligent Tutoring Systems Based on Clustered Solution Spaces,” in DeLFI, 2012.
[22] F. Gutierrez and J. Atkinson, “Adaptive feedback selection for intelligent tutoring systems,” Expert Systems with Applications, vol. 38, no. 5, pp. 6146–6152, 2011.
[23] K. Forbes-Riley and D. Litman, “Designing and evaluating a wizarded uncertainty-adaptive spoken dialogue tutoring system,” Computer Speech & Language, vol. 25, no. 1, pp. 105–126, Jan. 2011. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S0885230809000734
[24] D. Litman and K. Forbes, “Recognizing emotions from student speech in tutoring dialogues,” in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), Nov. 2003, pp. 25–30.
[25] J. Gratch, D. DeVault, G. M. Lucas, and S. Marsella, “Negotiation as a challenge problem for virtual humans,” in International Conference on Intelligent Virtual Agents. Springer, 2015, pp. 201–215.
[26] C. K. Shepherd, M. McCunnis, L. Brown, and M. Hair, “Investigating the use of simulation as a teaching strategy.” Nursing Standard, vol. 24, no. 35, 2010.
[27] V. Sharma, B. Shpringer, S. M. Yang, M. Bolger, S. Adewole, D. Brown, and E. Gharavi, “Data collection methods for building a free response training simulation,” in 2019 Systems and Information Engineering Design Symposium (SIEDS). IEEE, 2019, pp. 1–6.
[28] M. Mateas and A. Stern, “Structuring content in the fac¸ade interactive drama architecture.” in AIIDE, 2005, pp. 93–98.
[29] T. Joachims, “A probabilistic analysis of the rocchio algorithm with tfidf for text categorization.” Carnegie-mellon univ pittsburgh pa dept of computer science, Tech. Rep., 1996.
[30] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.
[31] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
[32] M. Sheridan, B. An, D. Brown, M. Bolger, M. Epstein, F. Matteo, R. Semunegus, C. J. Daniel, C. Clarkin, G. Gringley et al., “Investigating the effectiveness of virtual reality for cross-cultural competency training sieds 2018,” in 2018 Systems and Information Engineering Design Symposium (SIEDS). IEEE, 2018, pp. 53–57.
[33] A. Moenning, B. Turnbull, D. Abel, C. Meyer, M. Hale, S. Guerlain, and D. Brown, “Developing avatars to improve cultural competence in us soldiers,” in 2016 IEEE Systems and Information Engineering Design Symposium (SIEDS). IEEE, 2016, pp. 148–152.
[34] D. R. Traum and S. Larsson, “The information state approach to dialogue management,” in Current and new directions in discourse and dialogue. Springer, 2003, pp. 325–353.
[35] Y. Ko, “New feature weighting approaches for speech-act classification,” Pattern Recognition Letters, vol. 51, pp. 107–111, 2015.
[36] S. Kang, H. Kim, and J. Seo, “A reliable multidomain model for speech act classification,” Pattern Recognition Letters, vol. 31, no. 1, pp. 71–74, 2010.
[37] M. Kim and H. Kim, “Integrated neural network model for identifying speech acts, predicators, and sentiments of dialogue utterances,” Pattern Recognition Letters, vol. 101, pp. 1–5, 2018.
[38] R. L¨utticke, “Problem solving with adaptive feedback,” in International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems. Springer, 2004, pp. 417–420.
[39] J. Ramirez, J. M. G´orriz, and J. C. Segura, “Voice activity detection. fundamentals and speech recognition system robustness,” in Robust speech recognition and understanding. IntechOpen, 2007.
[40] A. McCallum, “Multi-label text classification with a mixture model trained by em,” in AAAI workshop on Text Learning, 1999, pp. 1–7.
[41] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine learning, vol. 85, no. 3, p. 333, 2011.
[42] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.
[43] D. A. Kolb, Experiential learning: Experience as the source of learning and development. FT press, 2014.
Below is the transcript of the dialogue of one of the participants for the first three scenes. We showed the score received (in parenthesis)
Scene 1:
Captain Heist: Good Morning. As you know, we are scheduled to meet our Chinese counterparts for the first time today. It is important that you make a good first impression. Unfortunately, I have had a requirement come up last minute and will not able to meet the Chinese captain today. I am having you go in my place. Please receive him cordially and begin making plans with him as soon as possible.
Participant: Thanks for the information.
Scene 2:
Participant: Good morning Captain Wang.
Captain Wang: Good morning, Lieutenant.
Participant: Are you the leader of the Chinese component of the coalition.
Captain Wang: Yes, I am. I made arrangements with Captain Heist. Is he here?
Participant: Unfortunately Captain Heist had prior duties and I will be taking his place.
Captain Wang: Why couldn’t he come?
Participant: He unfortunately had a prior engagements with other officers.
Captain Wang: I understand. Let’s begin preparing because we need to leave soon.
Participant: Great let’s get started sir.
Scene 3:
Participant: Captain Wang is there anything I can help you with?
Captain Wang: I’m Fine, Lieutenant. I just had some questions about the mission.
Participant: Well what are your concerns officer.
Captain Wang: Well, my team and I were wondering what kind of supplies your team is planning on bringing.
Participant: Well Captain we haven’t worked out all the Kinks but if you would like to meet are at our briefing later this afternoon you’re more than welcome to come.
Captain Wang: I completely understand, but even some basic information would be useful for me to start my planning.
Participant: Well Captain we haven’t worked out all the Kinks but if you would like to meet are at our briefing later this afternoon you’re more than welcome to come.
Captain Wang: Please Lieutenant, it would help me a great deal if I knew exactly what supplies you have for the mission. Participant: Well caps and we plan on bringing food water and other Hospital related equipment such as medicine to the site but we aren’t sure what sort of Weaponry will be using but that will be discussed in the brief.
Captain Wang: Is there any way for me to get a look at the supply depot where you keep everything? I don’t want to bother you any longer, so if it will save you time, I’ll walk over myself.
Participant: Well Captain the Supply Depot is down the way to the left so if you want to check there that would be a good hint to so it will bring.
Captain Wang: Oh, well thank you Lieutenant! Good luck on the planning.
Participant: I will be around until the mission brief this afternoon if you need me at any point.