Multi-modal Emotion Detection and Sentiment Analysis in conversation is gathering a lot of attention recently considering its potential use cases owing to the rapid growth of online social media platforms such as YouTube, Facebook, Instagram, Twitter etc. (Chen et al., 2017, Poria et al., 2016, Poria et al., 2017, Zadeh et al., 2016b, Zadeh et al., 2017), especially knowing that information obtained from any combination of more than one of the available modalities (e.g. text, audio, video) can be used to produce meaningful results.
The current state of the art systems on multi-modal emotion detection and sentiment analysis do not treat the modalities in accordance to the information they are capable of holding (e.g. textual information is significantly more likely to hold contextual information then audio or video features are), lack an adequate fusion mechanism, and fail to effectively capture the context of a conversation in a multi-modal setting. In addition to the lack of proper usage of the available modalities, models also fail to effectively capture the flow of a conversation, the separation between speaker and listener states, and the emotional effect a speakers utterance has on the listener (s) in dyadic conversations.
Our proposed model Multilogue-Net, attempts to embed basic domain knowledge and takes insight from Poria et al. (2019), assuming that the sentiment or emotion governing a particular utterance predominantly depends on 4 factors interlocutor state, interlocutor intent, the preceding and future emotions, and the context of the conversation. Interlocutor intent amongst the mentioned is particularly difficult to model due to its dependency of prior knowledge about the speaker, but modelling the other 3 separately, yet in an interrelated manner was theorized to produce meaningful results if managed to be captured effectively. The key intention was to attempt to simulate the setting in which an utterance is said, and use the actual utterance at that point to be able to gain better insights regarding emotion and sentiment of that utterance. The model uses information from all modalities learning multiple state vectors (representing interlocutor state) for a given utterance, followed by a pairwise attention mechanism inspired by Ghosal et al. (2018), attempting to better capture the relationship between all pairs of the available modalities.
The model uses two gated recurrent units (GRU) (Chung et al., 2014) for each modality for modelling interlocutor state and emotion. Along with these GRU’s, the model also uses an interconnected context network, consisting of the same number of GRU’s as the number of available modalities, to model a different learned context representation for each modality. The incoming utterance representations and the historical GRU outputs are used at every timestamp to be able to arrive at a prediction for that timestamp.
The model produces m different representations at every timestamp (Where m is the number of modalities), where each representation is the emotional state at that timestamp as conveyed by each of the modalities. These m representations are used by the fusion mechanism to incorporate information from each of the m representations to be able to arrive at the final prediction for that timestamp. We understand that the usage of the pairwise attention mechanism, along with the Emotion GRU are what make the model flexible across tasks.
The usage of only the text representation as input to the context GRUs has been observed to be key to the results, as the context of the conversation would be better captured by textual information then it would have with audio or video information. We believe that Multilogue-net performs better than the current state of the art (Ghosal et al., 2018) on multi-modal datasets because of better context representation leveraging all available modalities.1
The remaining sections of the paper are arranged as follows: Section 2 discusses related work; Section 3 discusses the model in detail; Section 4 provides experimental results, dataset details, and analysis; Section 5 contains our ablation studies and its implications; and finally Section 6 speaks on potential future work, and concludes our paper.
Multi-modal Emotion recognition and Sentiment Analysis has always attracted attention in multiple fields such as natural language processing, psychology, cognitive science, and so on (Picard, 2010). Previous works have been done studying factors of variation that have a more direct correlation with emotion, such as Ekman et al. (1992), who found correlation between emotion and facial cues, and a lot of studies extensively focus on emotions and their relationship with one another such as Plutchiks wheel of emotions, which defines eight primary emotion types, each of which has a multitude of emotions as sub-types.
Early work done to leverage multi-modal information for emotion recognition includes works such as Datcu and Rothkrantz (2012), who fused acoustic information with visual cues for emotion recognition and Eyben et al. (2010), who used contextual information for emotion recognition in multi-modal settings. More recently, deep recurrent neural networks have been used to be able make the best of the learned representations of the modalities available to be able to give very effective and accurate emotion and sentiment predictions. Poria et al. (2017) successfully used RNN-based deep networks for multi-modal emotion recognition, which was followed by multiple other works (Chen et al., 2017; Zadeh et al., 2018a; Zadeh et al., 2018c) giving results far better than what was seen before. Recent works also include works such as Hazarika et al. (2018), who used memory networks for emotion recognition in dyadic conversations, where two distinct memory networks enabled inter-speaker interaction.
Some works such as DialogueRNN (Majumder et al., 2018), though focused on emotion recognition and sentiment analysis using a single modality (text), works very well in a multi-modal setting by just replacing the text representation with a concatenated vector of all the modality representations. DialogueRNN effectively leveraged the separation between the speakers by maintaining two independent gated recurrent units to keep track of the interlocutor states, also effectively capturing context in the conversation, yielding state-of-the-art performance on uni-modal data. Even though DialogueRNN was able to give reasonably good results on multi-modal data, the lack of an adequate fusion mechanism and the lack of focus on a multi-modal representation held its multi-modal performance back.
Apart from the kind of works shown before, where a methodology or a model was proposed, works such as Poria et al. (2019) spoke extensively about the research challenges and advancements in emotion detection in conversation and gave a comprehensive overview of the problem. Most recently Ghosal et al. (2018) introduced the idea of learning the relationship between pairs of all available modalities using pairwise attention, in a multi-modal setting, where similar attributes learned by multiple modalities are emphasized and differences between the modality representations are diminished. Pairwise attention proved to be incredibly effective yielding state-of-the-art performance on multi-modal data with just simple representations for each modality.
3.1 Problem Formulation
Let there be a P number of participants in the conversation. The problem is defined such that for every utterance
uttered by any participant(s), a sentiment score is allotted along with a predicted emotion label (one of happy, sad, angry, surprise, disgust, and fear). Each utterance corresponds to a particular participant of the conversation, allowing this formulation of the problem to also capture the average sentiment of a participant in the conversation. Predictions over utterances also avoid problems such as classification during long moments of silence when predictions are made for a fixed time interval, and is also mostly common practice.
For every utterance , where p is the party who uttered the utterance, there exist three independent representations ,
,
, and
, and are obtained using the feature extractors further explained in section 4.2.
This gives us our overall formulation of the problem, which is to be able to learn a function which would take as input three independent representations of a particular utterance, information regarding the previous emotional state of the participant, and a representation of the current context of the conversation - to be able to map to an output prediction of a sentiment score and emotion label.
Details regarding how these representations are updated and how the output is generated using these inputs are described in detail below.
3.2 Model Details
Modelling was done under the underlying assumption that the sentiment or emotion of an utterance predominantly depends on four factors as mentioned before:
• Interlocutor State
• Interlocutor Intent
• Context of the conversation until that point
• Previous interlocutor states and emotions of a particular participant in the conversation
The proposed model attempts to model three out of the mentioned four explicitly, and assume that interlocutor intent will be modelled implicitly during model training. Interlocutor state is modelled using a state GRU (will be referred to as sGRU), A context GRU is used to keep track of the context of the conversation (cGRU), and an emotion GRU (eGRU) is used to keep track of the emotional state of that particular participant. Finally, a pairwise attention mechanism, which uses the emotion representation of all modalities at a particular timestamp is used to leverage the important modalities and relevant combination of the modalities for emotion or sentiment prediction at that timestamp.
Figure 1: Description of all the state updates at timestamp t for a single participant
Every utterance has three independent feature representations (text, audio, and video features), . Each of these feature representations are treated and operated on independently until the pairwise attention mechanism. The model consists of two GRUs (state GRU, and emotion GRU) for every modality and participant, and a context GRU for each modality common to all participants in the conversation (If p is the number of participants and m is the number of modalities, the model would have a total of 2mp + m GRUs). The inputs at the current timestamp and the previous state, context, and emotion representations are operated on to be able to arrive at the prediction at that timestamp. Figure 1 describes the updates at a particular timestamp and the role of each GRU is further explained below.
3.2.1 Context GRU (cGRU)
The Context GRU (cGRU) for each modality aims to capture the context of the conversation by jointly encoding the utterance representation of that modality (at timestamp t in the given diagram) (,
, or
) and the previous timestamp speaker state GRU output of that modality. This accounts for inter-speaker and inter-utterance dependencies to produce an effective context rep-
Figure 2: State updates and final prediction output in a conversation between two participants , where the updates of each participant at a timestamp is as given in figure 1
resentation. The current utterance , or
, changes the state of that speaker from (
,
) to (
). To capture this change in context we use GRU cell cGRU having output size
Where is the size of the context vectors
, and
, and
are the sizes of ut- terance representations of text, audio, and video respectively.
represents the concatenation operation,
is the size of all the state vectors
,
, and
; and all GRU weight and biases shapes are such that they produce the expected shape of outputs taking the given shape of inputs.
3.2.2 State GRU (sGRU)
The network keeps track of the participants involved in a conversation by employing a number of (sGRU)’s, where p is the number participants in the conversation and m is the number of available modalities.The sGRU associated with a participant outputs fixed size vectors which serve as an encoding to represent the interlocutor state, and are directly used for both emotion and sentiment prediction, and updating the context vectors.
All the state vectors are initialized to null at the first timestamp. For a timestamp t, the state vector of participant p and modality is updated using the input feature representation of that modality and simple attention over all the context vectors until that timestamp. The simple attention mechanism over all the context vectors is described by the following equations:
Where ,
, and
. In equation 4, we calculate attention scores over all previous context representations of all previous utterances, highlighting the relative importance of all the previous context vectors to
. A softmax layer is applied to amplify this relative importance, and finally equation 5 the final output of attention over context
is calculated by pooling the previous context vectors with
We then employ to update
to
on the basis of incoming utterance representa- tions for each modality
context representations
, and
using GRU cells
output size
Where is the size of all the state vectors
, and
are the sizes of utterance representations of text, audio, and video respectively.
represents concatenation operation, and all GRU weights shapes are such that they produce the expected shape of outputs taking the given shape of inputs.
The intended purpose of using this as the input to is to model the dependency of the speaker state on the context of the conversation as understood by the utterances until that point, along with the utterance representation at that point. The output of the sGRU for modality m and timestamp t serves as an encoding of the speaker state as conveyed by modality m, at time t.
3.2.3 Emotion GRU (eGRU)
The emotion GRU serves as the decoder for the encoding produced by the state GRU. The emotion GRU uses the previous timestamp eGRU output, and the encoding provided by sGRU to produce an emotion or sentiment representation which is further used by the pairwise attention mechanism to be able to produce the relevant output for prediction. At timestamp (t + 1) the emotion vectors are updated as:
Where is the size of all the emotion vectors
, and
are the sizes of utterance representations of text, audio, and video respectively.
is the size of the state vectors
; and all GRU weights shapes are such that they produce the expected shape of outputs taking the given shape of inputs.
The emotion GRU acts as a decoder to the encoding produced by the associated state GRU, producing a vector which can be used for both sentiment and emotion prediction.
3.2.4 Pairwise Attention Mechanism
The emotion GRU for each timestamp will produce an m number of vectors (where m is the number of modalities available). Pairwise attention is then used over these m vectors to produce the final prediction output. In particular pairwise attention is calculated over the following pairs in our case , and
. Pairwise attention for pair
would be calculated as follows:
Figure 3: Pairwise attention mechanism used as the fusion mechanism followed by the final prediction layer
Where ;
; and
represents element-wise product; and
represents concatenation.
A complete analysis on the pairwise attention mechanism has been done by Ghosal et al. (2018),
For emotion prediction we use a fully connected layer along with a final softmax layer to calculate 6 emotion class probabilities from
3.2.6 Training Fairly standard practices have been employed for the training of the model. Categorical cross-entropy has been used along with L2-regularization as the loss function during training for emotion prediction, to maximize likelihood over each of the classes.
Mean Square Error (MSE) along with L2 regularization has been employed as loss function during training for sentiment regression. The usage of a
Table 1: Multilogue-Net performance on CMU-MOSI in comparison with the current and previous state-of-the-art on the dataset. A2 indicating accuracy with 2 classes, and F1 indicating F1 score .
saturating output layer and a loss function that does not undo the saturation, leads to the model to stop training when it makes extreme predictions (close to -1 or +1) due to very small gradients. Using initialization strategies that start at smaller model weights, mini-batch gradient descent-based Adam (Kingma and Ba, 2014) optimizer, and using L2 regularization is used to avoid this failure mode.
4.1 Datasets
We evaluate our model using two benchmark datasets - CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) (Zadeh et al., 2016a) and the recently published CMU Multi-modal Opinion Sentiment and Emotion Intensity (CMUMOSEI) dataset (Zadeh et al., 2018b).
4.1.1 CMU-MOSI
CMU-MOSI dataset consists of 93 videos spanning over 2199 utterances. Each utterance has a sentiment label associated with it. It has 52, 10 & 31 videos in training, validation & test set accounting for 1151, 296 & 752 utterances. CMU-MOSEI has 3229 videos with 22676 utterances from more than 1000 online YouTube speakers. The training, validation & test set consist of 16216, 1835 & 4625 utterances, respectively. Each utterance in CMUMOSI dataset has been annotated as either positive or negative.
4.1.2 CMU-MOSEI
In CMU-MOSEI dataset labels are in a continuous range of -3 to +3 and are accompanied by an emotion label being one of six emotions. However, in this work we also project the instances of CMU-MOSEI in a two-class classification setup with values 0 signifies positive sentiments and values < 0 signify negative sentiments. We have called this A2 accuracy (accuracy with 2 classes). Along with this we have also shown results for continuous range prediction between -3 and +3, and emotion prediction with the 6 emotion labels for each utterance in CMU- MOSEI. We have used A2 as a metric to be consistent with the previous published works on CMU-MOSEI dataset (Ghosal et al., 2018; Zadeh et al., 2018b). CMU-MOSEI has further been used for other comprehensive experiments due to its large sizer and easier feature extraction
4.2.1 CMU-MOSEI
We use the CMU-Multi-modal Data SDK (Zadeh et al., 2018b) for feature extraction. For MOSEI dataset, sentiment label-level features were provided where text features used were GloVe embeddings (Pennington et al., 2014), visual features extracted by Facet (Stckli et al., 2017) & acoustic features by OpenSMILE (Eyben et al., 2010). Thereafter, we compute the average of sentiment label-level features in an utterance to obtain the utterance-level features. For each sentiment label-level feature, the dimension of the feature vector is set to 300 (text), 35 (visual) & 384 (acoustic).
4.2.2 CMU-MOSI
In contrast, for MOSI dataset we use utterance level features provided in Poria et al. (2017). These utterance-level features represent the outputs of a convolutional neural network (Karpathy et al., 2014), 3D convolutional neural network (Ji et al., 2010) & openSMILE (Eyben et al., 2010) for text, visual & acoustic modalities, respectively. Dimensions of utterance-level features are 100, 100 & 73 for text, visual & acoustic, respectively.
Table 2: Multilogue-Net performance on CMUMOSEI Sentiment Labels compared to previous state-of-the-art models on regression and accuracy Metrics. All metrics apart from MAE represents higher values for better results, MAE represents lower values for better results.
We evaluate our proposed approach on CMUMOSI (test-set) on accuracy and F1 score, and CMU-MOSEI (dev-set) on accuracy, F1 score, mean absolute error (MAE), pearson score (r), and accuracy’s on the emotion labels. Due to the lack of speaker information in CMU-MOSI we were not able to use the CMU-Multi-modal Data SDK for sentiment label extraction, to be able to evaluate our approach on CMU-MOSI on mean absolute error and Pearson score.
Results have also been reported for usage of two of the three available modalities. Uni-modal performance has not been reported as the focus of the paper is the effective usage of multi-modal data. In a uni-modal setting the model would not be using the fusion mechanism and the output would be equivalent to having a few dense layers after the emotion GRU to directly output the final prediction. F1 scores have not been mentioned by most previous models being used for comparison, but have been reported for Multilogue-Net for additional comparison to any future models using CMU-MOSI dataset.
Table 1 shows the performance of Multilogue-
Table 3: Multilogue-Net performance on MOSEI Emotion Labels compared with that of Graph-MFN on weighted accuracy and F1 score. MOSEI Emotion label results were presented by only one model, and comprehensive results have not been published for the same.
Net on CMU-MOSI dataset, comparing to the current state of the art (Ghosal et al., 2018), previous state-of-the-art (Poria et al., 2017), and DialogueRNN (Majumder et al., 2018) (Multi-modal performance of DialogueRNN has not been reported by Majumder et al. (2018), and we have run these experiments additionally for a better comparative study, where concatenating the input representations has been used as a fusion mechanism). Our model consistently outperforms the previous state-of-the-art but performs better only on one of the subsets of the modalities when compared to the current state-of-the-art.
In comparison to MMMU-BA our model also lacks in Multi-modal performance. We theorize that the model performance is lacking because of the low number of training examples (CMU-MOSI consists only of 93 conversations out of which 62 were used for training), in contrast to our model which has a high capacity (Relative to models being compared with). Since Multilogue-Net learns a lot of intermediate representations in order to make a prediction, it would need a larger dataset with more variability to be able to learn meaningful representations. The proposition that performance lacks due to a lack of training examples is backed by the results on CMU-MOSEI (demonstrated in a comparative setting in Table 2 and 3) where the model consistently outperforms the current state-of-the-art on most metrics.
On CMU-MOSEI, our model seems to perform very consistently on both sentiment and emotion labels. The model outperforms the current state of the art on all but one metric (both classification and accuracy) on sentiment labels in the tri-modal setting. Multilogue-Net also outperforms the current state of the art on the emotion labels by a considerable margin (This is also attributed to the fact that not a lot of models have presented results on these labels).
Similar observations are made in both datasets,
Until now, some architectural considerations, such as the use of eGRU and the fusion mechanism, have been briefly explained but not empirically jus-tified. This section aims to get empirical evidence regarding the effectiveness of these modules. Since our model completely hinges around the usage of the context and state GRU’s, our ablation studies and analysis have focused on the fusion mechanism and emotion GRU (eGRU) only.
5.1 Fusion Mechanism
The effectiveness of the fusion mechanism can be very easily examined by observing the results of the model on both tasks Sentiment Regression and Emotion Recognition, with and without the fusion mechanism. Table 4 shows these results on CMU-MOSEI modality subsets.
The bi-modal results in table 4 involve evaluating the pairwise attention module only once (Since there is only one pair available), directly followed by the prediction layer. The tri-modal case on the other hand involves evaluating the pairwise attention module thrice (Once for each pair). In general, the number of times this module will have to be evaluated for m modalities is , which raises
Table 4: Multilogue-Net performance on CMUMOSEI with and without the fusion mechanism - for ’without’ fusion we have concatenated all the representations and directly passed them to the prediction layer.
a fair concern regarding the trade-off between the additional computational cost and performance.
We empirically observe that the additional computational cost can be considered negligible in context of the increased performance, largely attributing to the non-parametric nature of the fusion mechanism and the relatively small number of additional parameters in the prediction layer (for the sentiment regression;
for emotion recognition).
The fusion mechanism seems to clearly be bene-ficial in all of the reported cases apart from video + audio, implying that the fusion mechanism is useful only in the cases the text representation is used. This further strengthens our claim that the text representation guides tri-modal performance.
5.2 Emotion GRU (eGRU)
Unlike as done with the fusion mechanism, the effectiveness of the eGRU cannot be examined by evaluating metrics with and without it. Removing the Emotion GRU would clearly be detrimental to the results, and would not convey the intention of having it.
The primary intention of having the eGRU can be considered to be maintaining consistency between tasks. To better understand what this means table 5 quantitatively demonstrates this effect. The model was trained separately for Emotion Detection and Sentiment Regression tasks. After both the models were trained satisfactorily, a particular sample from the test set (test sample 6) was inferred on. We then retrieved the intermediate text repre-
Table 5: Euclidean Distance between the same representations for Sentiment Regression as compared to Emotion Detection. (Distances have been converted to units for convenience and easier comparison)
sentations (, and
; superscript t indicating text modality) at a particular timestamp (t = 4) for both models on that sample. The Euclidean Distance between these two sets of representations (one for each task) was evaluated and have been shown in table 5, where we can clearly observe that the euclidean distance between the emotion representations is much larger as compared to the state and context representations.
This shows that for both tasks, interlocutor state and context representations are relatively similar to each other, whereas the emotion state representation is more varied and task dependant. This not only allows us to use the same cGRU and sGRU weights across tasks, but would also allow us to train for multiple tasks in parallel using a different eGRU for each task - giving us consistent and accurate predictions across multiple tasks. Analysis of such a network, and whether training for multiple tasks in parallel aids one another, has not been covered in this paper and is left to our future work.
In this paper, we have presented an RNN architecture for multi-modal sentiment analysis and emotion detection in conversation. In contrast to the current state-of-the-art models, our model focuses on effectively capturing the context of a conversation and treats each modality independently, taking into account the information a particular modality is capable of holding. Our model consistently performs well on benchmark datasets such as CMU-MOSI and CMU-MOSEI in any multi-modal setting.
The model can be further extended to have better feature extractors, and increase both the number of modalities and the number of participants in the conversation. Due to the lack of availability of datasets consisting of these extensions with emotion or sentiment labels, we have left this to our future work.
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Bal- truaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with wordlevel fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 163–171.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Y. Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling.
Dragos Datcu and Lon Rothkrantz. 2012. Semantic au- diovisual data fusion for automatic emotion recogni- tion.
P. Ekman, Edmund Rolls, David Perrett, and H. Ellis. 1992. Facial expressions of emotion: An old con- troversy and new findings: Discussion. Royal Society of London Philosophical Transactions Series B, 335:69–.
Florian Eyben, Martin Wllmer, and Bjrn Schuller. 2010. opensmile – the munich versatile and fast open- source audio feature extractor. pages 1459–1462.
Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, and Pushpak Bhattacharyya Asif Ekbal. 2018. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3454– 3466.
Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann. 2018. Conversational memory net- work for emotion recognition in dyadic dialogue videos. volume 2018, pages 2122–2132.
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2010. 3d convolutional neural networks for human action recognition. volume 35, pages 495–502.
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolu- tional neural networks. pages 1725–1732.
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. International Conference on Learning Representations.
Navonil Majumder, Soujanya Poria, Devamanyu Haz- arika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2018. Dialoguernn: An attentive rnn for emotion detection in conversations.
Jeffrey Pennington, Richard Socher, and Christoper Manning. 2014. Glove: Global vectors for word rep- resentation. volume 14, pages 1532–1543.
Rosalind Picard. 2010. Affective computing: From laughter to ieee. IEEE Transactions on Affective Computing, 1:11–17.
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 873–883.
Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. Convolutional mkl based multimodal emotion recognition and sentiment analysis. In IEEE 16th International Conference on Data Mining (ICDM), pages 439–448.
Soujanya Poria, Navonil Majumder, Rada Mihalcea, and Eduard Hovy. 2019. Emotion recognition in conversation: Research challenges, datasets, and re- cent advances. arXiv:1905.02947.
Sabrina Stckli, Michael Schulte-Mecklenbeck, Stefan Borer, and Andrea Samson. 2017. Facial expression analysis with affdex and facet: A validation study. Behavior Research Methods, 50.
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cam- bria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114.
Amir Zadeh, Paul Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning.
Amir Zadeh, Paul Liang, Soujanya Poria, Erik Cam- bria, and Louis-Philippe Morency. 2018b. Multi- modal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. pages 2236–2246.
Amir Zadeh, Paul Liang, Soujanya Poria, Prateek Vij, Erik Cambria, and Louis-Philippe Morency. 2018c. Multi-attention recurrent network for human communication comprehension. Proceedings of the 2018 AAAI Conference on Artificial Intelligence, 2018.
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis- Philippe Morency. 2016a. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos.
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis- Philippe Morency. 2016b. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. Intelligent Systems, IEEE, pages 82–88.