Facial micro-expression (ME) is a local brief facial movement, which can be triggered under high emotional pressure. The duration is less than 500ms [1]. It is a very important non-verbal communication clue, the involuntary nature make it possible to analyze personal genuine emotional state. ME analysis has many potential applications in national security [2], medical care [3], educational psychology [4], and political psychology [5]. Due to the growth and importance of MEs, researchers [6] have worked collaboratively to solicit the works in this area by conducting challenges in datasets and methods for MEs. This year, the theme of the Second Facial Micro-Expression Grand Challenge has extended to spotting challenges.
The main idea of most methods for ME spotting is to compare the feature differences between the first frame and the other frames in a time window. Meanwhile, the feature descriptors used in the state of the art are diverse, to name a few: LBP [7], [8], HOG [9], optical flow [10]–[13], integral projection [14], Riesz pyramid [15], and frequency domain [16]. Feature differences allow consistent comparisons between frames over a time window of the size of an ME. However, the movements spotted between frames
This work is supported by Chinese scholarship council and ANR reflet. This paper is also supported in part by grants from the National Natural Science Foundation of China (61772511) and The Royal Society (IF160006).
might not be the ME movements, it could be noises, macromovements and illumination changes. This is why the ability to distinguish MEs from other movements (such as blinking or subtle head movements) remains an open challenge.
Nowadays, methods utilizing machine learning are emerging [17]–[20]. Furthermore, [21] employed deep learning for the first time to perform the ME spotting. The machine learning process enhances the ability of distinguishing micro-expression from others. However, the spatial patterns are still the primary feature for the classifier. The temporal variation pattern of facial movement in a ME duration has yet to attract sufficient attention. Meanwhile, few articles spotted micro-expression directly from local region. However, the characteristic of that the micro-expression is a local facial movement could help to reduce the false positives.
In this paper, we spot the micro-expression clips in two recently published databases, and establish the baseline method for ME spotting challenge by using directly a temporal pattern extracted from local region [22]. Frames in a ME duration are taken into account to obtain a real temporal and local pattern (LTP), and then the LTPs are recognized by a classifier. Even though the spatial pattern is not studied, the spotted facial motions are differentiated by a fusion process from local to global. This method helps to improve the ability to distinguish ME from other movements. Furthermore, it allows finding the ME spatial local region and the temporal onset index of ME. We compare the results of our proposed LTP-ML method with a LBP approach - LBP--distance by Moilanen et al. [7].
The rest of the paper is organized as follows: Section II presents the methodology and performance metrics. Section III introduces the result and also shows the detailed experiment results. Section IV concludes the paper.
This section describes the benchmark databases, the pro- posed LTP-ML method, the state-of-the-art LBP method and the performance metrics.
A. Databases
Two most recent long videos spontaneous micro-expression databases, SAMM [23] and CAS(ME)2 [24], are used for ME spotting challenge. Both databases contain long videos, which were recorded in the strictly controlled laboratory environment. Table I compares the differences978-1-7281-0089-0/19/$31.00 c
between these two databases. The notable differences are the resolution and frame rates used in the experimental settings. These are indeed a great challenge for computer vision and machine learning community to produce a robust method worked for both databases, The detailed information of these two databases is presented in the following two subsections.
Fig. 1. Facial landmarks tracking and ROI selection. On the left: an example from SAMM; on the right: an example from CAS(ME)2
1) SAMM Long Videos Database: SAMM database consists of a total of 32 subjects and each has 7 videos [23]. The average length of videos is 35.3s. The original release of SAMM consists of micro-movement clips labelled in Action Units. Recently, the authors [25] introduced objective classes and emotion classes for the database. The recognition challenge will be using the emotional classes from the database as ground truth. The spotting challenge focuses on 79 videos, each contains one/multiple micro-movements, with a total of 159 micro-movements. The index of onset, apex and offset frames of micro-movements were provided as the ground truth. The micro-movements interval is from onset frame to offset frame. In this database, all the micro-movements are labeled. Thus, the spotted frames can indicate not only ME but also other facial movements, such as eye blinks.
2) CAS(ME)2 Database: In the part A of CAS(ME)2
database [24], there are 22 subjects and 97 long videos. The average duration is 148s. The facial movements are classified as macro- and micro-expressions. The video samples may contain multiple macro or micro facial expressions. The onset, apex, offset index for these expressions are given in the excel file. In addition, the eye blinks are labeled with onset and offset time.
B. LTP-ML: Our Proposed Baseline Method
The baseline method is developed based on the proposed LTP-ML (local temporal pattern-machine learning) method
in [22]. The method is extended for long videos by employing a sliding temporal window. The main idea and the modification of LTP-ML method is presented in the following paragraphs.
1) Pre-processing: As the ME is a local facial movement, we analyze ME only on a selection of regions of interest (ROIs). First of all, as shown in Figure 1, 84 facial landmarks are tracked in the video sequence by utilizing the Genfacetracker ( cDynamixyz). Then the size of ROI square a is determined by the distance L between the left and right inner corners of eyes: a
L. 12 ROIs squares are chosen based on the regions where ME happens most frequently, i.e. the corner of the eyebrows and of the mouth. Two ROIs of nose region are chosen as references because the nose is the most rigid facial region.
Since the average duration of ME is around 300ms, and the subjects barely moved in one second, the long videos in these two databases are processed by a temporal sliding window Wvideo whose length is 1s. The overlap is set to 300ms to avoid missing any possible ME movements. This, the video is separated into an ensemble of small sequences [I1,I2,...,IM] by sliding temporal window as shown in Figure 2. The positions of 12 chosen ROIs for all frames in one sequence are determined by the detected landmarks of the first frame in the window.
Fig. 2. PCA process analysis. The long video is divided into small sequences by a sliding window. Then the PCA process is performed respectively on time axis for 12 ROIs sequences in one small divided clip.
2) Feature Extraction: In this part, local temporal patterns (LTPs) [22] are analyzed in the local region to distinguish ME from other movements. They are extracted from 12 ROIs respectively in each small sequence. Supposing in sequence Im (m M), as illustrated in the lower part of Figure 2, PCA is performed on the temporal axis of each ROI sequence to conserve the principal variation at this region. The first two components of each ROI frame are used to analyze the variation pattern of local movement. The PCA process for ROI sequence ROImj ( j
12) in Im can be presented as in equation 1.
where Fmrepresents the pixels in one ROI frame, Pm
[Pm
Pm
are the first two components of PCA, n is the frame index in this ROI sequence (n
N). Hence, each frame in ROImj can be represented by a point Pm
.
Then, a sliding window WROI is set depending on the average duration of ME (300ms). The distances between the first frame and the other frames in this window are calculated. The window goes through each frame in the sequence ROImj , and the distance set can be got as w
WROI
, as shown in Figure 3.
Fig. 3. Distance calculation for one ROI sequence ROImj in video clip Im.
The values of distance are then normalized for the entire ROImj to avoid the influence of different movement magnitude in different videos. Hence, the feature of frame n for ROImj can be represented as: [CNmj , dmj (n,n + 1dmj (n,n +WROI
, where dmj (n,n + 1) is the nor- malized distance value and the CNmj is the normalization coefficient. The more detailed deduction process can be found in [22]. The feature for one ROI sequence of the entire long video is the concatenation of features of all the separated sequences.
3) Local Classification: As presented in the above paragraph, one video contains 12 feature ensembles from 12 ROI. Li et al. [22] showed the LTP patterns are similar for all chosen ROIs for all kinds of ME. The patterns which can represent the ME local movements can be recognized by a local classification. A supervised classification SVM is employed with Leave-One-Subject-Out cross validation. The feature selection and label annotation are presented in [22].
4) Global Fusion: After the LTPs which fit the local ME movement pattern are recognized, a global fusion is processed to eliminate the false positives concerning other movements and true negatives caused by our recognition process. As introduced in [22], there are three steps: a local qualification, a spatial fusion and a merge process.
C. LBP-χ2-distance Method
This method is firstly proposed in [7]. It is the most commonly used method for result comparison for ME spotting. Based on [7] and [18], the configuration of LBP-is set as follows: the entire face region is divided into 36 blocks. The overlap rates between blocks on axis X and Y are are 0.2 and 0.3 respectively. LBP features are extracted from blocks with uniform mapping. The radius r is set to r = 3, and the number of neighboring points p is set to p = 8. The
distances of the each frame are computed in an 2
Linterval +1 interval.
First of all, the value of LBP--distance is compared in the whole long video. However, the method can barely spot any micro-expression intervals, while there are many false positives. This is due to this method spots the maximal movements in the video, and there are some larger movements than ME in both databases. Hence, the entire video is separated into a sub-video set by a sliding window, the setting is the same as the LTP-ML method. For each sub-video, the feature differences are calculated and sorted to find the maximal movement in this short interval. This gives the chance to spot more MEs which could be ignored in entire video comparison.
D. Performance Metrics
There are three evaluation methods used to compare the performance of the spotting tasks:
1. True positive in one video definition Supposing
where k is set to 0.5, WgroundTruth represents the micro-expression interval (onset-offset). Otherwise, the spotted interval is regarded as false positive (FP).
2. Result evaluation in one video Supposing the number
of TP in one video is a (a m and a
n), then FP = n
a, false positive (FN) = m
a, the Recall, Precision and F1-score are defined:
In practical, these metrics might not be suitable for some videos, as there exist the following situations on a single video:
• The test video does not have micro-expression sequences, thus, m = 0, the denominator of recall will be zeros.
• The spotting method does not spot any intervals. The denominator of precision will be zeros since n = 0.
• If there are two spotting methods, Method1 spots p intervals and Method2 spots q intervals, and p q. Supposing for both methods, the number of true positive is 0, thus the metrics (recall, precision or F1-score) values both equal to zeros. However, in fact, the Method1 performs better than Method2. Considering these situations, we propose for a single video, we record the result in terms of TP, FP and FN. For performance comparison, we produce a final calculation of other metrics for the entire database.
3. Evaluation for entire database Supposing in the
entire database, there are V videos and M micro-expression sequences, and the method spot N intervals in total. The database could be considered as one long video, thus, the metrics for entire database can be calculated by:
TABLE II BASELINE RESULT FOR MICRO-EXPRESSION SPOTTING. SAMMcME REPRESENTS THE SAMM CROPPED-FACE VIDEOS CONTAIN ME, SAMM fME ARE
The final results by different methods would be evaluated by F1-score since it considers the both recall and precision.
As introduced in Section II, SAMM and CAS(ME)2 have different frame rates and resolution. Hence, the lengths of sliding window Wvideo, the overlap size, the interval length of WROI and the ROIs size are different for these two databases. Table III lists the experimental parameters.
TABLE III PARAMETER CONFIGURATION FOR SAMM AND CAS(ME)2. Lwindow IS
THE LENGTH OF SLIDING WINDOW Wvideo,Loverlap IS THE OVERLAP SIZE
BETWEEN SLIDING WINDOWS, Linterval IS THE INTERVAL LENGTH OF WROI.
For CAS(ME)2 database, there are 97 videos, but only 32 videos contain micro-expressions. Thus, different results are given under two conditions: one is only considering 32 videos which have ME (CAS(ME)2ME), another one is to include the entire database (all 97 videos). Since the raw videos in SAMM database are too big to download (700GB), only 79 videos (full frame: 270GB and cropped face: 11GB) were provided for the challenge. In this work, we report the results based on these two versions of SAMM database: one is the cropped videos (SAMMcME) provided by the authors using the method in [26], and the other one is the videos with full frame (SAMM fME). The spotting process is performed only on the downloaded databases.
A. Experiments Results of LTP-ML Method
After performing the LTP-ML method on these two databases, the spotting results for whole database are listed in Table II. The F1-score for (SAMMcME) and CAS(ME)2ME are 0.0316 and 0.0179 respectively. LTP-ML performs better in SAMMcME than SAMM fME, since the cropped-face process has already aligned the face region in the video, and reduced
the influence of irrelevant movements. Concerning the spotting result of CAS(ME)2, there are more FPs because the video in this database which has no ME may contain macro-expressions.
B. Experiments Results of LBP-χ2-distance (LBP-χ2) Method
The result is compared with LBP--distance (LBP-
) method. The spotting result is listed in Table II. For CAS(ME)2ME, when the threshold for peak selection is set to 0.15, we can get the best result for LBP-
method, the F1-score is 0.0111. Meanwhile, the highest F1-score of SAMMcME is 0.0055 when the threshold is set to 0.05.
Compared with LTP-ML method, LBP-method is less accurate. LTP-ML method is capable of spotting the subtle movements based on the patterns which represented the temporal pattern variation of ME. Yet, the value of F1-score is low because of the large amounts of FP. Both databases contain noises and irrelevant facial movements, especially for CAS(ME)2, it is not easy to separate macro-expressions from micro-expressions based on 30fps videos. The ability of distinguishing ME from other movements still need to be enhanced.
This paper addresses the challenge in spotting ME on long videos sequences using two most recent databases, i.e. SAMM and CAS(ME)2. We proposed LTP-ML for spotting MEs and provided a set of performance metrics as the guideline for result evaluation on ME spotting. The baseline results of these two databases are provided in this paper. We demonstrate that our proposed method is better than the LBP approach in spotting MEs. Whilst the method was able to produce a reasonable amount of TPs, there are still a huge challenge lays ahead due to the large amount of FPs. Further research will focus on enhancing the ability of distinguishing ME from other facial movements to reduce FPs, including the implementation of deep learning approaches when we have sufficient data.
The authors gratefully acknowledge the contribution of the Organisers and Program Committee Members.
[1] P. Ekman and W. V. Friesen, “Nonverbal leakage and clues to deception,” Psychiatry, vol. 32, no. 1, p. 88–106, 1969.
[2] P. Ekman, “Lie catching and microexpressions,” The philosophy of deception, p. 118–133, 2009.
[3] J. Endres and A. Laidlaw, “Micro-expression recognition training in medical students: a pilot study,” BMC medical education, vol. 9, no. 1, p. 47, 2009.
[4] M.-H. Chiu, H. L. Liaw, Y.-R. Yu, and C.-C. Chou, “Facial micro-expression states as an indicator for conceptual change in students’ understanding of air pressure and boiling points,” British Journal of Educational Technology.
[5] P. A. Stewart, B. M. Waller, and J. N. Schubert, “Presidential speechmaking style: Emotional response to micro-expressions of facial affect,” Motivation and Emotion, vol. 33, no. 2, p. 125, 2009.
[6] M. H. Yap, J. See, X. Hong, and S.-J. Wang, “Facial micro-expressions grand challenge 2018 summary,” in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on. IEEE, 2018, pp. 675–678.
[7] A. Moilanen, G. Zhao, and M. Pietik¨ainen, “Spotting rapid facial movements from videos using appearance-based feature difference analysis,” in Pattern Recognition (ICPR), 2014 22nd International Conference on. IEEE, 2014, p. 1722–1727.
[8] X. Li, X. Hong, A. Moilanen, X. Huang, T. Pfister, G. Zhao, and M. Pietik¨ainen, “Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and recognition methods,” IEEE Transactions on Affective Computing, 2017.
[9] A. Davison, W. Merghani, C. Lansley, C.-C. Ng, and M. H. Yap, “Objective micro-facial movement detection using facs-based regions and baseline evaluation,” in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on. IEEE, 2018, pp. 642–649.
[10] W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,” PloS one, vol. 9, no. 1, p. e86041, 2014.
[11] X. Li, J. Yu, and S. Zhan, “Spontaneous facial micro-expression detection based on deep learning,” in Signal Processing (ICSP), 2016 IEEE 13th International Conference on. IEEE, 2016, p. 1130–1134.
[12] S.-J. Wang, S. Wu, and X. Fu, “A main directional maximal difference analysis for spotting micro-expressions,” in Asian Conference on Computer Vision. Springer, 2016, p. 449–461.
[13] H. Ma, G. An, S. Wu, and F. Yang, “A region histogram of oriented optical flow (rhoof) feature for apex frame spotting in micro-expression,” in Intelligent Signal Processing and Communication Systems (ISPACS), 2017 International Symposium on. IEEE, 2017, pp. 281–286.
[14] H. Lu, K. Kpalma, and J. Ronsin, “Micro-expression detection using integral projections,” 2017.
[15] C. Duque, O. Alata, R. Emonet, A.-C. Legrand, and H. Konik, “Microexpression spotting using the riesz pyramid,” in WACV 2018, 2018.
[16] Y. Li, X. Huang, and G. Zhao, “Can micro-expression be recognized based on single apex frame?” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 3094– 3098.
[17] Z. Xia, X. Feng, J. Peng, X. Peng, and G. Zhao, “Spontaneous micro-expression spotting via geometric deformation modeling,” Computer Vision and Image Understanding, vol. 147, p. 87–94, 2016.
[18] T.-K. Tran, X. Hong, and G. Zhao, “Sliding window based micro-expression spotting: A benchmark,” in International Conference on Advanced Concepts for Intelligent Vision Systems. Springer, 2017, pp. 542–553.
[19] D. Borza, R. Danescu, R. Itu, and A. Darabant, “High-speed video system for micro-expression detection and recognition,” Sensors, vol. 17, no. 12, p. 2913, 2017.
[20] P. Hus´ak, J. ˇCech, and J. Matas, “Spotting facial micro-expressions” in the wild,” in 22nd Computer Vision Winter Workshop, 2017.
[21] Z. Zhang, T. Chen, H. Meng, G. Liu, and X. Fu, “Smeconvnet: A convolutional neural network for spotting spontaneous facial micro-expression from long videos,” IEEE Access, vol. 6, pp. 71 143–71 151, 2018.
[22] J. LI, C. Soladi´e, and R. S´eguier, “Ltp-ml: Micro-expression detection by recognition of local temporal pattern of facial movements,” in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on. IEEE, 2018, pp. 634–641.
[23] A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap, “Samm: A spontaneous micro-facial movement dataset,” IEEE Transactions on Affective Computing, vol. 9, no. 1, pp. 116–129, 2018.
[24] F. Qu, S.-J. Wang, W.-J. Yan, H. Li, S. Wu, and X. Fu, “Cas (me)ˆ 2: a database for spontaneous macro-expression and micro-expression spotting and recognition,” IEEE Transactions on Affective Computing, 2017.
[25] A. Davison, W. Merghani, and M. Yap, “Objective classes for micro-facial expression recognition,” Journal of Imaging, vol. 4, no. 10, p. 119, 2018.
[26] A. K. Davison, M. H. Yap, and C. Lansley, “Micro-facial movement detection using individualised baselines and histogram-based descriptors,” in 2015 IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 2015, p. 1864–1869.