In recent years, AR devices and applications have received unprecedented attentions from both research community and industry. To enable high-quality AR experience, an AR device should be able to understand its own motion [12,19,39], understand its surrounding environments [11, 13], and interact with human via control commands [24,28]. Among the interaction technologies, HGR is one of the most natural and efficient methods, which greatly enhances the AR experience. Thus, in this paper, we focus on the problem of HGR, to enable high-quality experience on low-cost AR devices.
Early studies on HGR mainly focus on the 3D skeleton-based method by using RGB-D cameras. RGB-D camera is able to directly measure depths, which greatly simplifies the algorithm design. However, its relatively high price makes it not affordable for low-cost AR devices. Recently, Molchanov et al. [18] introduced a RGB-based HGR system by using a 3D-CNN based method. While this method is theoretically sound and interesting, the incurred high computational cost makes it infeasible on low-cost devices. Moreover, the 3D-CNN-based algorithms will suffer from reduced performance in scenes with complex background or with multiple hands.
Although real-time HGR on low-cost embedded devices is a lessexplored topic, the applications using embedded-enabled CNNs are under active study by building lightweight efficient networks [8,15, 22]. However, the well-established techniques for other visual tasks are not directly applicable for gesture recognition problem due to its challenges in high degrees of freedom (DoF) and fast speed of hand moving. This makes real-time hand interaction on low-cost embedded device a challenging problem.
In this paper, we propose a lightweight and computationally ef-ficient gesture recognition network, namely LE-HGR, to enable real-time gesture recognition on embedded devices with limited computational and memory resources, as well as efficiently enhance the generalization ability of the complex AR interaction scenes. In the first step, we fed low-resolution images into candidate detector for real-time inference. To speed up the inference phase, we adopt
Figure 2: Multi-task cascaded network architecture. Top sub-figure: The online detector for generating candidate bounding boxes for hand. Bottom sub-figure: The multi-task CNN for eliminating the false candidates, and regressing the keypoint locations.
Figure 3: The proposed TraceSeqNN architecture. The TraceSeqNN is composed of transformer, LSTM and FC blocks. We adopt velocity branch and shape branch to better learn the motion features.
the MobileNet algorithm [8,26] as our backbone network to largely compress the required parameters and computations. Subsequently, a multi-task CNN is exploited to extract the temporal features of motion trace and the spatial features of hand shape. Finally, we implement a TraceSeqNN to recognize the hand gesture based on these motion features. A video file is also attached as our supplementary material in which real-time gesture recognition is used for AR applications. Our main contributions are summarized as follows: • Firstly, we design a cascaded multi-task network to enable real-time hand detection on AR devices. Subsequently, to enables interactive hand applications in complicated environments with multiple persons or multiple hands, we propose a fast mapping method to track the hand trace via predicted locations,.
• Secondly, to tackle the online HGR, we design the TraceSeqNN to encode the temporal correlation information and predict the gesture category. The velocity branch and shape branch are exploited to better learn the motion information. Additionally, an effective data augmentation method for the sequence samples is proposed to prevent overfitting and simulate different temporal domains of motion movements.
• Finally, we propose a novel gesture recognition framework and apply our system to AR interacting applications for enhancing the AR immersion experiences. Specifically, we show that our method outperforms R3DCNN [5] by 1.2% accuracy improvement on Nvidia gesture dataset [5]. More importantly, our method runs 500x faster compared to [5], and is able to perform real-time tasks on low-cost devices.
2.1 Visual Recognition Methods for Hand Detection
One of the most well known visual recognition algorithm series is the R-CNN families [4,23], which generate potential bounding boxes via region proposal methods and run a classifier on these proposed boxes. However, these methods require a large amount of computational resources, making themselves infeasible for low-cost embedded devices.
Unlike region proposal-based techniques, Single shot multi-box detector (SSD) [16] produced predictions on multi-scale feature maps and achieves competitive accuracy even with the relatively low-resolution input. To enable convolutional neural network in real-time on low-cost devices, Howard et.al [8] proposed MobileNet network, which is based on depth-wise separable convolutions to compress the number of parameters.
2.2 Gesture Recognition using Spatial-temporal Features
2.2.1 HGR via the RGB-D based Approachs
RGB-D cameras have the natural advantage of action recognition due to its capability of directly capturing 3D dense point cloud data. In recent years, a variety of methods have been successfully applied to the 3D skeleton-based action recognition problems. [7,14,21,34,37, 38]. Smedt et al. [2,38] encoded the temporal pyramid information using the 3D skeleton-based geometric features and performed a linear support vector machine (SVM) to achieve a classification task. Chen et al. [1] proposed a motion feature augmented recurrent neural network (RNN) for skeleton-based HGR.
2.2.2 Gesture Recognition using the Sequential Video Sequences
A large number of 3D-CNN-based methods [5, 17, 29, 32] have been developed for producing better performance on various video analysis tasks. Them, ConvLSTM [31, 35, 36, 40] combined the 3DCNN and LSTM to effectively learn different levels of spatial-temporal characteristics. CLDNN [25] analyzed the effect of adding CNN layers to LSTM and LRCN [3] extended the convolutional LSTM for visual recognition problems. Besides, multiple-stream based methods have been proposed for tackling the multiple modality data [20,27], which separately learn the spatial features from video frames and the temporal feature from the dense optical flow for action recognition.
3.1 Framework Overview
In this section, we illustrate the overall pipeline of our approach. As shown in Fig. 1, our LE-HGR mainly consists of four stages: detecting hand candidates from input images, refining detection results and regressing hand keypoints using a multi-task network, trace mapping for addressing multi-hand ambiguities, and gesture recognition from the hand-skeleton sequence. At the first stage, the hand bounding boxes are detected via an ultra-simplified SSD [15] network. Since this lightweight detector is inevitable to produce false-positive detection results, we propose a cascaded multi-task network to refine those results, as well as regress the hand keypoints in parallel. To effectively handle the situation when multiple hands appearing simultaneously in input images, we maintain the temporal traces via associating the predicted hands to the existed hand motion trails. Finally, hand keypoints in the temporal trace are stacked as the sequential input of the TraceSeqNN to recognize the gesture.
3.2 Joint Hand Detection and Keypoints Regression using the Multi-task Cascaded Network
In this section, we describe the architecture and training objective of our multi-task network, which is also shown in Fig. 2.
3.2.1 Online Hand Candidate Detector
Detector Network Architecture. In this stage, the SSD [16] is chosen as our candidate detector to achieve the balance between speed and accuracy. The MobileNet [8] is used as the backbone CNN network and the width multiplier (WM) is set to be 0.25 [8] to improve the inference speed. The multi-scale pyramid feature maps are produced by the max pooling operation similar to PPN [9], for reducing parameters.
Training Loss of the Candidate Detector. The overall loss function is a weighted sum of the confidence loss and the regression loss
where Lbox is the regression loss of bounding boxes following the concept of [4,16], and the confidence loss Lconf is the cross-entropy loss with the ground-truth label y and the predicted candidate probability p
3.2.2 Hand Detection and Keypoints Regression using the Multi-task CNN
Multi-task Network Architecture. A multi-task CNN is used as the validation approach to reject erroneous candidates and regress the hand keypoints at the same time. Two branches of fully connected layers (FC) are added to the end of the network for the hand confidence prediction and keypoints regression separately. Training Loss of the Multi-task Network. The complete loss function of the multi-task network is composed of the cross-entropy loss over binary classification and keypoints Euclidean loss as
where u is the corresponding coordinate of hand keypoint and ¯p is the predicted probability of hand.
3.2.3 Hand Trace Mapping via the Predicted Locations
In the case of HGR system used in complicated environments, ambiguity caused by multiple hands is normal rather than incidental. While other methods [5] often ignore this problem, we here introduce hand trace mapping to solve this problem.
Hand Trace Mapping. Hand mapping approach is used to match the predicted hand of the current frame with the maintained traces using the current bounding box and keypoints. We adopt the match loss Lmatch to measure the similarity of the current hand and existing traces
, which combines the intersection over union (IOU) and keypoint locations. The overall loss function of the matching phase is defined as
where IoU loss LIoU and the region area loss Larea of hand bounding box indicate the regional similarity. The Euclidean loss Lloc is utilized to measure the similarity of keypoint positions.
Temporal Motion Features. Once a detected hand is associated with the trace , the temporal motion features Xtmot of this tracked hand can be directly computed using the location vector Xtp
as follows:
where Xtv denotes the velocity of the tracked hand. Xte indicates the edge vector of hand skeleton, where E is the edge set. Xte describes the hand shape information.
Table 1: The hand detection results. The top two rows show the impact of image input size. The proposed cascaded network is effective in improving the performance for low-resolution input
Table 2: Evaluation the regression errors on public dataset. We conduct experiments on public datasets to compute the error metrics for keypoint regression.
3.3 HGR using Temporal Motion Features
3.3.1 Online Recognition based the TraceSeqNN
As depicted in Fig. 3, the proposed TraceSeqNN is primarily consisted of transformer, LSTM block and FC block. Given timestep T, the sequential input XTmot is fed into a Transformer to reshape the motion features for the LSTM layers [6].
In temporal movements, the hand shape and the velocity of keypoints demonstrate different characteristics. We perform velocity branch and shape branch to learn the velocity features XTv and hand shape features XTe separately. The outputs of these two branches are further stacked into the FC layers, the probability of gesture categories is finally predicted via the softmax layer.
3.3.2 Sequence Sample Generation and Augmentation
Sample Generation Principle. Given the start and end timestamps of annotated segments in the captured video, we can generate the sample set for training and test phases by clipping the original video with an objective timestep Tob j. In order to distinguish positive and negative samples, we compute the labels of these samples according to the similarity of these clipped samples and their corresponding original segments. Firstly, we represent the similarity as the intersection over a sample rIoS and the intersection over its annotation rIoA
where are the start and end timestamps of the clipped sequence in the original video. Subsequently, the label of this clipped sample can be computed
Table 3: Evaluation the TraceSeqNN for HGR on our dataset. Comparing our models against different state-of-the-art approaches on the same sequence dataset, which consists of left waving, right waving, and negative samples. Taking our temporal motion features as input, the HGR can run at 250 fps on low-cost MediaTek MTK-8167S processor.
according to the similarity metric
where c is the annotated category in the original video. and
are the thresholds of the sample principle.
Data Augmentation for the Temporal Domain. In order to generate massive hand motion samples of different temporal domains using the limited-quantity annotated segments, we propose a novel data augmentation method to increase the diversity of temporal domains. Firstly, the sequence samples are yielded via the timestep set:
where Tmin is the minimum time step and is configured to a constant value.
Subsequently, using the nonlinear transformation, i.e. interpolation or down-sampling operations, the generated samples are further re-sampled and deformed to the target timestep Tobj .
4.1 Dataset
4.1.1 Public Nvidia Gesture Dataset
Nvidia corporation published a dataset of 25 gesture types intended for touchless interfaces in cars [5]. The dataset consists of a total of 1050 training and 482 testing video sequences, covering both bright and dim artificial lighting. In order to validate the effectiveness of our proposed algorithm on Nvidia dataset, we also add annotations for the keypoints and bounding boxes of this dataset during the training phase. All new annotations will be publicly available.
4.1.2 Our Dataset
In addition to testing in the public Nvidia dataset, we also build our own dataset in a crowd-sourcing way. This dataset consists of more than 150k RGB images in different environments conditions (e.g., different background, lighting conditions, and so on) via different RGB cameras, which contains 120k training images and 30k testing. In addition, we recorded 30000 video sequences using different RGB cameras at 30fps.
4.2 Evaluation of Multi-task Cascaded Network for Hand Detection
The proposed cascaded multi-task network is adopted to predict the hand confidence and regress the keypoint locations. In this section, we conduct experiments for quantitatively evaluating the metrics of our architecture design, based on an hand-held low-cost interactive device. The main processor on this device is MediaTek MTK-8167S.
Table 4: Comparison with the state-of-the-art methods on the pubilshed Nvidia dataset. Proposed-reg uses the multi-task CNN to regress the keypoints, while proposed-cpm adopts the CPM-1Stage network to predict the keypoint locations. TraceSeqNN-SB uses the single branch to learn the motion features.
4.2.1 Training Details
To train the proposed network, we annotated 8867 ground-truth bounding boxes for Nvidia dataset and 90000 ground-truth bounding boxes for our dataset. The candidate detector was trained with the annotated datasets. Subsequently, for the multi-task CNN, we generated the positive samples with IOU over 0.5 and negative samples with IOU less than 0.3 via the candidate detector model. The hand keypoints of the positive samples are annotated only for the training phase of multi-task CNN.
4.2.2 Evaluation the Multi-task Cascaded Network
We compared the proposed network against standard SSD to evaluate the accuracy on previously mentioned datasets. For fair comparison, we chosen the MobileNetV1 as the backbone of our detector and standard SSD. Results from Table 1 show that the detection results are drastically degraded with SSD when using low-resolution input. However, the proposed network greatly improves both detection recall and precision with only negligible computational cost.
We also measured the error metrics on our dataset and Nvidia [5] dataset to calculate the precision of regression. The results are shown in Table 2. Furthermore, we implemented a one-stage convolutional pose machine (CPM-1Stage) [33] using the MobileNet as the backbone.
Figure 4: Confusion matrix of 25 gesture types. Confusion matrix on the total 482 Nvidia testset.
Figure 5: Examples of failure cases. We show failure cases due to ambiguity of hand gesture or overlapping between gestures, e.g. ”Two fingers left” is part of ”Rotate fingers CW” gesture.
4.3 Evaluation the TraceSeqNN and Data Augmentation
4.3.1 Training Details
For data augmentation of the training set, we set the objective timestep Tob j to be 13 according to the distribution of and
. The minimum timestep Tmin is set to be 8 and
is 5. Then,
and
are set to be 0.3 and
is 0.85 for calculating labels of the clipped samples. During the training phase of TraceSeqNN, the initial learning rate is set to be 0.004. The number of LSTM hidden layers is set to be 64. We train all the models using Adam Optimizer [10] with the dropout rate of LSTM cells at 0.2.
4.3.2 Metrics on Our Dataset using TraceSeqNN
We designed comprehensive experiments to evaluate different input channels, different data augmentation methods, and different network architectures (i.e., TDNN [30] , LSTM [6], CLDNN [25] and LRCN [3]) of our TraceSeqNN. The results are shown in Ta- ble 3. The results demonstrate that our data augmentation method can significantly improve the overall prediction accuracy. Compared with XTbox, by using motion features XTmot as the network input, false positives can be largely eliminated. To summarize, the proposed TraceSeqNN achieves the highest accuracy for HGR.
4.4 Evaluation on the Public Datasets for HGR
Table 4 shows the results by comparing our approach against competing state-of-the-art networks on the Nvidia dataset. As some of the ops is not supported on the hand-held low-cost device, for ensuring fair comparison, all the experiments were performed on
Figure 6: Typical AR application. The left column shows that finger track and gesture recognition (”click”) via our LE-HGR approach. The right column shows the application of AR education.
a MacBook Pro core-i7 computer by only using CPU and 16GB memory. Table 4 demonstrates that C3D [18] and R3DCNN [5] rely heavily on computational resources, which can only run at 1fps.
For the proposed TraceSeqNN, benefiting from the designed lightweight cascaded framework, our LE-HGR significantly improves the inference speed at 76 fps. In addition, the proposed approach achieves the highest accuracy over competing methods using temporal motion features. Adopting our proposed data augmentation, we have seen an increase of 5% on this challenging gesture dataset. Our final accuracy is 75.3% by only using the RGB modality dataset.
The complete confusion matrix is shown in Fig. 4. If equipped with a more precise keypoints predictor (CPM-1Stage), we can get a 2% improvement. we show some failure cases in Fig. 5
4.5 Typical AR applications
In general, AR devices have limited user interfaces, most often small buttons or touchscreen, these interfaces are not natural for human and destroy AR immersion experiences.
In order to enrich the AR experiences, we design an online AR interacting system via direct hand touch and gesture sensing. As shown in the Fig. 6, by locating and tracking finger position, it can understand the user’s intention (e.g. ”clicking”, ”draw star”, and so on). Obviously, by applying this system to educational and entertainment applications, we can provide more incredible and interesting AR experiences. We present a video to show these AR application demos in our supplementary material.
We have presented a complete online RGB-based gesture recognition framework, which is extremely lightweight and efficient for low-cost commercial embedded devices. An improved multi-task cascaded network is introduced for online hand detection and a prototype of the hand trace mapping is implemented to tackle the interference of multi-hand. Besides, a TraceSeqNN network combined with an effective data augmentation method is performed to learn the temporal information and recognize the gesture categories.
The experimental results show that our proposed cascaded multi-task network significantly improves the recall and precision rates of hand detection. The designed LE-HGR framework achieves advanced accuracy with significantly reduced computational complexity. To extend this work, we are interested in investigating this challenging issue directly to predict the 3D hand keypoint using only RGB-cameras. Moreover, a promising future research direction is to explore the attention mechanism for RGB-based HGR applications.
[1] X. Chen, H. Guo, G. Wang, and L. Zhang. Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition. In IEEE International Conference on Image Processing (ICIP), pp. 2881–2885, 2017.
[2] Q. De Smedt, H. Wannous, J.-P. Vandeborre, J. Guerry, B. L. Saux, and D. Filliat. 3d hand gesture recognition using a depth and skeletal dataset: Shrec’17 track. In Proceedings of the Workshop on 3D Object Retrieval, pp. 33–38. Eurographics Association, 2017.
[3] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634, 2015.
[4] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
[5] P. M. X. Y. S. Gupta and K. K. S. T. J. Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks. CVPR, 2016.
[6] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[7] J. Hou, G. Wang, X. Chen, J.-H. Xue, R. Zhu, and H. Yang. Spatialtemporal attention res-tcn for skeleton-based dynamic hand gesture recognition. In European Conference on Computer Vision, pp. 273–286. Springer, 2018.
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[9] P. Jin, V. Rathod, and X. Zhu. Pooling pyramid network for object detection. CoRR, abs/1807.03284, 2018.
[10] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[11] M. Klingensmith, I. Dryanovski, S. Srinivasa, and J. Xiao. Chisel: Real time large scale 3d reconstruction onboard a mobile device using spatially hashed signed distance fields. In Robotics: Science and Systems, 2015.
[12] M. Li and A. I. Mourikis. High-precision, consistent ekf-based visual-inertial odometry. The International Journal of Robotics Research, 32(6):690–711, 2013.
[13] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017.
[14] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang. Skeletonbased action recognition using spatio-temporal lstm network with trust gates. IEEE transactions on pattern analysis and machine intelligence, 40(12):3007–3021, 2018.
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pp. 21–37. Springer, 2016.
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015.
[17] Q. Miao, Y. Li, W. Ouyang, Z. Ma, X. Xu, W. Shi, and X. Cao. Multimodal gesture recognition based on the resc3d network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3047– 3055, 2017.
[18] P. Molchanov, S. Gupta, K. Kim, and J. Kautz. Hand gesture recognition with 3d convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1–7, 2015.
[19] R. Mur-Artal and J. D. Tard´os. ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
[20] N. Nishida and H. Nakayama. Multimodal gesture recognition using multi-stream recurrent neural network. In Pacific-Rim Symposium on Image and Video Technology, pp. 682–694. Springer, 2015.
[21] E. Ohn-Bar and M. M. Trivedi. Hand gesture recognition in real time
for automotive interfaces: A multimodal vision-based approach and evaluations. In IEEE transactions on intelligent transportation systems, pp. 2368–2377, 2014.
[22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
[23] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
[24] H. Sagawa, H. Nagayoshi, H. Kiyomizu, and T. Kurihara. Hands-free ar work support system monitoring work progress with point-cloud data processing. In IEEE International Symposium on Mixed and Augmented Reality, pp. 172–173, 2015.
[25] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, long short-term memory, fully connected deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4580–4584, 2015.
[26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
[27] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition. In Proceedings of the Neural Information Processing Systems (NIPS), 2015.
[28] A. Tewari, B. Taetz, F. Grandidier, and D. Stricker. A probabilistic combination of cnn and rnn estimates for hand gesture based interaction in car. In IEEE International Symposium on Mixed and Augmented Reality (ISMAR-Adjunct), pp. 1–6, 2017.
[29] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497, 2015.
[30] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition using time-delay neural networks. Backpropagation: Theory, Architectures and Applications, pp. 35–61, 1995.
[31] H. Wang, P. Wang, Z. Song, and W. Li. Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3138–3146, 2017.
[32] L. Wang, Y. Xu, J. Cheng, H. Xia, J. Yin, and J. Wu. Human action recognition by learning spatio-temporal features with deep neural networks. IEEE access, 6:17913–17922, 2018.
[33] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732, 2016.
[34] C. Xie, C. Li, B. Zhang, C. Chen, J. Han, C. Zou, and J. Liu. Memory attention networks for skeleton-based action recognition. arXiv preprint arXiv:1804.08254, 2018.
[35] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pp. 802–810, 2015.
[36] L. Zhang, G. Zhu, P. Shen, J. Song, S. Afaq Shah, and M. Bennamoun. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3120–3128, 2017.
[37] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE transactions on pattern analysis and machine intelligence, 2019.
[38] D. Zhao, Y. Liu, and G. Li. Skeleton-based dynamic hand gesture recognition using 3d depth data. Electronic Imaging, 2018(18):1–8, 2018.
[39] X. Zheng, Z. Moratto, M. Li, and A. I. Mourikis. Photometric patchbased visual-inertial odometry. In IEEE International Conference on Robotics and Automation (ICRA), pp. 3264–3271, 2017.
[40] G. Zhu, L. Zhang, P. Shen, and J. Song. Multimodal gesture recognition using 3-d convolution and convolutional lstm. IEEE Access, 5:4517– 4524, 2017.