Image-based OoD-Detector Principles on Graph-based Input Data in Human Action Recognition

2020·Arxiv

Abstract

Abstract

Living in a complex world like ours makes it unacceptable that a practical implementation of a machine learning system assumes a closed world. Therefore, it is necessary for such a learning-based system in a real world environment, to be aware of its own capabilities and limits and to be able to distinguish between confident and unconfident results of the inference, especially if the sample cannot be explained by the underlying distribution. This knowledge is particularly essential in safety-critical environments and tasks e.g. self-driving cars or medical applications. Towards this end, we transfer image-based Out-of-Distribution (OoD)-methods to graph-based data and show the applicability in action recognition.

The contribution of this work is (i) the examination of the portability of recent image-based OoD-detectors for graph-based input data, (ii) a Metric Learning-based approach to detect OoDsamples, and (iii) the introduction of a novel semi-synthetic action recognition dataset.

The evaluation shows that image-based OoD-methods can be applied to graph-based data. Additionally, there is a gap between the performance on intraclass and intradataset results. First methods as the examined baseline or ODIN provide reasonable results. More sophisticated network architectures – in contrast to their image-based application – were surpassed in the intradataset comparison and even lead to less classification accuracy.

I. INTRODUCTION

Modern deep convolutional neural networks are able to recognize objects in images, segment areas pixelwise, and even generate realistic looking photos. Despite their superb capabilities in those areas, they are not able to exposure their own lack of knowledge. As some studies have found out, the confidence of a a network in its output is as high for irrelevant or non-human understandable input data as for in-distribution input data [1]–[3]. As a result, there are numerous different approaches [2], [4]–[7] detecting so called out-of-distribution (OoD) data.

Instead of proposing another image based approach, this work investigates the applicability of OoD-detection methods on graph-based input data. To the best of our knowledge there are no OoD-detection methods which are usable and have been investigated on graph-based data. Since human skeleton graphs can be easily generated from RGB images [8], [9], depth data [10], and even RF-signals [11], the representation of the dynamics of human actions can be captured without the high computational cost of optical flow or problems regarding poor visual conditions. The contribution of this work is: (i) the examination of the portability of ODIN [4] and the confidence learning approach from [5], when using graph-structured input data in an action recognition task. As a baseline, the softmax output comparison proposed in [2] is used. Additionally, (ii) a Metric Learning-based approach detecting OoD-samples is developed. (iii) To ensure to have a controlled and repeatable evaluation environment, a novel semi-synthetic action recognition dataset is also introduced.

In the following section an overview of related work on both graph-based structured action recognition and OoD-detection is given. The baseline method and the examined methods are explained in section III. The semi-synthetic dataset and the quantitative evaluation are presented in section IV.

Fig. 1. The definition of OoD-data is mandatory. Values in the range are explainable by the given distribution but significantly less common then values in the range

II. RELATED WORK

Both in action recognition and outlier detection there is a large number of related work. We focus on skeleton-based action recognition as well as deep neural network outlier detection approaches. However, additional information regarding action recognition can be found in the surveys [12]–[14]. A good overview on outlier detection is given by [15]–[17].

A. Skeleton-based Action Recognition

Recognizing actions based on image data is one way to solve action recognition tasks. Another strategy uses skeleton data which can be extracted by a 2D or 3D pose estimator such as Stacked Hourglass Networks [18], PersonLab [9], or OpenPose [8]. The extracted landmarks can be seen as human joints and form the nodes of a skeleton graph (Figure 2). Based upon a time series of this graph input data, there are several ways on how to recognize an action.

A common approach is the analysis and classification of hand-crafted features using Hidden Markov Models [10], Support Vector Machines [19], or k-Nearest-Neighbor clas-sifiers [20]. Deep learning models [21]–[24] are trained in an end-to-end manner and do not rely on handcrafted features.

In [21], the skeleton graph is divided into five parts according to the human physical structure. These five parts are then fed separately into five bidirectional recurrent subnets (BRNN). The outputs of the subnets are successive fused to be the input of higher BRNN layers and build finally the input of the classification layer.

Part-aware LSTM networks are introduced in [22]. A part-aware LSTM splits the entire motion of the human body into multiple part-based LSTM cells. To keep the context of each body part separated from one another, each cell has its individual input, forget, and modulation gates. Only the output gate is shared among all body parts.

An approach using spatial temporal graph convolutional networks (ST-GCN) is given by [23]. The skeleton sequence is interpreted as a graph in such a way, that in each frame, the corresponding joints of naturally connected joints in a human body are connected by an edge. To include the temporal domain, the same joints between consecutive frames share an edge. The resulting graph is then propagated through the proposed graph convolution network which forms the input of the final classification layer.

Spatial reasoning and temporal stack learning networks are used in [24]. While the former models high-level spatial structural information within each frame, the latter is responsible for generating detailed temporal dynamics. A spatial reasoning network encodes the coordinate vector of each body part and feeds them into a residual graph neural network, which models the structural relationship between body parts. Those relationships are then analyzed in the temporal stack learning network, which stacks previous high-level features to generate even more high-level features. Based on the most high-level features, the system classifies an action.

B. OoD-Detection

Since detecting OoD-samples is an established topic, there are numerous detection methods which [15] categorizes into statistical [25]–[31], machine learning [32], [33], and neural network [2], [4]–[7], [34], [35] based methods.

Assuming normal distributed data, [25] calculates the mean and standard deviation of an attribute among all given data. An outlier is present, if the difference of the querying data

Fig. 2. Available skeleton data in the dataset: 18 node ground truth data (a) and 25 node OpenPose generated data (b).

and the mean divided by the standard deviation is lower than a predefined significance level.

To be able to handle multivariate data, [27] uses the Mahalanobis distance to handle possible inter-attribute dependencies. The outlier detection is then performed by generating a boxplot based upon the calculated distance.

A biologically inspired method to detect novelty is presented by [30] and uses an ensemble of simple detectors. Each detector checks the given data against its own definition of normality. If a detector detects an abnormal state, novelty has been detected.

To detect inlier, [31] uses a Support Vector Machine where the decision boundary is given by the sphere with minimal volume containing all data.

A method based on decision trees is presented by [32], where a decision tree is repeatedly constructed and pruned. After each pruning step, incorrect classified samples are removed from the training set and are marked as outliers.

Some early neural network based methods are given by [34] and [35]. The former takes advantage of the fact that a multilayer perceptron (MLP) works well for interpolating but bad for extrapolating data. More precisely, the MLP models the unconditional probability density of the input data used during training [34]. The latter trains an autoencoder based on the training data. If the system is not able to sufficiently reconstruct the input during the test phase, the input is marked as OoD [35].

A more recent neural network method is given by [2] who propose to check the maximum value of the softmax output of a classifying neural network against a predefined threshold. If the maximum is below the threshold, the system marks the input as an outlier. The authors mention that this method can be considered as a baseline, as it is the most na¨ıve way to decide whether an in- or an outlier is present.

The method proposed in [4] can be seen as an extension of the baseline method above. It only differs in the use of the tempered softmax [36] during the test phase. Since the network is trained with the default softmax, the tempered softmax (parameterized with high temperatures) forces the network to be sure with its decisions during the test phase.

In [5], a confidence estimation branch is appended to the network. This branch enables the network to directly estimate a degree of confidence instead of just classify the input in- or out-of-distribution. During the training, an additional confidence loss is added to prevent the network from being doubtful. The trained network is then able to provide an additional confidence output for a given input.

Another method modifying the basic network is given by [7]. Instead of a confidence branch, the presented extension maps the basic output onto a manifold and enables the possibility of using the Euclidean distance as a measure of out-of-distributioness.

III. OUT-OF-DISTRIBUTION DETECTORS

Our proposed method is inspired by the metric learning [7] and confidence learning [5] methods and tries to enable the network to estimate the local density around a sample in the estimated manifold.

We start with a definition of OoD-samples: A na¨ıve defini-tion is that OoD-samples are not explainable by an underlying learned distribution. As shown in Figure 1, this interpretation is problematic. Even if the values between -47 and 53 are explainable by the given distribution, the likelihood of having a sample in this range is negligibly small. Therefore, this work requires an in-distribution sample to be significantly explainable by the learned distribution.

For OoD-samples, [7] distinguishes between novelty and anomaly based OoD-samples. While the former describes samples sharing some common space with the trained distribution, the latter includes samples that are not related with the trained distribution. Credit card fraud, terrorist activities, and system failures are prominent examples of anomalies of high interest [37]. A third category ignored by [7] are plain outliers that are neither part of a new class nor part of an anomaly. They simply lie on or beyond the decision borders for their classes. This can be the result of bad training data or insufficient training.

The experimental setup of this work can be seen as novelty detection problem: A predetermined single class is excluded during the training and only present during the test phase. The predetermined class can be seen as the OoD-class and should be rejected by the system.

A. Baseline

The approach presented in [2] is used as a baseline. Given a pre-trained classifier which uses a softmax output layer, the proposed OoD-detector simply checks the maximum softmax output against a predefined threshold. If the maximum is greater than the threshold, the system continues its classifi-cation task. Otherwise the input is marked as OoD and hence rejected. The threshold value is determined in such a way, that an error of 5% is allowed. Therefore, the true positive rate is fixed at 95%. Except for the threshold determination,

Fig. 3. Tempered softmax applied to the same input with different values for T. The higher the temperature parameter, the more equally distributed is the output. Note range of the y-axis.

this method is one of the most na¨ıve ways on handling the detection of OoD-samples.

B. Out-of-DIstribution detector for Neural networks

ODIN [4] can be seen as an extension of the baseline approach. Inspired by [36], the tempered softmax

is used during the test phase. The higher the temperature parameter T, the more equally distributed is its output among all available classes C (see Figure 3). As a result, a high temperature parameter during the test phase forces the network to be confident in its classification decision. If not, the maximum softmax output is oppressed by the resulting almost equal distributed class probabilities. Both ODIN and the baseline have the advantage that no further changes to the network architecture are required.

C. Learning Confidence Approach

In comparison to the mentioned methods above, the confi-dence learning approach presented in [5] changes the underlying network architecture by adding an additional confidence branch. The branch enables the network to output a degree of confidence for a given input instead of just declaring an input sample as in- or out-of-distribution. To be able to estimate the confidence inside this branch, the training procedure is changed as follows: The classification output o is interpolated with the one-hot encoded ground truth y,

where the degree of interpolation is the confidence of the network for the given input. In order to prevent the network from always state a low confidence and therefore get a low classification loss, a weighted confidence loss

is added to the classification loss. The weight of the confidence loss is defined by a budget parameter and is adjusted whenever the weights are updated: If the confidence loss is greater than is increased and the system is more punished for low confidences. Otherwise is decreased and the system is getting less punished as a result of having a high confidence.

D. Metric Learning-based Approach

Like the confidence learning method, the approach based on Metric Learning changes the underlying network. More precisely, a Metric Learning layer (f(x) in Figure 4) is inserted between the base network and the classification layer. Additionally a branch for learning the confidence, by approximating either the density or entropy of the learned manifold is appended. The training is divided into two phases. First the Metric Learning layer is trained with the contrastive loss [38]. Afterwards the classification and confidence branches are trained on the resulting embeddings. The classification branch is trained straight forward by propagating the embeddings through a residual layer followed by a softmax activation. In contrast, the confidence approximation is a bit trickier.

The density as well as the entropy approximation use both the local neighborhood

of an embedded sample in a batch B to calculate the appropriate score. The neighborhood is given by all other samples in the batch where the (Euclidean) distance d(., .) to the corresponding embedding is lower than a predefined margin m.

1) Density Approximation: One of the simplest method on detecting how dense the area around a given sample x is, is by calculate the local density

of the neighborhood, where a normalization factor is directly given by the batch size. Indeed, using this calculation as the ground truth whilst the confidence training, the network learns to approximate the density but still lacks of the information about the pureness of the area. In an unclear decision region, the network should be able to give additional information, especially in terms of decision confidence. The entropy approach tries to fix this issue.

2) Entropy Approximation: Instead of simply calculating the density, the entropy (Equation 6) or Gini impurity (Equa- tion 7) provide information about the purity of the local neighborhood.

where C is the set of all available classes. The approach is similar to the density approximation but requires a few tweaks in the ground truth calculation. Since the Gini impurity (the entropy) reaches its minimum (maximum) if all samples belong to the same class, a weighting needs to take care of empty neighborhoods. This refers to neighborhoods consisting only of the processed sample itself. The weighting term for a sample x is therefore given by

and weights neighborhoods according to their size. After applying the weighting term, the resulting loss for the Gini impurity and entropy approximation is

which is the mean of the absolute differences between the calculated ground truth values for the batch B and the networks confidence output .

Fig. 4. Metric Learning-based approach to detect in- and out-of-distribution samples. The base network is extended by a Metric Learning layer (f(x)) as well as a confidence layer. The classification layer contains a residual block and is activated with a softmax.

IV. EVALUATION

The following section first describes the pipeline and intro- duce the novel semi-synthetic dataset. Afterwards the evaluation metrics are explained. Finally, the results are presented in a quantitative way.

A. Pipeline

As already mentioned, this work does not focus on the application of OoD-detector methods in the image domain. Instead, the applicability to graph-based data is examined. As an example, the problem of action recognition is analyzed where the input data is given in form of a sequence of graph skeleton data. Figure 5 shows the basic pipeline, which is similar to the one presented in [23]. Given a video input, single frames are extracted and analyzed by a 2D pose estimator (e.g. OpenPose [8]). The resulting sequence of skeleton data is then propagated through a graph CNN (e.g. ST-GCN [23]) resulting in a regularized high-level representation of the input data. Based on this extracted high-level representation, the OoDdetectors are examined and the classification is done.

Fig. 5. Basic pipeline for all experiments. First, the video input is divided into single frames. Those frames are then analyzed by an 2D pose estimator (e.g. OpenPose). The resulting skeleton sequences are then propagated through a graph CNN (e.g. ST-GCN) and finally analyzed by an OoD-detector.

B. Semi-synthetic Dataset

To obtain reproducible results, a novel semi-synthetic dataset is used. The dataset provides a controllable environment and is based on skeleton data of the CMU Graphics Lab Motion Capture Database [39]. This skeleton data is used to animate a human 3D model [40]. The resulting sequences are rendered with Blender [41] from 144 different camera settings (Figure 7). This can be seen as data augmentation and enables a scale and viewpoint invariance of the network [42]. Each rendered RGB image has a resolution of px, depth data in form of a corresponding px 16bit-grayscale image (Figure 6) and a 18 node ground truth skeleton (Figure 2). Currently, there are 32 different classes of actions in 109 sequences.

To verify the results and be able to check on interdataset OoDsamples, the NTU RGB+D [22] dataset is used additionally. It contains 60 action classes, presented in RGB videos with a resolution of px each, recorded from three different viewpoints. Compared to the short basic actions of the novel synthetic dataset, the NTU-RGB+D dataset contains more complex actions in which several persons may be involved.

C. Metrics

There are four established metrics used for the comparison of the different approaches [2], [4], [5]. The first one is the false positive rate (FPR) when the true positive rate (TPR) is fixed at 95% (FPR 95). The second one is the detection error at the same fixed true positive rate. The area under the receiver-operator characteristic (AUROC) and the area under the precision-recall curve (AUPR) are the last two.

FPR 95: The FPR 95 measures the false positive rate when the true positive rate is fixed at 0.95%.

Detection Error: The detection error measures the misclas-sification probability when a a fixed true positive rate at 95% is given. More precisely, the error is given by:

AUROC: The receiver-operator characteristic compares the true positive rate of a classifier with the corresponding false positive rate. The area under the receiver-operator characteristic is a threshold independent metric, measuring the overall performance of a classifier.

AUPR: Another threshold independent metric is the area under the precision-recall curve. Unlike the AUROC, the AUPR is more sensitive to imbalanced datasets which is a desirable feature when examining OoD-detectors. Since the inlier and outliers can both be handled as positives in the AUPR calculation, a AUPR-IN and AUPR-OUT score is given respectively.

D. Experimental Setup

This work distinguishes between an interclass and interdataset OoD-detection. For the interclass case, only the semi-synthetic dataset is taken into account. For each of the 32 different classes and each detector, a network is trained. In each training, a single class represents the OoD-class and is excluded from the training whilst the other 31 classes are in-distribution classes and included in the training. Therefore this problem can be labelled as novelty detection. For the interdataset case, the trained networks from the interclass OoD-detection were investigated on how good they distinguish between the 31 semi-synthetic (inlier) classes and the NTURGB+D plus the selected semi-synthetic (outlier) classes.

The data is split according to a stratified cross-validation into a test- and training set in a ratio of 1:4. As data augmentation, the skeleton graphs are modified by the following pipeline: First a random start and end point of a sequence is defined, guaranteeing a sequence length of 20 frames. Then a Gaussian noise is added to the node values. After this, there is a 50% chance that nodes will be set to zero (kind of a dropout) and a 50% chance that a vertical and horizontal mirroring is also applied. The noise as well as the application of dropout and the mirroring is fix for a whole sequence.

The human skeleton graphs are extracted by OpenPose using the default settings. ST-GCN with the spatial configuration partitioning strategy has been chosen for the analysis of the resulting graphs [23]. The ST-GCN networks are all initialized with random values. Adam was used as optimizer with its default parameters except for the initial learning rate.

For the evaluation, the procedure described in [2] has been followed. First the test set is separated into correctly and incorrectly classified examples. From the two resulting groups, the AUROC and AUPR scores are calculated. Afterwards, the confidence threshold is estimated in such a way, that the true positive rate of the correctly classified examples drops to 95%. Based on this threshold, the FPR and detection error is calculated. Since there are 32 different classes and therefore 32 trained networks for a given method, the results are averaged and the corresponding standard deviation is given.

1) Baseline and ODIN: The baseline method and ODIN do not require a modification of the existing network, so they can easily be examined without retraining. However, in order to be able to compare these straightforward approaches without interference from the smarter architectures, they are analyzed

Fig. 6. Example image of a sequence of the semi-synthetic dataset: The same image as (a) RGB image, (b) depth image and (c) RGB image with a modified background.

Fig. 7. Camera positions during the rendering. All cameras (blue dots and green triangles) are directed towards the scene center (black cross). The blue cameras have a distance of 5m to the scene center while the green ones have a distance of 10m.

with the base architecture. The networks are trained over 100 epochs with a batch size of 512. The initial learning rate is set to 0.001 and shrinks all 40 epochs by a factor of 10.

2) Learning Confidence: Like the previous networks, the learning confidence system is also trained over 100 epochs. The batch size is set to 1024 and the initial learning rate is 0.001 and shrinks all 30 epochs by a factor of 10. The budget parameter is set to 0.3.

3) Metric Learning: The training of the Metric Learning approach is divided into three parts. First the Metric Learning layer is trained to get a good representation of the data in the manifold. Based on the embedding, the classifier and OoDdetector are trained, while the weights of the Metric Learning layer are being held fixed.

The Metric Learning layer is trained over 200 epochs with a batch size of 512. The learning rate also starts at 0.001 and shrinks every 80 epochs by a factor of 10. The layer maps the input onto a 256 dimension output. The classification layer and the confidence layer are then both separately trained over 50 epochs with an initial learning rate of 0.0001 and a reduction every 20 epochs by a factor 10.

E. Results

In the following, the results are presented in quantitative terms. To give a hint where the value 0.95 resides, each of the following plots contains a dotted red line. The temperature parameter is displayed logarithmically on the x-axis. Table I and Table II provide results of the intraclass and intradataset evaluation. The row of the ODIN method in both tables contains the values of the ODIN parameterization with the lowest valid FPR score.

1) Baseline and ODIN: The results for the baseline as well as ODIN are shown in Figure 8. The parameter T = 1 equals the baseline. In the intraclass OoD-detection, the FPR reaches its minimum at a temperature parameter of T = 1.6. For temperature parameter values above 100, the required TPR of 95% cannot be guaranteed and are therefore not taken into account. Compared to the results in [4], the curve has an unusual course for a rising temperature parameter. The curve of the intradataset, on the other hand, shows a similar course as the results in [4].

2) Learning Confidence: Figure 9 shows the results of the confidence learning approach. The intradataset ODIN curve vary heavily from the ones, depicted in Figure 8. It should also be noted, that for the intradataset case the method performs significantly worse than ODIN, even if ODIN operates on the modified network architecture. Another remarkable problem is, that the average accuracy not included in the plots has dropped from (base architecture) down to 0.09 (learning confidence).

3) Metric Learning: The density approximating as well as the entropy approximating Metric Learning approaches were investigated and result in similar plots (Figure 10, Figure 11) for the intraclass and intradataset comparison. Therefore the following analysis can be applied to both of them. The ODIN intraclass FPR curve has some similarities with the curve of the base architecture. Nonetheless, the TPR curve drops at T = 100 much heavier than in the base architecture. In comparison to the learning confidence approach and in terms of the intraclass task, both Metric Learning approaches perform worse than the learning confidence or ODIN. In terms of the intradataset task, they perform slightly better than the

Fig. 8. ODIN Results: TPR and FPR for different temperature parameters. At T = 100, the TPR is no longer fixed at 95%, which leads to a better FPR but also allows more errors.

Fig. 9. Learning Confidence Results: It is noticeable, that the FPR as well as the TPR increase for T = 2000, which is a strange behavior compared to the other plots.

learning confidence approach but also have a higher variance. The average accuracies of the density and entropy classifiers are and and therefore also slightly better than the learning confidence ones ().

V. CONCLUSION

As the evaluation shows, OoD-methods can successfully be applied to graph-based data, but their behavior is different as on image-based data. Experiments showed, that the OoD-

Fig. 10. Metric Learning Results: Density approximation. The intraclass and intradataset FPR values for the density approach differ primarily in the standard deviation.

Fig. 11. Metric Learning Results: Entropy approximation. As the density approximation, the FPR values of the entropy approach differ primarily in the standard deviation.

detector method ODIN outperforms the more sophisticated learning confidence and the metric learning based method. Despite the modification of the network architecture made by these two methods, ODIN is superior when used with the modified architecture. As in Table I and Table II is shown, the modified network architectures even have a negative impact on the classification accuracy (ACC). This is of particular interest as the learning confidence method outperforms ODIN in the original paper in nearly every case. Another interesting

TABLE I INTRACLASS COMPARISON: TRAINING AND TEST ON THE SAME DATASET. A SINGLE CLASS IS EXCLUDED FROM THE TRAINING AND SERVES AS OOD-CLASS IN THE TEST PHASE.

TABLE II INTRADATASET COMPARISON: TRAINING ON THE SEMI-SYNTHETIC DATASET, TEST ON THE NTU-RGBD+D DATASET.

observation is, that the learning confidence method performs better in the intraclass task than the intradataset task.

In this paper we have shown with our novel semi-synthetic dataset, that applying ODIN on graph-based data is currently the best OoD-method.

Our presented metric learning based method embeds the high-level features into a manifold and learns to estimate the density or entropy of the local neighborhood of an embedded sample. Since it is crucial to find a good embedding, other embedding methods than the contrastive loss could be investigated in future work.

ACKNOWLEDGMENT

This work was developed in Fraunhofer Cluster of Excellence “Cognitive Internet Technologies”.

REFERENCES

[1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” ICLR, 2014.

[2] D. Hendrycks and K. Gimpel, “A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks,” ICLR, 2017.

[3] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in CVPR, 2015, pp. 427–436.

[4] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of- distribution image detection in neural networks,” ICLR, 2018.

[5] T. DeVries and G. W. Taylor, “Learning Confidence for Out- of-Distribution Detection in Neural Networks,” arXiv preprint arXiv:1802.04865, 2018.

[6] M. Kliger and S. Fleishman, “Novelty Detection with GAN,” arXiv preprint arXiv:1802.10560, 2018.

[7] M. Masana, I. Ruiz, J. Serrat, J. van de Weijer, and A. M. Lopez, “Metric Learning for Novelty and Anomaly Detection,” BMVC, 2018.

[8] Z. Cao, T. Simon, S. E. Wei, and Y. Sheikh, “Realtime multi-person 2D pose estimation using part affinity fields,” CVPR, vol. 2017-Janua, pp. 1302–1310, 2017.

[9] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Mur- phy, “Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model,” in ECCV, 2018.

[10] G. T. Papadopoulos, A. Axenopoulos, and P. Daras, “Real-time skeleton- tracking-based human action recognition using kinect data,” in MMM, 2014, pp. 473–483.

[11] M. Zhao, T. Li, M. Abu Alsheikh, Y. Tian, H. Zhao, A. Torralba, and D. Katabi, “Through-wall human pose estimation using radio signals,” in CVPR, 2018.

[12] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine recognition of human activities: A survey,” TCSVT, vol. 18, no. 11, pp. 1473–1488, 2008.

[13] R. Poppe, “A survey on vision-based human action recognition,” Image and Vision Computing, vol. 28, no. 6, pp. 976–990, 2010.

[14] Y. Kong and Y. Fu, “Human Action Recognition and Prediction: A Survey,” arXiv preprint arXiv:1806.11230, 2018.

[15] V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126, 2004.

[16] S. V. Bhosale, “Holy Grail of Outlier Detection Technique: A Macro Level Take on the State of the Art,” IJCSIT, 2014.

[17] A. Zimek and P. Filzmoser, “There and back again: Outlier detection between statistical reasoning and data mining algorithms,” 2018.

[18] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in ECCV. Springer, 2016, pp. 483–499.

[19] T. Kerola, N. Inoue, and K. Shinoda, “Spectral graph skeletons for 3d action recognition,” in ACCV, 2014, pp. 417–432.

[20] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo, “3-D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold,” IEEE Transactions on Cybernetics, vol. 45, no. 7, pp. 1340–1352, 2015.

[21] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in CVPR, 2015.

[22] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in CVPR, 2016.

[23] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” AAAI, pp. 7444–7452, 2018.

[24] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan, “Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning,” in Lecture Notes in Computer Science, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11205 LNCS, Cham, 2018, pp. 106–121.

[25] F. E. Grubbs, “Procedures for Detecting Outlying Observations in Samples,” Technometrics, vol. 11, no. 1, pp. 1–21, 1969.

[26] E. M. Knox and R. T. Ng, “Algorithms for mining distancebased outliers in large datasets,” in VLDB. Citeseer, 1998, pp. 392–403.

[27] L. J., J. M., and K. E., “Informal Identification of Outliers in Medical Data,” IDAMAP, vol. 1, pp. 20–24, 2000.

[28] J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang, “Topic Detection and Tracking Pilot Study,” Topic Detection and Tracking Workshop Report, 2001.

[29] A. H. Seheult, P. J. Green, P. J. Rousseeuw, and A. M. Leroy, “Robust Regression and Outlier Detection.” Journal of the Royal Statistical Society. Series A (Statistics in Society), vol. 152, no. 1, p. 133, 1989.

[30] D. Dasgupta and S. Forrest, “Novelty detection in time series data using ideas from immunology,” The 8th International Conference on Intelligent Systems, p. 82–87, 1999.

[31] D. Tax, A. Ypma, and R. Duin, “Support vector data description applied to machine vibration analysis,” Proc. 5th Annual Conference of the Advanced School for Computing and Imaging, pp. 15–23, 1999.

[32] J. G., “Robust Decision Trees: Removing Outliers from Databases,” KDD, pp. 174–179, 1995.

[33] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, and Others, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Kdd, vol. 96, 1996, pp. 226–231.

[34] C. M. Bishop, “Novelty detection and neural network validation,” IEE Proceedings: Vision, Image and Signal Processing, vol. 141, no. 4, pp. 217–222, 1994.

[35] J. N., M. C., and G. M., “A Novelty Detection Approach to Classifica- tion.” in IJCAI, Montreal, 1995, pp. 518–523.

[36] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” arXiv preprint arXiv:1503.02531, 2015.

[37] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection,” ACM Computing Surveys, vol. 41, no. 3, pp. 1–58, 2009.

[38] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in CVPR, vol. 2. IEEE, 2006, pp. 1735–1742.

[39] Carnegie Mellon Graphics Lab, “CMU Graphics Lab Motion Capture Database.”

[40] makehumancommunity, “www.makehumancommunity.org.”

[41] Blender Online Community, “Blender - a 3D modelling and rendering package,” Amsterdam, 2019.

[42] J. Bayer, D. Muench, and M. Arens, “Viewpoint independency for skeleton based human action recognition,” Fraunhofer IOSB, Ettlingen, Tech. Rep., 2020.