An Interval-Based Bayesian Generative Model for Human Complex Activity Recognition

2017·Arxiv

Abstract

Abstract

Complex activity recognition is challenging due to the inherent uncertainty and diversity of performing a complex activity. Normally, each instance of a complex activity has its own configuration of atomic actions and their temporal dependencies. We propose in this paper an atomic action-based Bayesian model that constructs Allen’s interval relation networks to characterize complex activities with structural varieties in a probabilistic generative way: By introducing latent variables from the Chinese restaurant process, our approach is able to capture all possible styles of a particular complex activity as a unique set of distributions over atomic actions and relations. We also show that local temporal dependencies can be retained and are globally consistent in the resulting interval network. Moreover, network structure can be learned from empirical data. A new dataset of complex hand activities has been constructed and made publicly available, which is much larger in size than any existing datasets. Empirical evaluations on benchmark datasets as well as our in-house dataset demonstrate the competitiveness of our approach.

Index Terms—complex activity recognition, Allen’s interval relation, Bayesian network, probabilistic generative model, Chinese restaurant process, temporal consistency, American Sign Language dataset.

I. INTRODUCTION

A complex activity consists of a set of temporally-composed events of atomic actions, which are the lowest-level events that can be directly detected from sensors. In other words, a complex activity is usually composed of multiple atomic actions occurring consecutively and concurrently over a duration of time. Modeling and recognizing complex activities remains an open research question as it faces several challenges: First, understanding complex activities calls for not only the inference of atomic actions, but also the interpretation of their rich temporal dependencies. Second, individuals often possess diverse styles of performing the same complex activity. As a result, a complex activity recognition model should be capable of capturing and propagating the underlying uncertainties over atomic actions and their temporal relationships. Third, a complex activity recognition model should also tolerate errors introduced from atomic action level, due to sensor noise or low-level prediction errors.

A. Related Work

Currently, a lot of research focuses on semantic-based complex activity modeling. Many semantic-based models such as context-free grammar (CFG) [26] and Markov logic network (MLN) [11], [18]) are used to represent complex activities, which can handle rich temporal relations. Yet formulae and their weights in these models (e.g. CFG grammars and MLN structures) need to be manually encoded, which could be rather difficult to scale up and is almost impossible for many practical scenarios where temporal relations among activities are intricate. Although a number of semantic-based approaches have been proposed for learning temporal relations, such as stochastic context-free grammars [29] and Inductive Logic Programming (ILP) [9], they can only learn formulas that are either true or false, but cannot learn their weights, which hinders them from handling uncertainty.

On the other hand, graphical models become increasingly popular for modeling complex activities because of their capability of managing uncertainties [31]. Unfortunately, most of them can handle three temporal relations only, i.e. equals, follows and precedes. Both Hidden Markov model (HMM) and conditional random field (CRF) are commonly used for recognizing sequential activities, but are limited in managing overlapping activities [13]. Many variants with complex structures have been proposed to capture more temporal relations among activities, such as interleaved hidden Markov models (IHMM) [20], skip-chain CRF [12] and so on. However, these models are time point-based, and hence with the growth of the number of concurrent activities they are highly computationally intensive [22]. Dynamic Bayesian network (DBN) can learn more temporal dependencies than HMM and CRF by adding activities’ duration states, but imposes more computational burden [21]. Moreover, the structures of these graphical models are usually manually specified instead of learned from the data. The interval temporal Bayesian network (ITBN) [31] differs significantly from the previous methods, as being a graphical model that first integrates interval-based Bayesian network with the 13 Allen’s relations. Nonetheless, ITBN has several significant drawbacks: First, its directed acyclic Bayesian structure makes it have to ignore some temporal relations to ensure a temporally consistent network. As such, it may result in loss of internal relations. Second, it would be rather computationally expensive to evaluate all possible consistent network structures, especially when the network size is large. Third, neither can ITBN manage the multiple occurrences of the same atomic action, nor can it handle arbitrary network size as it remains unchanged as the count of atomic action types. Figure 1 illustrates the graph structures of the three commonly-used graphical models. It is worth

Figure 1. The structures of three graphical models for complex activity recognition. (a)IHMM, where the observations of atomic actions (squareshaped nodes) and several chains of hidden states (round-shaped nodes) are used to handle overlapping; (b) DBN, where duration states and atomic action states are represented as chains of nodes; (c) ITBN, where any atomic action type (A1-A9) is represented by a node and the set of all possible interval relations between any pair of atomic action types Ai and A j is represented by a link Ii

noting that we will focus on complex activity recognition in this paper, and interested readers may consult the excellent reviews [1], [6], [14], [7], [4], [8] for further details regarding atomic-level action recognition.

B. Our approach

To address the problems in the previous models, we present an interval-based Bayesian generative network (IBGN) to explicitly model complex activities with inherent structural varieties, which is achieved by constructing probabilistic interval-based networks with temporal dependencies. In other words, our model considers a probabilistic generative process of constructing interval-based networks to characterize the complex activities of interests. Briefly speaking, a set of latent variables, called tables, which are generated from the Chinese restaurant process (CRP) [23] are introduced to construct the interval-based network structures of a complex activity. Each latent variable characterizes a unique style of this complex activity by containing its distinct set of atomic actions and their temporal dependencies based on Allen’s interval relations. There are two advantages to using CRP: It allows our model to describe a complex activity of arbitrary interval sizes and also to take into account multiple occurrences of the same atomic actions. We further introduce interval relation constraints that can guarantee the whole network generation process is globally temporally consistent without loss of internal relations. In addition, instead of manually specify a network to a fixed structure, the network structure in our approach is learned from training data. By learning network structures, our model is more effective than existing graphical models in characterizing the inherent structural variability in complex activities. A further comparison study is summarized in Table I, which also shows our main contributions.

Table I A SUMMARY COMPARISON OF GRAPHICAL MODEL-BASED APPROACHES FOR RECOGNIZING COMPLEX ACTIVITIES.

It is worth mentioning that in spite of the increasing need from diverse applications in the area of complex activity recognition, there are only a few publicly-available complex activity recognition datasets [25], [3], [15]. In particular, the number of instances are on the order of hundreds at most. This motivates us to propose a dedicated large-scale dataset on depth camera-based complex hand activity recognition. We have constructed a new dataset of complex hand activities which contains instances that are about an order of magnitude larger than the existing datasets. We have made the dataset and related tools made publicly available on a dedicated project website1 in support of the open-source research activities in this emerging research community.

II. DEFINITIONS AND PROBLEM FORMULATION

Assume we have at hand a dataset D of N instances from a set of L types of complex activities involving a set of M types of atomic actions A = {A1,A2,...,AM}. An atomic action interval (interval for short) is written by Ii = ai@, where tand trepresent the start and end time of the atomic action ai , respectively. Each complex activity instance is a sequence of k ordered intervals, i.e. I1,I2,...,Ik, such that if 1 k, then or and t. Seven Allen’s interval relations (relations for short) [2] are used to represent all possible temporal relationships between two intervals, denoted by , which is summarized below:

We define an interval-based network (network for short) G = (V,E) to represent a complex activity containing the temporal relationships between intervals. A node vi V represents the i-th interval (i.e. Ii) in a complex activity instance, while a directed link eiE represents the relation between two intervals (i.e. Ii and Ij), where 1 is the cardinality of the set V). Note a link always starts from a node with a smaller index than its arrival node. Each link ei

Figure 2. Two possible styles of the complex activity offensive play and its corresponding networks where the nodes represent the intervals of atomic actions and the links represent their temporal relations. Atomic actions: A1

involves one and only one relation ri. Figure 2 illustrates the corresponding networks of a complex activity.

The temporal relations on links shall be globally consistent in any interval-based network. Given any two relations rirjon links eiand e j, respectively, the relation rion link eimust follow the transitivity properties as shown in Figure 3. For example, suppose Ii meets Ij and Ij starts Ik, Ii is certainly meets Ik. However, if Ii starts Ij and Ij is finished by Ik, there are three possible relations between Ii and Ik, that is, before, meet and overlap. We formally use rirjto denote such composition operation following the transitivity properties. We say a network is consistent such that ririrjfor any triplet of links ei, ejand eiin the network.

Figure 3. The transitivity table for any interval relation rithrough composition operation ri

It is clear that a network can characterize one possible style of a complex activity with diverse combinations of atomic actions and their interval relations, as illustrated in Figure 2. From another point of view, a complex activity can be instantiated by sampling atomic actions and relations assigned to their associated nodes and links in a network following certain probabilities. In this way, a recognition model built on such networks is able to handle uncertainty in complex activity recognition. In addition, multiple occurrences of the same atomic action can appear in the same network but in different nodes (i.e. intervals). To this end, we present the probabilistic generative model IBGN where these networks can be constructed following the styles of the complex activities of interests under uncertainty. We shall also consider the IBGN model to construct networks with variable sizes of nodes and arbitrary structures. Note in our approach, for each type of complex activities we would learn one such dedicated IBGN model.

III. OUR IBGN MODEL

For any complex activity type l (1 L), denote the corresponding subset of Nl instances, where each element d is an instance of the l-th type of complex activity In IBGN, the generative process of constructing an interval-based network Gd = (Vd,Ed) for describing the observed instance d consists of two parts, node generation and link generation, which are described below.

A. Node Generation

In our IBGN model, we consider generating a network where each node is associated with an atomic action in a probabilistic manner. We also require our model to be capable of generating variable numbers of nodes and handling multiple occurrences of the same atomic action in our network, as summarized in Table I.

To accomplish these tasks, we first extend the process of the CRPs to make it available in our IBGN model. Originally, a CRP is analogous to a stochastic process of choosing tables for customers in a Chinese restaurant. In a nutshell, suppose there are an infinite number of tables {T1,T2,...}. The first customer (n = 1) always selects the first table; Any other customer (n > 1) randomly selects an occupied table or an unoccupied empty table with a certain probability.

We continue this process by serving dishes right after each customer is seated at a table. Assume there are a finite number of dishes and an infinite number of cuisines. Each table is associated with a cuisine that dishes are served with a unique probability distribution. Any customer sitting at a table randomly selects a dish with the probability relating to its corresponding cuisine.

In our model, a network contains a group of customers where each customer is analogous to a node while a dish is analogous to an atomic action type. Suppose customers from the same group prefer similar cuisines, which is analogous to a complex activity of interest, and are more likely to sit at the same tables. Formally, denote {t1,t2,...} the variables of tables, and {a1,a2,...} the variables of atomic actions 2. To construct a network Gd = (Vd,Ed), tn and an are the table and the atomic action (dish) assigned to the node (customer) vn Vd (n = 1,2,...,|Vd|). The generative process operates as follows. The n-th node vn selects a table tn that is drawn from the following distribution:

P

Figure 4. An illustration of the generative process of choosing a table tn and an atomic action an for node vn.

where ntis the number of existing nodes occupied at table T), with nt1, t = T1. is a tuning hyperparameter. It is worth mentioning that the distribution over table assignments in CRP is invariant and exchangeable under permutation according to de Finetti’s Theorem [28]. After vn is assigned with a table tn , an atomic action (dish) an is chosen from the table by:

where is a hyperparameter. Note is the parameter vector of a multinomial distribution at table T. Figure 4 presents a cartoon explanation of this node generation process of our IBGN model. Since we would obtain one

(a) Distributions of atomic actions at different tables that collectively represent the complex activity offensive play.

(b) Distributions of atomic actions at different tables that collectively represent the complex activity defensive play.

Figure 5. Examples of the tables and corresponding distributions over atomic actions for representing two complex activities, offensive play and defensive play, respectively. Here each table contains a set of atomic actions under a specific distribution.

dedicated IBGN model for each complex activity, a complex activity is now characterized as a unique set of distributions over atomic actions, i.e. . As illustrated in Figure 5, the two different complex activities offensive play and defensive play are associated with two distinct sets of tables with their own distributions over atomic actions. It suggests that our IBGN modeling approach is capable of differentiating the underlying characteristics associated with atomic actions from different complex activity categories.

B. Link Generation

Once each node (interval) is assigned to an atomic action, links and their associated relations are to be generated next. It is important to ensure consistency of the resulting relations over all links. Formally, given two relations rnand rnon the links enand en(nn), respectively, the interval relation rnon link enshall follow the transitivity properties listed in Figure 3. It is straightforward to verify that the set R is closed under the composition operation. As a result, by applying the transitivity table, for any composition there exists only 11 possible transitive relations in Figure 6, denoted as C = {Cz : 1 11}. In other words, any composition of two consecutive relations satisfies rnrn.

Figure 6. The 11 possible interval composition relations.

To construct a globally consistent network, the relations on every triangle in a network must also be consistent. Namely, for any triplet of nodes vn, vnand vn, if there exist three links rn, rn, and rn, they need to satisfy the transitive relation rnrnrn. As such, we define the interval relation constraint variable as follows:

Definition 1. Give an arbitrary interval-based network G = (V,E), the interval relation constraint cnfor a link enE (1 ) is

In link generation, our IBGN model follows the rule that any relation rncan only be drawn from cn. We have proved that a network constructed under this rule is globally consistent and complete, with proofs relegated to the Appendix A.

Theorem 1. (Network Consistency and Completeness)

A network Gd constructed by obeying the interval relation constraints is always temporally consistent, and any possible combination of relations in Gd can be constructed through our IBGN model.

Now, we are ready to assign relations to links. Suppose anAi, an = Aj and cnCz (1 M, 1 11), a relation rnon link enEd is chosen from a distribution over all possible relations in cnas follows:

where is the parameter vector of the multinomial distribution associated with the triplet (Ai,Aj,Cz). Note that for an interval relation constraint containing only one relation (i.e. C1-C7), the probability of choosing that relation is always one.

It is also important to notice that the quality of the network structure plays an extremely important role in activity

Figure 7. Three possible network structures in link generation.

modeling. In our previous work [17], two variants with fixed network structures are considered: chain-based network as in Figure 7(a), where only the links between two neighbouring nodes are constructed in networks; fully-connected network as in Figure 7(b), where all pairwise links are constructed in networks. In fact, they are two special cases of our IBGN model. In chain-based networks, only a set of |Vd1 links e1e2are generated, with en(1 |Vd1) representing the link of two neighboring nodes vn vn. Any interval relation constraint cnin chain-based networks equals to R, and thus such networks are inherently consistent because no inconsistent triangle exists. However crucial relations may be missing in this model. On the other side of the spectrum, we have fully-connected networks, where all possible pairwise links are considered. Any xpin fully-connected networks equals to rp. When fitting the IBGN model, it is possible to increase the likelihood by adding links, but doing so may result in overfitting. Instead of prefixing the network structures, in this work we relax the assumption of a structure being either fully-connected or chain-based, and consider learning an optimal structure (i.e. to decide which links should exist in the network) from data. This allows the IBGN model to handle temporal consistency for arbitrary network structures, as presented in Figure 7(c).

C. The Generative Process

For each dataset containing a particular complex activity 1 L, our model assumes the whole generative process including node generation and link generation in Algorithm 1. Notice that the optimal network structure Gwould be learned from with details to be elaborated in section IV-A. The structure Gdemonstrates whether a link needs to be generated in the network. For example, to construct the network Gd = (Vd,Ed) for a certain complex activity instance d, the link enfrom vnto vn is involved in Gd if and only if enobeys the structure of G, denoted by enEd .

The joint distribution of variables t, a, and r, is given by:

The total number of variables t, a, and r are Vd|, Vd| and EG, respectively. It is worth noting that

IV. IBGN LEARNING

In what follows we focus on how to learn the network structure and the parameter vectors and from the training data for a particular complex activity l.

A. Learning Network Structure

Instead of configuring a network with chain-based or fully-connected links, we would like to learn an network structure G in IBGN according to a score function that best matches the training data , i.e., learning from empirical data on which links to select in our constructed networks. An IBGN model is built over a collection of variables t for table assignments, a for atomic-action assignments and r for relation assignments. In detail, for a specific instance d =< I1,I2,...,Ik > with k = |Vd|, its corresponding network involves variables t = {t1,t2,...,tk}, a = {a1,a2,...,ak}, r = {r1r1r1r2r2r2rk, c = {c1c1c1c2c2c2ck. To consider a general network structure, we first introduce a null interval to make every instance in having the same number of kmax d{|Vd|} intervals. A null interval is defined as nullsuch that its associated atomic action is null and its temporal relation with any other intervals is null. In other words, null can be viewed as a special atomic action class. For any instance of size k , kk null intervals are appended at the rear of the instance. In this way, every instance has the same number of kintervals. Correspondingly,

Figure 8. The illustration of the IBGN network structure G associated with variables.

any IBGN has a total number of possible random variables, with kpossible variables for tables t, kpossible variables for atomic actions a, kpossible variables for interval relations r and kpossible variables for interval relation constraints c. An exemplar IBGN model can be described in Figure 8, where each vi is associated with variables ti and ai, and each eiis associated with variable riand ci, with 1 . To this end, our IBGN structure learning problem is defined as to find a G = (VG,EG) such that the score of G given is maximum.

Next, we employ structure constraints to translate the IBGN model to a corresponding problem in Bayesian networks.

Definition 2. Given an IBGN model G, its corresponding Bayesian network is defined as a directed bipartite graph GVGEGwhere VGUGWG, with the structure constraints such that

(1)|UGWG

: :

: in-degree: in-degree0 or 2,

where denotes the number of distinct elements of v.

A node vin UG(1 UG) maps to the variable an in G, while a node vin WG(1 UG) maps to the variable rnin G. That is, UGa1,...,akand WGr1r1r1rk. Notice that a null is introduced to represent the absence of a node or a relation in an instance of (constraint (2)), and thus there are M + 1 possible atomic-action assignments for a node in UGand |R| + 1 = 8 possible relation assignments for a node in WG. Moreover, any node vin UGhas no parent, and any node vin WGhas either being connected to the nodes vnand vn in UGor not being connected to any node (constraint (3)). That means a link enexists in G if and only if its corresponding node vin Ghas in-degree2. The structure of Gassociated with the variables is illustrated in Figure 9.

Figure 9. The corresponding bipartite graph Gof the IBGN structure G.

Now, the problem of determining whether a link should exist in IBGN is converted to the problem of finding a set of links EGwith the maximal score under the above constraints. In particular, we consider the Bayesian Information Criterion (BIC) as the score function, which addresses the overfitting issue by introducing a penalty term in the structure, as proved in the Appendix B:

B. Learning Parameters

We first estimate the parameters for node generation. Since the variable tn is latent for each node vn in our generative process, we shall approximately estimate the posterior distribution on t by Gibbs sampling. Formally, we marginalize the joint probability in Eq.(2) and derive the posterior probability of latent variable tn being assigned to the table T(1 ) as follows:

where nais the count that nodes have been assigned to the atomic action type Aiat the table T, and ntis the count of nodes in Gd that has been assigned to the table T. ˜i refer to the atomic action assignments for nodes vn. NTn is the count of occupied tables with nti 1. The suffix n of na means the count that does not include the current assignment of table for the node vn. is the tuning parameter for the -th table selection during CRP; is the Dirichlet prior for the i-th atomic action conditioned on the -th table. We provide the detailed derivations of the Gibbs sampling in Appendix C. By sampling the latent tables following the above distribution, the distributions of (1 ) can be estimated as

Normally, the hyperparameters and are set as fixed values before the execution of a Gibbs sampler. In our IBGN model, and involve a number of and M prior parameters, respectively. As they are unfortunately unknown beforehand, it is difficult to manually encode each parameter to proper values. As a result, we need to instead learn each hyperparameter to obtain reasonable results. The adoption of Gibbs sampling enables us to seamlessly incorporate the tuning of these hyperparameters as presented in Algorithm 2. The stop condition may be that a predefined max iterations

has been reached or that an estimation function converges to a given threshold. To update the hyperparameters and , a convergent method proposed by Minka [19] is used as follows:

where is digamma function, and the superscript (s) indicates the sample generated by the Gibbs sampler at the s-th iteration.

Next, we estimate the parameters : 1 Mfor link generation. It can be seen that the probability distribution of variable rnrelies on the triplet (Ai,A j,Cz) only (where anAi, an = A j and cnCz), and thus we can learn these parameters by maximum likelihood estimate method. In our IBGN model, given a triplet (Ai,A j,Cz, the conditional probability distribution on rnis a multinomial over all possible relations in Cz. Then, the likelihood of the parameter for P(rnAi,A j,Cz) with respect to is:

By applying a Lagrange multiplier to ensure 1, maximum likelihood estimate for is

where nriis the number of links enare labeled with the r-th relation in , with anAi, an = Aj, cnCz and en. Note the trivial cases of 1 for 1 7 as each of them contains only one element as indicated in Figure 6.

Now, by integrating out the latent variable t with all the parameters derived above, the probability of the occurrence of a new instance given the l-th type of complex activity is estimated below

where aand rare the sets of atomic actions and their relations in the new instance, respectively, and iindicate only these links obeying the structure of Gare counted. To predict which type of complex activity a new instance belongs to, we simply evaluate the posterior probabilities over each of the L possible types of complex activities as

V. EXPERIMENTS

Experiments are carried out on three benchmark datasets as well as our in-house dataset on recognizing complex hand activities. In addition to the proposed approach IBGN, two variants with fixed network structures are also considered: One is IBGN-C for chain-based structures, where only the links between two neighbouring nodes in networks are constructed; The other one is IBGN-F for fully-connected structures, where all pairwise links in networks are constructed. Several wellestablished models are employed as the comparison methods, which include IHMM [20], dynamic Bayesian network (DBN) [12] and ITBN [31], where IHMM and DBN are implemented on our own, and ITBN is obtained from the authors. All internal parameters are tuned for best performance for a fair comparison. The standard evaluation metric of accuracy is used, which is computed as the proportion of correct predictions. a) Experimental Set-Ups: The Raftery and Lewis diagnostic tool [24] is employed to detect the convergence of the Gibbs sampler (Algorithm 2) for the IBGN family. It has been observed that overall we have a short burn-in period, which suggests the Markov chain samples are mixing well. Thus nt and na are set to the averaged counts of their first 1000 samples after convergence. In addition, we utilize the branch-and-bound algorithm [5] for constraints-based structure learning. This approach can strongly reduce the time and memory costs for learning Bayesian network structures based on the BIC score function (Eq. (3)) without losing global optimality guarantees. Besides, to avoid the division-by-zero issue in practice (i.e. nri0 in Eq. (5)), we instead use by introducing a small smoothing constant 10) in the following experiments.

b) Time Complexity Analysis: The time complexities of IBGN-C and IBGN-F are O(M2 Tnfor training each complex activity category, where Tn is the number of iterations executed in Algorithm 2. IBGN has an extra time complexity of Ofor structure learning, where K . On the other hand, the time complexities of the IBGN family at the testing stage are the same, which is Ofor a single test instance.

A. Experiments on Three Existing Benchmark Datasets

a) Datasets and Preprocessing: The three publicly-available complex activity recognition datasets collected from different types of cameras and sensors are considered, as summarized in Table. II. We employ these datasets in our evaluation due to their distinctive properties: The OSUPEL dataset [3] can be used to evaluate the case where only a handful of atomic action types and simple relations are recorded; Opportunity [25] is challenging as it contains a large number of atomic action types and also involves intricate interval relations in instances; CAD14 [15] represents the datasets having relatively larger number of complex activity categories.

Table II SUMMARY OF THE PUBLICLY AVAILABLE DATASETS.

To recognize atomic actions in each dataset, we adopt the methods developed in their respective corresponding work. That is, we employed the dynamic Bayesian network models that are used to model and recognize each atomic action including shoot, jump, dribble and so on for OSUPEL [31]. Similarly, for CAD14 [15] we adopted the hierarchical discriminative model to recognize atomic action such as clap, talk phone and so on through an discriminative pose dictionary. For the sensor-based Opportunity dataset, we utilized the activity recognition chain system (ARC) [4] to recognize atomic action recognition from sensors. It is worth mentioning that we can also recognize null type of complex activity by labeling the intervals that are not annotated to any activities in the datasets. b) Comparison under an Ideal Condition: First of all, the competing models are evaluated under the condition that all intervals are correctly detected. Table III displays the averaged accuracy results over 5-fold cross-validations, where the proposed IBGN family clearly outperforms other methods with a big margin on all three datasets. The reason is that IBGNs engage the rich interval relations among atomic actions. Although ITBN can also encode relations, it however

fails with the multiple occurrences of the same atomic actions or when inconsistent relations existing among training instances. As a considerable amount of multiple occurrences of the same atomic actions and inconsistent temporal relations exist in CAD14, ITBN performs the worst. It can also be seen that IBGN-F with fully-connected relations performs better than IBGN-C on the Opportunity and CAD14 datasets where relations are intricate. However, IBGN-F might be overfitted when relations are simple, e.g. the OSUPEL dataset. Overall IBGN outperforms its two variants by its ability to adaptively learn network structures from data, where fixed structures might face the issue of either overfitting or underfitting.

Table III ACCURACY COMPARISON ON DIFFERENT DATASETS.

c) Robustness Tests under Atomic-Level Errors: In practice the accuracy of atomic action recognition will significantly affect complex activity recognition results. To evaluate the performance robustness, we also compare the competing models under atomic action recognition errors. First, it is important to check whether our model is robust under label perturbations of atomic-action-level (or atomic-level for short). To show this, we synthetically perturb the atomic-level predictions. Figure 10 reports the comparison results on Opportunity under two common atomic-level errors. It can be seen that IBGNs are more robust to misdetection errors where atomic actions are not detected or are falsely recognized as another actions. We perturbed the true labels with error rates ranging from 10-30 percents to simulate synthetic misdetection errors. Similarly, we also perturbed the start and end time of intervals with noises of 10-30 percents to simulate duration-detection errors where interval durations are falsely detected. It is also clear that IBGNs outperform other competing models under duration-detection errors. In addition, we report the evaluated performances under real detected errors caused by the ARC system for atomic-level recognition. We chose three classifiers for atomic action recognition from the ARC system, i.e. kNN, SVM and DT. Features such as mean, variance, correlation and so on are selected by setting a time-sliced window of 1s. After classifi-cation, each interval is assigned to an atomic action type. As shown in Figure 10, the models which can manage interval relations are relatively more robust to the atomic-level errors than other models, such as ITBN and IBGNs. Moreover, it is evident that IBGNs are noticeably more robust than ITBN with around 15% – 87% performance boost. IBGN performs the best among its family because it is more capable of handling the structural variability in complex activities than the other two variants, which may avoid more noise existing in training and testing information. Note similar conclusions are also obtained on the OSUPEL and CAD14 datasets.

Figure 10. Accuracies under atomic-level errors on Opportunity dataset.

Figure 11. Two instances of the complex activity ASL word ketchup. A1- A5 refer to the straightening state of thumb, index, middle, ring and pinky fingers, respectively (white bars), while A6-A10 refer to the bending state of these fingers (black bars).

B. Our Complex Hand Activity Dataset

a) Data Collection: We propose a new complex activity dataset on depth camera-based complex hand activities on performing American Sign Language (ASL). It is an ongoing effort, and at the moment it contains 3,480 annotated instances, which is already about 5-fold larger than existing ones. As illustrated in Fig. 11, complex activities in our dataset are defined as selected ASL hand-actions. There are 20 atomic actions, which are defined as the states of individual fingers, either straight or bent. It is important to realize that in a complex activity, there could be multiple occurrences of the same atomic action, as is also exemplified in Fig. 11(f) where an action A2 appears twice in the sub-network. Sixteen subjects participate in the data collection, with various factors being taken into consideration to add to the data diversities. Subjects of different genders, ages groups, races are present in the dataset. The male to female ratio is 12 to 4, races ratio is 14 to 2 and participants’ ages span from around 15 to 40. For each subject, the depth image sequences are recorded with a front-mount SoftKinetic camera while performing designated complex activities in office settings,

Figure 12. Performance changes vs. perturbation of atomic-action-level errors.

with a frame-rate of 25 FPS, image size of 320240, and hand-camera distance of around 0.6-1m. In total, the dataset contains 19 ASL hand-action complex activities, with each having 145-290 instances collected among all subjects. Each of the instance is comprised of 5-17 atomic action intervals. The 19 ASL hand-actions are air, alphabets, bank, bus, gallon, high school, how much, ketchup, lab, leg, lady, quiz, refrigerator, several, sink, stepmother, teaspoon, throw, xray. b) Atomic Hand Action Detection: To detect atomic-level hand-actions, we make use of the existing hand pose estimation system [30] with a postprocessing step to map joint location prediction outputs to the bent/straight states of fingers. To evaluate performance of the interval-level atomic action detection results, we follow the common practice and use the intersection-over-union of intervals with a 50% threshold to identify a hit/false alarm/missing, respectively. Finally F1 score is used based on the obtained precision and recall values. Note here the finger bent states are considered as foreground intervals. c) Robustness Tests under Atomic-Level Errors: We first evaluate the performance on simulated synthetic misdetection errors and duration-detection errors. From Fig. 12, we observe that overall IBGN is notably more robust than other approaches, meanwhile IHMM consistently produces the worst results. Our model is relatively robust in the presence of atomic-level errors. Now we are ready to show the performance of our model when working with our atomic-level predictor as mentioned previously. Overall our atomic-level predictor achieves F1 score of 0.724. Table IV summarizes the final accuracy comparisons of the six competing approaches based on our atomic-level predictor vs. the atomic-level ground-truth labels on our hand-action dataset. It is not surprising that in both scenarios IBGN again significantly outperforms the rest approaches. It is worth noting that taking into account the challenging task of atomic-level hand pose estimation on its own, the gap in performances of 0.58 vs. 0.86 on predicting over 19 complex activity categories is reasonably, which is also certainly one thing we should improve over in the future. Fig. 13 presents the confusion matrix of IBGN working with our atomic-level predictor. We observe that our system is able to recognize the ASL words such as xray, alphabets and quiz very well. At the same time, several ASL words turn to be difficult to deal with, with accuracy under 50%. This may mainly due to the relatively low accuracy of the atomic-level predictor we are using on the particular atomic actions.

Table IV COMPLEX ACTIVITY ACCURACY COMPARISONS ON OUR HAND-ACTION DATASET.

Figure 13. Confusion matrix of IBGN on our hand-action dataset.

VI. CONCLUSION

We present an interval-based Bayesian generative network approach to account for the latent structures of complex activities by constructing probabilistic interval-based networks with temporal dependencies in complex activity recognition. In particular, the Bayesian framework and the novel application of Chinese restaurant process (CRP) of our IBGN model enable us to explicitly capture inherit structural variability in each of the complex activities. In addition, we make publicly available a new complex hand activity dataset dedicated to the purpose of complex activity recognition, which contains around an-order-of-magnitude larger number of annotated instances. Experimental results suggest our proposed model outperforms existing state-of-the-arts by a large margin in a range of well-known testbeds as well as our new dataset. It is also shown that our approach is rather robust to the errors introduced by the low-level atomic action predictions from raw signals. As part of future work, we are considering relaxing the assumption that the IBGN models share the same structure for representing multiple instances of the same complex activity, and will instead learn more flexible structures for each class of complex activities by introducing latent structure variables that decide whether a link should exist in an instance. Also, we will continue the finalization of our hand-action dataset, improving the atomic-level atomic action prediction method, as well as attempting toward the establishment of standardized comparisons on exiting systems.

VII. ACKNOWLEDGMENTS

This research was supported in part by grant CQU903005203326 from the Fundamental Research Funds for the Central Universities in China, grants R-252-000-473-133 and R-252-000-473-750 from the National University of Singapore, and A*STAR JCO grants 15302FG149 and 1431AFG120.

We first prove that the composition operation on the interval relation union set S satisfies the associative law. The set of all the 127 unions is denoted by S.

Definition 3. (Composition Product of Two Relation Sets) The composition product of two sets of interval relations R, i.e. R = {r1and R, where any ri, is defined as Rri .

Lemma 2. (Associative Law on Composition)

Proof. Let Rx and Rz , where any ri. Then, Ryand Rx Since the com- position on the seven interval relation set satisfies the associative law that is both left and right associative, the associative law also holds on R, which is closed under the composition operation, i.e. So, we have .

Lemma 3. (Path Consistency)

Given an IBGN Gd, if xixixjfor any 1 |Vd|, then for any path P = vi vivivkvk in Gd, xixixixk.

Proof. For any path P = vi vivivkvk in Gd, we have xixixiand xixixi, then xixixixixixixi. Iteratively, we finally have xixixixk.

Lemma 3 indicates that if any ijk in the network satisfies the transitivity properties, then any path in the network also satisfies the transitivity properties. Hence, the entire network generated through our construction is consistent. (Completeness) (by contradiction)

Suppose there exists one interval relations xithat cannot be generated through our construction, i.e. xixixj, which means there exists at least one j(1 1) that xixj. This contradicts the fact that the network is consistent. Besides, any possible relation in R can be generated between two neighboring nodes. That is to say, any possible triangle that is consistent can be generated by our model.

Proof. The corresponding variable dependence in the IBGN model G is shown as follows:

According to our definition of the generative process, the structure between t and a (marked as blue line), the structure among t and the structure between c and r (marked as blue dotted line) are fixed. We only need to learn the structure between a and r.

In effect, the structure without t and c and their associated links (blue dotted lines) is equivalent to the Bayesian structure Gdefined in Definition 2, where UGa1,a2,...,akand WGr1r1r1rk(constraint (1)). The number of categories of any atomic action variable ai (1 i ) is M + 1, including the null atomic action. Similarly, the number of categories of any relation variable ri(1 i ) is 8, including the null relation (constraint (2)). Given a training dataset D, for any instance d and its corresponding network Gd, if j > |Vd|, then aj = null and rinull. Moreover, any atomic action variable ai has no parent, and any relation variable rihas either two parents or zero parents (constraint (3)). That is, a link eiexists in G if and only if both ai and aj are connected with riin G.

Formally, let VG =VtVaVrVc where Vt = {t1,...,tk, Va = {a1,...,ak, Vr = {r1r1r1rkand Vc = {r1c1c1ckdenotes the parents of vi VG and vj, where vj is the number of categories of vj. In particular, we have

where c j is a constant. Suppose the corresponding variable cnof the node vj Vc equals |Cz|, we have

Since the structure between t and a, the structure among t and the structure between c and r are fixed, the number of parents of a node vi VtVaVc are fixed. Then, we have

Also, let be the entire vector of parameters such that vik , where vik and denotes that vi is assigned with the k-th element and its parent is assigned with the j-th element in , respectively; ni jk indicates how many

are the parameter vectors associated with Vt, Va, Vr and Vc, respectively. Given a training dataset , we have

1. We split the parameter vector into two parts: and are the parameter vectors associated with UGand WG, respectively. Then, we have

Let

Given a dataset , the parameter vectors , can be estimated to maximize their respective logarithmic equations regardless of the network structures. Similarly, for the parameters such that vi Vr and Vc, they can also be estimated invariably under arbitrary network structures. Therefore, is constant. Then we have

APPENDIX C CONDITIONAL DISTRIBUTION FOR GIBBS SAMPLING

Since the probability of an conditioned on tn is Dirichletmultinomial distribution, we can write out the integration formulae:

The total number of tables cannot exceed the maximum network size .

The probability of tn conditioned on its previous tables t1,t2,...,tnfollows CRP . We adopt the technique introduced by Sudderth [27] to sample the n-th table in Gd as follows

where ntk is the number of time the previous n 1 nodes in Gd are assigned to the table Tk, and is the Kronecker delta function that

and ¯k denotes a previously empty table.

Now we derive the conditional probability of each variable tn of the n-th table in Gd assigned to the table Tby Gibbs sampling as follows:

REFERENCES

[1] Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Computing Surveys 43(3), 16 (2011)

[2] Allen, J.: Maintaining knowledge about temporal intervals. ACM Communications 26(11), 832–843 (1983)

[3] Brendel, W., Fern, A., Todorovic, S.: Probabilistic event logic for interval-based event recognition. In: CVPR (2011)

[4] Bulling, A., Blanke, U., Schiele, B.: A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys 46(3), 33 (2014)

[5] Campos, C.D., Ji, Q.: Efficient structure learning of Bayesian networks using constraints. JMLR 12, 663–689 (2011)

[6] Chen, L., Hoey, J., Nugent, C., Cook, D., Yu, Z.: Sensor-based activity recognition. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 42(6), 790–808 (2012)

[7] Cook, D.J., Krishnan, N.C., Rashidi, P.: Activity discovery and activity recognition: a new partnership. IEEE Transactions on Cybernetics 43(3), 820–828 (2013)

[8] Devanne, M., Wannous, H., Berretti, S., Pala, P.: 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Transactions on Cybernetics 45(7), 1023–1029 (2014)

[9] Dubba, K.S.R., Cohn, A.G., Hogg, D.C., Bhatt, M., Dylla, F.: Learning relational event models from video. Journal of Artificial Intelligence Research 53(1), 41–90 (2015)

[10] Fan, X., Yuan, C.: An improved lower bound for Bayesian network structure learning. In: AAAI (2015)

[11] Helaoui, R., Niepert, M., Stuckenschmidt, H.: Recognizing interleaved and concurrent activities: A statistical-relational approach. In: IEEE International Conference on Pervasive Computing and Communications (2011)

[12] Hu, D., Yang, Q.: CIGAR: Concurrent and interleaving goal and activity recognition. In: AAAI (2008)

[13] Kim, E., Helal, S., Cook, D.: Human activity recognition and pattern discovery. IEEE Pervasive Computing 9(1), 48–53 (2010)

[14] Lara, O.D., Labrador, M.: A survey on human activity recognition using wearable sensors. IEEE Communications Surveys & Tutorials 15(3), 1192–1209 (2013)

[15] Lillo, I., Soto, A., Niebles, J.: Discriminative hierarchical modeling of spatio-temporally composable human activities. In: CVPR (2014)

[16] Liu, A.A., Su, Y.T., Jia, P.P., Gao, Z., Hao, T., Yang, Z.X.: Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Transactions on Cybernetics 45(6), 1194–1208 (2015)

[17] Liu, L., Cheng, L., Liu, Y., Jia, Y., Rosenblum, D.: Recognizing complex activities by a probabilistic interval-based model. In: AAAI (2016)

[18] Liu, L., Wang, S., Peng, Y., Huang, Z., Liu, M., Hu, B.: Mining intricate temporal rules for recognizing complex activities of daily living under uncertainty. Pattern Recognition 60, 1015–1028 (2016)

[19] Minka, T.: Estimating a Dirichlet distribution (2000)

[20] Modayil, J., Bai, T., Kautz, H.: Improving the recognition of interleaved activities. In: Int. Conf. Ubiquitous computing (2008)

[21] Oliver, N., Horvitz, E.: A comparison of hmms and dynamic Bayesian networks for recognizing office activities. In: User Modeling, pp. 199– 209 (2005)

[22] Pinhanez, C.S.: Representation and recognition of action in interactive spaces. Ph.D. thesis, MIT (1999)

[23] Pitman, J.: Combinatorial stochastic processes. Tech. rep., Technical Report 621, Dept. Statistics, UC Berkeley (2002)

[24] Raftery, A.E., Lewis, S.: How many iterations in the Gibbs sampler. Bayesian statistics 4(2), 763–773 (1992)

[25] Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., et al.: Collecting complex activity datasets in highly rich networked sensor environments. In: Int. Conf. Networked Sensing Systems (2010)

[26] Ryoo, M., Aggarwal, J.: Semantic representation and recognition of continued and recursive human activities. Int. j. of compt. vision 82(1), 1–24 (2009)

[27] Sudderth, E.B.: Graphical models for visual object recognition and tracking. Ph.D. thesis, MIT (2006)

[28] Teh, Y.W.: Dirichlet process. In: Encyclopedia of machine learning, pp. 280–287 (2011)

[29] Veeraraghavan, H., Papanikolopoulos, N., Schrater, P.: Learning dynamic event descriptions in image sequences. In: IEEE Conference on Computer Vision & Pattern Recognition, pp. 1–6 (2007)

[30] Xu, C., Nanjappa, A., Zhang, X., Cheng, L.: Estimate hand poses efficiently from single depth images. International Journal of Computer Vision 116(1), 21–45 (2016)

[31] Zhang, Y., Zhang, Y., Swears, E., Larios, N., Wang, Z., Ji, Q.: Modeling temporal interactions with interval temporal Bayesian networks for complex activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(10), 2468–2483 (2013)

designed for accessibility and to further open science