Video data abound. As of this writing the equivalent of 500 hours of video is uploaded to Youtube alone, every minute. Since the majority of video feeds will never be inspected by humans, analysis of video, the ability to extract meaningful information from video feeds as well as the ability to express and evaluate different types of video analytics queries has become a pressing concern.
Recent advances in Deep Learning (DL) [5, 16] technologies introduced algorithms and models to comprehend challenging data types such as images, video and unstructured text (among many others); recent breakthroughs in applications such as image classification and object detection offer the ability to classify objects in image frames, detect object locations in frames as well as track objects from frame to frame with solid accuracy.
There has been increasing recent interest in the community [10,11,12,13,27] in the processing of video queries utilizing DL models as primitives for data extraction from videos. We are also concerned with applications that involve video collected from static cameras as it is prevalent in surveillance and security applications. Our focus however is on the effi-cient execution of temporal queries over video feeds. A video feed is an ordered sequence of frames, presented at a specific frame rate, typically 30 frames per second (fps) to align with human visual perception. Different types of objects appear in frames and can be easily detected by state-of-the-art object detection algorithms. Similarly, DL models are highly effective to track the same object (e.g., car) from one frame to the next, as objects persist across frames. Given this basic ability to detect and track objects across the temporal evolution of a video feed, several queries of interest arise. For example, we may be interested to discover video segments during which only the exact same two cars (and no other object) appear continuously within a 10 second interval; similarly we may wish to detect video segments during which the same three trucks and the same human object appear for a duration of one minute jointly (possibly along with many other cars/trucks and humans among them). These queries become highly sophisticated when the query objects can be associated in arbitrary Conjunctive Normal Form (CNF) expressions. Moreover, when query specified objects can be connected to external identities (e.g., license plates) such queries become highly powerful.
Figure 1: Video Examples
As a sample application scenario, consider surveillance analysis, where a lot of video footage is examined after an incident for certain events of interest involving joint presence of objects. For example, after an incident witnesses may report a white car and two males on the street. Subsequently, authorities may wish to detect in a large video footage segments where a white car and two humans appear jointly for at most five minutes (e.g., Figure 1a). Efficient execution of such queries can automate many of these tasks.
Evaluating queries on video feeds involving multiple objects and temporal constraints presents numerous challenges, however. First, finding video frames where the above query (a white car and two males appear jointly) is satisfied can be very expensive since both the actual objects and frames that satisfy the query are uncertain, which means that a bruteforce approach has to consider all possible object combinations and frames, thus leading to a large computation space. Sets of objects that are not promising to be part of any query answer should be pruned immediately from further evaluation. Second, Occlusions occur naturally among objects in videos, where some objects disappear from the visible screen in some frames (because of being blocked by other objects) and then reappear. For example, in Figure 1b the human may disappear behind the car for a few seconds or longer. Occlusions can be captured by state-of-art object tracking algorithms [14, 25], and thus they need to be considered in the query semantics. Similarly, although state-of-the-art object tracking algorithms perform well when tracking an object across frames, they are still error prone. As a result, some objects may disappear or are assigned new identifiers in some frames. Temporal query semantics has to account for pragmatic challenges such as occlusion, and subsequent query evaluation needs to take such semantics into account.
In this paper, we place the problem of evaluating temporal queries over video feeds into perspective, and in particular make the following contributions:
• We define semantics for temporal queries on video feeds accounting for aspects intrinsic to video objects such as object occlusion.
• We decouple object detection/classification from query evaluation, introducing an intermediate layer (MCOS Generation) that optimizes the input to the query evaluation module.
• We fully develop the MCOS Generation module, detailing a way to organize objects detected through state maintenance.
• We propose Marked Frame Set (MFS) to prune unnecessary states early. Additionally, to provide even better pruning efficiency in certain scenarios, we also propose to maintain states in a graph structure called Strict State Graph (SSG).
• We propose an algorithm called State Traversal (ST) that processes incoming frames against the SSG, dy-
namically adjusting its structure and incrementally and efficiently maintaining all primitives necessary for succinct query evaluation.
• We propose several modifications to a CNF query execution algorithm, and demonstrate how to efficiently couple CNF query execution with algorithm ST, yielding improved performance in our application scenario.
• We present a detailed experimental evaluation of the proposed algorithms utilizing both synthetic and real data, demonstrating the trade-offs between MFS and SSG. Both approaches for state maintenance yield sig-nificant performance improvements over baselines on real datasets, with benefits increasing as query parameters vary. We also demonstrate that adapting our state maintenance during query evaluation yields an optimized approach that exhibits significant speedup on query evaluation; up to more than 100 times in our experiments.
The remainder of this paper is organized as follows. In Section 2, we formally introduce the problem of interest in this work, followed by Section 3 which presents an overview of our solutions. Section 4 presents an overview of the proposed MCOS Generation module along with our main technical proposals. In Section 5, we discuss query evaluation. Section 6 presents a thorough experimental evaluation of all introduced approaches. Section 7 reviews related work, and Section 8 concludes the paper. All proofs of theorems can be found in the appendix.
Consider a video feed containing 5 frames, ({B}, {ABC}, {ABDF}, {ABCF}, {ABD}), where each letter represents an object, and each object set represents a frame in order. Assume all objects are of the same class. Queries such as “select the object set that appears in all frames” can be easily answered by intersecting the object set of all frames directly. However, if we allow occlusions/detection errors and issue queries such as “Select the video frames where some objects appear jointly for at least 3 frames in a window of 5 frames”, the problem becomes nontrivial. In this case, the object sets {B} (appearing in frames {0, 1, 2, 3, 4}) and {AB} (in frames {1, 2, 3, 4}) should be selected. If we further relax the duration threshold from 3 frames to 2 frames, three more object sets ({ABC}, {ABD}, {ABF}) should also be selected. In real-world applications, such queries may further be restricted to a certain number of objects (e.g. at least 2 persons and 1 car to appear jointly).
To formalize the problem, we consider a video feed V as a bounded sequence of frames, i.e., where N is the total length of the video. We use ID to denote the set of possible object identifiers (id) that can appear in the video feed (unique identifiers that are persistent for each detected object across video frames). Typically such identifiers are extracted by the application of object tracking models [14]. We use L to denote the set of class labels for all objects. Class labels (e.g., human, car) are extracted by the application of object classification/detection models on each frame.
Utilizing object detection and tracking algorithms, we can extract a set of objects from each frame
, and for each
) and its class label
we can obtain a structured relation V R from the video feed V , with schema (fid, id, class), where fid is the frame id
is the object id, and
the class label. Object tracking algorithms ensure that an object has the same id for all frames in which it appears, even under the presence of occlusions [19,25,26].
Given a window size to denote the structured relation obtained from the frames in the current window,
. To capture object co-occurrence across frames in
, we introduce a new predicate,
), to determine whether the objects spec-ified in the set
are co-occurring in frame f. That is,
) is TRUE iff
use F to denote the set containing all the frame ids in the current window,
). We slightly abuse terminology and use the term frame and frame id interchangeably as all frames are unique.
A co-occurrence object set is a set of objects that all appear jointly in certain frames, defined as follows:
a frame set such that
is TRUE, then
occurrence object set of frame set
For a given frame set , an object set is a maximum co-occurrence object set if and only if none of its supersets is a co-occurrence object set of
, defined as follows:
In the example mentioned in the beginning of this section, multiple MCOSs exist in the window of 5 frames. For example, object set {AB} is an MCOS of frame set {1, 2, 3, 4}, while {ABC} is an MCOS of frame set {1, 3}.
(CNF) queries over the video feeds. We adopt sliding window query semantics ; each query is expressed along with a temporal window context w. The window advances every time a new frame is encountered and the query result is evaluated over the most recently encountered w frames.
where x is the class label of an object, n is an integer value and For example,
specifies the number of cars to be greater or equal to value n. Bounded range queries can also be constructed based on conjunctions, e.g.,
3 queries whether the number of people is between 3 and 5. Since objects can appear and disappear in the visible screen due to occlusion, we introduce a duration parameter d to each query, where 0
Without occlusion an object should appear in the visible screen for w frames; with occlusion, however, the number of appearances can be less than w. The duration parameter captures this and specifies the minimum number of frames an object (more generally an MCOS) should appear in the current window.
condition c will be evaluated as TRUE iff there are frames , such that the following conditions are satisfied.
Figure 2: Overview
result is a set of MCOSs, which includes all the possible frame sets where the overall CNF expression is evaluated to TRUE.
The architecture of our overall solution is shown in Figure 2. Our solution consists of three modules, namely Object Detection and Tracking, Maximum Co-occurrence Object Set (MCOS) Generation, and Query Evaluation.
In the Object Detection/Tracking module, the input video feed, V , is processed to extract object ids and associated class labels. This module consumes the raw video feed (frames) and produces a structured relation V R. In this module we deploy state-of-the-art object detection and tracking algorithms. Specifically, we use Faster R-CNN [22] as the object detection algorithm and Deep Sort [25] as the tracking algorithm. The module is designed to be ‘plug and play’ such that any algorithm from the computer vision community can be adopted as it becomes available. The output of this layer is a relation of tuples with schema (fid, id, class) indicating that an object with identifier id is of a specific class and is detected at frame fid. This relation of tuples is relayed to the MCOS Generation module.
MCOS Generation is responsible for generating and materializing MCOS for query evaluation. The module computes MCOS incrementally by reusing intermediate results from the previous window. It also maintains sufficient information to decide when an MCOS should be removed as the window slides. To achieve this, the module utilizes information from queries, such as the window size w, duration condition d, and classes involved in the query. Queries with the same window size will be grouped together since they may share the same MCOS. Objects with class not requested by any query may be dropped from V R upon entering the MCOS generation module. We introduce the notion of a State to represent the basic unit that materializes an intermediate set of objects during generation, defined as follows:
Definition 3. State. Let F be the set of all frames in the current window and ID the set of all object identifiers. A state COS of frame set
Each state consists of two types of information: an object set and the frames in which the objects appeared. Note that the object set may not be an MCOS in some states. A state, is an MCOS of frame set
Otherwise, it is an invalid state. MCOS Generation should only produce valid states to Query Evaluation since queries are evaluated on MCOS only.
We use S to represent the set of all possible states in the current window. The object set in a state s is an MCOS as long as
For each valid state, s, processing on the MCOS can be divided into three steps:
1. Check against the duration parameter d. The query could be evaluated as TRUE only if
2. Conditions in CNF expressions are aggregates evaluated on class labels (e.g., ‘car Thus, objects have to be aggregated after classification based on their class labels.
3. Once all conditions of an expression have been processed, we can move on to evaluate CNF expressions. For this step we adapt CNF evaluation algorithms [24]. In MCOS Generation module, simple optimizations can reduce number of states evaluated. For example testing the duration parameter d can be conducted when the MCOS are generated (a form of push down optimization). Namely, a state s containing an MCOS can be pruned if (unsatisfied states). States with MCOS that are not pruned (satisfied states) will be relayed to the Query Evaluation module. Since we are evaluating queries over a window, some states that are not satisfied at frame i may be satis-fied in the future as the window slides. Subsequently state maintenance, requires succinct book-keeping to assume both efficient execution and correctness. Since queries are defined and evaluated on windows, we maintain states for each window and update these states incrementally. Queries are analyzed first, objects that are not required for the given query will be discarded immediately. States can be generated, satisfied, and invalid as new frames arrive. We present the MCOS Generation module in the next section. Then we discuss the Query Evaluation module in Section 5.
4.1 Objective
For each new arriving frame, the objective of MCOS generation is to maintain a set of states S and select a subset is a valid and satisfied state, while generating
efficiently. We refer to this procedure as state maintenance.
4.2 Marked Frame Set
4.2.1 Objective
An MCOS can be obtained by intersecting two or more object sets from the current window. Generating MCOS by assessing object intersections across object sets of frames in each window can be very expensive. Since some MCOSs are shared between windows, we use states to keep the information of an MCOS, and derive new states from existing states computed in the previous window.
When the window slides, frames that are removed from the window are denoted as expired. For each existing state, we remove the expired frame ids from their frame sets. A state can be removed from the collection of states in the current window only if the frame set is empty.
4.2.2 A First Attempt
As a first attempt to state maintenance, consider the following approach. Let the set of all states in the current window be S, and assume that the last frame id in the current window is i. When a new frame with id arrives with associated object set
, the following steps are conducted during state maintenance:
1. Append the new frame id: , the new frame id can be simply appended to the existing state
Example 1. Consider generating MCOS for single query with duration parameter d = 3 frames and a window size of w = 4 frames, as shown in Table 1. Each letter represents a detected object. The Frame and Object Set columns provide the frame id and associated object set obtained from the video feed. The column States depicts the set of states maintained in the current window, where satisfied states are underlined. Satisfied MCOSs are listed in the EXP(Expected) column.
If we maintain states by simply keeping the frame set of each object set, the states are shown in the Frame Set States column in the table.
A new state with object set {B} is created at frame 0 since there are no existing states. When frame 1 arrives, according to the above steps: we first append the new frame id 1 to state {B}; then a new state {ABC} with frame set {1} is created (step 2.b). With the same procedure, when frame 2 arrives, the new frame id is added to state {B}. Two new states are created: {AB} is created by step 2.a, since {ABDF} = {AB}; {ABDF} is created by step 2.b, since no existing states have the same object set as {ABDF}. By keep doing so, we obtain the complete state set in the current window for each frame. Note that the object set {A} is not generated and maintained in any window. This is because {A} is not an MCOS since A always co-occurs with object B (in frames 1-4).
No object set is satisfied in frame 0 and 1 since the total number of frames is less than 3. At frame 2, frame set {0, 1, 2} generates one object set {B}, which is an MCOS. When frame 3 arrives, frame set {1, 2, 3} produces another MCOS {AB}. Note that in frame 4, the state with object set {B} is satisfied (size of the frame set ); however, it is not an MCOS of the current frame set. The only MCOS in the current window is {AB}, which appeared in all frames.
In the above example, the only MCOS of frame set {1, 2, 3, 4} in the window of frame 4 is {AB}. However, we are keeping two states ({AB} and {B}) with the same frame set. {B} is not an MCOS of the frame set, thus the corresponding state is an invalid state, and should not be processed
Table 1: A Video Segment Example
further and/or evaluated. If we maintain it will impose unnecessary overhead and it will be processed against new arriving frames.
4.2.3 Key Frame Set
Such invalid states (shown in Section 4.2.2) should be removed from the state set so we introduce a mechanism to do so. Comparing frame sets across states can be costly since we don’t know which two states have to be compared without any indexes. To avoid such computations we maintain information regarding when a state becomes invalid. Once a state is invalid, it is removed from the state set immediately.
Consider the state ({AB}, {1, 2, 3, 4}) from frame 4 in Table 1. Removing frame 3 from the frame set, the resulting state ({AB}, {1, 2, 4}) is still valid since {AB} is still an MCOS of the new frame set. However, if we remove both frames 1 and 3, the new state ({AB}, {2, 4}) will become invalid since the MCOS for frame set {2, 4} is {ABD}. Thus, for a valid state s, a subset of determines the validity of a state. If we identify these frames, we can assert whether a state is valid or not. Instead of waiting until every frame expires, a state can be pruned as long as all frames in a key frame set expired. This observation can readily be used in Example 1 to prune {B}. To formally define these frames we introduce the notion of Key Frame Sets:
Definition 4. Key Frame Set. Let s be a state with MCOS of the frame set
. For some frame set
is a key frame set iff:
set.
That is, a Key Frame Set of state, s, is a set of frames with the following properties: if we remove all frames in the key frame set from a state will not be an MCOS; if we add one of the key frames back,
an MCOS. In example 1 if we remove frames 1 and 3 from state {AB}, another state ({ABD}, {2, 4}) has the same frame set. Since
, the object set {AB} is no longer an MCOS. Thus, {1, 3} is one of the key frame sets. For the same reason {2, 4} and {1, 4} are also key frame sets. Since frames are expired in temporal order, state {AB} will be invalid after frame 3 expires if no other frames are added to the state.
For a state, s, that is generated by some frame, k, directly, the frame k is a key frame of s (as in case 2 of Section 4.2.2). Even when the window slides, if the frame, k, remains in the window, the state, s, will always be valid. To highlight a key frame in a frame set, we add a mark (*) before the frame id. A frame set with such marked frames will be denoted as Marked Frame Set.
Suppose a new frame i arrives, and a state ns is created by frame i (Case 2 of Section 4.2.2). For each state frames will be marked according to the Frame Marking Rules, as follows:
1. For the state ns that is created by frame i (Step 2.b in Section 4.2.2), frame i will always be marked in 2. If
marked in
will also be marked in
By applying the above rules, the following theorem can be proved:
Theorem 1. The set of marked frames in each state is a Key Frame Set.
For a state s, if at least one frame is marked, then it has at least one key frame (according to theorem 1). According to the definition of Key Frame Set, state s contains an MCOS. Thus, state s is a valid state if at least one frame in marked. A state is eliminated from the current collection of states once all marked frames in the frame set expire.
We show in the following example that the marked frame set can reduce the maintenance cost significantly while pro- ducing the correct result.
Table 2: Example with Marked Frame Set
Example 2. Consider the example in Table 1 again. State with object set {B} will be generated at frame 0, while 0 will be marked in the frame set. From frame 1 to 3, the new frame id will be appended to the existing state. When frame 4 arrives, frame 0 will be removed from the set along with the mark. Since all the marked frames have been removed from the state, the state is invalid and will be removed. Thus, at frame 4, only state {AB} will be selected as the result.
4.2.4 The MFS Approach
Following the Frame Marking Rules (Section 4.2.3), we propose the MFS approach, which utilizes a set of states to maintain all possible object sets along with their Marked Frame Set. In particular, in MFS, we maintain a set of existing states from the previous window, and when a new frame arrives, we compute the object set intersection between each existing state and the new frame and mark the frame set according to the Frame Marking Rules.
4.3 Strict State Graph
Although MFS provides the ability to remove unnecessary states early, all states still have to be processed for each arriving frame. In this section, we further explore the relationship between states and provide an alternative approach to provide better pruning power.
4.3.1 Principal States
We first introduce Principal State to denote a subset of the states in the current window, defined as follows:
Definition 5. Principal State. For a window size w, let be the corresponding video frames. A state s in the current window is a principal state iff
, such that
A state may become a principal state when a new frame, k, arrives: a) if , such that s shares the same object set as the new frame
will becomes a principal state. 2) if none of the states in S has the same object set as the new frame, a new state, s, will be created with object set that of frame k and will become a principal state. We will say that a principal state, s, is created from frame k, and that frame k created principal state s. For the principal state, s, the frame that created it, k, is always a key frame of state s. A principal state will cease being one, once all the frames that created it expire from the current window.
For example, at frame 0 in Table 2, state {B} is the only state and is generated by the arriving frame directly, thus it is a principal state. For the same reason, principal state {ABC} is created when frame 1 arrives, while another principal state {ABDF} is created by frame 2. At frame 3, there are four principal states in total: {B}, {ABC}, {ABDF} and {ABCF}. When frame 4 arrives, frame 0 expires from the current window, thus state {B} is no longer a principal state.
Principal states carry an important property. Consider a window on the relation; starting with one frame, the only existing state (that was created from the frame) is a principal state. If we add one more frame to the window, a new state may be created based on the existing principal state and the new principal state. By keep on adding more frames, new states can always be created based on the existing states and the new principal state. Thus, every state in the current window is generated by principal states directly or indirectly.
4.3.2 Strict State Graph
To capture the relationship of how the states are generated, we use State Graph to connect those states, defined as follow:
Definition 6. State Graph(SG). A state graph is a directed graph, which can be represented as G = (S, E), where S is the set of all states and E is the set of all edges. For each edge state
is generated from state
For any state graph G, the following holds:
Property 1.
Suppose we have two principal states, {ABD} and {ABCF}, the State Graph generated is shown as Figure 3a. According to Property 1, if a new frame with objects {EG} arrives,
Figure 3: Adding New Edges to SSG
the state with object set {AB} can be omitted due to empty object intersection between the new frame and any existing principal states.
Assume a new frame with {ABDF} arrives. A new state {ABF} can be generated by computing object intersection between {ABCF} and the new frame (Figure 3b). However, if we simply connect states as the way they are generated, we will obtain a graph with unnecessary edges (Figure 3c). To remove these unnecessary edges, we propose to generate graphs maintaining the following property:
is the edge set containing all of the edges starting from
(
, we always have
We use Strict State Graph (SSG) to denote a State Graph that satisfies the above property. The graph that conforms to property 2 is shown in Figure 3d. To leverage the structure of SSG, two crucial issues have to be addressed: how to maintain the edges of the graph, and update the marked frame set for each state. We first focus on edge maintenance in Section 4.3.3, and address the state marking issue in Section 4.3.6. Then we provide the main algorithm in Section 4.3.7. The algorithm analysis will be presented in Section 4.3.8.
4.3.3 Edge Maintenance
Since other states are generated by principal states directly or indirectly (Section 4.2), we always maintain the set of principal states in the current window, denoted as PS.
When a new frame i arrives, let ns be the principal state created by i. We start traversing the graph from one of the states in PS (note that ns will be added to PS after all members of PS are visited). Assume during traversal of the graph we are visiting state s (s could be a principal state or another existing states on the graph). Utilizing the Strict State Graph, state s will be handled during traversal by the Graph Maintenance Procedure as follows:
1. We first compute the object set intersection between ns and state s. Let inter denote the intersection result,
2. If . That means for any state adjacent to s, the object intersection between that state and ns is empty. Thus, all of the adjacent states no longer need to be processed.
3. If . We add frame i to the frame set of s.
4. Otherwise, we need to further compute object intersections between ns and the adjacent states of s. Let denote the set of adjacent states of s:
(a) If , then the state with object set inter is already on the graph and no new states or edges need to be created.
(b) If , a new state with object set inter is created and added to the graph. We say that the new state with the object set, inter, can be generated from s.
5. After all states are visited according to above 4 steps, the new principal state, ns, will be added to the graph. To satisfy Property 2, we maintain edges in an SSG in two steps:
1. Modifying Existing Edges. When a new state is generated by intersecting objects between the new principal state and existing states (step 4.b of the Graph maintenance Procedure), we need to check whether existing edges should be deleted, and add new edges to the graph to satisfy Property 2.
2. Connecting the New Principal State. When a new principal state is added to the graph, existing states should be carefully chosen to build new edges (step 5 of the Graph maintenance Procedure) while guaranteeing Property 2.
4.3.4 Modifying Existing Edges
Suppose a new principal state ns is created by a new frame, and state is being visited (
could either be an existing principal state or other states on the graph).
another state that can be generated from
Assume after adding the edge, (), Property 2 is violated. That means,
to satisfy the property, edges have to be removed from the graph. To determine which edge should be removed, we need to check the object set intersections between
and the object sets of all adjacent states of
. In the above case, the intersection between object sets of state
will be the same as the object set in
, and thus edge (
) should be removed from the graph, while a new edge (
) should be added.
4.3.5 Connecting the New Principal State
To connect a new principal state to the graph, we first introduce the following theorem:
Theorem 2. Let PS be the set of principal states and ns be a new principal state. In an SSG, a state adjacent to
, such that
Thus, as per Theorem 2 one can obtain at most one state to connect ns starting from each state in PS during graph traversal. We use C to denote a set of states that could be adjacent to the new frame state ns. The set C is constructed as follows: Starting from each principal state, ps:
1. If , then no state will be added to C. 2. Otherwise, let s be the state with object set
could be an existing states in the graph (Graph Maintenance Procedure step 4.a), or a new state (Graph Maintenance Procedure step 4.b). In either case, state s will be added to C. For a state,
be the set containing all states that can be obtained by performing a DFS (Depth-First
Search) from s on the graph. States in C will be selected according to the State Selection Procedure as follows:
1. We sort the states in C according to (object set) in descending order, where
Meanwhile, an empty set, RS, is initialized, to materialize all reachable states from ns, and is dynamically updated.
2. State, s, is processed and selected as follows:
• If , then state s will be selected. All states from
will be added to RS.
• If will not be selected. After selecting the states, we can now add an edge between each selected state and the new principal state. The following theorem can be proved.
Theorem 3. The State Selection Procedure will not violate Property 2 while all newly generated states can be reached from the new principal state.
Now consider the example in Figure 3a again. When a new frame {ABDF} arrives: From state {ABCF}, a candidate, , with state {ABF} will be created, along with the reachable state set {{ABF}, {AB}}; From state {ABD}, a candidate,
, with state {ABD} will be created, along with the reachable state set {{ABD}, {AB}}. Since they have the same object set size, we can start from either candidate. Assume we select state from
first, after that the reachable state set for state {ABDF} is {{ABF}, {AB}}. {ABD} is not in the set, so we still need to select the state from
. After that, a new edges will be created between {ABD}, {ABF} and {ABDF}. The result is the same as Figure 3d, and Property 2 is maintained.
4.3.6 Marking the Frame Sets of States
During edge maintenance, the intersections between the new arriving frame and existing states are computed on SSG. To minimize the overall maintenance cost, we introduce State Marking Procedure to update the Marked Frame Set for each state at the same time.
For each new arriving frame state ns at frame i, we then visit the SSG starting from principal states in their arrival order. the marked frame set for each state is updated as follows:
1. Frame i in state ns will always be marked. 2. For each state s: • If a new state is generated from s. We copy the marked frame set from
, we also append the new frame i to its marked frame set.
• If , we simply append the new frame i to the marked frame set of s.
3. After state s is processed, it is marked as visited. The adjacent states of s will be further processed by applying the above rule recursively until there are no adjacent states or all adjacent states are already visited.
4. If some principal state, ps, (created from frame already visited, and
. This means from ps and ns, a new state
is generated. Since ps is already visited, the new state
must already created on the SSG. In this case, we mark the frame id
the marked frame set of
The correctness of the above procedure is guaranteed by the following theorem.
Theorem 4. Using the State Marking Procedure, a state is valid in SSG iff at least one frame in its frame set is marked.
4.3.7 The Algorithm
We now present the algorithm that builds SSGs. The algorithm consists of two main parts: State Traversal (ST), and Connecting the New Principal State (CNPS).
1. We maintain principal states in their order of arrival. The State Traversal (ST) algorithm implements the Graph Maintenance Procedure and State Marking Procedure. Meanwhile, a set of states, C (described in Section 4.3.5), will also be generated by the ST algorithm.
2. Connecting the New Principal State (CNPS) algorithm, implements the state selection procedure to add new edges between existing states and the new principal state, as discussed in Section 4.3.5. The ST algorithm is shown as Algorithm 1, which receives 6 parameters: the arriving frame id, , an existing state to visit, s, the parent state ps of state s, the new principal state, ns, the previous object intersection pInter (generated by ps and ns), and the current candidate c. Starting from principal states, pInter is initialized as an empty set, while its parent state ps and the current candidate c is set to null. To avoid visiting the same state multiple times, we include a flag to each state (shown as Lines 1-2). The flag will be set to the current frame id once the state is visited. For each state, we first remove expired frames from its frame set and prune the state if necessary (function pruneState Line 3). Then, the intersection between the object sets of states s and ns will be computed (Line 4). Lines 5-31 implement the Graph Maintenance Procedure (Steps 1-4), meanwhile, the frame set of each state is updated according to the State Marking Procedure (Steps 1-3). In the algorithm, we utilize two other functions, visitNext and createState: createState will create a state (or retrieve an existing one) with object set equal to inter and modify edges according to Section 4.3.4; while visitNext visits the adjacent vertices of the given state s, calling the ST algorithm recursively. After each principal state is visited by the ST Algorithm, the CNPS Algorithm is responsible for connecting the new principal state to the graph, which is described as the State Selection Procedure in Section 4.3.5, shown as Algorithm 2. In the algorithm, c stores a state (c.s) and the reachable state set (c.CS) from the state, both of which are obtained in the ST algorithm. A state s is satisfied and valid if
and at least one frame in
is marked. Let
which contains all satisfied and valid states in the window of frame i. The result state set,
, is sent to the Query Evaluation module. Meanwhile, we also keep a copy of set
When a new frame
arrives, let
be the new state graph. The new result,
of satisfied and valid states from
in the current window, while
is the set of satisfied and valid states obtained on the graph. This is because, for a new principal state, ns, we visit state
on the graph only if
States
in the previous result set
may still be valid and satisfied. If the intersection of object sets between state
and ns is empty, we may leave it out during graph traversal. Keeping only the previous result set is sufficient since for any existing unsatisfied state, s, it will become satisfied only if s is visited on the graph in the current window. Thus, if s is a satisfied state at frame , then it is either: a) visited in
the current window at frame , or b) was also satisfied in the window of the previous frame i.
4.3.8 Analysis
Assume the number of frames that share the same object set is on average. Given a window size
unique principal states will be generated. In the full version of the paper we prove:
Theorem 5. In SSG, there will be at most 2states and
number of edges. Thus, the complexity of algorithm ST is
5.1 The CNF Algorithm
Table 3: Inverted Index
For our query evaluation module any CNF evaluation algorithm can be employed. We choose to utilize the evaluation algorithm proposed in [24], which will be referred to as CNFEval.
The CNFEval algorithm evaluates a set of CNF queries that include (set membership) predicates. The algorithm first builds an inverted index for the starting set of queries. The inverted index is dynamically maintained as queries are inserted and/or deleted.
For example, query ), specifies a CNF on name-value pairs. Specifically, this query can be interpreted as follows: for name ”age”, the value is 2 or 3; for name ”state”, the value is ”CA”; and for name ”gender”, the value is ”F”.
An inverted index for containing posting lists is generated as shown in Table 3. A posting list is a list of triples that are generated according to query conditions. A triplet is represented as (qid, p, disjId), where qid is the query id, p is the predicate (
is the disjunction id (starting from 0) of the condition. For example, condition
is in the second disjunction, and thus a triplet (1,
, 1) is generated.
For a given set of name-value pairs, the algorithm retrieves posting lists from the inverted index first. By ordering and scanning all retrieved posting lists, queries can be answered correctly. For example, given an input {(age, 3), (gender, F)}, two posting lists (11) are retrieved. From the two triplets in the lists, we determine that both disjunctions 0 and 1 are satisfied. Thus, query
is evaluated as TRUE. The details of the entire algorithm are available elsewhere [24].
5.2 CNF EvaluationwithInequalityPredicates
The CNFEval algorithm only handles queries containing set predicates. As a result inequality conditions are not supported naturally. Thus, we propose to build three separate inverted indexes for the query conditions of the form label is used as the key, while n is used as the value), with an extra column, value, associated to each of the three cases,
=. Consider a query
3)
5) in our framework. Tables 4 and 5 depict the inverted indexes for this query. Since this query contains both
conditions, two inverted indexes are built. Each key in the inverted list is associated with an ordered list; ordering takes place by value in descending (if
) or ascending (if
) order. Given an input with a set of name-value pairs, posting lists are retrieved as follows. For each name-value pair, (k, v):
passes our enhancements, as CNFEvalE. It is utilized to evaluate queries on the Result State Set (Section 4.3.7) produced by the MCOS Generation module. The procedure to evaluate queries on video feeds is described as follows:
Table 4:
Table 5:
1. For the given queries, we first build inverted lists. 2. For each state, s, in the Result State Set (Section 4.3.7) generated by the MCOS Generation module: (a) A set of aggregate values, , is computed based on the MCOS of state s. Each aggregate value is represented as (
is the class label, while v is the number of objects of type l.
(b) The aggregate value set, , is utilized as input to CNFEvalE to produce the results.
(c) If some query, q, is evaluated as true, the frame set of the state, , is produced as the result.
5.3 Pruning States by Evaluation Results
Query evaluation proceeds on the MCOS of a state in the Result State Set (Section 4.3). We make the following observation: if queries are evaluated as FALSE on the MCOS generated by some state s, then they will always be evaluated as FALSE on every possible MCOS generated by state is generated from s.
This observation is captured by the following proposition:
Proposition 1. Let s be some state in an SSG, and be the MCOS generated by s. For each condition c in query q, if c is evaluated as FALSE on
, then for any state
with , the condition c will always be evaluated as FALSE on
If Proposition 1 holds, such states can be safely removed from the graph while guaranteeing correctness. Proposition 1 can be satisfied only if . To adopt such a pruning strategy, for a given set of queries, we test whether they contain
predicates only. If so, when a new state is generated (Graph Maintenance Procedure step 4.b), we evaluate the MCOS of that state using the Query Evaluation module. In this case, we do not wait for the entire Result State Set to be generated. If all queries are evaluated as FALSE, the state will be marked as ”terminated”. Terminated states will no longer be processed, and thus we can reduce the number of states maintained in the MCOS Generation module. This provides additional pruning flexibility to the MCOS generation module in certain cases.
In this section, we present a thorough experimental evaluation of the two proposed approaches (Marked Frame Set and Strict State Graph) on both synthetic and real datasets and study the tradeoffs between them. We first evaluate the MCOS generation methods proposed in this paper in Section 6.2. Then, based on MCOS generation , we show the performance of our Query Evaluation in Section 6.3.
6.1 Settings
Environment. Our experiments are conducted on a machine with Intel i5-9600K CPU, one GeForce GTX 1070 GPU and 16 GB RAM. All algorithms are implemented in Java except the object detection and tracking algorithms for which we utilize standard implementations.
Table 6: Dataset Statistics
Object Detection and Tracking. We adopt Faster RCNN [22] as the object detection algorithm to identify all objects in videos and Deep SORT [25] as the object tracking algorithm. For both algorithms, we utilize standard implementation from open-source projects
Synthetic Datasets. We use synthetic videos generated by the VisualRoad benchmark [6], which is proposed to evaluate the performance of video database management systems (VDBMSs). We select two representative videos provided on the project homepage: rain with light traf-fic (V1), and postpluvial with heavy traffic (V2). We also generate videos with different configurations to study the relationship between the performance and the number of objects per frame.
Real Datasets. We also evaluate our approach based on four videos from Detrac [18] and MOT16 datasets [20]. From Detrac, we use MVI 40171 (D1) and MVI 40751 (D2). From MOT16, we use MOT16-06 (M1) and MOT16-13 (M2).
Dataset Statistics. In the rest of this section, we use V1 and V2 to denote two synthetic videos, while using D1, D2, and M1, M2 to denote four real videos. D1 and D2 are captured by static cameras, while M1 and M2 are captured by moving cameras. In our experiments, we only identify objects belonging to one of the following classes: person, car, truck, and bus. After applying object detection and tracking algorithms, the statistical information of these videos can be summarized as Table 6. In the table, Frames and Objects represent the total number of frames and unique objects, respectively; Obj/F represents the average number of objects per frame; Occ/Obj represents the average occlusion times per object; while F/Obj denotes the number of frames that each object appears on average.
6.2 MCOS Generation
We implement a baseline approach that simply stores the set of frames that each object set appears. We first collect all the object sets that satisfy the given duration threshold, and then check whether they share the same frame set. If that is the case, according to the definition of MCOS, we only keep the object set with the maximum size. The refer to this approach as NAIVE. We also implemented the MFS approach as described in Section 4.2, as well as SSG as presented in Section 4.3.7. All three methods are memory-based, where a hash table is used to map the object sets with frame sets (NAIVE), or marked frame sets (MFS), or internal states (SSG). To study the performance of the three methods, we design experiments to measure only the MCOS generation time.
We vary the window size, w (default value is set to 300 frames), and the duration parameter, d, (default value is set to 240 frames). With 30 fps, the default setting aims to identify objects that appear at least 8 seconds in the window of 10 seconds. The number of occlusions per object is also varied during experiments. By default we do not vary occlusion; each video has a number of occlusions per object occurring naturally in each data-set (Table 6).
We then vary the number of total frames evaluated on datasets, shown as Figure 4. The performance of all methods demonstrates a similar general trend: when the number of frames increases, the total time also increases. In all three methods, MCOSs are generated according to the object sets, thus the number of distinct object sets can have a profound impact on the overall performance. Thus, even under the same window size and duration threshold, their performance on different videos can vary significantly. For example, in Figure 4c, processing the first 400 frames is very expensive, the total cost does not increase significantly as the number of frames increases. In contrast, processing the last 145 frames is much more costly on video D2 (Figure 4d). This is because these frames include more objects per frame than the average objects per frame for the clip. As a result, these frames contribute significantly to the number of states created overall and subsequently to the overall cost. MFS performs slightly better than SSG on all videos generated by the visualroad benchmark. These videos have smaller number of objects per frame, while the duration for each object in the visible screen (F/Obj in Table 6) is relatively longer than other videos. In such scenarios, the number of unique states is relatively small, while most states can be generated directly from principal states. The pruning power of SSG is not significant compared with MFS since the graph structure does not provide significant advantage. Instead, extra cost is paid to maintain the graph in the case of SSG. In contrast, on other datasets (D1 to M2),the performance of SSG is better than MFS due to the pruning effectiveness attributed to the graph structure. For example, in Figure 4d, SSG is around 10% faster than MFS and 30% faster in Figure 4f. In these datasets, the number of objects per frame is much larger (Obj/F in Table 6) or the duration of objects is the visible screen is lower, thus more states are maintained.
Figure 4: Varying the Total Number of Frames
With a window size w = 300 frames, we vary the duration parameter d from 180 to 270 frames shown as Figure 5. The performance of all methods is relatively stable since the duration parameter only influences the size of the Result State Set. All possible states still have to be maintained and computed. Among all datasets, MFS performs best on V2, where the maximum speed up is more than 3 times than NAIVE, while SSG performs best on M2, where the maximum speedup is around 3.5 times than NAIVE. This observation is also consistent with the previous figures, since V2 has the least number of objects per frame, while M2 has the most.
Figure 5: Varying Duration d
We next vary the window size w and fix the duration parameter d = 240 showing the results as Figure 6. All methods require more time to compute as the window size increases since more states have to be maintained and computed. The NAIVE and MFS methods are penalized more, since they both have to compute object set intersection between each existing state and the new arriving frame. As we increase the window size, the trends remain overall the same. It is interesting to observe how data set characteristics affect performance. For example, Dataset M1, M2 are captured by moving cameras; this means the duration of each object in the visible screen (Obj/F in Table 6) is smaller and new objects are introduced frequently to the visible screen, leading to more unique states. By increasing the window size, SSG benefits most due to its pruning power, which can be observed from Figure 6e (where SSG is 40% faster than MFS) and 6f (where SSG requires almost half the time of MFS to execute). Such trends do not exist on other datasets generated or captured by static cameras (Figure 6a to 6d).
So far both real and synthetic datasets utilized the intrinsic number of object occlusions present in them. To introduce more occlusions (and be able to vary them) we reuse the same object id after an object disappears from the video. Thus, to simulate more occlusions, we introduce an occlusion parameter, . Each object id will be reused at most
times. Figure 7 depicts the results, where
varies from 0 to 3. In general, more occlusions increase the chances that the intersection of object sets between states is non-empty. This penalizes the overall performance as larger number of states have to be maintained. Such trends can be observed
Figure 6: Varying Window Size w
for both synthetic and real data sets. Since both MFS and SSG provide the ability to remove invalid states early, both methods can benefit from more occlusions. For example, in Figure 7a, MFS is more than 3.8 times faster than NAIVE when = 3, while SSG is more than 2.8 times faster. As
increases, more states may need to be explored on the graph, thus the pruning power is reduced. We can observe MFS can perform slightly better than SSG when
(Figure 7e).
Figure 7: Varying # Occlusions
In summary, it is evident that both MFS and SSG offer large performance improvements compared to the NAIVE method. On videos with small number of objects per frame, MFS could perform better than SSG. When the number of objects per frame is relatively larger and/or the duration of objects in the visible screen is relatively smaller, SSG has the best performance, up to more than 3 times faster than the NAIVE method and can be up to one time faster than MFS.
Figure 8: Varying # Queries
In general, the performance trade-off between MFS and SSG depends on the characteristics of different datasets. On datasets with more objects per frame and/or data sets captured by moving cameras (yielding shorter duration of objects in the visible screen and increased number of new objects entering the visible screen, thus more unique states), SSG achieves higher speedups, as is observed from experimental results.
6.3 Query Evaluation
Based on MCOS Generation methods, we benchmark our Query Evaluation module (Section 5.2). We vary the number of queries from 10 to 50, as shown in Figure 8. The y-axis depicts overall performance including both MCOS generation and query evaluation. The performance of each method remains almost the same even when the number of queries increases. Both MFS and SSG are more than 2 times faster than NAIVE in Figure 8a, while SSG is even better than MFS in Figure 8b, with an overall speedup of more than 3. Thus, it is evident that the overheads of query evaluation on MCOS are negligible compared to that of state maintenance. To evaluate the pruning strategy introduced in Section 5.3, we implemented five different methods: MFS O and SSG O, which implement the pruning strategy based on MFS and SSG methods; NAIVE E, MFS E and SSG E, which only follow the CNFEvalE method proposed in Section 5.2. We generate 100 queries containing conditions only. Let C be the set containing all conditions from the given queries; we use
to denote the minimum value in all conditions, where
is the threshold value in condition c (conditions are of the form
where x is the object class). We vary
from 1 to 9, the evaluation result on real datasets is shown as Figure 9. As
increases, methods adopting SSG perform better than those adopting MFS in general. For example, in Figure 9c, when
= 3, SSG O is 3 times faster than NAIVE, while MFS O is only 2 times faster. Similarly, in Figure 9b, when
= 5, SSG O is also around one time faster than MFS O compared to NAIVE, where the speedup is five times. When
is large enough, we observe a significant performance improvement on all methods that adopt the pruning strategy (MFS O and SSG O). Among them, the SSG O has the best performance in general. When
= 1, methods adopting the pruning strategy are slightly more expensive on some datasets (Figure 9a, 9b) since more states in the graph are evaluated on the given queries. We observe that methods based on SSG are always the best compared to other approaches, while the pruning strategy offers significant improvements on both MFS and SSG methods. When the pruning strategy is enabled, the optimized approach provides more than 100 times speedup on real datasets, as is evident in Figure 9, where the
is set to 9. We finally present the performance of the three methods in
Figure 9: Varying
Figure 10: End-to-end Evaluation Time
an end-to-end manner, as shown in Figure 10. We measure the average time per query (issuing 50 queries and computing the average) in seconds for each dataset (the lower the better), including the object detection and tracking time. As can be observed, both MFS and SSG have leading performance, among which SSG has the best performance overall.
Several pieces of recent work focus on different aspects of declarative query processing over video feeds [10, 11, 12]. Kang et. al. [12] present an approach to filter frames via inexpensive filters for query processing purposes. Subsequently they present query processing techniques for spe-cific types of aggregates over the video feed [10, 11]. On a related thread, recent declarative query processing frameworks support queries with spatial constraints among objects on video feeds [13, 27] as well as interactions among objects in the visible screen [2]. Other proposals, [8, 17] present system approaches to query processing across video feeds and are concerned mainly with scalability aspects. In the vision community, several Deep Learning approaches for object detection and classification have been proposed with impressive accuracy [3, 4, 7, 15, 23]. Although not directly related, some work has been conducted in the context of mining frequent itemsets over sliding windows. The main focus of these works is to produce itemsets given a support/confidence threshold [1,9,21].
We considered the problem of evaluating CNF temporal queries over video feeds. Utilizing DL approaches, it is possible to extract valuable context from frames and enable advanced query processing. We introduced the Marked Frame Set and Strict State Graph approaches to maintain context that aims to reduce the overheads of processing new frames and facilitate CNF query evaluation. We demonstrate via experiments that our approaches yield vast performance improvements during query execution over alternate approaches.
[1] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Fourth IEEE International , pages 59–66. IEEE, 2004.
[2] N. K. Daren Chao and I. Xarchakos. Svq++: Querying for object interactions in video streams. In Proceedings of ACM SIGMOD, Demo Track, 2020.
[3] R. B. Girshick. Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1440–1448, 2015.
[4] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 580–587, 2014.
[5] I. J. Goodfellow, Y. Bengio, and A. C. Courville. Deep Learning. Adaptive computation and machine learning. MIT Press, 2016.
[6] B. Haynes, A. Mazumdar, M. Balazinska, L. Ceze, and A. Cheung. Visual road: A video data management benchmark. In Proceedings of the 2019 International Conference on Management of Data, pages 972–987. ACM, 2019.
[7] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2980–2988, 2017.
[8] K. Hsieh, G. Ananthanarayanan, P. Bodik, S. Venkataraman, P. Bahl, M. Philipose, P. B. Gibbons, and O. Mutlu. Focus: Querying large video datasets with low latency and low cost. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 269–286, Carlsbad, CA, Oct. 2018. USENIX Association.
[9] N. Jiang and L. Gruenwald. Cfi-stream: mining closed frequent itemsets in data streams. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 592–597, 2006.
[10] D. Kang, P. Bailis, and M. Zaharia. Blazeit: Fast exploratory video queries using neural networks. In https://arxiv.org/abs/1805.01046.
[11] D. Kang, P. Bailis, and M. Zaharia. Challenges and opportunities in dnn-based video analytics: A demonstration of the blazeit video query engine. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings, 2019.
[12] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia. Noscope: Optimizing neural network queries over video at scale. Proc. VLDB Endow., 10(11):1586–1597, Aug. 2017.
[13] N. Koudas, R. Li, and I. Xarchakos. Video monitoring queries. In Proceedings of IEEE ICDE, 2020.
[14] S. Krebs, B. Duraisamy, and F. Flohr. A survey on leveraging deep neural networks for object tracking. In 20th IEEE International Conference on Intelligent
Transportation Systems, ITSC 2017, Yokohama, Japan, October 16-19, 2017, pages 411–418, 2017.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, May 2017.
[16] Y. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[17] Y. Lu, A. Chowdhery, and S. Kandula. Optasia: A relational platform for efficient large-scale video analytics. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC ’16, pages 57–70, New York, NY, USA, 2016. ACM.
[18] S. Lyu, M.-C. Chang, D. Du, L. Wen, H. Qi, Y. Li, Y. Wei, L. Ke, T. Hu, M. Del Coco, et al. Ua-detrac 2017: Report of avss2017 & iwt4s challenge on advanced traffic monitoring. In Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on, pages 1–7. IEEE, 2017.
[19] X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai. Minimum error bounded efficient 1 tracker with occlusion detection. In CVPR 2011, pages 1257–1264. IEEE, 2011.
[20] A. Milan, L. Leal-Taix´e, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
[21] F. Nori, M. Deypir, and M. H. Sadreddini. A sliding window based algorithm for frequent closed itemset mining over data streams. Journal of Systems and Software, 86(3):615–623, 2013.
[22] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
[23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[24] S. E. Whang, H. Garcia-Molina, C. Brower, J. Shanmugasundaram, S. Vassilvitskii, E. Vee, and R. Yerneni. Indexing boolean expressions. Proceedings of the VLDB Endowment, 2(1):37–48, 2009.
[25] N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649. IEEE, 2017.
[26] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013.
[27] I. Xarchakos and N. Koudas. Svq: Streaming video queries. In Proceedings of ACM SIGMOD, Demo Track, 2019.
Table 7: Table of Notations
Theorem 1. The set of marked frames in each state is a Key Frame Set.
to denote the maximum frame id, i.e.
must be one of the following cases:
1. is marked. This means the object set can be directly generated by frame
. Assume the frame set is marked properly before frame
arrives. We use
to denote the key frame set. Since
can generate the object set at frame
is no longer a valid key frame set. According to the definition,
has to be added to it. Since
is marked, the new marked frame set is a valid key frame set.
2. is not marked. Then s must can be generated by two or more states. We use
denote the set of states that generates s. Assume all the marked frame set in states in
are key frame sets. We use
to denote the key frame set of state s. Then we have
Suppose the marked frame set of the new state is not a key frame set. Then
which is not a key frame. According to the definition, we can still obtain
if we remove
from the key frame set. However,
, which means
be removed from the marked frame set of
a contradiction. Suppose a key frame
is missing in the generated state, Then
can still be generated if we remove all of the marked frames. Assume
, we remove the marked frames from its frame set and use
to denote the object set generated by the rest of the frames. Then we have
. However, for each
Which means,
. Thus the marked frame set is not a key frame set for some state
which is a contradiction.
Theorem 2. Let PS be the set of principal states and ns be a new principal state. In an SSG, a state adjacent to
, such that
Proof. Assume s is adjacent to . Then there must be some principal state
PS, such that
. For a State Graph, s must can be reached from some principal state
is adjacent to ns, s can be generated by intersecting objects between ns and other existing states. Assume there are some state
with object set
must also be adjacent to ns. However, since
both are adjacent to
, which violates the property 2.
Theorem 3. The State Selection Procedure will not violate Property 2 while all newly generated states can be reached from the new principal state.
Proof. Assume it breaks the property 2. connected in the graph through one or more edges. Suppose
can be reached from
. Then we have
. Since we sort C before selection, the only case is
is obtained using DPS, which means,
can not be reached from
, which leads to a contradiction.
Assume some edge () is missing, where
, such that
, it is reachable from state
, which is a contradiction.
Theorem 4. Using the State Marking Procedure, a state is valid in SSG iff at least one frame in its frame set is marked.
Proof. When there is only one principal state, the marked frame set is always correct. Assume after processing frame i (which principal state is ns), all valid states have at least one frame marked. According to the State Marking Procedure, after processing a new arriving frame ns, all states that are principal states or generated by some principal states directly will be marked.
Suppose at frame with principal state
valid state s, no frame is marked. Then s must be generated by principal states indirectly. We use
to denote the set of states that are adjacent to s. Since s is valid, s can be generated by computing intersections between some existing state,
. By applying this rule repeatedly, eventually, we have some state
, which is generated by some existing principal state
. Since at least one frame is marked in
, which is a contradiction to the assumption.
If a state, s, is invalid. Then starting from all the principal states from the current window on the SSG, state s can not be generated. State s must be generated from some previous window with some marked frame, , expired. Since s can no longer be generated in the current window, the marked frames of s will not be updated. Therefore, no frame is marked in state, s, which is invalid and will be pruned.
Theorem 5. In SSG, there will be at most 2states and
number of edges. Thus, the complexity of algorithm ST is
Proof. Suppose the number of frames that share the same object set is on average. Given window size
of unique principal states will be generated. We use
to denote the number of unique principal states. We use
to denote the number of objects in each frame on average.
We first consider the case where all principal states share some objects in common. In this case, if we revert the direction of each edge, we can obtain a tree, where the root is the state with the common object set among all frames, and the leaves are principal states.
In the worst case, starting from , every possible state should be generated with different object set sizes. Suppose a new state can be generated by any two states, then
number of states can be generated with object size
1. The maximum number of states would be
+
. For each principal state with object set size of size, there will be at most
1 number of edges. Thus in the worst case, the total number of edges would be
). Normally,
we have the upper bound
Since we are using the DFS algorithm as the graph traversal algorithm, the complexity of the algorithm is O(V +