The ability to foresee what possibly happens in the future is one of the factors that makes humans intelligent. Predicting the future state of the environment conditioned on the past and current states requires a good perception and understanding of the environment as well its dynamics. This ability allows humans to plan ahead and choose actions that shape the environment in our interest.
In this paper, we focus on improving the model’s understanding of the environment’s dynamics by simply observing it. Due to the availability of unlabeled video data, self-supervised learning from observations is very attractive compared to approaches that require explicit labeling of large amounts of data.
In comparison to literature on understanding the current state of the environment – works typically known under the terms of semantic segmentation or action classification – there is limited work addressing the problem of predicting future states. In this paper, we are interested in predicting future activities. Our work is different from most prior works on video prediction, as they focus on predicting whole frames of the future [27, 42, 5]. In the context
Figure 1: Future prediction at the activity level. From a spatio-temporal observation, we predict multiple feature hypotheses that suggest possible future activities. Each feature represents a future action and a future object. A language module generates a caption for this representation of the future.
of decision making systems, pixel-wise future prediction is too detailed and cannot be expected to enable longer prediction horizons than just a few frames. The strategy of Luc et al. [30] to predict the segmentation of future frames by forecasting the future semantics instead of raw RGB values, appears much more promising. We follow a similar strategy to predict abstract features and even increase the level of abstraction by dealing with activities rather than pixel-wise labeling; see Figure 1.
This connects large part of the problem with activity classification: before making predictions about future states, we must interpret the given video input and extract features that describe the current state of the environment. In order to cover not only the present state but also the context of the past, we build on the work by Zolfaghari et al. [44]. This work on activity classification samples frames from a large time span of the past and converts the context from these frames into a feature representation optimized for classifying the observed activity. We argue that this feature representation is a good basis for learning a representation of what is likely to happen in the future. While we keep the first untouched, we learn the latter from the time-course of the videos. This even allows us to learn the dynamics in a self-supervised way. In this paper, we report results for both supervised (with action class labels) and self-supervised (unlabeled videos) training.
The predicted future state is provided as an activity class label or as a caption generated by a captioning module based on the predicted representation. Since the future is non-deterministic, forcing the network to predict a single possible outcome leads to contradictory learning signals and will hamper learning good representations. Therefore, we use a multi-hypotheses scheme that can represent multiple futures for each video observation.
Moreover, we decouple the prediction of the action and the object involved in an activity. This allows the model to generalize the same action across multiple objects and learn from only few shots or even without observing all combinations during training.
Overall, we propose the first approach for Predicting Future Activity (PreFAct) over large time horizons. Our method involves four important components: (1) a future prediction module that transforms abstract features of an observation to a representation of the future; (2) decoupling of the future representation into object and action; (3) representation of multiple hypotheses of the future; (4) natural language caption of the future representation.
Future Image Prediction. Many existing approaches for future prediction focus on generating future frames [29, 39, 35]. Since predicting RGB pixel intensities is diffi-cult and the future is ambiguous, these methods usually end up in predicting a blurry average image. To cope with the non-determinism, Mathieu et al. [33] suggest to use a multiscale architecture with adversarial training. Stochastic approaches [3, 24] use adversarial losses and latent variables to explicitly model the underlying ambiguities.
However, pixel-level prediction is still limited to a few frames into the future especially when the scene is highly dynamic and visual cues change rapidly. Moreover, pixel-level fine-detailed future prediction is not necessary for many decision making systems.
Future Semantic Prediction. There are many works which also tackle future prediction in a more abstract way [38, 37, 23, 44]. Han et al. [14] introduced a stacked LSTM based method to learn the task grammar and predict future using both RGB and flow cues. The key component of their method is the estimation of task progress which considers separate networks for each level of granularity. This makes the approach not only inefficient but also very specific to each task since granularity level for different activities and environments is not the same. To predict the starting time and label of the future action, Mahmud et al. [32] propose to use an LSTM to model long-term sequential relationships. More recently, Farha et al. [10] proposed a deep network to predict future activity. These methods rely on partial observations of the future and their predictions are limited to a fixed time horizons into the future. Another very interesting future prediction task required by autonomous driving, interactive agents or surveillance systems is forecasting the locations of objects or humans in the future [4, 9]. Fan et al. [9] introduced a two-stream network to infer future representations to predict future locations. Their method is limited to 1 to 5 seconds into the future. Bhattacharyya et al. [4] further addressed the multi-modality and the uncertainty of the future prediction by modeling both data and model uncertainties.
Future Feature Prediction. Prediction of future in semantic level is more easier and appealing for many applications such as autonomous driving [31]. Vondrick et al. [42] predicted visual representation of future frame. This approach is based on single frame an therefore is limited in terms of dynamics of the actions and also considers a short time horizon of 5 seconds into the future.
In contrast to these works, we limit ourselves to only look at the current observations to infer the future activity without limiting the time horizon of the future prediction. Inspired by by Luc et al. [30] we explicitly learn translation from current features to future features. Moreover, we address the ambiguous nature of the future by predicting multiple possible future representations with their uncertainties.
Uncertainty Estimation in CNNs. Modern CNNs are shown to be overconfident about their estimations [13] which makes them less trusted than non-blackbox traditional counterparts, despite their high performance. Recently, well-calibrated uncertainty estimation for CNNs has gained significant importance in order to tackle this shortcoming. One of the most popular uncertainty estimation methods for modern CNNs is MCDropout by Gal and Ghahramani [11, 18]. They show that using dropout over the weights for sampling, it is possible to get easy and ef-ficient sampling for model uncertainty. Lakshminarayanan et al [22] propose using network ensembles over dropout ensembles for better uncertainty predictions. Another less resource expensive alternative is snapshot ensembling [15] over the networks trained with Stochastic Gradient Descent with Warm Restarts [28]. All these methods still cannot avoid the sampling cost. Ilg et al [16] propose multi-hypotheses networks (MHN) [25, 6, 36] for uncertainty estimation in order to overcome sampling. They show that the MHN is not only able to produce multiple samples in one forward pass but also provides with the state-of-the-art uncertainty estimates for optical flow. To this end, we modify the MHN for classification for our uncertainty estimation for future action classification.
Given an observation at a current time segment t, future activity prediction aims for estimating the activity class of the video segment at , where
is the prediction horizon. Rather than learning directly this mapping, which
Figure 2: Training scheme for PreFAct. The model learns to transform features from the observed input to future features based on a supervised classification loss and/or a self-supervised regression loss.
we show to be clearly inferior, we use the features from an activity classification network for the current time segment and learn a mapping from these current features to features in the future.
A coarse view of the network training is shown in Figure 2. Figure 3 shows a more detailed view of the overall model. As the base action classification network, we use ECO [44]. Between the convolutional encoder and the fully connected layers of ECO, we add the future prediction module . We explore different designs for this module, shown in Figure 4. This also includes three different inception blocks. Moreover, we evaluate six different ways on where to include these modules into the ECO architecture, as illustrated in Figure 5. Experimental results with these different options are presented in Section C.
The weights of the ECO base network stay fixed (both the convolutional encoder and the fully connected activity classifier), while the future prediction module is trained. The training scheme is illustrated in Figure 2. Ground-truth features in the future are simply extracted by running ECO on the future time segment (lower part of Figure 2), i.e., training can work in a self-supervised manner as regression without annotation of class labels. The corresponding training objective is simply the mean squared error between predicted features and extracted features
for video segment i:
Concurrently, the future prediction module can also take the class labels of a labeled video into account. In this case, the weights are optimized for the cross-entropy loss on the activity class labels in the future time segment:
where is the softmax output for the future activity prediction based on
.
3.1. Class representation of objects and actions
One way to represent the result of future prediction is by activity classification. Human actions are characterized by the objects they interact with and the actions they perform. In activity learning, often, one optimizes for the activity class directly, which leads to a combinatorial explosion of possible activities. Training directly on these can result in bad representations. For example, if in the training set the action ’put’ is always combined with ’plate’, the network will not learn the action ’put’ but rather will recognize plates. Such representation will not generalize to somebody putting a cup.
Understanding the relationship between actions and objects leads to a more comprehensive interpretation. For instance, if the model already learned what a ’put’ action means, it can more easily generalize to various scenarios such as ’put butter’ or ’put spoon’. This enables us to extend the model to unseen objects-activity combinations by providing only a very small set of samples. Therefore, we propose to decouple the object and action classes. Treating them as separate sets but learning them jointly still exploits the relationship between them. We will show the advantage of this decoupling in Section 5.6.
3.2. Video captioning
A richer way of representing the results of the prediction module is via language. A video caption usually has more details than an activity class label. For instance, the caption ‘put celery back into fridge’ conveys more details than the label ‘put celery’. We use an LSTM based architecture - semantic compositional networks [12] - for generating a caption describing the future feature representation. The semantic concepts are trained separately from scratch for each video dataset. These concepts are used to extend the weight matrices of caption generating decoder, as described in [12]
If the next action is deterministic and depends only on the previous action, learning the mapping from the present to the future is almost trivial. It is a simple look-up table to be learned. However, the future action typically depends on subtle cues in the input and contains non-deterministic elements. Thus, multiple reasonable possibilities exist for the future activity. Therefore, we propose learning multiple hypotheses with their uncertainties, similar to multi-hypotheses networks (MHN) [16, 25, 6, 36]. In our setting, a multi-hypotheses network is used for predicting multiple feature vectors corresponding to the various possible outcomes together with their uncertainties. Each hypotheses yields the object and action class together with their class uncertainties; see Figure 3. We have separate uncertainties for objects and actions because each task has different un-
Figure 3: Overview of the future prediction model: PreFAct. It consist of two main modules: Module ”A” is the future representation learning stage. Given current observation, model learns to transform current features to the future representations and predict the labels of the future object and action. Module ”B” is an LSTM based network which captions the future by fusing the multiple feature representations and their corresponding uncertainties.
certainty levels. For instance, if a person is washing and there is a spoon, a plate, and a knife in the sink, the uncertainty for the chosen object will be much higher than for the action. The feature uncertainties allows the captioning LSTM to reason about which features are most likely to rely on.
To model the data uncertainty (aleatoric), the network yields the parameters of a parametric distribution, e.g., a Gaussian or Laplacian. This enables learning not only the mean prediction but also its variance, which can be interpreted as uncertainty. To cover the model uncertainty (epistemic), however, sampling from the network parameters is needed to compute the variation inherent within the model. Multiple-hypotheses networks create multiple samples in one forward pass, which approximates sampling from the network in a very efficient way.
4.1. Feature uncertainties
Following Ilg et al. [16], we model our posterior by a Laplace distribution parameterized by median a and scale b for ground-truth feature as:
During training, we minimize its negative log-likelihood (NLL):
As commonly done in the literature, we predict log b instead of b for more stable training.
To also include the model uncertainty, we minimize the multi-hypotheses loss:
For each training sample i, only the best feature among all hypotheses is penalized while the others stay untouched. The best feature is defined as the hypothesis closest to its ground-truth in terms of distance as follows:
This winner-takes-all principle, also used in [16, 25, 6, 36], fosters diversity among the hypotheses.
4.2. Classification uncertainties
For the classification loss, we model the data uncertainty as the learned noise scale [18]. In order to learn both the score maps and their noise scale, we minimize the negative expected log likelihood:
(7) where is the observed class and
are predicted logits corrupted by Gaussian noise with the learned noise scale
. Note that both s and
are learned by the network. This formula can be interpreted as first corrupting the logits with noise T times, where T is the number of hypotheses, and normalizing them by softmax to get pseudo-probabilities
Figure 4: Future feature transformation modules. We considered fully-connected layers, convolutional layers, and different inception modules.
Figure 5: Architectures for future representation learning. The transformation modules from Figure 4 can be integrated into the ECO architecture at various places. We considered 6 variants shown in this figure.
, then averaging over these T pseudo-probabilities to get the final pseudo-probabilities p(c). Finally, the cross-entropy loss is applied to the pseudo-probabilities.
From variances to uncertainties. For both feature regression and object/action classification, we compute the fi-nal uncertainties as the entropy of the distributions:
5.1. Datasets
Since future activity prediction received little attention so far, there is no dedicated dataset for this task. We conducted our experiments on the Epic-Kitchens dataset [7] and the Breakfast dataset [20]. Both show sequential activities on preparing meals with sufficient diversity. These two are the most suitable datasets for our task since they include temporally meaningful actions which follow each other in a procedural way i.e. ”Peel Potato” is followed by ”Cut Potato”.
Epic-Kitchens[7] dataset includes videos of people cooking in a kitchen from a first person view. Each video is divided in multiple video segments. In total there are 272 video sequences with 28, 561 activity segments for training/validation and 160 video sequences with 11, 003 action segments for testing. These segments are annotated using in total 125 verb and 331 noun classes. From the video sequences in training/validation dataset we randomly choose 85% of the videos for training and 15% of them for validation.
Breakfast[20] dataset includes meal preparation videos of 10 common breakfast items in third person view. On average each item has 200 preparation videos where some videos are from the same scene with multiple camera angles. All videos are divided in multiple video segments which are annotated with one of 48 predefined activity labels. We convert activity classes into object and action classes, e.g. activity: ”take cup” becomes action: ”take” and object: ”cup”. All videos in this dataset are provided by 52 participants. We use data from 39 of these participants for training and data from remaining 13 participants for testing.
5.2. Evaluation metrics
For classification, we use accuracy as quantitative measure, i.e. the rate of correctly predicted classes over the whole predictions.
The captioning models are evaluated using the standard metrics BLEU (B-1) [34], ROUGE L[26], METEOR [8], and CIDEr [41]. BLEU (B-1) calculates the geometric mean of n-gram precision scores weighted by a brevity penalty. ROUGE L measures the longest common subsequence between generated caption and the ground-truth. METEOR is defined as the harmonic mean of precision and recall of matched uni-grams between generated caption and its ground-truth. CIDEr measures the consensus between generated caption and the ground-truth.
For evaluating the quality of the uncertainty predictions we use reliability diagrams [13]. A reliability diagram plots the expected quality as a function of uncertainty. If the model is well-calibrated this plot should draw a diagonal decrease.
5.3. Implementation and training details
We base our feature extraction module on ECO [44]. Following the original paper, we take the ECO which was pretrained on Kinetics [17] and then further trained on Breakfast or Epic-Kitchens depending on the dataset used in the experiments. When we retrain the ECO for the baseline comparisons we follow the design choices of the original paper as is if not mentioned otherwise. We provide all details in supplementary material. Data augmentation is also applied as in the original work. Keeping the ECO feature extractor fixed, we train our future representation module which is initialized randomly. We use mini-batch SGD optimizer with Nesterov momentum of 0.9, weight decay of 0.0005, and mini-batches of 64. We utilize dropout after each fully connected layer. For the multi-hypotheses experiments we fix the number of hypotheses (T) to 8.
We extract frames from the video segments following the sampling strategy explained in the original paper. In this sampling, each segment is splitted into 16 subsections of equal size and from each subsection a single frame is randomly sampled. This sampling provides robustness to variations and enables the network to fully exploit all frames and enable us to predict arbitrary horizons into the future.
5.4. Comparison of feature translation modules
Table 1 compares different design choices for the feature translation module, as depicted in Figures 4,5. The architectures M3 and M6 provide the best performance. M3 corresponds to locating a grid inception block with 2D convolutions before the last two fully connected layers. M6 is the same with 3D convolutions. For the rest of the experiments, we used M3 as a feature transformation module. In these experiments, we report the accuracy on the (composed) activity recognition for the Breakfast dataset and on the (single) action recognition for the EpicKitchens dataset. This differs from the results we provide in the following sections where we evaluate our decomposed action/object classes separately.
5.5. Results on future activity prediction
Due to the lack of previous work on this problem, we compare to some simple and some more in-depth baselines. A comparison of these baselines is shown in Table 2. The table provides as upper bound the classification accuracy
Table 1: Comparison of design choices for the feature trans- formation modules. M3 and M6, as shown in Figures 4,5, are best.
Table 2: Next activity classification accuracy on the Break- fast (Brk) and Epic-Kitchens (Epic) datasets. ”A”: Action, ”O”: Object. PreFAct improves over all baselines, also the ECO baseline, which corresponds to learning the mapping from observation to future class label directly.
with ECO where the time frame of interest is observed, i.e., the problem is a standard classification problem without a future prediction component.
The ”largest class” baseline just assigns the label of the largest class in the training data to the label of the future activity. This is the accuracy achieved by simply exploiting the data imbalance of the datasets.
The ”copy current label” baseline performs activity clas-sification on the current observation and considers the predicted current label as the future activity label. This approach only works in cases, where the action or the object do not change over time.
The ”association rule mining” picks the most likely future activity as the label of the future activity. See supplementary material for details about each baseline.
As can be seen from the Table, the results we get with the future activity prediction network (PreFAct) are much better than these simple baselines. PreFAct-C denotes the network that was only trained using the class labels supervisedly, whereas PreFAct-R was trained trained only in a self-supervised manner without using class labels. PreFActR+C used both losses jointly for training. As expected, supervised training works better than only self-supervised training, and using both losses works marginally better than only supervised training. Note that the self-supervised learning can leverage on additional unlabeled video data. We explore this more in detail in Section 5.7.
The two most interesting baselines are the two state-of-the-art methods on video understanding - ECO [44] and Epic-Kitchens [7] - which we trained to predict future activities rather than the present activity. For a fair comparison, we modified the methods such that they provide both object and action classes rather than a single activity class. PreFAct clearly improves over this ECO baseline, which shows that the future prediction module is advantageous over directly learning the mapping from the observation to the future activity.
PreFAct-MH shows results with our multi-hypotheses network. Generating multiple hypotheses has potential to lead to significant performance improvement, as demonstrated by the Oracle selection, where the best hypothesis is selected based on the true label. Whereas, automated selection of the best hypothesis via their uncertainty estimates does not lead to a significant difference over the version with single hypothesis. This is consistent with the findings in other works, which showed good uncertainty estimates, but could not benefit from these hypotheses to select the best solution within a fusion approach.
5.6. Learning unseen combinations
The decomposition of activities into the action and the involved object allows us to generalize to new combinations not seen during training. Table 3 shows results on an object-action pair when all pairs of the specified object with the 5 actions in the top row were completely removed from the training data. In brackets are the numbers when these activities were part of the training set. In most cases, the approach is able to compensate for the missing object-action pairs by using the information from a related object or another action not among the 5 actions.
5.7. Self-supervised learning
While the self-supervised regression loss yields inferior results compared to supervised training on class labels, self-supervised learning has the advantage that it can be run effortlessly on unlabeled video data.
Table 4 shows how the self-supervised learning improves when adding extra unlabeled data S1 and S2 provided by the Epic-Kitchens dataset. S1 contains 8048 samples of seen kitchens, and S2 contains 2930 samples of unseen kitchens. The improvement is small but increases as more data is added.
5.8. Future captioning
We use semantic compositional networks [12] for captioning current and future video features. For each dataset, we obtain a separate semantic concept detector by training
Table 3: Action prediction of current observation for unseen object-action combinations on the Epic-Kitchens dataset. For each object-action pair, we report the accuracy when excluding all 5 actions for these objects during training. The numbers in parenthesis indicate the accuracy when training on the entire dataset. Our decomposition of action/object classes helps compensating for the missing information.
Table 4: Performance of the self-supervised learning (PreFAct-R) on the Epic-Kitchens dataset (left/right: object/action), as additional unlabeled data S1 or S2, or both is added for training. The more the unlabeled data the better.
Figure 6: Qualitative results on Breakfast dataset obtained from our captioning module with feature fusion -only TOP3 Hypotheses are visualized.
a multi-label classifier for the set of selected concepts from each dataset. For most experiments, we use the full vocabulary as the set of concepts.
Our feature representations provide multiple features and classes with their uncertainties. We explored various options on how to fuse and feed this information into the captioning module (Fig. 3(B)). They are tagged as in Table 5. We use the class certainties to select features: feature yielding the highest class certainty (BEST), the highest three certainty (TOP3), and all features (ALL). For fusion of the selected features, we considered: concatenation of them with their certainties (concat), and multiplying them with their certainties and concatenating (mult). We obtain the certainties by first normalizing the uncertainties to (0,1) range and then subtracting them from 1.
Table 5 compares these different options. Using all feature hypotheses magnified with their certainties () yield the best results with large margin in comparison to other alternatives. This suggests that capturing the future with its multi-modality and variation is the key to represent future semantics.
Table 6 and 7 shows that the multi-hypotheses design is clearly superior to its single prediction counterparts on both the Breakfast and the Epic-Kitchens dataset. While multiple hypotheses could not be exploited at the classification level in Section 5.5, they help a lot on the captioning task.
Figure 6 shows some qualitative results of future captioning on the Breakfast dataset. For each sample, future action/object classes of top-3 hypotheses are presented. In the top-left case, hypotheses are certain about the future object ”egg”, but for the action there is high uncertainty. In contrast, in the bottom-left case, uncertainty on the future object is higher than for the future action ”put”.
5.9. Uncertainty evaluation
In Figure 10, we provide the reliability diagram for the uncertainties of feature hypotheses for Epic-Kitchens dataset. The diagonal decrease suggests that our uncertainties are well calibrated with the errors of the features. In the supplementary material we provide more details about our method and more in-depth evaluations as well as more qualitative results including failure cases.
We presented the problem of predicting future activities based on past observations. To this end, we leveraged a feature embedding used for action classification and extended it by learning a dynamics module that transfers these features into the future. The network predicts multiple hypotheses to model the uncertainty of the future state. While this had little effect on future activity classes, it helped substantially for future video captioning. Due to the decomposed representation into object and action, the approach generalizes well to unseen activities. The approach also allows for fully self-supervised training. Although the perfor-
Figure 7: Reliability diagram for our feature uncertainties on Epic-Kitchens dataset. Diagonal decrease suggests that our uncertainties are well calibrated and potentially useful.
Table 5: Captionining performance based on different hypotheses selection and processing strategies. For ”Top3” and ”Best” we pick the feature with the lowest 3 and 1 class uncertainty respectively. Lower entropy means that we are more certain about the prediction. ”All” means all features are fed to the captioning. ”concat”: concatenation of feature and certainty vectors, ”mult”: multiplying features with their certainties and concatenating.
Table 6: Future captioning results on Breakfast dataset. PreFAct: Our model with regression and classification losses, PreFAct
: Our Multi-Hypotheses model.
Table 7: Future captioning results on the Epic-Kitchens dataset. PreFAct: Our model with regression and classi-fication losses, PreFAct
: Our Multi-Hypotheses model.
mance is still inferior to the supervised setting, it has a lot of potential when applied to large-scale unlabeled videos. We believe there is promise in investigating further in this direction.
Supplementary Material
”Association rule mining” [1] discovers relations between activities in the dataset. For instance, in the EpicKitchens dataset actions ”take” and ”put” are occurred together frequently. Therefore, by identifying these relations we can predict the future class labels based on the current observation class label. In the previous example, the rule would be: If action ”take” is observed, then ”put” will be the next action.
Using this method, we find the most probable patterns between activities. We first obtain the activity label () of the current observation and then, using association rule mining, the label of the future activity (
) will be the most co-occurred consequent activity (
).
Table 8 shows frequently occurring action sequences in the Epic-Kitchens dataset. In this table, we have provided three different components ”Support”, ”Confidence” and ”Lift”. Support refers to the popularity of action set, Con-fidence refers to the likelihood that an action ”B” happens if action ”A” is happened already, and Lift measures dependency of actions.
As shown in Table 8, the frequent action set is put) which happens 12.2% in the dataset and the frequent object set is
with 1.2% occurrence. We utilized Confidence to find the most probable action ”B” after observing action ”A”. For instance, if the current observed action is ”Open” then action ”put” will happen with probability of 24.2%.
During training, we use the SGD optimizer with Nesterov momentum of 0.9 and weight decay of 0.0005. Training is performed up to 80 epochs for Epic-Kitchens dataset with randomized minibatches consisiting of 64 samples, where each sample contains 16 frames of a current video segment.
For the Epic-Kitchen dataset, initial learning rate is 0.007 and decreases by a factor of 10 when validation error saturates for 10 epochs.
Training is performed up to 60 epochs for Breakfast dataset. We use dropout of 0.3 for the last fully connected layer. For the Breakfast dataset, initial learning rate is 0.001 and decreases by a factor of 10 when validation error saturates for 8 epochs. In addition, we apply the data augmentation techniques similar to [44]: we resize the input frames to and employ fixed-corner cropping and scale jittering with horizontal flipping. Afterwards, we run perpixel mean subtraction and resize the cropped regions to
.
During the inference time, we sample 16 samples from the video, apply only center cropping and then feed them directly to the network to get final future predictions. For the captioning, we utilize same approach but extracting the features from the regression layer. We use extracted features to train the LSTM to provide caption for each video segment.
For the future translation modules, we design several different architectures consist of fully connected layers and convolutional layers, see Table 9. For the F3, F4 and F5, we make use of inception modules introduced in [40]. For simplicity, we present each layer of modules F3, F4 and F5 in the following format (Table 9):
[input : operation d filters : output]
IN: Input from a specific layer of the ECO network.
Out: Output to the rest of the ECO network.
For evaluating the quality of the uncertainty predictions we use reliability diagrams [13] and sparsification plots [2, 43, 19, 21, 16]. A reliability diagram plots the expected quality as a function of uncertainty. If the model is well-calibrated this plot should draw a diagonal decrease. A sparsification plot shows the quality gain over the course of removing the samples with the highest uncertainties gradually. In the best case, samples would be removed using
Table 8: Frequency of action sequences in the Epic-Kitchens dataset. Support measures frequency of itemset, Confidence shows the probability of seeing the consequent action given previous action, and Lift measures how much dependent are two actions.
Table 9: Architecture details for feature translation modules. The input to each module is the output of a specific layer of ECO network depicted in Fig. 5 of the main paper.
the ground truth error and this can serve as the ground-truth (oracle) for the sparsification plot.
In Figure 8 we show the sparsification plots of the best, the worst, and the average hypothesis for classification of actions and objects for Breakfast (first row) and EpicKitchens (second row) datasets. Plots tends to consistently increase as the uncertainties removed. In order to assess the quality of these plots we also provide the Oracle Plot. The oracle is simply repeating the sparsification with the true error instead of the uncertainties to get the upper bound for the uncertainties. Ideally the closer the sparsification plot is to its oracle is the better. One possible reason for the relatively bigger distance in our plots can be that activity prediction has still not reached its saturation (70% error) while image classification has (typically < 5% error).
In Figure 9 we report the reliability diagram per hypothesis on both Breakfast and Epic-Kitchens datasets for future action/object classification. The diagonal increase suggests that our uncertainties are well calibrated. Accuracy tends to consistently increase as the confidence threshold of removed samples increase.
In Figure 10 we report the reliability diagram per hypothesis on Breakfast dataset for feature reconstruction. For Epic-Kitchens diagram, see the Figure 7 of the main paper. The diagonal decrease suggests that our uncertainties are useful as also supported by our captioning results.
In Figure 11 we show the sparsification plots per hypothesis for feature reconstruction for both datasets. Error tends to decrease as the high uncertain features removed. However, as in the classification case there is a big difference to its oracle due to the difficulty in future prediction. When the predictions do not generalize, uncertainties also do not.
Figure 12 shows representative results of our method. We input the current observation to the model and get the future captions.
Epic-kitchens dataset is ego-centeric and camera mounted on the person’s head. Therefore, future prediction on this dataset is more challenging due to the quick changes in view point and motion blurr. In the Breakfast dataset, the camera location is fixed throughout the video recording.
Figure 8: Sparsification plot of proposed method for the Breakfast (a and b) and Epic-Kitchen (c and d) datasets.The plot shows the accuracy of future activity prediction for each fraction of samples having the highest uncertainties removed. The oracle sparsification shows the upper bound by removing each fraction of samples ranked by the cross-entropy loss between the prediction and the ground-truth. The big difference to its oracle can be explained by the relatively big errors inherent in future prediction task.
In the last row of Figure 12 for the Epic-Kitchens, it is not known that tap is on or off and for the Breakfast, pan being buttered is not known. This implies that providing longer history of previously performed actions would decrease the ambiguity of future prediction.
For instance, in the last row for the Epic-Kitchens the observation is ”put down washing liquid” and prediction is ”turn on tap” while ground-truth is ”take spoon”.
[1] R. Agrawal, T. Imieli´nski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, pages 207–216, New York, NY, USA, 1993. ACM. 9
[2] O. M. Aodha, A. Humayun, M. Pollefeys, and G. J. Bros- tow. Learning a confidence measure for optical flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5):1107–1120, May 2013. 10
[3] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. CoRR, abs/1710.11252, 2017. 2
Figure 9: Reliability diagrams for Breakfast (a and b) and Epic-Kitchens (c and d) datasets. Diagonal increase in accuracy suggests that our uncertainties are well calibrated. By increasing the confidence threshold, accuracy increases consistently.
Figure 10: Reliability diagram for our feature uncertainties on Breakfast dataset. Diagonal decrease suggests that our uncertainties are well calibrated and potentially useful.
[4] A. Bhattacharyya, M. Fritz, and B. Schiele. Long-term on- board prediction of people in traffic scenes under uncertainty. CoRR, abs/1711.09026, 2017. 2
[5] W. Byeon, Q. Wang, R. Kumar Srivastava, and P. Koumout- sakos. Contextvp: Fully context-aware video prediction. In The European Conference on Computer Vision (ECCV),
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
Figure 11: Sparsification plot for feature reconstruction uncertainties for Breakfast (a) and Epic-Kitchens (b). The big difference to its oracle can be explained by the relatively big errors inherent in future prediction task.
September 2018. 1
[6] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In IEEE Int. Conference on Computer Vision (ICCV), 2017. 2, 3, 4
[7] D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The EPIC-KITCHENS dataset. CoRR, abs/1804.02748, 2018. 5, 6, 7
[8] M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, 2014. 5
[9] C. Fan, J. Lee, and M. S. Ryoo. Forecasting hand and object locations in future frames. CoRR, abs/1705.07328, 2017. 2
[10] Y. A. Farha, A. Richard, and J. Gall. When will you do what? - anticipating temporal occurrences of activities. CoRR,
abs/1804.00892, 2018. 2
[11] Y. Gal and Z. Ghahramani. Dropout as a bayesian approxi- mation: Representing model uncertainty in deep learning. In Int. Conference on Machine Learning (ICML), 2016. 2
[12] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. Semantic compositional networks for visual captioning. In CVPR, 2017. 3, 8
[13] C. Guo, G. Pleiss, Y. Sun, and K. Weinberger. On calibration of modern neural networks. In Int. Conference on Machine Learning (ICML), 2017. 2, 6, 10
[14] T. Han, J. Wang, A. Cherian, and S. Gould. Human action forecasting by learning task grammars. CoRR, abs/1709.06391, 2017. 2
[15] G. Huang, Y. Li, and G. Pleiss. Snapshot ensembles: Train 1, get M for free. In Int. Conference on Learning Representations (ICLR), 2017. 2
[16] E. Ilg, ¨O. C¸ ic¸ek, S. Galesso, A. Klein, O. Makansi, F. Hutter, and T. Brox. Uncertainty estimates and multi-hypotheses networks for optical flow. In European Conference on Computer Vision (ECCV), 2018. https://arxiv.org/abs/1802.07095. 2, 3, 4, 10
[17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017. 6
[18] A. Kendall and Y. Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In Int. Conference on Neural Information Processing Systems (NIPS), 2017. 2, 4
[19] C. Kondermann, R. Mester, and C. Garbe. A Statistical Con-fidence Measure for Optical Flows, pages 290–301. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. 10
[20] H. Kuehne, A. Arslan, and T. Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 780–787, June 2014. 5
[21] J. Kybic and C. Nieuwenhuis. Bootstrap optical flow confi- dence and uncertainty measure. Computer Vision and Image Understanding, 115(10):1449 – 1462, 2011. 10
[22] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS workshop, 2016. 2
[23] T. Lan, T.-C. Chen, and S. Savarese. A hierarchical representation for future action prediction. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 689–704, Cham, 2014. Springer International Publishing. 2
[24] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018. 2
[25] S. Lee, S. Purushwalkam, M. Cogswell, V. Ranjan, D. Cran- dall, and D. Batra. Stochastic multiple choice learning for training diverse deep ensembles. In Int. Conference on Neural Information Processing Systems (NIPS), 2016. 2, 3, 4
[26] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004. 5
Figure 12: Qualitative examples of future prediction on Epic-kitchens and Breakfast datasets. For each example, current observation and future observation are provided. Last row shows the failure examples of future prediction.
[27] W. Liu, D. L. W. Luo, and S. Gao. Future frame predic- tion for anomaly detection – a new baseline. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 1
[28] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient de- scent with warm restarts. In Int. Conference on Learning Representations (ICLR), 2017. 2
[29] W. Lotter, G. Kreiman, and D. D. Cox. Deep predictive cod- ing networks for video prediction and unsupervised learning. CoRR, abs/1605.08104, 2016. 2
[30] P. Luc, C. Couprie, Y. LeCun, and J. Verbeek. Predicting future instance segmentations by forecasting convolutional features. CoRR, abs/1803.11496, 2018. 1, 2
[31] P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun. Predicting deeper into the future of semantic segmentation. ICCV, 2017. 2
[32] T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury. Joint prediction of activity labels and starting times in untrimmed videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5784–5793, Oct 2017. 2
[33] M. Mathieu, C. Couprie, and Y. LeCun. Deep multiscale video prediction beyond mean square error. CoRR, abs/1511.05440, 2015. 2
[34] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association
for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., 2002. 5
[35] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. CoRR, abs/1412.6604, 2014. 2
[36] C. Rupprecht, I. Laina, R. DiPietro, M. Baust, F. Tombari, N. Navab, and G. D. Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In International Conference on Computer Vision (ICCV), 2017. 2, 3, 4
[37] G. Singh, S. Saha, and F. Cuzzolin. Predicting action tubes. CoRR, abs/1808.07712, 2018. 2
[38] K. Soomro, H. Idrees, and M. Shah. Predicting the where and what of actors and actions through online action localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 2
[39] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsuper- vised learning of video representations using lstms. CoRR, abs/1502.04681, 2015. 2
[40] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016. 9
[41] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceed-
ings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 5
[42] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating the future by watching unlabeled video. CoRR, abs/1504.08023, 2015. 1, 2
[43] A. S. Wannenwetsch, M. Keuper, and S. Roth. Probflow: Joint optical flow and uncertainty estimation. In IEEE Int. Conference on Computer Vision (ICCV), Oct 2017. 10
[44] M. Zolfaghari, K. Singh, and T. Brox. ECO: efficient convo- lutional network for online video understanding. Computer Vision – ECCV 2018, pages 713–730, 2018. 1, 2, 3, 6, 7, 8, 9