Assistive technologies are attracting increasing attention as a promising application of first-person vision — computer vision using wearable cameras such as Google Glass and GoPro HERO. Much like how we use our eyes, first-person vision techniques can act as an artificial visual system that perceives the world around camera wearers and assist them to decide on what to do next. Recent work has focused on a variety of assistive technologies such as blind navigation [20, 39], object echo-location [38], and person-
Figure 1. Future Person Localization. Given a first-person video of a certain target person, our network predicts where the target person will be located in the future frames based on the poses and scales of the person as well as the ego-motions of the camera wearer.
alized object recognition [15].
In this work, we are particularly interested in helping a user to navigate in crowded places with many people present in the user’s vicinity. Consider a first-person video stream that a user records with a wearable camera. By observing people in certain frames and predicting how they move subsequently, we would be able to guide the user to avoid collisions. As the first step to realizing such safe navigation technologies in a crowded place, this work proposes a new task that predicts locations of people in future frames, i.e., future person localization, in first-person videos as illustrated in Figure 11.
In order to enable future person localization, this work makes three key observations. First, ego-motion of a camera wearer is clearly observed in the form of global motion of first-person videos. This ego-motion should be incorporated in the prediction framework as it greatly affects future locations of people. For example, if a camera wearer is moving forward, apparent vertical locations of people in the first-person video will be moving down accordingly. Moreover, if the camera wearer is walking towards people would change walking direction slightly to avoid a collision. This type of interacting behaviors would also affect the future locations of people.
Another key observation is that the scale of people acts as a salient cue to capture a perspective effect in first-person videos. Since the optical axis of a wearable camera tends to be parallel to the ground plane, visual distances in first-person video frames correspond to different physical distances depending on where people are observed in the frames. Such differences have to be taken into account for better future localization, especially when localizing people who are moving towards or away from the camera wearer.
The last key observation that improves the prediction capability is that, the pose of a person indicates how that person is moving and will be located in the near future. Firstperson videos can be used effectively to get access to such pose information as they often capture people up-close.
Based on these key observations, we propose a method to predict the future locations of a person seen in a first-person video based on ego-motions of the video, poses, scales, and locations of the person in the present and past video frames (also refer to Figure 1). Specifically, we develop a deep neural network that learns the history of the above cues in several previous frames and predicts locations of the target person in the subsequent future frames. A convolution-deconvolution architecture is introduced to encode and decode temporal evolution in these histories.
To validate our approach, we develop a new dataset of first-person videos called First-Person Locomotion (FPL) Dataset. The FPL Dataset contains about 5,000 people seen at diverse places. We demonstrate that our method successfully predicts future locations of people in first-person videos where state-of-the-art methods for human trajectory prediction using a static camera such as [1] fail. We also confirmed a promising performance of our method on a public first-person video dataset [8].
A typical problem setting involving first-person vision is to recognize activities of camera wearers. Recently, some work has focused on activity recognition [7, 22, 23, 28], activity forecasting [6, 9, 26, 31], person identification [11], gaze anticipation [45] and grasp recognition [3, 4, 21, 35]. Similar to our setting, other work has also tried to recognize behaviors of other people observed in first-person videos, e.g., group discovery [2], eye contact detection [42] and activity recognition [33, 34, 44].
To the best of our knowledge, this work is the first to address the task of predicting future locations of people in first-person videos. Our task is different from egocentric future localization [26] that predicts where ‘the camera wearers’ will be located in future frames. One notable exception is the recent work by Su et al. [37]. Although they proposed a method to predict future behaviors of basketball players in first-person videos, their method requires multiple first-person videos to be recorded collectively and synchronously to reconstruct accurate 3D configurations of camera wearers. This requirement of multiple cameras is in contrast to our work (i.e., using a single camera) and not fit for assistive scenarios where no one but the user on assistance is expected to wear a camera.
Finally, the task of predicting future locations of people itself has been studied actively in computer vision. Given both locations of start and destination, work based on inverse reinforcement learning can forecast in-between paths [17, 24]. Several methods have made use of Bayesian approaches [18, 36], recurrent neural networks [1, 19], fully-convolutional networks [12, 43], and other social or contextual features [32, 41] for predicting human trajectories from images or videos. These methods are, however, not designed to deal with first-person videos where signifi-cant ego-motion affects the future location of a certain person. Also, while the fixed camera setting assumed in these methods can suffer from oblique views and limited image resolutions, egocentric setting provides strong appearance cues of people. Our method utilizes ego-motion, scale and pose information to improve the localization performance in such an egocentric setting.
3.1. Overview
In this section, we first formulate the problem of predicting future locations of people in first-person videos. Consider a certain target person seen in a current frame of a first-person video recorded on the street. Our goal is to predict where the target person will be seen in subsequent frames of the video based on the observation up to the current frame. Formally, let be the 2D lo- cation of the person in the frame t. As illustrated in Figure 2, we aim to predict the person’s relative locations in the subsequent
frames from the current one at
(red frames in the figure), that is,
, based on observations in the previous
frames (blue ones).
The key technical interest here is what kind of observations can be used as a salient cue to better predict . Based on the discussions we made in Section 1 (also refer to Figure 2), we focus on c-1) locations and c-2) scales of target people, d) ego-motion of the camera wearer, and e) poses of target people as the cues to approach the problem. In order to predict future locations from those cues, we
Figure 2. Future Person Localization in First-Person Videos. Given a) -frames observations as input, we b) predict future locations of a target person in the subsequent
frames. Our approach makes use of c-1) locations and c-2) scales of target persons, d) ego-motion of camera wearers and e) poses of the target persons as a salient cue for the prediction.
Figure 3. Proposed Network Architecture. Blue blocks correspond to convolution/deconvolution layers while gray blocks describe intermediate deep features.
develop a deep neural network that utilizes a multi-stream convolution-deconvolution architecture shown in Figure 3. Input streams take the form of fully-convolutional networks with 1-D convolution filters to learn sequences of the cues shown above. Given a concatenation of features provided from all input streams, the output stream deconvolutes it to generate . The overall network can be trained end-to-end via back-propagation. In the following sections, we describe how each cue is extracted to improve prediction performance. Concrete implementation details and training strategies are discussed in Section 4.2.
3.2. Location-Scale Cue
The most straightforward cue to predict future locations of people is their previous locations up to the current frame
. For example, if a target person is walking in a certain direction at a constant speed, our best guess based on only previous locations would be to expect them to keep going in that direction in subsequent future frames too. However, visual distances in first-person videos can correspond to different physical distances depending on where people are observed in the frame.
In order to take into account this perspective effect, we propose to learn both locations and scales of target people jointly. Given a simple assumption that heights of people do not differ too much, scales of observed people can make a rough estimate of how large movements they made in the actual physical world. Formally, let be a history of previous target locations. Then, we extend each location
of a target person by adding the scale information of that person
, i.e.,
. Then, the ‘location- scale’ input stream in Figure 3 learns time evolution in
, and the output stream generates
.
3.3. Ego-Motion Cue
While explicitly describes how a target person is likely to move over time, the direct prediction of
from
is still challenging due to significant ego-motion present in first-person videos. More specifically, the coordinate system to describe each point
changes dynamically as the camera wearer moves. This makes
and
quite diverse depending on both walking trajectories of the target person and ego-motion of camera wearers.
Moreover, ego-motion of camera wearers could affect how the target people move as a result of interactive dynamics among people. For instance, consider a case where a target person is walking towards the camera wearer. When the target person and the camera wearer notice that they are going to collide soon, they will explicitly or implicitly condition themselves to change their walking speed and direc- tion to avoid the potential collision. Although some recent work has tried to incorporate such interactive behaviors into human trajectory prediction [1, 19, 24, 32], their approaches need all interacting people to be observed in a static camera view and cannot be applied directly to our case.
In order to improve future localization performance for first-person videos, we propose to learn how the camera wearer has been moving, i.e., the ego-motion cue. Specifi-cally, we first estimate the rotation and translation between successive frames. Rotation is described by a rotation matrix and translation is described by a 3D vector
(i.e., x-, y-, z-axes), both from frame
to frame t in the camera coordinate system at frame
. These vectors represent the local movement between the successive frames, however, does not capture the global movement along multiple frames. Therefore, for each frame t within the input interval
, we accumulate those vectors to describe time-varying ego-motion patterns in the camera coordinate system at frame
:
We form the feature vector for each frame by concatenating the rotation vector (i.e., yaw, roll, pitch) converted from
and
, resulting in a 6-dimensional vector
. Fi- nally, we stack them to form an input sequence
for the ‘ego-motion’ stream shown in Figure 3.
3.4. Pose Cue
Another notable advantage of using first-person videos is the ability to observe people up-close. This makes it easier to capture what poses they take (e.g., which directions they orient), which could act as another strong indicator of the direction they are going to walk along.
The ‘pose’ stream in Figure 3 is aimed at encoding such pose information of target people. More specifically, we track temporal changes of several body parts of target people including eyes, shoulders, and hips as a feature of target poses. This results in an input sequence where
is a 2V -dimensional vector stacking locations of V body parts.
To investigate the effectiveness of our approach in detail, we first construct a new first-person video dataset recorded
Figure 4. First-Person Locomotion Dataset recorded by wearable chest-mounted cameras under diverse environments, which comprises more than 5,000 people in total.
by a person walking on the street. We also evaluate our method on First-Person Social Interaction Dataset [8] to see if our approach can be applied to a more general case where camera wearers take a variety of actions while walking.
4.1. First-Person Locomotion Dataset
To the best of our knowledge, most of the first-person video datasets comprise scenes where only a limited number of people are observed, e.g., CMU Social Interaction Dataset [27], JPL Interaction Dataset [34], HUJI EgoSeg Dataset [29]. In this work, we introduce a new dataset which we call First-Person Locomotion (FPL) Dataset. The FPL Dataset consists of about 4.5 hours of first-person videos recorded by people wearing a chest-mounted camera and walking around in diverse environments. Some example frames are shown in Figure 4. The number of observed people is more than 5,000 in total.
Training and testing samples are given in the form of a tuple , where
is location-scale,
is camera ego-motion,
is pose, and
is relative future location-scale with respect to
are available both in training and testing times and defined in interval
. On the other hand,
serves as ground-truth defined in
, which we can access only during the training time. In this experiment, we set
at 10 fps, i.e., a time window of one second for both observation and prediction.
We generated the samples as follows. For each frame, we detected people with OpenPose [5]. We tracked the upper body of detected people over time using the kernelized correlation filter [10] after two consecutive frames were aligned with homography. We terminated the tracking if subsequent detection results were not found within a certain pre-defined spatiotemporal range. As a result of this tracking, we obtained many short tracklets2. These tracklets were then merged to generate longer ones with the conditions 1) if the detected person at the tail of one tracklet is visually similar to that at the head of the other tracklet and 2) if these tracklets were also spatiotemporally close enough. A cosine distance of deep features extracted by Faster R-CNN [30] was used to measure visual similarity.
For each tracklet, we extracted locations , scales
, poses
, and ego-motion
as follows. First, we extracted 18 body parts using OpenPose [5].
was then defined by the middle of two hips. Also,
was given by the distance between the location of the neck and
. Furthermore, we obtained
as a 36-dimensional feature (i.e., V = 18), which was normalized by subtracting
and divided by
.
was estimated by the unsupervised ego-motion estimator [46]. Finally, we applied sliding window to generate multiple fixed length (i.e., 2 seconds) samples. As a result of this procedure, we obtained approximately 50,000 samples in total.
4.2. Implementation Details
Architecture choice The full specification of the proposed network architecture is shown in Table 1. Each input stream feeds -dimensional inputs (where D changes depending on which cues we focus on) to four cascading 1D temporal convolution layers of different numbers of channels, each of which is followed by batch normalization (BN) [14] and rectified linear unit (ReLU) activation [25]. Then,
-dimensional features from the input streams are concatenated and fed to the output stream consisting of two 1D convolution layers with BN and ReLU, four cascading 1D deconvolution layers also with BN and ReLU, and one another 1D convolution layer with linear activation.
Optimization To train the network, we first normalized and
to have zero-mean and unit variance. We also adopted a data augmentation by randomly flipping samples horizontally. The loss functions to predict
was defined by the mean squared error (MSE). We optimized the network via Adam [16] for 17,000 iterations with mini-batches of 64 samples, where a learning rate was initially set to 0.001 and halved at 5,000, 10,000, 15,000 iterations. All implementations were done with Chainer [40].
4.3. Evaluation Protocols
Data splits We adopted five-fold cross-validation by randomly splitting samples into five subsets. We ensured that samples in training and testing subsets were drawn from different videos. Training each split required about 1.5 hours
Table 1. Our Network Architecture where BN: batch normalization [14] and ReLU: rectifier linear unit [25]. The network consists of three input streams and one output stream, where inputs have different dimensions D depending on the streams: D = 3 for the location-scale stream, D = 6 for the ego-motion stream, and D = 36 for the pose stream.
on a single NVIDIA TITAN X. Also when evaluating methods with testing subsets, we further divided samples into three conditions based on how people walked (i.e., walking directions): target people walked a) Toward, b) Away from, or c) Across the view of a camera. Further details on how to segregate the samples into these three categories are present in our supplementary materials.
Evaluation metric Although our network predicts both locations and scales of people in the future frames, we measured its performance based on how accurate the predicted locations were. Similar to [1], we employed the final displacement error (FDE) as our evaluation metric. Specifically, FDE was defined by the L2 distance between predicted final locations and the corresponding ground-truth locations.
Baselines Since there were no prior methods that aimed to predict future person locations in first-person videos, we have implemented the following baselines.
• ConstVel: Inspired by the baseline used in [26], this method assumes that target people moved straight at a constant speed. Specifically, we computed the average speed and direction from to predict where the target would be located at the
-th frame.
• NNeighbor: We selected k-nearest neighbor input sequences in terms of the L2 distance on the se-
quences of locations and derived the average of k-corresponding locations at frame
. In our experiments, we set k = 16 as it performed well.
• Social LSTM [1]: We also evaluated Social LSTM, one of the state-of-the-art approaches on human trajectory prediction, with several minor modifications to better work on first-person videos. Specifically, we added the scale information to inputs and outputs. The estimation of Gaussian distributions was replaced by direct prediction of as it often failed on the FPL Dataset. The neighborhood size
used in the paper was set to
.
4.4. Results
Quantitative evaluation Table 2 reports FDE scores on our FPL Dataset. Overall, all methods were able to predict future locations of people with the FDE less than about 15% of the frame width (approximately in horizontal angle). We confirmed that our method (Ours) has significantly outperformed the other baselines. Since walking speeds and directions of people were quite diverse and changing dynamically over time, naive baselines like ConstVel and NNeighbor did not perform well. Moreover, we found that Social LSTM [1] performed poorly. Without explicitly taking into account how significant ego-motion affects people locations in frames, temporal models like LSTM would not be able to learn meaningful temporal dynamics, ultimately rendering their predictions quite unstable. Note that without our modification shown in Section 4.3, the performance of vanilla Social LSTM was further degraded (i.e., 152.87 FDE on average). Comparing results among walking directions, Toward was typically more challenging than other conditions. This is because when target people walked toward the view of a camera, they would appear in the lower part of frames, making variability of future locations much higher than other walking directions.
Error analysis We investigated the distribution of the errors. With our method, 73% samples received error less than 100 pixels (in horizontal angle). There were only 1.4% samples suffered from significant error larger than 300 pixels (
in horizontal angle). Additionally, we calculated the errors normalized by each sample’s scale. By assuming that the length between the center hip and the neck of a person to be 60 cm, the average error obtained by our method approximately corresponded to 60 cm in the physical world.
Qualitative evaluation Figure 9 presents several visual examples of how each method worked. Examples (a), (b), and (c) are results drawn respectively from Toward, Across, and Away subsets. Especially, significant ego-motion of the camera wearer to turn right was observed in
Table 2. Comparisons to Baselines. Each score describes the final displacement error (FDE) in pixels with respect to the frame size of
Table 3. Ablation Study. : locations,
location-scales,
ego-motion, and
: poses. Each score describes the final displacement error (FDE) in pixels with respect to the frame size of
Example (b), making predictions of baseline methods completely failure. Another case where ego-motion played an important role was when target people did not move, such as the person standing still in Example (d). Example (e) involves not only significant ego-motion but also changes in walking direction of the target. Our method successfully performed in this case as it could capture postural changes of target persons for prediction.
Ablation study We made an ablation study to see how each of scales, ego-motion, and poses contributed overall prediction performances. Specifically, we started from the only location information , then added scale information to use
. For these two conditions, we learned a singlestream convolution-deconvolution architecture. Then, we evaluated the combination of
(locations, scales, and ego-motion) and that of
(locations, scales, and poses) by learning two-stream convolution-deconvolution architectures. Results are shown in Table 3. We confirmed that all of the cues helped individually to improve prediction performances. Especially, significant performance gains were observed on the Toward subset from
to
, i.e., by introducing scale information, and from
to
, i.e., by further combining pose information.
Failure cases and possible extensions Figure 6 shows several typical failure cases. On both examples, our method
Figure 5. Visual Examples of Future Person Localization. Using locations (shown with solid blue lines), scales and poses of target people (highlighted in pink, left column) as well as ego-motion of camera wearers in the past observations highlighted in blue, we predict locations of that target (the ground-truth shown with red crosses with dotted red lines) in the future frames highlighted in red. We compared several methods: Ours (green), NNeighbor (cyan), and Social LSTM [1] (yellow).
and other baselines did not perform accurately as camera wearers made sudden unexpected ego-motion. One possible solution to cope with these challenging cases is to predict future movements of the camera wearers as done in [26].
4.5. Evaluation on Social Interaction Dataset
Finally, we evaluate how our approach works on FirstPerson Social Interaction Dataset [8]. This dataset consists of several first-person videos taken in an amusement park and involves a variety of social moments like communicating with friends, interacting with a clerk, and waiting in line, standing for a more general and challenging dataset. In our experiment, we manually extracted a subset of videos where camera wearers kept walking while sometimes interacting with others. From this subset, we collected approximately 10,000 samples in total. Similar to the previous experiment, we adopted five-fold cross-validation to evaluate how our method and other baselines performed.
Training setup In this dataset, camera wearers frequently turned their head to pay their attention to various different locations. This made ego-motion estimator [46] completely inaccurate as it was originally trained to estimate ego-motion of vehicle-mounted cameras, where such frequent turning was hardly observed in their training datasets. To cope with this, instead of the velocity and rotation used in Section 3.3, we made use of optical flows to describe ego-motion cues. More specifically, we computed dense optical
Figure 6. Failure Cases. Given previous locations (blue), predictions by our method (green) and Social LSTM [1] (yellow) both deviated from ground-truth future locations (red).
Table 4. Results on Social Interactions Dataset [8]. Each score describes the final displacement error (FDE) in pixels with respect to the frame sizes of either
flows using [13] and divided them into grids. We then computed average flows per grid and concatenate them to obtain 24-dimensional vector for describing ego-motion per frame. For the training, we first pre-trained our network on FPL Dataset with the same training strategies shown in Section 4.2 but with the above flow-based ego-motion feature 3. We then fine-tuned this trained network on the Social Interaction Dataset for 200 iterations using Adam with a learning rate of 0.002.
Results FDE scores are shown in Table 7. Similar to the previous experiment, we divided testing datasets into three subsets, Toward, Away, and Across, based on walking directions of target people. Although performances of all methods were rather limited compared to the previous results in Table 2, we still confirmed that our method was able to outperform other baseline methods including Social LSTM [1]. Some visual examples are also shown in Figure 7.
Figure 7. Visual Examples from Social Interaction Dataset [8]: previous locations (blue lines) of target people (pink bounding boxes); predictions by our method (green lines); and ground-truth future locations (red lines).
We have presented a new task called future person localization in first-person videos. Experimental results have revealed that ego-motion of camera wearers as well as scales and poses of target people were all necessary ingredients to accurately predict where target people would appear in future frames.
As we discussed with the failure cases, one possible direction for extending this work is to incorporate future localization of camera wearers [26]. By knowing how the camera wearers move in the near future, we should be able to predict future locations of observed people more accurately in first-person videos.
Figure 8 presents frequency distributions of lengths of the tracklets extracted from First-Person Locomotion Dataset and Social Interaction Dataset [8]. These statistics revealed that most people appeared only for a short time period. In our experiments, we tried to pick out tracklets which were 1) longer enough to learn meaningful temporal dynamics and 2) observed frequently in the datasets to stably learn our network. These requirements resulted in our 50,000 samples consisting of the tracklets longer than or equal to 20 frames (i.e., 2 seconds at 10 fps) and our problem setting of ‘predicting one-second futures from one-second histories’.
Details of sample division: We first calculated the mean of scale normalized lengths between the left hip and the right hip for the target person. If this mean is less than 0.25 we categorized the clip as Across. In the remaining clips, we labeled each frame of the clip as either Toward if
Table 5. Different Input/Output Lengths. Final Displacement Error (FDE) for various combinations of input () and output (
Table 6. Predicting Two-Second Futures. Final Displacement Error (FDE) where was set to
Table 7. Ablation Study on Social Interactions Dataset [8]. Final displacement error (FDE) for various combination of input features. Notations were the same as those of Table 6.
X-coordinate of the left hip is larger than that of the right hip and Away otherwise. If the number of frames labeled Toward is more than 75% of the total number of frames in the clip, the clip is categorized as Toward and as Away if it is less than 25%.
B.1. Other Choices of Input/Output Lengths
In our experiments, we fixed the input and output lengths to be
. Table 5 shows how performances changed for other choices of
and
. Overall, longer input lengths led to better performance (
vs. 10). Also, predicting more distant futures becomes more difficult (
vs. 6). To receive shorter inputs, we applied 1-padding to the first and second convolution layer in each stream.
We also compared our method against Social LSTM [1] on the task of predicting two-second futures (i.e., 20) in Table 6. We confirmed that our method still worked well on this challenging condition. To generate 20 frame prediction, we changed the kernel size of the deconvolution layers of 3, 3, 3, 3 to 3, 5, 7, 7.
B.2. Other Visual Examples
Figure 9 shows additional visual examples of how our method, as well as several baselines, predicted future locations of people.
B.3. Ablation Study on Social Interaction Dataset
We performed an ablation study on Social Interaction Dataset [8] in Table 7. While we computed ego-motion based on optical flows, the combination of ego-motion and pose cues contributed to performance improvements.
This work was supported by JST CREST Grant Number JPMJCR14E1, Japan.
Figure 8. Distributions of Tracklet Lengths. Frequency distributions of various lengths of tracklets extracted from First-Person Locomotion Dataset and Social Interaction Dataset [8] for three walking directions and the entire database, respectively.
Figure 9. Qualitative Examples of Future Person Localization on First Person Locomotion Dataset. (Row 1) Even though input sequence is almost static, our model is able to capture the left turn caused by the wearer’s ego-motion. (Row 2, 3) In the input sequence, the target is changing the pose to move right. While compared model fails to predict because of being agnostic to the pose information, our model produces a better prediction. (Row 4) The behavior with respect to complicated ego-motion. In the input sequence, the wearer is turning left to avoid other pedestrians. However, in the future frames, the wearer moves to the opposite side to avoid contact with the target. In this case, our prediction is perturbed due to ego-motion and predicts worse than Social LSTM. (Row 5) Our model works well both in outdoor scenes as well as indoor scenes.
[1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–971, 2016. 2, 4, 5, 6, 7, 8, 9
[2] S. Alletto, G. Serra, S. Calderara, and R. Cucchiara. Un- derstanding social relationships in egocentric vision. Pattern Recognition, 48(12):4082–4096, 2015. 2
[3] S. Bambach, S. Lee, D. J. Crandall, and C. Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In Proceedings of the IEEE International Conference on Computer Vision, pages 1949– 1957, 2015. 2
[4] M. Cai, K. M. Kitani, and Y. Sato. A scalable approach for understanding the visual structures of hand grasps. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1360–1366, 2015. 2
[5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi- person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7291 – 7299, 2017. 4, 5
[6] C. Fan, J. Lee, and M. S. Ryoo. Forecasting hand and object locations in future frames. CoRR, abs/1705.07328, 2017. 2
[7] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding Ego- centric Activities. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 407–414, 2011. 2
[8] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interactions: A first-person perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1226–1233, 2012. 2, 4, 7, 8, 9, 10
[9] A. Furnari, S. Battiato, K. Grauman, and G. M. Farinella. Next-active-object prediction from egocentric videos. Journal of Visual Communication and Image Representation, 49(C):401–411, 2017. 2
[10] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High- speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015. 4
[11] Y. Hoshen and S. Peleg. An egocentric look at video pho- tographer identity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4284– 4292, 2016. 2
[12] S. Huang, X. Li, Z. Zhang, Z. He, F. Wu, W. Liu, J. Tang, and Y. Zhuang. Deep learning driven visual path prediction from a single image. IEEE Transactions on Image Processing, 25(12):5892–5904, 2016. 2
[13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2462 – 2470, 2017. 8
[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448– 456, 2015. 5
[15] H. Kacorri, K. M. Kitani, J. P. Bigham, and C. Asakawa. People with visual impairment training personal object recognizers: Feasibility and challenges. In Proceedings of the CHI Conference on Human Factors in Computing Systems, pages 5839–5849, 2017. 1
[16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 5
[17] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecasting. In Proceedings of the European Conference on Computer Vision, pages 201–214, 2012. 2
[18] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila. Context-based pedestrian path prediction. In Proceedings of the European Conference on Computer Vision, pages 618– 633, 2014. 2
[19] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 336–345, 2017. 2, 4
[20] T.-S. Leung and G. Medioni. Visual navigation aid for the blind in dynamic environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 153 – 158, 2014. 1
[21] C. Li and K. M. Kitani. Pixel-level hand detection in ego- centric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3570– 3577, 2013. 2
[22] Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 287–295, 2015. 2
[23] M. Ma, H. Fan, and K. M. Kitani. Going deeper into first- person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1894–1903, 2016. 2
[24] W.-C. Ma, D.-A. Huang, N. Lee, and K. M. Kitani. Forecast- ing interactive dynamics of pedestrians with fictitious play. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 774 – 782, 2017. 2, 4
[25] V. Nair and G. E. Hinton. Rectified linear units improve re- stricted boltzmann machines. In Proceedings of the International Conference on Machine Learning, pages 807–814, 2010. 5
[26] H. S. Park, J.-J. Hwang, Y. Niu, and J. Shi. Egocentric fu- ture localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4697– 4705, 2016. 2, 5, 7, 8
[27] H. S. Park, E. Jain, and Y. Sheikh. 3D Social Saliency from Head-mounted Cameras. In Proceedings of the Advances in Neural Information Processing Systems, pages 1–9, 2012. 4
[28] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2847–2854, 2012. 2
[29] Y. Poleg, C. Arora, and S. Peleg. Head motion signatures from egocentric videos. In Proceedings of the Asian Conference on Computer Vision, pages 1–15, 2014. 4
[30] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To- wards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 1–9, 2015. 5
[31] N. Rhinehart and K. M. Kitani. First-person activity fore- casting with online inverse reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 3696–3705, 2017. 2
[32] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese. Learning social etiquette: Human trajectory understanding in crowded scenes. In Proceedings of the European Conference on Computer Vision, pages 549–565, 2016. 2, 4
[33] M. S. Ryoo, T. J. Fuchs, L. Xia, J. Aggarwal, and L. Matthies. Robot-centric activity prediction from first-person videos: What will they do to me? In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, pages 295–302, 2015. 2
[34] M. S. Ryoo and L. Matthies. First-person activity recogni- tion: What are they doing to me? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2730–2737, 2013. 2, 4
[35] A. Saran, D. Teney, and K. M. Kitani. Hand parsing for fine- grained recognition of human grasps in monocular images. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1–7, 2015. 2
[36] N. Schneider and D. M. Gavrila. Pedestrian path prediction with recursive bayesian filters: A comparative study. In Proceedings of the German Conference on Pattern Recognition, pages 174–183, 2013. 2
[37] S. Su, J. Pyo Hong, J. Shi, and H. Soo Park. Predicting be- haviors of basketball players from first person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1501–1510, 2017. 2
[38] T. J. J. Tang and W. H. Li. An assistive eyewear prototype that interactively converts 3d object locations into spatial au-
dio. In Proceedings of the ACM International Symposium on Wearable Computers, pages 119–126, 2014. 1
[39] Y. Tian, Y. Liu, and J. Tan. Wearable navigation system for the blind people in dynamic environments. In Proceedings of the Cyber Technology in Automation, Control and Intelligent Systems, pages 153 – 158, 2013. 1
[40] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next- generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems, pages 1–6, 2015. 5
[41] D. Xie, S. Todorovic, and S. C. Zhu. Inferring dark matter and dark energy from videos. In Proceedings of the IEEE International Conference on Computer Vision, pages 2224– 2231, 2013. 2
[42] Z. Ye, Y. Li, Y. Liu, C. Bridges, A. Rozga, and J. M. Rehg. Detecting bids for eye contact using a wearable camera. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, pages 1–8, 2015. 2
[43] S. Yi, H. Li, and X. Wang. Pedestrian behavior understand- ing and prediction with deep neural networks. In Proceedings of the European Conference on Computer Vision, pages 263–279, 2016. 2
[44] R. Yonetani, K. M. Kitani, and Y. Sato. Recognizing micro- actions and reactions from paired egocentric videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2629–2638, 2016. 2
[45] M. Zhang, K. Teck Ma, J. Hwee Lim, Q. Zhao, and J. Feng. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 2017. 2
[46] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsuper- vised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851 – 1860, 2017. 5, 7, 8