Recently, activity recognition from first-person camera views has been attracting increasing interest, motivated by advances in wearable device technology. Recognition of activities of daily living (ADL) from first-person views is an important task related to activity recognition. ADL are basic activities in a typical human life such as “making coffee” or “cutting paper.” If the system recognizes ADL properly, then it is applicable to nursing services, rehabilitation, and lifestyle habit improvements.
To recognize ADL, it is important to examine objects undergoing hand manipulation specifically. For example, a cleaning activity might be recognized only by recognizing that a user is using a vacuum cleaner. One can also recog-
Figure 1. Activities of daily living (ADL) captured by a wrist-mounted camera (left) and a head-mounted camera (right).
nize coffee-making activity if it is observed that a user is handling a mug and coffee beans. Pirsiavash and Ramanan [26] described the importance of recognizing handled objects for ADL recognition. They developed an ADL dataset collected using a chest-mounted camera. Then, they implemented ADL recognition in different homes by detecting the user’s hands and handled objects. The result suggests the crucial importance of detecting the handled objects properly in various environments for ADL recognition.
In a view from a head-mounted camera or a chest-mounted camera, handled objects are captured at a small scale in various positions. Furthermore, many non-handled objects also appear in the captured image. Consequently, many studies have examined hand detection or gaze prediction to develop a means of discerning picked-up and handled objects from other detected objects. However, such approaches entail the following difficulties: (a) Despite the advances in object-detection techniques, object detection is not an easy task in various environments. (b) Discerning a handled object from detected objects with hand detection or gaze prediction is not an easy task. (c) To train object detectors, it is necessary to collect numerous images with bounding boxes. Building a high-quality dataset with bounding boxes requires a considerable amount of labor, which hinders us from expanding a dataset.
To train an ADL recognition system without pixel-level annotations or bounding boxes of objects, we consider mounting a wearable camera on the wrist of the user’s dominant arm because the objects are handled mainly by the user’s dominant hand. We designate this camera as a wrist-mounted camera in this paper. For ADL recognition, a wrist-mounted camera has numerous advantages over a head-mounted camera or a chest-mounted camera: (a) Wrist-mounted cameras can capture a large image of the handled objects. (b) Because handled objects are close to a user’s dominant hand, the object positions are restricted in the images from the wrist-mounted camera. (c) Because of the above reasons, we can skip object detection and do not need a dataset of ADL with bounding boxes. We only need a dataset of ADL with annotations about the activity time segments in the videos.
We also propose a recognition system for videos captured by a wrist-mounted camera that has strong spatial bias and weak temporal bias. As shown in Figure 2, an image captured by a wrist-mounted camera has strong spatial bias, meaning that hand-manipulated objects tend to be located at the central area. In addition, the order of manipulated objects is mostly fixed for each action, which we call “weak temporal bias.” The state-of-the-art video representation, which extracts local features containing spatial information from pre-trained convolutional neural networks (CNNs), strongly loses spatial information and completely loses temporal information after encoding. Therefore, we also propose a novel video representation that retains spatial and temporal information after encoding to consider the above mentioned biases.
Our three contributions are the following:
1. We propose the use of a wrist-mounted camera for ADL recognition instead of a head-mounted camera or a chest-mounted camera. 2. We propose a discriminative video representation that retains spatial and temporal information. This is a method for the dataset captured from a wrist-mounted camera that has a large bias of spatial information and a small bias of temporal information. 3. We developed a novel and publicly available dataset that includes videos and annotations of ADL captured from a head-mounted camera and a wrist-mounted camera simultaneously.
2.1. Egocentric vision for ADL recognition
Various approaches for ADL recognition based on handled objects have been proposed [24, 29, 35]. Because wearable devices with cameras such as GoPro and Google Glasses have been developed recently, ADL recognition with viewpoint cameras has received a considerable amount of attention. Some works on egocentric ADL recognition have achieved results in a single environment, such as a kitchen or an office [9, 7, 18, 19].
For more practical settings, Pirsiavash and Ramanan [26] estimated the type of a handled object by detecting the object and arm from a wearable camera’s viewpoint. They demonstrated that action recognition performs well in diverse environments. However, it is necessary to provide positional information of all objects in all frames of the video at the time of learning. In addition, detecting an entire handled object itself is still difficult. Although their dataset has various annotations, such as type of activity and duration of its completion, as well as type of an object and its location, it took over 1 month to label various annotations by 10 parttime annotators. Consequently, expanding the dataset is not practical. In a more practical setting, an ADL recognition system that uses wearable devices in diverse environments should be trained with labels obtained by simpler annotation methods.
2.2. Video representation for action recognition
Video representation has been well studied in the action recognition domain. Some deep-learning approaches for action recognition have been proposed [23, 31]. However, these approaches require the use of large-scale video datasets (e.g. Sports 1M [14]), which are difficult to address and which require enormous amounts of time for the whole learning process.
Motion features: The general pipeline to obtain a video representation for action recognition models the distribution of local features from training videos. Local features representing motion information (e.g., HOG [4], HOF [20], and MBH [5]) are usually used. The combination of local features and improved dense trajectory (iDT) [34], which compensates for camera motion, is the de facto standard. It has shown great performance for action recognition [33].
CNN descriptors: CNN has achieved superior results to the standard pipeline for object recognition [16]. Jain et al. [12] brought CNN to action recognition. They obtain the state of a fully connected layer from each frame in videos and calculate the video representation by averaging all CNN features. Their method therefore exhibits performance that is surprisingly comparable to the combination of iDT, MBH, and Fisher vector (FV) [25]. To obtain more discriminative features containing spatial information, Xu et al. [36] proposed the extraction of latent concept descriptors (LCDs) from the poollayer and the application of VLAD [13] instead of averaging. However, spatial information is ignored when applying VLAD. Since our task is an intermediate task of action recognition and object recognition, we developed the CNN-based video representation above to design the video representation for ADL recognition from wrist-mounted cameras, which have strong spatial information bias.
Figure 2. Mean images of a head-mounted camera (left) and a wrist-mounted camera (right). Skin pixels are visible on the right side of the wrist-mounted camera image, although we cannot see anything in the head-mounted camera image. This implies that the user’s hand always appears in the right side and handled objects appear near the center of the wrist-mounted camera image.
Some works in the area of interface research have shed light on wrist-mounted cameras [32, 15]. In ADL recognition, Maekawa et al. [21] conducted multimodal ADL recognition using a wrist-mounted device that has a camera, microphone, acceleration sensor, illuminance meter, and digital compass. However, the color histogram alone is used as an image feature. This system is too simple to identify handled objects. Wrist-mounted cameras have never been evaluated carefully in ADL recognition. Therefore, we discuss the superiority of wrist-mounted cameras in this section.
Wrist-mounted cameras capture handled objects very closely, as shown in Figure 1. In addition, as shown in Figure 2, the user’s hand invariably appears on the right side of the image captured by a wrist-mounted camera, unlike that by a head-mounted camera. This trend of wrist-mounted cameras also means that handled objects always appear near the center of the captured image. Because of these strong spatial biases, we can recognize handled objects well even without manually annotating the bounding boxes of objects in the dataset. We need only to annotate the time segments of the activities. Wrist-mounted cameras have limitations: they cannot take pictures of human faces or recognize posture-defined actions such as “jumping” or “skipping.” Although there are such limitations, wrist-mounted cameras are more suitable for recognizing ADL, which mostly involves object manipulation.
As another feature, wrist-mounted cameras can process large motions. Because the handled objects and a wrist-mounted camera move together, other irrelevant parts, which move relative to the wrist-mounted camera, are blurred, whereas the handled objects are captured clearly. In addition, the objects, while moving, appear as static objects in the camera view, which enables robust recognition. This blurring effect is better obtained by setting the focal length to 10–30 cm.
We propose a new video representation based on LCDs [36] to take advantage of the strong spatial bias and weak temporal bias of the video captured by a wrist-mounted camera. Although LCD retains the spatial information in each frame at the descriptor level, spatial information and temporal information are dropped when the descriptors are encoded and aggregated into a video representation. However, the video captured by a wrist-mounted camera has a strong spatial bias, as described in Section 3. Although not as strong as spatial bias, temporal bias also exists because the order of handling objects is fixed roughly in each action class. Therefore, we use the benefits of these spatial and temporal biases specifically for a wrist-mounted camera and ADL.
Our method encodes LCDs at each location in all frames into single VLAD [13] vectors and optimizes the weights for the VLAD vectors to aggregate them into a video representation. The weight for a VLAD vector extracted from each location is designated as a spatial weight. Furthermore, we propose a method that divides a video into short sequences and optimizes the temporal weights for aggregating descriptors. Here, we describe the original LCD in Section 4.1, the proposed method to optimize spatial weights in Section D.2, and the proposed method to optimize spatial and temporal weights in Section 4.3.
4.1. CNN latent concept descriptors
Latent concept descriptors [36] constitute a state-of-the-art video representation using CNN, which is obtained as follows. (i) Given a video E including T frames E = , each frame is input to VGG net [27] pre-trained on the ImageNet2012 dataset [6] to obtain the pool
layer’s output. The dimension of pool
features is
, where a is the size of the filtered images of the last pooling layer and M is the number of convolutional filters in the last convolutional layer (in the case of VGG net, a = 7 and M = 512). (ii) The responses of M filters are concatenated for the respective locations of the pool
layer.
Then, a set of descriptors
is obtained from the t-th frame as follows.
(iii) All descriptors in are encoded with VLAD into a video representation v. Letting
denote a set of K coarse centers obtained by K-means, we obtain
as follows:
Therein, represents the nearest center of
. Then, v is obtained as an MK-dimensional VLAD encoding vector by concatenating
over all K centers. (iv) Finally, v is normalized by power and L2 normalization with intra-normalization [3].
Figure 3. Illustration of the proposed video representation . For this example, we set a = 2 and L = 2.
4.2. Discriminative spatial aggregated latent concept descriptors
The original LCD [36] drops the spatial information in the process of VLAD encoding because the descriptors are equally weighted when they are encoded with VLAD into a video representation v. We introduce spatial weights for the VLAD encoding vectors distinguished by their locations when we aggregate them into a final video representation. Because of the spatial weights, we can address the spatial bias such that the center area is more important than its surroundings, for example, because hand-manipulated objects tend to be located at the center view of a wrist-mounted camera. Specifically, we obtain a video representation v(i, j) for each cell (i, j) over all the T frames by encoding the descriptors in a set . In the same manner as in [36], v(i, j) is normalized by power and L2 normalization with intra-normalization. Letting
denote an
-dimensional weight vector, we obtain a weighted sum of v(i, j) as
where and
are defined as shown below.
As described in this paper, we obtain by arranging
eigenvectors
obtained by partial least squares (PLS) in
rows. Note that PLS is a method that can extract common information between sets of observed features. Therefore,
represents how many eigenvectors we use were obtained from PLS. Details related to computing the eigenvectors are given in the Supplemental Materials. Finally, we obtain a video representation
, which is called discriminative spatial aggregated LCDs (DSAR), by concatenating all elements in
in (3). Here,
is normalized by power and L2 normalization.
The idea to use the eigenvectors obtained by PLS as spatial weights was derived from the discriminative spatial pyramid representation (D-SPR) [10]. Consequently, the proposed method described in this section can be regarded as the combination of LCD and D-SPR. Our method optimizes the weights for cells in the output of pool
, whereas D-SPR optimizes the weights for areas in a spatial pyramid of each frame.
4.3. Discriminative spatiotemporal aggregated latent concept descriptors
The wrist-mounted camera dataset not only has a strong spatial information bias, but also has a weak temporal information bias. The original LCD described in Section 4.1 and the proposed DSAR described in Section D.2 lose temporal information in the process of VLAD encoding. Inspired by the idea of spatiotemporal pyramids [17], we introduce temporal weights for the VLAD encoding vectors distinguished by their time stamps when we aggregate them into a final video representation. Because of the temporal weights, we can assign the importance of each frame into a whole video representation. Specifically, we split a video into sequences consisting of equal numbers of frames
. The s-th sequence is a set of frames
. Then, we obtain a video representation
for each cell (i, j) over all the
frames in the s-th sequence by encoding the descriptors in a set
. Here, we consider multiple levels of the splitting (l = 0, . . . , L) such that we obtain a set of
as follows:
Again, is normalized by power and L2 normaliza- tion with intra-normalization. Letting
denote an
-dimensional weight vector, we obtain a weighted
Here, we define , and
, where
as shown below.
As described in this paper, we optimize spatial weights and temporal weights
iteratively and alternately. Specifically, we repeat the following two steps.
Step 1: optimizing In this step, we fix
and optimize
. We obtain
-dimensional vectors
by concatenating all the elements in
. Letting
denote , we can rewrite (7) as
This formulation is identical to (3). Therefore, we optimize in the manner described in Section D.2.
Step 2: optimizing
In this step, we fix and optimize
. We obtain
-dimensional vectors
by concatenating all the elements in
. Letting
denote
, we can then rewrite (7) as
This formulation is identical to (3). Therefore, we optimize in the manner presented in Section D.2.
We iterate Step 1 and Step 2 several times. Finally, we obtain a video representation , which is called discriminative spatiotemporal aggregated LCDs (DSTAR), by concatenating all the elements in
in (7) with power and L2 normalization. An illustration of
is shown in Figure 3.
We created a new ADL dataset that uses both a wrist-mounted camera and a head-mounted camera because there are as yet no published ADL datasets that use wrist-mounted cameras. In this section, we present the details of our dataset.
Note that it is also important to compare a wrist-mounted camera with a chest-mounted camera instead of with a head-mounted one since a chest-mounted camera is closer to the user’s hands. We encourage to compare wrist-mounting to other mountings [22] for ADL recognition as future work.
Table 1. Duration of each class and the distribution of the 23 classes in our dataset.
5.1. Activity class
We chose activity classes by referring to previous studies of ADL [26, 30, 8]. First, we removed some classes that many users were reluctant to record on video such as “brushing teeth” and “laundry.” Next, to introduce more variety into our dataset, we added some actions referring to other ADL recognition studies [30] and an evaluation of Alzheimer rehabilitation [8]. As Table 1 shows, we strove to recognize 23 ADL classes in this study. Detailed information is given in the Supplemental Materials.
5.2. Collection and annotation
To assemble the dataset, we used a GoPro HERO3+ 2 as the head-mounted camera and an HX-A100 3 as the wrist-mounted camera. Each user wore these two cameras, as shown in Figure 4. Each user therefore recorded two videos simultaneously. As in a previous ADL egocentric dataset [26], we did not instruct the users in detail how to act to obtain realistic data. After taking videos, all users manually annotated the duration and the action class in their own videos. The definition of an action includes some initial and final actions related to the action. For example, the action “cutting paper” is defined as follows: The initial action of “cutting paper” is to take scissors from the table and the fi-nal action is to put it on the table. We recruited 20 people to perform these tasks. All users were right handed. Our wrist-
Figure 4. Wearing a head-mounted camera and a wrist-mounted camera
Figure 5. Example images from our head-mounted dataset (top- half) and wrist-mounted dataset (bottom-half). We present a wide variety of scenes and ADL classes.
mounted camera and head-mounted camera dataset respectively produced 6.5 h (about 690,000 frames) of images.
5.3. Characteristics
Various objects are handled in daily life. Therefore, for ADL recognition, it is important to be able to recognize them in diverse environments. For this study, we asked users to take videos in their own homes. As shown in the examples in Figure 5, the environments caught on camera differ depending on the user. More examples are shown in the Supplemental Materials.
6.1. Experiment protocols
We used 16-layer VGG net [27] pre-trained on the ImageNet 2012dataset [6] for the CNN architecture in the same manner as the LCD [36]. Motion features are also important in action recognition. Therefore, we evaluated our dataset not only with CNN descriptors but also with iDT [34]. Following [34], we reduced the dimensions of the descriptors (HOG, HOF, and MBH) by a factor of 2 with PCA and encode them with FV, where the component number of the Gaussian mixture model was 256. We applied power and L2 normalization to aggregated vectors. As a classifier, we used a one-vs.-all SVM with linear kernel, setting C = 100. We used leave-one-user-out cross-validation for the evaluation so that the same person does not appear across both
Table 2. Mean classification accuracy of the proposed methods on the wrist-mounted camera dataset (WCD) and the head-mounted camera dataset (HCD). STARis the method without weight optimization, which is equivalent to a spatiotemporal pyramid [17].
training and test data for ADL recognition. The iteration number of our methods is fixed at five because it usually converges in a few iterations.
6.2. Evaluating DSTAR and our dataset
We evaluated our approach on our wrist-mounted camera dataset (WCD) and head-mounted camera dataset (HCD). For fair comparison, we reduced the LCD dimensions from 512-D to a various range of dimensions such as 64-D, 128-D, and 256-D with PCA, and encoded them with various numbers of centers K in VLAD such as K = 64, 128, 256, 512, 1024 as in [36] to find the best ones. We also explored the best choice of dimensions, , and
, for our method. We describe the best parameters and how they are determined in the Supplemental Materials because of the limited space here.
Table 2 presents the action classification accuracy of our dataset. Comparing the cameras, we found the accuracy on WCD to be superior to that on HCD for every method. Next, we compared each method on WCD. Actually, DSAR, which retains spatial information after encoding, showed superior performance to LCD; DSTAR, which retains not only spatial information but also temporal information, exhibited superior performance to DSAR on both datasets. Results showed that an LCD with a spatial pooling layer (LCDspp) did not improve performanceon our dataset, unlike TRECVID MEDTest 13 and 14 [1, 2]. The images captured by a wrist-mounted camera have strong spatial bias, and ADL actions have weak temporal bias, as described in Section 3. From the obtained results, we can con-firm that using spatial and temporal bias improves recognition accuracy on WCD. However, DSAR and DSTAR did not improve HCD performance. As shown in Figure 2, we cannot confirm strong spatial bias in the images captured by a head-mounted camera. Cutting features in every cell only made the features sparse. Aggregated video representation does not get more discriminative than without cutting if images have no strong spatial bias. Consequently, DSAR and DSTAR can be shown to improve recognition accuracy more for WCD than for HCD. Through these recognition results, we can confirm that using a wrist-mounted camera and considering spatial and temporal information improved ADL recognition performance.
Figure 6. Visualization example of iDT on a head-mounted camera (left) and a wrist-mounted camera (right). These images were captured simultaneously. Green lines are trajectories that were removed from the backgrounds with iDT. It is apparent that many background points in the image of a wrist-mounted camera are regarded as the foreground because of large motion on the camera.
Table 4. Mean classification accuracy of combining CNN-based descriptors with motion features, and a wrist-mounted camera with a head-mounted camera.
6.3. Applicability in existing datasets
Table 3 shows how wide our methods can be applied. We first evaluated LCD and our methods on UCIADL [26]. As shown in the table, our methods did not improve the performance since this dataset has low spatial bias. Moreover, we evaluated them on UCF101 [28], which is one of the representative datasets of typical action recognition. Although not as strong as our wrist-mounted dataset, this dataset has substantial bias. Therefore, DSTAR showed better performance on this dataset than LCD did. Details are shown in the Supplemental Materials.
6.4. Fusing motion features and cameras
From Tables 2 and 4, we can confirm that iDT features are less discriminative than CNN-based features on WCD.
Comparing the iDT on both cameras, we found that iDT on HCD showed better performance than on WCD unlike the CNN descriptors. We can ascertain this reason from Figure 6. As the figure shows, iDT failed to remove the backgrounds from the video captured by a wrist-mounted camera compared with a head-mounted camera because of the large motion of the camera. Therefore, iDT on HCD is superior to that on WCD.
Jain et al. [11] showed that combining object features extracted by CNN with motion features such as iDT boosts
Figure 7. Visualization of DSTAR spatial weights on the wrist-mounted camera dataset 4.
Figure 8. Visualization of DSTAR temporal weights on the wrist- mounted camera dataset 4.
action classification accuracy. Following their conclusion, we also demonstrate how our method was affected by the combination of motion features. We fused our methods with iDT on each dataset by simply averaging the score obtained using our methods and the mean score obtained by all iDT scores. As Table 4 shows, the performance of motion features was boosted by our methods more than by LCD. Although iDT features were more discriminative on HCD than on WCD, the combined features showed better performance on WCD than on HCD. Unlike action recognition, object features are more effective than motion features in ADL recognition because the critical key is the handled object. Therefore, we can find that wrist-mounted cameras are more suitable for ADL recognition than head-mounted cameras.
In case the user wears both a head-mounted camera and a wrist-mounted camera, we can choose superior information from wrist-mounted cameras and head-mounted cameras. Better object information is obtainable from wrist-mounted cameras, but better motion information is obtainable from head-mounted cameras. Therefore, we combined DSTAR on WCD with iDT on HCD to achieve the best accuracy of 89.7% on our dataset.
7.1. Visualized weights of DSTAR
Figures 7 and 8 show the absolute values of spatial and temporal weights and
calculated using DSTAR on WCD. This figure presents the optimal discriminative weights for the respective cells.
Spatial weight: In Figure 7, it is apparent that cells near the center are important for classification, whereas cells on the right side are less important. The user’s palm always appears. No object appears on the right side of a wrist-
Figure 9. This figure represents the recognition accuracy of each ADL class. It also shows the differences between the models.
mounted camera image. Therefore, the right side area in the image has less information for recognition. The features obtained from the upper left cell and the bottom left cell are also less discriminative because backgrounds unrelated to the user’s action are often captured in these cells. However, handled objects often appear in the middle area of a wrist-mounted camera image. Discriminative features can be obtained from the cells of these areas.
Temporal weight: Although not as strong as the spatial bias of the image captured by a wrist-mounted camera, each ADL class has weak temporal bias. As Figure 8 shows, although full-length features (level 0) are the most important, temporally cut features (levels 1 and 2) have different weights. Using temporally cut pyramids improved the recognition performance, as presented in Table 2. Additionally, slight differences are apparent at the same level. At level 2, the beginning and the end of the action are slightly more important than the middle of the action.
7.2. Analysis of ADL classification results
We analyze the classification results here. Figure 9 presents the results of four different methods: LCD on HCD, LCD on WCD, DSTAR on WCD, and, finally, DSTAR on WCD and iDT on HCD.
Comparing HCD with WCD: We first compared both cameras using LCD. Results showed that 18 classes showed superior performance on WCD over HCD; these classes improved by 28.1% on average. Especially, “write on paper,” “cut paper,” and “staple paper” were improved significantly. These classes are actions wherein users use small objects such as a pen, scissors, and a stapler. A head-mounted camera captures these objects at a small scale. However, a wrist-mounted camera can capture large-scale images even of small objects. Four classes, however, showed inferior performance on WCD compared to HCD, and “dry dishes” is the class in which the classification accuracy declined considerably: 18.5%. On WCD, “dry dishes” was more often confused with “wash dishes” than on HCD, although all actions of “wash dishes” were recognized correctly.
Comparing LCD with DSTAR: Next, we compared DSTAR with LCD on WCD. Using DSTAR instead of LCD improved the accuracy on 14 classes. Its average improvement rate was 10.1%. One significantly improved action class was “vacuuming.” When we used a vacuum cleaner with a wrist-mounted camera, the floor and other unrelated backgrounds appeared on the left side of the wrist-mounted camera image. Actually, DSTAR was considered to improve the performance by reducing the importance of the features extracted from these areas. On five classes, the accuracy decreased, but the average rate of decrease was only 5.5%.
Adding iDT on HCD with DSTAR on WCD: Finally, we noticed how adding iDT affected HCD. Actually, 14 classes improved their performance by adding iDT on HCD; these classes improved by 10.4% on average. Especially, “wipe desk” was highly improved. Because the object information of “wipe desk” on HCD was often confused with other actions done near the table, using a wrist-mounted camera improved the performance. However, the “wipe desk” motion was distinctive. Consequently, adding iDT on HCD boosted the performance again. Only on five classes did the accuracy decline, the average of which was only 4.7%. In case the user wears both cameras, we can obtain better recognition.
This study examined the recognition of ADL with a wrist-mounted camera. We developed a publishable dataset of videos taken with a head-mounted camera and a wrist-mounted camera. Additionally, we proposed a novel video representation that aggregated CNN descriptors spatially and temporally, and optimized their weights both iteratively and alternately. Finally, using the proposed dataset, we quantitatively demonstrated the benefits of a wrist-mounted camera over a head-mounted camera and those of our proposed method over previous methods. We believe that our work will help spread the use of cameras attached to wrist-mounted devices.
[1] TRECVID MED 13. http://www.nist.gov/itl/iad/mig/ med13.cfm. 6
[2] TRECVID MED 14. http://www.nist.gov/itl/iad/mig/ med14.cfm. 6
[3] R. Arandjelovic and A. Zisserman. All about VLAD. In CVPR, 2013. 3
[4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 2
[5] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2006. 2
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009. 3, 6
[7] A. Fathi, X. Ren, and J. M. Rehg. Learning to recognize objects in egocentric activities. In CVPR, 2011. 2
[8] D. Galasko, D. Bennett, M. Sano, C. Ernesto, R. Thomas, M. Grundman, and S. Ferris. An inventory to assess activities of daily living for clinical trials in Alzheimer’s disease. Alzheimer Disease & Associated Disorders, 11:33–39, 1997. 5
[9] M. Hanheide, N. Hofemann, and G. Sagerer. Action recog- nition in awearable assistance system. In ICPR, 2006. 2
[10] T. Harada, Y. Ushiku, Y. Yamashita, and Y. Kuniyoshi. Dis- criminative spatial pyramid. In CVPR, 2011. 4, 11, 15
[11] M. Jain, J. C. van Gemert, and C. G. Snoek. What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, 2015. 7
[12] M. Jain, J. C. van Gemert, and C. G. M. Snoek. University of amsterdam at THUMOS Challenge 2014. In ECCV workshop on THUMOS Challenge, 2014. 2
[13] H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregating local descriptors into a compact image representation. In CVPR, 2010. 2, 3
[14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 2
[15] D. Kim, O. Hilliges, S. Izadi, A. D. Butler, J. Chen, I. Oikonomidis, and P. Olivier. Digits: freehand 3D interactions anywhere using a wrist-worn gloveless sensor. In UIST, 2012. 3
[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 2
[17] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. 4, 6
[18] Y. Li, A. Fathi, and J. M. Rehg. Learning to predict gaze in egocentric video. In ICCV, 2013. 2
[19] Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In CVPR, 2015. 2
[20] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. IJCAI, 81:674–679, 1981. 2
[21] T. Maekawa, Y. Yanagisawa, Y. Kishino, K. Ishiguro, K. Kamei, Y. Sakurai, and T. Okadome. Object-based activity recognition with heterogeneous sensors on wrist. In ICPC, 2010. 3
[22] W. W. Mayol-Cuevas, B. J. Tordoff, and D. W. Murray. On the choice and placement of wearable vision sensors. Systems, Man and Cybernetics, Part A: Systems and Humans, 39(2):414–425, 2009. 5
[23] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015. 2
[24] D. J. Patterson, D. Fox, H. Kautz, and M. Philipose. Fine- grained activity recognition by aggregating abstract object usage. In ISWC, 2005. 2
[25] F. Perronnin and C. Dance. Fisher kernels on visual vocabu- laries for image categorization. In CVPR, 2007. 2
[26] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In CVPR, 2012. 1, 2, 5, 7
[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015. 3, 6, 15
[28] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402, 2012. 7
[29] M. Stikic, T. Huynh, K. Van Laerhoven, and B. Schiele. ADL recognition based on the combination of RFID, and accelerometer sensing. In Pervasive Computing Technologies for Healthcare, 2008. 2
[30] E. M. Tapia, S. S. Intille, and K. Larson. Portable wireless sensors for object usage sensing in the home: Challenges and practicalities. In Ambient Intelligence, 2007. 5
[31] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. 2
[32] A. Vardy, J. Robinson, and L.-T. Cheng. The wristcam as input device. In ISWC, 1999. 3
[33] H. Wang, A. Kl¨aser, C. Schmid, and C.-L. Liu. Dense tra- jectories and motion boundary descriptors for action recognition. IJCV, 103:60–79, 2013. 2
[34] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013. 2, 6, 7
[35] J. Wu, A. Osuntogun, T. Choudhury, M. Philipose, and J. M. Rehg. A scalable approach to activity recognition based on object use. In ICCV, 2007. 2
[36] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN video representation for event detection. In CVPR, 2015. 2, 3, 4, 6, 7, 13
We compute eigenvectors that construct by calculating partial least squares (PLS). The idea to use PLS for obtaining weights for discriminative features was derived from [10]. Here, we reproduce Eq. (3), (4), and (5) in our paper below:
We let be a column vector in
and
denote the corresponding column vector in V (i.e.
). Suppose that we have N labeled training samples
with C classes, where
and
represents the class label of the i-th training sample ranging from 1 to C. The between-class covariance matrix
can be written as follows:
where , and
is the number of samples in the c-th class. The trace of
is given by:
where
Here, is the mean of
belonging to the c-th class, and
is the mean of all samples in the training dataset. By maximizing Eq. (17) under the condition
, we obtain the eigenvector of the following eigenvalue problem:
where is the eigenvalue corresponding to the eigenvector w. We select the
largest eigenvalues
, and the corresponding eigenvectors
. Finally, we create
by arranging
in a row. As described in our paper,
can be obtained in the same manner as
when
is fixed.
Table 5 shows the definition of each class in our dataset.
Table 5. Activity definitions of our dataset
Figures 10, 11, 12, and 13 are the confusion matrices. We can see how action classes are confused in each figure. For example, “make coffee” and “make tea” are confused in every case. These actions have common handled objects such as a mug and pot. The biggest difference is whether the user uses tea bag or coffee beans and filter. It is difficult for head-mounted camera to recognize these objects. However, wrist-mounted camera can recognize small handed objects easily. Thus, LCD [36] on wrist-mounted camera dataset (WCD) recognizes “make coffee” and “make tea” better than LCD on head-mounted camera dataset (HCD).
vacuuming empty trash wipe desk turn on air-conditioner open and close door make coffee make tea wash dishes dry dishes use microwave use refrigerator wash hands dry hands drink water from a bottle drink water from a cup read book write on paper open and close drawer cut paper staple paper fold origami use smartphone watch TV Figure 10. The confusion matrix for LCD on HCD
Figure 11. The confusion matrix for LCD on WCD
Figure 12. The confusion matrix for DSTAR on WCD
Figure 13. The confusion matrix for DSTAR on WCD & iDT on HCD
In this section, we find the best parameters for each method.
D.1. Parameters for LCD
We first find the best parameters for LCD; the number of centers K in VLAD and descriptor dimension. Table 6 shows that features get more discriminative with the increase of K on WCD. However when K = 1024, the features get too sparse and less discriminative. Though we find the best parameter (K, D) = (128, 256), the compressed dimension by PCA dose not seem to have much effect.
Table 7 also shows that features get more discriminative with the increase of K on HCD. However when K = 512, the features get too sparse and less discriminative. Unlike WCD, when dimensions of each descriptor are compressed from 512-D to 64-D, they lose the discriminative ability. This is understood as follows: the images captured by wrist-mounted camera have less variety than the images by head-mounted camera. Thus, the descriptors extracted from WCD can be more compact than those from HCD.
Table 6. Impact on dimensions and numbers of centers K for LCD on WCD
Table 7. Impact on dimensions and numbers of centers K for LCD on HCD
We also find the best parameter for LCD. Tables 8 and 9 show the obtained results of LCD
. We can see similar trend as LCD without Spatial Pooling Pyramid (SPP) layer shown in Tables 6 and 7. The best parameter (K, D) are (128, 256) for LCD
on WCD and (256, 256) for LCD
on HCD. We employ the score obtained with these parameters in submitted paper.
Table 8. Impact on dimensions and numbers of centers
Table 9. Impact on dimensions and numbers of centers
D.2. Number of spatial elements
Next, we find the best parameters for DSAR; the number of centers K in VLAD, descriptor dimension, and . Tables
10 and 11 show the best parameter for DSAR on WCD and DSAR on HCD. We can see that dose not need to be a large number though it can be set up-to 49 in VGG-net [27] case. The similar trend can be seen in the D-SPR [10]. If features are cast into well-isolated space by PLS, using too large
means adding inefficient features.
For numbers of clusters, K = 512 seems too sparse unlike Table 6. We calculate weights , shown in Eq. (15), from separately aggregated features in each cell. These separately aggregated features can be more sparse than LCD features. Thus, the best number of clusters for DSAR is smaller than that of LCD.
We can find the best parameter for DSAR on WCD and
for DSAR on HCD from Tables 10 and 11.
Table 10. Impact on dimensions, numbers of centers for DSAR on WCD.
Table 11. Impact on dimensions, numbers of centers for DSAR on HCD.
D.3. Number of spatial and temporal elements
We finally find the best parameter for DSTAR; the number of centers K in VLAD, descriptor dimension, and . Following the result described in Section D.2, we fix
. Tables 12 and 13 show the best parameter for DSTAR on WCD and DSTAR on HCD. We can find the best parameter
for DSAR on WCD and
for DSAR on HCD from Tables 10 and 11.
Table 12. Impact on dimensions, numbers of centers for DSTAR on WCD, with fixed
Table 13. Impact on dimensions, numbers of centers for DSTAR on HCD, with fixed