The eye is an important organ for human to get information from the outside world. According to statistics [1], nearly 80% of the environmental information (color, brightness, shape, movement, depth, etc.) comes from vision. Computer vision (CV) gives computers the ability to "see the world" like humans. It uses cameras to mimic the function of the human eye, so as to realize the functions of extraction, recognition and tracking of the object. Visual tracking is one of the most challenging problems in computer vision, it can provide robot with tracking, location and recognition of the specified target, and the parameters of the target or environment can be provided to the controller for subsequent use. It enjoys wide applications in the field of machine intelligence, including in mobile robotics, autonomous driving, human computer interaction, automated surveillance and Eye-tracking technology etc.
1.1 Tracking algorithm and visual tracker
The traditional tracking algorithm is different from the visual tracker in CV. The former is more suitable as tracking strategy. This kind of algorithm can predict the moving state of the target in the next frame by putting forward mathematical formula to model the change of the state space of the target in time domain. The latter is the integration of detection algorithm, tracking strategy, update strategy, online classifier, re-detector and other branch algorithms in CV, which has a more complex system structure. In this paper, the related work of the latter was introduced and analyzed emphatically.
1.2 Aim and outline
As one of the research hotspots in the field of computer vision, to evaluate the synthetic performance of visual trackers, starting with PETS [2] and VIVID [3], many researchers have provided evaluation datasets, and many people have proposed tracking training set [4-6] (shown in Table 1). From Wu's evaluation benchmark [7, 8] to the VOT [9-14] visual competition, the performance of state-of-the-art visual trackers have been ranked, and some of the tracker have been open source. We combined the data of the evaluation database as a reference, firstly we introduced the difficulties and basic framework of visual tracking in Section 2. In Section 3, state-of-the-art trackers based on tracking-by-detection were summarized. Section 4 was dedicated to analyze the characteristics of tracker required in the field of mobile robot. At last, conclusion and future directions could be found in Section 5.
Table 1. Datasets proposed in recent years
NoteThe VOT competition has changed every year, recalibrating in 2016, replacing 10 easy sequences with 10 difficult ones in 2017 and adding 35 Long-Term tracking sequences VOT-LT in 2018. ST: ShortTerm, LT: Long-Term, Tr: Training set, Te: Test set.
2.1 Basic Framework and the problems in tracking system
Visual tracking has developed significantly over the past few decades [26-32], and the process of visual tracking has been clear since it was first put forward to now. For an input video or image sequence, firstly, the state of the current frame of the target is taken as the initial state of tracking (initialization model parameter), and then the key points are extracted and modeled. Then the target model is applied to the subsequent frames, and the current state of the target is estimated by the tracking strategy (filtering method, optical flow method, etc.). Further, the target model is updated by the current state. Finally, tracking the target model in the next frame. The basic flowchart of visual tracking is shown in Figure 1.
Fig. 1. The Framework flow of a Visual tracking system. Naiyan Wang et al. [33] divided the traditional visual tracking algorithm framework in detail. They decomposed the visual tracking into five parts: Motion Model, Feature Extractor, Observation Model, Model Updater, Ensemble Postprocessor. Then the experimental results were shown that feature extraction is far more important than observational model in visual tracking.
In the above tracking framework, feature extractor is the process of describing the target. On the basis of the extracted target feature, the object description model is constructed. Tracker can be divided into two categories according to the way of target feature extraction and observation model (online learning method): Generative method and Discriminative method. The method used to predict the trajectory of a target in the observation model is the tracking strategy, such as Kalman filter [34], extended Kalman filter [35], particle filter [36], L-K optical flow algorithm [37], Markov chain Monte Carlo algorithm [38], Normalized Cross Correlation [39], Mean-Shift [28, 40] and Cam-shift [41]. In the process of visual tracking, the state of the target and its surrounding environment are constantly changing (Figure 2), which not only makes it difficult to extract features and build models, but also requires trackers have more robustness and higher accuracy. Based on this, real-time tracking is also possible.
Fig.2. Challenges and difficulties in Visual tracking. There are generally recognized difficulties in tracking: (1) Appearance deformation; (2) Illumination change; (3) Appearance similarity; (4) Motion blur; (5) Background clutter; (6) Occlusion; (7) Out of view; (8) Scale change; (9) Out of plane rotation; (10) In plane rotation; (11) Background similar.
2.2 Generative Method
In the process of learning, generative method is to obtain conditional probability distribution P (Y
| X) from the data maximization joint probability P (X, Y), as the prediction model[42]. That is, the data possibility model built on the global state ) /
). The generative method tries to find out how the data is generated. Generally, it can learn a model representing the target, and search the image region through the target, then classify a signal and minimize the reconstruction error. Based on this generation model, finding the target which is similar to the description of the generated model, and then make template matching to find the most matching region in the image, that is the target in current frame. The specific steps are shown in figure 3 [43].
Fig. 3. Generative method tracker framework. First, input the video frame and select the target to initialize it. In addition, extracting the target features in the current frame. Then the model is described according to the features of the target and establishing the probability density distribution function. What’s more, searching for the next frame of the image region and making template matching to find the region with the highest similarity to the model in the image. Finally, output the target bounding box.
In the framework of visual tracker, the step of extracting target features in the process of target description is very important, which has great influence on the accuracy and speed of tracking. It is not only the generative method applied to feature extraction, but also one of the important steps of model checking in discriminative method. See Table 2 for commonly used feature representation.
As shown in figure 3, describing and modeling target are important steps in the generative method, which can affect the efficiency and accuracy of tracker. Depending on the degree of difficulty in target, the ways of model describe methods are different. The commonly used describe methods include kernel trick [44, 45], incremental learning [46], Gaussian mixed model [47], linear subspace [48], Bayesian network [49], sparse representation [50], hidden Markov model [51] and so on. Finally, the similarity measure function is used as the confidence index to reflect the reliability of each tracking result to determine whether the target is lost or not.
2.3 Discriminative Method
The basic idea of discriminative method is that using the data direct learning decision function Y = f (X) or maximization conditional probability distribution P (Y | X) as the prediction model in the learning process. The step is to establish the discriminant function (posteriori probability function) under the condition of finite sample, and to establish the possibility model of data P (Y | X) in the global state, without considering the generation model of the sample, but studying the prediction model directly[42]. In computer vision, this method usually uses the idea of image feature with machine learning. After extracting the target feature, the classifier is trained by the machine learning method to distinguish the target from the background. The architecture of the discriminant class tracking method is shown in figure 4. Because background information is added to the training, the background and target can be distinguished significantly, the performance is more robust, and gradually occupies the mainstream position in the field of visual tracking.
In computer vision, target tracking and target detection are two important parts. The purpose of
detection is to find the static or dynamic target in video, and tracking is to locate the dynamic target. The tracking algorithm was originally used to solve the speed of the detection algorithm. It was used to predict the location of the target in the next frame, then the detection algorithm was used to mark the location of the target. Later, some people segment the video sequence according to a certain period of time, and detect each frame image in this period, so the detection can achieve the effect of similar tracking. Such tracking is equivalent to detecting each frame, which is a kind of pseudo-tracking. Tracking developed into "dynamic detection", also known as Tracking-by-Detection, which is the mainstream research direction of visual tracking nowadays[52].
Table 2. Recent advances on visual descriptors.
There are usually two kinds of tracking-by-detection methods: one is the Correlation Filtering (CF), which trains the filter by regressing the input feature as the target Gauss distribution, and finds the peak value of the response in the prediction distribution to locate the position of the target in subsequent frames [87-91]. The other is the Deep Learning (DL), which by updating the weights of the foreground and background in the classifier, it can improve the ability to distinguish the target from its neighborhood background [92-94].
In recent years, a large number of machine learning methods have been modified to deal with the problem of tracking-by-detection, as a training classifier method. In classifier training, supervised learning and semi-supervised learning are commonly used in machine learning, while unsupervised learning is less used (Table 3).
Fig. 4. Discriminative method framework. The discriminative method does not care how the data is generated, it only cares about the difference between the signals, it regards the tracking problem as a binary classification problem, and then simply categorize a given signal by difference. Generally speaking, it is the decision boundary to find the target and the background. Tracking is regarded as a frame-by-frame detection problem, and the target frame is selected from the first frame manually.
Table 3. Common machine learning methods
3.1 Correlation Filter
Correlation filter (CF), also called discriminative correlation filter (DCF), the principle is that the convolution response of two correlated signals f and g is greater than that of uncorrelated signals (1). Where is the complex conjugate of f, the
is used in continuous domain and the ∑ is used in discrete domain. In visual tracking, the filter only generates a high response to each object of interest and a low response to the background. Due to the introduction of circulant matrix and the application of Fast Fourier Transform (FFT), Discrete Fourier Transform (DFT) and Inverse FFT (IFFT), the speed of visual tracking is greatly improved. Computational complexity dropped from
Since Bolme et al. learn average of synthetic exact filters (ASEF) [115] and minimum output sum of squared error (MOSSE) filter [116], Correlation Filter-based Trackers (CFTs) have attracted considerable attention in the visual tracking community [57, 117] in the following years. Chen et al. [118] summarized the general framework for correlation filtering visual tracking methods in recent years (Figure 5). Most of the current CFTs are based on this framework, and only improve or replace one part of this without affecting the structure of the entire framework. MOSSE only use single channel gray features and shows the high-speed of 615FPS, which shows the advantage of correlation filtering. Then CSK [57] extended the padding and circulant matrix based on MOSSE. After Galoogahi et al. learn MCCF [119] with multi-channel feature, the improved multichannel feature version Kernel Correlation Filter (KCF) [117] by CSK whose Precision and FPS outperform the best (Struck [120]) on OTB50 [7] at that time (Table 4). CN [75] extends the color feature Color Names based on CSK. With the increase of feature channels, from MOSSE (615FPS) to CSK (292FPS), KCF (172FPS), and CN (152FPS), the speed of tracker is decreasing gradually, but the effect is getting better and better, and it can always be kept at the real-time high speed level. CSK [57], KCF/DCF [117] and CN [75], which have been used as the benchmark in various databases, are correlation filter-based trackers. In the VOT2014 visual tracking competition, the correlation filter-based tracker [62, 117, 121] occupies the top three. Since CSK is learned, the sparse representation-based trackers [83, 122, 123] have gradually been replaced by faster and simpler CFTs.
Fig. 5. General framework for correlation filter visual tracking methods. After the first frame initialization, in each subsequent frame, an image patch at previously estimated position is cropped as current input. Subsequently, the input can be described better by extracting different visual features and cosine window is usually used to smooth the boundary effect of the window. Afterwards, convolution theorem, the correlation between input signal and the learned filter is obtained by Convolution Theorem. FFT is used to convert the signal into the frequency domain, and the symbol in the figure denotes element-wise computation. After the correlation, a spatial confidence map is obtained by IFFT, whose peak can be predicted as the new position of target. Lastly, the feature of the new estimated position is extracted to train and update the correlation filter with a desired output.
Table 4. CSK-based compared to the state-of-the-art tracker at that time
Using better feature layers will cause the tracker slow down, and the filter size is fixed, which makes it impossible to respond well to the scale change of the target. So many researchers are focusing on improving the relevant filtering framework. Danelljan et al. proposed DSST [62] with only HOG features, and created a filter architecture based on translation filter combine with scale filter. DCF is used as the filter to detect the translation and the correlation filter similar to MOSSE is trained to detect the scale change of the target. However, the regression formula of DSST is a local optimal problem because the translation filter and the scale filter are solved separately, so that its real-time performance is not good (25FPS). To overcome this problem, Danelljan et al. proposed an accelerated version of fDSST [61] using PCA dimensionality reduction, which reduces 33 scales to 17, and improves running speed (54FPS). Yang Li et al. proposed SAMF [121] based on KCF which similar to DSST and used HOG add CN features. The image patch is zoomed at multiple scales and then the target is detected by a translational filter. Different from DSST, SAMF combines scale estimation with position estimation to achieve global optimization by iterative optimization. Kiani et al. proposed a type of tracker based on MOSSE, by adding mask matrix P, the filter can crop the real small size samples from large circular shifted patches, so as to increase the proportion of the real sample, which includes CFLB [124] based on grayscale feature and BACF [125] based on HOG feature. Both of them can run in real time (CFLB-87FPS, BACF-35FPS). Sui et al. proposed RCF [126] used three sparse correlation loss functions in the original structure of CF, which can improve the robustness of tracking and real-time performance well (37FPS). Zhang et al. found new ways of using trackers, they proposed MEEM [127] which is essentially a combined tracker. It can call multiple trackers at the same time, and select the best trackers according to the calculation of cumulative loss function, but the actual operation effect is general (13FPS).
The CF template matching method has poor tracking effect on fast deformation and fast motion of target, color feature is not good for illumination change and background similarity, and their performance is unsatisfactory when they are used alone. Bertinetto et al. learned Staple [128] combines template based feature method DSST and color histogram feature based method DAT [78] (15FPS). They found that the accuracy and speed of the tracker combined with the advantages of strong robustness of HOG features to light variation and insensitivity of CN features to deformation were higher than those of the single two trackers. The combined tracker speeds up to 80FPS. Since then, HOG and Color Names have become the standard of Hand-Crafted features in tracking algorithm. Then Bertinetto et al. proposed Staple+ [128] to improve the tracking performance, it increases the number of feature channels from 28 to 56, and adds the response terms of large displacement optical flow motion estimation to the translation detection. Performance has improved, but at the cost of not being real-time. In the same way, Lukezic et al. proposed CSR-DCF [63], combined with the ideas of DAT and CFLB. Using the mask matrix P of CFLB and adding adaptive coefficient, then the response point is determined by CF response map and color probability weighted summation. The maximum response point is determined by weighted sum of CF response map and color histogram. The effect is impressive but the speed is only 13FPS.
Boundary effect has always been one of the difficulties in visual tracking, because of the fast motion, the real samples will escape from the cosine window, so the background will be trained to the classifier, resulting in the sample being contaminated and the tracking failure. In order to solve this problem, Danelljan et al. proposed SRDCF [129], learning the spatial regularization term to punish the filter coefficients in the boundary region and suppressed boundary effect. However, the optimization iteration without closed solution causes the tracker cannot achieve real-time (5FPS). Gundogdu et al analyzed the disadvantages of cosine window and proposed a new window function SWCF [130], which can suppress the irrelevant region of the target and highlight the part of the relative region of the target. However, due to the complexity of the new window function, the speed of the tracker is only 5 FPS. Hu et al. proposed MRCT [131], a manifold regularization-based correlation filter. A regression model is established by using augmented samples and unsupervised learning training classifiers, and similar to BACF, augmented samples are generated from one positive sample cropped in the target region and multiple negative samples cropped in the non-target region, which aims to reduce boundary effect. Bibi et al. proposed CF+AT [132] framework, the target response can be regularized by replacing the samples generated by cyclic shift measurement through actual translation measurement, so as to solve boundary effect. Mueller et al. proposed a Context-Aware based correlation-filter framework CACF [66], which can be used in the learning phase of traditional CF, and the framework can be widely used in many different types of CFTs. CF+AT and CACF improved the performance of tracker significantly, but the speed of tracker is also affected by the increase of computing time.
Tracking confidence is one of the necessary parts of the tracker, which is used to judge whether the target is lost or not. The generative method usually uses similarity measure function, and discriminative method has the classification probability provided by classifier trained by machine learning method. In general, CFTs always use the Maximum Response Peak (MRP, 2, per-channel) as the confidence parameter, but it is difficult to effectively determine the target location in complex environment. The earliest correlation filtering method (MOSSE) used Peak to Sidelobe Ratio (PSR, 3) combined with MRP to judge confidence level. Wang et al. proposed LMCF [133] (85FPS) is based on the hand-crafted features and Deep-LMCF (8FPS) based on CNN features. It combined that structure SVM with CF, and proposed Average Peak-to-Correlation Energy (APCE, 4), which can effectively deal with the target occlusion and loss. Yao Sui et al. proposed PSCF [134] based on RCF [126], used a new metric method to enhance the Peak-Strengthened (PS, 5), which is used to improve the discriminative ability of the correlation filters. The tracker can run at 13PFS on desktop. Lukezic et al. believe that the detection reliability of per-channel is reflected in the performance of the major mode value in the response of each channel, so they put forward the Spatio Reliability (6) in CSR-DCF [63]. By combining with the MRP, this tracker performed 13FPS.
where donates a filter,
donates discriminative feature channel, the normalization scalar ζ ensures that
where denote the maximum, minimum and the w-th row h-th column elements of the peak value of the response.
where R denotes the peak value of the response, denotes the jth response value, n denotes the number of the neighboring response values around the peak, and
denote the positions of the response peak (correlation output) and the ground truth peak (center of the target location), respectively.
Spatio Reliability is based on the ratio between the second and first major mode in the response map. And the per-channel detection reliability is estimated as (6).
Most of the CFTs only pay attention to the performance of short-term tracking, but do not consider long-term tracking that the target will occlude or disappear at any time. Kalal et al. first proposed a novel long-term tracking framework TLD (Tracking-Learning-Detection) [74], which adopts MedianFlow tracker for tracking, P-N learning mechanism and the random fern classifier for detection. Although TLD does not use CF, it provides the original idea for long-term tracking, and the tracker can run in real time. Ma et al. proposed LCT [135], based on the translation filter and scale filter of DSST, added a third correlation filter responsible for detecting the target confidence. It adopted random fern classifier in TLD as the online detector, the running speed is 27FPS. Ma et al. further proposed LCT+, a filter with long-term and short-term memory, added Online SVM Detector and CNN features. LCT+ based on hand-crafted features operating at 20 FPS and 14 FPS by using CNN feature. Hong et al. proposed MUSTer [136] with long-term and short-term memory based on Atkinson-Shiffrin memory model, performed well but runs very slowly (0.287FPS). Zhu et al. proposed a novel collaborative correlation tracker (CCT) [137] using Multi-scale Kernelized the Correlation Tracking (MKC) and Online CUR Filter for long-term tracking. Through the detection of the CUR1 filter, the drift problem caused by the long-term occlusion or disengagement of the model is reduced. And the tracker can reach 52FPS.
As can be seen from the above work, the main research direction of CFTs is as follows: (1) Adopt better learning methods; (2) Optimize the regression equation; (3) Extract more powerful features; (4) Reduce the impact of scale change; (5) Weaken the impact of boundary effects; (6) Use better confidence criterion; (7) Combined with the long-term target memory model, etc.
3.2 Deep Learning
In recent years, Deep Learning (DL) has been widely concerned [84]. As a representative algorithm, CNN has achieved amazing results in image and speech recognition with its powerful feature expression ability after a series of development [108, 138-141]. In the field of visual tracking, most of DL-based trackers belong to discriminative method. Since 2015, from the top international conferences (ICCV, CVPR, ECCV), it can be seen that more and more DL-based trackers have achieved surprising performance [11].
CNN-SVM [142], proposed by Korean POSTECH team, is one of the earliest DL-based tracker, which combined Convolution Neural Network (CNN) with Support Vector Machine (SVM) classifiers. Finally, the target-specific saliency map is taken as the observation object, tracking is performed by sequential Bayesian filtering. After that, a large number of CNN-based trackers (CNTs) have sprung up. MDNet [143] as an improvement of CNN-SVM, extracted the features of motion with deep learning and added motion features to tracking process. It shows people the potential of CNN in the field of visual tracking, but the tracker is only suitable for running on desktop computer or server, not for running on ARM. In order to improve speed of DL-based method, Held et al. proposed the first DL-based tracker can run at 100FPS2. In order to improve the speed, it takes advantage of the large amount of data offline training and avoids online fine-turning, then it doesn't classify patch in regression-based approach, but rather regresses the bounding-box of object. However, these measures can obtain higher FPS, but the price is lower tracking accuracy.
Bertinetto et al. proposed SiameseFC (SiamFC) [144] using Siamese architectures (Figure 6). It is the first tracker to train samples with VID [4] dataset. It performs better than GOTURN and SRDCF in that time, and runs at very fast speed on GPU (SiamFC 58FPS and SiamFC-3s 86FPS). On VOT2016, ResNet-based SiamFC-R and AlexNet-based SiamFC-A outperform, and it is the winner of speed testing on VOT2017 [9, 10]. SiamFC has been attracted a lot of attention because of its excellent performance. It can be said to have opened up another direction for DL-based visual tracking, and the VID dataset also becomes the standard training database of DL-based trackers due to it very suitable for pre-training. In just one year there are such good work to follow up [145-150]. From the results of VOT2017[9], it can be seen that the SiamFC series is a few surviving End-to-End offline training tracker, which is the only direction that can counteract CFTs at present, and it is the most promising direction that can benefit from big data and DL.
Fig. 6. Fully-convolutional Siamese architecture. SiamFC learns a function that compares an
exemplar image z to a candidate image x of the same size and returns a high score if the two images depict the same object and a low score otherwise. is fully-convolutional with respect to the exemplar and candidate image. The output is a scalar-valued score map whose dimension depends on the size of the candidate image. Then computing the similarity responses of all translated sub-windows within the search image in one evaluation, and learn a metric function g according to Finally, the target position is determined by metric function g.
Due to the structural property of CNN, its running speed is always limited. After that, many researchers have proposed combining CF with CNN to speed up the tracker. Bertinetto et al. proposed an improved work CFNet [145] for SiamFC, in this work, they deduced the differentiable closed solution of CF, so that it becomes a layer of CNN. CF is used to build the template of the filter in SiamFC. Then CNN-CF can be used for End-to-End training, which is more suitable for the convolutional features of CF tracking. Tracker can run 43FPS when used conv5. Meanwhile, Wang et al proposed DCFNet [146], used CNN feature instead of HOG feature in discriminative correlation filters (DCF). Besides CNN feature, the other parts are still fast calculated in the frequency domain. The feature resolution is nearly 3 times higher than that of CFNet, and the positioning accuracy is higher. The speed of tracker is 60FPS, but the boundary effect limits the detection area. The latest version of DCFNet 2.0, which has been trained with VID, has made a significant leap forward in performance over CFNet, and operating at 100FPS on GPU. CFCF [151] (the winner of VOT2017 challenge), proposed by Gundogdu et al., had also constructed CNN, that can be End-to-End training based on VID dataset. Unlike the previous trackers, CFCF used the CNN of this fine-tune to extract convolutional features, the rest is exactly the same as C-COT, and this tracker cannot be real-time. Fan et al. proposed PTAV [152], used SiamFC combined with f-DSST, multithreading technology, and drew on the experience of parallel tracking and mapping in VSLAM, uses a tracker T and a verifier V to work in parallel on two separate threads. Through validator to correct the tracker, this problem is studied from a new point of view, and a good experimental result (25FPS) is obtained. There are also a lot of many studies done by Korean Perception and Intelligence Lab on CNN-CF method [153-156], which used Random Forests, Deep Reinforcement Learning, Markov Chains and other machine learning algorithms to optimize the accuracy of classifier, but both of them cannot reach real time.
Huang et al. proposed the first CPU-friendly CNTs EArly-Stopping Tracker (EAST) [147], also an improvement on SiamFC. It tracks simple frames (similar or static) with simple features (HC), while complex frames (obvious changes) use stronger convolutional features to track. The advantage of this is that the average speed of the tracker reaches 23FPS, where 50 % of the time can operate at 190 FPS. On the other hand, the complex frame tracking that needs for convolutional features is very slow, which also shows that the frame rate fluctuation of the tracker will be large. Tao et al. proposed SINT [157] based on Content Based Image Retrieval (CBIR), which only uses the original observation of the target from the first frame. The matching function is obtained by offline training, and Siamese network is used to track the patch which is the best match to the target of initial frame calibration according to the matching function. In the experiment, SINT added optical flow tracking module (SINT+), the effect was improved, but neither of them could run in real time. Wang et al proposed SINT++ [158], which adds positive sample generation network (PSGN) and hard positive transformation network (HPTN) to improve the accuracy of the samples. Although the method is novel and it used the most popular Generative Adversarial Networks (GAN), the actual effect is not impressive.
Chen et al. put forward CRT [159] is different from the traditional DCF in that it does not need to obtain the analytical solution of the regression problem. It attempts to obtain an approximate solution by gradient descent method and a single convolutional layer to solve regression equation. Since convolution regression is trained only on "real" samples without background, it is theoretically possible to incorporate unlimited negative samples. The UCT [150] proposed by Zhu et al. regarded the feature extraction and tracking procedure as a convolution operation, so as to form a completely convoluted network architecture. Similarly, using stochastic gradient descent (SGD) to solve the ridge regression problem in DCF, and using offline training of CNN to accelerate. Meanwhile, they learned a new confidence parameter Peak-versus-Noise Ratio (PNR, 7), and proposed standard UCT (with ResNet-101) and UCT-Lite (with ZF-Net) can operate at 41FPS and 154FPS. Song et al proposed CREST [160], which also reformulated DCF as a one-layer CNN, and uses neural networks to integrate End-to-End training on feature extraction, response graph generation and model update. They learned that features are transformed into the response map through the base and residual mappings for better tracking performance. Park et al. proposed Meta-Tracker [161], an offline meta-learning-based method to adjust the initial deep networks used in online adaptation-based tracking. They demonstrated this approach on CNN-based MDNet [143] and CNN-CF-based CREST [160], then model training speed improved significantly. Yao et al. investigated the joint learning of deep representation and model adaptation on the basis of BACF [125], then proposed RTINet [162], which can run at 9FPS and get a real-time speed of 24 FPS in rapid version.
Table 5 collates the network evaluation database maintained by Wang3 et al. showing the top 20 best-performing tracker at this stage, including CVPR2018. Except for the CF-based tracker BACF and the HC-based tracker ECO-HC (Turbo BACF speed can be over 300FPS, but the source code is not open4), the rest of the trackers are based on DL framework, and most of them are based on CNN, but frame rate is generally in single digits. PTAV (SLAM-based), SiamRPN (Siamese network-based) and RASNet can achieve real-time (GPU speed). Table 5. The trackers are ordered by the average overlap scores.
Note: AUC (the area-under-curve) and Precision are the standard metrics. Real Time - FPS, Speeds from the original paper, not test on the same platform. Red - the best, Green - the second, Blue - the third.
Research in recent years has shown that it has always been a difficult point to make the GPU-based real-time trackers run well on CPU. SiamFC [144] cannot real time on CPU because AlexNet will run the same times as the number of scales, which seriously delays the running speed. The fastest DCFNet [146] uses two-layers CNN instead of HOG and the amount of calculation using conv2 is acceptable, but the process of pre-training and fine-tune will make it weak on CPU. EAST [147] as a CNN-based tracker, in most cases it is tracked in the form of KCF, and only in the difficult scenarios will use the conv5 features. In view of that above, if a CNT would perform on CPU or ARM, three points should be noted: (1) It is necessary to control the number of CNN capacity, convolutional layers are the main part of calculation, which needs careful optimization to ensure the speed of CNN. (2) Target image online un-update (no fine-tune), the target features will be fixed after the CNN offline training, thus avoiding the problem that Stochastic Gradient Descent (SGD) and back propagation are almost impossible to real-time in tracking.
3.3 Convolutional Features
CFTs have good speed and precision. CNTs have higher accuracy and can keep high speed on GPU. In order to improve the performance of CFTs, it is necessary to adopt the deep feature. CF End-to-end training can be added to the CNTs. CF and DL are not developed independently, they complement and promote each other. The current development direction of tracker is shown in figure 7.
Fig. 7. The Development Tree of the current trackers. At present, there are three methods of tracker: (1) CF-based method; (2) CNN-based method; (3) Other. The direction is mainly CF and DL. The Big Black fonts represent the stage development. Pink lines represent the contributions of Danelljan et al. Yellow lines represent the contributions of Ma et al.
The right side of the figure 7 represents CF-based trackers. Most of them could be divided into two categories according to the choice of the feature channel. One is to combine the correlation filtering of Hand-Craft features such as HOG, CN or CH (Color Histogram), which can ensure very high speed and good precision, such as BACF [125], ECO-HC [164] and Staple [128]. The others are that CF combined with deep convolutional features, can achieve higher accuracy. Pre-training the convolutional features of CNN model are very strong, generalization ability is very good, but the speed is poor, such as C-COT [168], ECO [164] and CFCF [151].
The left side of the figure 7 represents DL-based trackers, most of them using CNN to train samples, and they can also be divided into two sub categories. Precision oriented MDNet [143] and its extension, can counter the top CFTs on datasets, but due to the limitation of the training set, the generalization ability may be questioned. Speed based SiamFC [144] and its extension, can achieve far more real-time speed on GPU. Especially after the introduction of CF layer, convolutional features extraction can be combined with the detection of CF, and the CNN framework can also achieve intensive detection. Both accuracy and speed can reach a higher level.
A series of work by Danelljan et al. [61, 62, 75, 163, 164, 168, 175] can represent the history of
CFTs, from improving the correlation filtering architecture to solving the boundary effect, to using better features, and then to extracting sub-pixel precision feature. The effect of trackers is getting better and better. They presented a theoretical framework for learning Continuous Convolution Operator Tracker [168] (C-COT), which interpolates feature graphs with different resolution into continuous spatial domain by cubic interpolation. It gets excellent tracking effect, but because of the huge computation, the speed is only 0.3FPS. ECO [164] is an accelerated version of C-COT [168]. It introduced factorized convolution operator, compact generative model and interval update strategy, that simultaneously improves tracking speed and robustness. The GPU version of ECO operates at 8 FPS, and ECO-HC can operate at 60FPS on CPU. On the basis of ECO, He et al. put forward that CFWCR [176] is weighted by double-layer CNN features (conv1 and conv5), and the HC feature is completely abandoned. Although the performance is better than ECO, the cost is to abandon running speed. CFWCR runs at an average of 4FPS on GPU and 1.4FPS on CPU. Bhat et al. analyzed the relationship between the deep and hand-crafted features based on ECO, and proposed UPDT [163], which can make features benefit from the better and deeper CNN layer. It outperforms ECO with a relative gain of 18% on the VOT2016 dataset. Comparing with some state-of-the-art trackers in CVPR2018 [165, 177-179], it still shows the overwhelming advantage. However, UPDT only mentioned the adaptive fusion of feature layer, and did not explain the speed of running. Because of the deeper convolutional features, it should be very slow.
Ma et al. have done a series of works on the use of deep convolutional features. They proposed HCFT [180], with pure convolutional features for tracking, uses the activation values of Conv5-4, Conv4-4 and Conv3-4 in VGG19 as the feature layer and tracks target according to linear weights. It operates at 11FPS on GPU. Then they proposed that HCFT+ [181] and HCFT* [182]. HCFT+ added CF as a part of convolutional layer based on HCFT. By using traditional CF to calculate correlation response diagram on Conv4-4 and Conv5-4 layer, the tracking accuracy is improved and the speed is 12 FPS. HCFT* added a long-term memory filter to HCFT+ for long-term tracking. They proposed a region-based object re-detection and scale estimation scheme. Finally, an incremental updating method for two kinds of CF with different learning rates is proposed. It runs at 6.7FPS and performance better than HCFT+.
In addition to the above two types of CF combined convolutional features tracker, many researchers have proposed more novel methods. Lu et al. proposed LSART [165], the winner in VOT2017, that combines CNN and CF in a new way. They used the iterative method of CF and the regularized kernel in spatial domain to solve CNN, which is more effective than the traditional method. Chen et al proposed a convolutional features-based long-term tracking correlation filter LHCF [183], which is similar to that of HCFT and LCT. The innovation point is to estimate the translation of the target by training the three conventional features layers. Choi et al. proposed TRACA [184], a correlation filter based tracker using context-aware compression of raw deep features. Multiple auto-encoders are used to deal with different category of objects, and the high-dimensional features are compressed into low-dimensional features, which reduced redundancy and sparsity, and improves accuracy and speed. It can run at a fast speed of over 100 FPS.
From the test results of VOT2017, the high-performance trackers are mainly the following. C-COT [168] used CF combined with conventional features, its accelerated version ECO [164], the fine-tuned version CFCF [151], and the ECO-based GNet with GoogLeNet feature. CPU high speed trackers are ECO-HC [164], Staple [128], ASMS [79], and C++ based CSR-DCF++ [63]. GPU high-speed trackers include SiamFC [144], and its extended version SiamDCF [146], UCT [150]. Although the test results are good, they are all based on test sets. However, for the practical application scenarios, especially the mobile robot, which is the main direction in the future, there are still a lot of difficulties to overcome in the current tracking algorithm.
(1) Tracking accuracy and speed coordination
What the most important thing for mobile robot is that the visual tracker is more focused on the ability of real-time operation. At the present stage, visual tracking concerns could be divided into two main categories: The first category focuses on improving accuracy, such as MDNet [143], CFCF [151], TCNN [167], etc. This kind of tracker does achieve high precision and high ranking on each data set, but the speed is very slow (both on CPU and GPU), which cannot meet the requirement of mobile robot real-time application. The second focuses on real-time performance, such as Staple [128], ECO-HC [164], EAST [147] and so on, which guarantees accuracy and is much faster than DL-based architecture. From the VOT2017 challenge results, the top ten trackers on public dataset are C-COT-based or ECObased and the main features used convolutional and hand-crafted features. The performence of the tracker with convolutional features is better than that with hand-crafted features only, but the speed of the tracker is also decreased seriously. Although the performance of the GPU-based trackers is getting better and faster with the great development of deep learning, it has a good performance in the desktop work scene, but it can not be applied to the mobile side (based on ARM or CPU). Whether convolutional features are needed, or whether to find the better-faster features, is what the tracker needs to consider when it comes to mobile robots.
(2) Combination of target detection and tracking
In the test database of trackers, because all the targets are pre-calibrated, that is, the initial position of the target in each set of video frames is already known before tracking, the tracker tracks calibrated target directly. Therefore, it does not represent the ability to initialize the tracker in practical applications. Human brain has a strong logical reasoning ability, so it can identify the target at any time, and the "first impression" of the target can be quickly stored in the mind, and then can follow it all the time. However, the visual tracker selectively ignores the important issue of how the first frame bounding box comes from. Some trackers often appear to be weak when it is necessary to independently select and track new targets. The next step of visual tracking can be fused with target detection and recognition, which can independently confirm the target and then tracking it. In mobile robot applications, detect to Track would certainly be a closed loop in the future, rather than limited to the performance or speed of the tracker.
(3) Ability to long-term tracking
Since the 2014 Long-Term Detection and Tracking workshop (LTDT5), long-term tracking has been a major concern. A new Long-term tracking sub-challenge6 has also been added to the VOT2018, which requires the tracker to determine that the target disappears and to re-detect and track it when the target enters the scene again. This shows that the importance of long-term tracking has been paid more and more attention. At present, most visual trackers focus on the accuracy of Short-term tracking (e.g. 100~500 frames). But in practical applications, such as mobile robots, the tracking time is often uncertain, may be a few minutes or a dozen minutes or even longer, a lot of occlusion, target-loss problems are not prominent in short-term, which affects the actual use of the tracker. Therefore, the tracer can be required to Long-term stable tracking. Long-term tracking needs to add re-detector and longtime memory model to the traditional tracker, and they can be called to rectify the tracker if the trace fails. Of course, the short-term tracking performance of tracker is also related to the quality of long-term.
(4) Good portability
At present, most of the trackers are based on Tracking-by-Detection. For the performance of visual tracking, the selection of features has a great influence on tracking performance. Danelljan et al. [185] proved that the deep convolution feature has good rotation invariability but the speed advantage will be lost by introducing the convolution feature. Nowadays, the DL-based trackers (including extracted conventional features) take GPU as the core of computing, and need the specialized computing card such as Tesla or Titan to pre-train datasets, which are often composed of multiple graphics cards, which are expensive and power consuming. Table 5 shows that DL-based trackers are also becoming more and more difficult to run on GPU, DL-based trackers cannot benefit from deeper CNN [163]. For the mobile robot, the portability of the controller is an important factor affecting the physical parameters of the mobile robot, such as volume, endurance, structure complexity and so on, so the DL-based tracker is not suitable. Finding the CPU-friendly DL-based tracker (such as EAST [147]) may be a future development direction. Of course, compared with deep learning, CFTs are more suitable for mobile robot at the present stage.
Visual tracking as an important component of Computer Vision with many applications which makes it a highly attractive research problem. In this paper, we summarized the difficulties and general architecture of visual tracking. Then we provided a list of visual feature descriptors and summarized machine learning methods about trackers. With the view of real-time performance, state-of-the-art visual trackers based on Tracking-by-Detection were introduced from Correlation Filter, Deep Learning and Convolutional Features-based perspectives. Finally, the key point of application of trackers in mobile robots were analyzed, which is also trackers forthcoming research directions.
Although the generative method framework has the advantages of good real-time performance and less adjustment parameters, its modeling complexity limits its further development. With the development of correlation filter and deep learning, discriminative method algorithm based on Tracking-by-Detection architecture has become the mainstream. Their speed, precision and robustness have completely exceeded the generative method. However, the potential of deep learning in visual tracking direction is not well demonstrated, and replacing different neural networks does not result in substantial performance improvements [163]. Because the architecture determines that computing is inherently slow (on CPU), although the trackers based on DL or based on CF with convolutional features outperforms the CFTs based on HC features by 10% ~ 15%, there is no absolute advantage in practical application, and there is not much gap with CFTs. Instead, the speed of running on the CPU will constrain its performance. All in all, the running speed of computer vision algorithm is one of the most important indexes of algorithm performance, especially the visual tracker, which always puts the speed ahead of the performance in practical application. But in academic research, performance is often emphasized, and real-time testing is neglected. That is to say, the ultimate purpose of the visual tracker should to focus on practical applications, rather than just in their own circle of research to “Benchmark & Tuning”.
Unlike fixed position manipulator, the video camera that a mobile robot carries would move along with the robot, and sometimes it will have to rotate itself. So what the tracker needs to locate is the relative position of the target. Similarly, the visual trackers using database to test only collects twodimensional plane information, which does not collect depth information in space. Depth information is an important physical parameter necessary for mobile robot. So how to convert the vision algorithm suitable for plane tracking into the vision algorithm suitable for space tracking maybe a research direction in the future. For the video camera, the illumination variation is a common problem. The white balance of the video camera will go a sudden change when it is exposed to strong light, which will interfere with the tracking and updating of the target features. The research of high performance visual tracker suitable for mobile robot is not only limited to testing in database, but also needs to combine many kinds of sensors to assist visual tracking, and track target accurately in the open environment that is not restricted by databases and training datasets. Finally, the use of specific environmental information is also an important research direction. Such as vehicle tracking, cars should be kept on the road, not on the sky or on the wall. This kind of semantic or environmental information is also very useful for the development of trackers.
Development Program of China (No. 2018YFC0808000) and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), China.
Author Contributions: Shaoze You designed the architecture and finalized the paper. Hua Zhu conceived the idea. Menggang Li and Yutan Li did the proof reading.
Conflicts of Interest: The authors declare no conflict of interest.
Reference:
[1] J. Zhang, Attention-based Target Recognition Algorithm and Applications in the Mobile Robot, Chongqing University, 2013.
[2] R.B. Fisher, The PETS04 surveillance ground-truth data sets, Proc. 6th IEEE international workshop on performance evaluation of tracking and surveillance2004), pp. 1-5.
[3] R. Collins, X. Zhou, S.K. Teh, An open source tracking testbed and evaluation web site, IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2005)2005), pp. 35. [4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, Imagenet large scale visual recognition challenge, International Journal of Computer Vision, 115 (2015) 211-252.
[5] E. Real, J. Shlens, S. Mazzocchi, X. Pan, V. Vanhoucke, YouTube-BoundingBoxes: A large high-precision human-annotated data set for object detection in video, Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, (IEEE2017), pp. 7464-7473.
[6] M. Müller, A. Bibi, S. Giancola, S. Al-Subaihi, B. Ghanem, TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild, arXiv preprint arXiv:1803.10794, (2018).
[7] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, Proceedings of the IEEE conference on computer vision and pattern recognition2013), pp. 2411-2418.
[8] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Transactions on Pattern Analysis and Machine Intelligence, 37 (2015) 1834-1848.
[9] M. Kristan, A. Eldesokey, Y. Xing, Y. Fan, Z. Zhu, Z. Zhang, Z. He, G. Fernandez, A. Garciamartin, A. Muhic, The Visual Object Tracking VOT2017 Challenge Results, IEEE International Conference on Computer Vision Workshop2017), pp. 1949-1972.
[10] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Čehovin, T. Vojír̃, G. Häger, A. Lukežič, G. Fernández, The Visual Object Tracking VOT2016 Challenge Results, European Conference on Computer Vision, (Springer2016), pp. 777-823.
[11] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, R. Pflugfelder, The Visual Object Tracking VOT2015 Challenge Results, Computer Vision Workshop (ICCVW), 2015 IEEE International Conference on, (IEEE2015), pp. 564-586.
[12] M. Kristan, J. Matas, A. Leonardis, T. Vojíř, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, L. Čehovin, A novel performance evaluation methodology for single-target trackers, IEEE transactions on pattern analysis and machine intelligence, 38 (2016) 2137-2155.
[13] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. Čehovin, G. Nebehay, T. Vojíř, G. Fernández, A. Lukežič, A. Dimitriev, The Visual Object Tracking VOT2014 Challenge Results, IEEE International Conference on Computer Vision Workshops2015), pp. 98-111.
[14] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, L. Cehovin, G. Nebehay, G. Fernandez, T. Vojir, A. Gatt, The Visual Object Tracking VOT2013 Challenge Results, IEEE International Conference on Computer Vision Workshops2013), pp. 98-111.
[15] S. Song, J. Xiao, Tracking revisited using RGBD camera: Unified benchmark and baselines, Proceedings of the IEEE international conference on computer vision2013), pp. 233-240.
[16] A.W. Smeulders, D.M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: An experimental survey, IEEE Transactions on Pattern Analysis & Machine Intelligence, (2013) 1.
[17] P. Liang, E. Blasch, H. Ling, Encoding color information for visual tracking: Algorithms and benchmark, IEEE Transactions on Image Processing, 24 (2015) 5630-5644.
[18] A. Li, M. Lin, Y. Wu, M.-H. Yang, S. Yan, Nus-pro: A new visual tracking challenge, IEEE transactions on pattern analysis and machine intelligence, 38 (2016) 335-349.
[19] M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for uav tracking, European conference on computer vision, (Springer2016), pp. 445-461.
[20] H.K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, S. Lucey, Need for speed: A benchmark for higher frame rate object tracking, Computer Vision (ICCV), 2017 IEEE International Conference on, (IEEE2017), pp. 1134-1143.
[21] S. Li, D.Y. Yeung, Visual Object Tracking for Unmanned Aerial Vehicles: A Benchmark and New Motion Models, AAAI2017), pp. 4140-4146.
[22] L.C. Zajc, A. Lukezic, A. Leonardis, M. Kristan, Beyond standard benchmarks: Parameterizing performance evaluation in visual object tracking, Computer Vision (ICCV), 2017 IEEE International
[23] A. Moudgil, V. Gandhi, Long-Term Visual Object Tracking Benchmark, arXiv preprint
[24] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, VOT2018 Challenge, http://www.votchallenge.net/vot2018/dataset.html, 2018).
[25] J. Valmadre, L. Bertinetto, J.F. Henriques, R. Tao, A. Vedaldi, A. Smeulders, P. Torr, E. Gavves, Longterm Tracking in the Wild: A Benchmark, arXiv preprint arXiv:1803.09502, (2018).
[26] B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, International Joint Conference on Artificial Intelligence1981), pp. 674-679.
[27] J. Shi, Tomasi, Good features to track, Computer Vision and Pattern Recognition, 1994. Proceedings CVPR '94., 1994 IEEE Computer Society Conference on2002), pp. 593 - 600.
[28] D. Comaniciu, V. Ramesh, P. Meer, Real-time tracking of non-rigid objects using mean shift, Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, (IEEE2000), pp. 142-149.
[29] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, Computer vision and pattern recognition, 2006 IEEE Computer Society Conference on, (IEEE2006), pp. 798-805.
[30] S. Avidan, Ensemble tracking, IEEE transactions on pattern analysis and machine intelligence, 29 (2007). [31] Y. Li, H. Ai, T. Yamashita, S. Lao, M. Kawade, Tracking in low frame rate video: A cascade particle filter with discriminative observers of different life spans, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30 (2008) 1728-1740.
[32] M. Ozuysal, M. Calonder, V. Lepetit, P. Fua, Fast keypoint recognition using random ferns, IEEE transactions on pattern analysis and machine intelligence, 32 (2010) 448-461.
[33] N. Wang, J. Shi, D.Y. Yeung, J. Jia, Understanding and Diagnosing Visual Tracking Systems, (2015) 3101-3109.
[34] R.E. Kalman, A new approach to linear filtering and prediction problems, Journal of basic Engineering, 82 (1960) 35-45.
[35] N.J. Gordon, D.J. Salmond, A.F. Smith, Novel approach to nonlinear/non-Gaussian Bayesian state estimation, IEE Proceedings F (Radar and Signal Processing), (IET1993), pp. 107-113.
[36] R.S. Bucy, Bayes theorem and digital realizations for non-linear filters, Journal of the Astronautical Sciences, 17 (1969) 80.
[37] S. Baker, I. Matthews, Lucas-kanade 20 years on: A unifying framework, International journal of computer vision, 56 (2004) 221-255.
[39] J.-C. Yoo, T.H. Han, Fast Normalized Cross-Correlation, Circuits, systems and signal processing, 28 (2009) 819.
[40] D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on pattern analysis and machine intelligence, 24 (2002) 603-619.
[43] W. Zhu, Y. Liu, B.L. Bian, Z. Zhang, Survey on Object Tracking Method Base on Generative Model, MICROPROCESSORS, 38 (2017) 41-47.
[44] B. Han, D. Comaniciu, Y. Zhu, L.S. Davis, Sequential Kernel Density Approximation and Its Application to Real-Time Visual Tracking, IEEE Transactions on Pattern Analysis & Machine Intelligence, 30 (2008) 1186-1197.
[45] D. Comaniciu, V. Ramesh, P. Meer, Kernel-Based Object Tracking, Pattern Analysis & Machine Intelligence, 25 (2003) 564-575.
[46] D.A. Ross, J. Lim, R.S. Lin, M.H. Yang, Incremental Learning for Robust Visual Tracking, International Journal of Computer Vision, 77 (2008) 125-141.
[47] A.D. Jepson, D.J. Fleet, T.F. El-Maraghi, Robust online appearance models for visual tracking, Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on2003), pp. 1296-1311.
[48] M.J. Black, A.D. Jepson, EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation, International Journal of Computer Vision, 26 (1998) 63-84.
[49] O. Tuzel, F. Porikli, P. Meer, A Bayesian Approach to Background Modeling, Computer Vision and Pattern Recognition - Workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on2005), pp. 58-58.
[50] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T.S. Huang, S. Yan, Sparse Representation for Computer Vision and Pattern Recognition, Proceedings of the IEEE, 98 (2010) 1031-1044.
[51] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Readings in Speech Recognition, 77 (1990) 267–296.
[52] H. Yang, L. Shao, F. Zheng, L. Wang, Z. Song, Recent advances and trends in visual tracking: A review, Neurocomputing, 74 (2011) 3823-3831.
[53] F. Porikli, Integral histogram: A fast way to extract histograms in cartesian spaces, Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, (IEEE2005), pp. 829-836.
[54] C. Ma, J.-B. Huang, X. Yang, M.-H. Yang, Adaptive correlation filters with long-term and short-term memory for object tracking, International Journal of Computer Vision, (2018) 1-26.
[55] L. Sevilla-Lara, E. Learned-Miller, Distribution fields for tracking, Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, (IEEE2012), pp. 1910-1917.
[56] M. Felsberg, Enhanced distribution field tracking using channel representations, Proceedings of the IEEE International Conference on Computer Vision Workshops2013), pp. 121-128.
[57] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, European conference on computer vision, (Springer2012), pp. 702-715.
[58] Dalal, Navneet, Triggs, Bill, Histograms of Oriented Gradients for Human Detection, Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on2005), pp. 886-893.
[59] D.G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision, 60 (2004) 91-110.
[60] M. Danelljan, G. Häger, F. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, British Machine Vision Conference, Nottingham, September 1-5, 2014, (BMVA Press2014).
[61] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Discriminative Scale Space Tracking, IEEE Transactions on Pattern Analysis & Machine Intelligence, 39 (2016) 1561-1575.
[62] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Accurate Scale Estimation for Robust Visual Tracking, British Machine Vision Conference2014), pp. 65.61-65.11.
[63] A. Lukezic, T. Vojir, L.C. Zajc, J. Matas, M. Kristan, Discriminative Correlation Filter with Channel and Spatial Reliability, IEEE Conference on Computer Vision and Pattern Recognition2017), pp. 4847-4856.
[64] Y. Liu, X. Chen, H. Yao, X. Cui, C. Liu, W. Gao, Contour-motion feature (CMF): A space–time approach for robust pedestrian detection, Pattern Recognition Letters, 30 (2009) 148-156.
[65] H.-U. Kim, D.-Y. Lee, J.-Y. Sim, C.-S. Kim, Sowp: Spatially ordered and weighted patch descriptor for visual tracking, Proceedings of the IEEE International Conference on Computer Vision2015), pp. 3011-3019.
[66] M. Mueller, N. Smith, B. Ghanem, Context-Aware Correlation Filter Tracking, Computer Vision and Pattern Recognition2017), pp. 1387-1395.
[67] K. Zhang, L. Zhang, Q. Liu, D. Zhang, M.H. Yang, Fast Visual Tracking via Dense Spatio-temporal Context Learning, 8693 (2014) 127-141.
[68] B.S. Manjunath, W.Y. Ma, Texture Features for Browsing and Retrieval of Image Data, IEEE Trans Pami, 18 (1996) 837-842.
[69] W.H. Liao, Region Description Using Extended Local Ternary Patterns, International Conference on Pattern Recognition2010), pp. 1003-1006.
[70] X. Tan, B. Triggs, Enhanced Local Texture Feature Sets for Face Recognition Under Difficult Lighting Conditions, IEEE Trans Image Process, 19 (2010) 1635-1650.
[71] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns, IEEE Transactions on Pattern Analysis & Machine Intelligence, 24 (2000) 971-987.
[72] S. Liao, X. Zhu, Z. Lei, L. Zhang, S.Z. Li, Learning Multi-scale Block Local Binary Patterns for Face Recognition, International Conference on Advances in Biometrics2007), pp. 828-837.
[73] J. Chen, S. Shan, C. He, G. Zhao, M. Pietikäinen, X. Chen, W. Gao, WLD: A Robust Local Image Descriptor, IEEE Transactions on Pattern Analysis & Machine Intelligence, 32 (2010) 1705.
[74] K. Z, M. K, M. J, Tracking-Learning-Detection, IEEE Transactions on Pattern Analysis & Machine Intelligence, 34 (2012) 1409-1422.
[77] G. Pass, R. Zabih, J. Miller, Comparing images using color coherence vectors, ACM International
[78] H. Possegger, T. Mauthner, H. Bischof, In defense of color-based model-free tracking, Computer Vision
[79] T. Vojir, J. Noskova, J. Matas, Robust Scale-Adaptive Mean-Shift for Tracking, Scandinavian
[80] C.P. Papageorgiou, M. Oren, T. Poggio, A general framework for object detection, International
[81] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference
[82] B. Babenko, M.H. Yang, S. Belongie, Visual Tracking with Online Multiple Instance Learning, Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on2009), pp. 983-990. [83] K. Zhang, L. Zhang, M.H. Yang, Real-time compressive tracking, European Conference on Computer
[85] X. Wang, T.X. Han, S. Yan, An HOG-LBP human detector with partial occlusion handling, Proc.ieee
[86] W. Kuan, J. Chen, C. Liang, Y. Wu, R. Hu, Object tracking via online trajectory optimization with multi-
[87] X. Wu, T. Xu, W. Xu, Review of Target Tracking Algorithms in Video Based on Correlation Filter,
[88] W. Zhang, B. Kang, Recent Advances in Correlation Filter-Based Object Tracking: A Review, Journal
[89] Q. Wei, S. Lao, L. Bai, Visual Object Tracking Based on Correlation Filters: A Survey, Computer Science,
[90] J. Ye, J. Lei, H. Wu, M. Peng, N. Xue, A Survey of Object Tracking Based on Discriminative Classifier
[91] D. Cao, C. Fu, G. Jin, Survey of Target Tracking Algorithms Based on Machine Learning, Computer
[92] J. Jia, Y. Qin, Survey on Visual Tracking Algorithms Based on Deep Learning Technologies, Computer
[93] H. Luo, L. Xu, B. Hui, Z. Chang, Status and prospect of target tracking based on deep learning, Infrared
[94] X. Gu, Y. Mao, Q. Li, Survey on Visual Tracking Algorithms Based on Mean Shift, Computer Science,
[96] N.C. Oza, Online bagging and boosting, IEEE International Conference on Systems, Man and
[97] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, Icml, (Citeseer1996), pp. 148-
[99] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, Conference on
[101] N.S. Altman, An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression, American
[102] K. Nigam, A.K. Mccallum, S. Thrun, T. Mitchell, Text Classification from Labeled and Unlabeled
[103] Z. Kalal, J. Matas, K. Mikolajczyk, P-N learning: Bootstrapping binary classifiers by structural
[104] J.D. Lafferty, A. Mccallum, F.C.N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Eighteenth International Conference on Machine Learning2001),
pp. 282-289. [105] K. Fukushima, S. Miyake, Neocognitron: A Self-Organizing Neural Network Model for a Mechanism
[106] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural computation,
[107] N. Wang, D.Y. Yeung, Learning a deep compact image representation for visual tracking,
[108] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science,
[109] R. Girshick, Fast R-CNN, Proceedings of the IEEE international conference on computer vision2015),
[110] H. Zhang, S. Sheng, Learning Weighted Naive Bayes with Accurate Ranking, IEEE International
[112] J. Cheng, R. Greiner, Comparing Bayesian Network Classifiers, IEEE Transactions on Vehicular
[114] N. Abramson, D. Braverman, G. Sebestyen, Pattern Recognition and Machine Learning (Springer,
[115] D.S. Bolme, B.A. Draper, J.R. Beveridge, Average of Synthetic Exact Filters, Computer Vision and
[116] D.S. Bolme, J.R. Beveridge, B.A. Draper, Y.M. Lui, Visual object tracking using adaptive correlation
[117] J.F. Henriques, C. Rui, P. Martins, J. Batista, High-Speed Tracking with Kernelized Correlation Filters,
[118] Z. Chen, Z. Hong, D. Tao, An Experimental Survey on Correlation Filter-based Tracking, Computer
[119] H.K. Galoogahi, T. Sim, S. Lucey, Multi-channel Correlation Filters, IEEE International Conference
[120] S. Hare, A. Saffari, P.H.S. Torr, Struck: Structured output tracking with kernels, International
[122] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on, (IEEE2012), pp. 1838-1845. [123] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, Computer vision and pattern recognition (CVPR), 2012 IEEE Conference on, (IEEE2012), pp. 1822-1829. [124] H.K. Galoogahi, T. Sim, S. Lucey, Correlation Filters with Limited Boundaries, Computer Vision and
[125] H.K. Galoogahi, A. Fagg, S. Lucey, Learning Background-Aware Correlation Filters for Visual
[126] Y. Sui, Z. Zhang, G. Wang, Y. Tang, L. Zhang, Real-Time Visual Tracking: Promoting the Robustness
[127] J. Zhang, S. Ma, S. Sclaroff, MEEM: Robust Tracking via Multiple Experts Using Entropy
[128] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, P.H.S. Torr, Staple: Complementary Learners for
[129] M. Danelljan, G. Hager, F. Shahbaz Khan, M. Felsberg, Learning Spatially Regularized Correlation Filters for Visual Tracking, Proceedings of the IEEE International Conference on Computer Vision2015),
[130] E. Gundogdu, A.A. Alatan, Spatial windowing for correlation filter based visual tracking, IEEE
[131] H. Hu, B. Ma, J. Shen, L. Shao, Manifold Regularized Correlation Object Tracking, IEEE Transactions
[132] A. Bibi, M. Mueller, B. Ghanem, Target response adaptation for correlation filter tracking, European of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA2017), pp. 21-26. [134] Y. Sui, G. Wang, L. Zhang, Correlation filter learning toward peak strength for visual tracking, IEEE transactions on cybernetics, 48 (2018) 1290-1303.
[135] C. Ma, X. Yang, C. Zhang, M.H. Yang, Long-term correlation tracking, Computer Vision and Pattern Recognition2015), pp. 5388-5396.
[136] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, D. Tao, MUlti-Store Tracker (MUSTer): A cognitive psychology inspired approach to object tracking, Computer Vision and Pattern Recognition2015), pp. 749-758.
[137] G. Zhu, J. Wang, Y. Wu, H. Lu, Collaborative Correlation Tracking, BMVC2015), pp. 184.181-184.112.
[138] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, (California Univ San Diego La Jolla Inst for Cognitive Science1985).
[139] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation, 1 (1989) 541-551.
[140] K. Fukushima, Neural network model for a mechanism of pattern recognition unaffected by shift in position-Neocognitron, IEICE Technical Report, A, 62 (1979) 658-665.
[141] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998) 2278-2324.
[142] S. Hong, T. You, S. Kwak, B. Han, Online tracking by learning discriminative saliency map with convolutional neural network, International Conference on Machine Learning2015), pp. 597-606.
[143] H. Nam, B. Han, Learning Multi-domain Convolutional Neural Networks for Visual Tracking, Computer Vision and Pattern Recognition2016), pp. 4293-4302.
[144] L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Torr, Fully-Convolutional Siamese Networks for Object Tracking, (2016) 850-865.
[145] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P.H.S. Torr, End-to-End Representation Learning for Correlation Filter Based Tracking, (2017) 5000-5008.
[146] Q. Wang, J. Gao, J. Xing, M. Zhang, W. Hu, DCFNet: Discriminant Correlation Filters Network for Visual Tracking, (2017).
[147] H. Chen, S. Lucey, D. Ramanan, Learning Policies for Adaptive Tracking with Deep Feature Cascades, (2017) 105-114.
[148] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, S. Wang, Learning Dynamic Siamese Network for Visual Object Tracking, IEEE International Conference on Computer Vision2017), pp. 1781-1789.
[149] T. Yang, A.B. Chan, Recurrent Filter Learning for Visual Tracking, arXiv preprint arXiv:1708.03874, (2017).
[150] Z. Zhu, G. Huang, W. Zou, D. Du, C. Huang, UCT: learning unified convolutional networks for real-time visual tracking, Proc. of the IEEE Int. Conf. on Computer Vision Workshops2017), pp. 1973-1982.
[151] E. Gundogdu, A.A. Alatan, Good Features to Correlate for Visual Tracking, IEEE Transactions on Image Processing, 27 (2017).
[152] H. Fan, H. Ling, Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking, Proc. IEEE Int. Conf. Computer Vision, Venice, Italy2017).
[153] J. Choi, H.J. Chang, S. Yun, T. Fischer, Y. Demiris, J.Y. Choi, Attentional Correlation Filter Network for Adaptive Visual Tracking, CVPR2017), pp. 7.
[154] L. Zhang, J. Varadarajan, P.N. Suganthan, N. Ahuja, P. Moulin, Robust visual tracking using oblique random forests, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2017), pp. 5589-5598.
[155] B. Han, J. Sim, H. Adam, BranchOut: Regularization for Online Ensemble Tracking with Convolutional Neural Networks, Computer Vision and Pattern Recognition2017), pp. 521-530.
[156] D. Yeo, J. Son, B. Han, J.H. Han, Superpixel-based tracking-by-segmentation using markov chains, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE2017), pp. 511-520.
[157] R. Tao, E. Gavves, A.W. Smeulders, Siamese instance search for tracking, Proceedings of the IEEE conference on computer vision and pattern recognition2016), pp. 1420-1429.
[158] X. Wang, C. Li, B. Luo, J. Tang, SINT++: Robust Visual Tracking via Adversarial Positive Instance Generation, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018), pp. 4864-4873.
[159] K. Chen, W. Tao, Convolutional Regression for Visual Tracking, IEEE Transactions on Image Processing, 27 (2018) 3611-3620.
[160] Y. Song, C. Ma, L. Gong, J. Zhang, R.W.H. Lau, M.H. Yang, CREST: Convolutional Residual Learning for Visual Tracking, IEEE International Conference on Computer Vision2017), pp. 2574-2583.
[161] E. Park, A.C. Berg, Meta-Tracker: Fast and Robust Online Adaptation for Visual Object Trackers, arXiv preprint arXiv:1801.03049, (2018).
[162] Y. Yao, X. Wu, L. Zhang, S. Shan, W. Zuo, Joint Representation and Truncated Inference Learning for Correlation Filter based Tracking, arXiv preprint arXiv:1807.11071, (2018).
[163] G. Bhat, J. Johnander, M. Danelljan, F.S. Khan, M. Felsberg, Unveiling the Power of Deep Tracking, arXiv preprint arXiv:1804.06833, (2018).
[164] M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, ECO: Efficient Convolution Operators for Tracking, (2016) 6931-6939.
[165] C. Sun, D. Wang, H. Lu, M.-H. Yang, Learning spatial-Aware regressions for visual tracking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018), pp. 8962-8970.
[167] H. Nam, M. Baek, B. Han, Modeling and Propagating CNNs in a Tree Structure for Visual Tracking, (2016).
[168] M. Danelljan, A. Robinson, F.S. Khan, M. Felsberg, Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking, (2016).
[169] Z. Teng, J. Xing, Q. Wang, C. Lang, S. Feng, Y. Jin, Robust Object Tracking Based on Temporal and Spatial Deep Networks, IEEE International Conference on Computer Vision2017), pp. 1153-1162.
[170] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, S. Maybank, Learning attentions: residual attentional Siamese Network for high performance online visual tracking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018), pp. 4854-4863.
[171] T. Zhang, C. Xu, M.H. Yang, Multi-task Correlation Particle Filter for Robust Object Tracking, IEEE Conference on Computer Vision and Pattern Recognition2017), pp. 4819-4827.
[172] B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High Performance Visual Tracking With Siamese Region Proposal Network, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018), pp. 8971-8980.
[173] Z. Chi, H. Li, H. Lu, M. Yang, Dual Deep Network for Visual Tracking, IEEE Transactions on Image Processing, 26 (2017) 2005-2015.
[174] S. Yun, J. Choi, Y. Yoo, K. Yun, Y.C. Jin, Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning, IEEE Conference on Computer Vision and Pattern Recognition2017), pp. 1349-1358.
[175] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Convolutional Features for Correlation Filter Based Visual Tracking, IEEE International Conference on Computer Vision Workshop2016), pp. 621-629.
[176] Z. He, Y. Fan, J. Zhuang, Y. Dong, H. Bai, Correlation Filters with Weighted Convolution Responses, ICCV Workshops2017), pp. 1992-2000.
[177] A. He, C. Luo, X. Tian, W. Zeng, A twofold siamese network for real-time object tracking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018), pp. 4834-4843.
[178] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. Lau, M.-H. Yang, VITAL: VIsual Tracking via Adversarial Learning, arXiv preprint arXiv:1804.04273, (2018).
[179] Z. Zhu, W. Wu, W. Zou, J. Yan, End-to-End Flow Correlation Tracking with Spatial-Temporal Attention, illumination, 42 (2017) 20.
[180] C. Ma, J.B. Huang, X. Yang, M.H. Yang, Hierarchical Convolutional Features for Visual Tracking, IEEE International Conference on Computer Vision2016), pp. 3074-3082.
[181] C. Ma, Y. Xu, B. Ni, X. Yang, When Correlation Filters Meet Convolutional Neural Networks for Visual Tracking, IEEE Signal Processing Letters, 23 (2016) 1454-1458.
[182] C. Ma, J. Huang, X. Yang, M.-H. Yang, Robust Visual Tracking via Hierarchical Convolutional Features, arXiv preprint arXiv:1707.03816, (2017).
[183] H. Chen, B. Fan, Hierarchical Convolutional Features for Long-Term Correlation Tracking, (Springer Singapore, Singapore, 2017), pp. 677-686.
[184] J. Choi, H.J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, J.Y. Choi, Context-aware Deep Feature Compression for High-speed Visual Tracking, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018), pp. 479-488.
[185] J. Johnander, M. Danelljan, F.S. Khan, M. Felsberg, DCCO: Towards Deformable Continuous Convolution Operators for Visual Tracking, International Conference on Computer Analysis of Images and Patterns, (Springer2017), pp. 55-67.