b

DiscoverSearch
About
My stuff
Learning-based Tracking of Fast Moving Objects
2020·arXiv
Abstract
Abstract

Tracking fast moving objects, which appear as blurred streaks in video sequences, is a difficult task for standard trackers as the object position does not overlap in consecutive video frames and texture information of the objects is blurred. Up-to-date approaches tuned for this task are based on background subtraction with static background and slow deblurring algorithms. In this paper, we present a tracking-by-segmentation approach implemented using state-of-the-art deep learning methods that performs near-realtime tracking on real-world video sequences. We implemented a physically plausible FMO sequence generator to be a robust foundation for our training pipeline and demonstrate the ease of fast generator and network adaptation for different FMO scenarios in terms of foreground variations.

Object tracking is a well explored field of computer vision. The majority of object tracking algorithms starting from basic correlation trackers up to state-of-the-art deep network trackers utilize texture-based correlation or feature based methods. Modern video capturing devices with built-in processing algorithms are capable of producing sharp images of the moving objects. Moreover, the person capturing the object in motion typically tracks the moving object, hence it predominantly stays in the center of the image and in-focus. For such tasks the correlation-based trackers are thus sufficient.

The situation dramatically changes when the object in question moves so fast, that it is impossible to capture it sharp in videos. We call such object in motion an ’FMO’, short for Fast Moving Object [1].

An FMO can be loosely defined as an object traveling a distance larger than its diameter within one frame of the video sequence (Figure 1). The inter-frame object overlap is negligible and this causes problems to many conventional trackers.

A typical manifestation of an FMO in video frames is a prolonged streak without any particular texture, colored with prevailing color of the object, or a combination of object colors; see Figure 1. The lack of any sharp texture of the object renders most of the texture-based correlation trackers unusable.

The first tacking algorithm specifically designed for FMOs uses a method based on background subtraction [1]. This technique requires static background, static camera and large prominent foregrounds. It is also prone to object miss-tracking, which then requires a time consuming object re-detection.

More recent approaches deal with the problem of FMO tracking by running a de-blurring algorithm [2], [3], [4]. These methods perform considerably better, but are extremely slow, as they require a full-blown de-blurring optimization pipeline. Thereforhe they are unusable for real-time video stream processing.

The proposed solution is to approach the problem not as tracking by correlation but rather as tracking by segmentation. Our view is that the segmentation task is for this cases more useful. Resulting segmentation can be further used for trajectory prediction and down the pipeline even for the trajectory estimation in the conventional de-blurring algorithms.

Our primary goal is to provide a method operating in real-world scenarios such as tracking of ping pong, squash balls, badminton and similar objects. The method uses a convolutional neural network (CNN) with real-time performance in videos with resolution 320x240. For training and evaluation, we synthesized a dataset composed of relevant YouTube sport video sequences. We propose on-demand synthetic FMO data generator to tackle the problem of producing annotated data automatically.

For comparison with state-of-the-art approaches we evaluate our method on a well established FMO dataset [1]. We demonstrate that our approach shows competitive results and investigate cases where our algorithm outperforms and under-performs current methods both in precision and execution time.

image

Object tracking is well established field of research in computer vision. Many methods have been proposed for tracking single or multiple objects in video sequences. Namely tracking by detection[5], [6], tracking by features[7], [8] tracking by correlation[9] and others. All of the mentioned approaches are based on either object detection using texture information of the tracked object or features extracted from it. This assumes that the object image contains some minimum level of details.

image

Fig. 1: Examples of Fast Moving Objects in real-world videos.

Also, many of the conventional trackers performs best when the tracked object bounding boxes largely overlap in the consecutive frames. Both the mentioned assumptions does not hold in sequences with a FMO. FMO Tracking FMO tracking is bringing attention of more and more

researchers lately. Initial work in this field was done in [1],

where the authors firstly introduced the theme and proposed first tracker based on background subtraction. In the heart of the method lies a tracker capable of tracking the background changes. When the tracker fails a time consuming re-detection is ran to resume tracking. Lately, an interesting work was done in [2] where the tracking problem was defined as a de-bluring optimization problem. In another similar approach [4] authors show intra-frame tracking capability of de-bluring approach. Albeit the results are promising in both mentioned publications, these methods focus on videos with static camera and background and additionally, their algorithm cannot be used real-time due to high processor time demands of the optimization algorithm.

In this section, we first give a brief introduction of the overall framework of the proposed method. Then we briefly investigate various strategies and finally we described the proposed method in depth.

A. Overview

The work of [1] inspired us to tackle the problematic cases on which the method did not perform well enough, namely tracking of the very small objects. Most of the sport videos are sharp by itself from using modern capturing devices or because the cameraman is actively tracking the object of interest. An exception of this are small objects moving very fast. Those are typically balls in some specific sports such as tennis, softball or badminton.

After some failed attempts to solve tracking of such small fast objects (FMO) by conventional means, we have turned our attention towards deep learning methods. Deep learning methods achieve top results in many segmentation tasks in terms of both the computation time and the precision. We researched several state-of-the-art segmentation networks of which we were able to achieve the best results with u-net type architecture with inception bottleneck modules called ENet[10]. Please refer to Figure 2.

Because we aimed to be able to track problematic real-world sports videos where due to the bad resolution and small object other tracking methods fail, we included YouTube sports videos for performance evaluation. However the used YouTube dataset is not annotated and was evaluated only by human ratter’s personal opinion. Therefore it is not included in resulting statistics. This dataset had however profound influence on network creation and hyper-parameter tuning.

As a benchmark dataset we choose the publicly available FMO dataset to be able to compare the performance of proposed approach with the original method[1]. We perform preprocessing of the dataset in such a way that the foregrounds size and color resembles the foregrounds used for training. Another way to make the method perform on different foregrounds is to fine-tune the network with different dataset generator parameters. Example of fine-tuning can be seen in Table I method II where we have fine-tuned the network to detect bigger foregrounds, e.g. frisbee or volleyball.

B. Backbone architecture

As in many cases of most modern methods, our approach is based on deep network architecture. The backbone of our segmentation network we use U-Net architecture called

image

Fig. 2: Processing pipeline: During the training phase (top section) the sequences are dynamically synthesized using preprocessed video sequences, foregrounds and path generator. Next, the frames are concatenated and input the network as a 15-channel image. During inference phase (bottom part) the sequences are segmented by the network and Kalman based tracker is used for path prediction.

ENet[10], consisting of inception blocks based on [11]. The initial choice of this network design was done based on the speed of evaluation and performance on various benchmark datasets. We found out that this design outperformed other backbone designs we had tested, such as classical U-Net or Mask-RCNN.

The basic idea behind the FMO trace segmentation is training the network to recognize prolonged objects with no apparent texture, typically of white color to resemble most common sports balls, such as in ping pong, squash and even badminton or tennis. This represented in our opinion majority of the problematic sport videos.

Furthermore, because of the difficulty of the task, the single image segmentation proven to be to difficult and produced too many false positives. This is expected, as the proposed network basically learns to recognize bright smears and therefore falsely segment any bright spots or lines in the image. To overcome this, we tested approaches with sequence of several consequent frames as a network input. The idea was, that sequence of images improves trace consistency in time. For this purpose, we tested several multi-frame approaches, namely 3 and 5 frames either concatenated in color channels or as a full 4D input to the 4D network (Even though this approach is mathematically equivalent to the channel concatenation the idea was to produce faster learning and less false positives.) The best results were achieved by using 5 consequent video frames concatenated in color channel, i.e. the input to the network is single 15 channel image. See Figure 2

The images used for training are synthetic FMO sequences based on real-world sporting background images. Because every deep network is only as good as the dataset used for training, we have invested considerable effort to create a quality tool for generation the synthetic sequences. Please refer to the Section 3 for further details.

Although majority of the state-of-the-art deep learning methods heavily depends on the re-using of the learned parameters from their successful predecessors, transfer learning proven inapplicable in our case. This is due to the specificity of our task, which cannot exploit learned convolution kernels from other problems based on extraction of texture features.

C. Dataset generator

In the heart of any modern machine learning is always a good dataset. Due to the nonexistence of any training FMO dataset, we created our own physic laws obeying FMO sequence generator. First we have obtained the dataset of youtube sports videos which we used as a background. To eliminate any false fast moving object from the videos, we have generated sequences of median images. Every frame of such a sequence was calculated as a median of 5 consequent frames. Next, we created a foreground generator based upon selected ball images from variety of sports. Finally, we de-

image

Fig. 3: FMO synthetic data generator example. The rightmost image shows example of small emulated bounce.

signed physically plausible generator of trajectories, including random bounces or occlusions. See Figure 2.

In the core of the image synthesizer is a random motion path generator which takes into account fully simulated camera (including CCD size resolution and aperture properties) as well as motion of the simulated object in space. The generator begins from a random initial speed vector and then iterates in time simulating the motion. For even better plausibility the gravitational acceleration into account, too. The sudden velocity changes (hit from a racket like), bounces (as if from wall, ground or table) occlusions and sudden motion stops are simulated as well. Refer to Figure 3 right.

Such a generated trajectory is then convolved with foreground to create the motion trace and finally inserted as a weighted sum into the sequence of background images using following formula.

image

where  Ptis the path PSF normalized to sum to 1, F(X) is the random foreground image,  bfis the overexposure brightness factor (described in next paragraph), M(x) is foreground indicator function and B(x) is the background image. Used foreground image is created as a random selection of real-world white ball images which are tinted in random bright color and resized to a pre-defined range of foreground sizes.

Another aspect which had been taken into account is fast moving object overexposure. This is due to to the ’HDR’ effect of the moving object. The overall brightness of the object in one frame can, and often is, brighter than maximum brightness point in the rest of the image. Typically what every camera has to solve is the conversion of high brightness range of the world to the quantized 255 brightness values. This is done by several techniques which are out of scope of this article. This conversion usually includes some form of clipping of the brightness levels which are too high to optimize overall image brightness balance. In a typical image without any FMO the overexposed parts of the image are clipped to the maximum allowed brightness. But, in case of a fast moving object, the true brightness of the object when stopped is an integration of its brightness along the object trajectory. In other words, the overall brightness of the object is spread out along the object path so it does not exceeds the maximum pixel brightness in any point of the image. Therefore, it is often the case, that the true brightness of the object, when aggregated along the path, exceeds the maximum brightness of the image, especially with the white ball. If this effect would not have be taken into the consideration, the rendered object would seem very dim in the resulting image. This lead us to set the factor of absolute brightness of foreground between 0.8 - 1.4 of the maximum brightness.

As for ground truth mask image used in training phase, we use the foreground path mask corresponding to the middle frame of the sequence. It is calculated again as foreground mask convolved with the trajectory corresponding to the middle frame ([P3 ∗ M]). Please see the Figure 2 for illustration.

D. Tracking

On top of the successful segmentation we have implemented a simple tracker. The tracker is responsible for final object trajectory estimation. First we select the blob which most likely represent the tracking object. This can be achieved by simply selecting the largest connected component in segmentation image. For sequences containing many false positives, more sophisticated logic can be applied. We used weighted composition of two measures: connected component size and shape. Since we are looking for prolonged object, we use second central moments of the connected components to estimate the prolongation. The position of the blob is in tracker represented by bounding box.

Sequences of the bounding box positions are used by the tracker to extrapolate the object trajectory. For frames with missing or too small blobs, we utilize a Kalman filter to estimate missing trajectory or predicting trajectories in cases the object is lost or occluded.

Output of the tracker is a sequence of bounding boxes representing estimated object trajectory. Refer to examples in Figure 4

image

Fig. 4: Tracker output example

In this section we present results of proposed method and compare them to the original CVPR paper[1]. We focused our attention at real-world application with both the speed of the inference and the accuracy for small ball-like object detection.

A. Evaluation

The proposed method was evaluated on the FMO dataset [1], where it achieved comparable or better results when compared to the published method.

The performance criteria was selected to correspond evaluation statistics in the original paper. These are precision TP/(TP + FP), recall TP/(TP + FN) and F1-score 2TP/(2TP + FN + FP), where TP, FP, FN is the number of true positives, false positive and false negatives, respectively. A true positive detection has an intersection over union (IoU) with the ground truth polygon greater than 0.5 and an IoU larger than other detections. The second condition ensures that multiple detections of the same object generates only one TP. False negatives are FMOs in the ground truth with no associated FP detection.

The results for both the original method and our approach are listed in Table I. It can be concluded, that overall mean F1-score is slightly better for our method, as well as mean recall. We were also able to avoid significant under-sizing of the resulted segmentation of the FMO trace, which would cause high precision values over small recall value. Therefore, we argue that our approach results are more balanced in terms of precision and recall performance metrics.

Training of the segmentation network was performed using the synthetic data generator described in Section III-C.

The performance of the method reflects the purpose of our algorithm. It performs well on sequences with small ballshaped object moving relatively fast (ping-pong, softball, tennis and squash). Poor performance was recorded on sequences with foregrounds different from balls (like darts or archery) and on sequences with low background-foreground contrast (darts window and blue ball). The method under-performs on data with grater foreground / velocity ratio (frisbee and volleyball). The foreground on these sequences are of larger size and is not moving faster than its diameter, as per FMO definition in Section I.

Our approach is advantageous in fact that the network can be easily fine-tuned with image synthesizer setup for another sequence type, such as particular background, particular foreground (i.e. yellow ball) or foregrounds of different size range, etc. For comparison we have re-trained the network to detect foreground of bigger size and slower motions. The results are in the most right part of the results table I. The segmentation network stopped to be sensitive to smaller foregrounds, such as ping pong, squash or tennis, and starts to perform in cases with larger foregrounds, like frisbee or volleyball.

The next main difference is that our algorithm tackle the problem of detecting very small objects (from roughly 2 pixels in diameter) or objects crossing the background of similar color.

B. Computational time

Another benefit that comes from using the neural network is relatively short inference time. The state-of-the-art approaches[2], [3], [4] are based on foreground de-blurring and therefore are inherently slow. In [4] authors state mean time is 4 second per frame. Our methods is capable of running in near-realtime regime using widely available graphics card. For more details please refer to Table II, where we depicted mean frame evaluation times for NVidia Tesla X GPU using several samples of image resolutions.

Next goal will be to speed up the inference times enough for the method to be able to perform in real-time environment. We plan to achieve this by optimizing the network in size by pruning, using compacted backbone, reducing precision or network quantization.

C. YouTube sport videos

As mentioned before, our primary goal was tracking of balls in sporting videos with predominately high relative speed, such as ping pong, baseball, tennis or badminton. For this purpose we have downloaded more than 900 000 YouTube sport videos to create a base of our synthetic data generator backgrounds. Over 1800 of this sequences contain ping-pong matches, which we used for testing of our framework. Although we measured our performance on the FMO dataset, we also aim for good performance on real wold sequences. Examples of ping-pong sequence evaluation can be seen in Figure 5.

We proposed a method performing in difficult task of real time real world fast moving object detection and tracking. We achieved to overcome limitations of the previous works in this field, namely the long computation time and difficulty to detect small and very fast objects or objects crossing the background of similar color. We have introduced a synthetic physically

image

TABLE I: Performance of the original CVPR2017 method [1] in comparison to proposed method (method I - trained for smaller foregrounds; method II - trained for bigger foregrounds). The results suggests better overall performance of the trace segmentation in overall F1 performance score for method I.

image

TABLE II: Some examples of video inference times achieved using NVidia Tesla X GPU.

plausible fast moving object sequence generator, which we utilize for network training. We showed the simplicity of adapting the generator to another type of foreground followed by network fine-tuning that allows us to detect foregrounds of different size and color.

In the future work, we would like to focus on optimizing the processing pipeline with respect to speed in order to achieve true real-time performance in high resolution videos and automatically track all kinds of sports balls in video streams. This can be further utilized in various applications such as instantaneous ball speed detection, ball misses or ball out of bounds detection.

image

I Performance of the original CVPR2017 method [1] in comparison to proposed method (method I - trained for smaller foregrounds; method II - trained for bigger foregrounds). The results suggests better overall performance of the trace segmentation in overall F1 performance score for method I. . . . . . . . . . . . . . . . . . . . . . . 6

II Some examples of video inference times achieved using NVidia Tesla X GPU. . . . . . . . . . . . 6

[1] D. Rozumnyi, J. Kotera, F. Sroubek, L. Novotny, and J. Matas, “The world of fast moving objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5203–5211.

[2] D. Rozumnyi, J. Kotera, F. ˇSroubek, and J. Matas, “Non-causal tracking by deblatting,” in German Conference on Pattern Recognition. Springer, 2019, pp. 122–135.

[3] J. Kotera and F. ˇSroubek, “Motion estimation and deblurring of fast moving objects,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 2860–2864.

[4] J. Kotera, D. Rozumnyi, F. Sroubek, and J. Matas, “Intra-frame object tracking by deblatting,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.

[5] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2096–2109, 2015.

[6] J. Zhang, S. Ma, and S. Sclaroff, “Meem: robust tracking via multiple experts using entropy minimization,” in European conference on computer vision. Springer, 2014, pp. 188–203.

[7] K.-W. Chen and Y.-P. Hung, “Multi-cue integration for multi-camera tracking,” in 2010 20th International Conference on Pattern Recognition. IEEE, 2010, pp. 145–148.

[8] S. Gladh, M. Danelljan, F. S. Khan, and M. Felsberg, “Deep motion features for visual tracking,” in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 1243–1248.

[9] H. Liu, Q. Hu, B. Li, and Y. Guo, “Long-term object tracking with instance specific proposals,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 1628–1633.

[10] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016.

[11] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.

image

Fig. 5: YouTube real world ping pong sequences evaluation.


Designed for Accessibility and to further Open Science