• “On-blocks” is always first • Class starts at the point when the swimmers are on the blocks at the start of the race • The transition to the next class will be defined as the point when the swimmer is no longer touching the blocks
• Diving is always after “On-blocks” • Defined as the point when the swimmer is in mid air and not on the block or underwater or swimming yet • The transition out of diving will be defined as the point when the entire swimmer becomes occluded by the water and splash of the dive entry, in the case that a swimmer fails to completely submerge themselves, skip the underwater class completely and start annotating as swimming
• Underwater can only happen after a turn or diving • Defined as any point in the race when the swimmer is completely submerged, not touching a wall and not swimming • The transition out of underwater will be defined as the point when the swimmer breaks the water with any part of their body to start swimming • Dont annotate a swimmer if they cant be seen, i.e. 90% of swimmer is hidden due to angle, lane ropes and refraction of water
• Swimming comes after underwater, diving or turning • Defined as any point in the race when the swimmer is completing legal stroke cycles and not touching a wall • The transition out of swimming into turning can occur on a touch turn or on a flip turn. • When preforming a touch turn, turning commences when the swimmer touches the wall • When preforming a flip turn, the turn commences when the swimmer is on their front and the head is submerged due to the the flip • The transition out of swimming into finishing is when the swimmer touches the wall and the races has concluded
5. Turning
• Turning only happens after swimming
• Defined as any point in a race when the swimmer comes to a stop near enough to a wall in order to touch or push off the wall
• The transition out of turning to underwater is when the swimmers feet or possibly, last body part leaves the wall
• The point when a swimmer is completely straight can also signify the transition to underwater
• In the case that a swimmer fails to completely submerge themselves after a touch, skip the underwater class completely and start annotating as swimming
• There should be no point at which a turn should not be boxed unless it is cut off by the camera or camera angle as the swimmer is somewhere on the wall
• Finishing only happens after swimming • Defined as any point after the conclusion of the race
distance • Finishing is always the final class of swimmer
Class Annotation Details
This section outlines how to assign a bounding box to each example of a swimmer. In general the box must be the smallest possible box containing the entire swimmer, ”Except where the bounding box would have to be made excessively large to include a few additional pixels (5%)” (Everingham and Winn 2007). If 80% - 90% of a swimmer is cut off by the camera, do not give them a box. Put a box around a swimmer that can be identified in any way, unless it is cut off by the camera or camera angle. Because there are a variety of situations where this statement becomes ambiguous there will be some general guidelines for specific classes.
“On-blocks” For swimmers in the farther lanes and behind other swimmers, add the tightest box possible around all visible parts of the swimmer. If the tip of a foot is visible from behind another swimmer, for example, do not make the box excessively larger than the majority of the swimmer visible. An example would be the swimmer above the annotated swimmer in figure 3a.
Swimming Stretch the box to include arms and feet. Center the end of the box with the swimmers feet around the splash produced by the kick if the feet are not visible.
Underwater When a swimmer is visible, create the smallest possible box that encompasses the swimmer, see figure 3c. When a swimmer becomes too difficult to box accurately do not annotate the swimmer, see the top three swimmers in figure 3c.
Turning The smallest box shall be made around the swimmer such that it encompasses the swimmer, for all swimmers, regardless of how visible the swimmer is in terms of occlusions. If more than ninety percent of the swimmer is out of camera view then do not annotate the swimmer.
Figure 2: Examples of swimmers states
Finishing As a swimmer finishes, they generally look to the clock to see their time. As this happens, they transition from being in horizontal body position, to being in a vertical body position. Due to the refraction of water and bubbles formed by the swimmer, the body of the swimmer becomes invisible to the camera. Thus, a minimal box around what is viable is all that is required.
Diving It can be difficult to determine exactly which swimmer is being annotated. This is because the minimal box including the entirety of one swimmer could also require that the swimmer below is included. Create a minimal box around the swimmer being annotated even if this means the box created also includes a large portion of the other swimmer.
When annotating video, many frames are highly correlated, and thus are redundant for training a detection or tracking algorithm. For this reason, tests were performed to reduce the amount of redundant annotations collected. In these experiments, we examined how to limit annotation to avoid collecting redundant data while still capturing sufficient data variety to train a successful model.
Collecting Swimmer Data for Tracking To illustrate the importance of efficiently and effectively collecting data, a simple and somewhat extreme example is considered. regard a single race video of a fifteen-hundred freestyle, at thirty frames per second, with eight swimmers, and with a length of sixteen minutes (regarded as a slow men’s time one the world stage). The resulting video if annotated in full would result in more than 230,000 examples of swimmers, preforming all six classes of swimming in all its frames. When one considers the nature of a fifteen-hundred freestyle it is obvious these examples do not contain the right proportion of swimmer examples. This is evident as the examples will not contain all four strokes, their respective turns or the both genders of swimmers to say the least. There are other problems with collecting one single race and this is mentioned in the summary of swim race footage variability. Regardless, Using a custom-built labelling system, annotations of swimmers required an average of two seconds per bounding box. This means labelling the entirety of the aforementioned video would take over five days of continuous work. Because of the high redundancy in these images,
Table 1: The amount of collected data for each class
such annotation would be an inefficient use of time. The following describes the experiments conducted in order to find a better annotation procedure.
Extraction of Swimming Video Features
The method used to test what frames in race video to annotate and what frames to skip was as follows. Using footage found on Swim USA’s YouTube page (Swim USA productions 2009), data was collected from a few videos of one competition, 2019 TYR Pro Swim Series - Bloomington. All possible strokes, turns and dives where present in the data collected. One in every three frames of the footage was annotated, as suggested in (Victor et al. 2017). An exception was made with footage containing diving, in which case video was annotated frame by frame. This exception was due to the large amount of movement a dive contains and its short duration in time. The result was three-thousand frames of data with 25,000 examples of swimmers in various classes, the exact values can be seen in Table 1.
After the collection of this data, multiple models were trained with different subsets of this collected data to find the amount and distribution of data that produces optimal results. Optimal results being reducing the amount of redundancy in the dataset, while still obtaining detection results that are good enough. The first method of creating subsets was to randomly select a specified percentage of the three-thousand frames. The second method of subset creation was to randomly select a specified percentage of each class of the three-thousand frames. This method guaranteed that there will always be the same ”percent of total” in all classes. Tests of the models using the second methods data will show if a certain class should have had more annotations in the initial collection phase.
Figure 3: Additional examples of swimmers states
The Darknet-53, YOLOv3-416 model (Redmon 2018a) and Darknet-15, YOLOv3-tiny-416 model (Redmon 2018b) was considered for testing. It was found that their results where almost identical and so the following tests were completed with the tiny model. In total, fifteen models where trained with 1%, 2%, 5%, 10%, 25%, 50%, 75% and 100% of the data collected. For each test the models where given the exact same architecture and parameters; the only way they differed was in the datasets they were trained on. Their performance was tested against a dataset of five-hundred frames that were not used in training. Half of the test set was obtained from the same pool used for training and the other half was obtained from a different pool but with similar conditions. These pools are designated as Bloomington and Winter National, respectively. The performance was gauged using mean average precision (mAP) for each class, for more details, see (Everingham et al. 2010). The mAP of tracking was also collected. Tracking disregards the classes, the mAP value of tracking represents how good the model is at identifying the position of a swimmer in a pool.
Figure 4: Results from the data collection test
Figure 4 shows a condensed breakdown of the results. The top plots represents testing of footage from the same pool, Bloomington. The bottom plots represents testing of footage from a different pool, Winter National. The x-axis represents the percent of data used for training and the y-axis represents the mAP value. Each line in a plot is either a different class or the tracking results. The plots on the left represent the first subset distribution and the plots on the right represent the second subset distribution.
The first thing to notice from this test is that based on the top two graphs, using roughly twenty percent of the data collected was sufficient to produce comparable results. That is collecting data every fifteen frames and every five frames for diving. After reducing the data collected to less than twenty percent in a steep decline in overall performance is observed.
Next thing to notice is the extremely poor performance of the model when predicting diving at the Winter National pool and the less than optimal performance in the turning and underwater. These results are due to the difference in camera angles from one pool to the other, as can be seen in Figure 5. This could have been partially fixed by flipping the training images horizontally but the camera angle between pools is different even with the horizontal flip. That being said, the swimming in Winter National was captured at roughly the same position as Bloomington. This is con-firmed in figure 4 as the Bloomington and Winter National swimming plots have the same profile. This is in contrast to the rest of the Winter National plots that have drastically different profiles than the Bloomington results.
Lastly, there seems to be no significant difference in the amount data for which mAP value sharply drops when comparing across all classes. This might indicate that collecting annotations once every fifteen frames (once every five for diving) is a good enough approximation. However, this conclusion is based on a relatively small test set. If more insight is to be gained on the distribution of classes collected from swimming footage, more tests need to be conducted.
In this paper we presented the first step in a project for the automation of swimming analytics. We began construction of a dataset through identifying the important aspects of data collection and annotation. The results suggest that data can be efficiently collected from a video by annotating frames at two or three frames a second (six frames a second for diving). Such analysis provided validation that under optimal circumstances a detection system can exist. Lastly, this experiment gave a general intuition of how deep learning detection models such as Darknet (Redmon and Farhadi 2018)
Figure 5: Difference between dive camera angles: Winter National (top) and Bloomington (bottom)
respond to swimmer data. Specifically, a lighter detection model, such as Darknet-15 performs roughly the same as Darknet-53 for the detection of swimmers (Redmon 2018a; 2018b). lastly, there is no reason to believe that a system such as this one should not work in a general sense, once given more training examples of swimmers in different competitions.
With the tools put forth by this paper we are able to begin the next steps in automating swimming analytics. We will use the procedure presented here to collect more data from a variety of sources, creating an annotated dataset for swimming. Next, we will build better tracking solutions incorporating swimmer dynamics such as in (Bewley et al. 2016), and finally we will build metric collection solutions to automatically derive common swimming metrics such as stroke count and stroke length. The beauty of this work is that is very modular and as such can be built upon once the ground work has been completed. Upon completion, we are confi-dent that this project will greatly help simplify the collection and use of swimming analytics, assisting coaches and athletes across all levels of swimming and even possibly help to increase viewer interest of swimming.
Australian Sports Commission, A. 1981. Australian institute of sport. [Online], Available: https://ais.gov.au/ [Accessed: Jan. 1, 2020].
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; and Upcroft, B. 2016. Simple online and realtime tracking. In Proc. IEEE ICIP, 3464–3468.
Canadian Sport Institute Pacific, C. 1999. Canadian sport institute pacific. [Online], Available: http://www.csipacific. ca/ [Accessed: Jan. 1, 2020].
Chan, K. L. 2013. Detection of swimmer using dense optical flow motion map and intensity information. Mach. Vision Appl. 24(1):75–101.
Elipot, M. 2019. A new paradigm to do and understand the race analyses in swimming: The application of convolutional neural networks. ISBS Proceedings Archive 37(1):455.
Everingham, M., and Winn, J. 2007. Pascal visual object classes challenge 2007 (voc2007) annotation guidelines. [Online], Available: http://host.robots.ox.ac.uk/pascal/VOC/ voc2007/guidelines.html [Accessed: Nov 19 2019].
Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88(2):303–338.
Form Swim, F. 2019. Form. [Online], Available: https: //www.formswim.com/ [Accessed: Nov. 19, 2019].
Kalal, Z.; Mikolajczyk, K.; and Matas, J. 2011. Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7):1409–1422.
Mooney, R.; Corley, G.; Godfrey, A.; Quinlan, L. R.; and ´OLaighin, G. 2016. Inertial sensor technology for elite swimming performance analysis: A systematic review. Sensors 16(1):18.
RaceTeck. 2018. Raceteck. [Online], Available: https:// racetek.ca/ [Accessed: Nov. 19, 2019].
Ranjbar Alvar, S., and Baji´c, I. V. 2018. MV-YOLO: motion vector-aided tracking by semantic object detection. In Proc. IEEE MMSP’18, 1–5.
Redmon, J., and Farhadi, A. 2018. YOLOv3: an incremental improvement. arXiv.
Redmon, J. 2018a. YOLOv3. [Online], Available: https: //github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg
[Accessed: Sept. 10, 2019].
Redmon, J. 2018b. YOLOv3-tiny. [Online], Available: https://github.com/pjreddie/darknet/blob/master/ cfg/yolov3-tiny.cfg [Accessed: Sept. 10, 2019].
Sha, L.; Lucey, P.; Morgan, S.; Pease, D.; and Sridharan, S. 2013. Swimmer localization from a moving camera. In Proc. Int. Conf. Digital Image Computing: Techniques and Applications (DICTA), 1–8.
Sha, L.; Lucey, P.; Sridharan, S.; Morgan, S.; and Pease, D. 2014. Understanding and analyzing a large collection of archived swimming videos. In Proc. IEEE Wint. Conf. Appl. Comput. Vision, 674–681.
Swim USA productions, S. U. 2009. Usa swimming. [Online], Available: https://www.youtube.com/user/ USASwimmingOrg/featured [Accessed: Sept. 10, 2019].
Tritonwear. 2015. Tritonwear, take the guesswork out of swimming faster. [Online], Available: https://www. tritonwear.com/ [Accessed: Nov. 19, 2019].
Victor, B.; He, Z.; Morgan, S.; and Miniutti, D. 2017. Continuous video to simple signals for swimming stroke detection with convolutional neural networks. In Proc. IEEE CVPR Workshops, 66–75.
Zecha, D.; Greif, T.; and Lienhart, R. 2012. Swimmer detection and pose estimation for continuous stroke-rate determination. In SPIE Multimedia on Mobile Devices 2012; and Multimedia Content Access: Algorithms and Systems VI, volume 8304, 830410.