SKT-MOT and DyTracker: A Multiobject Tracking Dataset and a Dynamic Tracker for Speed Skating Video

Wang, Junwu; Li, Zongmin; Li, Yachuan; Yang, Shaobo; Wang, Ben; Li, Hua

doi:https://doi.org/10.1155/2023/3895703

Scientific Programming

On this page

Abstract Introduction Related Work Discussion Conclusions Appendix Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2023 | Article ID 3895703 | https://doi.org/10.1155/2023/3895703

SKT-MOT and DyTracker: A Multiobject Tracking Dataset and a Dynamic Tracker for Speed Skating Video

Junwu Wang,¹Zongmin Li,¹Yachuan Li,¹Shaobo Yang,¹Ben Wang,¹and Hua Li²

Academic Editor: Jianping Gou

Received17 Apr 2023

Revised09 Sept 2023

Accepted20 Sept 2023

Published18 Oct 2023

Abstract

Speed skating serves as a significant application domain for multiobject tracking (MOT), presenting unique challenges such as frequent occlusion, highly similar appearances, and motion blur. To address these challenges, this paper constructs an MOT dataset called SKT-MOT for speed skating and analyzes the shortcomings of existing datasets and methods. Accordingly, we propose a dynamic MOT method called DyTracker. The method builds upon the DeepSORT baseline and enhances three key modules. At the global level, we design the track dynamic management (TDM) algorithm. In the motion branch, a novel metric is proposed to evaluate occlusion and Kalman filter dynamic update (KFDU) is implemented. In the appearance branch, we account for the difference in human posture and propose the feature dynamic selection and updating (FDSU) strategy. This makes our DyTracker flexible and efficient to achieve a multiobject tracking accuracy (MOTA) of 93.70% and identification F1 (IDF1) score of 92.39% on SKT-MOT, which is a significant advantage over existing SOTA methods. To validate the generalization of our proposed module, two dynamic update modules are inserted into other methods and validated on the public dataset MOT17, and the accuracy is generally improved by 0.2%–0.6%.

1. Introduction

Speed skating is of significant importance as a prominent winter Olympics event, with substantial influence worldwide. The application of multiobject tracking (MOT) technology to provide supplementary data analysis for speed skaters holds practical significance. Tracking speed skaters presents a distinctive case within MOT, entailing numerous unique challenges. This paper aims to enable MOT to be efficiently completed in speed skating scenarios.

MOT is a classic problem in computer vision that aims to identify and track all objects of a specific category in a video. Early methods [1, 2] relied mainly on handcrafted features to compute the similarity between frames and achieve object association. With the development of deep learning, methods based on deep neural networks have gradually become mainstream. SORT [3] was the first method to apply object detection networks to MOT, completing the association task through the Kalman filter (KF) [4] and the Hungary matching algorithm [5]. This was also the first tracking-by-detection (TBD) paradigm framework. DeepSORT [6] introduced a reidentification module on this basis, which jointly completes tracking using appearance features and motion clues. It is the most widely used method in the industry. With the development of MOT, some methods [7–11, 21, 22] integrated these modules into a unified network to reduce the inference time and attempted to achieve the end-to-end. These methods are called joint detection and tracking (JDT) paradigms. Both paradigms have made significant progress in recent years.

As MOT technology matures, its applications become increasingly widespread. In certain competitive sports fields [12–14, 20], MOT technology is widely used to guide the training and competition of athletes. However, in the speed skating scene, the development of MOT is relatively slow, mainly due to a lack of data and unique challenges in speed skating. To this end, we first construct SKT-MOT, an MOT dataset for short-track speed skating, consisting of 56 video clips with three scenes and a total of 53,178 images. Based on this, we also developed object detection and re-ID datasets for speed skaters. Additionally, we analyze the unique challenges and advantages of speed skating scenarios. These challenges include:(1)Frequent occlusions between athletes.(2)Athletes dress similarly or even identically.(3)Speed skating is fast and prone to motion blur.

These difficulties hinder the efficient completion of MOT tasks. But the speed skating scene has advantages, with its advantages lying in a relatively small and fixed number of athletes and a relatively clean environment.

Aiming at these advantages and challenges, we propose DyTracker, an MOT method that builds on the DeepSORT baseline and improves three modules: (1) track dynamic management (TDM), which employs a dynamic tracks management algorithm to overcome the influence of false detections and maintain tracks number stability; (2) Kalman filter dynamic update (KFDU), which evaluates the degree of occlusion per athlete and implements KF dynamic update, which improves the robustness of KF against occlusion; and (3) feature dynamic selection and updating (FDSU), which analyzes the deficiencies of traditional association methods for highly similar appearance and detection noises issues and proposes a dynamic matching and updating strategy based on the difference in posture and detection quality.

In summary, the main contributions of this paper are as follows:(1)We constructed an MOT dataset SKT-MOT for speed skating to compensate for the lack of data.(2)We analyzed the unique advantages and challenges of the speed skating scene and proposed a dynamic MOT method—DyTracker.(3)We carried out adequate experiments on SKT-MOT and MOT17 dataset [16] to verify the effectiveness and generalization of the proposed method and modules.

The paper is organized as follows: Section 1: introduction, Section 2: related work, Section 3: SKT-MOT dataset, Section 4: DyTracker, and Section 5: experiment and discussion, followed by conclusions. In the appendix, we list the specific meanings of the abbreviations in the article.

2.1. MOT Datasets

In various scenarios, numerous datasets for MOT have emerged, as shown in Table 1. Alongside the dataset we have proposed, several existing datasets also concentrate on human tracking. For example, PETS2009 [18] and TUD [19] are early pedestrian tracking datasets, albeit with relatively small scales. To form larger-scale datasets, MOT15 [15] integrates these early pedestrian datasets. MOT17 [16] further enriches pedestrian tracking by expanding to new scenes and dynamic perspective. MOT20 [17] increases the difficulty of tracking by increasing pedestrian density. In recent years, human tracking datasets have emerged in complex scenarios, such as DanceTrack [35] in dance scenes and SoccerNet-Tracking [12] in soccer scenes, which significantly contribute to the advancement of MOT in human tracking.

In addition to human tracking, other datasets have been proposed for various object types. In the field of autonomous driving, there exists a dataset called KITTI [36] that specifically focuses on vehicle tracking, representing the earliest large-scale MOT dataset in this domain. Additionally, BDD100K [37] and KITTI360 [38] further expand vehicle tracking data. CTMC [39] is dedicated to tracking biological cells, while TAO [40] focuses on multicategory tracking, annotating 833 target categories, significantly enriching the content of MOT.

2.2. MOT Methods

Most MOT methods can be categorized into TBD and JDT, as shown in Table 2. TBD [3, 6, 23, 24, 41] involves three independent components:

(1)An existing object detector to generate detection boxes for each frame.(2)A re-ID embedding model used to extract the appearance features of objects.(3)A tracker to associate objects based on their motion cues or appearance features.

TBD is a flexible framework, with each component able to be replaced, giving it high generalization and suitability for complex scenes. However, it has the drawback of being time-consuming during inference.

Instead, JDT [7–11, 21, 22] incorporates several components into a unified network, reducing the inference time. Typically, JDT builds on detectors, fusing a tracker for them or adding a feature extraction branch. Over the past period, this paradigm has become mainstream. However, due to contradictions between modules, achieving global optimality for the JDT paradigm is problematic.

In recent years, significant advancements have been made in both paradigms. DeepSORT [6] represents the classic method within the TBD paradigm, leveraging motion and appearance as the two primary target features to accomplish the tracking task through Hungarian matching. StrongSORT [41] enhances DeepSORT by incorporating more powerful components. ByteTrack [23], in pursuit of faster speed, utilizes only motion cues for data association. In addition, it incorporates low-scoring detection frames into the association process, significantly reducing missed detections. On this basis, OC-SORT [25] corrects the accumulation of errors in Kalman filtering and introduces the directional consistency metric, which effectively improves robustness to occlusion; BoT-SORT [24] introduces camera motion compensation while adjusting the state parameters of the KF.

In the JDT method, JDE [9]/FairMOT [10] integrates a feature extraction branch into the original detector, unifying the detector and feature extraction models. On the other hand, CTracker [44] proposes a chain tracking framework based on two frame input, transforming the data association problem into a pairwise bounding boxes regression problem. SCT [45] chains them together using IoU, KF, and binary matching and introduces attention to better extract features. Centertrack [21] follows this two-frame input framework and borrows the idea of using points in CenterNet [42] to represent objects. It directly predicts the offset of the target between frames to achieve the association of data. TraDeS [8] constructs a global similarity matrix to predict this offset while simultaneously correcting the detection and segmentation results of the current target. The transformer-based MOTR [11] approach introduces a novel concept called track query. Each track query models the complete track of a target, enabling its transfer and update from frame-to-frame, thereby achieving end-to-end tracking. Recently, Unicorn [43] and OmniTracker [46] present a unified framework that uses a single network to simultaneously address four tracking tasks: single object tracking (SOT), MOT, video object segmentation (VOS), and multiobject tracking and segmentation (MOTS).

2.3. Motivation

Section 2.1 presents several human tracking datasets; however, most of them [15–19] predominantly focus on urban street scenes and indoor environments, while sports scenes are relatively scarce. Moreover, these datasets have certain limitations:(1)They exhibit simple motion patterns, primarily slow and linear motion.(2)The objects in these datasets have significant appearance differences, making them easily distinguishable.

These limitations have somewhat hindered the development of MOT. To address this gap, we proposed SKT-MOT, which provides new data for sports scenes, breaks through existing limitations, and poses new challenges to MOT.

In Section 2.2, we discuss different method types and mainstream approaches. Considering the challenges associated with optimizing the JDT paradigm and the lack of speed skating data, we followed the TBD paradigm. This paradigm allows us to independently train the detector and utilize additional detection data, making it easier to achieve higher accuracy.

The current mainstream framework of TBD is to complete the association using two major cues: motion and appearance. Some methods use only motion cues for tracking to pursue speed, but the smaller number of individuals in the speed skating scene meant less inference time, so we chose DeepSORT [6], which utilizes both cues, as the baseline. DeepSORT is, in fact, not a novel approach. However, it can still perform well when equipped with a powerful detector and an appropriate correlation strategy, as verified by this paper and StrongSORT [41].

However, the accuracy is generally not high when applying mainstream methods [6, 7, 9, 10, 23] such as DeepSORT to speed skating scenes. This could be attributed to the fact that existing methods are constrained by the dataset limitations and struggle to handle frequent occlusions, motion blurring, and clothing proximity between skaters. In addition, we investigated existing MOT methods for speed skating scenes but only found one work, LocalSort [47], which designs a local matching measurement method for occlusion problems, but it doesn’t take into account similarities in appearance and motion blurring. To this end, we have performed a comprehensive analysis of the impact of these challenges and designed an efficient dynamic tracker to enhance tracking performance in speed skating scenes.

3. SKT-MOT

3.1. Dataset Construction

SKT-MOT dataset collected 56 short-track speed skater daily training videos with a frame rate of 30 fps and a resolution of 1,920 1,080. Thirty-six videos were selected as the training set, 10 as the validation, and 10 as the test. The videos were taken from two speed skating scenes at Beijing Capital Indoor Stadium and Ice and Snow Sports Base of Beijing Sport University, 44,402 and 8,776 images were labeled, respectively, by LabelMe [26] in the two scenes. Labeling information included the athlete’s identification and the bounding box. For fully occluded objects, keep the ID consistent before and after occlusion. The basic information of the SKT-MOT is shown in Table 3.

3.2. Dataset Analysis

We quantitatively analyzed the clothing similarity between athletes, as shown in Figure 2. The results indicate a high degree of similarity between athletes’ appearance color. In terms of the motion pattern, we analyze the trajectory and speed changes, as shown in Figure 3. The speed skating trajectory showed a unique insole shape, which differs significantly from the general pedestrian trajectory. The speed changes exhibit ups and downs, and the average speed is high at 10 m/s, making it prone to motion blur. Moreover, due to the intense competition in short-track speed skating, athletes frequently exchange positions, resulting in frequent occlusion occurring in a single view. These unique issues pose new challenges for MOT.

(a)

(b)

4. DyTracker

DyTracker (Dynamic Tracker) is an efficient MOT method for speed skating scenes, and Figure 4 illustrates our DyTracker built upon the TBD paradigm. It improved DeepSORT [6] with TDM, KFDU, and FDSU modules.

4.1. Preview DeepSORT

The DeepSORT algorithm is a two-branch framework consisting of a motion branch and a feature branch, where the detection results are fed into both branches frame-by-frame to complete the matching and updating process.

4.1.1. Matching

In the motion branch, the KF [4] predicts the state of the trajectory (box position and scale, etc.) in the current frame. The correlation between the predicted state of trajectory and the newly input detection information is computed using the Mahalanobis distance [28].where is the newly input th detection information, is the predicted state of the th trajectory, and is a covariance matrix.

In the feature branch, a reidentification module is used to extract the appearance feature of the newly input detection. Furthermore, it uses a feature gallery to store the latest 100 frame features for each trajectory and integrates them as the trajectory feature in frame k. Then, the feature similarity is measured by the minimum cosine distance.where is the feature of the newly entered th detection, is the th trajectory feature in frame k, and is the feature gallery of the th trajectory.

The two distances mentioned above are used together to construct a similarity matrix . On the basis of Hungarian matching [5], a cascade matching strategy is proposed for a two-round matching process. The first round depends on the similarity matrix, and the second round uses a simple IoU.

4.1.2. Updating

After the matching is completed, in the motion branch, the KF performs a state update, fuses the detection and prediction values to generate the final correction result, and updates the relevant parameters; in the feature branch, newly matched object features are inserted into the feature gallery to complete the feature update.

4.2. TDM

TDM focusses on the characteristics of speed skating. Once speed skating begins, athletes rarely disappear from the video and join halfway through, resulting in a relatively fixed number of tracks in a video. However, false detections can easily disrupt the stability of the number of tracks, as shown in Figure 5. Based on this, we designed a dynamic management module for the number of tracks, as shown in Algorithm 1, to maintain the stability of the number of tracks and improve the robustness to false detection.

Input: Video length ;
Number of tracks ;
Initial frame tracks number;
instability;
1	for frame todo
2	ifis instability then
3	ifpresence ofdetections not matched to any track and their confidence are enough high then
4	Generate new tracks;
5	;
6	end
7	else ifpresence oftracks not matched to any detection for fifteen consecutive frames then
8	Delete these tracks;
9	;
10	else ifno change for fifteen consecutive frames then
11	stability;
12	end
13	else ifis stability then
14	ifpresence ofdetections not matched to any track for fifteen consecutive frames then
15	Generate new tracks;
16	;
17	end
18	else ifpresence oftracks not matched to any detection for fifteen consecutive frames then
19	Delete these tracks;
20	;
21	else
22	remain unchanged;
23	end

Algorithm 1: TDM.

4.3. KFDU

In the motion branch, the KF [4] operation requires inputting detections’ position and scale information. However, frequent occlusion will cause this information to be inaccurate, which also affects the accuracy of the KF. KFDU aims at the problem, proposes a metric to evaluate the degree of occlusion, and performs the dynamic update of the KF according to the metric, improving the robustness to occlusion.@

4.3.1. Evaluate Occlusion

The evaluation of an athlete’s occlusion is often based on the detection confidence, but this criterion is not specific enough. Confidence is jointly determined by object classification and location accuracy. Since the KF relies on location information as input and motion blur interference, location information should be considered more. Figure 6 shows how motion blur affects detection confidence, making occlusion assessment less reliable. For these issues, we proposed an adjustment factor that calculates the IoU and the distance between the central point (similar to the DioU) between the current detection box and other boxes. The maximum to adjust the detection confidence to obtain the final occlusion metric , weakens the effect of motion blur and enriches location information.where IoU is the degree of overlap between two boxes, represents the current detection box, represents other boxes, represents the distance between the central points of two boxes, represents the diagonal distance of the smallest external rectangle, and is the detection confidence.

In this paper, no further distinction is made between the occluder and the occluded, both of whom should receive less trust compared to athletes without occlusion. In addition, detection confidence can distinguish them to some extent.

4.3.2. Kalman Filter Dynamic Update

The KFDU retained the KF state prediction step and improved the state update step. The KF process is shown in Figure 7. In the update step, the observation noise covariance reflects the observation uncertainty, a smaller observation noise means that this observation is more trustworthy. However, in the KF algorithm, is a constant matrix, which gives the same trust to observations of different qualities but it should be dynamic. In other words, when the object is heavily occluded, we should weaken the observation and give more trust to the prediction. In comparison, for high-quality observations, we should give more trust. KFDU is shown in Algorithm 2. Specifically, occlusion metric is used to measure the observation quality and achieve dynamic adjustment of the measurement noise covariance . This gives the KF a dynamic trust for different observations.

4.4. FDSU

The appearance branch mainly includes feature similarity matching and feature updating. In the matching step, we considered that the athletes in the speed skating scene are dressed similarly, but they have differences in their postures. Therefore, we proposed a similarity-matching method that dynamically selects these two features (FDS). In the update step, occlusion and motion blurring produce low-quality detections and pollute the feature gallery, for which we proposed a dynamic feature update strategy (FDU).

4.4.1. Feature Dynamic Selection (FDS)

The existing human tracking datasets [16–19] have large differences in clothing but similar postures, which results in traditional matching methods ignoring posture information and relying solely on differences in appearance color, using historical features for the association. However, in the speed skating scene, it’s just the opposite. This means that traditional matching methods do not apply. To address this, we considered differences and instantaneous invariance of posture and argued that matching using only adjacent frame features can also be efficient for the tracking. Figure 8 illustrates this point.

Figure 8

Comparison of athlete posture changes. The horizontal axis is the time, the vertical axis is the athlete’s ID, and the four athletes are taken from the same video. Observing the figure can be obtained: in the speed skating scene (1) due to differences in the position and habits of athletes, there are apparent differences in posture among them at the same time; (2) the posture shows instantaneous invariance (i.e., high similarity of the same athlete between adjacent frames) due to motion inertia; (3) due to the clothing worn by athletes being similar, appearance color may not be as reliable.

Input: Observation
Observation noise covariance
Measurement occlusion degree
Predicted state
Predicted state covariance
The observation model
Output: Updated state
Updated state covariance
Step:
1
//Updating dynamically observation noise covariance
2
//Calculating corrected Kalman gain
3
//Based on K, fusing observation and predicted state
4
//Updating state covariance

Specifically, we reduced the weighting of historical features and took more consideration of proximity features to increase the weighting of posture features. Additionally, based on the existing gallery (color gallery), FDS added a posture gallery that only stores adjacent frames’ features. For athletes with clear postures, we select a posture gallery for similarity matching. Otherwise, we use the color gallery. It judges whether the athlete’s posture is clear based on the occlusion metric , as shown in Section 4.3.1. If exceeds the threshold of 0.9, it is considered clear; otherwise, it is considered blurry. With this strategy, athletes can dynamically select the appropriate feature gallery for matching similarity.

4.4.2. Feature Dynamic Update (FDU)

In the feature update, DeepSORT [6] builts a gallery of features for each trajectory and inserted new features into it to achieve the update, which results in a significant waste of spatial and temporal resources. JDE/FairMOT [9, 10] improved this approach by using an exponential moving average (EMA) feature update strategy in which only one feature state is maintained per trajectory, which is a resource saver and the current dominant feature update solution. However, this approach has flaws. Problems such as occlusion and motion blur cause an increase in detection noise, resulting in differences in the quality of detection. The EMA strategy treats detections of different qualities equally. However, this process should be dynamic. High-quality features should be retained with greater weight, while low-quality detection should be ignored. Specifically, we introduce detection confidence to reflect detection quality and dynamically adjust the momentum term , which achieves dynamic updating of the EMA.where is the detection confidence, represents the feature state of the trajectory in frame , is the appearance embedding of the new detection, and is the original static momentum term.

4.5. Complexity Analysis

TDM algorithm controls the variation of trajectory number to effectively address the issues of false detection and missed detection, maintaining the purity of trajectory library. In terms of time complexity, assuming a video length of frames, the algorithm performs one or two judgments for each frame, which is a single loop problem, so the time complexity is . In terms of space complexity, this algorithm only needs to store two one-dimensional variables, the total number of trajectories and its state, as well as two timers, and dynamical updating without storing overwritten data, so its space complexity is .

The FDSU largely inherits the original KF, with only two-step operation in the update step, so its time complexity is consistent with the KF’s time complexity. Assuming a video length of frames and one iteration per frame, because matrix operations are required, its time complexity is . In terms of space complexity, the state vector and error covariance matrix of each moment need to be stored, and these vectors and matrices are squared with the number of observation data , so the space complexity also is .

DyTracker, like DeepSORT, is difficult to analyze specifically due to the overall complexity affected by multiple modules. Therefore, we mainly compared DyTracker with DeepSORT here. TDM and KFDU have been explained in detail in the previous text. For the FDSU module, in terms of time complexity, it mainly adds one judgment and two numerical operations, so the added time can be ignored; in terms of space, we additionally add a storage library for posture information, but it only stores information from adjacent frames, so the added space cost is not significant. Overall, compared with DeepSORT, DyTracker does not add too much time and space consumption, but the efficiency gain is significant.

4.6. Datasets and Metrics

4.6.1. Datasets

We conduct experiments on the SKT-MOT and MOT17 datasets [16]. SKT-MOT is a dataset of the short-track speed skating proposed in this article, and specific details are given in Section 3. For the detection and re-ID module, we transformed, respectively, the data format imitating COCO [32] and MARS [33] datasets, dividing the dataset according to the 7 : 2 : 1. MOT17 is a popular dataset for MOT, which consists of seven sequences, 5,316 frames for training, and seven sequences, 5,919 frames for testing. For ablation studies, we take the first half of each sequence in the MOT17 training set for training and the last half for validation following.

4.6.2. Metrics

The evaluation of tracking performance is mainly based on multiobject tracking accuracy (MOTA), identification F1 (IDF1) score, and multiobject tracking precision (MOTP).

MOTA is an evaluation metric for MOT algorithms that focus on tracking accuracy. It is calculated on the basis of false positive (FP), false negative (FN), and identification switch (IDSW), where FP represents false detection, FN represents missing detection, and IDSW counts the number of identity switches of an object. Despite its limitations and criticisms, it is still the most widely accepted evaluation metric for MOT.

IDF1 is another important metric in MOT to evaluate the precision of object identification. It responds more to the accuracy of ID matching. Here, identification true positive (IDTP) stands for the correctly identified object, identification false positive (IDFP) stands for the incorrectly identified object, and identification false negative (IDFN) stands for the unidentified identity information. MOTP, which measures the overlap between the resulting bounding box and ground truth, describes the localization precision of the object.

5. Experiments and Discussion

5.1. Experimental Details

For detection, the detector is YOLOv5-x [29] pretrained on the COCO dataset, introduces Diou-NMS, changes the localization loss to CioU-LOSS, and uses the original training schedule. For the embedding of the re-ID feature, the re-ID module [34] of DeepSORT is used, and the initial learning rate is 0.1, using Adam Optimizer [30]. For DyTracker, a threshold of 0.65 is set for nonmaximum suppression (NMS) and a threshold of 0.7 for detection confidence. The minimum feature distance threshold is 0.2, and the momentum term in the color gallery and the posture gallery is 0.65 and 1, respectively. The weight factor for the appearance cost is 0.98.

All experiments are conducted on a server machine with two 12 GB 2080Ti.

5.2. Comparative Experiment

We compared our DyTracker with state-of-the-art methods on the SKT-MOT dataset, and Table 4 lists the detailed performance results. The experimental results show that our DyTracker is much superior to other methods in MOTA, with a highest of 93.70% in all methods. Compared with JDT methods [7, 9, 10], we also have significant advantages. Compared to similar TBD methods [6, 23], we use the same detector and have achieved certain improvements. Second, the performance of DyTracker is also best in MOTP and IDF1, which confirms that our tracker can achieve more accurate object localization, more efficient completion of association tasks, and better tracking of speed skaters. On the other hand, our method compares favorably with LocalSORT, which is also designed for speed skating scenarios, demonstrating significant advantages. Limited by the two-stage framework, the FPS performance of our method is mediocre. Figure 9 shows the visualization of the DyTracker tracking results on the SKT-MOT.

Compared to our baseline DeepSORT, DyTracker improves 11.92% and 14.78% in MOTA and IDF1, respectively, and the rest of metrics are also significantly improved. As can be seen from the FPS, these performance increases result in only minimal time consumption. Figure 10 compares the visualization effects of DeepSORT and DyTracker. It is clear that when occlusion occurs, the position of object boxes in DeepSORT will have a significant deviation from skaters. Moreover, due to similar appearance and other problems, ID matching errors and IDSW also occur frequently. In contrast, the tracks in DyTracker are more precise and stable, which further demonstrate that our proposed method performs better in complex situations such as occlusion and similar dress.

(a)

(b)

5.3. Ablation Study

5.3.1. Ablation Study for DyTracker

Table 5 summarizes the DeepSORT to DyTracker process:

(1)TDM: Overcoming the influence of false detection has significantly reduced FP, thereby improving MOTA. At the same time, the range of ID values is also controlled, reducing the occurrence of ID switches, making ID matching more accurate and improving IDF1.(2)FDSU: Improved matching accuracy, reduced the influence of detection noise, and significantly increased IDF1, while also improving MOTA to a certain extent.(3)KFDU: Improved object location accuracy, resulting in a significant increase in MOTP and improved the robustness of KF, leading to improvement in MOTA.(4)Occlusion: Using the occlusion metric instead of the confidence to evaluate the degree of occlusion has reduced the impact of motion blur, leading to improvements in all metrics.

5.3.2. Extended Experiments for KFDU and FDU

To solve the occlusion problem, this article proposed two modules, KFDU and FDU. We argue that occlusion occurs to some extent in current datasets. These two modules should have universal adaptation. Table 6 shows that we have inserted the two update modules into the existing method, and the verification results on mot 17val, both MOTA and IDF1 have been improved.

5.3.3. Ablation Study for Threshold

We conducted an ablation experiment on the threshold to evaluate whether the pose is clear. As shown in Figure 11, the trend of MOTA and IDF1 is basically the same. We chose the 0.9 corresponding to the peak as the final threshold. When the threshold is small enough, that is, entirely relying on the posture gallery, which only correlates adjacent frames, can also achieve good results; when the threshold is large enough to rely entirely on the color gallery, which takes more into account historical features, no better than the former. This further suggests that proximity features should be more considered in speed skating scenes and appropriately reduce the weighting of historical features.

5.4. Limitations and Future Work

5.4.1. Limitations

The speed skating scene is different from general scenes. Taking the appearance similarity problem as an example, the appearance difference between targets is relatively large and their poses are close in general scenes, while the opposite is true in speed skating scenes (similar appearance, large pose differences). This explains why existing algorithms are not suitable for speed skating and also pose a challenge for the performance of our proposed modules in general scenes.

To better demonstrate the generalization of our modules, we conducted extensive ablation studies, as shown in Section 5.3, which show that the two proposed update modules are generally applicable. However, TDM algorithm is limited to scenarios with a relatively fixed number of trajectories, while FDS is limited to cases with large pose differences.

Additionally, due to the relatively small amount of speed skating data and the difficulty in optimizing end-to-end methods, we opted for a relatively basic two-stage framework. Although good accuracy is achieved, it ran slowly.

5.4.2. Future Work

Although FDS in this paper has certain limitations, we believe that the idea of dynamic selection has great potential for further research. For example, when objects are occluded, it is difficult to distinguish them by appearance alone, so motion direction and displacement should be considered more. In cases where the appearance difference between objects is large, appearance can be used as the main factor for matching. For scenarios with large pose differences, pose information can be taken into account. Based on different situations, different cues can be determined as the leading factors to achieve an adaptive matching process.

In addition, we believe that pose information can be greatly valuable in certain MOT scenarios, such as DanceTrack and skating. We will also continue to research in this direction, such as using pose information to guide feature extraction and achieving the unity of pose recognition and tracking tasks (sharing a common network). On the other hand, we will also collect more speed skating data for expansion and try more end-to-end methods to seek faster speeds and greater accuracy.

6. Conclusions

This study explores the potential development space of MOT from the perspective of short-track speed skating. First, we constructed a short-track speed skating MOT dataset and analyzed its unique challenges, revealing the limitations of existing datasets and the inadequacies of existing methods. Accordingly, we proposed a dynamic tracker specifically designed for speed skating scenarios, which improves three modules based on DeepSORT: the TDM module mainly addresses the issues of FP and missed detections, KFDU enhances the robustness of KF against occlusions, and FDSU considers the posture differences to address the clothing similarity problem and proposed a dynamic update strategy to mitigate the impacts of occlusions and motion blur. Compared to existing methods, our method achieved the highest MOTA of 93.7 and IDF1 of 92.39 in the SKT-MOT dataset. Furthermore, we conducted extensive ablation experiments to analyze the generalization and potential values of all modules. We believe that there are differences and similarities between speed skating scenarios and general scenarios, which provide new insights to solve existing MOT problems and have great research value.

Appendix

In this article, we used many abbreviations. To facilitate better understanding for readers, we provided specific explanations for these abbreviations, as shown in Table 7.

Data Availability

The data used to support the findings of this study are available from the corresponding author on request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the National Key R&D Program (grant no. 2019YFF0301800), National Natural Science Foundation of China (grant no. 61379106), and the Shandong Provincial Natural Science Foundation (grant nos. ZR2013FM036, ZR2015FM011).

References

G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, “Part-based multiple-person tracking with partial occlusion handling,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1815–1821, IEEE, Providence, RI, USA, 2012.
View at: Publisher Site | Google Scholar
K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg, “Who are you with and where are you going?” in CVPR 2011, pp. 1345–1352, IEEE, Colorado Springs, CO, USA, 2011.
View at: Publisher Site | Google Scholar
A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468, IEEE, Phoenix, AZ, USA, 2016.
View at: Publisher Site | Google Scholar
R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
View at: Publisher Site | Google Scholar
B. Yaw and H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, pp. 83–97, 1955.
View at: Publisher Site | Google Scholar
N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649, IEEE, Beijing, China, 2017.
View at: Publisher Site | Google Scholar
S. Sun, N. Akhtar, H. Song, A. Mian, and M. Shah, “Deep affinity network for multiple object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 104–119, 2021.
View at: Google Scholar
J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, “Track to detect and segment: an online multi-object tracker,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, pp. 12352–12361, Computer Vision Foundation/IEEE, 2021.
View at: Google Scholar
Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, “Towards real-time multi-object tracking,” in Computer Vision–ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. M. Frahm, Eds., vol. 12356 of Lecture Notes in Computer Science, pp. 107–122, Springer, Cham, 2020.
View at: Publisher Site | Google Scholar
Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “FairMOT: on the fairness of detection and re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129, pp. 3069–3087, 2021.
View at: Publisher Site | Google Scholar
F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, “Motr: end-to-end multiple-object tracking with transformer,” in Computer Vision–ECCV 2022, Proceedings, Part XXVII, pp. 659–675, Springer, Tel Aviv, Israel, 2022.
View at: Google Scholar
A. Cioppa, S. Giancola, A. Deliège et al., “Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3491–3502, IEEE Computer Society, Los Alamitos, CA, USA, 2022.
View at: Publisher Site | Google Scholar
Y. Gong and G. Srivastava, “Multi-target trajectory tracking in multi-frame video images of basketball sports based on deep learning,” EAI Endorsed Transactions on Scalable Information Systems, vol. 10, no. 2, Article ID e9, 2023.
View at: Google Scholar
B. T. Naik and M. F. Hashmi, “YOLOv3-Sort: detection and tracking player/ball in soccer sport,” Journal of Electronic Imaging, vol. 32, Article ID 011003, 2023.
View at: Publisher Site | Google Scholar
L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, “MOTChallenge 2015: towards a benchmark for multi-target tracking,” 2015, arXiv preprint arXiv: 1504.01942.
View at: Google Scholar
A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, “MOT16: a benchmark for multi-object tracking,” 2016, arXiv preprint arXiv: 1603.00831.
View at: Google Scholar
P. Dendorfer, H. Rezatofighi, A. Milan et al., “MOT20: a benchmark for multi object tracking in crowded scenes,” 2020, arXiv preprint arXiv: 2003.09003.
View at: Google Scholar
J. Ferryman and A. Shahrokni, “PETS2009: dataset and challenge,” in 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6, IEEE, Snowbird, UT, USA, 2009.
View at: Publisher Site | Google Scholar
M. Andriluka, S. Roth, and B. Schiele, “Monocular 3D pose estimation and tracking by detection,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 623–630, IEEE, San Francisco, CA, USA, 2010.
View at: Publisher Site | Google Scholar
J. Wang, Y. Peng, X. Yang, T. Wang, and Y. Zhang, “Sportstrack: an innovative method for tracking athletes in sports scenes,” 2022, arXiv preprint arXiv: 2211.07173.
View at: Google Scholar
X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” 2020, European Conference on Computer Vision (ECCV).
View at: Google Scholar
P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 941–951, IEEE, 2019.
View at: Publisher Site | Google Scholar
Y. Zhang, P. Sun, Y. Jiang et al., “Bytetrack: multi-object tracking by associating every detection box,” in Computer Vision–ECCV 2022, vol. 13682 of Lecture Notes in Computer Science, pp. 1–21, Springer, Israel, 2022.
View at: Publisher Site | Google Scholar
N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, “BoT-SORT: robust associations multi-pedestrian tracking,” 2022, arXiv preprint arXiv: 2206.14651.
View at: Google Scholar
J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, “Observation-centric sort: rethinking sort for robust multi-object tracking,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9686–9696, IEEE, Vancouver, BC, Canada, 2023.
View at: Publisher Site | Google Scholar
A. Torralba, B. C. Russell, and J. Yuen, “Labelme: online image annotation and applications,” Proceedings of the IEEE, vol. 98, no. 8, pp. 1467–1484, 2010.
View at: Publisher Site | Google Scholar
R. M. Haralick, “Using perspective transformations in scene analysis,” Computer Graphics and Image Processing, vol. 13, no. 3, pp. 191–221, 1980.
View at: Publisher Site | Google Scholar
R. De Maesschalck, D. Jouan-Rimbaud, and D. L. Massart, “The mahalanobis distance,” Chemometrics and Intelligent Laboratory Systems, vol. 50, no. 1, pp. 1–18, 2000.
View at: Publisher Site | Google Scholar
G. Jocher, ““YOLOv5 by Ultralytics,” 2020, [Online]. Available: https://github.com/ultralytics/yolov5.
View at: Google Scholar
D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” 2014, arXiv preprint arXiv: 1412.6980.
View at: Google Scholar
Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: exceeding yolo series in 2021,” 2021, arXiv preprint arXiv: 2107.08430.
View at: Google Scholar
T.-Y. Lin, M. Maire, S. Belongie et al., “Microsoft COCO: common objects in context,” in Computer Vision–ECCV 2014, vol. 8693 of Lecture Notes in Computer Science, pp. 740–755, Springer, Cham, 2014.
View at: Publisher Site | Google Scholar
L. Zheng, Z. Bie, Y. Sun et al., “MARS: a video benchmark for large-scale person re-identification,” in Computer Vision–ECCV 2016, vol. 9910 of Lecture Notes in Computer Science, pp. 868–884, Springer, Cham, 2016.
View at: Publisher Site | Google Scholar
Z. Pei, ““Deepsort pytorch,” 2019, [Online]. Available: https://github.com/ZQPei/deep_sort_pytorch.
View at: Google Scholar
P. Sun, J. Cao, Y. Jiang et al., “Dancetrack: multi-object tracking in uniform appearance and diverse motion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20993–21002, IEEE, 2022.
View at: Google Scholar
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, IEEE, Providence, RI, USA, 2012.
View at: Publisher Site | Google Scholar
F. Yu, H. Chen, X. Wang et al., “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2636–2645, IEEE, 2020.
View at: Google Scholar
Y. Liao, J. Xie, and A. Geiger, “KITTI-360: a novel dataset and benchmarks for urban scene understanding in 2D and 3D,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2023.
View at: Publisher Site | Google Scholar
S. Anjum and D. Gurari, “CTMC: cell tracking with mitosis detection dataset challenge,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 4228–4237, IEEE, Seattle, WA, USA, 2020.
View at: Publisher Site | Google Scholar
A. Dave, T. Khurana, P. Tokmakov, C. Schmid, and D. Ramanan, “TAO: a large-scale benchmark for tracking any object,” in Computer Vision–ECCV 2020, pp. 436–454, Springer, Glasgow, UK, 2020.
View at: Google Scholar
Y. Du, Z. Zhao, Y. Song et al., “StrongSORT: make deepSORT great again,” IEEE Transactions on Multimedia, 2023.
View at: Google Scholar
X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019, arXiv preprint arXiv: 1904.07850.
View at: Google Scholar
B. Yan, Y. Jiang, P. Sun et al., “Towards grand unification of object tracking,” in Computer Vision–ECCV 2022, vol. 13681 of Lecture Notes in Computer Science, pp. 733–751, Springer, Cham, 2022.
View at: Publisher Site | Google Scholar
J. Peng, C. Wang, F. Wan et al., “Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking,” in Computer Vision–ECCV 2020, vol. 12349 of Lecture Notes in Computer Science, pp. 145–161, Springer, Cham, 2020.
View at: Publisher Site | Google Scholar
S. A. Qureshi, L. Hussain, Q. ul-ain-Chaudhary et al., “Kalman filtering and bipartite matching based super-chained tracker model for online multi object tracking in video sequences,” Applied Sciences, vol. 12, no. 19, Article ID 9538, 2022.
View at: Publisher Site | Google Scholar
J. Wang, D. Chen, Z. Wu et al., “OmniTracker: Unifying object tracking by tracking-with-detection,” 2023, arXiv preprint arXiv: 2303.12079.
View at: Google Scholar
Q. Li, H. Mo, X. Wang, and H. Li, “Multiple object tracking and kinematic simulation for short track speed skating,” Journal of System Simulation, vol. 33, no. 5, pp. 1039–1050, 2021.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2023 Junwu Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

218

Downloads

198

Citations

Scientific Programming

SKT-MOT and DyTracker: A Multiobject Tracking Dataset and a Dynamic Tracker for Speed Skating Video

Abstract

1. Introduction

2. Related Work

2.1. MOT Datasets

2.2. MOT Methods

2.3. Motivation

3. SKT-MOT

3.1. Dataset Construction

3.2. Dataset Analysis

4. DyTracker

4.1. Preview DeepSORT

4.1.1. Matching

4.1.2. Updating

4.2. TDM

4.3. KFDU

4.3.1. Evaluate Occlusion

4.3.2. Kalman Filter Dynamic Update

4.4. FDSU

4.4.1. Feature Dynamic Selection (FDS)

4.4.2. Feature Dynamic Update (FDU)

4.5. Complexity Analysis

4.6. Datasets and Metrics

4.6.1. Datasets

4.6.2. Metrics

5. Experiments and Discussion

5.1. Experimental Details

5.2. Comparative Experiment

5.3. Ablation Study

5.3.1. Ablation Study for DyTracker

5.3.2. Extended Experiments for KFDU and FDU

5.3.3. Ablation Study for Threshold

5.4. Limitations and Future Work

5.4.1. Limitations

5.4.2. Future Work

6. Conclusions

Appendix

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright