OKSort

multi-object-tracking
pose-estimation
A multi-object tracker that swaps bounding-box IoU for keypoint similarity — outperforming bbox-based and appearance-based trackers on deformable, non-standard subjects while staying an order of magnitude faster than appearance methods.
Published

October 1, 2025

OKSort is a multi-object tracker that embeds Object Keypoint Similarity directly into the SORT-family predict–match–update loop — to my knowledge, the first to do so — purpose-built for deformable subjects where bounding boxes poorly describe body overlap.

Multi-object trackers in the SORT lineage (SORT, ByteTrack, OC-SORT, BoT-SORT, StrongSORT) associate detections across frames using bounding-box IoU and, optionally, learned appearance embeddings. Both fail on specific kinds of subjects. For deformable and elongated bodies, an axis-aligned bounding box around a diagonal fish or a curled animal is mostly background, so box IoU decouples from actual body overlap. For visually near-identical subjects, appearance embeddings trained on pedestrian datasets have no discriminative signal and can actively hurt performance by assigning tracks based on irrelevant features. And for compute budgets that exclude heavy ReID backbones, appearance-based trackers run an order of magnitude slower than pure-motion ones. Any domain doing keypoint detection on deformable subjects — sports analytics, surgical instruments, livestock, behavioral neuroscience — has some subset of these problems.

Approach

OKSort extends the SORT-family loop with three changes. First, Object Keypoint Similarity replaces bounding-box IoU as the association metric. OKS is scale-normalized and per-keypoint confidence-weighted, so it degrades gracefully when parts of the subject are occluded or poorly localized. Second, a 24-dimensional Kalman filter tracks all six keypoints jointly (six points × 2D position + 2D velocity), with measurement noise scaled inversely by keypoint confidence — well-localized keypoints dominate state updates, low-confidence ones do not pollute them. Third, an orientation-curvature matching term augments the OKS cost with the cosine similarity of spine heading vectors between predicted and observed poses. Around this are the moving parts from the best of the SORT family: ByteTrack-style two-phase matching, OC-SORT-style observation-centric recovery, and a merger-detection heuristic that extends the unmatched-track coast limit from 1 s to 3 s when two subjects collapse into one detection.

OKSort runs in production as the per-camera tracking stage of AquaPose, but it is packaged as a standalone library because the approach transfers to any domain with per-frame keypoint detections.

Benchmark

OKSort was benchmarked against seven alternatives — six bounding-box trackers (ByteTrack, OC-SORT, BoT-SORT, BoT-SORT-ReID, SFSORT, StrongSORT) and one keypoint-based baseline (KeySort) — on 9,340 frames of dense multi-subject tracking (72,381 detections, 14.8% occlusion rate). All trackers received identical ground-truth detections, so differences are attributable entirely to association quality. OKSort achieved the highest HOTA (0.452), 8.0% above the best bounding-box tracker (BoT-SORT, 0.419) and 31% above KeySort (0.345) — a keypoint tracker that uses raw Euclidean distance rather than scale-normalized OKS, isolating the normalization as the source of the advantage. Extended across all twelve cameras, OKSort led on ten of twelve (0.563 aggregate HOTA), with the advantage growing on the six densest camera views (>20,000 detections each). Runtime is 2–3× slower than the fastest bounding-box trackers due to the per-keypoint cost matrix, but 16× faster than StrongSORT and comfortably inside the offline-pipeline compute budget.

Appearance-based re-identification actively hurt performance on this data. BoT-SORT-ReID scored 0.014 HOTA below the same tracker without ReID. Pedestrian-trained ReID features provide no discriminative signal for visually similar non-human subjects, and they inject noise into otherwise clean geometric associations.

Stack: Python, NumPy, SciPy. Evaluated with TrackEval (MOTChallenge format); compared against trackers from the boxmot library.