AquaPose

computer-vision

3d-reconstruction

multi-object-tracking

pose-estimation

End-to-end 3D pose tracking for many near-identical animals behind a refractive interface, with identity maintained using geometry alone — no appearance features required.

Published

March 15, 2026

AquaPose is a vertically integrated pipeline from synchronized multi-camera video to per-fish 3D midline trajectories, holding identity across thousands of frames of heavy occlusion using geometric consistency alone.

Multi-camera 3D pose tracking is well-studied for humans, lab mice, and other terrestrial subjects. Three requirements are not: many near-identical subjects, where appearance-based re-identification provides no discriminative signal; refractive imaging geometry, where cameras looking through an air–water interface break the pinhole projection that standard multi-view geometry depends on; and long continuous recordings, where even a 0.01% per-frame identity error compounds to an identity swap every ~5 minutes at 30 fps. These constraints stack. Accurate cross-view association needs accurate 3D geometry; accurate 3D geometry in aquatic settings needs refraction handling; and long-run identity on visually similar subjects needs something other than appearance embeddings. To my knowledge, no existing tool addresses this combination.

Pipeline

Five stages move multi-camera video to 3D trajectories:

Oriented bounding box detection (Ultralytics YOLO) — rotated boxes that hug elongated bodies at arbitrary orientations, avoiding the wasted crop area of axis-aligned boxes on diagonal subjects.
Six-keypoint pose estimation on affine-warped crops (YOLO Pose). A midline skeleton (nose → tail) matches the laterally symmetric body plan; branching skeletons used for limbed animals don’t transfer.
Per-camera tracking via OKSort — a keypoint-based multi-object tracker I packaged as a standalone library because the approach generalizes beyond this application.
Cross-view identity association — per-chunk (300-frame) stateless matching via ray-based affinity scoring under the refractive model, partitioned into identity groups by Leiden community detection. Each chunk re-solves identity from scratch; there is no “lock it in and pray” identity map to drift.
Refraction-aware triangulation — confidence-weighted DLT under the shared AquaCal refractive projection model, with optional B-spline midline fitting.

Long recordings run as fixed chunks with Kalman and identity state carried across boundaries, so a failure in one chunk cannot propagate. A global stitcher reconciles chunk-local IDs into consistent fish identities using per-fish body length as a biometric to detect and correct residual swaps.

The pipeline bootstraps from a small manual annotation set (43 OBB images) and iteratively self-labels: its own 3D reconstructions reproject back into each camera view as pseudo-labels, optionally human-reviewed in Label Studio, then fed into retraining. Thin-plate-spline elastic augmentation synthesizes the high-curvature postures that are behaviorally interesting but underrepresented in real data. A targeted wall-fish augmentation pipeline inpaints easy sand-background fish out of training images, forcing the detector to learn the hard wall-adjacent cases.

Performance

Evaluated on 9,450 frames (315 s at 30 fps) of 9 Aulonocara sp. “Yellow Head” across 12 synchronized cameras, under deliberately hard conditions — median nearest-neighbor distance of 1.0 body length, 14.8% 2D occlusion rate, 6 near-identical females — the full pipeline produced one confirmed identity swap across the entire run (occurring in the first 7% of the recording), reconstructed 95.6% of possible fish-frames, held at least 8 of 9 fish in 94.9% of frames, and achieved a median reprojection error of 2.82 pixels. End-to-end throughput on a single desktop GPU (RTX 4070) is 4.26 fps across 12 cameras.

On a follow-up 5.8-hour continuous recording (~628,000 frames, 24.5M detections), reconstruction coverage and reprojection error showed no systematic degradation over the run. This is the failure mode that typically kills multi-object trackers on long recordings — identity error compounds frame by frame — and AquaPose avoids it by construction: the per-chunk stateless cross-view association re-solves identity from geometry each window, rather than carrying errors forward indefinitely.

Training-data ablations quantified each intervention. Detection on wall-adjacent fish improved +25.5 mAP points (0.398 → 0.653 mAP₅₀₋₉₅, a 64% relative gain) through pseudo-labeling, hard-case curation, and wall augmentation. Pose precision rose from 0.948 to 0.992 — a 74% relative error reduction. TPS elastic augmentation specifically reduced curvature-dependent accuracy loss on the highest-curvature tercile by 11%.

Stack: Python, PyTorch, Ultralytics YOLO, OpenCV, NumPy, SciPy, Leiden community detection, SQLite, HDF5, Label Studio.

Pipeline

Performance

Links