AquaMVS
AquaMVS produces dense 3D reconstructions of submerged environments from calibrated multi-camera video, threading a refractive projection model through every geometric stage — sparse matching, dense depth estimation, multi-view fusion, and surface reconstruction.
Multi-view stereo is a mature computer vision problem with strong off-the-shelf solutions: COLMAP, OpenMVS, commercial photogrammetry packages. None of them handle refraction correctly. The pinhole camera model they depend on is a physical approximation that breaks at air–water interfaces — epipolar geometry is no longer a straight line, feature-match correspondences drift, and the errors compound through every downstream stage. Treating refraction as post-hoc distortion does not scale: each pipeline stage needs geometry-aware projection, or errors propagate. And the research tools that do handle refraction typically target single moving cameras on small objects, not simultaneous multi-camera arrays imaging tank-floor-scale environments.
Approach
AquaMVS threads the shared AquaCal refractive projection model through every geometric operation in a dense MVS pipeline. Projection operations are reformulated in PyTorch so every stage is GPU-accelerated and differentiable; forward projection runs ten fixed Newton-Raphson iterations against the Snell’s-law constraint with an analytical Jacobian that is compatible with autograd.
Two complementary reconstruction pathways share the same refractive backbone. The first uses RoMa v2 — a transformer-based dense warp matcher that produces pixel-level correspondence fields, yielding the highest-density reconstructions where visual texture is available. The second pairs SuperPoint + LightGlue sparse feature matching (which establishes per-pixel depth bounds) with plane-sweep stereo inside those bounds: a cost volume built via NCC or SSIM photometric similarity, sampled across depth hypotheses with sub-pixel parabolic refinement.
For time-series reconstruction, temporal median filtering suppresses transient objects (swimming fish, debris) so the reconstructed geometry reflects the static environment. Per-camera depth maps are geometrically filtered, back-projected, and fused via voxel-grid downsampling with normal estimation. The fused point cloud is converted to a textured triangle mesh via screened Poisson surface reconstruction.
Reconstruction quality
On real video from a 13-camera rig imaging a 2 m submerged arena, the RoMa pathway produced 14.3 million points and a 2.2 million-vertex mesh from a single frame — 2.1× the density of the LightGlue + plane-sweep pathway at 14% lower wall-clock time (1960 s vs. 2271 s on a laptop RTX 3060).
Dimensional accuracy, measured purely from the reconstructed point cloud with no reference dimensions entering the computation, recovered tank diameter within <1% of known values at both the sand level (1888 mm reconstructed vs. 1871 mm reference) and the water surface (1960 mm vs. ~1956 mm). The boundary circle fit at sand level had an RMS residual of 3.5 mm, confirming that the reconstruction recovers the circular cross-section faithfully.
Across five consecutive reconstructions of the static environment — any variation here is an upper bound on reconstruction noise, since the substrate is fixed — 85–90% of pixels showed frame-to-frame height differences below 0.5 mm, ~95% fell below 1.0 mm, and there was no systematic drift. Qualitatively, the reconstructed surfaces resolve biologically meaningful features at the ~40 mm scale of substrate relief: bowers (male-constructed mound structures), sand gradients, and clean sand-to-wall transitions at the tank perimeter.
Stack: Python, PyTorch, RoMa v2, SuperPoint + LightGlue, OpenCV, NumPy, Open3D (point-cloud ops and surface reconstruction).
Links
- Install:
pip install aquamvs - GitHub: tlancaster6/AquaMVS
- Companion tools: AquaCal (calibration), AquaPose (animal tracking)
- Paper: Lancaster et al. (in prep)