Lite Any Stereo: The Ultra-Light Model That Shatters the Efficiency-Accuracy Trade-Off in Stereo Vision

For decades, stereo vision—the process of estimating depth from two offset images—has been a cornerstone of computer vision, powering everything from autonomous driving to robotics. While deep learning has propelled accuracy to new heights, these gains have come at a steep cost: state-of-the-art models are often massive, computationally expensive behemoths that struggle to run in real-time on resource-limited hardware. This has created a frustrating dichotomy in the field, where researchers believed that lightweight models simply lacked the capacity for strong zero-shot generalization—the ability to perform accurately across diverse, unseen scenarios without fine-tuning. Now, a groundbreaking paper from Imperial College London, titled "Lite Any Stereo: Efficient Zero-Shot Stereo Matching," s this assumption head-on, introducing a model that delivers top-tier accuracy while requiring less than 1% of the computational cost of leading s.

The researchers, led by Junpeng Jing, Weixun Luo, and Ye Mao, designed Lite Any Stereo from the ground up to bridge the gap between efficiency and zero-shot capability. At its core is a novel hybrid cost aggregation module that cleverly combines 2D and 3D representations. While 2D convolutions are efficient but limited in capturing disparity cues, and 3D convolutions are powerful but computationally heavy, the team found that a serial connection—applying a small 3D block first, followed by a 2D block—optimally captures complementary spatial and disparity information. This design, using ConvNeXt layers for the 2D component, maintains a lean profile with just 33 billion multiply-accumulate operations (MACs), compared to over 3,600 for some accuracy-focused models. The backbone leverages MobileNetV2, chosen for its channel configuration alignment, and avoids external priors like DepthAnything to keep overhead minimal.

However, the architectural innovations are only half the story. The team's three-stage training strategy is what truly unlocks the model's zero-shot prowess. Stage ① involves supervised training on 1.8 million synthetic stereo pairs from datasets like SceneFlow and TartanAir, building a foundational matching ability. Stage ② introduces self-distillation: a teacher model receives clean inputs, while a student model gets perturbed data, encouraging domain-invariant features through a feature alignment loss. Crucially, the team found that keeping the teacher's weights fixed outperformed more complex update schemes. Stage ③ leverages 0.5 million unlabeled real-world stereo images, using pseudo-labels generated by a frozen accurate model (FoundationStereo) for knowledge distillation. This approach effectively bridges the sim-to-real gap, with the authors noting that data quality trumped sheer scale—low-quality datasets were excluded to prevent performance degradation.

Are nothing short of spectacular. In zero-shot evaluations on four major benchmarks—KITTI 2012, KITTI 2015, ETH3D, and Middlebury—Lite Any Stereo achieved the highest accuracy among all efficient s by a large margin. It even matched or surpassed non-prior-based accurate models like Selective-IGEV, which uses over 100 times more computation. On the KITTI 2015 leaderboard, it ranked first among efficient s at the time of submission. Qualitatively, the model produces smooth, detailed disparity maps on in-the-wild 4K images, handling s like reflections and repetitive textures where others falter. Impressively, it runs at 21 milliseconds on a GTX 1080 GPU, demonstrating real-time potential on older hardware, and uses just 2.5 GB of memory for 2K inputs, making it suitable for embedded systems and drones.

Despite its achievements, Lite Any Stereo is not without limitations. The authors acknowledge that it still trails behind depth-prior-based approaches, and performance on indoor datasets like Middlebury saw a slight drop in the final training stage, likely due to limited high-quality real-world indoor data. s with transparent objects and reflections remain areas for improvement. Nonetheless, this work fundamentally shifts the paradigm in stereo matching, proving that ultra-light models can indeed excel in zero-shot settings. By open-sourcing their code and model, the team sets a new standard for practical deployment, offering a path toward more accessible and efficient depth sensing technologies. Future work may focus on scaling real-world data collection and expanding the model zoo for varied computational budgets.

Lite Any Stereo: The Ultra-Light Model That Shatters the Efficiency-Accuracy Trade-Off in Stereo Vision

Original Source

About the Author

Guilherme A.