AI Now Tracks Any Object in Images and Videos by Concept

TL;DR

A new AI model detects, segments, and follows any concept across images and videos, doubling the accuracy of existing systems.

A significant advancement in computer vision has arrived with the introduction of SAM 3, a model that can detect, segment, and track objects in images and videos based on simple concept prompts. This development addresses a critical gap in AI's ability to find and segment all instances of a visual concept anywhere in a scene, such as identifying every 'cat' in a video, rather than just one object per prompt. The model represents a step change in promptable segmentation, improving upon previous systems and setting a new standard for what researchers call Promptable Concept Segmentation (PCS). By enabling users to specify objects with short noun phrases or image examples, SAM 3 opens up practical applications in robotics, content creation, augmented reality, and scientific research, making complex visual analysis more accessible and efficient.

The core innovation of SAM 3 lies in its unified architecture, which combines an image-level detector and a memory-based video tracker sharing a single backbone. To tackle of open-vocabulary concept detection, the researchers introduced a presence head that decouples recognition and localization, significantly boosting detection accuracy. This design allows the model to determine whether a concept is present in an image before pinpointing its location, a that proved especially effective when training with challenging negative phrases. The model supports interactive refinement, letting users add prompts to correct errors, and it processes a single image with over 100 detected objects in just 30 milliseconds on an H200 GPU, sustaining near real-time performance for videos with about five concurrent objects.

To achieve these performance gains, the team built a scalable data engine that produced a high-quality dataset with 4 million unique concept labels across images and videos. This data engine leveraged human and AI annotators in a feedback loop, using multimodal large language models as 'AI verifiers' to double annotation throughput. The resulting dataset, called Segment Anything with Concepts (SA-Co), includes 4 million unique phrases and 52 million masks, along with a synthetic dataset with 38 million phrases and 1.4 billion masks. Experiments show that SAM 3 doubles the accuracy of existing systems in both image and video PCS, achieving a zero-shot mask average precision of 48.8 on the LVIS benchmark compared to the previous best of 38.5, and surpassing baselines on the new SA-Co benchmark by at least two times.

Of SAM 3 extend across various real-world domains, offering enhanced capabilities for applications that require precise object identification and tracking. For instance, in robotics, it could improve navigation and manipulation by accurately segmenting tools or obstacles; in content creation, editors might quickly isolate and modify specific elements in videos; and in data annotation, it could automate the labeling of large datasets, saving time and reducing human error. The model's ability to handle simple noun phrases and image exemplars makes it user-friendly for non-experts, while its integration with multimodal large language models allows it to tackle more complex queries, such as those requiring reasoning about relationships between objects.

Despite its advancements, SAM 3 has limitations. The model struggles to generalize to out-of-domain terms, such as fine-grained concepts like specific aircraft types or medical terminology, particularly in niche visual domains like thermal imagery. Additionally, it is constrained to simple noun phrase prompts and does not support multi-attribute queries or longer referring expressions without combining with a multimodal large language model. In videos, inference cost scales linearly with the number of objects being tracked, which may limit real-time applications for crowded scenes. However, the researchers note that performance can be improved through fine-tuning on small amounts of human-annotated data or using domain-specific synthetic data generated by their data engine, pointing to scalable future enhancements.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn