Understanding how objects interact in videos is crucial for applications like robotics, surveillance, and assistive systems, but existing AI tools often operate as black boxes with no room for human input. Researchers from UC Santa Barbara have developed Click2Graph, a framework that bridges this gap by allowing users to guide video analysis with a single click or bounding box. This interactive approach transforms how AI systems interpret dynamic scenes, making them more controllable and interpretable for real-world use.
The key finding is that Click2Graph can generate detailed scene graphs from minimal user prompts, such as clicking on a subject in a video frame. The system then automatically segments and tracks that subject across time, discovers interacting objects, and predicts relationships between them, all while maintaining temporal consistency. For example, as shown in Figure 1, clicking on a dog leads the system to segment a carpet and predict the activity 'sitting,' while clicking on a child yields a dog and the relationship 'playing.' This demonstrates how user guidance can direct AI attention to specific interactions, reducing errors and enhancing relevance in complex environments.
Ology builds on SAM2, a promptable segmentation model that produces precise masks from visual cues but lacks semantic reasoning. Click2Graph adds two components: a Dynamic Interaction Module (DIDM) and a Semantic Classification Head (SCH). The DIDM, illustrated in Figure 3, generates subject-conditioned prompts to find interacting objects, using a transformer-based approach to predict their locations. The SCH then classifies subject, object, and predicate labels from the segmented masks, enabling full scene graph generation. The system was trained on the OpenPVSG dataset with a multi-task loss function, incorporating mask, localization, and semantic prediction losses, and it runs at about 10 frames per second on standard hardware.
From experiments on the OpenPVSG benchmark, detailed in Tables 2-4, show that Click2Graph achieves competitive performance despite generating fewer predictions than automated s. For instance, it maintains strong spatial interaction recall and prompt localization recall, indicating reliable object and segmentation. However, end-to-end semantic recall remains challenging due to fine-grained label confusions, such as distinguishing between 'child' and 'baby' or 'on' and 'sitting.' Qualitative analysis in Figure 4 highlights successes in recovering interactions and handling occlusions, but also reveals failures in predicate granularity, underscoring the difficulty of semantic reasoning in diverse video contexts.
Of this research are significant for making video analysis more accessible and practical. By enabling user guidance, Click2Graph allows non-experts to direct AI systems in safety-critical or complex scenarios, such as monitoring interactions in healthcare or autonomous driving. It complements fully automated s by offering corrective capabilities, potentially reducing errors and improving trust in AI-driven decisions. The framework's robustness to different prompt types, including points, boxes, and masks, as shown in Table 3, further supports its deployment in real-world settings where user input may be imperfect.
Limitations of the current system include an inability for users to directly modify predicted labels during inference, and corrections do not yet feed back into the model dynamically. The paper notes that semantic classification is the primary bottleneck, with errors arising from visually similar categories in the OpenPVSG dataset. Future work could integrate language models to enhance predicate reasoning or develop multi-subject prompting for more complex interactions, as suggested in the conclusion.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn