AI That Edits Videos Without Labeled Training Data

TL;DR

New method splits videos into layers and recombines them for realistic edits, no manual labels needed, but balances visual quality against text control.

A new artificial intelligence system can now compose videos by seamlessly integrating moving subjects into different backgrounds, a task that has long d both automated tools and human editors. Developed by researchers from the University of Illinois Urbana-Champaign and Google, , called Split-then-Merge (StM), addresses a core problem in video generation: how to maintain precise control over dynamic elements without relying on extensive labeled datasets. This breakthrough could transform fields from film production to virtual reality, where realistic video compositing is essential for creating immersive experiences.

The key finding is that StM can generate coherent videos by learning from unlabeled footage alone, outperforming existing state-of-the-art s in preserving motion and visual identity. The system takes a foreground video, such as a person walking, and a background video, like a city street, and merges them into a single video where the subject moves naturally within the new scene. For example, it can place a pig on a forest road or a lunar surface while adapting its motion and adding realistic shadows, as shown in Figure 1 of the paper. This goes beyond simple copy-paste by ensuring affordance-aware placement, meaning objects are positioned in physically plausible ways, like a swan in water rather than on ground.

Ology relies on a two-stage process: splitting and merging. First, the Decomposer uses off-the-shelf models to automatically split unlabeled videos into layers—a foreground subject, a background scene, a mask, and a text caption—without human annotation. This creates a training dataset called StM-50K, comprising 50,000 video clips from sources like Panda-70M and Animal Kingdom. Then, the Composer, built on a latent diffusion transformer, learns to merge these layers back into the original video through a self-composition approach. Critical innovations include a transformation-aware training pipeline that applies random augmentations to the foreground to prevent shortcuts, and an identity-preservation loss that balances foreground fidelity with harmonious blending, using a weighted sum of foreground and background sub-losses as defined in Equation 4.

From the paper demonstrate StM's superiority across multiple metrics. In quantitative evaluations on a test set of 93 unseen video pairs, StM achieved the highest scores for identity preservation, with 84.82 for foreground and 92.88 for background, and the best motion alignment, with 1.22 for foreground action and 16.36 for background motion, as detailed in Table 2. Qualitative comparisons in Figures 5 and 6 show that StM preserves complex dynamics, such as a goat's running motion with rapid camera movement, while baselines like Copy-Paste + I2V or SkyReels often fail, producing static backgrounds or distorted appearances. User studies with 50 participants and VLLM-based judges further confirmed StM's advantages, with win rates over 80% for metrics like motion alignment and overall quality, as reported in Table 3.

Of this research are significant for industries reliant on video content creation, such as entertainment, advertising, and education. By automating the tedious process of video compositing, StM could reduce production costs and time, enabling more dynamic and personalized media. For instance, it could allow filmmakers to insert actors into virtual sets with realistic interactions or help educators create engaging instructional videos by combining elements from different sources. 's scalability, using unlabeled data, also makes it adaptable to diverse applications without the need for costly annotations.

However, the paper acknowledges limitations, including a trade-off between visual fidelity and textual alignment. StM prioritizes preserving input video motion and appearance, which sometimes reduces strict adherence to text prompts, as seen in lower textual alignment scores compared to text-guided baselines. Additionally, performance depends on the quality of off-the-shelf decomposition models, and errors in foreground mask extraction can lead to artifacts. Future work may focus on balancing text control better and improving decomposition robustness to enhance overall reliability.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn