Editing a 3D scene captured from multiple angles has traditionally required skilled artists or dense sets of images to maintain consistency. A new AI framework now makes this possible with just a handful of casually taken photos, opening doors for applications in product visualization, real estate, and augmented reality. Researchers from Technion and NVIDIA have developed InstructMix2Mix (I-Mix2Mix), a system that can modify a scene according to a text instruction—like 'turn him into a knight' or 'make it snowy'—while ensuring the edits look coherent from every viewpoint, even when only four input images are available. This addresses a critical limitation in current 3D editing tools, which often struggle with sparse views, producing artifacts or inconsistent details that break the illusion of a unified scene.
The core is that by distilling the editing capabilities of a powerful 2D image editor into a pre-trained multi-view diffusion model, the system can achieve robust 3D consistency without requiring a dense neural field representation. The researchers used InstructPix2Pix as the teacher model for editing and Stable Virtual Camera (SEVA) as the student model, which has an inherent data-driven prior for generating view-consistent scenes. Through a tailored version of Score Distillation Sampling (SDS), personalizes the student model to the target scene and edit instruction, enabling it to output a set of edited images that are both faithful to the prompt and geometrically coherent across all provided viewpoints. This approach effectively bypasses the need for per-scene optimization of neural radiance fields or 3D Gaussian splatting, which typically demand many input views to function reliably.
Ology involves several novel adaptations to the standard SDS pipeline to accommodate a multi-view diffusion student. Instead of rendering from a neural field, the student generates samples via its denoising trajectory, with distillation performed incrementally across timesteps to avoid costly full sampling runs. A key innovation is the replacement of the conventional neural field consolidator with the multi-view model, leveraging its built-in 3D prior. The process includes an initialization step where one reference image is edited by the teacher and encoded to guide the student, followed by iterative distillation stages: student query, alignment via simple bilinear interpolation, perturbation with a specialized stochastic noise schedule, teacher prediction enhanced by a Random Cross-View Attention mechanism, and student weight updates. This attention mechanism, which aligns all frames to a randomly selected key frame each iteration, strengthens cross-view coherence without additional computational cost, as detailed in Figure 1 of the paper.
Experimental demonstrate that I-Mix2Mix significantly outperforms prior s in multi-view consistency while maintaining competitive per-frame edit quality. Quantitative evaluations on scenes from datasets like Instruct-NeRF2NeRF and CO3D show that achieves the highest CLIP Directional Consistency score (0.337 vs. 0.287 for the next best baseline, DGE), indicating better preservation of semantic differences across views after editing. Qualitative comparisons in Figure 4 reveal that baseline s, such as Instruct-GS2GS and DGE, often produce inconsistencies like mismatched textures or Janus-like artifacts in sparse-view settings, whereas I-Mix2Mix delivers coherent edits. A human study further confirms this advantage, with raters identifying fewer inconsistencies in I-Mix2Mix outputs (1.34 on average) compared to DGE (2.02), and winning in 75% of scenes with statistically significant differences.
Of this work are substantial for industries relying on visual content creation, as it enables efficient 3D scene editing from limited data, reducing the need for extensive photography or manual labor. However, the approach inherits limitations from its backbone models: InstructPix2Pix and SEVA can struggle with certain edit prompts or perfect consistency, and the distillation process is computationally intensive, taking about 40 minutes per edit on a high-end GPU, which is more than twice as slow as some competitors. Additionally, while the framework is modular and could integrate stronger future models, current extensions to tasks like multi-view conditional generation using ControlNets tend to produce blurry outputs, a known artifact of SDS-based optimization. Future work may focus on speeding up the distillation and improving generalization beyond editing tasks, but for now, I-Mix2Mix represents a significant step toward accessible and consistent 3D content manipulation from sparse inputs.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn