AIResearch AIResearch
Back to articles
Science

Meta's WorldGen: From Text to Traversable 3D Worlds in Minutes

A new AI system automates large-scale virtual environment creation, bridging language and interactive design.

AI Research
March 26, 2026
4 min read
Meta's WorldGen: From Text to Traversable 3D Worlds in Minutes

Imagine typing a simple phrase like 'medieval village' and watching a fully traversable, interactive 3D world materialize before your eyes, ready for exploration in a game engine. This is the promise of WorldGen, a groundbreaking system from Meta's Reality Labs that transforms natural language descriptions into large-scale, functional virtual environments. Announced in a technical report dated November 2025, WorldGen represents a significant leap in generative AI for 3D content, aiming to democratize world-building by eliminating the need for manual modeling or specialized expertise. By bridging the gap between creative intent and executable digital spaces, the technology could revolutionize game development, simulation, and immersive social experiences, offering a glimpse into a future where anyone can become a creator of complex virtual worlds.

The core innovation of WorldGen lies in its modular, four-stage pipeline that systematically converts a text prompt into a coherent 3D scene. The process begins with Scene Planning, where a Large Language Model (LLM) parses the user's prompt into structured JSON parameters to drive a procedural generator. This creates a 3D blockout—a rough geometric sketch of the scene's layout—ensuring navigability through features like open spaces and chokepoints. From this blockout, the system extracts a navigation mesh (navmesh) defining walkable areas and generates a reference image using a depth-conditioned diffusion model, which establishes the scene's theme and style. This stage guarantees functional layouts, addressing a key weakness where even advanced image generators often fail to produce traversable scenes.

Next, Scene Reconstruction takes the plan—comprising the blockout, reference image, and navmesh—and produces a holistic, low-resolution 3D mesh of the entire scene. The team leverages AssetGen2, Meta's state-of-the-art image-to-3D model, fine-tuned to condition generation on both the reference image and the navmesh. This dual conditioning is crucial: it ensures the generated geometry aligns with navigable regions, preserving connectivity and reducing artifacts like unreachable areas, even in occluded parts not visible in the reference image. The result is a single textured mesh that maintains global coherence, though at a resolution insufficient for final use, serving as a scaffold for subsequent refinement.

The third stage, Scene Decomposition, breaks this monolithic mesh into individual, semantically meaningful objects using an enhanced version of AutoPartGen, a model for autoregressive part . To handle the complexity of scenes with many objects, accelerates inference by generating parts in order of connectivity degree—prioritizing structural anchors like the ground—before decomposing the remainder via connected-component analysis. This step, fine-tuned on a curated dataset of scene-level assets, efficiently segments the scene into components such as buildings, trees, and props, enabling localized editing and enhancement without regenerating the entire world.

Finally, Scene Enhancement elevates each object's quality through per-object refinement. For each decomposed low-resolution mesh, the system renders a view and uses an LLM-Vision model to generate a high-resolution image, hallucinating fine details while maintaining style consistency with the global scene. A mesh refinement model then reconstructs the object's geometry from this image and the coarse shape, preserving orientation and overall structure. High-resolution textures are synthesized via a multi-view generation process conditioned on normal maps and a delighted version of the enhanced image, employing disentangled attention for cross-view consistency. The output is a set of high-fidelity, textured meshes that assemble into a cohesive, navigable world.

Showcased in the paper demonstrate WorldGen's ability to generate diverse, visually rich scenes—from medieval towns to sci-fi colonies—that are geometrically consistent and immediately deployable in game engines like Unreal or Unity. Qualitative comparisons highlight advantages over prior work: single-shot image-to-3D s lack detail and compositionality, while view-based generators like Marble produce limited 'bubbles' with fidelity that degrades beyond a few meters. In contrast, WorldGen scenes span approximately 50x50 meters, maintain consistency throughout, and output standard meshes rather than less portable representations like Gaussian splats. The entire pipeline completes in about five minutes with parallel processing, enabling rapid prototyping.

However, the system has notable limitations. It relies on a single reference view, restricting scene scale and complicating the generation of vast open worlds or multi-layered environments without stitching artifacts. The independent object representation also raises efficiency concerns for very large scenes due to a lack of geometry or texture reuse. Future work may address scalability through tiling strategies and material sharing. Despite these s, WorldGen marks a pivotal step toward accessible, generative world-building, potentially reshaping how interactive 3D content is created for gaming, simulation, and beyond.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn