AIResearch AIResearch
Back to articles
AI

New GPU Design Makes 3D Graphics Faster and More Efficient

A RISC-V-based GPU prototype accelerates 3D Gaussian Splatting, achieving up to 152 FPS for rendering and 38.6 iterations per second for training, bridging the gap to real-time applications in AR/VR and robotics.

AI Research
March 26, 2026
4 min read
New GPU Design Makes 3D Graphics Faster and More Efficient

A new GPU architecture could transform how devices handle complex 3D graphics, making real-time rendering and training more feasible for applications like augmented reality, robotics, and volumetric video. The research, detailed in a paper from the University of Texas at Austin, introduces Vorion, a RISC-V-based GPU prototype that hardware-accelerates 3D Gaussian Splatting (3DGS), a technique for creating photorealistic 3D scenes. Current implementations struggle with performance: for example, on edge devices like the Jetson Orin NX, rendering often falls below 20 frames per second (FPS), far from the 60–90 FPS needed for immersive AR/VR, while training on high-end GPUs like the RTX 4090 can take 30–90 minutes per scene. This gap limits the adoption of 3DGS in real-time scenarios, but Vorion aims to close it by integrating dedicated hardware into the graphics pipeline.

The key finding is that Vorion significantly speeds up both rendering and training for 3D Gaussian Splatting. The researchers discovered that alpha blending and gradient accumulation are the dominant bottlenecks, accounting for over 70% of rendering time and more than 60% of training runtime. By accelerating these stages with a custom Gaussian rasterizer, Vorion achieves substantial performance gains. In tests, a silicon prototype with 8 SIMT cores and 2 rasterizers delivered 19 FPS for rendering and up to 4.97 iterations per second for training, representing speedups of 30.2–38.4 times for rendering and 51–58.4 times for training compared to software-only implementations on the same hardware. A scaled design with 64 SIMT cores and 16 rasterizers pushed these numbers to 152 FPS for rendering and 38.6 iterations per second for training, demonstrating near-linear scalability as more rasterization units are added.

Ology involved designing a GPU architecture that builds on the RISC-V-based Vortex platform, with modifications to support 3DGS efficiently. The team analyzed the 3DGS pipeline, identifying bottlenecks through runtime breakdowns on datasets like CoMap and Tanks & Temples. They proposed three main strategies: using larger 64x64 tile sizes to reduce Gaussian invocations by over 80%, introducing z-tiling to parallelize processing along the depth axis, and implementing a hybrid dataflow that switches between Gaussian-centric and pixel-centric modes in occluded scenes. The Gaussian rasterizer, which mimics traditional triangle rasterizers, includes a Gaussian buffer, pixel buffer, and raster lanes to handle blending and gradient computations. For training, the rasterizer was extended to compute color and opacity gradients directly in hardware, while other gradients were offloaded to programmable kernels.

, Based on a prototype fabricated in TSMC 16nm FinFET technology and post-layout simulations, show compelling performance improvements. As shown in Figure 8, the silicon prototype operated at frequencies from 100 MHz to 530 MHz and voltages from 0.57 V to 1.15 V, with power staying below 600 mW. The scaled design, simulated at 500 MHz, achieved up to 152 FPS for rendering and 38.6 iterations per second for training. Compared to prior accelerators, Vorion maintains full programmability and FP32 precision without accuracy loss, achieving 1.7 times better area efficiency per rasterizer than GPU rasterizer designs. The researchers extrapolate that with future scaling to 1,024 rasterizers, training times could drop below 3 seconds per optimization pass, moving closer to real-time operation.

Of this work are broad for real-world applications. By making 3D Gaussian Splatting faster and more efficient, Vorion could enable real-time 4D video capture, improve robotic perception systems, and enhance AR/VR experiences with smoother, more interactive graphics. The architecture's scalability means it can be adapted from edge devices to server-class hardware, supporting a range of use cases from mobile platforms to high-performance workstations. This advancement addresses a critical paradox in the field: current hardware limits the quality of 3DGS , but those limitations hinder the algorithm's adoption, which in turn reduces motivation for custom hardware. Vorion breaks this cycle by providing a low-cost integration into next-generation GPUs.

However, the paper notes several limitations. Z-tiling, which enables depth-parallel execution, is not applicable during training because the backward pass must process Gaussians in back-to-front order, though this is mitigated by existing data parallelism. The hybrid dataflow, which switches to pixel-centric mode in occluded scenes, is optional and primarily benefits dense indoor environments; its effectiveness varies with scene complexity. Additionally, the current implementation relies on FP32 precision, and while the paper mentions that future integration of quantization and compression techniques could further improve performance, these are out of scope and not evaluated. The research also does not address broader adoption s, such as software ecosystem support or energy efficiency in ultra-low-power devices, leaving room for future work.

Original Source

Read the complete research paper

View on arXiv

About the Author

Guilherme A.

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn