AI Writes Faster Code on Its Own, Up to 6x Quicker

TL;DR

A new reinforcement learning method auto-generates high-performance code, hitting up to 6.65x speedup over existing tools with no human input.

As machine learning models grow increasingly complex and computing hardware becomes more diverse, achieving optimal performance has become a significant challenge. Researchers from ETH Zurich have developed PerfLLM, a system that uses artificial intelligence to automatically discover high-performance implementations of computational kernels—the fundamental building blocks of modern computing—across different hardware architectures. This breakthrough could dramatically reduce the engineering effort required to optimize software for new processors, making high-performance computing more accessible and efficient.

The key finding is that PerfLLM can automatically generate optimized code implementations that outperform existing state-of-the-art libraries. On the GH200 processor, PerfLLM achieved a geometric mean speedup of 6.65× relative to previous methods and 13.65× relative to other approaches. For specific operations like elementwise multiplication, PerfLLM's implementations showed 1.62× speedup over TVM and 1.22× over PyTorch on MI300A hardware, while on GH200 it achieved 1.71× speedup over PyTorch and 3× over TVM.

The methodology centers on PerfDojo, a framework that treats code optimization as a game where AI agents apply transformations to improve performance while maintaining program correctness. The system uses large language models to encode program representations and reinforcement learning to navigate the vast space of possible optimizations. Unlike traditional approaches that rely on human expertise, PerfLLM employs a modified reinforcement learning algorithm called Max Q-learning that focuses on discovering the single best-performing trajectory rather than maximizing cumulative reward, making it particularly effective for finding optimal code implementations.

Results analysis shows that PerfLLM successfully discovered optimization techniques that humans might overlook. For example, in elementwise multiplication, the AI system identified that vectorizing the innermost dimension with size 4 and using 128-bit loads instead of 32-bit loads could improve performance—techniques not implemented in standard libraries. The system's performance improvements are demonstrated across multiple hardware platforms including x86, Arm, and RISC-V processors, as well as AMD MI300A and Nvidia GH200 accelerators. Figure 1b shows the geometric mean speedup of 6.65× on GH200 hardware, while Figure 13 compares performance across various kernels.

This research matters because manually optimizing code for different hardware architectures is time-consuming and requires specialized expertise. The automated approach could significantly reduce development costs and make high-performance computing more accessible. For regular users, this could translate to faster machine learning training, more efficient scientific simulations, and improved performance in everyday applications without requiring manual optimization efforts.

The approach has limitations, primarily in computational cost—the PerfLLM optimization process takes 5× to 100× longer than heuristic-guided methods. Optimizing a single kernel can require up to eight hours, and tuning a complete library of approximately 160 operators would need an estimated 1,280 node-hours. Additionally, the system currently focuses on a specific set of program transformations and may not discover all possible optimization opportunities, particularly for novel hardware features not represented in the training environment.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn