GPUs Get a Universal Language From 10 Shared Primitives

TL;DR

Researchers found ten core operations common to all major GPUs, enabling a vendor-neutral instruction set that cuts software lock-in.

A new analysis reveals that despite years of proprietary competition, all major GPU architectures share a core set of computational building blocks driven by physical necessity, not convention. This opens the door to a universal instruction set architecture for GPUs, similar to ARM for CPUs, which could dismantle the software lock-in that currently ties code to specific hardware vendors like NVIDIA. For non-technical readers, this means future AI and scientific applications could run efficiently on any GPU without costly rewrites, potentially lowering costs and increasing innovation in high-performance computing.

The researchers systematically analyzed over 5,000 pages of documentation across NVIDIA, AMD, Intel, and Apple GPUs, spanning 16 microarchitectures and 15 years of evolution. They identified ten hardware-invariant primitives that appear in every architecture, such as lockstep thread groups, mask-based divergence for handling conditional code, and workgroup-scope barriers for synchronization. These elements are not arbitrary but stem from thermodynamic and information-theoretic constraints, like the high energy cost of instruction fetch relative to arithmetic operations, which makes amortizing fetch across multiple threads essential for efficiency. The study argues that this convergence is inevitable, driven by the physical realities of parallel computation rather than vendor imitation.

To conduct this cross-vendor analysis, the team compiled official ISA reference manuals, architecture whitepapers, patent filings, and community reverse-engineering efforts, with Apple data relying on flagged confidence levels due to lack of public documentation. They extracted details along eight dimensions, including execution model and memory hierarchy, and classified each as invariant, parameterizable, or divergent. This ology allowed them to distinguish between universal necessities, like the register-occupancy tradeoff governed by fixed SRAM area, and vendor-specific choices, such as wave width or support for double-precision floating-point operations. The analysis excluded graphics pipeline primitives and focused solely on general-purpose computation to maintain scope.

Show that on five of six benchmark-platform pairs tested, an abstract model based solely on these universal primitives matched or exceeded native vendor-optimized performance. For example, in a GEMM kernel on NVIDIA T4 hardware, the abstract model achieved 126.1% of native performance, while on Apple M1, it reached 101.2%. However, a reduction benchmark on NVIDIA T4 revealed a critical outlier: the abstract model only achieved 62.5% of native performance because it lacked intra-wave shuffle operations, which are essential for efficient data sharing within thread groups on that platform. This finding led the researchers to refine their model, adding shuffle as an eleventh mandatory primitive to ensure portable performance across all vendors.

Of this work are significant for industries reliant on GPU acceleration, such as machine learning and scientific computing, where NVIDIA's CUDA platform currently dominates due to software lock-in. A vendor-neutral ISA could enable code to run on any GPU without performance penalties, fostering competition and reducing dependency on single suppliers. This is particularly timely given geopolitical tensions and export controls that make sovereign AI compute a priority. The proposed abstract execution model follows a thin abstraction principle, defining what hardware must do without prescribing how, allowing for microarchitectural innovation while maintaining binary compatibility.

Limitations of the study include reliance on reverse-engineered data for Apple GPUs, with flagged confidence levels, and benchmarking on only two of the four vendors, though NVIDIA and Apple represent the most architecturally distant pair. The native implementations used for comparison were hand-written tiled kernels, not production-optimized library code, meaning represent a lower bound on abstraction cost. Future work will extend benchmarks to AMD and Intel platforms, incorporate irregular workloads, and develop a formal ISA specification with a prototype compiler to automate translation to vendor-native backends.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn