Simple Tiling Trick Speeds Up AI Chips by 4.5x

TL;DR

A new data layout method boosts AI accelerator performance up to 4.5x, setting a record on AMD hardware without changing the underlying chip technology.

A simple change in how data is organized within AI chips has led to dramatic performance improvements, challenging long-held assumptions in hardware design. Researchers have demonstrated that by decoupling the buffered tile sizes of input and output operands in matrix multiplication—a technique called asymmetric tile buffering (ATB)—they can achieve up to a 4.54× speedup on AMD's latest XDNA2™ AI Engine. This approach, detailed in a recent paper, boosts mixed-precision GEMM performance from 4.8 to 24.6 TFLOPS, establishing a new performance record for this hardware. The finding is significant because it shows that conventional symmetric buffering, widely used across CPUs, GPUs, and accelerators, is unnecessarily restrictive, leaving potential performance gains untapped.

At its core, the key finding is that asymmetric tile buffering allows for higher arithmetic intensity by reducing buffer pressure on input data. In conventional symmetric buffering, the buffered tile size of input A along the M dimension matches the output tile size of C, which increases buffer requirements and limits reuse. The researchers observed that the buffer lifetime of input A only needs to last as long as it takes to accumulate a single row of C, meaning the input and output tile buffers do not need to share the same M dimension. This insight enables ATB to allocate more buffer capacity to operands B and C, increasing data reuse and overall efficiency. The paper shows that this decoupling, parameterized by (T_MA, T_MC, T_K, T_N) where T_MC ≥ T_MA, can improve GEMM throughput by up to 40% even over highly optimized symmetric kernels.

Ology involved developing an analytical performance model that captures both the benefits and costs of ATB, providing principled guidance for selecting tiling factors. The model incorporates arithmetic intensity gains from higher reuse and the overheads of increased kernel switching costs. For a case study, the researchers applied ATB to AMD's XDNA2™ AI Engine, a spatial architecture with 32 compute cores, each with 64 KB of local L1 memory. They used the IRON API and MLIR-AIE toolchain to program the NPU, measuring performance under three precision configurations: mixed-precision BF16-BFP16, all BFP16 with BFP16 accumulation, and all BFP16 with BF16 accumulation. The model helped optimize microkernel design by considering instruction-level parallelism, accumulation chains, and double buffering of input registers to hide latency.

From the evaluation show substantial improvements across all configurations. In Config 1 (BF16-BFP16 mixed-precision), ATB enabled a 128×64×128 L1 tile that would require 91 KB under symmetric buffering but became feasible with ATB, achieving 24.3 TFLOPS—a 4.54× speedup over the MLIR-AIE baseline of 4.8 TFLOPS. Table 2 in the paper details these gains, with Config 2 reaching 31.3 TFLOPS (3.07× over baseline) and Config 3 achieving 28.5 TFLOPS (2.79× over baseline). The data also reveals trade-offs: when memory-bound, ATB benefits from smaller T_K and larger asymmetry ratio ρ to increase arithmetic intensity; when compute-bound, larger T_K improves core efficiency by reducing kernel switching overhead. For example, at T_K=64, core efficiency decreased only 5.8% as ρ increased from 1 to 8, compared to a 25.6% drop at T_K=8.

These have important for AI hardware design and software optimization, as they demonstrate that simple tiling adjustments can unlock significant performance without hardware changes. For everyday users, this means faster AI applications on devices like laptops with AMD Strix Point processors, potentially improving tasks such as real-time language translation or image generation. The research also provides a framework for developers to manually optimize kernels using techniques like unrolling, register-level tiling, and double buffering, which proved essential for BFP16 precision where compiler-driven optimizations were ineffective. By moving GEMM from memory-bound to compute-bound in some cases, ATB helps overcome bottlenecks that limit AI acceleration.

However, the paper acknowledges limitations, including the complexity of managing data movement across multiple memory levels and the overheads introduced by kernel switching. The performance model shows that while ATB increases arithmetic intensity, it also raises switching costs, requiring careful balance in tiling factor selection. Additionally, the study is specific to AMD's XDNA2™ architecture, and its applicability to other NPUs, CPUs, or GPUs may vary due to hardware constraints. Future work could explore automated search and scheduling to make ATB-based kernel generation fully automatic, extending these benefits more broadly.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn