A new approach to solving large symmetric eigenvalue problems has been developed, delivering significant speed improvements on modern multi-GPU systems. This breakthrough addresses a critical bottleneck in scientific computing, where current libraries like cuSOLVERMp and MAGMA only utilize around 1.5% of the peak multi-GPU performance, as shown in Table 1 of the paper. The researchers found that by redesigning the algorithm to pipeline different computational stages, they could achieve mean speedups of 5.74x over cuSOLVERMp and 6.59x over MAGMA on an 8×A100 GPU platform, with similar gains on H100 GPUs. This advancement is particularly important for fields like condensed matter physics, quantum chemistry, and density functional theory, where large eigenvalue decomposition problems are common and often exceed the capabilities of single GPUs.
The key finding is that conventional eigenvalue decomposition implementations suffer from severe underutilization of GPU resources due to sequential processing and load imbalance. The paper demonstrates that existing state-of-the-art libraries achieve only 2.18 TFLOPS (MAGMA) and 2.37 TFLOPS (cuSOLVERMp) on 8 A100 GPUs for a 49152×49152 matrix, which represents just 1.3% and 1.5% of the theoretical peak performance. By introducing a pipelined two-stage algorithm instead of the conventional subsequent approach, the researchers were able to overlap computations across stages, reduce synchronization overhead, and improve overall GPU utilization. This in better strong and weak scalability, as evidenced by the experimental showing near-perfect weak scalability on both A100 and H100 platforms.
Ology involves several innovative techniques to enable pipelining. First, the researchers abandoned the conventional block-cyclic data distribution in favor of a blockwise distribution, which allows different stages like successive band reduction and bulge chasing to overlap. They also reordered the stages of the eigenvalue decomposition process, treating the divide-and-conquer iterative solver as an independent task that can run concurrently on CPUs while GPUs handle other stages. This reordering required mathematical proof to ensure correctness, with experiments confirming that the backward error and orthogonality are preserved within floating-point precision, as shown in Table 2 comparing with cuSOLVER's Dsyevd routine.
Analysis from the paper shows consistent performance gains across different hardware and problem sizes. Figure 13 illustrates that the pipelined eigenvalue decomposition outperforms cuSOLVERMp and MAGMA on both 8 A100 and 8 H100 GPUs for matrix sizes ranging from 20000 to 100000. On A100 GPUs, it achieves speedups of 5.74x and 6.59x, while on H100 GPUs, the speedups are 5.25x and 9.24x respectively. The scalability tests in Figures 14 and 15 demonstrate superior strong and weak scalability compared to baselines, with the pipelined approach maintaining robust performance across 1 to 8 GPUs, whereas cuSOLVERMp and MAGMA show poor or negative scaling in some cases.
Of this work are substantial for scientific computing applications that rely on large eigenvalue problems. By significantly improving performance and scalability, the pipelined algorithm enables researchers in physics and chemistry to solve larger problems more efficiently, potentially accelerating discoveries in materials science and quantum simulations. The paper notes that applications like VASP, BerkeleyGW, and Gromacs typically use eigenvalue decomposition solvers, and this new could enhance their capabilities. Additionally, the optimizations, such as the communication-avoiding successive band reduction and BLAS2-based bulge chasing back transformation, reduce communication volume and improve kernel performance, making the approach suitable for future scaling to even larger systems with thousands of GPUs.
However, the paper acknowledges limitations in the current implementation. The reordered stages strategy introduces an extra matrix multiplication step to form final eigenvectors, which adds computational overhead, though it is efficient on modern GPUs. The load balancing approach, which adjusts block sizes in back transformation to compensate for imbalance in earlier stages, requires careful tuning and may not generalize perfectly to all hardware configurations. Furthermore, is specifically designed for symmetric eigenvalue problems, and extending it to non-symmetric matrices remains a for future work. The researchers also note that while the pipeline improves utilization, it may still face bottlenecks in extreme-scale deployments beyond 8 GPUs, necessitating further optimization for supercomputing environments.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn