A new optimization for AI language models addresses a hidden inefficiency that slows down responses in common scenarios like brief chatbot interactions. When users ask short questions or engage in quick conversations, large language models such as Llama 3.1-70B-Instruct can underutilize the powerful hardware they run on, specifically NVIDIA's H100 GPUs. This underutilization occurs because the standard scheduling in FlashAttention-3, a widely used attention mechanism, disables a key parallelization technique for sequences shorter than 512 tokens, leaving over 90% of the GPU's Streaming Multiprocessors idle. Researchers from the Barcelona Supercomputing Center have developed a sequence-aware split heuristic that overcomes this bottleneck, delivering kernel-level speedups of 21 to 24% for metadata-enabled inference paths without affecting performance in other cases.
The key finding is that by allowing sequence-level parallelism even in low-head-count decoding configurations, the new heuristic significantly improves hardware utilization. In low-head-count regimes, such as Multi-Query Attention (MQA) or Grouped-Query Attention (GQA) with few key/value heads, the workload per decode step is reduced, leading to as few as 8 Thread Blocks being launched on an H100 GPU with 132 Streaming Multiprocessors. This in an occupancy of only about 6%, a severe underutilization. The researchers discovered that increasing the number of splits along the sequence dimension in these conditions can recover latency, with the evolved heuristic forcing split counts of 12 or 16 for short single-batch prompts, as shown in Figure 1 of the paper. This approach bypasses the static guard in FlashAttention-3 that previously prevented splitting for sequences of 512 tokens or less.
Ology involved using OpenEvolve, an LLM-guided evolutionary search framework, to dynamically generate and refine Python-based heuristics for workload scheduling. The experiment targeted minimizing Time per Output Token (TPOT) for standard chat interactions with Llama-3.1-70B-Instruct, focusing on short prompts (sequence length ≤ 512) and batch size of 1. The search space included parameters like num_splits, which controls sequence-level parallelization, pack_gqa for memory layouts, and sm_margin for resource management. By isolating scheduling semantics from mathematical correctness, the evolutionary agent identified that increasing num_splits in low-throughput regimes correlates directly with latency reductions. This process revealed that the static short-sequence guard in FlashAttention-3 was prematurely limiting split counts, and the top-performing candidates overrode this by enforcing higher splits to increase parallel work across SMs.
Analysis from kernel-level A/B testing shows clear improvements. In Table 1, for a sequence length of 512 tokens with 1 or 2 key/value heads, the patched kernel reduced execution times from 13.72 microseconds to 11.37 microseconds (a 21% speedup) and from 13.52 microseconds to 10.93 microseconds (a 24% speedup), respectively. These gains are specific to metadata-enabled inference paths used by stacks like vLLM; without precomputed metadata, gains are more modest (∼1.0 to 1.05×). The policy is conservative, applying only to the nblk = 4 boundary bucket represented by LK = 512, and leaves shorter sequences (≤ 384 tokens) and saturated workloads unchanged. An extended split sweep in Figure 3 demonstrates that enabling splitting drops latency into a broad low-latency plateau, with the chosen split count of 3 being the smallest to enter this regime, offering nearly the full benefit while minimizing complexity.
Of this research are practical for real-world AI applications, particularly in improving the responsiveness of chatbots and other interactive AI systems. By optimizing how GPUs handle short, low-head-count decoding tasks, the sequence-aware heuristic can make AI models faster without requiring changes to the models themselves or the underlying hardware. This is especially relevant for deployments using inference stacks that precompute scheduler metadata, as the full 21-24% improvement applies there. The approach also illustrates a broader principle: using automated search tools like OpenEvolve can uncover systems-level optimizations that are then distilled into simple, upstreamable changes, potentially benefiting a wide range of AI infrastructure.
Limitations of the study include the narrow scope of the evaluated policy, which focuses only on the representative case of sequence length 512 tokens within the nblk = 4 boundary bucket. The paper explicitly notes that lower sequence lengths (e.g., 128, 256, 384 tokens) are left unchanged in this initial policy, and extending the benefit to these cases is future work. Additionally, the gains are contingent on the use of precomputed scheduler metadata; without it, improvements are minimal. The heuristic also does not address all configurations, as it defaults back to standard behavior for dense workloads where splitting might introduce overhead, ensuring no performance regressions across a sweep of 160 configurations. Future research could explore more configuration-specific split counts and apply similar principles to other hardware or model architectures.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn