Large language models have transformed how we interact with technology, but their ability to reason through complex, multi-step problems remains a critical frontier. To assess this capability, researchers often use a metric called pass@k, which measures the probability that at least one correct solution appears in multiple attempts. This metric has become a standard benchmark in fields like code generation and math problem-solving, where models generate several possible answers and pick the best. However, a new theoretical study s the wisdom of using pass@k as a direct goal for training these models, arguing it can inadvertently stifle the very exploration needed for robust reasoning.
The researchers found that optimizing for pass@k does not encourage models to explore diverse solutions but instead reduces to a simple reweighting of the pass@1 objective, which measures success in a single attempt. Specifically, they derived the gradient of the pass@k objective and showed it is a scalar multiple of the pass@1 gradient, with a scaling factor that depends on the model's current performance. This means the training signal vanishes when the model struggles to find any correct solutions, precisely when exploration is most needed. Conversely, as the model improves and concentrates probability on a single solution, pass@k converges to pass@1, making the extra computational cost of generating multiple samples redundant during training.
To analyze this, the study formalized the reinforcement learning framework for large language models, where a model acts as a stochastic policy generating sequences of tokens. In reinforcement learning with verifiable rewards, a deterministic verifier checks correctness, and the pass@k objective aims to maximize the chance that at least one of k samples is correct. The researchers mathematically demonstrated that the gradient of this objective is collinear with that of pass@1, offering no new direction for optimization. They visualized this relationship in Figure 1, showing how the scaling factor and objective value change with pass@1 probability, highlighting regimes where learning signals diminish.
Indicate that using pass@k as an optimization target can lead to 'exploration collapse,' where models narrow their focus to already discovered solutions, reducing diversity. The study proved that as a policy concentrates probability mass on a dominant mode, the gap between pass@k and pass@1 vanishes, diminishing the benefits of multiple samples. This aligns with empirical cited in the paper, such as those by Peng et al., who observed that probability concentration correlates with worse pass@k performance. The analysis suggests that pass@k is better suited as a diagnostic tool to measure reasoning potential under repeated sampling, rather than a training objective.
For practical applications, this insight has significant . In tasks like code synthesis or mathematical reasoning, where diversity of solutions is crucial, relying on pass@k for training might hinder models from discovering novel approaches. The researchers argue that effective optimization requires mechanisms explicitly encouraging exploration, such as those proposed in recent works like Pass@K Policy Optimization or SimKO, which aim to balance exploration and exploitation. However, they caution that directly optimizing pass@k remains problematic, as high pass@k is a consequence of good exploration, not a driver of it.
The study acknowledges limitations, noting that its analysis is theoretical and based on standard policy gradient s. It does not address all possible algorithms, and the broader impact is ological, focusing on optimization objectives rather than immediate societal effects. Future work could explore alternative metrics or training strategies that better promote exploration without the pitfalls identified. Ultimately, this research urges a shift in how we evaluate and train AI for reasoning, emphasizing the need for tools that foster diversity rather than inadvertently suppressing it.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn