Deep SOR Minimax Q-Learning Speeds Up Competitive AI

TL;DR

See how SOR preconditioning combined with neural networks trains competitive AI agents faster and more stably in two-player zero-sum games.

In the rapidly evolving landscape of artificial intelligence, the ability to handle competitive multi-agent scenarios is becoming increasingly crucial. From autonomous vehicles navigating complex traffic to algorithmic trading systems in financial markets, many real-world problems are inherently adversarial, where one agent's gain is another's loss. These scenarios are mathematically modeled as two-player zero-sum games, a subclass of Markov Games that has long d researchers due to its computational complexity and slow convergence rates. Traditional reinforcement learning approaches, while effective for single-agent problems, often struggle in these competitive settings where agents must simultaneously learn optimal strategies against opponents who are also adapting their behavior. The fundamental lies in developing algorithms that can efficiently navigate high-dimensional state-action spaces while guaranteeing convergence to Nash equilibria—the optimal strategies where neither player can unilaterally improve their outcome.

Researchers from the Indian Institute of Science have now introduced a significant advancement in this domain with their Deep Successive Over-Relaxation Minimax Q-Learning (D-SOR-MQL) algorithm. This novel approach addresses the critical limitations of existing s by combining deep neural networks as function approximators with a successive over-relaxation technique that dramatically accelerates convergence. The algorithm builds upon the foundation of Generalized Policy Iteration (GPI), which unifies value iteration and policy iteration within a common framework. What makes this work particularly noteworthy is its departure from tabular s that suffer from the curse of dimensionality—the exponential growth of computational requirements as state-action spaces increase. Instead, the researchers employ deep neural networks with two hidden layers (256 and 128 units respectively) using ReLU activations to approximate the joint Q-value function, making the approach scalable to real-world problems with continuous or high-dimensional spaces.

The experimental presented in the paper demonstrate the algorithm's effectiveness across multiple challenging environments. In the Guard-Invader game—a non-symmetric two-player zero-sum Markov game where a guard tries to intercept an invader before it reaches a door—the D-SOR-MQL algorithm showed remarkable improvements over baseline s. For environments with 49 states, the algorithm achieved losses as low as 0.1429 ± 0.0018 with optimal parameter settings, compared to 0.8617 ± 0.0405 for the baseline M2DQN algorithm. Similar improvements were observed in the Soccer environment, another classic competitive domain where players must maneuver a ball into opposing goals. The researchers conducted extensive ablation studies to determine the optimal successive over-relaxation parameter w, finding that values between 1.4 and 2.0 generally provided the best performance, though the exact optimal value varied by environment and state space size.

Beyond its empirical success, the paper provides rigorous theoretical guarantees that establish the algorithm's mathematical soundness. The researchers prove finite-time convergence with O(ϵ⁻²) sample complexity—matching the best known bounds for standard Q-learning while extending them to the more complex minimax setting with function approximation. This theoretical analysis, conducted under linear function approximation for tractability, demonstrates that the algorithm maintains contraction properties essential for convergence while incorporating the successive over-relaxation mechanism. The proof relies on three standard assumptions: that the operator F is a contraction, that the noise sequence forms a martingale difference, and that each component is visited sufficiently often during sampling. These represent a significant advancement over previous work that either lacked finite-time guarantees or was restricted to tabular settings impractical for real-world applications.

Despite these impressive achievements, the research acknowledges important limitations that point toward future directions. The theoretical analysis, while comprehensive, is conducted under linear function approximation rather than the deep neural networks used in the experiments—creating a gap between theory and practice that the researchers explicitly identify as an area for future work. Additionally, the current implementation assumes simultaneous action selection by both players, which may not capture all real-world competitive scenarios where actions might be sequential or partially observable. The algorithm's performance also depends on careful tuning of the successive over-relaxation parameter w, with the paper noting that values beyond an environment-specific threshold w* can lead to inconsistent performance. These limitations, however, do not diminish the algorithm's significance but rather highlight the complexity of the problem domain and the need for continued innovation in competitive reinforcement learning.

Reference: Gautam, S., Mandal, L., & Bhatnagar, S. (2025). Deep SOR Minimax Q-Learning for Two-Player Zero-Sum Games. arXiv preprint arXiv:2511.16226.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn