Study: Lion and Momentum Beat Adam in Some AI Tasks

TL;DR

Five gradient descent methods tested head-to-head. Lion and Momentum outperform Adam in key tasks, challenging standard deep learning practices.

Optimization algorithms are the backbone of deep learning training, determining how fast models converge, how stable training is, and how well final models generalize. As models grow larger and tasks diversify, selecting the right optimizer has become critical for both research and industry applications. A new study provides a comprehensive comparison of five major algorithms—SGD, Mini-batch SGD, Momentum, Adam, and Lion—offering evidence-backed guidance that could reshape how practitioners approach optimization in AI systems.

The researchers found that Lion and Momentum emerged as top performers in experimental tests, while Adam and Mini-batch SGD showed significant limitations under default settings. On the MNIST handwritten digit classification task, Momentum achieved the highest test accuracy of 0.9815, closely followed by Lion at 0.9799, with SGD maintaining a reliable 0.9749. In contrast, Adam and Mini-batch SGD lagged behind at 0.9644 and 0.9614, respectively. For the California Housing regression task, Lion achieved the best generalization with a root mean squared error (RMSE) of 0.54, outperforming Momentum (0.56) and SGD (0.58), while Adam and Mini-batch SGD were the worst performers with RMSE around 0.59. These highlight that algorithm performance is highly task-dependent, challenging the one-size-fits-all approach often seen in practice.

The study systematically analyzed each algorithm's core principles and historical development. SGD, the foundational , uses simple stochastic updates with theoretical roots dating back to the 1951 Robbins–Monro framework, offering solid generalization but slower convergence. Mini-batch SGD balances computational efficiency and stability by averaging gradients over small data subsets, though it requires careful batch size tuning. Momentum accelerates convergence by accumulating a velocity vector from historical gradients, smoothing oscillations through a physical analogy of inertia. Adam combines momentum with adaptive learning rates per parameter, using exponential moving averages of gradients and their squares, but includes bias correction for early iterations. Lion, discovered through automated program search, employs momentum-augmented sign updates, taking the sign of a mixed gradient signal to unify step sizes across parameters, which enhances memory efficiency by avoiding second-moment estimation.

Experimental , visualized in figures from the paper, reveal distinct behavioral patterns. On MNIST, loss curves in Figure 1 show Momentum and Lion converging fastest with low initial loss and steady declines, while SGD's curve is smoother but slower. Mini-batch SGD exhibited high loss and fluctuations, indicating noise sensitivity. Adam declined rapidly early but jittered mid-training, reflecting adaptive sensitivity to noise. Gradient norm trends in Figure 5 further illustrate these differences: SGD's norm decreased regularly, Mini-batch SGD's remained high with fluctuations, Momentum's decreased slowly but consistently, Adam's fluctuated considerably, and Lion's showed low-amplitude decreases with minimal fluctuations. On California Housing, Figure 3 and Figure 4 indicate Lion's RMSE declined rapidly with small fluctuations, while Momentum remained stable, and Adam and Mini-batch SGD were unstable with multiple spikes. Figure 8 quantifies final RMSE, confirming Lion's superiority in noisy regression tasks.

For real-world AI development are substantial. For practitioners, the study suggests that Lion and Momentum are preferred choices under many conditions, offering robust performance across tasks like image classification and regression. Lion's memory efficiency—maintaining only first-order momentum instead of second-order estimates as Adam does—makes it suitable for large-scale training, though it requires smaller learning rates and warmup strategies. Momentum excels in structured, low-noise data scenarios, aligning with its design to damp oscillations. SGD remains a reliable baseline for stability and generalization, while Adam and Mini-batch SGD demand careful hyperparameter tuning, such as using AdamW with decoupled weight decay or adjusting batch sizes. This s the common reliance on Adam as a default optimizer, emphasizing the need for task-specific selection to improve training efficiency and model outcomes.

Limitations of the study, as noted in the paper, include the reliance on simplified multi-layer perceptron models and two datasets, which may not fully capture performance in more complex architectures or diverse tasks. The theoretical underpinnings of Lion are less established compared to algorithms like SGD and Adam, leaving gaps in convergence analysis. Additionally, hyperparameter settings were standardized to default recommendations, but real-world applications often require extensive tuning, which could alter . Future work should extend validation to larger datasets and advanced network types to generalize further, ensuring that optimization choices adapt to evolving AI s.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn