Training deep neural networks, the backbone of modern artificial intelligence, often relies on methods like Adam that excel in complex, non-convex regions but may slow progress near optimal solutions. Researchers have introduced a two-phase training algorithm that dynamically switches between Adam and the conjugate gradient method based on the convexity of the loss function, achieving faster convergence and higher accuracy across diverse AI models without additional tuning. This approach addresses a fundamental challenge in machine learning: efficiently navigating the intricate landscape of loss functions to find better minima.
The key finding is that loss functions in deep neural networks often exhibit a predictable structure, starting with non-convex regions where the gradient norm increases, followed by convex regions where it decreases. By detecting the peak in gradient norm during training, the algorithm switches from Adam—effective in non-convex areas—to conjugate gradient, which offers superlinear convergence in convex zones. This method consistently outperformed standard Adam-only training in experiments, reducing final loss values and improving validation accuracy on datasets like CIFAR-10, CIFAR-100, and MNIST.
Methodologically, the researchers designed a simple procedure that monitors the gradient norm epoch by epoch. Initially, Adam handles the non-convex phase, where gradients fluctuate and norms rise. Once the norm peaks and begins to decline—indicating entry into a convex region—the system switches to conjugate gradient optimization. This switch is automated using a tolerance threshold, set at 0.9 times the maximum observed gradient norm, eliminating the need for manual parameter adjustments beyond standard batch sizes, which were optimized at 512 elements for robustness.
Results from the paper show clear advantages. For instance, on CIFAR-10 with a Vision Transformer architecture, the two-phase approach achieved lower training losses (e.g., 0.0008 for vit-mlp with two-phase vs. 0.0016 with Adam-only) and higher validation accuracies (0.995 vs. 0.990). The conjugate gradient phase accelerated convergence significantly after the switch, as illustrated in Figure 5, where loss decreased rapidly in convex regions. Across various model architectures, including reduced transformers and VGG5 convolutional networks, the pattern held: all variants exhibited the hypothesized convexity transition, with two-phase training yielding superior outcomes in both loss minimization and generalization metrics.
In practical terms, this innovation matters because it enhances the efficiency of AI model training, which is computationally intensive and time-consuming for large-scale applications like image recognition and natural language processing. By leveraging inherent function properties, it could reduce resource costs and improve model performance in real-world systems, from healthcare diagnostics to autonomous vehicles, without requiring complex modifications or extensive hyperparameter tuning.
Limitations noted in the paper include the empirical nature of the findings, which, while consistent across tested architectures, may not generalize to all neural networks, especially very large text-based models not yet evaluated. Additionally, the assumption of a simple convexity pattern is not universally guaranteed; in highly complex functions, alternating non-convex and convex regions could lead to pitfalls like spurious minima, though these were not observed in the experiments. Future work aims to verify the method's applicability to broader model classes.
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn