Training neural networks often gets stuck in suboptimal solutions, limiting their real-world accuracy. Researchers have developed a simple add-on to popular optimizers that encourages continued exploration of the solution space, leading to better performance in tasks like image recognition and language modeling. This innovation addresses a fundamental challenge in deep learning where algorithms typically stop searching once they find a local minimum, potentially missing superior options.
The key finding is that by modifying gradient-based optimizers with an 'E' adaptor, the system continues to explore along 'valleys' in the loss landscape—areas with nearly identical low losses—to find flatter and lower minima. These minima are associated with improved generalization, meaning the models perform better on unseen data. In experiments, this approach increased accuracy by an average of 2.5% across various large-batch tasks compared to state-of-the-art methods.
The methodology involves integrating the adaptor into optimizers like SGD and Adam, where it uses remembered gradient information from early training stages to guide exploration later on. Specifically, it combines current gradients with a momentum term adjusted by a parameter α, which controls the balance between exploration and exploitation. For large batches, α is set negative to promote flat minima, while for small batches, it can be positive for faster convergence. This design avoids the high computational cost of directly calculating complex landscape features.
Results from the paper show that the adapted optimizer, ALTO, achieved a top-1 accuracy of 70.83% on ImageNet with a batch size of 4086, outperforming the standard Lamb optimizer's 70.64% with a batch size of 256. In natural language processing, ALTO reduced perplexity to 78.37 on GPT-2 tasks, compared to 83.13 for Lamb, indicating better model performance. Figures from the study, such as Figure 3, illustrate that ALTO leads to lower top Hessian eigenvalues and reduced parameter drift, confirming convergence to flatter regions.
This advancement matters because it enables more efficient use of computational resources in large-scale AI training, such as in data centers and cloud services, where batch sizes are increasing to leverage multiple GPUs. By finding better solutions without additional data or major hardware changes, it could accelerate developments in areas like autonomous systems, medical imaging, and language models, making AI more reliable and cost-effective.
Limitations noted in the paper include that the optimal settings for parameters like α and β1 vary by task and batch size, and the improvement is marginal in small-batch scenarios due to noisy gradients. Future work will focus on refining these hyperparameters and extending the adaptor to other optimizer types.
Original Source
Read the complete research paper
About the Author
Guilherme A.
Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.
Connect on LinkedIn