AI Trains Faster with Less Memory

Training complex AI models often requires immense computational power and memory, limiting their scalability and increasing costs. A new method for neural ordinary differential equations (Neural ODEs) addresses this by using mixed-precision training, which cuts memory usage by up to 50% and speeds up training by nearly twofold while maintaining accuracy. This breakthrough makes advanced AI applications more accessible and efficient, benefiting fields like image classification and generative modeling.

The researchers developed a framework that combines low-precision and high-precision computations to optimize neural ODE training. Neural ODEs model continuous-time dynamics, such as how data evolves over time, but traditional training methods are computationally expensive due to repeated evaluations and large memory demands. By evaluating the velocity function—a neural network parameterizing the system—in low precision (e.g., 16-bit formats like float16 or bfloat16) and storing intermediate states and adjoints in high precision (e.g., 32-bit), the method reduces resource use without sacrificing performance. This approach includes a dynamic scaling heuristic that automatically adjusts factors to prevent numerical issues like underflow or overflow during backpropagation, ensuring stability.

In practice, the team implemented this in an open-source PyTorch package called rampde, designed as a drop-in replacement for existing neural ODE codes. They tested it on tasks like continuous normalizing flows for generative modeling, optimal transport flows on the BSDS300 dataset, and image classification with the STL-10 dataset. For example, in the STL-10 classification task, the method achieved competitive test accuracies above 76% while reducing peak memory usage from 21.5 GB to as low as 2.2 GB and cutting training times significantly. The results showed that mixed precision maintained model quality, with relative errors in gradients remaining stable and not growing uncontrollably with the number of integration steps, as supported by theoretical analysis.

This advancement matters because it enables more efficient AI development, especially for large-scale problems where computational resources are a bottleneck. In real-world terms, it could accelerate research in areas like medical imaging or autonomous systems by allowing faster iterations and lower costs. For instance, companies developing AI-driven simulations or data analysis tools could deploy models more quickly, while researchers with limited hardware could tackle more complex tasks.

However, the method has limitations. Performance gains depend on hardware with tensor core support, such as modern GPUs, and it currently focuses on explicit time integrators, not implicit ones. The paper notes that roundoff errors, though bounded, could become significant with very small step sizes, and the approach may require further adaptation for stochastic differential equations or other extensions. Future work could explore 8-bit quantization or applications in control problems to broaden its impact.

AI Trains Faster with Less Memory

About the Author

Guilherme A.