New AI Optimizer Beats Standard Methods in Key Tests

TL;DR

Sven, a new algorithm, trains neural networks faster and more accurately for regression tasks while keeping computational costs low.

A new optimization algorithm for neural networks, named **Sven** (Singular Value dEsceNt), has been developed by researchers from **MIT**, the **University of Oxford**, and the **NSF Institute for Artificial Intelligence and Fundamental Interactions (IAIFI)**, offering a more efficient way to train models by leveraging the natural structure of loss functions. As described in the paper, unlike traditional methods that collapse all data points into a single gradient, Sven treats each data point's contribution as a separate condition to satisfy simultaneously, using linear algebra to find the optimal parameter update. This approach not only speeds up convergence but also achieves lower final losses in regression tasks, outperforming widely used optimizers like **Adam** while remaining computationally feasible.

## How Sven Works

Sven works by exploiting the fact that standard loss functions in machine learning are sums over individual data points, each representing a condition the model should meet. Instead of reducing these to a single scalar for gradient descent, Sven computes a **Jacobian matrix** that captures how each data point's residual changes with parameters. It then uses the Moore-Penrose pseudoinverse of this matrix, approximated via a truncated **singular value decomposition (SVD)**, to find the minimum-norm update that best satisfies all conditions at once.

This incurs only a factor of **k** more computational overhead than stochastic gradient descent, where k is a hyperparameter controlling the number of retained singular values. This makes Sven scalable compared to traditional natural gradient methods that scale quadratically with parameters.

## Experimental Results

In experiments detailed in the paper, Sven was tested on three datasets: **1D regression**, **random polynomial regression** over six dimensions, and **MNIST classification** using a label regression loss. Results demonstrate that Sven significantly outperforms standard first-order methods like SGD, RMSProp, and Adam in regression tasks, converging faster per epoch and to lower validation losses.

For instance, on the 1D regression problem, Sven with a batch size of **32** and **k=16** achieved much lower loss than SGD and other baselines. While LBFGS, a second-order method, sometimes reached lower losses, it took at least **ten times longer** in wall-time, highlighting Sven's efficiency. On MNIST, Sven matched Adam's performance but did not surpass it, indicating a gap between regression and classification settings that warrants further investigation.

## Applications Beyond Machine Learning

Applications of Sven extend beyond standard machine learning benchmarks, particularly into **scientific computing** where loss functions often decompose into interpretable conditions, such as physical constraints or boundary equations. The authors — Samuel Bright-Thonney and Jesse Thaler from MIT and IAIFI, along with Thomas R. Harvey and Andre Lukas from Oxford — note an upcoming application to the numerical modular bootstrap, suggesting Sven could enhance optimization in fields like physics and engineering.

By providing a global view of loss decomposition, Sven adds a tool to the practitioner's toolkit, complementing existing techniques like weight decay and momentum. Its ability to handle over-parametrized regimes — common in modern neural networks — without the prohibitive costs of natural gradient methods makes it a promising alternative for training large models efficiently.

## Limitations and Future Directions

Despite its advantages, Sven faces limitations, primarily related to **memory overhead** when dealing with many conditions, as computing the Jacobian requires storing multiple model copies per batch. The paper discusses mitigation strategies, such as micro-batching and parameter batching, but notes that these would require modifications to standard autograd tools like PyTorch.

Additionally, the performance gap between regression and classification tasks remains unexplained; the paper suggests it may relate to differences in **singular value spectra**, with regression showing rapid decay and classification having flatter distributions. Future work will need to address scaling to larger models and exploring the hyperparameter **κ**, which affects how residuals are defined, to fully realize Sven's potential across diverse applications.

---SOURCES---
- Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method — arXiv
- Sven Paper (HTML version) — arXiv
- Descending into the Modular Bootstrap — arXiv
- NSF Institute for Artificial Intelligence and Fundamental Interactions — IAIFI
- Samuel Bright-Thonney — Personal Website
- Jesse Thaler — Curriculum Vitae
- Thomas R. Harvey — University of Oxford
- Prof Andre Lukas — University of Oxford

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn