AI Trains Billion-Parameter Models Without Backprop

TL;DR

EGGROLL uses low-rank perturbations to scale evolution strategies, training large neural networks with integer-only math at competitive accuracy.

A breakthrough in artificial intelligence research has introduced a way to train massive neural networks without relying on gradient backpropagation, a fundamental technique that has powered most modern AI systems. This new approach, called Evolution Guided General Optimization via Low-rank Learning (EGGROLL), leverages evolution strategies (ES) to optimize models with billions of parameters, overcoming long-standing scalability issues. By using low-rank matrix perturbations, EGGROLL reduces computational and memory costs dramatically, making it feasible to apply ES to large-scale tasks like language model pre-training and fine-tuning. This advancement opens doors to training models with non-differentiable components or using integer-only operations, which could lead to more energy-efficient and robust AI systems.

Evolution strategies are a class of blackbox optimization s that explore parameter space by evaluating a population of perturbed models, rather than computing gradients through backpropagation. This makes ES suitable for optimizing non-differentiable or noisy objectives, such as rewards in reinforcement learning or outcomes in language model reasoning. However, traditional ES becomes prohibitively expensive at scale because it requires generating full-rank matrix perturbations for each population member, leading to high memory and computation demands. For instance, with billion-parameter models, naïve ES would need to store and process massive matrices, limiting population sizes and slowing down training. EGGROLL addresses this by approximating these perturbations with low-rank matrices, specifically by sampling random matrices A and B with a small rank r and forming perturbations as AB⊤ scaled by 1/√r. This reduces auxiliary storage from mn to r(m+n) per layer and cuts the cost of forward passes from O(mn) to O(r(m+n)), as detailed in the paper.

Ology behind EGGROLL involves generating low-rank perturbations that approximate the full-rank updates used in standard ES, with a theoretical guarantee of fast convergence. The researchers proved that the low-rank update converges to the full-rank Gaussian ES update at an O(1/r) rate, meaning even small ranks like r=1 can provide accurate approximations. In practice, EGGROLL uses a deterministic random number generator to reconstruct noise on demand, avoiding the need to store perturbations in memory. During training, it batches low-rank adapters and shares base activations, enabling efficient parallel evaluation on GPUs. For example, in experiments, EGGROLL achieved a hundredfold increase in training throughput for billion-parameter models at large population sizes, nearly reaching the throughput of pure batch inference, as shown in Figure 2a where it normalized training speeds to 91 compared to 0.41 for OpenES and 0 for PPO.

From the paper demonstrate EGGROLL's effectiveness across diverse domains without compromising performance. In tabula rasa reinforcement learning settings, EGGROLL matched or outperformed standard ES s like OpenES on 14 out of 16 environments, as illustrated in Figure 4 for tasks such as Pendulum-v1 and Craftax Symbolic. For language model fine-tuning, EGGROLL was competitive with GRPO on reasoning tasks; on the countdown task with an RWKV-7 1.5B model, it achieved a validation accuracy of 35% compared to GRPO's 23% under the same hardware and time constraints, as seen in Figure 5a. Most notably, EGGROLL enabled stable pre-training of a nonlinear recurrent language model (called EGG) that operates purely in integer datatypes, scaling population sizes from 64 to 262,144 and achieving a test loss of 3.41 bits/byte, as shown in Figure 2b. This integer-only training is significant because it allows for hardware-friendly operations that could reduce energy consumption in AI systems.

Of EGGROLL extend beyond current AI practices, offering a pathway to train models that are currently difficult or impossible to optimize with gradient-based s. By eliminating the need for differentiability, EGGROLL can handle discrete parameter spaces, noisy objectives, and models with non-differentiable components, such as neuro-symbolic systems. This could lead to more robust AI that can integrate symbolic reasoning or operate in low-precision environments, like edge devices. Additionally, the efficiency gains mean researchers can explore larger population sizes and more complex architectures without prohibitive costs, potentially accelerating discoveries in areas like drug or autonomous systems. The paper suggests future applications in training end-to-end systems with language models that interact with other agents, highlighting its versatility.

Despite its advantages, EGGROLL has limitations that warrant further investigation. The theoretical analysis assumes certain regularity conditions, such as bounded fitness functions and symmetric distributions, which may not hold in all real-world scenarios. While works well with low ranks like r=1, the paper notes that more analysis is needed to explain this success fully, especially in extreme cases. Additionally, the integer-only training experiments, though promising, were conducted on a character-level dataset (minipile) and may not scale directly to larger, more complex language tasks without adjustments. The researchers also acknowledge that EGGROLL's performance in multi-agent reinforcement learning and other domains requires more extensive validation to ensure generalizability. These limitations point to areas for future work, such as refining score function approximations and exploring hybrid approaches with gradient-based techniques.

About the Author

Guilherme A.

Former dentist (MD) from Brazil, 41 years old, husband, and AI enthusiast. In 2020, he transitioned from a decade-long career in dentistry to pursue his passion for technology, entrepreneurship, and helping others grow.

Connect on LinkedIn