AI Trains Smarter with Less Power

Training large language models like those behind modern AI assistants is notoriously expensive, with costs soaring into the millions of dollars due to immense computational demands. A new approach, TetraJet-v2, tackles this by enabling highly efficient, low-precision training that slashes resource use without sacrificing performance, making advanced AI more accessible and sustainable. This breakthrough could lower barriers for researchers and companies developing next-generation AI systems.

Researchers discovered that fully quantized training at 4-bit precision—where data representations are simplified to reduce computational load—can achieve near-lossless results when key issues like weight oscillation and activation outliers are controlled. In experiments, TetraJet-v2 reduced the performance gap with high-precision training by an average of 51.3% across models of varying sizes, from 70 million to 370 million parameters, trained on up to 200 billion tokens. This means AI models can be trained effectively with far less memory and computing power.

The method employs a double-block linear layer design tailored for the NVFP4 data format, which groups data into small blocks for more accurate scaling. To suppress weight oscillation—where values fluctuate near quantization thresholds and hinder convergence—the team introduced OsciReset, an algorithm that identifies and resets oscillating weights to stable states during training. For activation outliers, which are extreme values that distort low-precision calculations, OutControl retains critical data in higher precision (e.g., FP8 or BF16) during forward and backward passes, using a static selection of channels based on persistent patterns observed in model activations.

Results from pre-training evaluations on models like OLMo-2 show that TetraJet-v2 consistently outperforms prior methods, achieving lower perplexity scores (e.g., 18.52 on Wikitext-103 for the 370M model) and higher accuracy on benchmarks such as ARC and MMLU. For instance, in one test, the approach reached an average accuracy of 43.41% across multiple tasks, closing the gap with full-precision training. The combination of oscillation suppression and outlier control proved crucial, with ablation studies confirming that both components contribute significantly to stability and performance.

This advancement matters because it addresses the growing energy and cost challenges of AI development, potentially enabling more organizations to train sophisticated models without prohibitive expenses. In real-world terms, it could lead to faster innovation in areas like natural language processing and automated systems, while reducing the environmental footprint of data centers. However, the paper notes limitations, including that the method was tested only up to 370 million parameters and 200 billion tokens, and hardware support for efficient 4-bit computations is not yet widely available, leaving scalability and practical speedups to future work.

AI Trains Smarter with Less Power

About the Author

Guilherme A.